avatarChristianlauer

Summary

Google has integrated Apache Spark with BigQuery, allowing users to run Spark stored procedures in BigQuery SQL.

Abstract

Google has announced the integration of Apache Spark with its BigQuery service, enabling users to execute Spark stored procedures within BigQuery SQL. This integration leverages Spark's high-speed data processing capabilities on large datasets from diverse sources using a distributed architecture and cluster computing. Users can now create and run Spark-based stored procedures in Python, which can be invoked using SQL, streamlining the process of working with big data. The connection setup for Apache Spark is described as straightforward, with detailed instructions provided in the official documentation. This move is seen as beneficial for users who can now combine the speed and flexibility of Spark with the data warehousing capabilities of BigQuery, opening up opportunities for data integration, ad-hoc analysis, and machine learning applications.

Opinions

  • The integration of Apache Spark with BigQuery is positioned as a significant advancement that many users have anticipated.
  • Apache Spark's established presence in the big data landscape as a fast and versatile processing framework is emphasized.
  • The integration is seen as a logical step by Google, aligning BigQuery's capabilities with the needs of users who require the performance benefits of Spark.
  • The author suggests that this development could increase the appeal of BigQuery to Apache Spark users and vice versa, potentially expanding Google Cloud's customer base.
  • The integration is part of a series of updates from Google, indicating a commitment to enhancing BigQuery's functionality and user experience.

Google unites BigQuery with Apache Spark

How Google brings together what belongs together

Photo by Venti Views on Unsplash

Many of you may have been waiting for this for a while, Google is bringing BigQuery and Apache Spark together. You can now run them in stored procedures in Google BigQuery SQL.

Spark makes it possible to execute data queries on large data sets from different sources at high speed and with good performance. For this, the framework uses a distributed architecture and cluster computing. Now Google announced that when using BigQuery, you can create and run Apache Spark stored procedures that are written in Python [1]. After you create them you can let them easily run them with SQL, similar to running SQL stored procedures. A stored procedure in BigQuery is a collection of statements that can be called from other queries or stored procedures. A procedure can accept input arguments and return values as output.

Before you can start with coding, you first have to set up a connection to Apache Spark — but this is pretty straight forward:

Implement a Connection to Apache Spark — Image by Author

In addition, some rights may have to be assigned — for a detailed description, please refer to the official documentation:

After you set up the connections, you then can start coding and create stored procedure using Apache Spark. Here is a blue print for you [2]:

CREATE PROCEDURE `YOUR_PROJECT_ID`.YOUR_DATASET.PROCEDURE_NAME(PROCEDURE_ARGUMENT)
 WITH CONNECTION `CONNECTION_NAME`
 OPTIONS (
     engine="SPARK",
     main_file_uri=["URI"]);
 LANGUAGE PYTHON [AS PYSPARK_CODE]

Apache Spark lends itself to numerous applications in the Big Data environment due to its speed and the variety of options for processing large amounts of data from a wide range of sources. Therefore, this step by Google makes perfect sense. Apache Spark is an established Big Data framework and is used by many users and companies. Now customers who use Apache Spark could also feel increasingly addressed by the Data Warehouse technology BigQuery and its users could just use the advantages of Spark or better said use it even easier directly on BigQuery with SQL. Important application areas of Apache Spark are:

  • Data Integration
  • Ad Hoc Analysis of Big Data
  • Machine Learning

So all topics that are very well combined with BigQuery —this is then really good news. In addition, Google had released several other useful updates in recent weeks:

Sources and Further Readings

[1] Google, BigQuery Release Notes (2022)

[2] Google, Work with stored procedures for Apache Spark (2022)

Data Science
Google
Apache Spark
Python
Technology
Recommended from ReadMedium