Google unites BigQuery with Apache Spark
How Google brings together what belongs together

Many of you may have been waiting for this for a while, Google is bringing BigQuery and Apache Spark together. You can now run them in stored procedures in Google BigQuery SQL.
Spark makes it possible to execute data queries on large data sets from different sources at high speed and with good performance. For this, the framework uses a distributed architecture and cluster computing. Now Google announced that when using BigQuery, you can create and run Apache Spark stored procedures that are written in Python [1]. After you create them you can let them easily run them with SQL, similar to running SQL stored procedures. A stored procedure in BigQuery is a collection of statements that can be called from other queries or stored procedures. A procedure can accept input arguments and return values as output.
Before you can start with coding, you first have to set up a connection to Apache Spark — but this is pretty straight forward:

In addition, some rights may have to be assigned — for a detailed description, please refer to the official documentation:
After you set up the connections, you then can start coding and create stored procedure using Apache Spark. Here is a blue print for you [2]:
CREATE PROCEDURE `YOUR_PROJECT_ID`.YOUR_DATASET.PROCEDURE_NAME(PROCEDURE_ARGUMENT)
WITH CONNECTION `CONNECTION_NAME`
OPTIONS (
engine="SPARK",
main_file_uri=["URI"]);
LANGUAGE PYTHON [AS PYSPARK_CODE]Apache Spark lends itself to numerous applications in the Big Data environment due to its speed and the variety of options for processing large amounts of data from a wide range of sources. Therefore, this step by Google makes perfect sense. Apache Spark is an established Big Data framework and is used by many users and companies. Now customers who use Apache Spark could also feel increasingly addressed by the Data Warehouse technology BigQuery and its users could just use the advantages of Spark or better said use it even easier directly on BigQuery with SQL. Important application areas of Apache Spark are:
- Data Integration
- Ad Hoc Analysis of Big Data
- Machine Learning
So all topics that are very well combined with BigQuery —this is then really good news. In addition, Google had released several other useful updates in recent weeks:
Sources and Further Readings
[1] Google, BigQuery Release Notes (2022)
[2] Google, Work with stored procedures for Apache Spark (2022)





