Google launches UDFs to export BigQuery Data as Protocol Buffer Columns

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1958

Abstract

he article down below.</p><div id="2dc3" class="link-block"> <a href="https://auth0.com/blog/beating-json-performance-with-protobuf/"> <div> <div> <h2>Beating JSON performance with Protobuf</h2> <div><h3>Protobuf, the binary format crafted by Google, surpasses JSON performance even on JavaScript environments like…</h3></div> <div><p>auth0.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*dalBAMSSw1lO51FC)"></div> </div> </div> </a> </div><p id="83c9">BigQuery now offers the possibility to merge multiple column values into a single Protobuf value, which has the following benefits[2][3]:</p><ul><li>Object type safety.</li><li>Better compression, data transfer time, and cost in comparison with with JSON.</li><li>Flexibility as most programming languages have libraries to handle Protobuf.</li><li>Less overhead when reading from multiple columns and building a single object.</li></ul><p id="b332">Google has stated that other column types can also provide type safety, but using Protobuf columns provides a fully typed object, which can reduce the amount of work that needs to be done on the application layer or on another part of the pipeline[2]. Besides these advantages, there are also some limitations to exporting BigQuery data as Protobuf columns. These two points should be taken into consideration[2][3]:</p><ul><li>Protobuf columns are not well indexed or filtered. The approach of searching by the content of the Protobuf columns might be less effective.</li><li>Sorting data in Protobuf format can be difficult.</li></ul><h2 id="706b">Query Performance Insights about High Cardinality Joins</h2><p id="ff5f">Query Performance Insights in High Cardinality Joins is already generally available now. When a quer

Options

y contains a join with non-unique keys on both sides of the join, the size of the output table can be considerably larger than the size of either of the input tables. This particular insight implies the fact that the ratio of output rows to input rows is higher and offers information about these row counts. Hence, one has to check their join conditions to confirm that the increase in the size of the output table is expected. In this regard, it must be said that the usage of cross joins has to be avoided. If the need of using a cross join is really there, one shoud try using a <code>GROUP BY</code> clause to pre-aggregate results, or use a window function. For more information, see Reduce data before using a <code>JOIN</code>.</p><div id="ecc7" class="link-block"> <a href="https://readmedium.com/can-microsoft-fabric-and-chatgpt-be-the-killer-for-bigquery-redshift-snowflake-and-co-7ed13edccd27"> <div> <div> <h2>Can Microsoft Fabric and ChatGPT be the Killer for BigQuery, Redshift, Snowflake and co?</h2> <div><h3>How Microsoft could pull ahead of the Competition in Cloud, Development and Data Science</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*SA6mowAnSQutxtDygWM1NA.jpeg)"></div> </div> </div> </a> </div><h2 id="c4aa">Sources and Further Readings</h2><p id="cae4">[1] auth0, <a href="https://auth0.com/blog/beating-json-performance-with-protobuf/">Beating JSON performance with Protobuf</a> (2023)</p><p id="b6a1">[2] Google, <a href="https://cloud.google.com/bigquery/docs/release-notes">BigQuery release notes</a> (2023)</p><p id="13d2">[3] Google, <a href="https://cloud.google.com/bigquery/docs/protobuf-export">Export data as Protobuf columns</a> (2023)</p></article></body>

Export Data as Protobuf Columns

Protocol buffers (Protobuf) is a binary format created by Google to serialize data between different services. Google has made this protocol as open source. It now also provides support for languages like JavaScript, Java, C#, Ruby & Co. and is also working faster than XML or JSON[1]. If more information is needed, please feel free to check out the article down below.

BigQuery now offers the possibility to merge multiple column values into a single Protobuf value, which has the following benefits[2][3]:

Object type safety.

Better compression, data transfer time, and cost in comparison with with JSON.

Flexibility as most programming languages have libraries to handle Protobuf.

Less overhead when reading from multiple columns and building a single object.

Google has stated that other column types can also provide type safety, but using Protobuf columns provides a fully typed object, which can reduce the amount of work that needs to be done on the application layer or on another part of the pipeline[2]. Besides these advantages, there are also some limitations to exporting BigQuery data as Protobuf columns. These two points should be taken into consideration[2][3]:

Protobuf columns are not well indexed or filtered. The approach of searching by the content of the Protobuf columns might be less effective.

Sorting data in Protobuf format can be difficult.

Query Performance Insights about High Cardinality Joins

Query Performance Insights in High Cardinality Joins is already generally available now. When a query contains a join with non-unique keys on both sides of the join, the size of the output table can be considerably larger than the size of either of the input tables. This particular insight implies the fact that the ratio of output rows to input rows is higher and offers information about these row counts. Hence, one has to check their join conditions to confirm that the increase in the size of the output table is expected. In this regard, it must be said that the usage of cross joins has to be avoided. If the need of using a cross join is really there, one shoud try using a GROUP BY clause to pre-aggregate results, or use a window function. For more information, see Reduce data before using a JOIN.

Google launches UDFs to export BigQuery Data as Protocol Buffer Columns

Better Query Performance in high Cardinality Joins

Google launches new JSON & Quantitive Like Operator Functions for BigQuery

How Google helps making work with Semi- and Unstructured Data easier

Export Data as Protobuf Columns

Beating JSON performance with Protobuf

Protobuf, the binary format crafted by Google, surpasses JSON performance even on JavaScript environments like…

Query Performance Insights about High Cardinality Joins

Can Microsoft Fabric and ChatGPT be the Killer for BigQuery, Redshift, Snowflake and co?

How Microsoft could pull ahead of the Competition in Cloud, Development and Data Science

Sources and Further Readings