avatarChristianlauer

Summary

Google has enhanced BigQuery with advanced mathematical functions for data science applications.

Abstract

Google has recently updated BigQuery with powerful data science functions, including COSINE_DISTANCE, EUCLIDEAN_DISTANCE, and EDIT_DISTANCE. These functions are designed to improve mathematical and statistical analyses in various fields such as machine learning, data analysis, and natural language processing. The cosine distance is particularly useful for text similarity and document clustering, while the euclidean distance is a general-purpose metric for clustering, classification, and regression tasks. The edit distance, also known as Levenshtein distance, is employed for string similarity and spell checking. The article provides practical examples of how to use these functions in BigQuery SQL and emphasizes the significance of these updates for data scientists and analysts who can now perform more sophisticated analyses using only SQL within BigQuery.

Opinions

  • The author suggests that the new functions are a significant enhancement for BigQuery users, particularly for those in data science roles.
  • The article implies that the integration of these advanced functions directly into BigQuery allows for more streamlined and efficient data analysis workflows.
  • The author seems to appreciate Google's commitment to continuously improving BigQuery by adding statistical and machine learning capabilities, as evidenced by recent updates like the Pattern Analyzer and Bag of Words features.
  • The provision of practical examples and links to further documentation indicates the author's view that these resources are valuable for users to fully leverage the new functionalities.

Google launches Powerful Data Science Functions for BigQuery

How you can now use more advanced Mathematical Functions in BigQuery

Photo by Jeswin Thomas on Unsplash

Google just launched very useful functions which you can use for various mathematical and statistical Data Science cases.

Just last week Google already announced the availability of more advanced text analysis functions in BigQuery and BigQuery ML with that you can realize Tokenization and Bag of Word with SQL. Now they also launched new mathematical functions — namely[1]:

you can use these distances in various fields such as Machine Learning, data analysis, and natural language processing. Each metric serves a different purpose and is suitable for specific types of data — here a small what is what and how to use them[2][3][4][5]:

Cosine Distance

The cosine distance measures the cosine of the angle between two vectors in a multidimensional space. It is often used to compare the similarity between two vectors. Use Cases: Cosine distance is commonly used in natural language processing (NLP) tasks, such as text similarity and document clustering. It is particularly useful when the magnitude of the vectors is not important, and the focus is on the orientation of the vectors.

Euclidean Distance

The euclidean distance is the straight-line distance between two points in Euclidean space. In other words, it measures the length of the shortest path between two points. Use Cases: Euclidean distance is a general-purpose distance metric and is widely used in various applications, including clustering, classification, and regression. It is suitable for scenarios where the absolute magnitude of the vectors is significant, and the spatial relationship between data points matters.

Edit Distance or Levenshtein Distance

The Edit or Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. Use Cases: Edit distance is commonly used in applications related to string similarity and spell checking. It is suitable for scenarios where the similarity between sequences of characters needs to be assessed, such as in DNA sequence analysis, spell correction, or comparing words.

Example in BigQuery

So now from the theory back to a practical example with BigQuery SQL. In this example Iwill use the Euclidean Distance. But is had to be said that they work pretty similar to each other — please also use the links below to dive deeper.

SELECT EUCLIDEAN_DISTANCE(
 [(1, 1.0), (2, 2.0)],
 [(2, 4.0), (1, 3.0)]) 
AS results;

Which will provide:

Query Result — Screenshot by Author

Summary

These new mathematical functions are very powerful and often used in Data Science use cases, so a pretty awesome update for all Data Scientists and Analysts who work with BigQuery. Google follows here the path of bringing more and more statistical and Machine Learning capabilities to it’s Data Warehouse BigQuery and using only SQL. Other lately updates for BigQuery:

Sources and Further Readings

[1] Google, BigQuery release notes (2023)

[2] Asking ChatGPT about Cosine, Euclidean and Edit Distance (2023)

[3] Google, COSINE_DISTANCE (2023)

[4] Google, EUCLIDEAN_DISTANCE (2023)

[5] Google, EDIT_DISTANCE (2023)

Data Science
Programming
Google
Bigquery
Technology
Recommended from ReadMedium