avatarChristianlauer

Summary

The NORMALIZE_AND_CASEFOLD function in Google BigQuery is a powerful tool for improving data quality by standardizing string comparisons across different Unicode encodings.

Abstract

The article discusses the NORMALIZE_AND_CASEFOLD function in BigQuery, which is designed to handle strings with various Unicode encodings and improve data quality. It explains the importance of Unicode normalization due to the multiple representations of characters that can complicate software processes. The function supports four normalization modes: NFC, NFKC, NFD, and NFKD, each with specific use cases for decomposing and reassembling characters based on canonical or compatibility equivalence. The article provides examples of how to use the function in BigQuery to compare strings that are encoded differently but represent the same text. It also highlights the unique capability of BigQuery to perform case-insensitive string comparisons using this function, which is not commonly found in other SQL databases.

Opinions

  • The author suggests that the adoption of Unicode has made it necessary to normalize strings to eliminate nonessential differences in encoding.
  • The article implies that the NORMALIZE_AND_CASEFOLD function is particularly useful for data preparation tasks in BigQuery.
  • The author emphasizes the practicality of the function, indicating that while it may not be used frequently, it is valuable for specific scenarios where string encoding consistency is critical.
  • The author appears to appreciate Google's inclusion of this function in BigQuery, considering it a distinctive feature that sets it apart from other SQL databases.

“NORMALIZE AND CASEFOLD” in BigQuery

How to improve Data Quality when working with Strings

Photo by Amanda Jones on Unsplash

With the BigQuery SQL function NORMALIZE_AND_CASEFOLDyou have a wonderful possibility to work with strings with different Unicode encodings.

What was Unicode and the different Encodings again?

You can use Unicode to represent strings in multiple forms. With the ever-increasing adoption of Unicode , especially over the Internet, it has become necessary to eliminate nonessential differences in Unicode strings. Multiple representations for a combination of characters complicate software, for example, when a web server responds to a page request or a linker searches for a particular identifier in a library. So what may be rendered the same on the screen may have different encoding. Examples of different encodings [1]:

  • 7-bit ASCII
  • ISO 8859–1 Latin-1
  • Utf-8 — 8-bit variable-length encoding

Example of different Encodings

To give you a concrete example: If you would display the different encoding of a string via Python, you would get different results. Here an example:

c = chr(0xa9)
print(c)
print(c.encode('utf-8'))
print(c.encode('iso-8859-1'))

The results showing the different encoding formats:

©
b'\xc2\xa9'
b'\xa9'

Normalization Modes in BigQuery

Google BigQuery supports four optional normalization modes [2]:

NFC: Normalization Form Canonical Composition Decomposes characters and reassembles them according to canonical equivalence. NFKC: Normalization Form Compatibility Composition Decomposes characters according to compatibility and reassembles them according to canonical equivalence. NFD: Normalization Form Canonical Decomposition Decomposes characters according to canonical equivalence and multiple combining characters are arranged in a specific order. NFKD: Normalization Form Compatibility Decomposition Decomposes characters according to compatibility and multiple combination characters are arranged in a specific order.

How to use “NORMALIZE AND CASEFOLD”

So let’s try it out, how we can apply the whole thing in BigQuery. Here are two strings that are encoded differently but actually represent the same thing.

WITH Strings AS (
 SELECT ‘A\u0308\uFB03n’ AS a, ‘Ä\uFB03n’ AS b
)
SELECT a, b,
 NORMALIZE_AND_CASEFOLD(a, NFD)=NORMALIZE_AND_CASEFOLD(b, NFD) AS nfd,
 NORMALIZE_AND_CASEFOLD(a, NFC)=NORMALIZE_AND_CASEFOLD(b, NFC) AS nfc,
 NORMALIZE_AND_CASEFOLD(a, NFKD)=NORMALIZE_AND_CASEFOLD(b, NFKD) AS nkfd,
 NORMALIZE_AND_CASEFOLD(a, NFKC)=NORMALIZE_AND_CASEFOLD(b, NFKC) AS nkfc
FROM Strings;

And the expected results:

Query result — Image by Author

And now another try with values which will result is a false for certain normalization modes.

WITH Strings AS (
  SELECT '\u2168' AS a, 'IX' AS b UNION ALL
)
SELECT a, b,
  NORMALIZE_AND_CASEFOLD(a, NFD)=NORMALIZE_AND_CASEFOLD(b, NFD) AS nfd,
  NORMALIZE_AND_CASEFOLD(a, NFC)=NORMALIZE_AND_CASEFOLD(b, NFC) AS nfc,
  NORMALIZE_AND_CASEFOLD(a, NFKD)=NORMALIZE_AND_CASEFOLD(b, NFKD) AS nkfd,
  NORMALIZE_AND_CASEFOLD(a, NFKC)=NORMALIZE_AND_CASEFOLD(b, NFKC) AS nkfc
FROM Strings;

Here, we get a false for the NFD and NFC mode — here only the one with Compatibility Decomposition (NFKD and NFKC) is working:

Query Result — Image by Author

Summary

This is a short description of the very practical function with which you can check strings to see if they have an identical encoding. Certainly not a function you use all the time, but it can be useful in data preparation — and it’s quite unique within SQL, so it’s worth mentioning that Google has equipped BigQuery with this function.

Last but not least a small Note: In BigQuery you can use the case folding for the caseless comparison of strings. If you need to compare strings and case should not be considered, use NORMALIZE_AND_CASEFOLD, otherwise you can use NORMALIZE.

Sources and Further Readings

[1] informIT, Operators and Expressions (2006)

[2] Google, NORMALIZE_AND_CASEFOLD (2022)

Data Science
Google
Sql
Python
Programming
Recommended from ReadMedium