avatarChristianlauer

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1891

Abstract

<p id="7279">It is not just about integrating a Data Lake with a Data Warehouse, but rather integrating a Data Lake, a Data Warehouse, and purpose-built storage to enable unified governance and ease of data movement [3]. From my own experience has often shown that a Data Lakes can be realized much faster. Once all data is available, Data Warehouses can still be built on top of it as a hybrid solution.</p><figure id="56fd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*OM0nRRaSph6nVyel.png"><figcaption>Hybrid Data Lake Concept — Image from Author</figcaption></figure><p id="8f56">This makes rigid and classically planned Data Warehouses a thing of the past. This greatly accelerates the provision of dashboards and analyses and is a good step towards a data-driven culture. An implementation with new SaaS services from the cloud and approaches such as ELT instead of ETL also accelerate the development.</p><h2 id="778c">The Time for Hybrid Systems</h2><p id="0ee6">This system architecture makes clouds such as AWS, Cloud or Azure very attractive, because here you can use object storage such as S3, Cloud Storage and others together with classic databases as a data lake and, based on this, often combine it with already existing interfaces with data warehouse technologies such as Google BigQuery, Azure Synapse or AWS Redshift. And the data lakehouse is ready, so to speak. If necessary, companies can even rely only on the latter, because these new data warehouse technologies offer column-based and NoSQL functionalities together with interfaces and query functions on other database systems, so that a data transfer may no longer be necessary. BigQuery Omni is a good example of this — here you can even query data from other cloud platforms.</p><figure id="9be6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*XHK8Uwpm_n6TS4Jp.png"><figca # Options ption>Functunality of BigQuery Omni — Source: <a href="https://cloud.google.com/bigquery-omni/docs/introduction">Google</a></figcaption></figure><blockquote id="768d"><p>BigQuery Omni <b>provides a unified management interface through Google Cloud</b>. BigQuery Omni can use your existing Google Cloud account and BigQuery projects. You can write a standard SQL query in the Cloud Console to query data in AWS or Azure, and see the results displayed in the Cloud Console [4].</p></blockquote><h2 id="514b">Summary</h2><p id="8562">So it’s not so much about Data Lakehouses vs. Data Lakes rather it is that Data Lakehouses are based on Data Lakes. Different SQL and NoSQL databases serve as data storage for the raw data, which can then be processed and analyzed with modern Data Warehouses. Hybrid technologies such as Google BigQuery and other solutions even offer everything from a single source, so that you can access other systems and platforms directly from there using SQL.</p><h2 id="d8d8">Sources and Further Readings</h2><p id="ddbd">[1] talend, <a href="https://www.talend.com/de/resources/data-lake-vs-data-warehouse/">Data Lake vs. Data Warehouse</a></p><p id="94d7">[2] IBM, <a href="https://www.ibmbigdatahub.com/blog/charting-data-lake-using-data-models-schema-read-and-schema-write#:~:text=Blogs-,Charting%20the%20data%20lake%3A%20Using%20the%20data%20models%20with%20schema,read%20and%20schema-on-write&amp;text=There%20is%20no%20attempt%20to,-on-read%20data%20stores.">Charting the data lake: Using the data models with schema-on-read and schema-on-write</a> (2017)</p><p id="063a">[3] AWS, <a href="https://aws.amazon.com/big-data/datalakes-and-analytics/data-lake-house/?nc1=h_ls">What is a Lake House approach?</a> (2021)</p><p id="e918">[4] Google, <a href="https://cloud.google.com/bigquery-omni/docs/introduction">What is BigQuery Omni</a> (2022)</p></article></body>

Data Lakehouse vs. Data Lake

What are the Differences and how they are build up on each other

Photo by Salmen Bejaoui on Unsplash

With the replacement of the classic Data Warehouse by new modern often cloud-based systems such as Data Lakes, certain problems are occuring. Because a Data Lake is a large container of all possible and often still raw data, these can not be used well for e.g. Self Service BI tools. This is where the Data Lakehouse comes into play. Data Lakehouses are a mixture of Data Lakes and classical Data Warehouses.

Data Warehouses and Data Lakes

Data Lakes and Data Warehouses are established terms when it comes to storing Big Data, but the two terms are not synonymous. As said before Data Lake is a large pool of raw data for which no use has yet been determined. A Data Warehouse, on the other hand, is a repository for structured, filtered data that has already been processed for a specific purpose [1].

While Data Warehouses use the classic ETL process in combination with structured data in a relational database, a Data Lake uses paradigms such as ELT and a schema on read as well as often unstructured data [2].

Differences Data Warehouse vs. Lake — Image by Author

So what is a Data Lakehouse?

It is not just about integrating a Data Lake with a Data Warehouse, but rather integrating a Data Lake, a Data Warehouse, and purpose-built storage to enable unified governance and ease of data movement [3]. From my own experience has often shown that a Data Lakes can be realized much faster. Once all data is available, Data Warehouses can still be built on top of it as a hybrid solution.

Hybrid Data Lake Concept — Image from Author

This makes rigid and classically planned Data Warehouses a thing of the past. This greatly accelerates the provision of dashboards and analyses and is a good step towards a data-driven culture. An implementation with new SaaS services from the cloud and approaches such as ELT instead of ETL also accelerate the development.

The Time for Hybrid Systems

This system architecture makes clouds such as AWS, Cloud or Azure very attractive, because here you can use object storage such as S3, Cloud Storage and others together with classic databases as a data lake and, based on this, often combine it with already existing interfaces with data warehouse technologies such as Google BigQuery, Azure Synapse or AWS Redshift. And the data lakehouse is ready, so to speak. If necessary, companies can even rely only on the latter, because these new data warehouse technologies offer column-based and NoSQL functionalities together with interfaces and query functions on other database systems, so that a data transfer may no longer be necessary. BigQuery Omni is a good example of this — here you can even query data from other cloud platforms.

Functunality of BigQuery Omni — Source: Google

BigQuery Omni provides a unified management interface through Google Cloud. BigQuery Omni can use your existing Google Cloud account and BigQuery projects. You can write a standard SQL query in the Cloud Console to query data in AWS or Azure, and see the results displayed in the Cloud Console [4].

Summary

So it’s not so much about Data Lakehouses vs. Data Lakes rather it is that Data Lakehouses are based on Data Lakes. Different SQL and NoSQL databases serve as data storage for the raw data, which can then be processed and analyzed with modern Data Warehouses. Hybrid technologies such as Google BigQuery and other solutions even offer everything from a single source, so that you can access other systems and platforms directly from there using SQL.

Sources and Further Readings

[1] talend, Data Lake vs. Data Warehouse

[2] IBM, Charting the data lake: Using the data models with schema-on-read and schema-on-write (2017)

[3] AWS, What is a Lake House approach? (2021)

[4] Google, What is BigQuery Omni (2022)

Data Science
Data
Technology
Data Lake
Programming
Recommended from ReadMedium