Data Lake vs. Data Warehouse

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1820

Abstract

e cases. Solutions such as Data Lakehouses and hybrid Data Warehouse solutions that are split and equipped with NoSQL features offer a good approach here.</p><h2 id="e3b1">The Era of Data Lakehouses</h2><p id="63be">Data Lakehouses combines the Data Lake with a Data Warehouse to enable unified governance and ease of data movement [3]. My own experience has often shown that a Data Lake can be realized much faster. Once all data is available, Data Warehouses can still be built on top of it as a hybrid solution.</p><figure id="e46f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*E2Qi2HRvHTwpf5Sy.png"><figcaption>Data Lakehouse Concept — Image from Author</figcaption></figure><p id="d45b">This makes rigid and classically planned Data Warehouses a thing of the past. This greatly accelerates the provision of dashboards and analyses and is a good step towards a data-driven culture. An implementation with new SaaS services from the cloud and approaches such as ELT instead of ETL also accelerate the development. An example is a Google Data Lakehouse where you use Cloud Storage for your Data Lake and BigQuery for you Data Warehouse — here it’s important to mention that also BigQuery fullfills some characteristics of an Data Lake, so it’s somehow a hybrid solution.</p><h2 id="4ef7">How to build up a Data Lakehouse?</h2><p id="d686">To be a bit more concrete, we can take a look at how and with which technologies and services such Data Lakehouses can be built. In the figure below, an architecture is shown that was realized in the Google Cloud. Here, Cloud Storage and BigQuery are used as storage. Due to the good connectivity in the Google Cloud, the services can easily exchange data with each other and thus be used for analysis, machine learning and other topics.</p><figure id="c0be"

Options

<img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*K2nXayC6ZOo-EwfP.jpg"><figcaption>Data Lakehouse on GCP — Source: <a href="https://cloud.google.com/blog/products/data-analytics/open-data-lakehouse-on-google-cloud">Google</a> [4]</figcaption></figure><p id="8bd2">This architecture is of course also possible with other providers such as AWS or MS Azure. Microsoft for example offers with Azure Synapse Analytics such an analysis platform. Azure is already using several Data Lakehouse approaches, such as the option of integrating data from a Data Lake as a virtual table.</p><h2 id="1491">Summary</h2><p id="2ce5">So instead of seeing Data Lakes as the defeater of the Data Warehouse, it is one more solution and one more step in the integration process. I load raw data into a Data Lake and have more possibilities to store different data types. Afterwards, I load them via ETL process as usual into a Data Warehouse in a more easily analyzable form. This is then referred to as a Data Lakehouse or hybrid approach.</p><h2 id="1d52">Sources and Further Readings</h2><p id="06d6">[1] talend, <a href="https://www.talend.com/de/resources/data-lake-vs-data-warehouse/">Data Lake vs. Data Warehouse</a></p><p id="eecb">[2] IBM, <a href="https://www.ibmbigdatahub.com/blog/charting-data-lake-using-data-models-schema-read-and-schema-write#:~:text=Blogs-,Charting%20the%20data%20lake%3A%20Using%20the%20data%20models%20with%20schema,read%20and%20schema-on-write&text=There%20is%20no%20attempt%20to,-on-read%20data%20stores.">Charting the data lake: Using the data models with schema-on-read and schema-on-write</a> (2017)</p><p id="9502">[3] AWS, <a href="https://aws.amazon.com/big-data/datalakes-and-analytics/data-lake-house/?nc1=h_ls">What is a Lake House approach?</a> (2021)</p></article></body>

Data Warehouse vs. Data Lake

Both, Data Lakes and Data Warehouses are established terms when it comes to storing Big Data, but the two terms are not synonymous. A Data Lake is a large pool of raw data for which no use has yet been determined. A Data Warehouse, on the other hand, is a repository for structured, filtered data that has already been processed for a specific purpose [1].

While Data Warehouses use the classic ETL process in combination with structured data in a relational database, a Data Lake uses paradigms such as ELT and a schema on read as well as often unstructured data [2].

Differences Data Warehouse vs. Lake — Image by Author

So it’s actually not about Data Lake vs. Data Warehouse

Even though Data Lakes are gaining in importance and the classic Data Warehouse has lost some of its prominence, both solutions are still needed. As you can see above, both solutions have different use cases. Solutions such as Data Lakehouses and hybrid Data Warehouse solutions that are split and equipped with NoSQL features offer a good approach here.

The Era of Data Lakehouses

Data Lakehouses combines the Data Lake with a Data Warehouse to enable unified governance and ease of data movement [3]. My own experience has often shown that a Data Lake can be realized much faster. Once all data is available, Data Warehouses can still be built on top of it as a hybrid solution.

Data Lakehouse Concept — Image from Author

This makes rigid and classically planned Data Warehouses a thing of the past. This greatly accelerates the provision of dashboards and analyses and is a good step towards a data-driven culture. An implementation with new SaaS services from the cloud and approaches such as ELT instead of ETL also accelerate the development. An example is a Google Data Lakehouse where you use Cloud Storage for your Data Lake and BigQuery for you Data Warehouse — here it’s important to mention that also BigQuery fullfills some characteristics of an Data Lake, so it’s somehow a hybrid solution.

How to build up a Data Lakehouse?

To be a bit more concrete, we can take a look at how and with which technologies and services such Data Lakehouses can be built. In the figure below, an architecture is shown that was realized in the Google Cloud. Here, Cloud Storage and BigQuery are used as storage. Due to the good connectivity in the Google Cloud, the services can easily exchange data with each other and thus be used for analysis, machine learning and other topics.

Data Lakehouse on GCP — Source: Google [4]

This architecture is of course also possible with other providers such as AWS or MS Azure. Microsoft for example offers with Azure Synapse Analytics such an analysis platform. Azure is already using several Data Lakehouse approaches, such as the option of integrating data from a Data Lake as a virtual table.

Summary

So instead of seeing Data Lakes as the defeater of the Data Warehouse, it is one more solution and one more step in the integration process. I load raw data into a Data Lake and have more possibilities to store different data types. Afterwards, I load them via ETL process as usual into a Data Warehouse in a more easily analyzable form. This is then referred to as a Data Lakehouse or hybrid approach.