What is a Data Lakehouse?

Summary

The concept of a Data Lakehouse represents a modern hybrid approach to data storage and analytics, merging the benefits of Data Warehouses and Data Lakes into a unified system.

Abstract

A Data Lakehouse is an emerging architecture that seeks to combine the scalability and flexibility of a Data Lake with the structured analysis capabilities of a Data Warehouse. This hybrid model allows for the storage of both raw and processed data within a single system, facilitating easier data governance and movement. The Lakehouse approach is characterized by the integration of purpose-built storage solutions, enabling a more agile and data-driven culture within organizations. By leveraging cloud-based services and technologies such as ELT processes, the Lakehouse paradigm accelerates the development of analytics and machine learning applications. While the term 'Data Lakehouse' may be seen as a new buzzword, it essentially describes an established practice in cloud data warehousing and lakes, emphasizing the importance of a unified platform for diverse data needs.

Opinions

The author suggests that the Data Lakehouse is not merely a buzzword but a practical evolution in data management, representing a shift from traditional Data Warehouses.
The Lakehouse concept is seen as a way to overcome the limitations of both Data Lakes and Data Warehouses by providing a single system that can handle a variety of data workloads.
The author believes that the hybrid approach of combining Data Lakes and Data Warehouses has been in use for some time, particularly in cloud-based environments, and is now being encapsulated by the term 'Data Lakehouse.'
The author expresses that the Lakehouse architecture can lead to faster deployment of dashboards and analyses, thus promoting a data-driven culture within organizations.
The author points out that the Lakehouse model is supported by major cloud providers, such as Google Cloud with its Cloud Storage and BigQuery services, AWS, and Microsoft Azure with Azure Synapse Analytics, indicating its growing industry acceptance and adaptability.

New Paradigm or just a Buzzword?

What are Data Lakehouses? Just another buzzword or actually the successor to Data Lakes and Warehouses? In order to combine the advantages of Data Warehouses and Lakes, many companies have developed a hybrid BI environment. They store raw data in Data Lakes, while loading parts of it into the Data Warehouse as needed. The Data Lakehouse should combine the advantages of Data Lakes and Data Warehouses into a hybrid concept. The two systems are not operated side by side, but as a novel single system.

Data Warehouses vs. Data Lakes

Both, Data Lakes and Data Warehouses are established terms when it comes to storing Big Data, but the two terms are not synonymous. A Data Lake is a large pool of raw data for which no use has yet been determined. A Data Warehouse, on the other hand, is a repository for structured, filtered data that has already been processed for a specific purpose [1].

While Data Warehouses use the classic ETL process in combination with structured data in a relational database, a Data Lake uses paradigms such as ELT and a schema on read as well as often unstructured data [2].

Differences Data Warehouse vs. Lake — Image by Author

So what is a Data Lakehouse?

It is not just about integrating a Data Lake with a Data Warehouse, but rather integrating a Data Lake, a Data Warehouse, and purpose-built storage to enable unified governance and ease of data movement[3]. From my own experience has often shown that a Data Lakes can be realized much faster. Once all data is available, Data Warehouses can still be built on top of it as a hybrid solution.

Hybrid Data Lake Concept — Image from Author

This makes rigid and classically planned Data Warehouses a thing of the past. This greatly accelerates the provision of dashboards and analyses and is a good step towards a data-driven culture. An implementation with new SaaS services from the cloud and approaches such as ELT instead of ETL also accelerate the development.

In my opinion, this approach has been around for some time, especially in the area of Cloud Data Warehousing and Data Lakes. Here, these two technologies have long been combined with each other in a hybrid approach (Read here more about it). In my opinion, the new trend term Data Lakehouse simply describes this established approach.

How to build up a Data Lakehouse?

To be a bit more concrete, we can take a look at how and with which technologies and services such Data Lakehouses can be built. In the figure below, an architecture is shown that was realized in the Google Cloud. Here, Cloud Storage and BigQuery are used as storage. Due to the good connectivity in the Google Cloud, the services can easily exchange data with each other and thus be used for analysis, machine learning and other topics.

Data Lakehouse on GCP — Source: Google [4]

This architecture is of course also possible with other providers such as AWS or MS Azure. Microsoft for example offers with Azure Synapse Analytics such an analysis platform. Azure is already using several Data Lakehouse approaches, such as the option of integrating data from a Data Lake as a virtual table.

Summary

For me, the topic of Data Lakehouses is not new, this hybrid approach has been around for some time in the area of Cloud Data Warehousing and Lakes, and the term Data Lakehouse now describes this approach. In this article, I have shown you the difference between Data warehouses, Data Lakes and Data Lakehouses and an example of how such a Data Lakehouse can be technically implemented.

What is a Data Lakehouse?

New Paradigm or just a Buzzword?

Data Warehouses vs. Data Lakes

So what is a Data Lakehouse?

How to build up a Data Lakehouse?

Summary

Sources and Further Readings