avatarChristianlauer

Summary

The web content discusses the concepts of Data Lakehouse and Data Mesh, explaining their differences, benefits, and how they can be integrated to enhance data management and governance within an organization.

Abstract

The article "Data Lakehouse vs. Data Mesh" delves into the modern approaches to data management, emphasizing the evolution from traditional Data Warehouses and Data Lakes to more sophisticated systems. The Data Lakehouse is presented as a hybrid system that combines the scalability and flexibility of Data Lakes with the structured analysis capabilities of Data Warehouses. It leverages open standards and supports various data types and workloads, including real-time data processing. On the other hand, the Data Mesh approach focuses on organizational change, advocating for decentralized data ownership, data product thinking, self-service infrastructure, and federated governance. The article suggests that these concepts are not mutually exclusive but can be complementary, with the Data Lakehouse providing the technical foundation and Data Mesh offering the organizational framework to effectively manage and distribute data across a company.

Opinions

  • The author suggests that the Data Lakehouse can lead to a paradigm shift in data management by decoupling storage from processing and supporting diverse data types and workloads.
  • The Data Mesh concept is seen as a transformative organizational strategy that prioritizes domain-specific data ownership and treats data as a product.
  • The article posits that a successful data strategy should not be a choice between Data Lakehouse and Data Mesh but rather an integration of both, leveraging the technical strengths of the Lakehouse and the governance model of the Mesh.
  • The author implies that the combination of Data Lakehouse and Data Mesh can shorten the time-to-value for data initiatives compared to traditional Data Warehouse approaches.
  • The article hints at the importance of self-serve data platforms and federated computational governance in empowering users and ensuring data quality and compliance.

Data Lakehouse vs. Data Mesh

How to use the new Approaches to become more Data Driven

Photo by Zdeněk Macháček on Unsplash

Data Warehouse, Data Lake and now Data Lakehouse and Data Mesh, what is what and where are the differences, especially the question of how do they relate to each other? Here, you will find out how they differ and how they can actually be build on each other.

The Data Lakehouse

The Data Lakehouse stores raw data in a Data Lake or Data Lakes divided into certain business contexts, while loading transfered and aggregated parts of it into the Data Warehouse for purposes like Self-Service BI, Data Marts or Machine Learning Services. The Data Lakehouse should combine the advantages of Data Lakes and Data Warehouses into a hybrid concept.

Data Lakehouse Concept — Image by Author

The two systems are not operated side by side, but as a novel single system. Benefits of a Data Lakehouse are [1]:

  • Decoupling data storage from data processing to achieve better scalability.
  • Open standardized storage formats and interfaces.
  • Support for different data types, from unstructured to structured data.
  • Support for various workloads, such as Data Science, Machine Learning, SQL, and Analytics.
  • End-to-end streaming: streaming support eliminates the need for separate systems to serve real-time data applications.
  • Shorter time-to-value compared to a Data Warehouse.

The Data Mesh Approach

It is important to understand that the Data Mesh concept primarily establishes a new organizational perspective and is less based on technical problem solving. Therefore, you should consider this four principles when building up a Data Mesh organization [2]:

  • Domain-oriented decentralized data ownership and architecture: A Data Mesh should serve the individuals business units. Therefore, one or different Data Lakehouses could be build.
  • Data as a product: The Data Lakehouse architecture helps to manage data as a product by providing different data team members in domain-specific teams complete control over the data lifecycle.
  • Self-serve data infrastructure as a platform: Users can supply themselves with data in a self-service BI tool, while Data Scientists, for example, access the same data and develop models.
  • Federated computational governance: The data should be backed up and distributed with a role concept. Data catalogs are also helpful here, for example.
Data Mesh on Google Cloud — Source: Google [3]

Bring it all Together

So as you can see, it’s less about Data Lakehouse vs. Data Mesh then combine a Data Lake and a Data Warehouse as a Data Lakehouse and using the organizational approach of a Data Mesh to gover, manage and distribute the data in the company. In the architectural diagram above, you see the technical components like data, code and infrastructure which can be seen as the Data Lakehouse while the upper part the catalog, security, logging and so on are used to bring the right data to the right person — the Data Mesh.

Summary

I hope this has helped you to get at least a first understanding of what Data Lakehouse and Data Mesh are and how they can be combined. If you want to learn more click here:

Sources and Further Readings

[1] Stefan Koch, Können Lakehouses einen Paradigmenwechsel anstossen? (2021)

[2] Michael Armbrust, Ali Ghodsi, Bharath Gowda, Arsalan Tavakoli-Shiraji, Reynold Xin and Matei Zaharia, Frequently Asked Questions About the Data Lakehouse (2021)

[3] Google, Build a data mesh on Google Cloud with Dataplex, now generally available (2022)

Data Science
Data
Data Mesh
Data Lakehouse
Technology
Recommended from ReadMedium