What is a Data Mesh?

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2394

Abstract

ity.</li><li>Open standardized storage formats and interfaces.</li><li>Support for different data types, from unstructured to structured data.</li><li>Support for various workloads, such as data science, machine learning, SQL, and analytics.</li><li>End-to-end streaming: streaming support eliminates the need for separate systems to serve real-time data applications.</li><li>Shorter time-to-value compared to a Data Warehouse.</li></ul><figure id="b998"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*_sIGtnBhMVL9KajV.jpg"><figcaption>Data Lakehouse on GCP — Source: <a href="https://cloud.google.com/blog/products/data-analytics/open-data-lakehouse-on-google-cloud">Google</a> [4]</figcaption></figure><h2 id="f3d7">So what is a Data Mesh Approach?</h2><p id="3c92">A Data Mesh approach could improve the Data Lake as the dominant architectural paradigm. It is important to understand that the Data Mesh concept primarily establishes a new organizational perspective and is less based on technical problem solving. Therefore, you should consider this four principles when building up a Data Mesh organization [4]:</p><ul><li><b>Domain-oriented decentralized data ownership and architecture: </b>A Data Mesh should serve the individuals business units. Herefore, one or different Data Lakehouses could be build.</li><li><b>Data as a product: </b>The Data Lakehouse architecture helps to manage data as a product by providing different data team members in domain-specific teams complete control over the data lifecycle.</li><li><b>Self-serve data infrastructure as a platform:</b> Users can supply themselves with data in a self-service BI tool, while data scientists, for example, access the same data and develop models.</li><li><b>Federated computational governance: </b>The data should be backed up and distributed with a role concept. Data catalogs are also helpful here, for example.</li></ul><h2 id="ec1b">How Agility and Product Owners create additional Value</h2><p id="3b0c">Of course, when Data Lakehouses are built for individual departments, it is important to include these units in the development process. This is where product owners help, because they are the link between departments and developers and data scientists. The next logical step is to enable the departments themselves to define data products that will help them. And then to develop and mai

Options

ntain these themselves usually with initial support from a data product owner already active elsewhere. This requires systematic communication between data providers and data users to align needs and intentions on the one hand and potentially usable data for this purpose on the other.</p><h2 id="f201">Summary</h2><p id="26f6">The data mesh is therefore not a new type of technology like Data Warehouses are, but an organizational approach to distribute data in the company securely, transparently and efficiently. A Data Lakehouse is a hybrid approach between a Data Lake and a Data Warehouse. This is not a new technology but an efficient way to operate Data Warehouse and Data Lake as one and to build a modern data platform. Practically, Data Lakehouses support the principles of building a data mesh. In order to develop data-driven products really close to the customer, Agile Team and Product Owner are recommended.</p><h2 id="fde3">Sources and Further Readings</h2><p id="6d55">[1] talend, <a href="https://www.talend.com/de/resources/data-lake-vs-data-warehouse/">Data Lake vs. Data Warehouse</a></p><p id="5849">[2] IBM, <a href="https://www.ibmbigdatahub.com/blog/charting-data-lake-using-data-models-schema-read-and-schema-write#:~:text=Blogs-,Charting%20the%20data%20lake%3A%20Using%20the%20data%20models%20with%20schema,read%20and%20schema-on-write&text=There%20is%20no%20attempt%20to,-on-read%20data%20stores.">Charting the data lake: Using the data models with schema-on-read and schema-on-write</a> (2017)</p><p id="eaf2">[3] Stefan Koch, <a href="https://hub.hslu.ch/informatik/koennen-lakehouses-einen-paradigmenwechsel-anstossen/">Können Lakehouses einen Paradigmenwechsel anstossen?</a> (2021)</p><p id="b93b">[4] <a href="https://databricks.com/blog/author/michael-armbrust">Michael Armbrust</a>, <a href="https://databricks.com/blog/author/ali">Ali Ghodsi</a>, <a href="https://databricks.com/blog/author/bharath-gowda">Bharath Gowda</a>, <a href="https://databricks.com/blog/author/arsalan">Arsalan Tavakoli-Shiraji</a>, <a href="https://databricks.com/blog/author/reynold-xin">Reynold Xin</a> and <a href="https://databricks.com/blog/author/matei-zaharia">Matei Zaharia</a>, <a href="https://databricks.com/blog/2021/08/30/frequently-asked-questions-about-the-data-lakehouse.html">Frequently Asked Questions About the Data Lakehouse</a> (2021)</p></article></body>

New Technology or just an Approach for efficient Data Platforms?

Data Lakes and Data Warehouses or even the combination in a Data Lakehouse are considered to be the key to the use of Big Data. However, these approaches often bring the problem that with data from many sources and formats, they can become too complex and prevent efficient use. This often results in frustrated users.

Recap Data Lake vs. Data Warehouse

To understand the topic holistically a short recap on the topic Data Warehouse vs. Data Lake and where the differences lie. Data Lakes and Data Warehouses are established terms when it comes to storing Big Data, but the two terms are not synonymous. A Data Lake is a vast pool of raw data that has not a specific use case yet. A Data Warehouse, on the other hand, is a repository for structured and organized data that has already been dedicated to a specific purpose [1].

While Data Warehouses use the common ETL process in combination with already organized data in a relational database, a Data Lake uses paradigms such as ELT and a schema on read in relation to unstructured data [2].

Differences Data Warehouse vs. Lake — Image by Author

Data Lakehouses combine both

Storing raw data in Data Lakes, while loading parts of it into the Data Warehouse for purposes like Self-Service BI or ML Services makes it a Data Lakehouse. The Data Lakehouse should combine the advantages of Data Lakes and Data Warehouses into a hybrid concept. The two systems are not operated side by side, but as a novel single system. Benefits of a Data Lakehouse could be [3]:

Decoupling data storage from data processing to achieve better scalability.

Open standardized storage formats and interfaces.

Support for different data types, from unstructured to structured data.

Support for various workloads, such as data science, machine learning, SQL, and analytics.

End-to-end streaming: streaming support eliminates the need for separate systems to serve real-time data applications.

Shorter time-to-value compared to a Data Warehouse.

Data Lakehouse on GCP — Source: Google [4]

So what is a Data Mesh Approach?

A Data Mesh approach could improve the Data Lake as the dominant architectural paradigm. It is important to understand that the Data Mesh concept primarily establishes a new organizational perspective and is less based on technical problem solving. Therefore, you should consider this four principles when building up a Data Mesh organization [4]:

Domain-oriented decentralized data ownership and architecture: A Data Mesh should serve the individuals business units. Herefore, one or different Data Lakehouses could be build.

Data as a product: The Data Lakehouse architecture helps to manage data as a product by providing different data team members in domain-specific teams complete control over the data lifecycle.

Self-serve data infrastructure as a platform: Users can supply themselves with data in a self-service BI tool, while data scientists, for example, access the same data and develop models.

Federated computational governance: The data should be backed up and distributed with a role concept. Data catalogs are also helpful here, for example.

How Agility and Product Owners create additional Value

Of course, when Data Lakehouses are built for individual departments, it is important to include these units in the development process. This is where product owners help, because they are the link between departments and developers and data scientists. The next logical step is to enable the departments themselves to define data products that will help them. And then to develop and maintain these themselves usually with initial support from a data product owner already active elsewhere. This requires systematic communication between data providers and data users to align needs and intentions on the one hand and potentially usable data for this purpose on the other.

Summary

The data mesh is therefore not a new type of technology like Data Warehouses are, but an organizational approach to distribute data in the company securely, transparently and efficiently. A Data Lakehouse is a hybrid approach between a Data Lake and a Data Warehouse. This is not a new technology but an efficient way to operate Data Warehouse and Data Lake as one and to build a modern data platform. Practically, Data Lakehouses support the principles of building a data mesh. In order to develop data-driven products really close to the customer, Agile Team and Product Owner are recommended.