avatarChristianlauer

Summary

The provided content discusses the differences between Data Warehouses and Data Lakehouses, outlining their characteristics, use cases, and potential for integration.

Abstract

The article delves into the distinctions between traditional Data Warehouses and the emerging concept of Data Lakehouses. It explains that Data Warehouses are long-established systems primarily designed for structured data analysis, often utilizing SQL technologies and ETL processes, with examples including Google Big Query, Hadoop, and Amazon Redshift. In contrast, a Data Lakehouse is a hybrid solution that combines the capabilities of a Data Lake and a Data Warehouse, allowing for unified governance and easier data movement. This approach enables faster data development and supports a data-driven culture, with the flexibility to handle both structured and unstructured data. The article suggests that while Data Warehouses are suitable for companies dealing with smaller volumes of structured data, the shift towards cloud-based solutions like Google BigQuery is leading to the adoption of Lakehouse architectures, which offer performance benefits, reduced maintenance, and cost savings. It also highlights that modern cloud Data Warehouses are evolving towards Lakehouse characteristics, indicating a convergence in data management solutions.

Opinions

  • The author suggests that Data Lakehouses can be implemented more quickly than traditional Data Warehouses and provide a modern, hybrid solution for data management.
  • The article implies that the rigidity of classic Data Warehouse planning is becoming outdated, with modern cloud services and ELT approaches accelerating development.
  • It is the author's view that combining a Data Lake with a classical Data Warehouse into a Lakehouse is often the appropriate step for companies to take, especially for those needing to handle raw data and expedite data development.
  • The author believes that modern cloud-based Data Warehouses are already moving towards the Lakehouse model, offering hybrid system benefits.
  • The article posits that while some companies may continue with traditional Data Warehouses for specific use cases, the trend is towards adopting Lakehouse architectures for their performance, maintenance, and cost advantages.

Data Warehouse vs. Data Lakehouse

What are the Differences and when to use what?

Photo by Daniel Jacob on Unsplash

Data Warehouses are now established in almost all companies, but recently you hear more and more about Data Lakehouses — what is what, where are the differences and how can both be combined?

So let’s look into both of the technologies and define what is what first to then think about how the two can be even combined.

The Data Warehouse

The Data Warehouse has been established in companies for a long time, it is intended as a source for all kinds of data analysis and can be extended by (Self-Service) Business Intelligence, often based on OLAP for Data Analysts but also normal employees. Often classical ETL processes with mostly structured data are performed with schema on write based on SQL technologies. Famous classical Data Warehouses often were developed on premise and technologies like MSSQL, MySQL or Oracle databases were used. Modern and frequently used cloud based systems are Google Big Query, Hadoop or Amazon Redshift.

Architecture of a Data Warehouse — Image Source: IBM[1]

The Data Lakehouse

The Data Lakehouses combines the Data Lake with a Data Warehouse to enable unified governance and ease of data movement [2]. My own experience has often shown that a Data Lake can be realized much faster. Once all data is available, Data Warehouses can still be built on top of it as a hybrid solution.

Data Lakehouse Concept — Image from Author

This makes rigid and classically planned Data Warehouses a thing of the past and also accelerates the provision of dashboards and analyses. Hence, it is a good step towards a data-driven culture. An implementation with new SaaS-services from the cloud and approaches such as ELT instead of ETL also accelerates the development.

An example is a Google Cloud based Data Lakehouse where you use cloud storage for your Data Lake and BigQuery for you Data Warehouse — here it’s important to mention that also BigQuery fulfills some characteristics of an Data Lake, so it’s somehow a hybrid solution.

Data Lakehouse as a Replacement for the Data Warehouse?

The answer: Not necessarily. There are certainly cases in companies that only process a small amount of data, often structured data, and thus continue to operate their classic Data Warehouse, but here too it may make sense to switch to a modern cloud Data Warehouse for reasons of performance, less maintenance effort and possibly also costs. However, it must also be said here that these modern cloud-based Data Warehouses such as Google BigQuery, AWS Redshift or Snowflake are often already hybrid systems and are already moving in the direction of a Data Lakehouse. Otherwise, it should be said that it is often the appropriate step for companies to combine a Data Lake for raw data and faster data development with a classical Data Warehouse in a Data Lakehouse.

Here are articles on this topic that might also be interesting:

Sources and Further Readings

[1] IBM, Data Warehouse (2020)

[2] AWS, What is a Lake House approach? (2021)

Data Science
Technology
Big Data
Business
Data
Recommended from ReadMedium