#Data Science #Data Architecture #Big Data #Technology #Business

Here’s How to Architect Data Lake Solutions.

A technical solution approach

Purpose of the Article

In this article, I provide an overview of data lakes, data ponds, and data swamps from an architectural perspective.

My aim is to provide business value, use cases, architecture, design and implementation, and consumption considerations for data lakes.

The use of data lakes became the de-facto standard for Big Data Analytics platforms supporting the Internet of Things (IoT) and Artificial Intelligence (AI) initiatives.

The majority of the Big Data and Analytics solutions require considering a data lake architectural model.

Most of my recent digital transformation initiatives involve data lake solutions to support IoT, Mobility and AI goals.

Many clients started a strong interest in using technical capabilities and compelling use cases of data lakes.

Data lakes are fundamental aspects of Big Data lifecycle management. I cover Big Data lifecycle management in a different article.

What is a Data Lake?

A data lake is a dynamic data store that keeps the data for an extended period of time. It can be fed iteratively using multiple data sources as further clean data are discovered and transformed from various sources in various parts of the enterprise.

For example, a data lake can store relational data from enterprise applications and non-relational data from IoT devices, social media streaming, and data from mobile applications and devices.

A data lake can be a single store of transformed enterprise data in the native format. These transformed data stores are usually reported, visualized, and analyzed using advanced analytics.

A data lake can include structured, semi-structured, and unstructured data. Structured data is traditionally well-managed, relatively more straightforward, and not a big concern of the overall data management process.

However, the challenge is related to semi-structured and, more importantly, dealing with unstructured data. I plan to cover these concerns in a different article.

Use Cases for Data Lakes

There are several other use cases for data lakes.

The primary use case for data lakes is to take advantage of a clean data store based on self-service without needing technical data professionals.

This use case refers to the data consumers in departments or across the enterprise. These people need to use data for various purposes without needing the help of a team of data platforms and practice experts.

Real-time data analysis leveraging the data sources coming from various sources is a common use case.

From my recent digital transformation initiatives, auditing requirements for corporate compliance and centralization of data was frequently mentioned use case for data lakes.

Another use case is related to the goal of having a complete view of customer data coming from multiple sources.

Business Value of Data Lakes

Data lakes can provide substantial business value propositions.

One of the key business value propositions of data lakes comes from being able to perform advanced analytics very quickly for data coming from various real-time sources such as clickstreams, social media, and system logs.

As data lakes are highly agile in deployment and easily configurable for use, these merits pose a compelling business value for organizations aiming for agile and continuous service delivery frameworks for data consumption.

Effective use of data lakes for speedy and timely data process and consumption can help business stakeholders to identify opportunities rapidly, make informed decisions, act on their decisions expeditiously, and rapidly offer their products and services to business customers.

Architectural and Design Considerations for Data Lakes

Architecting and designing data lakes require upfront planning for data types with substantial input from business stakeholders.

For example, suppose the purpose of data is unknown and not clearly stated by the business stakeholders. In that case, we may consider keeping data in raw format so that it can be used by data professionals in the future when it is needed.

In data lakes, instead of using schema, we usually store information using unique identifiers and metadata tags. For a data lake, if a schema is required, this can only be on an on-read basis rather than on a write.

On-write is an essential requirement for a data warehouse which I plan to cover in a different article. Meanwhile, to inform you, technically, for a data lake, the schema is only created when reading the data from sources.

A schematic structure can be applied to the data only when it is read. The use of an on-read schema allows unstructured data to be stored in the database.

Another architectural consideration is keeping in mind that the data in the data lakes do not go through the ETL process. ETL stands for Extract, Transform, Load. ETL is a procedure to copy data from data sources to data destinations.

From a storage architecture perspective, the data lakes are considered as the raw big data stores, not optimized or transformed for specific data consumers.

The storage architecture for data lakes mandates the data stores to be based on low-cost storage units. This architectural consideration has a favorable impact from the business value point of view.

From the non-functional requirements perspective, the key architectural consideration for data lakes is scalability because data growth is the main strategic focus of data lakes in business organizations.

Hence, scalability coupled with capacity is a critical success factor for architecting effective data lake solutions.

One of the critical challenges of data lakes is maintaining security. As we know, the data comes to the data lake in real time from multiple uncontrolled sources.

To address this challenge, a well-governing data security architecture specifically including access controls and semantic consistency, must be in place.

As architects, we need to engage security specialists to take the required measures to address the security concerns of the data lakes.

As architects, we don’t design data lakes. Data lake design is a specialist-level activity usually conducted by an experienced data storage architect or product specialist leveraging the skills of multiple data management specialists.

However, the design for data lakes must comply with the architectural framework, principles, and guidelines.

What is a Data Pond?

A data pond (a.k.a. a data puddle) is a relatively smaller purpose-built data platform usually used by a specific single team mission (e.g. used by only a marketing or sales department) in a business organization.

We can consider a data pond as the alternative solution option for data-intensive ETL offloading engagements required by a single team with specific data management requirements.

Unlike data lakes, data ponds are not data-driven processing allowing informed decisions at the enterprise level. From the organizational point of view, we can consider them for departmental or group levels.

For example, we can consider a data pond for designing a data warehouse on a smaller scale. I plan to cover the data warehouse in a different article as it is a different architectural model with different use cases.

What is a Data Swamp?

You might have heard about the term “data swamp” as an insulting term used by unsatisfied data consumers or business stakeholders responsible for the data practice in a business organization.

They usually use this term to show their dissatisfaction with an angry statement like “our data lake turned to a data swamp, we have no clue of our data stores, cannot access and use our data”.

When we hear the term “data swamp” in a meeting or so, we should understand that our data platforms are in risky situations and must find ways to re-architect the platform and the data lake solution.

Fundamentally, a data swamp is the opposite of a data lake. A data swamp technically refers to an unmanaged data lake that cannot be accessible by the intended consumers or may not provide desired business value.

From lessons learned in the field of data management and analytics, I witnessed that many poorly architected, designed, and unsuccessfully implemented data lakes turned into data swamps.

As it is a widespread concern, we must take necessary measures, use best practices, and architect our data lake solutions based on business goals, use cases, requirements, and strategic direction.

How are data lakes implemented?

Data lakes can be implemented using various data management tools, techniques, and services. There are many commercially available tools, products, and services.

For example, Azure Data Lake, Amazon S3, and IBM Cloud Pak for Data are some data lake implementation enablers with necessary tools, products, and services that can be considered for our data lake implementations based on our solution goals and use cases, requirements, and strategies.

As mentioned in my previous articles, open-source tools are embraced by many business customers for various reasons. Most of my clients have chosen open-source products for their data lake implementations. To this end, Apache Hadoop became the choice of platform implementation for data lakes.

Apache Hadoop is a highly scalable, modular, technology-agnostic, cost-effective system and presents no schema limitations. I also noticed in my collaboration circles that many Big Data and Storage solution architects commonly use Hadoop to empower their data platforms.

There are, of course, many more open-source tools that can be used for different aspects of architecting, designing, implementing, and consuming data lakes. Details of those tools are beyond the scope of this article. I covered some of them in detail in one of my recent books titled Architecting Big Data Solutions.

Conclusions and Takeaways

Data lakes are essential for various business reasons.

In a nutshell, the major reasons can be summarised as reducing the cost of storage, storing a wide variety of data types, increasing storage capacity, scaling multiple data types, and reducing risks for data management across the enterprise.

Data lakes can be considered strategic, the underlying infrastructure for IoT (Internet of Things) and AI (Artificial Intelligence).

We must architect, design, and deploy our data lakes methodically with rigor, considering the customer requirements, use cases, business goals, organizational strategy, and consumer expectations.

If not architected, designed, and deployed using the above success factors and principles, our data lakes can turn into data swamps which can be an undesirable situation for our business organization and customers.

Thank you for reading my perspectives. I wish you a healthy and happy life.

Samples from My Other Books on Medium

Digital Transformation Handbook for Solution Architects

An Architectural and Design Guide

medium.com

A Modern Enterprise Architecture Approach

Transform the enterprise with Mobility, Cloud, IoT & Big Data

medium.com

Summary Of “Digital Intelligence”

On ILLUMINATION Book Chapters

medium.com

The Significance of Quantum Computing for the Future of Artificial Intelligence

Chapter 5 of the book “On the Cusp of the Artificial Intelligence Revolution”

medium.com

An Introduction to IoT Ecosystem

A technical, architectural, and solution design overview of the Internet of Things (IoT) based on the experience of the…

medium.com

If you are a new reader and find this article valuable, you might check my holistic health and well-being stories reflecting on my reviews, observations, and decades of sensible experiments.

Sample Health Improvement Articles for New Readers

I write about various hormones and neurotransmitters such as dopamine, serotonin, oxytocin, GABA, acetylcholine, norepinephrine, adrenaline, glutamate, and histamine.

One of my goals as a writer is to raise awareness about the causes and risk factors of prevalent diseases that can lead to suffering and death for a large portion of the population.

To raise awareness about health issues, I have written several articles that present my holistic health findings from research, personal observations, and unique experiences. Below are links to these articles for easy access.

Metabolic Syndrome, Type II Diabetes, Fatty Liver Disease, Heart Disease, Strokes, Obesity, Liver Cancer, Autoimmune Disorders, Homocysteine, Lungs Health, Pancreas Health, Kidneys Health, NCDs, Infectious Diseases, Brain Health, Dementia, Depression, Brain Atrophy, Neonatal Disorders, Skin Health, Dental Health, Bone Health, Leaky Gut, Leaky Brain, Brain Fog, Chronic Inflammation, Insulin Resistance, Elevated Cortisol, Leptin Resistance, Anabolic Resistance, Cholesterol, High Triglycerides, Metabolic Disorders, Gastrointestinal Disorders, Thyroid Disorders, Anemia, Dysautonomia, cardiac output, and major disorders.

I also wrote about valuable nutrients. Here are the links for easy access:

Lutein/Zeaxanthin, Phosphatidylserine, Boron, Urolithin, taurine, citrulline malate, biotin, lithium orotate, alpha-lipoic acid, n-acetyl-cysteine, acetyl-l-carnitine, CoQ10, PQQ, NADH, TMG, creatine, choline, digestive enzymes, magnesium, zinc, hydrolyzed collagen, nootropics, pure nicotine, activated charcoal, Vitamin B12, Vitamin B1, Vitamin D, Vitamin K2, Omega-3 Fatty Acids, N-Acetyl L-Tyrosine, Cod Liver Oil, and other nutrients to improve metabolism and mental health.

Disclaimer: My posts do not include professional or health advice. I document my reviews, observations, experience, and perspectives only to provide information and create awareness.

I publish my lifestyle, health, and well-being stories on EUPHORIA. My focus is on metabolic, cellular, mitochondrial, and mental health. Here is my collection of Insightful Life Lessons from Personal Stories.

If you enjoy writing and storytelling, you can join Medium, NewsBreak, and Vocal as a creator to find your voice, reach out to a broad audience, and monetize your content.

You may also check my blog posts about my articles and articles of other writers contributing to my publications on Medium. I share them on my website digitalmehmet.com. Here is my professional bio. You can contact me via weblink.

Get an email whenever Dr. Mehmet Yildiz publishes. He is a top writer and editor on Medium.

undefined

You might join my six publications on Medium as a writer by sending a request via this link. 19K+ writers contribute to my publications. You might find more information about my professional background.

If you enjoy reading, you may join Medium with my referral link for limitless access to my stories and other writers.

#Data Science #Data Architecture #Big Data #Technology #Business

Here’s How to Architect Data Lake Solutions.

A technical solution approach

Purpose of the Article

What is a Data Lake?

Use Cases for Data Lakes

Business Value of Data Lakes

Architectural and Design Considerations for Data Lakes

What is a Data Pond?

What is a Data Swamp?

How are data lakes implemented?

Conclusions and Takeaways

Related Articles on Medium

Samples from My Other Books on Medium

Digital Transformation Handbook for Solution Architects

An Architectural and Design Guide

A Modern Enterprise Architecture Approach

Transform the enterprise with Mobility, Cloud, IoT & Big Data

Summary Of “Digital Intelligence”

On ILLUMINATION Book Chapters

The Significance of Quantum Computing for the Future of Artificial Intelligence

Chapter 5 of the book “On the Cusp of the Artificial Intelligence Revolution”

An Introduction to IoT Ecosystem

A technical, architectural, and solution design overview of the Internet of Things (IoT) based on the experience of the…

Sample Health Improvement Articles for New Readers

Get an email whenever Dr. Mehmet Yildiz publishes. He is a top writer and editor on Medium.

undefined