#Data Science #Technology #Business

Here’s an Introduction to Big Data Lifecycle Management.

An Architectural and Solution Design Overview

Background

In this article, I provide an architectural overview of the Big Data lifecycle management based on key points extracted from my recent book titled “Architecting Big Data & Analytics Solutions — Integrated with IoT & Cloud”. Understanding this process is essential to architect and designing Big Data solutions.

Big data is different from traditional data. The main differences come from characteristics such as volume, velocity, variety, veracity, value and overall complexity of data sets in a data ecosystem. Understanding these V words provide useful insights into the nature of Big Data.

There are many definitions in the industry and academia for Big Data; however the most succinct yet comprehensive definition which I agree comes from Gartner: “Big data is high-volume, high-velocity, and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”.

The only missing keyword in this definition is ‘veracity’. I’d also add to this definition that these characteristics are interrelated and interdependent.

Introduction to Big Data Lifecycle Management

As Big Data solution architects, we need to understand the data lifecycle management process, as we are engaged in all phases of the lifecycle as a technical leader.

Our roles and responsibilities may differ in different phases; however, we need to be on top of lifecycle management from an end-to-end perspective.

From an architectural and solution design perspective, a typical Big Data solution, similar to a traditional data lifecycle, can include a dozen of distinct phases in the overall data lifecycle solution process.

Big Data solution architects are engaged in all phases of the lifecycle, providing different inputs and producing different outputs for each phase.

These phases may be implemented under various names in different data solution teams.

There is no rigorous universal systemic approach to the Big Data lifecycle in the industry as the field is still evolving.

The common approach is that experience from traditional data management is transferred and enhanced for particular solution use cases.

For awareness and guiding purposes to aspiring Big Data architects, I propose the following distinct phases. In several successful data architecture projects, I successfully used this template to ensure the lifecycle is properly covered in the solutions.

Phase 1: Foundations

Phase 2: Acquisition

Phase 3: Preparation

Phase 4: Input and Access

Phase 5: Processing

Phase 6: Output and Interpretation

Phase 7: Storage

Phase 8: Integration

Phase 9: Analytics and Visulisation

Phase 10: Consumption

Phase 11: Retention, Backup, and Archival

Phase 12: Destruction

Let me provide you with an overview of each phase with some guiding points.

You can customise the names of these phases based on the requirements and organisational data practice of your Big Data solutions.

The key point is that these names are not set in stone and provided only as guidance.

Phase 1: Foundations

In data management process, the foundation phase includes various aspects such as understanding and validating data requirements, solution scope, roles and responsibilities, data infrastructure preparation, technical and non-technical considerations, and understanding data rules in an organisation.

This phase requires a detailed plan facilitated ideally by a data solution project manager with substantial input from the Big Data solution architect and some data domain specialists.

A Big Data solution project includes details such as plans, funding, commercials, resourcing, risks, assumptions, issues, and dependencies in a project definition report (PDR). Project Managers compile and author the PDR; however, the solution overview in this critical artifact is provided by the Big Data Architect.

Phase 2: Data Acquisition

Data Acquisition refers to collecting data. Data sets can be obtained from various sources. These sources can be internal and external to the business organisations.

Data sources can be in structured forms such as transferred from a data warehouse, a data mart, or various transaction systems, or semi-structured sources such as Weblogs, system logs, or unstructured sources such as coming from media files consisting of videos, audios, and pictures.

Even though data collection is conducted by various data specialists and database administrators, the Big Data architect has a substantial role in facilitating this phase optimally.

For example, data governance, security, privacy, and quality controls start with the data collection phase. Therefore, the Big Data architects take technical and architectural leadership of this phase.

The lead Big Data solution architect, in liaison with enterprise and business architects, lead and document the data collection strategy, user requirements, architectural decisions, use cases, and technical specifications in this phase.

For comprehensive solutions for sizable business organisations, the lead Big Data architect can delegate some of these activities to various domain architects and data specialists.

Phase 3: Data Preparation

In the data preparation phase, the collected data — in raw format- is cleaned or cleansed — these two terms are interchangeably used in different data practices of various business organisations.

In the data preparation phase, data is rigorously checked for inconsistencies, errors, and duplicates. Redundant, duplicated, incomplete, and incorrect data are removed. The objective is to have clean and useable data sets.

The Big Data solution architect facilitates this phase. However, most data cleaning tasks, due to the granularity of activities, can be performed by data specialists who are trained in data preparation and cleaning techniques.

Phase 4: Data Input and Access

Data input refers to sending data to planned target data repositories, systems, or applications.

For example, we can send the clean data to determined destinations such as a CRM (Customer Relationship Management) application, a data lake for data scientists, or a data warehouse for use by specific departments. In this phase, data specialists transform the raw data into a useable format.

Data access refers to accessing data using various methods. These methods can include the use of relational databases, flat files, or NoSQL. The NoSQL is more relevant and widely used for Big Data solutions in various business organisations.

Even though the Big Data solution architect leads this phase; they usually delegate the detailed activities to data specialists and database administrators who can perform the input and access requirements in this phase.

Phase 5: Data Processing

Data Processing phase starts with processing the raw form of data. Then, we convert data into a readable format giving it the form and the context. After completion of this activity, we can interpret the data using the selected data analytics tools in our business organisation.

We can use common Big Data processing tools such as Hadoop MapReduce, Impala, Hive, Pig, and Spark SQL.

The popular real-time data processing tools in most of my solutions were HBase, and the near real-time data processing tool was Spark Streaming. There are many open-source and proprietary tools on the market.

Data processing also includes activities such as data annotation, data integration, data aggregation, and data representation.

Let me summarise them for your awareness.

Data annotation refers to labelling the data. For example, once the data sets are labelled, they can be ready for machine learning activities.

Data integration aims to combine data existing in different sources, and it aims to provide a unified view of data to the data consumers.

Data representation refers to the way data is processed, transmitted, and stored. These three essential functions depict the representation of data in the lifecycle.

Data aggregation aims to compile data from databases to combined datasets to be used for data processing.

In the data processing phase, data may change its format based on consumer requirements. Processed data can be used in various data outputs in data lakes, in enterprise networks, and connected devices.

We can further analyse the data sets for advanced processing techniques using various tools such as Spark MLib, Spark GraphX, and several other machine learning tools.

Big Data processing requires the involvement of various team members with different skills sets.

While the lead Big Data solution architect leads the processing phase, most of the tasks are performed by data specialists, data stewards, data engineers, and data scientists.

The Big Data solution architect facilitates the end to end process for this phase.

Phase 6: Data Output & Interpretation

In the data output phase, the data is in a format which is ready for consumption by the business users. We can transform data into usable formats such as plain text, graphs, processed images, or video files.

The output phase proclaims the data ready for use and sends the data to the next phase for storing. This phase, in some data practices and business organisation, is also called the data ingestion. For example, the data ingestion process aims to import data for immediate use or future use or keep it in a database format.

Data ingestion process can be in a real-time or in a batch format. Some standard Big Data ingestion tools that were commonly used in my solutions were Sqoop, Flume, and Spark streaming. These are popular open-source tools.

One of the activities is to interpret the ingested data. This activity requires analysing ingested data and extract information or meaning out of it to answer the questions related to the Big Data business solutions.

Phase 7: Data Storage

Once we complete the data output phase, we store data in designed and designated storage units. These units are part of the data platform and infrastructure design considering all non-functional architectural aspects such as capacity, scalability, security, compliance, performance and availability.

The infrastructure can consist of storage area networks (SAN), network-attached storage (NAS), or direct access storage (DAS) formats. Data and database administrators can manage stored data and allow access to the defined user groups.

Big Data storage can include underlying technologies such as database clusters, relational data storage, or extended data storage, e.g. HDFS and HBASE, which are open source systems.

In addition, the file formats such as text, binary, or other types of specialised formats such as Sequence, Avro, and Parquet must be considered in data storage design phase.

Phase 8: Data Integration

In traditional models, once the data is stored, it ends the data management process. However, for Big Data, there may be a need for the integration of stored data to different systems for various purposes.

Data integration is a complex and essential architectural consideration in Big Data solution process. Big Data architects are engaged to architect and design the use of various data connectors for the integration of Big Data solutions.

There may be use cases and requirements for many connectors such as ODBC, JDBC, Kafka, DB2, Amazon S3, Netezza, Teradata, Oracle and many more based on the data sources used in the solution.

Some data models may require integration of data lakes with a data warehouse or data marts. There may also be application integration requirements for Big Data solutions.

For example, some integration activities may comprise integrating Big Data with dashboards, tableau, websites, or various data visualisation applications. This activity may overlap with the next phase, which is data analytics.

Phase 9: Data Analytics & Visualisation

Integrated data can be useful and productive for data analytics and visualisation.

Data analytics is a significant component of Big Data management process. This phase is critical because this is where business value is gained from Big Data solutions. Data visualisation is one of the key functions of this phase.

We can use many productivity tools for analytics and visualisation based on the requirements of the solution. In my Big Data solutions, the most commonly used tools were Scala, Phyton, and R notebooks. Phyton was selected as the most productive tool touching almost all aspects of the data analytics especially to empower machine learning initiatives.

In your business organisation, there can be a team responsible for data analytics led by a chief data scientist. Big Data solution architects have a limited role in this phase however they closely work with the data scientists to ensure the analytics practice and platforms are aligned with business goals.

The Big Data solution architects need to ensure the phases of the lifecycle are completed with an architectural rigour.

Phase 10: Data Consumption

Once data analytics takes place, then the data is turned into information ready for consumption by internal or external users, including customers of the business organisation.

Data consumption require architectural input for policies, rules, regulations, principles, and guidelines. For example, data consumption can be based on a service provision process. Data governance bodies create regulations for service provision.

The lead Big Data Solution Architect leads and facilitates the creation of these policies, rules, principles, and guidelines using an architectural framework selected in the business organisations.

Phase 11: Retention, Backup, & Archival

We know that critical data must be backed up for protection and meeting industry compliance requirements.

We need to use established data backup strategies, techniques, methods, and tools. The Big Data solution architect must identify, document, and obtain approval for the retention, backup, and archival decisions.

The Big Data solution architect may delegate the detailed design of this phase to an infrastructure architect assisted by several data, database, storage, and recovery domain specialists.

Some data for regulatory or other business reasons may need to be archived for a defined period of time. Data retention strategy must be documented and approved by the governing body, especially by enterprise architects, and implemented by the infrastructure architects and the storage specialists.

Phase 12: Data Destruction

There may be regulatory requirements to destroy a particular type of data after a certain amount of time.

The destruction requirements may change based on the industries.

You need to confirm the destruction requirements with the data governance team in business organisations.

Conclusions

Even though there is a chronological order for the life cycle management, for producing Big Data solutions, some phases may slightly overlap and can be done in parallel. Your organisation’s proprietary method may require certain order. You need to check with your method exponent in your organisation’s data practice division.

The life cycle proposed in this article is only a guideline for awareness of the overall process. You can customise the process based on the structure of the data solution team, unique organisational data platforms, data solution requirements, use cases, and dynamics of the owner organisation, its departments, or the overall enterprise ecosystem.

This has been a quick overview of the Big Data lifecycle management using twelve phases. In the next article, I plan to introduce the Big Data solution components in further detail.

I also write articles on IoT (Internet of Things), Artificial Intelligence, Cognitive Computing, Business Architecture, and Enterprise Architecture disciplines.

You are welcome to join my 100K+ mailing list, to collaborate, enhance your network, and receive a technology newsletter reflecting my industry experience.

If you enjoyed this article, you might check out a summary of my digital transformation handbook, freely available on Medium.

Digital Transformation Handbook for Solution Architects

An architectural and design guide for educative purposes

medium.com

In addition to technology, I write about health and well-being. I collect my stories on Euphoria, my personal publication on Medium.

Euphoria

My focus is on joy, happiness, health, productivity, & life satisfaction. I document my five decades of experience in…

medium.com

Sample Health Improvement Articles for New Readers

If you are a new reader and find this article valuable, you might check my holistic health and well-being stories reflecting on my reviews, observations, and decades of sensible experiments.

Sample Health Improvement Articles for New Readers

I write about various hormones and neurotransmitters such as dopamine, serotonin, oxytocin, GABA, acetylcholine, norepinephrine, adrenaline, glutamate, and histamine.

One of my goals as a writer is to raise awareness about the causes and risk factors of prevalent diseases that can lead to suffering and death for a large portion of the population.

To raise awareness about health issues, I have written several articles that present my holistic health findings from research, personal observations, and unique experiences. Below are links to these articles for easy access.

Metabolic Syndrome, Type II Diabetes, Fatty Liver Disease, Heart Disease, Strokes, Obesity, Liver Cancer, Autoimmune Disorders, Homocysteine, Lungs Health, Pancreas Health, Kidneys Health, NCDs, Infectious Diseases, Brain Health, Dementia, Depression, Brain Atrophy, Neonatal Disorders, Skin Health, Dental Health, Bone Health, Leaky Gut, Leaky Brain, Brain Fog, Chronic Inflammation, Insulin Resistance, Elevated Cortisol, Leptin Resistance, Anabolic Resistance, Cholesterol, High Triglycerides, Metabolic Disorders, Gastrointestinal Disorders, Thyroid Disorder, and Major Diseases.

I also wrote about valuable nutrients. Here are the links for easy access:

Lutein/Zeaxanthin, Phosphatidylserine, Boron, Urolithin, taurine, citrulline malate, biotin, lithium orotate, alpha-lipoic acid, n-acetyl-cysteine, acetyl-l-carnitine, CoQ10, PQQ, NADH, TMG, creatine, choline, digestive enzymes, magnesium, zinc, hydrolyzed collagen, nootropics, pure nicotine, activated charcoal, Vitamin B12, Vitamin B1, Vitamin D, Vitamin K2, Omega-3 Fatty Acids, N-Acetyl L-Tyrosine, Cod Liver Oil, and other nutrients to improve metabolism and mental health.

Disclaimer: Please note that my posts do not include professional or health advice. I document my reviews, observations, experience, and perspectives only to provide information and create awareness.

I publish my lifestyle, health, and well-being stories on EUPHORIA. My focus is on metabolic, cellular, mitochondrial, and mental health. Here is my collection of Insightful Life Lessons from Personal Stories.

If you enjoy writing and storytelling, you can join Medium, NewsBreak, and Vocal as a creator to find your voice, reach out to a broad audience, and monetize your content.

You may also check my blog posts about my articles and articles of other writers contributing to my publications on Medium. I share them on my website digitalmehmet.com. Here is my professional bio. You can contact me via weblink.

Get an email whenever Dr. Mehmet Yildiz publishes. He is a top writer and editor on Medium.

undefined

As a writer, blogger, content developer, and reader, you might join Medium, Vocal Media, NewsBreak, Medium Writing Superstars, Writing Paychecks, WordPress, and Thinkers360 with my referral links. These affiliate links will not cost you extra to join the services.

You might join my six publications on Medium as a writer by sending a request via this link. 18K+ writers contribute to my publications. You might find more information about my professional background.

If you enjoy reading, you may join Medium with my referral link for limitless access to my stories and other writers.