The Power of DataOps: Automated Data Unification Strategies for Scalable Solutions
Data unification aims to create a unified view of data from multiple sources, essential for business intelligence. Traditionally, organizations attempted to create unified data views via manually constructed rules in ETL systems. But this approach only scales as the complexity of the problem increases. With the advent of machine learning, a new era in data unification began with scalable and automated systems capable of handling large and complex data sources.

Unification
The core requirements for unifying data sources are:
- Extracting data from a data source into a central processing location
- Transforming or standardizing data elements (e.g., WA to Washington)
- Cleaning data (e.g., –99 means a null value)
- Mapping schema to align attributes across source datasets (e.g., your “surname” is my “Last_Name”)
- Consolidating entities or clustering all records thought to represent the same entity (e.g., are Ronald McDonald and R. Mac‐ Donald the same clown?)
- Selecting the “golden value” for each attribute for each clustered entity
- Exporting unified data to a destination repository
ETL (Extract, Transform, Load) systems and rules-based integration approaches were traditionally used for data unification. However, these approaches do not scale as the complexity and amount of data sources increase. There are several reasons why:
- ETL systems and rules-based approaches take much time to create and maintain rules. Basically, they are challenging to construct and maintain.
- As the complexity and amount of data sources increase, the rules become more challenging to create, understand, and verify beyond a few hundred. After a few hundred, they surpass the ability of a human to understand them.
- At scale, they outstrip the ability of humans to verify them.
Enterprises typically operate at a large scale with orders of magnitude more data than ETL tools can manage. As a result, automated and machine learning-based solutions have become more popular for scalable data unification strategies.
Scalability
To achieve scalable data unification, certain principles must be followed. These rules include moving away from rules-based systems towards machine learning-based solutions and developing collaborative systems that involve domain experts.
A scalable approach, therefore, must perform the vast majority of its operations automatically:
This rule emphasizes that the data unification systems must automate most operations for a scalable solution. Creating rules in ETL systems is impractical, and conducting manual operations on vast data takes time and effort. The law ensures that the data unification process occurs faster and is more efficient, thereby reducing the time it takes to gain insights from this data. This positively impacts business systems as the results can be used to make more informed and timely decisions.
Scalable data unification must be schema-last:
Traditionally, organizations define schemas upfront, constraining data’s ability to evolve and be adapted to new data sources. By adopting a “schema-last” approach, data unification systems can discover schemas from source attributes, enabling them to adapt to changes and accommodate data sources that change regularly. This ensures that the data unification process remains current with the latest data sources, enhancing the accuracy of insights gained from the data. Ultimately, this positively impacts business systems as the insights generated from the data are more accurate and reliable.
Scalable data unification systems must be collaborative and use domain experts to resolve ambiguities:
This rule focuses on collaboration between domain experts and computer professionals to resolve ambiguity, ensuring the data unification process is accurate and efficient. Collaboration provides a shared understanding of the data and how it should be processed, resulting in higher-quality insights. This rule recognizes that data unification is not just a technical process but requires input from domain experts with expertise in the data. The impact on business systems is that the insights generated from the data are more reliable, leading to better decision-making.
Only machine learning can rise to the problem sizes found in large enterprises:
Machine learning is essential in scalable data unification systems as the problem sizes found in large enterprises are too complex for traditional ETL systems to handle. Machine learning can effectively manage these large volumes of data and make automated decisions when possible, thereby increasing the efficiency of the data unification process. The impact on business systems is that machine learning enables faster and more accurate data unification, which leads to more timely and informed decisions. This ultimately improves the overall performance of business systems.
Scalable unification systems must scale to multiple cores and processors:
With the increasing volume of data, organizations must ensure that their data unification systems can handle the processing power required. This rule emphasizes that scalable unification systems must be able to scale to multiple cores and processors, thereby ensuring that they can take the large amount of data generated by various sources. This positively impacts business systems by providing that the data unification process is efficient, timely, and can handle large volumes of data.
Scalable unification systems must scale to have a parallel algorithm with lower complexity than N**2:
As the volume of data increases, it becomes increasingly challenging to process it promptly and efficiently. This rule emphasizes that scalable unification systems must have a parallel algorithm with lower complexity than N**2, meaning the system can process data with the minimum computational power required. This ensures that the data unification process is efficient and that insights can be gained from the data promptly. The impact on business systems is that the insights generated from the data can be used to make informed decisions crucial for success.
Scalable unification system must examine the changed records and perform incremental unification:
With the many changes that occur in data sources, it is crucial that scalable data unification systems can identify changes and perform incremental unification. This rule emphasizes that data unification systems must examine the changed records and only process those that have been modified rather than processing the entire dataset repeatedly. This ensures that the data unification process is efficient and that insights are gained from the most up-to-date information. The impact on business systems is that insights generated from the data are more reliable and up-to-date, enabling organizations to make more informed decisions that lead to improved performance.
Data unification systems must prioritize data quality over speed:
While speed is essential in data unification, this rule emphasizes that data quality should not be compromised. Data quality issues can lead to incorrect insights, which can be costly for organizations. Therefore, scalable data unification systems must prioritize data quality over speed, ensuring that data is accurate and reliable. The impact on business systems is that insights generated from the data are more accurate, leading to better decision-making and improved performance. Prioritizing data quality ensures that organizations can reap the benefits of data-driven decision-making.
Additional Reading and Resources (mixture of free and subscription services):
For PM, PMM, & ML Bits, Bytes, and Bots
For Education & Analytics Education on Education
Palmer, Andy, et al. “Getting DataOps Right.” DataOps: The Complete Guide to Enterprise Data Operations Transformation, O’Reilly Media, Inc., 2019, pp. 1–13.
