avatarDataKitchen

Summarize

The DataOps Enterprise Software Industry, 2020

People are talking about DataOps. Companies are marketing DataOps products and services, and organizations are adopting DataOps to improve the efficiency, quality and cycle time of their data analytics.

If you are unfamiliar with the term, DataOps is a new approach to the end-to-end data lifecycle, which applies new processes and methodologies to data analytics. Agile software development helps deliver new analytics faster and with higher quality. DevOps automates the deployment of new analytics and data. Statistical process control, used in lean manufacturing, tests and monitors the quality of data flowing through the data-analytics pipeline.

Components of a DataOps Enterprise Solution, Source: Eckerson Group

Growing enterprise interest in DataOps has spawned a robust ecosystem of vendors. To date, over $50M has been invested in companies who market a wide array of DataOps product and services.

Please email us if we forgot anyone or if you have any comments.

Key Components of a DataOps Platform

There are four key software components of a DataOps Platform: data pipeline orchestration, testing and production quality, deployment automation, and data science model deployment / sandbox management. Below is our running list of the vendors in each group.

  1. Data Pipeline Orchestration: DataOps needs a directed graph-based workflow that contains all the data access, integration, model and visualization steps in the data analytic production process
  • Airflow — an open-source platform to programmatically author, schedule, and monitor data pipelines.
  • Apache Oozie — an open-source workflow scheduler system to manage Apache Hadoop jobs.
  • DBT (Data Build Tool) — is a command-line tool that enables data analysts and engineers to transform data in their warehouse more effectively.
  • BMC Control-M — a digital business automation solution that simplifies and automates diverse batch application workloads.
  • Composable Analytics — a DataOps Enterprise Platform with built-in services for data orchestration, automation, and analytics.
  • DataKitchen — a DataOps Platform that reduces analytics cycle time by monitoring data quality and providing automated support for the deployment of data and new analytics.
  • Reflow — Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs.
  • ElementL — A current stealth company founded by ex-facebook director and graphQL co-creator Nick Schrock. Dagster Open Source
  • Astronomer.io — Astronomer recently re-focused on Airflow support. They make it easy to deploy and manage your own Apache Airflow webserver, so you can get straight to writing workflows.
  • Piperr.io — Use Piperr’s pre-built data pipelines across enterprise stakeholders: From IT to Analytics, From Tech, Data Science to LoBs.
  • Prefect Technologies — Open-source data engineering platform that builds, tests, and runs data workflows.
  • Genie — Distributed Big Data Orchestration Service by Netflix
  • Saagie — Saagie Data Fabric seamlessly orchestrates big data technologies to automate analytics workflows and deploy business apps anywhere.
  • DataOps.live — Dataops for Snowflake: 100% of your DataOps needs in one end-to-end platform (added May 2020)

2. Automated Testing and Production Quality and Alerts: DataOps automatically tests and monitors the production quality of all data and artifacts in the data analytic production process as well as testing the code changes during the deployment process.

  • ICEDQ — software used to automate the testing of ETL/Data Warehouse and Data Migration.
  • Naveego — A simple, cloud-based platform that allows you to deliver accurate dashboards by taking a bottom-up approach to data quality and exception management.
  • DataKitchen — a DataOps Platform that improves data quality by providing lean manufacturing controls to test and monitor data.
  • FirstEigen — Automatic Data Quality Rule Discovery and Continuous Data Monitoring
  • Great Expectations — Great Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compiling or deploy time). by https://www.superconductive.ai/index.html
  • Enterprise Data Foundation — Open-source enterprise data toolkit providing efficient unit testing, automated refreshes, and automated deployment.
  • RightData- is a self-service suite of applications that help you achieve Data Quality Assurance, Data Integrity Audit and Continuous Data Quality Control with automated validation and reconciliation capabilities.
  • QuerySurge-Continuous Testing with QuerySurge for DevOps QuerySurge is the smart Data Testing solution that automates the data validation & testing of Big Data, Data Warehouses, and Business Intelligence Reports.
  • CompactBI — TestDrive is a testing framework for your data and the processes behind them. (added July 2020) (Acquired by Informatica, July 2020)
  • Tricentis TOSCA — Tricentis BI and Data Warehouse testing ensures data integrity faster, more rigorously, and more reliably than manual ETL testing and report verification. (added April 2020)
  • Databand — Data Pipeline Performance Monitoring, Observability for data engineering teams. Achieve total visibility over your data pipelines, from source to machine learning model. (added Feb 2020)
  • Soda Data Monitoring — Soda tells you which data is worth fixing. Soda doesn’t just monitor datasets and send meaningful alerts to the relevant teams. It identifies and prioritizes data issues that are causing your business the most damage, and walks you through a resolution workflow. (added Feb 2020)
  • ToroData — Detect data quality problems and anomalies automatically. Then fix them before they hit production. (added May 2020)
  • OwlDQ — Unified Data Quality Do you catch bad data, or does bad data catch you by surprise? (added June 2020)
  • Monte Carlo Data — Data reliability delivered. Data breaks. We ensure your team is the first to know and the first to solve. (added June 2020)
  • AccelData —Observability for Analytics & AI. Observe, optimize, and scale enterprise data pipelines. (added October 2020)
  • Validio — AUTOMATED REAL-TIME DATA VALIDATION AND QUALITY MONITORING. (added March 2021)
  • LightUp Data — Silent breaks in data availability, conformity, and validity are leaving you exposed to hidden data outages. (added March 2021)
  • BigEval-Get the most professional tools to validate enterprise data and maintain a high level of information quality. (added March 2021)
  • Telm.ai — Telm.ai helps data engineers and data architects to design, build and maintain robust and reliable data systems(added March 2021)

3. Deployment Automation and Development Sandbox Creation: DataOps continuously moves code and configuration continuously from development environments into production.

  • Jenkins — a ‘CI/CD’ tool used by software development teams to deploy code from development into production
  • DataKitchen — a DataOps Platform that supports the deployment of all data analytics code and configuration.
  • Amaterasu — is a deployment tool for data pipelines. Amaterasu allows developers to write and easily deploy data pipelines, and clusters manage their configuration and dependencies.
  • Lentiq — Lentiq is the data science environment that brings your projects to life. (added July 2019)
  • Harbr_ — Harbr is a complete solution for your customers, suppliers, partners and employees to exchange, monetize and collaborate on data and models (added June 2020)

4. Data Science Model Deployment: DataOps-driven data science teams make reproducible development environments and move models into production. Some have called this ‘MLOps” or “ModelOps

  • Domino — accelerates the development and delivery of models with infrastructure automation, seamless collaboration, and automated reproducibility.
  • Hydrosphere.io — deploys batch Spark functions, machine-learning models, and assures the quality of end-to-end pipelines.
  • Open Data Group — a software solution that facilitates the deployment of analytics using models.
  • ParallelM — moves machine learning into production, automates orchestration, and manages the ML pipeline. (acquired by DataRobot June 2019)
  • Seldon — streamlines the data science workflow, with audit trails, advanced experiments, continuous integration, and deployment.
  • Metis Machine — Enterprise-scale Machine Learning and Deep Learning deployment and automation platform for rapid deployment of models into existing infrastructure and applications.
  • Datatron — Automate deployment and monitoring of AI Models
  • DataKitchen — a DataOps Platform that supports the testing and deployment of data science models and the creation of sandbox data science environments.
  • DSFlow — Go from data extraction to business value in days, not months. Build on top of open source tech, using Silicon Valley’s best practices.
  • DataMo-Datmo tools help you seamlessly deploy and manage models in a scalable, reliable, and cost-optimized way.
  • MLFlow- An open source platform for the complete machine learning lifecycle from Databricks
  • Studio.ML — Studio is a model management framework written in Python to help simplify and expedite your model building experience.
  • Comet.ML — Comet.ml allows data science teams and individuals to automagically track their datasets, code changes, experimentation history and production models creating efficiency, transparency, and reproducibility.
  • Polyaxon — An open source platform for reproducible machine learning at scale.
  • kubeflow — The Machine Learning Toolkit for Kubernetes
  • Verta.ai — Models are the new code!
  • Omega | ML — Python AI/ML analytics deployment & collaboration for humans (added July 2019)
  • CD Foundation SIG on MLOps (added Mar 2020)

DataOps Supporting Functions

In addition to the foundational tools above, there are many software components that play a critical supporting role in the DataOps ecosystem.

  1. Code and artifact storage (e.g. git, dockerhub, etc)
  2. Parametrization and secure key storage (eg. Vault, jinja2)
  3. Distributed computing (e.g. mesos, kubernetes)
  4. DataSecOps, Versioning, or Test Data Management:
  • Delphix — A software platform that enables teams to virtualize, secure and manage data.
  • Pachyderm — version control for data, similar to what Git does with code.
  • Quilt Data — Quilt versions and deploys data: like Docker for Data
  • Privitar— More data-driven decisions without compromising on privacy. Get more business value from sensitive data — while enhancing privacy protection.
  • DVC — Open-source Version Control System for Machine Learning Projects … data version control
  • Instabase — a platform for data management and version control
  • Datical — Database release automation for software development teams
  • DBMaestro — Automate & govern database releases to accelerate time-to-market while preventing downtime & data-loss.
  • Hazy — Hazy generates smart synthetic data that’s safe to use and actually works as a drop-in replacement for real data science, model training and analytics workloads. (added April 2020)
  • GenRocket — Achieve Continuous Testing with Enterprise Test Data Generation Our System was designed to help QA teams generate the exact test data they need at a low cost. (added April 2020)
  • Exate — The “Ink Bomb For Data” eXate unlocks the value in your data by allowing you to share it safely, because data privacy is at the heart of everything we do. (added April 2020)
  • Spectacles — Deploy your LookML with confidence. Spectacles automatically tests your LookML to ensure Looker always runs smoothly for your users. (added June 2020)

2. Big Data Performance Management

  • SelectStar — database monitoring solution with alerts, monitoring, and relationship mapping.
  • Unravel — manages the performance and utilization of big data applications and platforms.
  • Redgate — SQL tools to help users implement DataOps, monitor database performance, and provision of new databases.

Other Vendors Talking DataOps

In addition to the tools above, there are many software components that are messaging on DataOps.

1. Data Integration and Unification with a DataOps Message

  • Nexla — Scalable and secure Data Operations platform that allows business users to send, receive, transform, and monitor data.
  • Switchboard Software — fully managed, cloud-hosted data operations solution that integrates, cleans, transforms and monitors data.
  • Tamr — enterprise data unification solution that uses a bottoms-up, machine-learning-based approach.
  • StreamSets — The industry’s first data operations platform for full life-cycle management of data in motion.
  • Trifacta — end-user data prep.
  • Infoworks — Use Big Data Automation to Simplify Data Engineering and DataOps
  • Landoop — The enterprise overlay for Apache Kafka R & Kubernetes
  • Devo — Devo delivers real-time operational and business insights from analytics on streaming and historical data to operations, IT, security and business teams at the world’s largest organizations.
  • IBM DataOps--Organize your data to be trusted and business-ready for your Journey to AI (added May 2020)

2. All-in-One Cloud Platforms talking DataOps

  • MAPR — provide a Converged Data Platform that enables customers to harness the power of big data by combining analytics in real-time with operational applications to improve business outcomes.
  • Quobole — big-data-as-a-service company with a cloud-based platform that extracts value from huge volumes of structured and unstructured data.
  • John Snow Labs — The Data Lab is an enterprise platform featuring data integration, no-code interactive data discovery & analysis, a collaborative data science notebooks environment, and productizing models as API’s at scale.

2. Service and Consulting Organizations with a DataOps slant

  • Kinaesis — We work with our clients within the Financial Services to leverage investment into Data Solutions and generate real value.
  • CapGemini — Capgemini is building a practice area around DataOps
  • John Snow Labs — Data curation, data science, data engineering, and data operations services. specializing in healthcare and life science.
  • XenonStack — DataOps, DevOps, decision support, big-data analytics, and IoT services
  • Locke Data — Data science services
  • Cognizant
  • Wipro
  • IBM — IBM renamed several of their products as DataOps
  • LEIT — eLadingEdge Technology is an Information Technology
Dataops
Agile
DevOps
Data
Big Data
Recommended from ReadMedium