How to Integrate Great Expectations with Databricks
Get better data quality metrics with one change to Great Expectations

A common challenge data engineering teams face is how to best measure data quality. Poor data quality leads to wrong insights and potentially bad business decisions. An integrated data quality framework reduces the team’s workload when assessing data quality issues.
Great Expectations (GE) is a great python library for data quality. It comes with integrations for Apache Spark and dozens of preconfigured data expectations. Databricks is a top-tier data platform built on Spark. So you’d expect them to integrate seamlessly, but that is not quite the case.
So in this article, I’ll walk through a simple change you can make to one GE class that allows for a more integrated solution between GE and Databricks.
All the code for this article is available in the repo here.
The Problem
I was hoping for a simple way to integrate GE with Databricks without switching between PySpark and configuration files. I’ve found that using GE in a hosted environment is challenging. GE does offer a step-by-step guide on ‘How to Use Great Expectations in Databricks.’ If you follow the guide step-by-step, you end up with a mountain of configuration setup.
Specifically, I wanted a data quality framework that would fit nicely with the Databricks Medallion Architecture and hit these areas:
- Minimal overhead, and it ‘just works’ with Databricks
- Able to write data quality tests in line with other PySpark code
- Throw an error if the underlying data changed in an unexpected way
- Save results as a file to a storage location
The end state would be an architectural pattern similar to this:

Data quality is progressively improved as data passes through each level. Along the way, each time the data is validated, the result is saved as JSON, and bad data is quarantined before loading it to the next level.
The Solution: Extending the SparkDFDataset class
One of the base dataset classes of GE is the SparkDFDataset. The SparkDFDataset inherits the PySpark DataFrame and has all the expectations implemented as methods.
By extending the SparkDFDataset class, you can add new methods to enhance integrations with Databricks.
The code below demonstrates how I’ve added a handful of the methods to integrate GE with Databricks.








