ON Data Engineering
Python’s Data Classes a Data Engineer’s best friend
Data Engineering application of data classes
Data classes are a relatively new introduction to Python, first released in Python 3.7 which provides an abstraction layer leveraging type annotations to define container objects for data. Compared to a normal Python class, data classes make do of some of the syntactic sugar for instantiation, and there are a number of areas where data class can add value to data engineering.
Understanding Data Classes
Data classes
The data class library introduces a lightweight way to define objects, providing getters and setters for the different fields define within it.
from dataclasses import dataclass
@dataclass
class CustomerDataClass:
As shown above, it relies on a decorator pattern to wrap around classes and enrich them with specific features.
Data class and field definitions
The data class leverages a series of fields defined within the class along with their Python-type annotations.
@dataclass
class CustomerDataClass:
customer_id: int
The class can then be instantiated by providing a variable customer_id as a constructor argument:
my_customer = CustomerDataClass(customer_id=123)
For each of the fields defined within the data class an @ property
accessor and setter
are defined. The data can therefore be retrieved in the following manner:
my_customer.customer_id
>> 123
It is possible to extract the different list of fields defined within the data class using the __annotations__
setting.
CustomerDataClass.__annotations__
>>{'customer_id': int}
The __annotations__
property would provide raw annotations. There are however cleaner ways to resolve the different field types within a data class.
Data class as definition objects:
@dataclass(init=False, repr=False)
class CustomerDataClass:
customer_id: int
Data classes can also serve as definition objects the init
constructor argument determines whether the data class will be automatically initialized. In order to use the data class in a full definition mode, it is also required to disable the repr
as the initialized values are by default outputted as part of the class’s string representation.
CustomerDataClass()
>> <__main__.CustomerDataClass at 0x10a8c3d30>
Data class and meta fields:
@dataclass(init=False, repr=False)
class CustomerDataClass:
customer_id: int = dataclasses.field(
metadata={
"description": "Some Customer Identifier",
}
)
Data classes can leverage some extra properties defined as fields
, to add features such as default values, default factory, or more importantly when leveraging data classes as definition objects, metadata properties. The information defined within these field’s metadata can be retrieved from the class in the following manner:
CustomerDataClass.__dataclass_fields__['customer_id'].metadata
>> mappingproxy({'description': 'Some Customer Identifier'})
Leveraging Data classes for Data Applications
Type validation:
We can use data classes to implement type validation. A. specific library dataclass-type-valida
tor exists to help support this use case.
We can also leverage data classes to validate data we would like to ingest for example after having ingested it on a data frame:
import numpy as np
import pandas as pd
annotations = CustomerDataClass.__annotations__
df = pd.DataFrame([[11], [234]], columns=['customer_id'])
for index, row in df.iterrows():
for column_name, column_type in annotations.items():
if isinstance(row[column_name], np.generic):
assert(isinstance(row[column_name].item(), column_type))
else:
assert(isinstance(row[column_name], column_type))
>> pass the assertion
df = pd.DataFrame([['11'], ['234']], columns=['customer_id'])
for index, row in df.iterrows():
for column_name, column_type in annotations.items():
if isinstance(row[column_name], np.generic):
assert(isinstance(row[column_name].item(), column_type))
else:
assert(isinstance(row[column_name], column_type))
>> throws assertion error
dtypes specifications:
Sometimes it is important to not only be able to leverage the native annotations for types but to enrich the information with specific dtypes when the data is loaded onto a pandas DataFrame for instance. Pandas read_csv function for instance let us provide a dictionary of {“column_name”: “column_dtype”}
when reading the file to create a data frame. These can be inferred from a data class when specified for instance in a metadata property.
import dataclasses
import pandas as pd
@dataclass(init=False, repr=False)
class CustomerDataClass:
customer_id: int = dataclasses.field(
metadata={
"dtype": "Int64",
}
)
fields = CustomerDataClass.__dataclass_fields__
customer_dtypes = {field_name: fields[field_name].metadata.get('dtype') for field_name in fields}
pd.read_csv("/foo/bar", dtype=customer_dtypes)
This can be used to have the right dtypes for instance when needing to specify null or non-nullable integer values and be more memory efficient than the automatic type conversion of pandas.
SQLAlchemy Models & DLL:
It is also possible to generate a SQL alchemy model dynamically out of a data class.
import dataclasses
from dataclasses import dataclass
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer
@dataclass(init=False, repr=False)
class CustomerDataClass:
customer_id: int = dataclasses.field(
metadata={
"sqlalchemy_type": Integer
}
)
attr_dict = {
"__tablename__": "table_name",
"__table_args__": {"schema": "schema"},
"id": Column(Integer, primary_key=True),
}
sql_alchemy_dict = {
k.lower(): Column(v.metadata.get("sqlalchemy_type"))
for k, v in CustomerDataClass.__dataclass_fields__.items()
}
attr_dict.update(sql_alchemy_dict)
Base = declarative_base()
SampleModel = type("SampleModel", (Base,), attr_dict)
The model generated is a pure SQL Alchemy model and can be instantiated like a normal model: SampleModel(customer_id=220)
. This model can also be used through Alembic to generate schema migrations.
The Pydantic framework provides a more direct way to leverage the type annotations to generate similar models. A decorator coming from the library being sufficient to allow the generation of the model.
from pydantic.dataclasses import dataclass
@dataclass
class CustomerDataClass:
customer_id: int
Another use of the SQL Alchemy annotations in the data is to leverage them to write to a table using Pandas data frame with specific types. This can be done by leveraging Pandas to_
sql and the dtype
property.
sql_alchemy_types = {
k.lower(): v.metadata.get("sqlalchemy_type")
for k, v in CustomerDataClass.__dataclass_fields__.items()
}
df.to_sql(..., dtype=sql_alchemy_types)
Protobuf:
A specific library called pure-proto
buf exists that allows translating data classes into Protobuf. Protobuf is a protocol that facilitates data exchanges across application/programming languages.
Simple ETL
Simple ETL processes can also be defined and described as part of a data class. Operations such as type casting, renaming… quite a few operations can be defined for simple ETL as part of a data class.
API
Leveraging FastAPI and Pydantic, it is possible to leverage data classes to build APIs in a streamlined fashion.
Summary
Data class provides a versatile abstraction for dealing with data schema and its downstream transformations. Through adapters, it is possible to leverage them for schema validation, DDL, APIs, or message passing. They should form part of the swiss army knife that data engineers working in Python leverage.
More from me on Hacking Analytics: