Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

s="hljs-selector-class">.metadata >> mappingproxy({'description': 'Some Customer Identifier'})</pre></div><h1 id="0ff1">Leveraging Data classes for Data Applications</h1><h2 id="db82">Type validation:</h2>We can use data classes to implement type validation. A. specific library <a href="https://pypi.org/project/dataclass-type-validator/"><code>dataclass-type-valida</code>tor</a> exists to help support this use case.We can also leverage data classes to validate data we would like to ingest for example after having ingested it on a data frame:<div id="4b98"><pre>import numpy as np import pandas as pd</pre></div><div id="655b"><pre>annotations = CustomerDataClass.annotations df = pd.DataFrame([[11], [234]], columns=['customer_id'])</pre></div><div id="2ce1"><pre>for index, row in df.iterrows(): for column_name, column_type in annotations.items(): if isinstance(row[column_name], np.generic): assert(isinstance(row[column_name].item(), column_type)) else: assert(isinstance(row[column_name], column_type))</pre></div><div id="18d3"><pre>>> pass the assertion</pre></div><div id="8b05"><pre>df = pd.DataFrame([['11'], ['234']], columns=['customer_id'])</pre></div><div id="795f"><pre>for index, row in df.iterrows(): for column_name, column_type in annotations.items(): if isinstance(row[column_name], np.generic): assert(isinstance(row[column_name].item(), column_type)) else: assert(isinstance(row[column_name], column_type))</pre></div><div id="8f0c"><pre>>> throws assertion error </pre></div><h2 id="c7cf">dtypes specifications:</h2>Sometimes it is important to not only be able to leverage the native annotations for types but to enrich the information with specific dtypes when the data is loaded onto a pandas DataFrame for instance. Pandas <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html">read_csv</a> function for instance let us provide a dictionary of <code>{“column_name”: “column_dtype”}</code> when reading the file to create a data frame. These can be inferred from a data class when specified for instance in a metadata property.<div id="f177"><pre>import dataclasses import pandas as pd</pre></div><div id="df71"><pre>@dataclass(init=False, repr=False) class CustomerDataClass: customer_id: int = dataclasses.field( metadata={ "dtype": "Int64", } )</pre></div><div id="87e2"><pre>fields = CustomerDataClass.dataclass_fields customer_dtypes = {field_name: fields[field_name].metadata.get('dtype') for field_name in fields} pd.read_csv("/foo/bar", dtype=customer_dtypes)</pre></div>This can be used to have the right dtypes for instance when needing to specify null or non-nullable integer values and be more memory efficient than the automatic type conversion of pandas.<h2 id="36a7">SQLAlchemy Models & DLL:</h2>It is also possible to generate a SQL alchem

Options

y model dynamically out of a data class.<div id="a772"><pre>import dataclasses from dataclasses import dataclass from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import Column, Integer</pre></div><div id="5ea3"><pre>@dataclass(init=False, repr=False) class CustomerDataClass: customer_id: int = dataclasses.field( metadata={ "sqlalchemy_type": Integer } ) attr_dict = { "tablename": "table_name", "table_args": {"schema": "schema"}, "id": Column(Integer, primary_key=True), } sql_alchemy_dict = { k.lower(): Column(v.metadata.get("sqlalchemy_type")) for k, v in CustomerDataClass.dataclass_fields.items() } attr_dict.update(sql_alchemy_dict) Base = declarative_base() SampleModel = type("SampleModel", (Base,), attr_dict)</pre></div>The model generated is a pure SQL Alchemy model and can be instantiated like a normal model: <code>SampleModel(customer_id=220)</code> . This model can also be used through Alembic to generate schema migrations.The <a href="https://pydantic-docs.helpmanual.io/usage/dataclasses/">Pydantic</a> framework provides a more direct way to leverage the type annotations to generate similar models. A decorator coming from the library being sufficient to allow the generation of the model.<div id="3a7e"><pre>from pydantic.dataclasses import dataclass</pre></div><div id="a98a"><pre>@dataclass class CustomerDataClass: customer_id: int </pre></div>Another use of the SQL Alchemy annotations in the data is to leverage them to write to a table using Pandas data frame with specific types. This can be done by leveraging Pandas <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html"><code>to_</code>sql</a> and the <code>dtype</code> property.<div id="a7b7"><pre>sql_alchemy_types = { k.lower(): v.metadata.get("sqlalchemy_type") for k, v in CustomerDataClass.dataclass_fields.items() } df.to_sql(..., dtype=sql_alchemy_types)</pre></div><h2 id="bf62">Protobuf:</h2>A specific library called <a href="https://pypi.org/project/pure-protobuf/"><code>pure-proto</code>buf</a> exists that allows translating data classes into Protobuf. Protobuf is a protocol that facilitates data exchanges across application/programming languages.<h2 id="ec4e">Simple ETL</h2>Simple ETL processes can also be defined and described as part of a data class. Operations such as type casting, renaming… quite a few operations can be defined for simple ETL as part of a data class.APILeveraging FastAPI and Pydantic, it is possible to leverage data classes to build <a href="https://fastapi.tiangolo.com/tutorial/extra-models/">APIs</a> in a streamlined fashion.<h1 id="0325">Summary</h1>Data class provides a versatile abstraction for dealing with data schema and its downstream transformations. Through adapters, it is possible to leverage them for schema validation, DDL, APIs, or message passing. They should form part of the swiss army knife that data engineers working in Python leverage.More from me on <a href="https://medium.com/analytics-and-data">Hacking Analytics</a>:<ul><li><a href="https://readmedium.com/on-the-evolution-of-data-engineering-c5e56d273e37">One the evolution of Data Engineering</a></li><li><a href="https://readmedium.com/overview-of-the-different-approaches-to-putting-machinelearning-ml-models-in-production-c699b34abf86">Overview of the different approaches to putting Machine Learning (ML) models in production</a></li><li><a href="https://readmedium.com/real-time-data-pipelines-complexities-considerations-eecad520b70b">Real-time Data Pipelines — Complexities & Considerations</a></li><li><a href="https://readmedium.com/the-path-to-learning-sql-and-mastering-it-to-become-a-data-engineer-256ea0fef4e7">The path to learning SQL and mastering it to become a Data Engineer</a></li><li><a href="https://readmedium.com/the-lost-art-of-data-modeling-1118e88d9d7a">The lost Art of Data Modeling</a></li></ul></article></body>

ON Data Engineering

Python’s Data Classes a Data Engineer’s best friend

Data Engineering application of data classes

Data classes are a relatively new introduction to Python, first released in Python 3.7 which provides an abstraction layer leveraging type annotations to define container objects for data. Compared to a normal Python class, data classes make do of some of the syntactic sugar for instantiation, and there are a number of areas where data class can add value to data engineering.

Understanding Data Classes

Data classes

The data class library introduces a lightweight way to define objects, providing getters and setters for the different fields define within it.

from dataclasses import dataclass

@dataclass
class CustomerDataClass:

As shown above, it relies on a decorator pattern to wrap around classes and enrich them with specific features.

Data class and field definitions

The data class leverages a series of fields defined within the class along with their Python-type annotations.

@dataclass
class CustomerDataClass:
    customer_id: int

The class can then be instantiated by providing a variable customer_id as a constructor argument:

my_customer = CustomerDataClass(customer_id=123)

For each of the fields defined within the data class an @ property accessor and setter are defined. The data can therefore be retrieved in the following manner:

my_customer.customer_id
>> 123

It is possible to extract the different list of fields defined within the data class using the __annotations__ setting.

CustomerDataClass.__annotations__
>>{'customer_id': int}

The __annotations__ property would provide raw annotations. There are however cleaner ways to resolve the different field types within a data class.

Data class as definition objects:

@dataclass(init=False, repr=False)
class CustomerDataClass:
    customer_id: int

Data classes can also serve as definition objects the init constructor argument determines whether the data class will be automatically initialized. In order to use the data class in a full definition mode, it is also required to disable the repr as the initialized values are by default outputted as part of the class’s string representation.

CustomerDataClass()
>> <__main__.CustomerDataClass at 0x10a8c3d30>

Data class and meta fields:

@dataclass(init=False, repr=False)
class CustomerDataClass:
    customer_id: int = dataclasses.field(
        metadata={
            "description": "Some Customer Identifier",
        }
    )

Data classes can leverage some extra properties defined as fields , to add features such as default values, default factory, or more importantly when leveraging data classes as definition objects, metadata properties. The information defined within these field’s metadata can be retrieved from the class in the following manner:

CustomerDataClass.__dataclass_fields__['customer_id'].metadata
>> mappingproxy({'description': 'Some Customer Identifier'})

Leveraging Data classes for Data Applications

Type validation:

We can use data classes to implement type validation. A. specific library dataclass-type-validator exists to help support this use case.

We can also leverage data classes to validate data we would like to ingest for example after having ingested it on a data frame:

import numpy as np
import pandas as pd

annotations = CustomerDataClass.__annotations__
df = pd.DataFrame([[11], [234]], columns=['customer_id'])

for index, row in df.iterrows():
    for column_name, column_type in annotations.items():
        if isinstance(row[column_name], np.generic):
            assert(isinstance(row[column_name].item(), column_type))
        else:
            assert(isinstance(row[column_name], column_type))

>> pass the assertion

df = pd.DataFrame([['11'], ['234']], columns=['customer_id'])

for index, row in df.iterrows():
    for column_name, column_type in annotations.items():
        if isinstance(row[column_name], np.generic):
            assert(isinstance(row[column_name].item(), column_type))
        else:
            assert(isinstance(row[column_name], column_type))

>> throws assertion error

dtypes specifications:

Sometimes it is important to not only be able to leverage the native annotations for types but to enrich the information with specific dtypes when the data is loaded onto a pandas DataFrame for instance. Pandas read_csv function for instance let us provide a dictionary of {“column_name”: “column_dtype”} when reading the file to create a data frame. These can be inferred from a data class when specified for instance in a metadata property.

import dataclasses
import pandas as pd

@dataclass(init=False, repr=False)
class CustomerDataClass:
    customer_id: int = dataclasses.field(
        metadata={
            "dtype": "Int64",
        }
    )

fields = CustomerDataClass.__dataclass_fields__
customer_dtypes = {field_name: fields[field_name].metadata.get('dtype') for field_name in fields}
pd.read_csv("/foo/bar", dtype=customer_dtypes)

This can be used to have the right dtypes for instance when needing to specify null or non-nullable integer values and be more memory efficient than the automatic type conversion of pandas.

SQLAlchemy Models & DLL:

It is also possible to generate a SQL alchemy model dynamically out of a data class.

import dataclasses
from dataclasses import dataclass
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer

@dataclass(init=False, repr=False)
class CustomerDataClass:
    customer_id: int = dataclasses.field(
        metadata={
            "sqlalchemy_type": Integer
        }
    )
attr_dict = {
    "__tablename__": "table_name",
    "__table_args__": {"schema": "schema"},
    "id": Column(Integer, primary_key=True),
}
sql_alchemy_dict = {
    k.lower(): Column(v.metadata.get("sqlalchemy_type"))
    for k, v in CustomerDataClass.__dataclass_fields__.items()
}
attr_dict.update(sql_alchemy_dict)
Base = declarative_base()
SampleModel = type("SampleModel", (Base,), attr_dict)

The model generated is a pure SQL Alchemy model and can be instantiated like a normal model: SampleModel(customer_id=220) . This model can also be used through Alembic to generate schema migrations.

The Pydantic framework provides a more direct way to leverage the type annotations to generate similar models. A decorator coming from the library being sufficient to allow the generation of the model.

from pydantic.dataclasses import dataclass

@dataclass
class CustomerDataClass:
    customer_id: int

Another use of the SQL Alchemy annotations in the data is to leverage them to write to a table using Pandas data frame with specific types. This can be done by leveraging Pandas to_sql and the dtype property.

sql_alchemy_types = {
    k.lower(): v.metadata.get("sqlalchemy_type")
    for k, v in CustomerDataClass.__dataclass_fields__.items()
}
df.to_sql(..., dtype=sql_alchemy_types)

Protobuf:

A specific library called pure-protobuf exists that allows translating data classes into Protobuf. Protobuf is a protocol that facilitates data exchanges across application/programming languages.

Simple ETL

Simple ETL processes can also be defined and described as part of a data class. Operations such as type casting, renaming… quite a few operations can be defined for simple ETL as part of a data class.

API

Leveraging FastAPI and Pydantic, it is possible to leverage data classes to build APIs in a streamlined fashion.

Summary

Data class provides a versatile abstraction for dealing with data schema and its downstream transformations. Through adapters, it is possible to leverage them for schema validation, DDL, APIs, or message passing. They should form part of the swiss army knife that data engineers working in Python leverage.