Summary

The Azure Data Factory (ADF) article emphasizes the importance of creating reusable datasets to optimize data pipeline management and reduce resource duplication.

Abstract

The article "Azure Data Factory: Why You Need Reusable Datasets & How to Create Them?" outlines the benefits of using generic datasets within ADF to streamline the data engineering process. It explains that datasets in ADF represent logical references to data used in pipelines, containing details on data structure, connection, and format. The narrative presents a scenario with three data engineers, John, David, and Sara, each creating specific datasets for their projects, leading to redundancy and increased maintenance. The problem with this approach is highlighted by the potential for an unsustainable increase in datasets as new projects arise, causing higher maintenance overhead and increased storage consumption in ADF. The recommended solution is to implement generic datasets that can be parameterized and reused across different projects, significantly reducing the number of datasets needed and facilitating easier maintenance and scalability. The article concludes by demonstrating how to design and use these generic datasets, showcasing the before and after impact on the ADF interface, and summarizing the advantages of this approach for data teams and schema-less data sources.

Opinions

The article suggests that the traditional approach of creating specific datasets for each project is suboptimal and can lead to an unmanageable proliferation of datasets.
It posits that parameterizing datasets and making them generic is a more efficient and scalable method for managing data pipelines in ADF.
The author believes that generic datasets can significantly reduce the time and effort required for dataset maintenance and troubleshooting.
The article conveys that generic datasets are particularly beneficial for working with schema-less data sources, such as NoSQL databases or data lakes.
It is implied that while generic datasets are advantageous, their implementation should be evaluated on a case-by-case basis to ensure they align with the specific needs of a project.

Azure Data Factory: Why You Need Reusable Datasets & How to Create Them?

This article describes the importance of reusable/generic datasets in ADF and how to create them.

Datasets in Azure Data Factory (ADF) represent a logical reference to the data that you want to ingest, transform, or output in your data pipelines.

They usually include information such as the following:

data structure of the source and destination (also called sink in ADF).
connection details to the data in source and sink.
filepath, schema and file format that ADF needs to work with.

A Scenario

You have a team of 3 data engineers. Let’s call them John, David and Sara. All 3 design ADF pipelines as part of their daily job and thus create datasets.

John’s project reads a combination of json and csv files from a SharePoint to load those into a Synapse dedicated SQL pool table called table_a.

David’s project reads csv files and binary files from a network drive into another Synapse dedicated SQL pool table called table_b.

Sara’s project reads API data through a JSON and loads it into a third table called table_c.

All of them store their incoming data into an Azure data lake storage gen2 container.

Without Generic Datasets

Without creating generic datasets for their needs, John, David and Sara create datasets that cater to their specific use cases.

✓ John creates 3 datasets — csv file, json file and sql dataset for table_a. ✓ David creates 3 datasets — csv file, binary file and sql dataset for table_b. ✓ Sara creates 2datasets — json file and sql dataset for table_c.

Following is how their datasets look in ADF —

To provide further context, when John, David and Sara created their datasets, they linked those datasets to their data stores only. To put this into perspective, below are the datasets configured for the different tables —

John’s sink dataset —

David’s sink dataset —

Sara’s sink dataset —

The Problem

Can their data pipelines function like this? Absolutely. Assuming no other error occurs, all the 3 data pipelines will function and load data into their respective tables.

So what’s the problem?

The question to ask is if this approach is optimized and can this be sustained? We can see our ADF already contains-

3 datasets for the 3 different tables
2 datasets for the 2 csv files
2 datasets for the 2 json files

What happens six months or a year from now? As the influx of projects increases, new csv/excel/json files are discovered and new tables get created, our original strategy will balloon the resources used in ADF since the number of new datasets created appears to be directly proportional to the number of new files/tables.

There are additional challenges to be mindful about as well such as-

Higher maintenance overhead because the team has to maintain the connection settings of data sources/sink and debug multiple datasets should there be an issue with the network configuration.
ADF’s storage consumption increases with the addition of each duplicated dataset.

The Recommended Solution

To avoid problems like above, consider an approach where you create generic datasets. In simple terms, design a dataset such as a csv or sql table dataset that can be reused for different projects. To apply this approach, let’s analyze how many generic datasets we need to create-

As we know, John and David’s project uses csv files and thus I will create 1 generic csv dataset instead of John and David needing to create 2 datasets which is a redundant approach.
John and Sara’s project uses a JSON file each to load data into a SQL table. I will also create a generic dataset for a JSON.
Finally, all 3 projects feed into different tables which, even thought are different tables, use the same data warehouse/server. So I can create 1 generic dataset that caters to all 3 tables.

Let’s design the generic dataset

Generic CSV Dataset

Starting with generic csv dataset, go to parameters, click + New and add 3 different parameters. Since the datasets reside in a folder structure on Azure’s data lake, I will provide the following names —

main_folder
directory
filename

Return to the Connection tab and configure the parameters as shown-

Click Save.

Generic SQL Dataset

Navigate to any SQL dataset and click Parameters. Add 2 parameters-

schema_name
table_name

Return to the Connection tab and check Enter manually checkbox. Next, click Add dynamic content.

But, what if in another project, a CSV file uses a different column delimiter?

Often, there will be scenarios where despite your generic dataset’s availability, it may not be of use in a project. As an example, the default value of column delimiter property for a csv is comma. If a project you are working on uses a different delimiter, you will have to either modify the existing generic csv dataset or create a new csv dataset.

Here, I advise to have a forethought and parameterize as many fields as you can. This enables users to provide their custom values while still reusing the same dataset.

How to use a generic dataset?

Going back to the John’s project, he now has 2 generic datasets to work with — one for csv and one for sql. Create a copy activity and from the source, select the generic csv dataset.

John can now provide the values for all the parameters as shown above. The same dataset can now be used by David to load his csv file in the table_b.

On the sink side, select the generic sql dataset.

Because the schema and table name properties were parameterized, John is able to provide the schema and table name for his project. The same dataset can be used by David and Sara as well by simply providing the names of their schema and table to the parameters.

Before and After

Shown below is how the datasets look in ADF before creating generic datasets and after creating generic datasets —

Summary

To summarize, creating generic datasets not only optimizes a data team’s organization but also saves time by reducing the maintenance overhead needed to create/troubleshoot existing datasets. They are particularly useful for working with schema-less data sources like NoSQL databases or data lakes. Of course, I’d urge you to evaluate your use case and assess if creating a generic dataset provides any advantages or not.