avatarNicholas Leong

Summary

This context discusses a branching strategy for data engineers, emphasizing the importance of version control and branching strategies in managing changes to source code, separating development and production environments, and ensuring product stability.

Abstract

The context highlights the need for a branching strategy tailored to data engineers, as they build and maintain data warehouses rather than traditional software products. The strategy involves managing changes to source code, including SQL commands, data pipelines, CI/CD, DevOps, and data models. The article explains the differences between data engineering and software engineering branch design, focusing on the frequency of releases and the impact of changes on data warehouses. The proposed branching strategy includes DEV, UAT, and MASTER branches, with DEV containing the latest fixes and features, UAT reflecting the current state of the UAT environment, and MASTER representing the current state of the production environment. The article also discusses CI/CD pipelines, hotfixes, and deployment workflows.

Opinions

  • Data engineers require a specialized branching strategy due to the unique nature of their work in building and maintaining data warehouses.
  • The frequency of releases for data engineers is often higher than for software engineers due to urgent changes required by CEOs and CTOs.
  • A well-designed branching strategy can improve the stability and quality of data warehouses.
  • CI/CD pipelines are essential for deploying code changes and executing SQL statements to apply changes to the data warehouse.
  • Hotfixes are necessary to address bugs in the UAT and production environments, and they can bypass the process of going through the DEV branch.
  • Proper synchronization of branches is crucial to ensure that bug fixes are present in all branches.
  • A proper branching strategy allows data engineers to focus on their work without wasting time on unnecessary tasks.

Hands-on Tutorials

How To Structure Your Git Branching Strategy — By A Data Engineer

Data pipelines require version control too!

Image by Author

If you’ve ever dealt with code collaboratively, you’d understand the importance of version control and branching strategies. These are the key tools that allow multiple developers to work on a project in parallel. Without them, your product is very likely to break.

For those who don’t understand what version control and branches are — In a summarized explanation, version control is the practice of managing changes to your source code. It allows developers to clone, work, and deploy code without interfering with other developers’ work.

Branches are simply versions of your source code. It is useful in separating code that is currently in development and actual working, stable code for production environments.

You’ve heard of the DEV, UAT, and MASTER branch for software engineers and developers. But have you ever come across a branching strategy for Data Engineers/Data Scientists?

Instead of a product, Data Engineers and Data Scientists build and maintain data warehouses. Data Scientists do build data products but are often not able to do so before establishing a stable data warehouse to gather data.

Let’s talk about some of the differences in branch design for Data Engineers vs Software Engineers.

Content

Photo by Aaron Weiss on Unsplash

For data engineers, source code usually involves everything from data warehouse versions to data pipelines.

There are —

  • SQL Commands (DMLs/DDLs) to apply changes into Data Warehouse
  • Data Pipelines
  • CI/CD
  • DevOps
  • Data Models

and these are just at the top of my head. In layman's terms, a data warehouse is just a place where data is stored. But in reality, it is way more complex than that.

A data warehouse is used for many purposes, and they vary according to the company you work in. In some places, data warehouses are used to produce Business Intelligence graphs for data analysts/data scientists, and sometimes even for the upper management like the CEOs and CTOs of the company.

A data warehouse may store data that is fed into machine learning products through a real-time pipeline, updating machine learning models as new data comes in. If data stored in the warehouse has issues, it will be reflected in said machine learning models.

Each of these data warehouse functions has its time interval. For example, data analysts expect to have their graphs updated daily, while Machine Learning products expect to have their data updated in real-time.

Hence, changes made to the source code may have a bigger impact than you think. It only takes 1 invalid SQL query execution to receive a phone call from the CEO himself, asking why are dashboards breaking.

Releases

Photo by Fotis Fotopoulos on Unsplash

Typically, releases scheduled by Software Engineers are much more spaced out compared to Data Engineers. This is because it takes a longer time to develop a feature for a product compared to crafting a SQL query that adds a column in a table.

That being the case, I’ve noticed that Data Engineers roll out releases much more often than the typical Software Engineer. I’ve experienced as often as 3–4 releases per week. I would say that a typical feature release for Software Engineers takes a week at least.

This isn’t by design.

Most of the time, Data Engineers get requirements for urgent changes on important dashboards or products. CEOs and CTOs may require numbers on the dashboard changed for many reasons, and they want it to be reflected almost immediately. Hence the frequency of hotfixes.

Branching Strategy

Image by Author

Let’s talk about the branching strategy I designed for my organization. There are 3 main branches —

DEV — Contains latest fixes and features

UAT — Current State of UAT Environment

MASTER — Current State of Production Environment

With this design, the DEV branch will contain commits ahead of the MASTER branch. The production environment will often not reflect the most recent features the team has developed before any testing.

DEV

If a developer wants to start working on a feature/bug, a development branch is made named after the Jira Ticket. Jira Tickets are tasks assigned to engineers by managers. Development happens on this branch.

Engineers can deploy their work onto the DEV environment and perform tests. Once they are happy with their tests, they can merge the development branch into the DEV branch, essentially pushing their changes into DEV.

It is important that at this point engineers are confident that their code is working as expected. DEV may be merged into UAT at any point in time and any bugs, if present, will be brought forward.

CI/CD

In this particular design, the only way to deploy onto UAT and Production is by pushing code into the branches they are connected to.

But first, what is CI/CD?

Continuous integration (CI) and continuous delivery (CD) is a method in Software Engineering, or Data Engineering, in this case, to deliver code changes more frequently and reliably. It contains a set of principles and practices that development teams should follow.

In our case, we deploy CI/CD pipelines to overwrite code in Apache Airflow, which is our workflow management system for Data Pipelines. This includes code for DAGs, Operators, and everything else.

The CI/CD pipeline also executes DMLs/DDLs to apply changes to the Data Warehouse using a SQL version control tool named Liquibase.

In the UAT and Production environment, Data Engineers have restrictions in deploying work. For example, only Select statements are allowed in the UAT and Production Data Warehouse. This is to prevent anyone from accidentally dropping a table or inserting a row.

The only way to execute DMLs/DDLs, which are SQL statements that bring changes to the schema/data of tables is by executing through the CI/CD pipeline. It may sound like overkill, but this practice has improved the stability and quality of our Data Warehouse tremendously.

UAT

Once a team has made significant progress and wants to move features into UAT, they can schedule a release. DEV branch is merged into UAT branch and CI/CD is run. The environment is usually monitored to verify there aren’t any bugs for a day. The branch is then tagged by using semantic versioning.

MASTER

After thorough testing in the UAT environment, the team can proceed to move said features into production by merging the UAT branch into the MASTER branch. The CI/CD for production is run and the environment is monitored once more. The branch is then tagged accordingly.

HOTFIXES

Bugs. They are unavoidable. If there are bugs at any point in UAT and Production environment, Data Engineers will need to apply hotfixes. These are done by creating a hotfix branch from the faulty branch, applying the changes, and merging the hotfix branch back into its original branch.

Hotfixes can bypass the process of having to go through DEV. It can be merged directly into MASTER if necessary. After merging, the branch is then tagged accordingly as well. The branch must also be synced into dependent branches.

For example — A hotfix is merged into Prod. The developer must then sync prod into DEV and UAT so that these branches contain the fix as well.

Deployment Workflow

Image By Author

Here’s a step-by-step guide on how to deploy work onto environments using this branching strategy.

  1. Create branches named after Jira tickets for development from the DEV branch. We use A, B, and C as examples of features/bugs here.
  2. Once tests are satisfactory, A, B, and C can be merged into the DEV branch at any time. The DEV branch now contains A, B, and C.
  3. On a UAT release, the DEV branch is directly merged into the UAT branch through a pull request. CI/CD now deploys A, B, and C onto the UAT environment which applies changes onto both data pipelines and the actual data warehouse itself. The branch is then tagged accordingly.
  4. After careful monitoring, a Production release is scheduled. The UAT branch is merged into the MASTER branch through a pull request. CI/CD deploys A, B, and C onto the Production environment. The branch is tagged accordingly.
  5. During releases, development is not halted. Engineers can continue their work on the DEV branch as per usual. As an example, the engineers are working on features D and E as releases are performed.
  6. As a hypothetical scenario, a bug was found on production during the CI/CD run. A hotfix branch is created containing the bugfix F and is directly merged into MASTER for the hotfix to happen. The bugfix F is deployed onto the Production environment through CI/CD. The branch is tagged accordingly.
  7. Both master branch and production environment now contain A, B, C, and F. Once the team is happy with the hotfix F, the MASTER branch is synced onto UAT and DEV branch so that F is present in all branches. DEV and UAT now contain A, B, C, and F.
  8. Once tests for D and F are done, they are merged into the DEV branch. The DEV branch now contains all features/bugs from A to F. It can go through the release cycle again, bringing D and F into UAT and production.

Conclusion

Photo by Kelly Sikkema on Unsplash

If you’ve read until this point, you must really have a thing for the boring stuff. Most of my friends fall asleep when I tell them about branches, they didn’t know I had a passion for nature.

Jokes aside, a proper branching strategy is important so that data engineers don’t have to waste time messing around with things they don’t want to.

They can just focus on the data.

In this article, we’ve gone through —

  • DEV, UAT, and MASTER branch
  • Differences in branch design between Software and Data Engineers
  • What is CI/CD?
  • Deployment Workflow

As usual, I end with a quote.

The world is one big data problem.” — By Andrew McAFee

Subscribe to my newsletter to stay in touch.

You can also support me by signing up for a medium membership through my link. You will be able to read an unlimited amount of stories from me and other incredible writers!

I am working on more stories, writings, and guides in the data industry. You can absolutely expect more posts like this. In the meantime, feel free to check out my other articles to temporarily fill your hunger for data.

Thanks for reading! If you want to get in touch with me, feel free to reach me at [email protected] or my LinkedIn Profile. You can also view the code for previous write-ups in my Github.

Data Science
Git
Data Engineering
Editors Pick
Hands On Tutorials
Recommended from ReadMedium