Hands-on Tutorials
How To Structure Your Git Branching Strategy — By A Data Engineer
Data pipelines require version control too!

If you’ve ever dealt with code collaboratively, you’d understand the importance of version control and branching strategies. These are the key tools that allow multiple developers to work on a project in parallel. Without them, your product is very likely to break.
For those who don’t understand what version control and branches are — In a summarized explanation, version control is the practice of managing changes to your source code. It allows developers to clone, work, and deploy code without interfering with other developers’ work.
Branches are simply versions of your source code. It is useful in separating code that is currently in development and actual working, stable code for production environments.
You’ve heard of the DEV, UAT, and MASTER branch for software engineers and developers. But have you ever come across a branching strategy for Data Engineers/Data Scientists?
Instead of a product, Data Engineers and Data Scientists build and maintain data warehouses. Data Scientists do build data products but are often not able to do so before establishing a stable data warehouse to gather data.
Let’s talk about some of the differences in branch design for Data Engineers vs Software Engineers.
Content
For data engineers, source code usually involves everything from data warehouse versions to data pipelines.
There are —
- SQL Commands (DMLs/DDLs) to apply changes into Data Warehouse
- Data Pipelines
- CI/CD
- DevOps
- Data Models
and these are just at the top of my head. In layman's terms, a data warehouse is just a place where data is stored. But in reality, it is way more complex than that.
A data warehouse is used for many purposes, and they vary according to the company you work in. In some places, data warehouses are used to produce Business Intelligence graphs for data analysts/data scientists, and sometimes even for the upper management like the CEOs and CTOs of the company.
A data warehouse may store data that is fed into machine learning products through a real-time pipeline, updating machine learning models as new data comes in. If data stored in the warehouse has issues, it will be reflected in said machine learning models.
Each of these data warehouse functions has its time interval. For example, data analysts expect to have their graphs updated daily, while Machine Learning products expect to have their data updated in real-time.
Hence, changes made to the source code may have a bigger impact than you think. It only takes 1 invalid SQL query execution to receive a phone call from the CEO himself, asking why are dashboards breaking.
Releases
Typically, releases scheduled by Software Engineers are much more spaced out compared to Data Engineers. This is because it takes a longer time to develop a feature for a product compared to crafting a SQL query that adds a column in a table.
That being the case, I’ve noticed that Data Engineers roll out releases much more often than the typical Software Engineer. I’ve experienced as often as 3–4 releases per week. I would say that a typical feature release for Software Engineers takes a week at least.
This isn’t by design.
Most of the time, Data Engineers get requirements for urgent changes on important dashboards or products. CEOs and CTOs may require numbers on the dashboard changed for many reasons, and they want it to be reflected almost immediately. Hence the frequency of hotfixes.
Branching Strategy

Let’s talk about the branching strategy I designed for my organization. There are 3 main branches —
DEV — Contains latest fixes and features
UAT — Current State of UAT Environment
MASTER — Current State of Production Environment
With this design, the DEV branch will contain commits ahead of the MASTER branch. The production environment will often not reflect the most recent features the team has developed before any testing.
DEV
If a developer wants to start working on a feature/bug, a development branch is made named after the Jira Ticket. Jira Tickets are tasks assigned to engineers by managers. Development happens on this branch.
Engineers can deploy their work onto the DEV environment and perform tests. Once they are happy with their tests, they can merge the development branch into the DEV branch, essentially pushing their changes into DEV.
It is important that at this point engineers are confident that their code is working as expected. DEV may be merged into UAT at any point in time and any bugs, if present, will be brought forward.
CI/CD
In this particular design, the only way to deploy onto UAT and Production is by pushing code into the branches they are connected to.
But first, what is CI/CD?
Continuous integration (CI) and continuous delivery (CD) is a method in Software Engineering, or Data Engineering, in this case, to deliver code changes more frequently and reliably. It contains a set of principles and practices that development teams should follow.
In our case, we deploy CI/CD pipelines to overwrite code in Apache Airflow, which is our workflow management system for Data Pipelines. This includes code for DAGs, Operators, and everything else.
The CI/CD pipeline also executes DMLs/DDLs to apply changes to the Data Warehouse using a SQL version control tool named Liquibase.
In the UAT and Production environment, Data Engineers have restrictions in deploying work. For example, only Select statements are allowed in the UAT and Production Data Warehouse. This is to prevent anyone from accidentally dropping a table or inserting a row.
The only way to execute DMLs/DDLs, which are SQL statements that bring changes to the schema/data of tables is by executing through the CI/CD pipeline. It may sound like overkill, but this practice has improved the stability and quality of our Data Warehouse tremendously.
UAT
Once a team has made significant progress and wants to move features into UAT, they can schedule a release. DEV branch is merged into UAT branch and CI/CD is run. The environment is usually monitored to verify there aren’t any bugs for a day. The branch is then tagged by using semantic versioning.
MASTER
After thorough testing in the UAT environment, the team can proceed to move said features into production by merging the UAT branch into the MASTER branch. The CI/CD for production is run and the environment is monitored once more. The branch is then tagged accordingly.
HOTFIXES
Bugs. They are unavoidable. If there are bugs at any point in UAT and Production environment, Data Engineers will need to apply hotfixes. These are done by creating a hotfix branch from the faulty branch, applying the changes, and merging the hotfix branch back into its original branch.
Hotfixes can bypass the process of having to go through DEV. It can be merged directly into MASTER if necessary. After merging, the branch is then tagged accordingly as well. The branch must also be synced into dependent branches.
For example — A hotfix is merged into Prod. The developer must then sync prod into DEV and UAT so that these branches contain the fix as well.
Deployment Workflow

Here’s a step-by-step guide on how to deploy work onto environments using this branching strategy.
- Create branches named after Jira tickets for development from the DEV branch. We use A, B, and C as examples of features/bugs here.
- Once tests are satisfactory, A, B, and C can be merged into the DEV branch at any time. The DEV branch now contains A, B, and C.
- On a UAT release, the DEV branch is directly merged into the UAT branch through a pull request. CI/CD now deploys A, B, and C onto the UAT environment which applies changes onto both data pipelines and the actual data warehouse itself. The branch is then tagged accordingly.
- After careful monitoring, a Production release is scheduled. The UAT branch is merged into the MASTER branch through a pull request. CI/CD deploys A, B, and C onto the Production environment. The branch is tagged accordingly.
- During releases, development is not halted. Engineers can continue their work on the DEV branch as per usual. As an example, the engineers are working on features D and E as releases are performed.
- As a hypothetical scenario, a bug was found on production during the CI/CD run. A hotfix branch is created containing the bugfix F and is directly merged into MASTER for the hotfix to happen. The bugfix F is deployed onto the Production environment through CI/CD. The branch is tagged accordingly.
- Both master branch and production environment now contain A, B, C, and F. Once the team is happy with the hotfix F, the MASTER branch is synced onto UAT and DEV branch so that F is present in all branches. DEV and UAT now contain A, B, C, and F.
- Once tests for D and F are done, they are merged into the DEV branch. The DEV branch now contains all features/bugs from A to F. It can go through the release cycle again, bringing D and F into UAT and production.
Conclusion
If you’ve read until this point, you must really have a thing for the boring stuff. Most of my friends fall asleep when I tell them about branches, they didn’t know I had a passion for nature.