dcd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gTY2PKbOL0WAL_QP8pRO_g.png"><figcaption></figcaption></figure><h2 id="f60b">Step 2: Standardise model naming convention.</h2><p id="b4b0">Grouping models into stages helps us understand what a model does, but having a robust framework of naming models would save us plenty of time from scanning through a bajillion models, or infuriate our colleagues by asking the Nth time “<i>what does this model do?</i>” when we’re too lazy to open and read the query.</p><figure id="984a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ydH44pmX05qZ6H77kIkqig.gif"><figcaption>when someone name a transformation task: <b><i>processed_accounts</i></b>.</figcaption></figure><p id="f2af">The result table shares the same name as the model file. Since we will be building so many models, it makes sense to provide a naming convention we wouldn’t be confuse ourselves over the tables we created in our warehouse.</p><p id="c892">Here’s our table naming convention template:</p><p id="b304"><code><environment>.<db_type><stage><db_name>__<table_name>_<country></code></p><p id="4f0c">…and the models directory looks like this:</p><figure id="55ec"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gz46x1UESyRm_lE1XalL2Q.png"><figcaption></figcaption></figure><p id="3ade">We also specify the database schema in the model file config:</p>
<figure id="27a7">
<div>
<div>
<iframe class="gist-iframe" src="/gist/usagiramen/675828957f41265a9979abd440ef060f.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="ba9d"><b>Note</b>: <i>you may noticed this is a macro file, and not the actual model. I shall go into detail about this in the next article.</i></p><p id="68da">When we run <code>base_noodles__metadata_au</code> model in dbt, it will build a table named <code><b>test_mysql.base_noodles__metadata_au</b></code>.</p><p id="0068">By reading the <code>mysql</code> folder in models, we know <code>base_noodles__metadata_au </code>contains a deduplicated copy (because it’s a base) of <code>metadata</code> from a MySQL database named “noodles”. Lastly, it contains data from Australia only.</p><h1 id="023e">How we make pipelines readable?</h1><h2 id="9f3e">Step 3: Include documentation in every model.</h2><p id="00df">Noticed the <code>.yml</code> files in <code>mysql</code> folder? That’s where we document our models and tests. Here’s a peek at one of our YAML files:</p>
<figure id="f935">
<div>
<div>
<iframe class="gist-iframe" src="/gist/usagiramen/a88b3fea48be12aee845c908288b206a.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="8149">When we run <code>dbt docs generate</code> on the terminal, it compiles all relevant information about our dbt project so we can view the documented models and lineage graphs using <code>dbt docs serve</code> command later. Keeping our documentation in this format is a boon because:</p><ul><li>We can easily include a script to feed the column descriptions into any BI tool our stakeholders use.</li><li>It serves as our single source of documented truth. No more referring to random Google sheets.</li></ul><h1 id="0094">How do we ensure data quality in our pipelines?</h1><h2 id="b6ce">Step 4: Specify critical tests at the top of the pipeline.</h2><figure id="7687"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*QfEqhCXI62SyPwak9oQwNQ.gif"><figcaption>when you write models with no tests.</figcaption></figure><p id="9944">dbt allows you to test the results of your models. Adding tests in your pipeline ensures the data is transform as you would expect. Adding test <i>early</i> in your pipeline ensures you don’t add test needlessly.</p><p id="cdf0">Notice that we included <code>not_null</code> and <code>unique</code> tests for <code>metadata</code> table in the base stage? We specified <code>id</code> should be unique with no NULL values at t
Options
he top so every model down the pipeline should follow the same atomicity, even when performing JOINs in intermediate or final stages.</p><p id="02ca">This helps us avoid compensating poorly-defined transformation logic by enforcing test everywhere because we don’t know <code>base_noodles__metadata</code> is at user / event level. It would be a nightmare for someone to debug a pipeline that was transforming user level data at the source, to event level in intermediate stage (duplicated users), to something else.</p><h1 id="6f68">Results: what was the overall impact?</h1><p id="8919">After building the models, our final step is to create Airflow DAGs to trigger the pipelines.</p><p id="f176">The only quantifiable metric is time taken to complete the entire pipeline transformation. We ran the new <code>account</code> pipeline and compare the execution time to our old pipeline as a benchmark (2 hours). Surprisingly, the new pipeline takes an average 2 hours 30 minutes to complete!</p><p id="3bc2">While optimising our pipelines for speed is something we have to improve, the pipeline design shines in a few aspects:</p><h2 id="f276">Impact #1: Faster updates.</h2><p id="569d">We able to include additional data points requested by our stakeholders while the migration is ongoing. Adding new transformation logic is much easier now since we could easily identify which part of the flow to introduce these updates on.</p><p id="3190">Such tasks would usually require days (reading the transformation task, adding and testing the fix), now can be completed within the morning.</p><h2 id="c869">Impact #2: We found more hidden errors.</h2><p id="c1fe">One of our interns who helped us with the pipeline migration was able to raise several issues about how some table JOINs in the new pipeline don’t make sense.</p><p id="b395">This was interesting because we did not change the original logic from the old pipeline during migration, and we only onboarded him on dbt, not specifically on the pipeline. Yet he was able to read what we had documented and unearth potential errors, fully hidden in the old pipeline.</p><p id="f446">In detail, we renamed a column <code>id AS old_id</code> so we could retain <code>id</code> and <code>old_id</code> after the transformation, but we mistakenly perform a LEFT JOIN on the wrong id column somewhere in the subquery, resulting in fewer results than expected. The scary thing is we didn’t know until now.</p><p id="5539"><b>Note</b>: <i>This is where we learned our lesson not to rename columns everywhere.</i></p><figure id="d4bc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*x0svqrNO8swf_ydzad5FeQ.gif"><figcaption>when i understood the bare minimum of dbt.</figcaption></figure><h1 id="c30e">Conclusion</h1><p id="080d">Overall, we definitely achieve the two goals we highlighted. Our migrated pipelines are now easier to read and understand, especially someone else’s. More complex metrics can be introduced directly in dbt models, so our stakeholders won’t need to download the data and apply their calculations on their excel.</p><p id="bcec">The easier it is to consume data from our tables, the quicker they can generate insights, and they can make decisive actions sooner.</p><figure id="67c7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*MwHE1HpGaELw7L0tL7m0IA.gif"><figcaption>when our stakeholders love our tables.</figcaption></figure><p id="d8bc">I’ll discuss on more lessons learned and how we optimise for speed in the next article. dbt had documented some <a href="https://docs.getdbt.com/docs/guides/best-practices#best-practices-in-dbt-projects">best practices</a> to apply when setting up your first project; I highly recommend you to check it out too.</p><p id="4c88">❗️ <b><i>Interested in what else we work on?
</i></b>Follow us <b>(<a href="https://medium.com/shopback-tech-blog/">ShopBack Tech blog</a></b> <b>| <a href="https://www.linkedin.com/company/shopback-com">ShopBack LinkedIn page</a>)</b> now to get further insights on what our tech teams do!</p><p id="4f68">❗❗️<b><i> Or… interested in us? (definitely, we hope)</i></b>
Check out <a href="http://careers.shopback.com/"><b>here</b></a> how you might be able to fit into ShopBack — we’re excited to start our journey together!</p></article></body>
Designing ELT pipelines with dbt (Part 2)
This is the second part of a series about my data pipeline experience with dbt. Our focus in this article is on model structures and best practices we use to design pipelines while aligning with dbt’s framework.
What are our requirements to make this migration a success?
To clarify again, dbt only takes care of the data transformation process. So when we chose this tool, it should help us achieve these goals:
Goal #1: Readable pipelines.
Scalability is a common buzzword in the tech scene. No doubt we want “scalable pipelines”, but what really matters is any analyst can take a look at our dbt models (where we write our transformation query) and go “Aha! I know what this genius is trying to do.”.
When analysts understand how data is transformed, it’s easier to break down complex queries into smaller pieces, reuse models that has several business use cases, etc. Given our international presence, building a new pipeline to ingest data from a new country should be as simple as updating a few parameters.
Goal #2: Data we can trust.
Quick background context: we currently adopt a self-serve analytics culture in our company. An executive should be able to build a chart to understand the insights themselves, rather than requesting an analyst to build one.
To support this culture, we need the right set of BI tools and clean, descriptive datasets so our data consumers won’t ping us on Slack with “Are you sure this data is accurate?” or “Where can I get metric X in this table?”. The data in these tables should be consistent and the column descriptions are readily available should they need further reading.
How we setup our dbt models?
Step 1: Structure pipelines in meaningful stages.
We introduced a series of stages how we want our pipeline to flow. As mentioned in our previous article, every model in each stage should have only one job to do.
Our stages are defined as follows:
source. This is where our source data originates. The transformation process begins here, and is prepared by our data engineering folks.
base. Data in this stage is copied from the source and deduplicated by a unique key. Its primary role is to ensure each row is unique, and only contains the most updated values.
staging. This is where we perform core data cleaning such as applying fix to incorrect logic and adding important columns we need down the pipeline.
intermediate. Business-related logic and metrics are introduced at this stage. An intermediate table often consists of data from multiple staging tables, so this is where the bulk of query complexity occurs.
final. The final table is the dataset which stakeholders and analysts will use for operational dashboards and deep dive analysis. It contains all the necessary columns from intermediate tables or staging tables. Also, we only rename columns and recast values to the desired type here.
me when my manager brief us about pipeline stages.
When we migrate our infamous accounts table to dbt, we studied all related transformation tasks (up to the correct source tables), and assign each part of the transformation logic to the correct stage. Here’s an illustration how the new accounts pipeline looks like:
Step 2: Standardise model naming convention.
Grouping models into stages helps us understand what a model does, but having a robust framework of naming models would save us plenty of time from scanning through a bajillion models, or infuriate our colleagues by asking the Nth time “what does this model do?” when we’re too lazy to open and read the query.
when someone name a transformation task: processed_accounts.
The result table shares the same name as the model file. Since we will be building so many models, it makes sense to provide a naming convention we wouldn’t be confuse ourselves over the tables we created in our warehouse.
We also specify the database schema in the model file config:
Note: you may noticed this is a macro file, and not the actual model. I shall go into detail about this in the next article.
When we run base_noodles__metadata_au model in dbt, it will build a table named test_mysql.base_noodles__metadata_au.
By reading the mysql folder in models, we know base_noodles__metadata_au contains a deduplicated copy (because it’s a base) of metadata from a MySQL database named “noodles”. Lastly, it contains data from Australia only.
How we make pipelines readable?
Step 3: Include documentation in every model.
Noticed the .yml files in mysql folder? That’s where we document our models and tests. Here’s a peek at one of our YAML files:
When we run dbt docs generate on the terminal, it compiles all relevant information about our dbt project so we can view the documented models and lineage graphs using dbt docs serve command later. Keeping our documentation in this format is a boon because:
We can easily include a script to feed the column descriptions into any BI tool our stakeholders use.
It serves as our single source of documented truth. No more referring to random Google sheets.
How do we ensure data quality in our pipelines?
Step 4: Specify critical tests at the top of the pipeline.
when you write models with no tests.
dbt allows you to test the results of your models. Adding tests in your pipeline ensures the data is transform as you would expect. Adding test early in your pipeline ensures you don’t add test needlessly.
Notice that we included not_null and unique tests for metadata table in the base stage? We specified id should be unique with no NULL values at the top so every model down the pipeline should follow the same atomicity, even when performing JOINs in intermediate or final stages.
This helps us avoid compensating poorly-defined transformation logic by enforcing test everywhere because we don’t know base_noodles__metadata is at user / event level. It would be a nightmare for someone to debug a pipeline that was transforming user level data at the source, to event level in intermediate stage (duplicated users), to something else.
Results: what was the overall impact?
After building the models, our final step is to create Airflow DAGs to trigger the pipelines.
The only quantifiable metric is time taken to complete the entire pipeline transformation. We ran the new account pipeline and compare the execution time to our old pipeline as a benchmark (2 hours). Surprisingly, the new pipeline takes an average 2 hours 30 minutes to complete!
While optimising our pipelines for speed is something we have to improve, the pipeline design shines in a few aspects:
Impact #1: Faster updates.
We able to include additional data points requested by our stakeholders while the migration is ongoing. Adding new transformation logic is much easier now since we could easily identify which part of the flow to introduce these updates on.
Such tasks would usually require days (reading the transformation task, adding and testing the fix), now can be completed within the morning.
Impact #2: We found more hidden errors.
One of our interns who helped us with the pipeline migration was able to raise several issues about how some table JOINs in the new pipeline don’t make sense.
This was interesting because we did not change the original logic from the old pipeline during migration, and we only onboarded him on dbt, not specifically on the pipeline. Yet he was able to read what we had documented and unearth potential errors, fully hidden in the old pipeline.
In detail, we renamed a column id AS old_id so we could retain id and old_id after the transformation, but we mistakenly perform a LEFT JOIN on the wrong id column somewhere in the subquery, resulting in fewer results than expected. The scary thing is we didn’t know until now.
Note: This is where we learned our lesson not to rename columns everywhere.
when i understood the bare minimum of dbt.
Conclusion
Overall, we definitely achieve the two goals we highlighted. Our migrated pipelines are now easier to read and understand, especially someone else’s. More complex metrics can be introduced directly in dbt models, so our stakeholders won’t need to download the data and apply their calculations on their excel.
The easier it is to consume data from our tables, the quicker they can generate insights, and they can make decisive actions sooner.
when our stakeholders love our tables.
I’ll discuss on more lessons learned and how we optimise for speed in the next article. dbt had documented some best practices to apply when setting up your first project; I highly recommend you to check it out too.
❗❗️ Or… interested in us? (definitely, we hope)
Check out here how you might be able to fit into ShopBack — we’re excited to start our journey together!