Data Model Design 101: Composite vs Surrogate Keys

When to know which type of key to use in your data models

I’ve recently been writing a data model to represent a new part of our business. The data requires a lot of questions to be asked, as it’s quite difficult to understand intuitively.

The data model requires that I join similar, yet different, datasets from two different sources into one dataset. Any time you merge two datasets, it’s integral for you to think about the unique key that will then act as the primary key of this new dataset.

Unfortunately, you can’t assume that the primary key in each dataset will carry over into the resultant one. This is because these keys, if incrementing integers, will tend to be duplicated from dataset to dataset.

However, you can create a new key.

In this article, we will discuss two options for creating a unique key in a data model—a surrogate key or a composite key. What are the differences between these? When should you use one versus the other?

Surrogate vs Composite Key

Composite keys are made up of more than one identifying field, together the fields that make it up are unique. They are created from real-world values and whose meaning can be understood when read

Surrogate keys are generated for the sole purpose of being a primary key and don’t contain any real-world meaning. They are typically hash values that make data retrieval fast and easy.

When to Use a Surrogate Key vs Composite Key

Composite keys are ideal when you still want to maintain the value of your data. While composite keys are a unique combination of fields, you can generate a new field based on these values to make the unique lookup of records easier.

This is what I recommend when using a composite key in a data model. We will go over an easy way to use SQL or dbt to generate a composite key within any of your data models.

Surrogate keys are ideal when you don’t need to maintain the value of your data and want a fast and efficient way to retrieve your data. These are often used when datasets are unique across 3 or more fields, making a composite key difficult to use. They are also used in very large tables where a composite key may not be practical.

How to Generate a Composite Key

Composite keys are quite easy to create using basic SQL. You simply need to concatenate two or more fields:

SELECT
  user_id || customer_signup_year AS _user_ck
FROM users_2023_unioned_users_2024

Before doing this, make sure you check for NULL values. If there’s a possibility that either of these fields is NULL, you will need to figure out how to deal with that so that these NULL values do not persist in your composite key.

If using a data transformation tool like dbt, I highly recommend turning this simple function into a macro that you can reuse across multiple different data models.

{% macro generate_composite_key(column_name_1, column_name_2) -%} 

  {{ column_name_1 }} || {{ column_name_2 }}

{%- endmacro %}

Now you can simply reference this macro whenever generating a composite key!

SELECT 
  {{ generate_composite_key('user_id', 'customer_signup_year' }} AS _user_ck
FROM users_2023_unioned_users_2024

This will come in handy especially if you add more complex logic to handle NULL values.

How to Generate a Surrogate Key

To generate a surrogate key, you can build off the same logic that we used to generate a composite key. You simply add an MD5 function to the concatenated fields.

SELECT
  md5(user_id || customer_signup_year) AS _user_sk
FROM users_2023_unioned_users_2024

The MD5 function in SQL returns a 128-bit checksum for a string which can be used as a hashed key.

One thing to note- make sure the datatypes of the fields being used to generate these keys are the same across all locations where you generate them. If the data types are different, the key values will vary.

dbt has a particularly useful function for generating surrogate keys as part of the dbt_utils package. The best part is that it handles any NULL values for you, so you don’t need to build out this logic yourself.

SELECT
  {{ dbt_utils.generate_surrogate_key(['user_id', 'customer_signup_year']) }} AS _user_sk 
FROM users_2023_unioned_users_2024

With this function, if user_id or customer_signup_year is NULL, the function will handle it so that the values will be repeated across each use of the function.

Conclusion

Data modeling requires a lot of architectural choices, especially when it comes to keys. Choosing the right key will allow your data model to be more user-friendly, faster, and more efficient.

Be sure to keep these 3 things in mind when deciding between a composite and surrogate key:

Is it helpful for the end user to see the values that make up the key?
How large is the data model going to be?
Do NULL values need to be handled?

Let these questions guide you on which architectural decision to make for your model. Happy data modeling!