avatarSystemDesign

Summary

The provided content discusses the design of a scalable unique ID generator inspired by Twitter Snowflake, which is essential for distributed systems requiring unique identifiers for various objects.

Abstract

The article delves into the concept of a unique ID generator, a critical component in distributed systems for generating unique identifiers that are time-stamped and sortable. It outlines the functional and non-functional requirements for such a system, including the need for uniqueness across multiple nodes, fixed size IDs, and high consistency without duplication. The naïve solutions of auto-incrementing IDs and time-based IDs are discussed, along with their limitations in distributed environments. The article then introduces the Universally Unique Identifier (UUID) and its advantages and disadvantages, before focusing on Twitter Snowflake's approach to creating 64-bit unique IDs using a combination of timestamp, machine ID, and a local counter. The design of a custom ID generator based on Snowflake's principles is detailed, including the steps to generate the epoch timestamp, create a node ID from the machine's MAC address, and implement a counter that resets every millisecond. The article concludes with resources for further learning and preparation for system design interviews.

Opinions

  • The author suggests that the naïve solutions are insufficient for high-scale applications requiring unique IDs across distributed environments.
  • UUIDs are praised for their low probability of duplication but criticized for their large size and non-sequential nature.
  • Twitter Snowflake is highlighted as an ideal system for generating unique IDs due to its smaller ID size, time-sortability, and scalability.
  • The article promotes the use of a custom ID generator inspired by Twitter Snowflake for applications that do not require the full 128-bit UUID.
  • The author emphasizes the importance of preparing for system design interviews with courses and resources, particularly those offered by Educative and other educational platforms.

System Design Interview: Scalable Unique ID Generator (Twitter Snowflake or a similar service)

PREV | HOME | NEXT

Don’t forget to get your copy of Designing Data Intensive Applications, the single most important book to read for system design interview prep! Udacity | Coursera | Pluralsight.

Check out ByteByteGo’s popular System Design Interview Course

Consider signing-up for paid Medium account to access our curated content for system design resources.

Grokking Modern System Design for Software Engineers and Managers

Generating unique IDs is a popular task involved in the development of most large scale applications. Unique IDs help identify objects and retrieve them when needed. But how can you create random IDs that are unique even when the system works on a distributed environment where multiple computing nodes are involved? Twitter Snowflake is an ideal example of a high-scale random ID generator, designed to work on a global scale. Let’s learn how to design a similar random ID generator.

Get a leg up on your competition with the Grokking the Advanced System Design Interview course and land that dream job! Or if you prefer video-based courses check out Udacity.

Applications of Random ID Generation

When working with sharded MySQL databases, you’ll need unique IDs to use as primary keys for MySQL tables. When you only have a single MySQL database to work with, you can simply autoincrement the IDs to generate unique IDs for the different tables. However, when working with sharded MySQL databases, you’ll need a unique ID generator.

Other than sharded databases, generating unique IDs is a requirement in many distributed systems. Generating bank account numbers, image ID, file ID, and user ID for large scale distributed systems will need a random ID generator. URL shorteners, such as TinyURL, also use a random ID generator to assign unique IDs to create each short URL.

If you are interviewing, consider buying our number#1 course for Java Multithreading Interviews.

Twitter created its own random ID generator called Twitter Snowflake to create unique IDs for tweets, users, messages and all other objects that need identification. Discord, Instagram and other social networking systems also use similar programs to create IDs.

Grokking the Coding Interview: Patterns for Coding Questions

Requirements For A Random ID Generator

Functional Requirements

  • Each time the ID generator is called, it returns a key that is unique across all node points for the system.
  • The ID should be of a fixed size, let’s say 64 bits.
  • The ID has a timestamp. A timestamp allows the user to order the entries according to the time they were created.

Non Functional Requirements

  • The system should be highly consistent. There should not be any duplicate IDs created across all data centers.
  • The system should be able to scale to accommodate a large number of requests each second.

Naïve Solution — Single Computing Node

Let’s look at the simplest case where we have a single computing node.

Auto-Incrementing IDs

Since there’s only a single machine that generates unique IDs for that particular object, you can simply increment the IDs for each new entry. For example, if your system needs unique IDs to identify posts made by users, you can append each new entry in a hash table with an incrementing ID as the key.

Check out the course Coderust: Hacking the Coding Interview for Facebook and Google coding interviews.

The ‘Post’ table will have a column for the ID, post creator’s user ID (from the User Table), and content. It can also have additional columns for creation time, number of likes etc. A unique ID can be created by making the ID an auto-incrementing integer.

Breeze through your coding interviews with Hacking the Coding Interview.

IDs Generated By Time

Alternatively, you can use the gettimeofday() function to make the time (accurate to each microsecond) of the request its ID. The IDGenerator function with this approach is written as below. gettimeofday() function and timeval structure is predefined in the sys/time.h header file.

id IDGenerator() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return cat(tv.tv_sec,tv.tv_usec);
}

Drawbacks Of The Naïve Solution

The problem with the two alternatives above is that while they create unique IDs when used on a single machine, they will fail at creating unique IDs when multiple machines are involved. High-scale applications, such as Twitter and Facebook are often built on a distributed environment, with multiple computing nodes. There are high chances of duplication when employing either of the two techniques discussed above.

Scalable Solution — Random Numbers

Where multiple machines are involved, developers often write ID generators such that the function picks an ID randomly from a large range every time it is called.

If you are preparing for tech interviews at FANGs, you may want to check out the course Grokking the System Design Interview by Educative.

Universally Unique Identifier (UUID)

One of the most popular examples for generating random numbers is a Universally Unique Identifier (UUID), also called a Globally Unique Identifier (GUID), which is a 128-bit number. This means that every time the ID generator function is called, it will randomly pick a number from a set of 2128 numbers. The chances of duplication are negligible.

Brush-up on Big-O notation for tech interviews with: Big-O Notation For Coding Interviews and Beyond

Windows and many SQL systems give you functions to generate UUIDs. In SQL server, there are two functions that you can use to generate UUIDS: newid() and newsequentialid(). While newid() simply generates UUIDs, the newsequentialid() generates UUIDs in a sequential manner. The function generates a UUID that’s greater than all other UUIDs created by the same function on the machine since the start of Windows. This is especially useful when you want to use the generated UUIDs as row identifiers. The numbers can start from a lower range after a restart but they are still guaranteed to be globally unique on that machine.

Advantages and Disadvantages of UUIDs

UUIDs are excellently suited to generating unique IDs across a distributed system since the chance of duplication even when the function is called on different machines is extremely low. Also, the generated IDs are random, which means that the UUID created cannot be predicted in advance. This is especially useful for the cases with sensitive transactions.

The major disadvantage of UUIDs is the length of the ID. Each ID is a 128-bit (32 char) long number. The large size of the ID means it will take more storage space, and will increase the response time of the machine in retrieving or using the IDs. Additionally, by default, GUIDs are not sequential. As discussed above, though newsequentialid() can make the IDs sequential for a single machine, it’s not possible to guarantee the same for multiple machines. However, time-stamped GUIDs can solve this problem.

Grokking Comp Negotiation in Tech

What Is Twitter Snowflake

Twitter Snowflake optimizes the original concept of UUIDs to work on a large scale, while keeping the IDs sortable and the ID size manageable. It’s an ideal example to base our design on.

Twitter Snowflake creates a 64-bit random, unique ID (as compared to 128-bit UUID). The ID is composed of:

  • 41-bit Epoch timestamp (with millisecond precision)
  • 10-bit machine ID (accommodates 1024 machines)
  • 12-bit counter bits (generated by a local counter for every machine that rolls over when the number turns 4096)
  • 1 extra bit for future use. By default, this bit is set to 0.

Check out the course Coderust: Hacking the Coding Interview for Facebook and Google coding interviews.

Advantages of Twitter Snowflake

Unlike UUIDs, Twitter Snowflake IDs are sortable by time since they have a timestamp at the beginning. This makes them ideal for indexing. Additionally, since they’re 64-bit long, they’re half the size of UUIDs, practically cutting down the storage size and processing times by half. Also, the approach is scalable, considering the fact that it can accommodate 1024 machines. It is highly available, since there can be 1024 machines, where each machine can generate 4096 unique IDs each millisecond (the system also prevents rollover in the same millisecond).

On the downside, the design requires Zookeeper to maintain Machine IDs. Also, the generated IDs are not random, unlike UUIDs. Since the future IDs are predictable, the design may not be applicable to applications where security is a requirement.

Land a higher salary with Grokking Comp Negotiation in Tech.

Designing A Customized ID Generator Inspired By Twitter Snowflake

Let’s design our own ID generator on the same concept as that of Twitter Snowflake. Similar to Twitter Snowflake, let’s design a system to generate a 64-bit unique ID. For the next part of the post, let’s call the ID generated by the ID generator ‘id’. So the id consists of:

  • 1-bit sign (or unused bit) the value for which is always set to 0.
  • 41-bit epoch timestamp (which turns out to be 69.73 years before the time rolls over )
  • 10-bit Node/Machine ID (our system can scale to 1024 nodes or machines)
  • 12-bit local counter for each machine (maximum value of the counter can be:

While all the rest of the components can be represented by int datatype (32-bit integer), time will need to be assigned long datatype (64-bit integer).

Learn patterns to dynamic programming interview questions to land your next Big Tech job!

Depending on the application, you might not always need a 64-bit ID. If the requirements are smaller, you can cut down the timestamp to, say, 20-bit or even smaller. If the application has fewer nodes, you can have fewer bits for the Node ID. Similarly, counter bits may also be reduced depending on the request frequency handled by every node.

If you are preparing for tech interviews at FANGs, you may want to check out the course Grokking the System Design Interview by Educative.

System Diagram

An application calls the ID generator to generate a new ID. For example, a social networking application calls the ID generator to assign a unique ID to each post.

This is how the request will flow:

Each new post reaches the load balancer and is directed to an application server. The application server calls the distributed ID generator to assign a unique ID to the new request. The ID generator creates an ID and returns it to the application server.

Next, let’s zoom into the “ID Generator” block in the above diagram and understand how it works. Here’s the system diagram for the ID Generator:

When a new request reaches the ID Generator, a vacant id of datatype ‘long’ is generated. This id has 64 bits that are initially vacant.

Next, the system will determine the time of the request and append this epoch time 22 bits to the left in the id so that there’s space for appending the Node ID and Counter later on.

Master multi-threading in Python with: Python Concurrency for Senior Engineering Interviews.

Next, the system determines the Node ID (or Machine ID) and fills it into the id, 12 bits to the left, leaving space for the counter.

Finally, the system appends the counter (incrementing with every new request).

The final ID is returned to the application server. The application server will store this id against the request in its database.

Check out the course Coderust: Hacking the Coding Interview for Facebook and Google coding interviews.

Steps Followed by ID Generator

Generating the Epoch Timestamp

The epoch timestamp is the number of seconds that have elapsed since January 1, 1970, midnight (GMT). However, since our timestamp needs to be of millisecond precision, we will use EpochMilli to find the timestamp in milliseconds.

This value can be generated by the system as a long datatype and appended to the unique ID as the prefix. When time is set as a prefix for the ID, you can be sure that every ID generated by the system, even across multiple machines, will be larger than the IDs generated before it.

For instance, it is 22 October 2021, 12:39:16 at the time of writing this sentence. Converting it to EpochMilli timestamp, we have 1634906356000.

To get the EpochMilli timestamp for the request at that instant, make sure you import the library java.time.Instant at the start of the program.

Ace the machine learning engineer interview with Grokking the Machine Learning Interview.

You can then use the function Instant.now(). Adjust it for the Custom Epoch (time at the start of the system to give full 69.73 years before rolling back) to get the timestamp that you can return to the ID generating function. Suppose our ID generator started on 23 October, 2018, 5:26:20, we can take the Custom Epoch time from this reference. The field CUSTOM_EPOCH is thus set to1540272380000.

private static long timestamp() {
    return Instant.now().toEpochMilli() — CUSTOM_EPOCH;
}

With the above code, we have the first 41 bits of our id. We fill the timestamp into the id using a left shift:

long id = Timestamp << (NODEIDBITS + COUNTERBITS);

When we shift Timestamp 10+12 bits to the left when adding it to the id, we’ll have the right 22 bits vacant for appending the Node ID and Counter bits later.

This is what the is looks like now:

Create Node ID

To generate the Node ID or Machine ID, we will use the machine’s MAC address. This ensures that the Machine ID is unique for every machine. To get the MAC address of the machine, make sure you import the library java.net.NetworkInterface at the start of the program.

Interviewing? consider buying our number#1 course for Java Multithreading Interviews.

You can then use the command;

networkInterface.getHardwareAddress()

to get the Node ID component of int datatype.

Now that we have the next component, i.e. the Node ID, let’s use it to fill the next 10 bits of the id:

id |= (nodeId << COUNTERBITS);

By shifting the Node ID 12 bits to the left, we place the Node ID in its correct place, next to the Timestamp, while keeping the last 12 bits vacant for the Counter bit.

If you are preparing for tech interviews at FANGs, you may want to check out the course Grokking the System Design Interview by Educative.

This is what the id looks like now:

Create Counter

Now the last 12 bits are still vacant. This space is reserved for the Counter.

We know that the counter cannot exceed 212–1=4095. When the counter reaches the value of 4095, they will roll back to 0 in the next id. For this, we need to define the max value of the counter.

private static final int maxCounter = 
                               (int)(Math.pow(2, COUNTERBITS) — 1);
Crack your next Java interview with: The Java Interview Handbook: 300+ Interview Questions

To increment the counter for each new request that comes in during the same millisecond, we have the following code. Note that if the counter reaches maxCounter, the request will have to wait until the next millisecond to be given an id. If the request comes in for a fresh millisecond, the counter is set to 0, and increments for each new request that comes in during the same millisecond.

if (currentTimestamp == lastTimestamp) {
    counter = (counter + 1) & maxCounter;
    if(counter == 0) {
        // maxCounter reached, wait until next millisecond
        currentTimestamp = wait(currentTimestamp);
    }
} else {
    // reset counter to 0 for each new millisecond
    counter = 0;
}

The counter is appended to the id with the following code.

id |= counter;

The above code places the counter into its right place in the id (rightmost bits that were vacant and reserved for the counter.

Check out the course Coderust: Hacking the Coding Interview for Facebook and Google coding interviews.

ID Generated

The id looks like this now:

All the bits are filled and the complete id is ready for the request. Depending on the application, the id can be associated with a post, a user, a transaction, or something else. The unique id is returned to the app server. The app server stores it in its database against the request.

Start your journey in machine learning with Grokking Machine Learning Design.

Top YouTube Videos On Designing A Random ID Generator

Teach kids to code and have fun at the same time — CodeMonkey!

Conclusion

This is how a unique ID generator, similar to Twitter Snowflake is designed. The Id generator can be integrated into many different applications.

If you enjoyed the article, kindly support us and consider following. Thank you :)

Your Comprehensive Interview Kit for Big Tech Jobs

0. Grokking the Machine Learning Interview This course helps you build that skill, and goes over some of the most popularly asked interview problems at big tech companies.

1. Grokking the System Design Interview Learn how to prepare for system design interviews and practice common system design interview questions.

2. Grokking Dynamic Programming Patterns for Coding Interviews Faster preparation for coding interviews.

3. Grokking the Advanced System Design Interview Learn system design through architectural review of real systems.

4. Grokking the Coding Interview: Patterns for Coding Questions Faster preparation for coding interviews.

5. Grokking the Object Oriented Design Interview Learn how to prepare for object oriented design interviews and practice common object oriented design interview questions

6. Machine Learning System Design

7. System Design Course Bundle

8. Coding Interviews Bundle

9. Tech Design Bundle

10. All Courses Bundle

Code
Coding
Technology
Codingbootcamp
System Design Interview
Recommended from ReadMedium