avatarVishal Barvaliya

Summary

This blog post provides 30 golden rules for system design, covering topics such as low latency requirements, read-heavy systems, write-heavy systems, unstructured data, ACID-compliant databases, complex data, high availability, high-volume data search, scaling SQL databases, global data delivery, graph data, horizontal scaling, high-performing database queries, single points of failure, bulk job processing, server load management, microservices architecture, data integrity, analytics and audit trails, fault tolerance, user-to-user fast communication, failure detection, efficient server scaling, decentralized data transfer, high availability trade-offs, handling large data, cache eviction policy, and handling traffic spikes.

Abstract

The blog post begins by introducing the topic of system design and its importance. It then proceeds to provide 30 golden rules for system design, each with an example scenario and explanation. The rules cover a wide range of topics, including low latency requirements, read-heavy systems, write-heavy systems, unstructured data, ACID-compliant databases, complex data, high availability, high-volume data search, scaling SQL databases, global data delivery, graph data, horizontal scaling, high-performing database queries, single points of failure, bulk job processing, server load management, microservices architecture, data integrity, analytics and audit trails, fault tolerance, user-to-user fast communication, failure detection, efficient server scaling, decentralized data transfer, high availability trade-offs, handling large data, cache eviction policy, and handling traffic spikes. Each rule is explained in detail with an example scenario and the benefits of implementing the rule. The post concludes by acknowledging the sources used to write the blog and encouraging readers to subscribe to the author's feeds and try out the recommended AI service.

Opinions

  • The author believes that implementing the 30 golden rules for system design will lead to efficient and scalable systems.
  • The author emphasizes the importance of low latency requirements, read-heavy systems, write-heavy systems, unstructured data, ACID-compliant databases, complex data, high availability, high-volume data search, scaling SQL databases, global data delivery, graph data, horizontal scaling, high-performing database queries, single points of failure, bulk job processing, server load management, microservices architecture, data integrity, analytics and audit trails, fault tolerance, user-to-user fast communication, failure detection, efficient server scaling, decentralized data transfer, high availability trade-offs, handling large data, cache eviction policy, and handling traffic spikes in system design.
  • The author acknowledges the sources used to write the blog and encourages readers to subscribe to their feeds and try out the recommended AI service.

Golden Rules for System Design Interview.

System Design is explained in detail with examples.

Welcome to the world of System Design! In this blog, we will break down the essential rules of creating efficient and scalable systems. Whether you are tackling latency issues or optimizing for read-heavy workloads, I have got you covered everything. Let’s dive into it!

Image Source

1. Low Latency Requirement: Make use of Cache and CDN.

  • when we have a low latency requirement, a good option is to use a Cache and CDN(Content Delivery Network) for systems.

Example Scenario:

  • Imagine an online streaming service where users expect low latency when accessing video content. In this case, a Cache can store frequently requested video metadata and a Content Delivery Network (CDN) can be used to deliver video files with low latency.
  • When a user requests information about a video, the system checks the cache for metadata. If the metadata is present, it’s served quickly. For video files, the CDN ensures that content is delivered from a server that is geographically closer to the user, which reduces latency even more.
  • By combining a Cache for metadata and a CDN for content delivery, the system optimizes response times for both information retrieval and media playback. This is especially important for services with low latency, such as streaming platforms or real-time communication apps, which are a high priority.

2. Read-Heavy System: Use Cache for faster reads.

  • this rule suggests that you should consider using a Cache for a Read-Heavy System. Caching involves storing copies of frequently accessed data in a faster storage layer to reduce the time and resources needed to fetch that data from the primary storage.

Example Scenario:

  • Let’s take the example of an e-commerce website with a high volume of read operations, such as product listings, user profiles, product details, etc. Instead of hitting the database every time a user requests product information, the system can use a cache to store this information temporarily.
  • The system checks the cache when a user searches for popular products. If the product details are available in the cache, then the system will retrieve the information from the cache instead of fetching it from the database. This method significantly reduces the load on the database and improves the overall response time for the user.
  • Caching mechanisms like Memcached or Redis are commonly used for such scenarios. They store key-value pairs, where the key represents the requested data, and the value is the retrieved data from the database. This is how subsequent read requests for the same data can be served directly from the cache, which improves overall system performance.

3. Write-Heavy System: Employ Message Queues for async writing.

  • This rule advises using Message Queues for async (non-real-time) processing in a Write-Heavy System. In scenarios with a high volume of write operations, using Message Queues can help split the process of receiving and processing requests.

Example Scenario:

  • Let’s take a social media platform as an example. Users frequently post updates and comments on this platform and perform other write operations. Instead of handling these operations synchronously, which might lead to performance bottlenecks and increased latency, the system can use a Message Queue.
  • When a user makes a post, the system places a message in the queue containing information about the post. A separate service, asynchronously processing messages from the queue, handles the actual write operation. This separation allows the main application to quickly acknowledge the user’s action while the actual processing happens in the background.
  • Popular message queue systems like Apache Kafka, RabbitMQ, or Amazon SQS can be used in such scenarios. it helps ensure that write-heavy operations don’t directly impact the user experience by processing tasks asynchronously and efficiently handling spikes in write requests.

4. Unstructured Data: Use NoSQL Database.

  • one should use the NoSQL Database when dealing with unstructured data.

Example Scenario:

  • let’s take an example of a social media platform where users can post diverse content like text, images, and videos. In this type of system, data associated with each post can be in different data types and doesn’t fit properly into a fixed, structured format.
  • A NoSQL database like MongoDB or Cassandra is beneficial because it allows flexibility in storing different data types without requiring a predefined schema. Unlike traditional relational databases, NoSQL databases can efficiently handle unstructured or semi-structured data, which makes them suitable for scenarios where data formats can evolve or vary, such as in content-rich applications like social media platforms or content management systems.

5. ACID Compliant DB: Choose RDBMS/SQL Database.

  • Using a Relational Database Management System (RDBMS) or SQL Database is beneficial when you need Atomicity, Consistency, Isolation, and Durability (ACID) compliance.

Example Scenario:

  • Consider a banking application where financial transactions are happening. ACID properties are important in financial systems to ensure the integrity of transactions.
  • When a user transfers money from one account to another, it’s essential that the transaction is atomic (either fully completed or not at all), consistent (preserving the system’s integrity), isolated (not affected by other concurrent transactions), and durable (persisted even in the face of failures).
  • In the above scenario, an RDBMS like MySQL, PostgreSQL, or Oracle would be suitable. These databases provide ACID properties, making these RDBMS systems well-suited for applications where data integrity and reliability are important, such as financial systems, healthcare databases, or any system where precise and reliable data management is crucial.

6. Complex Data (Videos, Images, Files): Prefer Blob/Object storage.

  • We should use Blob/Object storage for handling complex data like videos, images, and files.

Example Scenario:

  • Let’s take an example of a cloud storage service where users upload various media files — images, videos, and documents. Instead of storing these directly in a database, which may not be optimized for large binary files, the system can utilize Blob/Object storage.
  • Blob storage, offered by services like Amazon S3 or Azure Blob Storage, is designed to handle large binary objects. When a user uploads a video or an image, the system stores it in the Blob storage, ensuring efficient handling and retrieval of these complex data types.
  • This separation allows for better scalability and performance, especially in scenarios where a large volume of multimedia content needs to be stored and retrieved.

7. Complex Pre-computation: Combine Message Queues & Cache.

  • one should consider using Message Queues and Cache for complex pre-computation tasks.

Example Scenario:

  • Imagine an e-commerce platform where the system must compute and update each user's product recommendations based on their browsing history and preferences. Instead of performing these computations synchronously during each user request, the system can use Message Queues and Cache.
  • When a user interacts with the platform, a message is placed in the queue to initiate the pre-computation task for personalized recommendations. Meanwhile, a cache stores recent recommendations for quick access. The main application can then quickly retrieve the latest recommendations from the cache, reducing the need to perform complex computations in real-time repeatedly.
  • The system optimizes performance and responsiveness by employing Message Queues and Cache for pre-computation, providing users with up-to-date and personalized recommendations without slowing down the core application.

8. High Availability: Use Load Balancer.

  • This rule advises using a Load Balancer for achieving high availability, performance, and throughput.

Example Scenario:

  • Let’s take an example of a web application that experiences varying levels of traffic throughout the day. A Load Balancer sits in front of multiple servers, distributing incoming user requests across these servers. This ensures that no single server is overwhelmed with too many requests, preventing performance degradation and potential downtime.
  • For instance, during peak hours, the Load Balancer can intelligently distribute traffic, evenly spreading the load among available servers. In case one server becomes unavailable, the Load Balancer can redirect traffic to other healthy servers, contributing to high availability.
  • Load Balancers play a crucial role in enhancing system reliability, improving response times, and efficiently utilizing resources, making them essential components in scenarios where maintaining consistent performance and availability is a priority, such as in web applications or online services.

9. High-Volume Data Search: Consider a search index or engine.

  • This rule suggests considering search indexes, tries, or search engines for high-volume data search.

Example Scenario:

  • Let’s take a scenario of a large e-commerce website with a vast product catalog, Where users frequently search for products based on various criteria. In order to efficiently handle high-volume searches, the system can implement a search index.
  • A search index, such as Elasticsearch or Apache Solr, allows the system to quickly locate and retrieve relevant items based on search queries. It indexes the product data, making searches faster compared to scanning the entire database. Tries, which are tree-like structures, can also be employed for efficient prefix-based searches.
  • By using these techniques, the system enhances the speed and accuracy of search operations, providing a better user experience, especially in applications where quick and precise search capabilities are critical, such as e-commerce platforms or large-scale content repositories.

10. Scaling SQL Database: Implement Database Sharding.

  • This rule recommends implementing Database Sharding for scaling SQL databases.

Example Scenario:

  • Let’s take one example of an online marketplace with a rapidly growing number of users and transactions. As the database becomes larger, it might face performance challenges. Database Sharding involves breaking the database into smaller, more manageable pieces called shards. Each shard is responsible for a subset of the data.
  • For instance, instead of storing all user data in a single database, the system can shard the user data across multiple databases based on some criteria, such as user ID ranges, Location, etc. This way, when querying for user information, the system can target the specific shard containing relevant data, distributing the load and improving performance.
  • Sharding is useful for horizontally scaling databases and handling increased data volume, making it a valuable strategy for applications experiencing significant growth in user base and data complexity.

11. Global Data Delivery: Consider CDN.

  • This rule recommends using a Content Delivery Network (CDN) for global data delivery.

Example Scenario:

  • Let’s understand this with example of a content-heavy website, like a news portal or a streaming service, with users distributed globally. Without a CDN, users might experience slow load times due to the physical distance between their location and the origin server.
  • By integrating a CDN, the content — such as images, videos, or web pages — is distributed across multiple servers strategically placed in various geographic locations. When a user requests content, the CDN delivers it from the nearest server rather than fetching it from the origin server. This significantly reduces latency and improves the overall user experience.
  • CDNs are particularly beneficial in scenarios where global data delivery is crucial, ensuring that users around the world can access content quickly and efficiently.

12. Graph Data: Utilize Graph Database.

  • This rule suggests utilizing a Graph Database for managing data with nodes, edges, and relationships.

Example Scenario:

  • Imagine a social networking platform where users are connected through relationships. Each user is a node, and their connections (friendships) form edges. A traditional relational database might struggle to efficiently represent and query such complex relationships.
  • In this scenario, a Graph Database like Neo4j or Amazon Neptune is well-suited. It excels in managing and querying interconnected data, making it easier to retrieve information about relationships and traverse the graph efficiently. This is particularly beneficial for applications with complex relationships between entities, such as social networks, fraud detection systems, or recommendation engines.

13. Scaling Various Components: Go for Horizontal Scaling.

  • Implementing Horizontal Scaling for scaling various components of a system is efficient way to design a system.

Example Scenario:

  • Let’s take an example of an e-commerce platform having multiple services like user authentication, product catalog, and order processing. As the user base grows, individual components may experience increased load.
  • Instead of vertically scaling (adding more resources to a single node/server), the system can horizontally scale by adding more servers or instances of specific services. For example, if the product catalog service is under heavy load, additional instances of that service can be deployed to share the load and distribute incoming requests.
  • Horizontal scaling enhances the system’s ability to handle increased traffic and provides a more cost-effective and flexible approach to scaling, especially in distributed architectures where different components may experience varying levels of demand.

14. High-Performing Database Queries: Utilize Database Indexes.

  • Using Database Indexes for high-performing database queries is beneficial.

Example Scenario:

  • In a database with a large volume of data, retrieving specific information quickly can be challenging without proper optimization. Database Indexes serve as data structures that improve the speed of data retrieval operations on a database table.
  • For instance, in an e-commerce database, if there’s a frequent need to search for products based on their category, creating an index on the “category” column can significantly speed up such queries. The index allows the database engine to quickly locate and retrieve relevant records without scanning the entire table.
  • By strategically using indexes, the system can ensure that common queries are executed efficiently, contributing to faster response times and improved overall database performance.

15. Single Point of Failure: Introduce Redundancy.

  • Whenever you have risk of Single Points of Failure in you system then you should implement Redundancy to address this issue.

Example Scenario:

  • In any system, a single point of failure can lead to service disruptions or outages. Redundancy involves creating duplicates or backups of critical components to ensure that if one fails, another can seamlessly take over.
  • For instance, in a web application, having multiple servers hosting the application allows for redundancy. If one server goes down due to hardware failure or other issues, the load balancer can redirect traffic to the remaining servers, which prevents a complete service interruption.
  • Redundancy is really important for maintaining system availability and reliability, especially in scenarios where uninterrupted service is essential, such as in financial systems, healthcare applications, and etc.

16. Bulk Job Processing: Use Batch Processing and Message Queues.

  • When there is a requirement of Processing Bulk Job then in these scenarios using Batch Processing and Message Queues are beneficial.

Example Scenario:

  • Let’s take one example of an analytics platform and it needs to process large volumes of data periodically, such as generating daily reports or running complex computations on historical data. in this scenario Instead of handling these tasks in real-time, which might overload the system, Batch Processing can be employed. The system collects data over a period and queues it for batch processing.
  • A Message Queue manages the order and distribution of these processing tasks. The Batch Processing system can then efficiently handle these jobs, performing computations on the collected data without affecting real-time operations.
  • This approach ensures that resource-intensive tasks are executed in a controlled manner, preventing system overload during peak usage times and optimizing overall performance.

17. Server Load Management: Apply Rate Limiter.

  • This rule advises us for using a Rate Limiter for Server Load Management and Preventing Denial-of-Service (DOS) Attacks.

Example Scenario:

  • In an online service, particularly one that involves user authentication or API requests, implementing a Rate Limiter helps control the rate at which requests are accepted from a particular client or IP address. This is essential for preventing abuse, such as brute-force attacks on passwords or overwhelming the server with a high volume of requests.
  • For instance, if a user attempts to log in and enters the wrong password multiple times within a short period, a Rate Limiter can temporarily block further login attempts from that IP address. Similarly, it can limit the number of API requests a client can make in a given time frame.
  • By implementing a Rate Limiter, the system can protect itself from potential malicious activities, ensuring fair and controlled access to its services.

18. Microservices Architecture: Employ API Gateway.

  • It is highly recommended to use an API Gateway for Microservices Architecture.

Example Scenario:

  • In a microservices architecture, where various components or services handle specific functionalities, an API Gateway acts as a centralized entry point for managing and directing incoming requests. This is beneficial for several reasons, including simplifying client communication, handling authentication, and aggregating responses from multiple services.
  • For example, imagine an e-commerce application with separate microservices for handling user authentication, product catalog, and order processing. The API Gateway can receive a user’s request, authenticate the user, route the request to the appropriate microservices, aggregate the results, and then send a unified response back to the client.
  • Using an API Gateway simplifies the client-side experience, enhances security, and allows for better control and management of the microservices ecosystem.

19. Data Integrity: Ensure Checksum Algorithm.

  • Rule suggests us for ensuring Data Integrity using a Checksum Algorithm.

Example Scenario:

  • In data transmission or storage, errors can occur, which leads to data corruption. A checksum is a value calculated from a set of data, and it serves as a verification mechanism to ensure data integrity.
  • For an example, when transferring a file over a network, the system can calculate a checksum before sending it. The receiving end then recalculates the checksum upon receiving the file. If the calculated checksum matches the original, it indicates that the file wasn’t corrupted during the transfer.
  • Checksum algorithms are essential for maintaining data integrity in scenarios where accuracy is crucial, such as in financial transactions, file transfers, or any data exchange where errors must be minimized.

20. Analytics and Audit Trails: Consider data lakes or append-only databases.

  • This rule recommends using data lakes or append-only databases for analytics and audit trails.

Example Explanation:

  • In systems that require extensive analytics and comprehensive audit trails, using data lakes or append-only databases can be beneficial. These storage solutions allow for efficient and scalable handling of large volumes of data.
  • For example, in a healthcare application, where detailed audit trails and extensive analytics on patient data are necessary, using a data lake can store diverse data types and provide a platform for robust analytics.
  • By leveraging these storage solutions, a system can effectively manage and analyze vast amounts of data, supporting requirements for analytics, compliance, and auditing, especially in applications dealing with complex and varied datasets.

21. Fault-Tolerance: Implement Data Replication.

  • Implementing Data Replication for Fault-Tolerance and Durability is beneficial.

Example Scenario:

  • In a distributed system, data replication involves creating and maintaining copies of data across multiple nodes in a cluster. This redundancy make sure that even if one node fails, there are still copies of the data available, which leads to fault-tolerance and durability.
  • For example, in a database system, data replication might involve having mirrored databases across geographically dispersed locations. If one database becomes unavailable due to a hardware failure or other issues, the system can seamlessly switch to another replica, minimizing downtime and data loss.
  • Data replication is valuable in scenarios where continuous access to data is important, such as in financial systems, healthcare databases, or any application where fault tolerance and data durability are top priorities.

22. User-to-User Fast Communication: Use Websockets.

  • Rule 22 advises us for using Websockets for fast communication between users in a User-to-User scenario.

Example Scenario:

  • Consider a real-time chat application where users need to exchange messages instantly. Traditional HTTP communication might have latency issues, as it involves opening a new connection for each request.
  • Websockets provide a persistent, bidirectional communication channel between clients (web browsers or applications) and the server. In a chat application, when one user sends a message, the server can instantly push that message to the recipient’s device over the established WebSocket connection, ensuring low latency and real-time communication.
  • Websockets are valuable in scenarios where immediate and interactive communication between users is essential, such as in chat applications, online gaming, or collaborative editing platforms.

23. Failure Detection: Implement Heartbeat.

  • It recommends us for implementing a Heartbeat for Failure Detection in Distributed Systems.

Example Scenario:

  • In a distributed system where multiple nodes or components work together, it’s very important to detect failures promptly. A heartbeat mechanism involves sending regular signals or messages (heartbeats) between components to verify their status.
  • For example, in a cluster of servers, each server can send periodic heartbeats to confirm its operational status. If a server stops sending heartbeats, it indicates a potential failure, and the system can take appropriate actions, such as redirecting traffic to healthy servers or initiating recovery procedures.
  • Implementing a heartbeat mechanism helps maintain system reliability by quickly identifying and responding to failures in a distributed environment, ensuring uninterrupted service.

24. Efficient Server Scaling: Apply Consistent Hashing.

  • It suggests implementing Consistent Hashing for Efficient Server Scaling.

Example Scenario:

  • In a distributed system where data is distributed across multiple servers, consistent hashing is a technique that ensures a smooth and efficient way to scale the server pool. Traditional hashing methods might require rehashing all data when adding or removing a server, but consistent hashing minimizes this impact.
  • For example, in a key-value stores, when we add new node in a cluster, only a fraction of the keys need to be remapped to the new server, which reduces the overall data movement. In the same way, when a server is removed, only the data associated with that server needs to be redistributed.
  • Consistent hashing is valuable in scenarios where dynamic scaling of servers is common, as it allows for efficient and balanced distribution of data across the server pool with minimal disruption.

25. Decentralized Data Transfer: Consider Gossip Protocol.

  • It suggests Using the Gossip Protocol for Decentralized Data Transfer.

Example Scenario:

  • Let’s take one example of a decentralized system where nodes need to exchange information without relying on a central authority, the Gossip Protocol is a communication method where nodes share information with a few randomly selected peers. Over time, the information spreads throughout the network.
  • For instance, in a peer-to-peer network, when a node has new data, it gossips this information to a few randomly chosen neighboring nodes. Those nodes, in turn, do the same. This process continues, ensuring that information is gradually disseminated across the entire network.
  • The Gossip Protocol is useful in scenarios where decentralized communication and data transfer are essential, such as in blockchain networks or peer-to-peer systems.

27. High Availability Trade-Off: Embrace Eventual Consistency.

  • This rule suggests acknowledging the trade-off between High Availability and Consistency and considering Eventual Consistency.

Example Explanation:

  • In distributed systems, achieving both high availability and strong consistency simultaneously can be challenging. Eventual Consistency is a model where, given time, all replicas of a piece of data will converge to the same value, but it might not be immediately consistent.
  • For an example, in a distributed database, if a write operation occurs, it might take some time for all replicas to reflect the updated data. During this period, there might be a temporary inconsistency between replicas. Eventual Consistency is accepted in scenarios where real-time consistency is not a strict requirement, and availability and partition tolerance are prioritized, such as in many web applications.

28. Handling Large Data: Implement Pagination.

  • Rule 29 suggests implementing Pagination to handle large data in network requests.

Example Explanation:

  • In scenarios where you need to retrieve a large set of data from a server could be resource-intensive or impractical, in such scenarios pagination is used to break down the data into smaller, manageable chunks.
  • Let’s take one example of an e-commerce website which is displaying a list of products, instead of loading all products at once, the system might retrieve and display the first 20 products initially. then as user scrolls down or navigates to the next page, the system fetches the next set of products dynamically.
  • Pagination improves the user experience by reducing load times and optimizing resource usage, particularly in situations where loading extensive data in a single request may lead to performance issues.

29. Cache Eviction Policy: Prefer LRU Cache.

  • This rule recommends defining a Cache Eviction Policy, with the preferred choice being LRU (Least Recently Used) Cache.

Example Explanation:

  • In caching systems, when the cache reaches its capacity, an eviction policy determines which items to remove to make space for new ones. LRU Cache is a commonly used policy where the least recently accessed items are removed first.
  • For example, if a web server caches frequently accessed images, and the cache is full, the LRU policy ensures that the images least recently requested are evicted to make room for new content.
  • Choosing and implementing an effective cache eviction policy is crucial for optimizing cache performance and ensuring that the most relevant data is retained, especially in systems dealing with a large volume of data and limited cache space.

30. Handling Traffic Spikes: Use Autoscaling.

  • This rule suggests implementing Autoscaling to manage resources dynamically, especially to handle traffic spikes.

Example Explanation:

  • Autoscaling is a cloud computing feature that automatically adjusts the number of resources (such as servers or instances) based on the current demand. This is particularly useful during traffic spikes or duration of increased workload.
  • For instance, if a web application experiences a sudden surge in user traffic, autoscaling can dynamically add more server instances to handle the increased load. Conversely, during periods of low demand, it can scale down to save resources and costs.
  • Implementing autoscaling ensures that a system can adapt to varying workloads efficiently, which provides optimal performance and resource utilization, especially in scenarios where demand fluctuates unpredictably.

Resources used to write this blog:

  • Learn from YouTube Channels
  • I used Google to research and resolve my doubts
  • Inspired and learned from many Medium blogs on System Design
  • From my Experience
  • I used Grammarly to check my grammar and use the right words.
  • Especial thanks to Aqil Zeka who inspired me to write it in more detail!

if you enjoy reading my blogs, consider subscribing to my feeds. also, if you are not a medium member and you would like to gain unlimited access to the platform, consider using my referral link right here to sign up.

System Design Interview
Software Engineering
Data Science
System Design Concepts
Software Development
Recommended from ReadMedium