avatarJason Ngan

Summary

The article contrasts row-based and column-based databases, detailing their storage mechanisms, performance characteristics, data compression, and parallel processing capabilities to guide users in selecting the appropriate database type for their specific use cases.

Abstract

The article "Database 101: Row-Based vs. Column-Based Databases" delves into the foundational differences between row-oriented and column-oriented database storage systems. It explains that row-based databases store all data for a single record in contiguous data pages, optimizing for transactional operations where full records are frequently accessed. In contrast, column-based databases store data by column, which is beneficial for analytical queries that typically involve a subset of columns from a table. The article highlights the advantages of column-based databases in terms of data compression, due to the homogeneity of data types in columns, and their superiority in parallel processing. However, it also acknowledges the drawbacks of column-based databases when it comes to retrieving or updating entire records. The conclusion emphasizes the importance of understanding these differences to make informed decisions about which database system to use based on the application's requirements.

Opinions

  • The author suggests that row-based databases are ideal for transactional systems where operations often involve reading and writing full records.
  • It is implied that column-based databases are more efficient for analytical and data warehousing tasks due to their ability to handle large datasets with queries that touch only a few columns.
  • The article conveys that column-based databases generally offer better data compression due to the use of techniques like run-length encoding, which is facilitated by the uniformity of data types within columns.
  • The author expresses that column-based databases have an edge in parallel processing because columns can be independently accessed and processed by multiple threads or nodes without contention.
  • The article concludes by advising that the choice between row-based and column-based databases should be made based on the specific use case to ensure optimal performance and efficiency.

Database 101: Row-Based vs. Column-Based Databases

What are the differences and why should you know?

Photo by Joao Vitor Heinrichs from Pexels

In database management, there are two fundamental approaches for storing data: row-based and column-based database.

Each approach carries its own strengths and weaknesses and is suited for different use cases.

In this article, we will learn the key differences between row and column-oriented databases so that you can choose the right solution for your application.

Let’s get started.

How data is stored on disk?

Each record is stored in a data page within a file

Before we dive into the differences, let’s revisit how data is stored on disk.

Every record in a database is stored in a fixed-size data page within a larger file.

Each data page is typically 8kb in size and all IO operations in the database works on the page-level, aka, page-level IO.

When querying a single record, an I/O operation retrieves a data page with multiple rows , rather than just an individual record.

Row-based database

Data is stored as rows in a data page

In a row-based database, each record is stored as a row in a data page.

As all data in a single row is stored on the same data page, reading and writing a row as a whole is quick and efficient.

This is ideal for transactional use cases, where individual rows of data are frequently read and updated as a whole.

However, as the entire row must be read in each IO, row-based database can be slower for queries that only access specific columns.

This can be a problem in analytical use cases where large amounts of data are queried from only a few columns.

Column-based database

Data is stored as columns in a data page

In a column-based database, data is stored as columns in a data page.

This allows the database engine to only retrieve the required columns for a particular query.

This is ideal for analytical queries where large amounts of data is drawn from a few columns.

For example, when aggregating data on user age in a user table, only the age column needs to be loaded into memory in a column-based database, while a row-based database would need to load all rows and columns.

However, since each column is stored separately in different data pages, querying or writing a full record can be slower, as multiple IOs are needed to read or write to the data pages.

What are their differences?

Now that we have a rough idea of how row-oriented and column-oriented database work, let’s dive deeper into their differences.

Performance

Row-based

  • Row-based database works best when a single record is read and updated as a whole frequently. A single IO fetches a full data page containing all columns in a row.
  • When querying data that requires only a specific column, row-based database bring in extra data to the memory.

Column-based

  • Column-based database thrives when the query only touches a subset of the columns in a table as each column is stored separately in different data pages.
  • When querying or writing a single record as a whole, column-based database has a slower performance.

Data compression and encoding

Run-length encoding technique

Column-based databases are typically better suited for data compression than row-based databases.

In a column-based database, data is stored as columns in data pages, with most columns having the same data type. As a result, the database can achieve better compression rates.

For instance, if a column contains the values “A”, “A”, “A”, “B”, “B”, a column-based database can leverage the run-length encoding technique to compress the data as “A(3), B(2)”

While a row-based database can still utilise data compression, the compression rate is limited since all columns are stored in the same data page and most have different values or data types.

Parallel processing

Column-based database is better in parallel processing than row-based database as columns can be divided and processed independently by multiple threads or nodes.

In contrast, row-based databases stores each record as a row in a single data page, so individual columns cannot be accessed or processed independently without first accessing the entire row.

A row-based database can suffer from contention issues when multiple threads or nodes try to access the same data page,, limiting the scalability of the database for parallel processing.

Conclusion

Photo by Monstera from Pexels

In short, a row-based database stores data as rows in data pages and is better suited for transactional use cases where rows are often accessed as a whole.

Column-based database, on the other hand, stores data as columns and allows more efficient query when only a specific columns are needed.

Choosing between the two approaches depends on the specific use case, and understanding the differences between them is crucial for making smarter design decisions.

I hope you find this helpful and I will see you at the next one!

If you are interested in articles like this, join me and sign up for Medium today!

Database
System Design Interview
Software Development
Software Engineering
Software Architecture
Recommended from ReadMedium