Database 101: Row-Based vs. Column-Based Databases
What are the differences and why should you know?

In database management, there are two fundamental approaches for storing data: row-based and column-based database.
Each approach carries its own strengths and weaknesses and is suited for different use cases.
In this article, we will learn the key differences between row and column-oriented databases so that you can choose the right solution for your application.
Let’s get started.
How data is stored on disk?

Before we dive into the differences, let’s revisit how data is stored on disk.
Every record in a database is stored in a fixed-size data page within a larger file.
Each data page is typically 8kb in size and all IO operations in the database works on the page-level, aka, page-level IO.
When querying a single record, an I/O operation retrieves a data page with multiple rows , rather than just an individual record.
Row-based database

In a row-based database, each record is stored as a row in a data page.
As all data in a single row is stored on the same data page, reading and writing a row as a whole is quick and efficient.
This is ideal for transactional use cases, where individual rows of data are frequently read and updated as a whole.
However, as the entire row must be read in each IO, row-based database can be slower for queries that only access specific columns.
This can be a problem in analytical use cases where large amounts of data are queried from only a few columns.
Column-based database

In a column-based database, data is stored as columns in a data page.
This allows the database engine to only retrieve the required columns for a particular query.
This is ideal for analytical queries where large amounts of data is drawn from a few columns.
For example, when aggregating data on user age in a user table, only the age column needs to be loaded into memory in a column-based database, while a row-based database would need to load all rows and columns.
However, since each column is stored separately in different data pages, querying or writing a full record can be slower, as multiple IOs are needed to read or write to the data pages.
What are their differences?
Now that we have a rough idea of how row-oriented and column-oriented database work, let’s dive deeper into their differences.
Performance
Row-based
- Row-based database works best when a single record is read and updated as a whole frequently. A single IO fetches a full data page containing all columns in a row.
- When querying data that requires only a specific column, row-based database bring in extra data to the memory.
Column-based
- Column-based database thrives when the query only touches a subset of the columns in a table as each column is stored separately in different data pages.
- When querying or writing a single record as a whole, column-based database has a slower performance.
Data compression and encoding

Column-based databases are typically better suited for data compression than row-based databases.
In a column-based database, data is stored as columns in data pages, with most columns having the same data type. As a result, the database can achieve better compression rates.
For instance, if a column contains the values “A”, “A”, “A”, “B”, “B”, a column-based database can leverage the run-length encoding technique to compress the data as “A(3), B(2)”
While a row-based database can still utilise data compression, the compression rate is limited since all columns are stored in the same data page and most have different values or data types.
Parallel processing
Column-based database is better in parallel processing than row-based database as columns can be divided and processed independently by multiple threads or nodes.
In contrast, row-based databases stores each record as a row in a single data page, so individual columns cannot be accessed or processed independently without first accessing the entire row.
A row-based database can suffer from contention issues when multiple threads or nodes try to access the same data page,, limiting the scalability of the database for parallel processing.
Conclusion

In short, a row-based database stores data as rows in data pages and is better suited for transactional use cases where rows are often accessed as a whole.
Column-based database, on the other hand, stores data as columns and allows more efficient query when only a specific columns are needed.
Choosing between the two approaches depends on the specific use case, and understanding the differences between them is crucial for making smarter design decisions.
I hope you find this helpful and I will see you at the next one!
If you are interested in articles like this, join me and sign up for Medium today!





