Use the Source: Redis Internal Tricks
You can stare at architecture diagrams all day, but the real power lies in the source code.

Redis (https://github.com/redis) is a C-based in-memory first database server with persistence options. It’s often compared with Memcached. One key difference is that Redis provides a richer set of primitive operations that make it more versatile for a variety of use cases. There are abundant Redis guides online in the form of books, blogs, and websites. This post aims to discover some interesting details in the Redis’ source code, a perspective that’s rarely seen in other high level summaries.
Event Loop
One of the biggest surprises to people unfamiliar with Redis is that Redis is single-threaded. Well, OK, it’s not entirely true. It uses thread-pool to manage some cleanup and even separate processes for the persistence operations. But the main event loop is single-threaded. The argument is that if you can keep every event handler lightweight, single-thread can be very performant (because no context switch is needed), and needless to say, so much more manageable. This turns out to be true for Redis, and it’s proven to work really well. It goes without saying that Redis employs many helpful tricks to minimize the event handler scope, such as amortized key expiration, on the fly table rehashing, etc.
Redis’ main event loop is a select call, which is an interface implemented by the real select, epoll, evpoll, kqueue, etc. These are platform specific system calls. Redis evaluates the environment macros (such as __linux__, __APPLE__) predefined by the gcc compiler to determine which implementation to compile in. By the way, the different versions of “poll” is a very interesting topic and there are numerous resources online for them.
The event loop handles both file events and time events. File events include TCP and Unix sockets. Regarding to select on file events, one off-topic point I should mention is that file descriptors associated with regular files will always select true immediately for read/write/error. So don’t try to apply this paradigm to regular files. The event loop waits for the file events with a timeout. The timeout is set to be the time to the next time event. This way, the event loop wakes up either for any available file operations or when the next time event is due. Time events include operations like serverCron which does many important jobs. One interesting but less conspicuous thing serverCron does is to obtain a global system time that other functions can reference. This saves a lot of system calls and makes the time for the binary more consistent. Best of all, no access synchronization is needed for this global time because of the simplicity of single-thread.
In addition to the handlers of the file and time events. The event loop can also be registered with beforeSleep and afterSleep functions for operations before entering select and after leaving select.
Memory Optimization
An in-memory database is about using memory, and it’s also about how to use less of it. Redis builds in a lot of intriguing techniques to improve memory efficiency.
It started with a superior memory allocation library, jemalloc (https://github.com/jemalloc/jemalloc). Memory is allocated in the so-called arenas. Different threads may be assigned to different arenas for memory allocation so that no synchronization is needed. Inside the arena, memory is managed by size category — small, large, huge — to reduce fragmentation. This is a helpful post that does a good job analyzing the advantage of jemalloc. One thing to note is that in Redis, the memory allocation call is wrapped behind a zmalloc interface, and the intention is for platform compatibility. zmalloc prefers jemalloc and uses it as a drop-in replacement for the regular malloc when possible.
Redis tries to save memory in every place. If there is a chance to use a C union primitive, it will. For example, the dictionary entry uses a union of void* and long long to support multiple data types. What’s more, when storing values, Redis spends some CPU cycles to attest if the content in a string is actually a number. If so, it stores it to a number to save space. In addition, Redis defines variances of the common structures so that it can choose smaller struct definition when possible. For instance, its canonical string buffer SDS has SDS8, SDS16, SDS32, and SDS64 variances in which integer variables take up respective bit lengths. Last but not least, Redis often uses static shared objects to avoid recreating them, e.g., many error messages are global and immutable.
Redis also pays close attention to its hash table size. If the keys become too sparse, Redis will contract the table size. To avoid heavy operations in one round of the event loop, Redis actually uses two hash tables during the rehashing period and incrementally moves keys over. In the end, it switches to use the new table atomically. Other advanced data structures are also in use. Ziplist is a typical one. Ziplist is a compact representation of a double-linked list in a continuous blob of memory (think about all the space we can save in the absence of those 32-bit pointers). Each entry stores the length of previous entry, the type and the length of its own so that the list can be traversed both ways. The obvious shortcoming is that it needs to be resized when the content outgrows the buffer or when there is too much free space in the end. Ziplist works best with the right size. Therefore, Redis uses a top level double-linked list for a bunch of ziplists. In a way, it’s like a two level index. Each element of the top level double-linked list points to a ziplist. This greatly reduces the need to resize compared to a single monolithic ziplist. Another trick is to compress the ziplists in the middle of the top level double-linked list as only the head and tail are commonly accessed.
One more thing Redis optimizes is to avoid rehashing during database dump. Redis database dump is done through forking a child process and relying on OS’s copy-on-write to avoid synchronizing data in code. Rehashing tends to modify a large range of data, causing the OS to copy all those memory pages.
Persistence Layer
Redis provides persistence options. It’s a tradeoff users can configure for throughput vs data lost. The intuition is that if you want higher throughput, you’ll have to save the data to disk less often, increasing the chances of data lost when a crash or power down occurs.
Redis saves data in two formats, RDB (Redis Database) and AoF (Append-only Files). RDB takes a snapshot of the current hash table and writes it to the disk. Redis forks a child process for this purpose so that the code doesn’t have to bother with synchronization. It relies on the OS’s copy-on-write to maintain the integrity of the snapshot. In most modern OS, a forked process shares most of its parent’s memory pages until they are modified, during which the OS will intervene before the write and make a copy for the forked process. The cadence of RDB saving is configured by a save point config, which is essentially a config to instruct Redis to save RDB after certain number changes or a certain time window.
AoF is analogous to the transaction commit log of other databases. Every write command is logged into an aof_buf before the command is considered processed. Redis periodically writes data in aof_buf to the output, usually a file. A worth noting detail is that when a program calls the C fflush, the data is flushed from FILE’s internal buffer to the OS. Then when the program calls fsync, the OS flushes it to the device. The device itself, however, may have other layers of caches. So for many platforms, there is actually no way for the application to be certain that data is persisted on the device — some platforms though do support custom fsync that only returns after the data is actually written on device, but Redis does not assume such OS support. So Redis basically accepts that there is a slim chance of data lost no matter how frequent you configure the fsync. By the way, fsync is such a heavy operation that you want to minimize as long as possible.
To stop the AoF from growing forever, AoF rewrite takes place every once a while. It reduces the size of AoF by merging write commands. For example, you only need to keep the last write record for the same key, and you can remove any write record of a key that was later deleted. This is because AoF is designed to reproduce the final database snapshot rather than providing a revision history. During the rewrite, data can continue to be appended to the old AoF. In the background, the old AoF is merged into the new one. At some point in the process, Redis will stop writing its in-memory aof_buf to the file because it needs the new AoF to catch up with the old one so that it can switch to the new AoF. Once that switching is done, the aof_buf is then flushed to the new AoF.
The AoF file switching is done by renaming the temporary new AoF to the canonical AoF (the same name as the old AoF). This effectively overrides the old AoF. A useful thing to note is that the file descriptor of the old AoF stays valid even when the underlying file is overridden (an OS feature). The underlying old file is only removed when the application calls unlink on the old file descriptor. Since calling unlink may lead to deleting a large old AoF, Redis delegates it to a separate thread to avoid blocking the main event loop.
Closing
OK, I think that’s enough juicy details for one post. I am thinking about writing a series of “Use the Source” to share more interesting findings from the implementation of famous open-source projects. So, see you next time.






