Summary

The web content discusses the complexities and strategies of implementing cache in software systems to enhance performance while ensuring data consistency across different levels of the system, from CPU cache to distributed system cache.

Abstract

The article delves into the intricacies of cache management in software development, emphasizing the challenges of maintaining data consistency in multi-threaded environments. It explores CPU-level cache coherence issues and the use of the volatile keyword in programming languages like Java and C# to ensure memory visibility without the overhead of locks. The narrative progresses to system-level caching strategies, such as write-through and read-through, and their effectiveness in keeping cache and database values synchronized. The author presents a range of solutions for cache coherence problems in distributed systems, including eventual consistency with periodic cache deletion, strong consistency using distributed locks, and optimistic versioning. The trade-offs between performance and consistency are highlighted, with the conclusion that a balance must be struck based on the system's requirements.

Opinions

The author suggests that while cache implementation can significantly improve system performance, it introduces complex issues such as cache invalidation and data consistency.
The use of volatile is seen as a cheaper alternative to locks for ensuring visibility of boolean variables across threads, but it does not replace the need for proper synchronization mechanisms.
The author implies that a deep understanding of CPU cache architecture and cache coherence protocols like MESI is beneficial for software developers working with multi-threaded code.
The article conveys that cache coherence issues persist at the system level and can be addressed through strategies like write-through and read-through caching, though these strategies have limitations and can lead to out-of-sync data under certain conditions.
The author opines that achieving strong consistency in distributed caching scenarios often requires sacrificing performance, but optimizations like versioning can offer a middle ground.
The author recommends using message queues or similar pub/sub mechanisms for cache invalidation in distributed systems, akin to the MESI protocol at the CPU level.
Retries are suggested as a practical approach to handle scenarios where cache deletion operations fail in a distributed environment.
The article concludes with the pragmatic view that there is no one-size-fits-all solution to cache coherence, and developers must carefully consider the trade-offs between performance and data consistency.

I have asked this cache question to a couple of senior positions

When I see “I boost system performance X times faster by using Y cache”

Cache is hard

There are only two hard things in Computer Science: cache invalidation and naming things.

- Phil Karlton

When we work with cache, it brings performance, but it also brings problems. the biggest problem is — data consistency.

Let’s walk it through with interview details.

Story#1 — cache in the small world

Many candidates are not aware there is a cache at the CPU level.

“You mentioned here fixed a critical issue in multi-threading code, could you explain what was that?”.

“Sure, I volatile my variable which is shared across multi-threads.”. he answered.

“So what is the value type of this volatile, and why are you using it”?

“It is an integer, to make sure all threads can get the updated value of the variable”, he said.

“Do you know the cost of using it? And could you pls explain why it works?”.

“I’m not sure about it, it just works. without volatile, multiple threads seeing different values,” he said.

Actually, he was not wrong, I just want to dig into more to touch the limit (was a senior role of C#).

“Do you know about CPU cache? And could you tell me what’s the difference between volatile and lock keywords?”.

“I guess using volatile could be faster”, he said.

we discussed something else.

…

don’t get me wrong, I am good with the answer.

In this post, we just to discuss more.

Cpu level cache

When we work in multi-threading code, to ensure thread safety(automatic operation), we use locks or synchronized keywords. under the hood, it uses semaphore or mutex.

In case you are not familiar with semaphore, you can check this post and in case you want to have a general idea about the difference between threading and processing or coroutine, you can check here.

Let’s move on and see a thread race condition problem.

So at the CPU level, there are multi-level caches. L1 and L2 are within “each CPU world”. they have a shared L3 cache, and yes, Memory is hit when the cache is missed in all 3 levels.

As a software(not hardware) developer, we do not need to know many details, only roughly how the cache inconsistency problem happens is enough.

So here as you can see:

Thread1 updates the variable ‘flag’=1
And very quickly at the same time when thread2 tries to read the value, it gets flag=0. because CPU2’s L1 or L2 cache is hit which stores the old value.

Eventually, thread2 carries a wrong status value ‘0’ to continue the logic.

So we already know that the lock can make it work. but it is not cheap, what if we only have a boolean variable like ‘flag=true/false’? Is there a cheaper way to make sure this variable is “VISIBLE” to other threads? yes, that’s where the ‘volatile’ keyword comes into play.

How volatile exactly works

The above picture has a couple of steps but just 3 key points.

Cache Coherence protocol e.g. MESI. it depends on CPU architecture, there are different cache coherence protocols implemented. so in step 3 above, after the volatile variable is updated, the CPU will “tell” other CPUs that this ‘flag’ variable’s value is updated, pls directly read from memory and update “your cache”.
Snooping (Bus-based coherence). Snooping is a mechanism used in bus-based cache coherence protocols (e.g., MESI) where each CPU monitors the system bus for changes to memory locations; When a CPU writes to a memory location, other CPUs snoop the bus to detect changes and update their caches accordingly; This mechanism helps maintain cache coherence by ensuring that all CPUs have a consistent view of memory.
Memory barrier. for languages like Java or C#, there is JIT (just in time) compiler. this command is to “tell” the compiler that “before and after the read or write this volatile variable, do not insert any instructions”. to make sure the read/write is on the main memory.

As you can see, it works completely differently from the lock. and this is the reason why volatile is safe to be used in boolean type, it can guarantee memory visibility, but not automatic operation (still needs lock).

So far good?

Let’s move on to system system-level cache coherence problem.

Story #2 — Cache at system level

“Here you mentioned that increased system throughput 3–4 times by introducing a cache, it was amazing, could you elaborate more?” I asked.

“Sure, it was xx cache. We used it to store yy”. he said.

“How do you make sure cache and database values are in sync? and have you faced any issues with the cache being out of sync with a database?”

“We faced an issue the cache was not updated. Then we used the write-through strategy which solved it”. he said.

“Write through could solve cache coherence issue?” I asked.

“Yes, initially we only deleted cache but not update and having issues then we changed to write-through,” he said.

“Ok, what about this case ….” I described the problem exactly like below.

“Sorry, I think we solved the problem not using write-through, but read through”. he said.

“Then how about this case when using read-through, ….” I described the below case to him.

“I can not remember correctly. maybe we delete cache then update the database or update then delete cache can solve the problem?” he said.

I explained to him both the below 2 cases won’t solve the cache coherence issue.

We discussed something else.

…

Let’s try some solutions.

A dirty patch

Whenever we delete the cache, make sure it is not outdated by another racing thread. So what if:

we delete it again after X minutes (archive eventually consistency).
X = the max out-of-sync minutes that can be tolerated.

No, what if it is a finance system, we want it 0 minutes out of sync — meaning strong consistency.

Strong consistency

Sure then we have to sacrifice a bit of performance. using the distributed lock. you can either choose a database table or Redis (cluster) to do that.

When accessing the cache. check the lock record, if no record then insert
Once done remove the lock record

This way could achieve strong consistency but the lock could be bottlenecked in the system throughput.

What if we want a bit better performance?

Then we need to be a bit more “optimistic”.

Versioning of cache item

When accessing a cache item:

Get the version of the cache item
Before updating(or deleting) it, if the version is the same then do update.

Wait, there is a chance that after checking, there is another thread that updated the value very quickly. true, there is such a case, but not always. and that’s why — versioning can not guarantee 100% strong consistency.

If you really need to make sure 100% and want a bit more throughput. maybe can combine it with the above locking, before updating, lock it then update then release the lock.

Multiple nodes

What if my cache is distributed? and when I delete one cache then need to notify other nodes the cache is invalid.

Remember how “CPU” “tells” other “CPUs” that its cache is outdated? just use the same idea. at the CPU level, it uses bus snooping with MESI protocol, at the system level we can use a message queue(or subscribe to Mysql binlog) to do pub/sub.

What if delete failed, Let’s say some nodes never ack the message. retry.

Recap

Nothing is free. including the use of cache.

The cache coherence issue is everywhere. CPU level(lock, volatile), database level(all kinds of locks, MVCC)

When you are asked questions about cache consistency issues during an interview.

5 words and 2 sentences.

“double” invalidate
Lock
Versioning
Pub/Sub
Retry

Strong consistency, lock; Better performance, versioning (not guarantee 100% consistency); Balancing performance and data consistency, combine both.

That’s all. hope you enjoyed it. see you in the next post!