On DevOps — 24. Deployment Strategy: In-Place, All-at-Once, Rolling Update, Blue/Green, and Canary.

In the previous two chapters, we discussed how we trigger the deployment: continuous deployment and GitOps. If you haven’t read them yet, here are the links:
Now that we have some mechanism that can trigger deployment, let’s look at how exactly a new version is rolled out.
0. The Good Old Days
Let me paint a picture and see if this resonates with you:
It was not that long ago (about six years), I was working at a web-based start-up company. The infrastructure is now what we call “hybrid” since we had some stuff both in AWS and on-premise.
Back then, we hadn’t started doing microservice architecture yet; the only thing we had was a Python monolith running on two virtual machines (one is “production,” the other served as a backup/disaster recovery) fronted by a load balancer. Of course, the actual architecture was more complicated, but this was the core of the whole website.
It was a “single-point-of-failure” type of architecture, i.e., there was only one instance of the app running. We did have a backup environment, though, so the deployment required us to do the same on the backup instance, too.
When deployment was needed, what we did was simple:
- Log in to the prod and the backup VMs.
- Do a “git pull” to get the specific version that had already been deployed and tested in our staging environment.
- Restart the web server, which would cause a little downtime.
In an ideal world, in a traditional (and highly suggested) architecture, there should be a proxy/load balancer (like Nginx) forwarding requests to more than one instance of applications. In that case, we need to do the same deployment procedure to all the other machines as well, manually or automatically.
1. In-Place Update
The story above is precisely an “in-place” deployment strategy.
During the deployment process, no new infrastructure or other stuff is “created”; instead, the app is “updated” where it had already been running before the deployment.
Maybe you didn’t do python. For example, if you run some C programs and want to deploy, you can stop the running process, put the new binary there, and start it. You get the gist; it’s the same: the old version of the app is updated “in place” where it is.
1.1 Advantages and Disadvantages of In-Place
The upside of this strategy is quite obvious: it’s simple. This also happens to be the downside of this strategy, because if you think about it, if your app depends on some other stuff, like some specific version of a library, or some DB schema change which must be deployed at the same time as the new version, it’s not so easy to manage.
And, what if the new version doesn’t work? After all, it has only been tested in the “staging” environment, and there might be some differences between your staging and your production. When things go wrong (they will, it’s not a question of if, but when,) it could be quite hard to revert the change to the previous working version.
The list of downsides actually continues to grow over time.
Let’s say the environment has been up and running for years, and now since your business has grown, you need to deploy the same thing again in another data center (or cloud). How do you exactly deploy everything in a completely new virtual machine? What are the dependencies? How exactly do we deploy, and in what order?
Let’s say the developers want to set up a local dev env as a test. How do they exactly do that? Use the same tool for production? Manually? Or?
When the business is small, and you want to deploy something quickly, in-place might be the right strategy for you; but over time, you will out-grow it, and you need something better.
2. Blue/Green Deployment
The blue/green deployment is a strategy designed to solve some of the challenges of the in-place strategy.
As the name hints, you have not one but two sets of environments. Your old version of the app is running in the Blue environment; when releasing a new version, instead of update it in-place in the Blue environment, you deploy the new version in the Green environment. And by default, only Blue is serving traffic to your users.
After the new version is deployed in the Green environment, you have plenty of time to test it internally to make sure it fully functions as intended. Only then do you switch (most likely from a load balancer level or DNS level) to make the Green environment publicly available to your customers and the original Blue unavailable.
Next time, you deploy the new version in the Blue and do the switch again.
If things go wrong, at least you could quickly switch back without worrying about dependencies and stuff like that.
2.1 Advantages of Blue/Green
The advantage is two-fold: first, you can thoroughly test the new version before it is publicly available, and second, you can quickly switch back to the original version should something goes wrong.
This is already better than the in-place strategy, but you have some overhead to manage. For example, you need two sets of environments instead of one; it also brings new challenges. For example, do you put your DB into the scope of Blue/Green as well? If not, how to make sure when switching back happens, the DB schema already changed with the new version could still be backward compatible with the old version? These are technical challenges that can be solved, but some extra effort is required.
2.2 Challenges of Blue/Green
However, there is still one issue not resolved yet: be it in-place or Blue/Green, we need to manage dependencies and libraries, and if I know anything about dependency management, it’s that it’s a tough job.
When we need a new environment, local testing, or another cloud account, how do we ensure the new environment is precisely the same as the old one? What if we forget to change one environment when we deploy? What if there is an online bug, but we can’t reproduce locally because we don’t even know for sure if the two environments are exactly the same? Configuration management could solve this (partially), but to ensure you can always create the same environment whenever you want, the simplest solution is never to upgrade your environment. Sounds silly, right? But No. If you don’t update it ever, it will always be the same:
3. Immutable Infrastructure
This is the opposite of the in-place strategy because, as the name suggests, you don’t update anything in-place; instead, you delete the old thing and spin up a new thing to replace it.
If you never change your old environment; instead, you create a new one from scratch (with automation, so that it can be reproduced again and again with exactly the same results), you will never risk any difference between your environments.
3.1 Immutable Infrastructure with Docker
Docker has become more and more popular partly because it’s much more lightweight than a VM. What does the virtual machine virtualize? The hardware. Each VM has its own operating system (guest OS), and it’s the VM hypervisor’s job to emulate hardware and interact with the guest OS. A docker, on the other hand, virtualizes the OS, not the hardware. A docker doesn’t have a guest OS. It has only libraries and dependencies. And it’s the Docker daemon’s job to emulate OS to interact with the Docker. That’s why it’s more lightweight and faster.
We don’t change a docker image. If we need to change something, like deploying a new version of an app or change some dependencies, we simply build a new image and start a new container from that image. This is immutable infrastructure: we never change anything in-place; we only destroy the old ones and create new ones.
3.2 Immutable Infrastructure with VMs
If you don’t use Docker but use virtual machines, we can also do immutable infrastructure. It’s possible to define the VM with a configuration management tool, deploy everything needed, and save it as a base image. When launching a new instance of the app or creating a new environment, you simply use the base image to start a new VM, and it will work out of the box with no extra effort whatsoever.
When you need to do a new deployment or add/change anything to that VM, you don’t. Instead, you update your configuration management code, build a new VM, save it as a new base image, and launch the app from the new base image.
3.3 Pet V.S. Cattle
There is an excellent analogy about mutable and immutable infrastructure: pet versus cattle.
In-place is to treat your servers as “pet”: you care about each of them, give them names, they are all different from each other, and when they die, you cry.
Cattle, on the other hand, are all the same. You don’t bother yourself by naming them and giving them extraordinary care; they are all identical to each other. If you lose one, you simply buy another one to replace it.
When you treat your infra like a pet, you have operational overhead, and things could go wrong more easily. If you treat your infra like cattle, you save yourself from the operational, manual overhead and achieve consistency across all environments.
4. All-at-Once
The deployment process described above is also an “all-at-once” strategy: We had only two instances and triggered deployment on both of them simultaneously. I.E., we are deploying everything “all-at-once,” hence the name of this deployment strategy.
The downside is obvious: it’s not the “right” thing to do. Because if the deployment fails somehow, we lose everything, and we have to revert everything. Here goes your SLA.
But don’t think that you should never do this. It depends. It always does.
For many businesses, it’s not necessary to have a 24 x 7 uptime. Even less than half of that should suffice. Why bother achieving high availability when it’s not required from the business’s standpoint?
5. Rolling Update
In the traditional and highly suggested architecture (you probably have it already,) there should be a load balancer with multiple instances of applications behind it.
In this case, you don’t necessarily have to deploy everything at once. Instead of all-at-once, how about removing one old instance first, then adding a new instance? See if it works fine. If yes, continue. If not, revert. I.E., you are updating your instances in a “rolling” fashion, hence the name “rolling update.”
If you have 10 instances, changing one at a time would cause you to lose 10% capacity at a deployment failure instead of 100% as in all-at-once.
Given this great advantage, should you use rolling update all the time? Well, yes, and no. Yes, because you could use this strategy whenever possible to reduce the impact radius and deploy without any downtime; no, because the rolling update isn’t a silver bullet; it can’t solve everything.
Say your old app version isn’t compatible with your new version, like non-backward-compatible API change or even DB schema change. In that case, it’s simply not possible to have both versions running in parallel. This, apparently, is the downside of the rolling update.
6. Canary Release
Previously we mentioned Blue/Green, a strategy that allows you to fully test the new version internally before it goes public.
What if an internal test isn’t enough? What if I want a small portion of my real customers to test it, give it a try, and maybe I can even collect some feedback from them, then decide if I should roll the new update out to all the customers? Like Blue/Green, but both are public, and one is used by 99% of the users, and the other is only available to 1% of the users?
Well, this is the “canary release.” This strategy gets its name from an old coal mining tactic. Miners would release canaries into coal mines in an attempt to gauge the amount of toxic gases present. In essence, you can configure the percentage and the rules based on which to decide whether a user reaches the “Blue” or the “Green.”
Why is this necessary? Because some decisions are one-way doors: once you make that decision, it would be tough to revert it. When making these kinds of decisions, you need to think twice (or three times) before acting. Maybe you want to add a new feature, but you can’t be really sure until real customers have tried it. The canary release enables you to test your new version publicly to a small batch of users before rolling it out.
7. Summary
It’s probably not possible to cover all the strategies in this article, but we more or less have already covered the important ones. In real-world scenarios, it’s not uncommon to combine them to achieve what you want, like rolling-update with immutable infrastructure or automated canary but with percentage rolling up until the new version handles 100% of the traffic.





