Project Nimble: Region Evacuation Reimagined

We are proud to present Nimble: the evolution of the Netflix failover architecture that makes region evacuation an order of magnitude faster. At Netflix, our goal is to be there for our customers whenever they want to come and watch their favorite shows. A lot of the work we do centers around making our systems ever more available, and averting or limiting customer-facing outages. One of the most important tools in our toolbox is to route traffic away from an AWS region that is unhealthy. Because Netflix continues to grow quickly, we are now at a point where even short or partial outages affect many of our customers. So it’s critical that we are able to route traffic away from a region quickly when needed. This article describes how we re-imagined region failover from what used to take close to an hour to less than 10 minutes, all while remaining cost neutral.

The history of region evacuation at Netflix is captured in three prior articles. While traffic failovers have been an important tool at our disposal for some time, Nimble takes us to the next level by optimizing the way in which we use existing capacity to migrate traffic. As part of our project requirements, we wanted minimal changes to core infrastructure, no disruptions to work schedules, and no onerous maintenance requirements dropped on other engineering teams at the company.

Failovers Took Too Long

When we set out on this journey, we began by breaking down the time it took then to do a traffic failover, about 50 minutes:

5 minutes to decide whether we would push the failover button or not. Failover operations took time, and the operations made enormous amounts of AWS EC2 mutations that could potentially confuse the state of a healthy region. As a result, a failover was a somewhat risky, not only slow, path to take.
3–5 minutes to provision resources from AWS. This included predicting necessary scale-up and then scaling destination regions to absorb the traffic. This was nontrivial: Netflix services autoscale following diurnal patterns of traffic. Our clusters were (and are) not overprovisioned to the point where they could absorb the additional traffic they would see if we failed traffic from another region to them. As a result, failover needed to include a step of computing how much capacity was required for each of the services in our ecosystem.
25 minutes for our services to start up. Boot an AWS instance, launch a service, download resources required to operate, make backend connections, register with Eureka, apply any morphing changes specified through our Archaius configuration management, register with AWS ELBs… our instances have it tough, and we could only do so much to coax them into starting faster under threat of receiving traffic from other regions.
10 minutes or more to proxy our traffic to destination regions. To compensate for DNS TTL delays, we used our Zuul proxies to migrate traffic via back-end tunnels between regions. This approach also allowed us to gauge the “readiness” of a region to take traffic, because instances generally need some time to reach optimal operation (e.g., via JITing). We found that we needed to move traffic in increments to give new instances a chance to absorb the new traffic well.
5 minutes to cut over DNS. While the calls to repoint our DNS entries completed within seconds, our DNS TTLs generally meant that the bulk of our devices would move within about 5 minutes. We considered a failover complete when the vast majority of devices had moved to using the new DNS entries and were being served by a destination region.

All of the above steps add up to about 50 minutes, which we considered unacceptably long. Remember that we are operating at a scale of 117+M customers who, together, watch 140 million hours of content every day! 50 minutes of a broken experience impacts a lot of people. What we needed was something that was much faster.

How to Make Failing Fast Fast

We set ourselves an aggressive goal of being able to fail over traffic in less than 10 minutes. In order to hit that kind of speed, we needed to eliminate the long poles. We needed services to start up instantly and be ready to take traffic without a warm-up period. If we could meet that requirement, regional failover would consist purely of flipping DNS records and letting the network move users over.

First Iteration — Why Not Pin High?

Our services maintain homeostatic balance using autoscaling policies. Each service runs as an ASG (Auto Scaling Group). If a service is CPU-bound, for instance, we may choose to add more instances to its group when the average CPU usage crosses a threshold, and remove some instances when average CPU usage drops below another, lower threshold. Given the years since our migration to the cloud, this is a mechanism Netflix’s dev teams are now operationally familiar with, and is well-understood during normal and crisis operations.

If we attempted to modify groups to run “cold” by pre-calculating needed capacity to absorb a failover from another region, we would need to make significant changes. Either we would need to change the signals that teams used for autoscaling into something centralized, giving their services instructions divorced from their normal operation, or we would need to alter every autoscaling policy with some kind of linear (or worse) transformation to take into account failover absorption needs. The idea of opening targeted consultations on each of the hundreds of scaling policies at Netflix did not seem like a winning strategy.

We also considered simply abandoning autoscaling altogether and pinning to a calculated value, but this would hide performance regressions in the code by absorbing them into a potentially enormous buffer intended for regional evacuation absorption. We would need to come up with some automated way to frequently calculate a desired service size given incoming RPS and scale the buffer based on this metric, but no such mechanism was, as yet, available.

We needed to come up with something more clever.

The Right Tool — Dark Capacity

Having capacity ready to take traffic seemed like the right solution, but adding it to active services in the front-line would add an operational burden that we didn’t want to incur. How, then, would we keep spare instances at the ready without affecting production? Our solution essentially combines the benefits of extra capacity without the distributed burden of operating it.

Netflix bakes AMIs rather than having configuration management prepare instances from a base after launch. We realized that we could keep instances hidden away in shadow ASGs that would work as “dark” groups topping off capacity for the services they were shadowing. We would have to ensure total isolation of these groups from the streaming and metrics analysis path until they were activated, and we’d need to figure out mechanisms to add them to running services when they were needed so that on failover it would look like we’d provisioned new instances spontaneously when called upon.

Our setup is based on the relatively unknown detach and attach instance mechanisms that AWS provides for EC2 ASGs. Essentially, we can pluck an instance from the dark autoscaling group and push it into the ether, then make a subsequent EC2 API call to pop it into a running service group. We created an orchestrator that detaches instances, keeps track of them, and attaches them to their intended destinations.

It was straightforward to test the detach and attach mechanism with a single ASG, but for a production environment incorporating many ASGs, we would need a much better mechanism to track active services, follow “red/black” pushes, and clone configurations from production services. We leveraged our Netflix Edda and Spinnaker cloud management and deployment tools and their APIs to track changes to frontline ASGs and clone them into dark autoscaling clusters with identical launch configurations automatically when deploy pipelines, rollbacks, or other operations happen. We also needed to predict how many dark instances we need for each service. If we have too few instances for a service, this service may easily get overwhelmed, which can then have negative downstream and upstream effects on other services and eventually on our customers. For this reason, it’s important that we get this prediction right for each and every one of our services. At a high level, the calculation is based on time of day and how much traffic we expect each service in a specific region to see if we were to do a failover.

Nearly all Netflix production applications inherit from a base AMI. The base AMI provides a well-known Netflix environment featuring consistent packages and system configuration as well as kernel tuning and a pre-populated set of environment variables. It also autodetects the ASG that a service is in and sets a number of variables corresponding to this — variables we needed to match to the parent service we were shadowing. The base AMI team helped us interject early in the boot process, making sure that all dark instances match the system environment they are shadowing and are blissfully unaware of their actual location.

Now we had a mechanism to create dark autoscaling groups and move their instances into production. We had to keep these instances out of the traffic path; otherwise we’d just created a very elaborate mechanism of pinning services high, which we were trying to avoid. Netflix uses two primary mechanisms to send traffic to instances — Amazon’s Elastic Load Balancers (ELBs) and our internal Ribbon system. ELBs attach to one or more ASGs, and because dark instances existed in a different, shadow group from a service’s main ASG, the service’s ELBs never saw the dark instances, and this communication method was thus disabled. In order to prevent Ribbon traffic from communicating with dark capacity, our Runtime team helped us devise a library (included in all of our services through our common platform) to prevent dark capacity from registering as UP with Eureka (our service discovery system), gating them in a STARTING state. In this mode, Ribbon never noticed the instances as ready to take traffic; they would come up at the ready but wait for our signal before registering UP with Eureka.

Finally, even when not munching on customer traffic, our instances produce an incredible amount of metrics about their functioning. We needed the instances to be silent, and to that end, we enlisted the help of the Insight Team to help us disable Atlas reporting when we had hit our STARTING gate. This was the final piece of the puzzle: until the transition to a functioning UP status, dark instances were not registered to communicate and reported no metrics, but had in fact gone through their entire startup procedure and were ready to take on traffic at the flip of a switch. So we built the switch.

How it turned out and where we’re going next

We indeed reached our goal of sub-10 minute failovers. We can now complete the operation in 8 minutes, as opposed to the 50 minutes it used to take. Rolling out all the changes above and the software to orchestrate all of this took a team of two approximately six months. The project timeline is a great example how a small team can make a big difference fast at Netflix.

For such a wide-ranging and impactful project, touching all of our control plane services in every AWS region out of which we operate, Nimble’s simplicity allows it to scale. As long as teams have no cross-regional dependencies, Nimble will “just work” for them without extra effort or the need to build in integration or enable any special code. At this time, all streaming-path services are enabled for Nimble by default.

The lack of heavy maintenance burden allows the Traffic Team to focus on innovation and the future of Nimble, some of which can be already be seen in how we tackle regional disruptions. We’re looking to explore new ways to use Nimble. For instance, if a service needs an emergency dose of capacity, should we allow them to hit a button and engage any failover capacity? We’re also investigating using Nimble as a basis for quicker autoscaling response: why have AWS start up fresh instances when we have pre-warmed instances ready to go?

If answering questions such as these sounds interesting, join us!