Summary

Netflix's transition to AWS over the past year has taught them valuable lessons about operating in a cloud environment, emphasizing the need to embrace failure as a means to build resilient systems.

Abstract

Netflix has shared insights gained from their move to AWS, highlighting the importance of adapting to cloud-specific operational models and the necessity of designing systems that can withstand failure. They've learned that the reliability assumptions from self-managed data centers do not directly translate to the cloud, where instance failures are more common and network latency is more variable. Netflix has adopted a philosophy of constant failure testing, exemplified by their Chaos Monkey tool, to ensure their systems can handle unexpected outages. They also stress the importance of learning at real scale rather than relying on toy models, as this approach provided them with a more accurate understanding of their architecture's limitations and strengths. Finally, Netflix emphasizes the need for commitment and perseverance when facing the challenges inherent in migrating to and operating within a cloud environment like AWS.

Opinions

Embracing the cloud requires unlearning traditional data center practices and adopting new operational paradigms.
Co-tenancy in the cloud introduces variability that systems must be designed to accommodate or circumvent.
Building systems that can tolerate failure is critical; Netflix's Rambo Architecture approach ensures that services can succeed independently despite dependencies failing.
Testing systems with real-world traffic at scale is invaluable for understanding how they will perform in production.
Committing to the cloud and pushing through difficult periods is essential for success, with leadership buy-in being a significant factor in overcoming migration challenges.

5 Lessons We’ve Learned Using AWS

In my last post I talked about some of the reasons we chose AWS as our computing platform. We’re about one year into our transition to AWS from our own data centers. We’ve learned a lot so far, and I thought it might be helpful to share with you some of the mistakes we’ve made and some of the lessons we’ve learned.

1. Dorothy, you’re not in Kansas anymore.

If you’re used to designing and deploying applications in your own data centers, you need to be prepared to unlearn a lot of what you know. Seek to understand and embrace the differences operating in a cloud environment.

Many examples come to mind, such as hardware reliability. In our own data centers, session-based memory management was a fine approach, because any single hardware instance failure was rare. Managing state in volatile memory was reasonable, because it was rare that we would have to migrate from one instance to another. I knew to expect higher rates of individual instance failure in AWS, but I hadn’t thought through some of these sorts of implications.

Another example: in the Netflix data centers, we have a high capacity, super fast, highly reliable network. This has afforded us the luxury of designing around chatty APIs to remote systems. AWS networking has more variable latency. We’ve had to be much more structured about “over the wire” interactions, even as we’ve transitioned to a more highly distributed architecture.

2. Co-tenancy is hard.

When designing customer-facing software for a cloud environment, it is all about managing down expected overall latency of response. AWS is built around a model of sharing resources: hardware, network, storage, etc. Co-tenancy can introduce variance in throughput at any level of the stack. You’ve got to either be willing to abandon any specific subtask, or manage your resources within AWS to avoid co-tenancy where you must.

Your best bet is to build your systems to expect and accommodate failure at any level, which introduces the next lesson.

3. The best way to avoid failure is to fail constantly.

We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.

If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.

One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most — in the event of an unexpected outage.

4. Learn with real scale, not toy models.

Before we committed ourselves to AWS, we spent time researching the platform and building test systems within it. We tried hard to simulate realistic traffic patterns against these research projects.

This was critical in helping us select AWS, but not as helpful as we expected in thinking through our architecture. Early in our production build out, we built a simple repeater and started copying full customer request traffic to our AWS systems. That is what really taught us where our bottlenecks were, and some design choices that had seemed wise on the whiteboard turned out foolish at big scale.

We continue to research new technologies within AWS, but today we’re doing it at full scale with real data. If we’re thinking about new NoSQL options, for example, we’ll pick a real data store and port it full scale to the options we want to learn about.

5. Commit yourself.

When I look back at what the team has accomplished this year in our AWS migration, I’m truly amazed. But it didn’t always feel this good. AWS is only a few years old, and building at a high scale within it is a pioneering enterprise today. There were some dark days as we struggled with the sheer size of the task we’d taken on, and some of the differences between how AWS operates vs. our own data centers.

As you run into the hurdles, have the grit and the conviction to fight through them. Our CEO, Reed Hastings, has not only been fully on board with this migration, he is the person who motivated it! His commitment, the commitment of the technology leaders across the company, helped us push through to success when we could have chosen to retreat instead.

AWS is a tremendous suite of services, getting better all the time, and some big technology companies are running successfully there today. You can too! We hope some of our mistakes and the lessons we’ve learned can help you do it well.

— John Ciancutti.

5 Lessons We’ve Learned Using AWS

See Also:

Four Reasons We Choose Amazon’s Cloud as Our Computing Platform

We think cloud computing is the future.