Three lessons from Netflix on how to live in the cloud
- 09 October, 2013 10:57
Netlfix is a big company, and a big cloud user. With 38 million members across 40 countries, it streams a billion hours of content per month.
Almost all of the Netflix's customer-facing services like a massive database that creates personalized content recommendations based on prior viewing history are run in Amazon Web Service's public cloud.
The company has a content-delivery platform named Open Connect that it manages with partnering ISPs to actually stream movies to users.
As one of the biggest cloud users in the world, the company has gleaned lessons from its operations. Below are three takeaways of how the company approaches using the cloud from Ariel Tseitlin, director of cloud solutions for Netflix, who spoke at the Massachusetts Technology Leadership's Cloud Summit on Tuesday.
One Netflix goal is to create the smallest level of abstraction as possible for each application to minimize the effect of any downtime or service failure in the cloud. If this is done successfully, it drastically reduces the "blast radius" of any cloud outage, says Tseitlin, who's responsible for building out the company's cloud and ensuring its reliability.
For example, if Netflix's personalization service goes down, then the company defaults to a more generic recommended movies list that will suggest the most popular titles, but not necessarily those personalized to the user. That minimizes the snowball effect of one service bringing down others.
Build in redundancy
It's one thing to have functionality of applications and services deployed to the cloud at granular levels, it's another to scale it and ensure it works all the time. That's why Netflix has horizontally scaled its service across the globe. Each service is deployed to at least three Availability Zones (AZ), which are isolated locations within Amazon's cloud. AWS recommends deploying to at least two AZs for its service-level agreement (SLA) to kick in. Not only are Netflix services deployed to three AZs, but they are each scaled independently so that if an AZ fails then load balancers migrate traffic to the healthy AZ.
In addition to scaling to multiple AZs, the entire Netflix service is replicated across two regions within Amazon's cloud both U.S. East and EU West and replicated asynchronously. The idea is that if an entire region in Amazon's cloud were to fail then the service would still be available.
Even with monitoring and alerts that cover the entire operations of Netflix, failures will still happen. That's why the company has built a platform for monitoring its service and fixing mistakes. The Simian Army is a series of open source tools that have been developed internally by Netflix that test the fault tolerance of the company's operations. Chaos Monkey is one that randomly kills various services to test failure at the application layer. Chaos Gorilla is another that brings down an entire AZ to test for high availability. Chaos Kong is a service in development that Netflix hopes to use to eventually test an entire region shutting down. Tseitlin says that Netflix is so concerned with testing and monitoring that it jokingly refers to itself as a monitoring company that occasionally delivers movies.
Another aspect of being resilient is in the way the company distributes responsibility to its workers. The company relies heavily on developers to build out the Simian Army and cloud services. Whenever a developer builds something, they're responsible for keeping it up. While this may sound like a "devops" model which is the idea of developers provisioning their own infrastructure resources Netflix instead embraces what Tseilin calls a "distributed ops" model. Each developer is responsible for the entire life cycle of the code and applications they create. Developers write the programs, run them and are responsible for keeping them up to date.
While Netflix has moved almost all of the company's customer-facing services to the public cloud already, it still has more work to do. On the road map is to move all the company's in-house, back-end services to the cloud as well. That process has already started with a migration form Exchange to Google Apps for email. It transitioned form Concur to Workday for expense management and a traditional internal file sharing to Box, Tseitlin says.
Billing and payments are still mostly in Netflix-controlled data centers to comply with Payment Card Industry (PCI) standards. If all goes well, that may change soon though. Netflix wants to be all in the cloud if it can be: "The goal is to not run data centers at all," Tseitlin says.