On Monday, Amazon explained in detail why a Netflix outage occurred on Christmas Eve. Apparently, a developer mistakenly erased data from Amazon's Elastic Load Balancing Service (ELB), PC Mag says. It took Amazon a while to determine the cause of the problem, and when initial recovery attempts failed, the Netflix outage was extended.
Amazon later posted an apology statement on their AWS website that read:
"We want to apologize. We know how critical our services are to our customers' businesses, and we know this disruption came at an inopportune time for some of our customers. We will do everything we can to learn from this event and use it to drive further improvement in the ELB service."
On Dec. 24, Netflix users reported issues with the Watch Instantly service on devices that support Netflix streaming. However, not all devices were affected and the problem was fixed by Tuesday, the company stated.
Amazon issued the following statement which further explains the Netflix outage,
"This data is used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region (for example tracking all the backend hosts to which traffic should be routed by each load balancer).
Unfortunately, the developer did not realize the mistake at the time. After this data was deleted, the ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers.
The ELB service had authorized additional access for a small number of developers to allow them to execute operational processes that are currently being automated. This access was incorrectly set to be persistent rather than requiring a per access approval. This would have prevented the ELB state data from being deleted in this event.
We have reverted this incorrect configuration and all access to production ELB data will require a per-incident CM approval."