- WEB STARTUPS
- WEB JOBS
- ALL TOPICS
Update From Amazon Regarding Friday’s S3 Downtime
We were one of the first to report on the Amazon S3 outage yesterday. What I thought would be a quick outage, quickly turned into several hours of "semi-downtime" for many popular sites including Twitter and Tumblr. Since Amazon S3 is only a storage location, the services still worked as expected, they just looked like poo.
Early this morning, Amazon commented on what happened to cause the outage. Here is the commentary from Amazon via the Web Services forum:
Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.
Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.
As we said earlier today, though we’re proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable. As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements. We are taking immediate action on the following: (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls. Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.
The Amazon Web Services Team
This afternoon, Reuven from ElasticDrive posted the following message in response to the above:
Gee, I hope we didn’t bring down S3 with our elasticdrive software which makes extensive use of cryptographic credentials, we’ve seen a major up tick in usage lately. (Specially yesterday)
Whatever the case, Amazon was quick to respond to all of our questions and kept us updated; Unlike say how Skype handled their outage late last year. It looks like we all need to create a backup plan for how S3 is used going forward to prevent downtime from affecting our live Web sites.