Amazon Explains The S3 Outage and Downtime Last Weekend

amazon s3Amazon has posted an announcement regarding what happened last weekend with their S3 storage service and the downtime of nearly 8 hours. We covered the outage extensively here on CenterNetworks. Overall the downtime ran from 8:40am Pacific Time to 5:00pm Pacific Time. It’s cute how they call it anĀ "availability event" – I need to add this to my list of synonyms for the words dead, down, outage and not working.

Here’s their final conclusion:

We’ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers’ objects. However, we didn’t have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn’t detect it and it spread throughout the system causing the symptoms described above. We hadn’t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

Overall I would say that Amazon did a good job in keeping everyone informed via their health status page. They have made some changes that will let their servers become more "chatty" in the future and hopefully prevent this type of outage and severe downtime from happening again.

RSS Feed
RSS
4 COMMENTS
  1. Greg Clinton says:

    So Amazon learned a lesson. Incremental improvement. They’re better today then they were yesterday. I’m happy to have chosen them.

  2. centernetworks says:

    Yep, most people say the key to failing is to learn from it and not do it again. Unlike say another company out there today who seems to fail every other day.

    I want Amazon to use their services for Amazon.com – so they can feel the same pain we do when it’s down. They say they do but everything is in a cache. Not the same in that case.

  3. Mukul Kumar says:

    Hi,

    I just posted some thoughts on “Cloud Availability” at http://mukulblog.blogspot.com/2008/07/cloud-availability.html . Your thoughts are welcome.

    Thanks,
    Mukul.

  4. Maybe I’m in the minority – but I find this explanation more disturbing than the outage.

    In the web 2.0 world, I think that transparency and accountability are two of the most essential elements of a successful business. So, when Apple or Google, or Yahoo or Twitter has an issue, or has unhappy customers, there’s someone who’s name is on the product, the company, and the response.

    At Amazon, we know the name – Jeff Bezos. And we know the product S3. But the ‘announcement’ of the outage is signed:

    Sincerely,
    The Amazon S3 Team

    Why?

    No product manager? No business unit head? No ‘human’ who can be held responsible for a level of service. Any why is this posted on the ‘health’ board, rather than on the AWS blog? Don’t we expect the companies in the space to both sign their name to their work, and to talk to customers in a format that allows for commenting and discussion.

    I’m really disappointed this took a week. I’m really disappointed that this isn’t posted in a forum where customers can respond.

    And I’m more concerned than I was last Sunday that Amazon isn’t building an enterprise class product with accountability and transparency.

    And finally – and I’ve blogged about before… Why isn’t Amazon willing to use its own product for its web site. How much revenue would Amazon have lost had the Amazon.com site been down for 8 hours last Sunday? I’m pretty sure that Jeff Bezos would have had a statement had their product been down (and they’re partners whose stores are powered by Amazon had suffered such an outage).

    Here’s my previous post on this topic:

    http://www.vator.tv/news/show/2008-07-21-will-jeff-bezos-eat-his-own-dogfood

Leave a Reply

Become a sponsor

SPONSORS

Loop11
Clicky Web Analytics
CloudContacts
125px
Future of Web Design
Advertise here

STARTUP NEWS

twitter