Amazon Simple Storage Service Outage - Some Learnings

February 16, 2008 – 7:42 am

Amazon’s Simple Storage Service (S3) experienced an outage earlier today, which affected KnowledgeTreeLive and its users. The outage was quite widely reported

KnowledgeTreeLive is in beta and so this was a great way to learn about our contingency planning, both from a technology perspective but also process and communication.

The pundits appear to be pointing to major holes in the cloud computing model which I think is somewhat exaggerating the impact and significance of today. All systems have issues, and this is not entirely unexpected. To provide some context here, we recently experienced outages with our (expensive) hosting provider, RackSpace, who are supposed to be best in the business (and we were by no means the only parties affected). Amazon has had a really great track-record of keeping S3 up and running for the last few years (over 99.993% of the time) and one or two small, isolated outages are acceptable and indeed, expected.

There are however some learnings suggested by others that I do hope Amazon will take to heart.

This is all well and good but it is up to us companies who leverage cloud computing technologies to provide our customers with a innovative (and reasonably priced) services, to ensure that we engineer our systems appropriately to gracefully deal with these situations.

  What this outage meant for KnowledgeTreeLive users

  • During the period of the outage, all documents stored in Amazon S3 were safe and unaffected.
  • We experienced problems with the creation of new KnowledgeTreeLive accounts, particularly if you asked for demo data to be placed into your repository. Our support guys picked up on these pretty quickly and contacted the users affected by this.
  • Users weren’t able to upload new documents, not a great state but what we think is an appropriate behavior - we want users to be certain that their documents are stored safely in persistent storage.
  • Users weren’t able to download documents they had previously stored. This is certainly not an ideal situation and we’ll be investigating how we can implement a cache of documents within our cluster, probably utilizing the distributed filesystem between the various front-end web server appliances.
  • We couldn’t start up new Amazon Elastic Compute Cloud AMI’s and if we needed to due to a significant increase in load we would have had to take the entire system into maintenance mode. Maintenance Mode and other fail-safes are managed from outside of the Amazon cloud.

Some learnings for KnowledgeTreeLive 

  • We need to investigate a “Hot Cache” for documents uploaded to KnowledgeTreeLive, most likely leveraging a distributed filesystem running between our web server appliances. This will allow our customers to continue to have access to their documents during an S3 outage.
  • We need to be better at keeping users informed about what’s going on. We have a KnowledgeTreeLive Beta blog and send an RSS feed of the blog to the KnowledgeTreeLive dashboard, but didn’t do it fast enough this time around.

We’re meeting early this coming week to discuss how we can plug these technology and process holes and I’m likely to blog about the outcomes.

Post a Comment