The AWS Outage—What Happened and Why Does It Matter?
Wednesday 1 Mar 2017, 4:04 pmYesterday, one of Amazon’s data centres suffered some kind of catastrophic failure, and their S3 service went down. The US-EAST-1 DC is one of the data centres the tech giant have on the Eastern seaboard (it’s in North Virginia, to be precise). A large portion of the internet was affected, and a variety of websites were affected by the outage.
What is AWS and S3?
AWS stands for Amazon Web Services, and is a collection of technology platforms which can be used by anyone to run websites. It includes EC2 (virtual server hosting), RDS (hosted databases), SES (email sending and receiving service), and many other services such as S3, which was affected in the outage yesterday.
S3 stands for Simple Storage Service, and is basically an online file storage platform. For a small fee, website owners can upload static assets to Amazon, who will look after it and serve them up on request. As an example, Netflix, Reddit, Dropbox, Tumblr, and Pinterest all use S3 to host critical parts of their website. Think of it kind of like a cloud hosted USB hard drive.
Why Would a Web Developer Use S3?
Often, hosting platforms limit the amount of data a particular website can transfer in a month, or charge money based on the amount of data transferred. Using a third party file storage system can be much cheaper, and you only pay for what you use (as opposed to what you might use, which is the normal pricing model used by web hosts — you don’t have to pay for ‘unused space’ on your server).
If you have a website where users might upload content, then S3 already has all of the infrastructure needed to upload and manage content, and the web developer doesn’t need to write server code to handle file uploads.
S3 has an availability SLA of 99.9% (known as ‘three nines’), which works out at around 43 minutes per month of downtime, after which service credits are offered. This is much more than most web hosts offer.
So What Actually Happened?
As yet, Amazon have not released details of what exactly caused the outage, but it is clear that their US-EAST-1 DC was failing to deliver files some portion of the files stored there. Amazon referred to this simply as ‘increased error rates’, but many users were reporting a full outage. You may have seen images missing, found some websites completely unusable, or seen core functionality of websites working incorrectly.
In a rather funny twist, Amazon themselves use S3 to host their service status page, so they were not immediately able to update this to reflect the fact that the service was unavailable.
What Lessons Can Developers Learn From This?
In the wake of this outage, developers around the world will be under pressure to build more redundant storage solutions. This outage, although it was relatively short (a few hours), will probably result in an awful lot of lost productivity, potentially also lost sales and revenue. Business people in the higher echelons of companies will likely be very unhappy with this loss, and will be looking to mitigate going forwards.
What Lessons Can Amazon Learn From This?
In the short term, Amazon are going to remain the leading storage provider due to their cost and reputation. However, Amazon are going to need to work to rebuild confidence in their services going forward to retain their huge market share. Other providers will be starting to offer redundant solutions to compete with Amazon. To counter this, Amazon will need to consider whether they can make a profitable service with automatic fail-overs, duplicating data across multiple data centres, rather than relying on developers to implement this themselves.
If Amazon get their marketing right, this could even end up turning them a profit in the long run.
What Lessons Can Users Learn From This?
There’s not much you can do about this yourself, but it’s a great opportunity to better understand how much of the internet is heaped up in one place. Amazon Web Services is a huge platform that powers many of your favourite websites, and in the 10 years it’s been operational, it has mostly been working silently in the background.
As of writing this, S3 is back up, and all services in North America are operating normally. I would expect a preliminary root cause analysis to be available by the end of this week.