AWS Outage

Recently, Amazon Web Services had a problem.  A big problem.

You probably saw the effects of the outage in one way or another.  For us, several of our internal systems related to sales and marketing were simple unusable during much of the afternoon.

While problems like this are all too common in the technology space but this one was a doozy and raises questions about the risks of cloud services that are often ignored.

According to Cyence, the outage costs S&P 500 companies and estimated $150 Million in sales and costs US Financial services company an additional $160 Million during the four-hour outage.  Sites like Disney, Target, Nike, Nordstroms were all affected.

The cause of the outage?  A simple typo but some poor engineer who was debugging an issues with their billing system and accidentally took more servers offline that was intended.  This caused a cascading effect where major subsystems needed to be restarted – unfortunately for the Internet, restarting these services took way longer to complete than expected furthering the problem.

You can read Amazon’s official response to the issue here and more from USA Today and Business Insider.

I trust that Amazon has the proper approach to learn from this incident and change systems and procedures to prevent this type of thing from happening.  What’s more, this incident was a very small blip on the radar compared to the scope and scale of services running on AWS.

But it does raise an interesting question – where is the trade-off between using cloud services and accepting the risks that come along with outsourcing critical functions of your business?

Can the likes of Target & Nike internally provide the level of performance, scalability and redundancy offered by AWS?  Maybe, but the costs would be huge compared to what they pay AWS for services.  Does it make sense to offload the core functionality of a cloud platform to a company like Amazon, even knowing that something like this can happen?  Of course it does – even knowing that things like this WILL happen.

But it does give food for thought with the increasing level of reliance on providers like this that can have such a huge (and costly) impact when things don’t go as planned.  As more and more infrastructure moves towards cloud providers the risk continues to increase, the proverbial ‘all your eggs in one basket.’

Look, while this incident made headlines I have no doubt that 1) it could have happened to any cloud provider, ex: Azure and 2) disruptions like this will happen in the future.

As systems get larger and more complex you take on increasing risks that a simple mistake (or targeted attack) can have far reaching and unintended consequences but, even with that, the advantages of cloud services for performance, reliability, security and scalability far out reach the risks of future problems.