AWS Outage: Key Takeaways Following Huge Downtime Incident - NETSERVE365

BLOG POSTS

AWS Outage: Key Takeaways from Huge Downtime Incident
March 9, 2017
BLOG POSTS

AWS Outage: Key Takeaways Following Huge Downtime Incident

Early last week, Amazon Web Services S3 system went down and things went haywire. Many popular sites crashed partially and some entirely. According to SimilarTech, over 124,000 sites were affected. S3, a cloud storage system, experienced what Amazon called “high error rates” that impacted sites like, Business Insider, Reddit, Slack and more.
 
What was the root of this problem? A typo. The amazon S3 team was debugging an issue with the billing system. An authorized team member executed a command which was intended to remove a small number servers for one of the S3 subsystems that is used by the billing process. Unfortunately, one of the inputs of the command was entered incorrectly and a larger set of servers was removed. The servers removed two other S3 subsystems, one of these being the index subsystem that manages the metadata and location information of all S3 objects in the region.
 
While these large-scale downtime instances don’t happen too often, they do materialize the need for organizations to have the right systems and processes in place with proper redundancy. Here are a few things to take away from this multi-million dollar downtime event:
 

Downtime is Inevitable

Unfortunately, downtime is inevitable, even on the best systems and technologies. Amazon’s S3 is designed to deliver 99.9% durability and is relied on by more than 120,000 websites. Even then, they can and will experience the occasional issue. The point is, expect downtime but work with a technology provider that can minimize downtime and manage the process efficiently when you experience downtime. How downtime is dealt with can make the difference between efficiency and thousands of dollars in associated costs.
 

Don’t keep all your tech eggs in one basket

The S3 outage has taught us many things, but this should be your biggest take away. While these large-scale cloud provider outages don’t happen often, there is still a need to cover all the bases in the case of downtime. The workforce has become increasingly connected and mobilized and this strengthens the need for continuous access to data and applications. Investing in a solution that doesn’t rely exclusively on a single cloud for storage, disaster recovery and business continuity will protect your organization from disasters.
 
Solutions that combine a local, onsite appliance with cloud replication and redundancy can help ensure that if either a physical or web-based system fails, its counterpart can allow users to maintain full or partial continuity. With this, you don’t have to put work on pause while the outside vendor, like amazon, works to recover from the outage.
 
If you want to be really safe, it’s time to reconsider relying so much on public cloud. Look into options like private cloud or hybrid cloud models that provide more flexibility and security. As organizations decide to host their infrastructure in a single set of data centers provided by organizations like Google, Amazon, and Microsoft, you are potentially increasing the frailty of the web by creating increasing numbers of single points of failures.
 
No matter what, there will still be failures and we will have to overcome them. But, if you take the right precautions now by not putting all your IT eggs in one basket you will be much happier during the next outage.
 
 
Other Articles You May Like:
Security Threats and Cloud Computing: How to Overcome
Disadvantages of your Break/fix Model Company
Keys to an Effective Disaster Recovery Plan