Top 10 Cloud outages of 2018: None of the mainstream public Clouds survived!

According to IDC’s “China public cloud Service Market semi-annual Tracking Report” released in July this year, Alibaba Cloud’s market share has exceeded 45%, while Tencent Cloud has reached 10%. In the global market, according to the latest data from Gartner, Amazon AWS accounts for 51.8% of the global market share; Microsoft Azure came in second with 13.3%. Alibaba Cloud ranked third, accounting for 4.6%; Google Cloud accounted for 3.3%; IBM followed with 1.9%. It can be seen that these major cloud providers account for the vast majority of the global market, and once cloud services go down, countless enterprises will be affected.

In 2018, the cloud computing market not only grew rapidly, but also had problems. The conflict between cloud providers and the open source community continues to escalate. Mainstream cloud vendors have not escaped outages, and some of them have experienced service outages for several times in a year, leading to continuous decline in enterprise confidence in public cloud. This post summarizes the top 10 cloud outages of 2018 and invites you to add your cloud darkest moments.

Details of the incident: On January 18, 2018, Google cloud automation failed, resulting in a 93-minute outage of its computing engines in its two main availability areas, US-Central 1 and Europe-West3. Google’s response was a “network programming failure” that caused the Autoscaler service to fail. This failure means that new or newly migrated virtual machines cannot communicate with other availability VMS.

Remedy: The engineering team manually switches to the replacement task to restore the data persistence layer to normal operation.

Downtime: 93 minutes

Follow-up: Google promised that in the future, if the configuration data became obsolete, Google would stop the VM migration and the data persistence layer would reresolve the peer during a long running process so that it could quickly switch to the replacement task in the event of a failure.

Details: In the early morning of March 2, 2018, Alexa, which relies on AWS, began to lose its voice. The red indicator on the smart speaker kept flashing to indicate an interruption, and Alexa kept issuing built-in apologies. In the hours that followed, Alexa received thousands more complaints. It is understood that the fault was caused by a problem with Amazon’s AWS web service. Other apps that rely on AWS as a backbone were also affected during the day, including software developer Atlassian and cloud messaging company Twilio.

Fix: Amazon’s AWS online support team fixed this.

Downtime: several hours (due to the incident early in the morning, did not ferment in the first time)

Follow-up: Amazon’S AWS did not elaborate on the glitch, saying only that it was related to network connectivity.

Incident details: On May 31, 2018, AWS again experienced connectivity issues due to a hardware failure at a data center in the Northern Virginia area. AWS’s core EC2 service, Workspaces Virtual desktop service and Redshift data warehouse service were all affected.

Remedy: artificial repair

Duration: about 30 minutes

Follow-up: Mai-lan Tomsen Bukovec, vice president and general manager of Amazon S3, said in a recent interview that Amazon has never seen a data center crash. This means that none of the past incidents resulted in an entire data center crash, and AWS has made system design improvements to prevent such incidents.

Details of the incident: On 17-18 June 2018, Microsoft Azure was affected by high temperatures resulting in storage and network outages due to a problem with the thermostatic system in the Irish data center.

Downtime: more than 5 hours

Details of the accident: On June 27, 2018, ali Cloud suffered a major technical failure at about 16:21, and began to recover at 16:50. The official failure time was about 30 minutes, and the recovery time took about an hour. After technical review, Ali gave that the cause of the fault was that the engineer team performed a change verification operation when launching the new automatic operation and maintenance function. No problem occurred in the test environment, but unknown bugs were triggered after the launch.

Remedy: Manual intervention to locate and solve the problem.

Downtime: 30 minutes, recovery time takes about an hour.

Follow-up: This event is defined as S1, which means that important functions of core services are unavailable, affecting some users and causing certain losses. Alibaba Cloud issued an official statement, saying, “There is no excuse for this failure. We cannot and should not make such a mistake! We will seriously review and improve the automated operation and maintenance technology and release verification process, and Revere every line of code and every commitment.”

Accident details: On August 5, 2018, Beijing qing bo CNC technology co., LTD. (hereinafter referred to as the “forward nc”) in the official weibo published an article called “tencent cloud disaster for a startup, this paper shows that on July 20, 2018, tencent greetings hard disk failure (tencent in the late of the cause of the accident is presented, which demonstrates the cloud). As a result, all the data stored by the company is lost and cannot be restored. This is the platform data of the startup company of nearly 10 million yuan, including accurate registered users and content data accumulated after long-term promotion and diversion.

Remedy: Tencent Cloud said it informed users of the fault status immediately after monitoring the anomaly, and immediately organized file system experts and teamed up with vendor technical experts to try to repair the data. However, after many efforts, some data integrity verification failed.

Follow-up: Tencent Cloud proposed the “compensation + compensation” scheme, and promised to continue communication with “Frontier CNC” to help it recover its business.

Accident details: On July 24, 2018, users repeatedly logged in to Tencent Cloud and logged out, even if they changed operators, the result is the same. Subsequently, Tencent Cloud issued a notice that it was initially determined that the operator’s cable interruption, the operator has found the break point, is connecting, mainly affected for some users in Guangzhou region.

Remedial measures: the operator immediately intervene repair.

Downtime: The downtime is unknown, and recovery takes 30 to 40 minutes

Details of the incident: Prime Day is a 36-hour promotion for amazon members around the world. Just after the start of the event, amazon’s website and App went down at the same time, affecting not only its e-commerce business, but also its other products and services. Amazon explained this as a global issue with the AWS management console.

Downtime: The outage lasted nearly six hours.

Follow-up: An AWS spokesperson said the intermittent AWS management console issue did not have any meaningful impact on Amazon’s consumer business.

Incident details: On the morning of September 4th, severe weather, including a lightning strike, occurred near Microsoft Azure’s South Central US data center. This affected the voltage in the cooling system, causing connectivity problems for multiple Azure services and making it difficult for customers to access resources stored in the data center. The affected Services include Office365, Active Directory, Visual Studio Online, Visual Studio Team Services, and more.

Fix: Microsoft engineers had restored power to the data center and most network equipment by The morning of Sept. 5, and other services are coming back.

Downtime: more than 24 hours

Details: On November 9th, the Kubernetes service (GKE) node pool construction function provided by Google public Cloud failed, and maintenance personnel could not create new nodes through the Cloud Console UI.

Remedy: Google sent an engineering team to investigate the cause of the problem and begin repairing it. According to Google, affected enterprise users can first use gCloud Command built into GCP to install new Kubernetes nodes.

Downtime: nearly 19 hours

For many smes, the manpower and maintenance costs of self-built equipment rooms are too high. They hope to take advantage of the advantages of cloud computing, such as low cost, scalability, reliability and convenience, but worry about risks. The risks are often the same, such as security breaches, regulatory issues, and a lack of knowledge about how to build the best cloud computing infrastructure. In the past few years, cloud providers have also experienced a number of large and small outages, indicating that companies are not unconcerned. As more businesses and government agencies move their data into the cloud, even a minor outage can be catastrophic. Even with Aliyun, which offers 99.9 percent reliability, that 0.1 percent outage still happens.

Considering these needs of enterprises, the trend of hybrid cloud is obvious now, and many public cloud vendors are laying out hybrid cloud market. With the hybrid cloud, companies can reduce costs while increasing productivity without committing themselves entirely to the public cloud. However, hybrid clouds also have compatibility and security compliance challenges. Therefore, to minimize the loss caused by faults, enterprises should not only establish a comprehensive DISASTER recovery guarantee system, but also conduct regular drills for the disaster recovery system.

Did you experience any of these public cloud outages in 2018? What do you think about it?

Feel free to leave your thoughts in the comments.

Click on it and try the new functions of wechat?

Top 10 Cloud outages of 2018: None of the mainstream public Clouds survived!

Related Posts

Mit-6.824 Distributed System LAB1-MapReduce

Practice of requirements management for large-scale product technical teams

Eureka high availability implementation