Abstract: Ali Cloud is committed to providing better operation and maintenance experience, making your process of using ECS more transparent and efficient, and realizing a more standardized and automated operation and maintenance way. Based on Active O&M 2.0, you can have a smoother experience using the ECS cloud server. By using system events, you no longer rely on work orders to contact customer service, and can respond to the restart of active O&M instances through self-service processing, reducing the impact on system reliability and business continuity.
The Elastic Compute Service (ECS) is an Elastic and scalable computing Service that helps you reduce IT costs, improve O&M efficiency, and focus on core Service innovation. When you build a business system based on ECS cloud server, you can realize agile response to business needs and strong guarantee of business continuity by virtue of many advantages and features of cloud computing. On this basis, Ali Cloud is committed to providing a better operation and maintenance experience, making your process of using ECS more transparent and efficient, and realizing a more standardized and automated operation and maintenance way.


Active operations
Ali Cloud uses strict IDC standards, server access standards and operation and maintenance standards to ensure the high availability of the entire cloud computing infrastructure, data reliability and cloud server availability. For a single ECS instance, Ali Cloud promises that the service availability of a single ECS instance within a service cycle will not be less than 99.95%. For the single region multiple availability area, Ali Cloud promises that the service availability of the single region multiple availability area within a service cycle is not less than 99.99%.


We know that at the infrastructure level, there are always potential factors such as software bugs or hardware failures that affect the operation of ECS instances. Therefore, in order to ensure the above high level of service availability, in addition to the high availability design of cloud computing infrastructure, ECS proactive operation and maintenance is indispensable. Active O&M As the invisible guardian of the ECS, it proactively performs routine maintenance and fault detection on physical servers that host ECS instances, and repairs potential faults through online or rotational upgrades to continuously improve system reliability, performance, and security protection capabilities and ensure the stable running of cloud servers.


In some cases, however, the physical server needs to be restarted or shut down for maintenance. In this case, the ACTIVE o&M system sends a message to the ECS users on the server to inform you that your ECS instance needs to be restarted and migrated to a healthy physical server. Previously, after receiving such notification, users needed to submit work orders and contact customer service personnel for authorization. With the evolution of Active Operation 2.0, this experience has been improved in many ways.


Experience to upgrade


1. Active operation and maintenance of live migration, without interruption of instance operation
If active O&M detects that a physical server has a fault risk, the system will preferentially live migrate the ECS instance on the server to another physical server. The ECS instance that is successfully live migrated will not be interrupted and its services remain online. Only a small number of instances that are at risk of live migration will enter the active O&M restart migration process. After this strategy is upgraded, the impact on user business continuity is effectively reduced. With the rapid growth of Ali Cloud users, the number of work orders related to active operation and maintenance is 125 times lower than last year!


2. Clearer risk notification and advance knowledge of migration impact
For instances where it is necessary to restart and migrate, Aliyun will send message notifications and specific prompts to users in advance. Since the local storage (local site) comes from a single physical server and is not based on multi-copy distributed technology, the data stored on the local site will be erased during migration. Therefore, for local site instances, the notification clearly indicates this risk and reminds you to back up data before migration. For cloud disk instances, the notification provides operation guidance. You no longer need to submit work orders and contact customer service personnel. You can directly restart the instance migration on the console or through API.


3. Don’t use work orders to find customer service, system events to help
The self-service processing function of cloud disk instance restart and migration is online on the console and API. When receiving a system plan event of restart and migration, you can know the execution plan of the event. As shown in the following figure, you can restart the system immediately, schedule the restart during off-peak hours, or wait for the scheduled system execution to perform o&M operations in preparation. This process eliminates the need to rely on work order processing, improving efficiency and reducing the impact of instance restart on your line of business.






Get twice the result with half the effort


As mentioned above in the experience improvement, “thing” half-power multiplier, in addition to active operation and maintenance process evolution, also comes from the release of system events. System events can improve users’ awareness of ECS running status changes, and users can take specific operations to respond to or avoid the impact of the events on in-line services. Through the closed loop of system events, more o&M scenarios can be standardized and automated, so that users can get better o&M experience on cloud.


The original link
To read more articles, please scan the following QR code: