Disaster recovery (Dr) becomes the basic requirement for enterprises to access and use the cloud
According to the “Global Cloud COMPUTING IT Infrastructure Market Forecast Report” released by IDC in 2019, the proportion of IT infrastructure on the global cloud exceeded that of traditional data centers in 2019. More and more enterprises choose to build systems in the cloud because of the low cost and stability of cloud computing, and cloud has become a mainstream IT infrastructure. In recent years, open source technology and cloud technology have maintained rapid development, a wide variety of products and services have emerged, technical personnel have become more expedient in making decisions, and the speed of architecture change is accelerating. In the process of rapid evolution, it is necessary to guard against unreasonable failures caused by human beings and also pay attention to the impact of natural disasters. An inappropriate business interruption may bring serious brand, customer and economic losses.
All cloud enterprises require disaster recovery system capacity building as the most basic goal and guarantee investment. Only by ensuring that critical data is not lost and system services are restored as soon as possible in the event of a disaster, can an enterprise achieve long-term, stable and rapid development.
Common disaster failures
In the production practice of enterprises, large and small failures will inevitably occur, affecting the stability of the system. Some faults can be recovered quickly after the occurrence of external users feel nothing, and some faults cannot be recovered for a long time, resulting in external public opinion, capital loss and other problems, and may even lead to the bankruptcy of the company. Generally, there are the following types of faults:
-
Human error, such as common configuration errors, application release failure and so on;
-
Hardware faults, such as network device faults, affect multiple servers in the equipment room or cluster.
-
Network attacks, such as DDoS attacks;
-
Disconnection/power failure, such as cable is cut;
-
Natural disasters, for example, lightning strikes and power failure in the equipment room.
Under these disasters, often faced with interruption of public network, access network, computer room and other facilities, will cause the drop, the website can’t open, fault alarm and other business issues, for businesses, need to face “business” and “failover” two big difficult problem, the best way is to decouple the two types of problems, in the event of a failure, rapid flow, priority recovery operations. Locate and rectify the fault after services are recovered.
Growth of failure escape capability
Locating and rectifying a fault involves four steps: Discovering a fault – Locating a fault – Rectifying a fault – Restoring services. Service recovery cannot be decoupled from fault recovery. Better approach is to upgrade the four troubleshooting steps into “found that the problem – cutting flow – business recovery” three troubleshooting steps, through the “flow” to ensure fast recovery of the business, the business recovery time shortened from “tens of minutes or even several hours” to “levels minutes or even seconds”, improve business ability of disaster.
In order to ensure the realization of fast cut-off flow and the “effective” cut-off flow in real scenarios, we need to build higher-order DISASTER recovery architecture technology and enhance the coordination among “infrastructure”, “business system”, “support tools”, “production system” and “emergency personnel”. The collaboration between the architecture and the organization ensures disaster recovery.
This capability can not be broken immediately, but needs to constantly optimize the architecture and organizational collaboration to promote the spiraling rise of the disaster recovery and multi-live capacity of the business.
Break through geographical constraints
Enterprises usually deploy equipment rooms in a single region at the initial stage. However, as the service scale grows, the equipment room in a single region cannot meet service requirements. At the same time, with the explosive growth of the number of connections of single-region clustered components, the capacity of a single cluster cannot continue to expand, and it is urgent to split the cluster.
However, when splitting a cross-region cluster, the following principles must be met: routing consistency and data consistency. In this way, services can break through geographical restrictions, achieve horizontal capacity expansion across regions, flexibly schedule traffic, and solve capacity challenges in a single region. For example:
1. Machine capacity. Multiple remote equipment rooms are deployed in peer-to-peer mode. Enterprise applications can be flexibly deployed in multiple equipment rooms in multiple places.
2. Connection capacity. The clustered components in the equipment room are independent, and each equipment room is connected to its own components to avoid unlimited connection number.
Dr Limitations of Dr
Dr Dr Is based on data-level Dr. It is commonly implemented by building the same application system in a backup room. When a disaster occurs, the application system recovers within a specified time range (RTO) to minimize the loss caused by the disaster. In the actual implementation, there are the following problems:
1. The DISASTER recovery center does not provide services at ordinary times. Therefore, the switchover to the disaster recovery center cannot be determined at the critical moment.
2. The DISASTER recovery center does not provide services at ordinary times. As a result, the disaster recovery resources are idle and the cost is high.
3. The DISASTER recovery center does not provide services in normal times. Therefore, the equipment room that provides services in normal times remains in a single region.
Apply the concept of living more
Application hypermetro is an advanced form of application DISASTER recovery (Dr) technology. It means that a production system partially or completely corresponds to the local production system in the same city or remote equipment room, and all applications in the equipment room provide external services at the same time. When a disaster occurs, the vivisystems can switch service traffic within minutes, and users do not even feel the fault.
Common application multi-live architectures include same-city multi-live, remote multi-live, and hybrid cloud multi-live. Compared with traditional Dr, application multi-live has the following four advantages:
-
Minute-level RTO. The recovery time is fast. The average recovery time of ali’s internal production level is less than 30s, and that of external customers’ production system is less than 1 minute.
-
Full utilization of resources. There is no idle resource problem, and multiple computer rooms and multiple resources are fully utilized to avoid resource waste.
-
High switching success rate. Relying on the mature multi-activity technology architecture and visual operation and maintenance platform, compared with the existing disaster recovery architecture, the switching success rate is high, and the annual switching success rate of thousands of times within Ali is as high as over 99.9%.
-
Precise flow control. The application of multi-activity supports traffic sealing from the top to the bottom, and the specific business traffic is driven into the corresponding machine room depending on the precise drainage ability. Based on this advantage, enterprises can incubate the characteristics of full-range gray scale and key traffic guarantee.
By 2025, more than 50% of enterprises will be using distributed clouds. Public cloud services will extend to edge computing and IDC, and a distributed cloud will cover all scenarios. Live scenarios and technologies for applications that cross the cloud, across platforms, and across geographies will begin to emerge. The application system must have the ability to escape disaster failure at any time. Smooth migration to the upper cloud is a key decision point for every decision maker.
As services evolve and architectures evolve, Dr Governance addresses evolving issues. How to realize the application multi-active DISASTER recovery architecture and organizational collaboration is becoming more and more concerned by enterprises.
—- the above excerpted from the White Paper on Application of Live Technology can be downloaded here! Release the latest information of cloud native technology, collect the most complete content of cloud native technology, hold cloud native activities and live broadcast regularly, and release ali products and user best practices. Explore the cloud native technology with you and share the cloud native content you need.
Pay attention to [Alibaba Cloud native] public account, get more cloud native real-time information!