Brief Introduction: This paper describes ali cloud block storage snapshot service based on high-performance ESSD cloud disk to improve snapshot service performance, provide lightweight, real-time user experience and reveal the technical principle behind. Based on industry development and cloud data protection scenarios, the oceanstor 9000 provides advanced snapshot-based data protection solutions for enterprise users and backup vendors, meeting the urgent requirements for cloud user data protection and ensuring enterprise business continuity.

In July 2021, Gartner, an internationally renowned consulting company, released the “Magic Quadrant” of IaaS (Infrastructure as A Service) and PaaS (platform as a Service) platforms of public cloud. Ali Cloud became the public cloud service provider in the “Visionary” Quadrant for the first time with its leading technical capabilities. Among them, Ali Cloud Storage won the first single score, and Ali Cloud computing, storage, network and security scored the first in the world. The ESSD cloud disk provides users with high availability, reliability, and performance block-level random access services and native snapshot data protection capabilities.

New demand for native business

With the development of cloud native technology, more and more enterprises based on cloud computing virtualization, elastic expansion and the booming distributed framework of cloud native technology, container technology, choreography system, continuous delivery and rapid iteration, build large-scale, elastic expansion and rich distributed business scenarios on the cloud. As the deployment scale of enterprise applications, storage, computing and other resource requirements grow exponentially, traditional data protection solutions cannot meet the new cloud technology changes. Users are facing a more intense market competition environment and urgently need cloud data protection solutions adapted to business scale and development to meet their own competitiveness and business development needs. Although the service background and scenarios of data protection have changed due to cloud computing and cloud native, users’ demands for data protection have not changed. The criteria for measuring data protection are still the recovery point in time (RTO) and recovery point (RPO).

The primary goal for users is still business continuity, that is, rapid service recovery in the face of interruption threat; The business is under pressure of growth, and the business is rapidly expanded. Users have the following urgent requirements for cloud data protection and snapshot services based on service scenarios:

  • Short creation time: Snapshots are created quickly and critical services are backed up immediately.
  • ** Speed available: ** Snapshot speed available, in case of emergencies, to complete cloud disk rollback recovery.
  • ** Service expansion: ** Sudden increase in service volume requires service expansion.
  • Machine protection:singleECSInstance and multipleECSConsistency data protection of the instance associated with multiple disks.
  • ** Test and verification: ** Data can be tested, verified and recovered outside the production environment.
  • ** Fast recovery: ** The file system and application data are backed up for consistency, avoiding application downtime.
  • Container backup: The rapid iteration and release of container business environment necessitates the protection of metadata and application business data.

According to the definition of the Storage Network Industry Association (SNIA), a snapshot is a fully usable copy of a specified data set. The copy includes an image of the corresponding data at a certain point in time (the time when the copy starts). Ali cloud block storage snapshot is to provide ESSD cloud disk consistency data mirror at a certain time. Constantly adapt to the development trend of the industry, the snapshot service found that users of the new demand and new scene, tirelessly for the new development and evolution of iteration function, acme upgrade optimization ESSD cloud disk snapshots of senior enterprise new features: a consistent snapshot snapshot speed available properties, application and adaptation of distributed application architecture consistency snapshot copy snapshot cross-regional and long distance function of disaster. Are independent output and the integrated development, meet the needs of business users in the cloud, big data service, game, artificial intelligence, the financial industry and other fields, also got other teams such as: ali cloud cloud database team RDS, hybrid cloud backup, elastic container instance ECI, ACK container services and other business team and user feedback:

  • RDS industry users’ comments on the cloud database team are as follows: RDS second-level backup products are aligned with database backup products in the industry, reducing the occupation of instance resources for original physical file backup and effectively reducing data protection risks.
  • Elastic container instance ECI container acceleration benefit Customer Tousson’s evaluation is: the extreme cache acceleration function accelerates container application release, reduces the calculation time of simulation platform, reduces the calculation task to less than 5 minutes on average, and greatly shortens the product release cycle.
  • According to hybrid cloud backup customers, the application consistency backup capability of the entire system is comparable to the snapshot function of the VMware virtualization platform.
  • The snapshot service provides the ability of consistency group snapshot and application consistency, which fully meets the evaluation ability of Alibaba cloud block storage service by Gartner in 2021. The Container Business ACK team measures the capabilities of container backups by Forrestor 2021.

A typical scenario

Lightweight and real-time snapshot quick availability features, and advanced features of consistency group snapshot and application consistency snapshot provide enterprise users and third-party backup vendors with quick Copy Data Management application scenarios such as rapid backup and recovery, DISASTER recovery (Dr) testing, Copy utilization, and Dr Switchover. Gartner’s Hype Cycle analysis of storage and data protection, released in July 2021, lists container backup, cloud backup, and replica data management (CDM) as industry trends for data protection in the coming years. Gartner’s basic definition of replica data management is that “Golden Image” is generated on secondary storage based on the snapshot of primary storage based on application consistency, and is used for backup, disaster recovery and testing, and heterogeneous storage is regarded as the basic condition of capability. The advanced snapshot service features of ESSD of Ali Cloud fully meet the conditions of CDM construction, and help users to realize the original data protection of replica data management on the cloud. Typical scenarios:

** Backup and recovery: speed backup and standard backup combined to provide close and remote backup recovery points. ** Machine protection based on ECS instances on the cloud and container application of K8S environment, regularly create quick available snapshots. After the consistency group snapshot feature and the fast availability feature are enabled, the interval for generating local instant snapshots can reach the level of seconds. The instant snapshot copy is retained locally for ultra-fast backup and is used for second-level I/O performance lossless recovery. Periodically generate application consistency snapshots based on upper-layer enterprise applications. Local snapshot copies are uploaded to the object storage OSS over the network as standard backup. Standard backup After the backup data is uploaded, the local region is visible to all available areas. It is suitable for historical data that can be retained for a long time.

** Dr Test: A Dr Test based on fast backup. ** The disaster recovery environment must be tested regularly in replica data management. Periodic testing improves the reliability of the Dr Environment and prevents services from rapidly recovering the Dr System when a disaster occurs due to configuration problems or environment changes. Fast clone technology based on local snapshot copy, disaster recovery instance and pull up container application, periodically mount and backup data test verification. In the traditional solution based on replication technology, test can be performed only after snapshot replication is available at the Dr Site. In the extreme backup mode, second-level cloning, second-level mounting, and second-level startup tests are implemented at the Dr End.

** Replica utilization: Data analysis based on fast backup. ** Under the condition that the production environment is not affected, the disaster recovery environment based on the rapid cloning technology, carries out the timed pull up of container applications, carries out the big data calculation and analysis of copies, and mines the data value. In practice, copy utilization is also reflected in the MySQL database application based on ultra-fast backup for instant pull up of read-only standby database for offline data analysis.

** Dr Switchover: Services are switched from the production environment to the Dr Environment. ** If a disaster occurs, services cannot be recovered within a short period of time and production cannot continue. In this case, services are switched from the generation center to the DISASTER recovery center. After the services in the production center are recovered, the services can be switched back.

Compared with the traditional CDM scheme for duplicate data management, cloud computing environment and cloud native environment have a large scale elastic homogeneous computing environment, and enterprise users do not need to invest in equipment resources and software. Ultra-fast backup and ultra-fast clone technology greatly reduce the recovery point target RTO of copy development, testing and disaster recovery switchover. The unified backup data format of the snapshot service on the cloud reduces the number of copies required in various management processes and eliminates data format compatibility problems among backup software.

The technology principle

A number of optimizations have been made to the distributed snapshot algorithm and implementation to enable users to provide lightweight, real-time data protection regardless of performance concerns. Light: The I/O performance is not affected during snapshot creation. Fast: ESSD cloud disk snapshots can be created, rolled back, and cloned in seconds, meeting users’ requirements for real-time data protection and rapid DevOps choreography.

Speed available features

The snapshot Service with the ultra-fast availability feature not only supports data backup, compliance scenarios, and long-term archiving services, but also supports one-click backup of cloud disk data to the Object Storage Service of Ali Cloud. The snapshot protection policy is formed by preserving the local snapshot copy at a second interval. Achieve snapshot lightweight creation, real-time availability of rapid cloning, second lossless rollback advanced features.

** Fast clone: ** In a cross-availability Dr Environment isolated from production, snapshot clone a new disk to achieve writable snapshots, application test verification, and service recovery preparation; Eliminate service pressure on the cloud and implement horizontal service expansion. For example, the horizontal expansion of MySQL database applications, secondary database construction, instance creation and read/write separation all require second pull up. Rapid clone uses the lazy loading technology to make the second data available in the local region and across clusters of the local snapshot copy, and quickly clone new disks to achieve second pull up of instances.

** Second-level rollback: ** Second-level I/O lossless rollback recovery is implemented for local snapshot copy data and local storage on cloud disks. The snapshot generation process is based on the improved ROW technology and holographic index technology. As the cloud disk data blocks written to ESSD change, the cloud disk read performance is optimized according to the optimal I/O performance read mode of ESSD. You do not need to pull data from the remote object storage, achieving second rollback I/O performance without damage.

After multiple snapshots are created for the cloud disk and a rollback is initiated, the cloud disk performance remains unchanged. After multiple local snapshots are retained on a cloud disk of a vendor, I/O read performance deteriorates to varying degrees.

Consistency Group Snapshot

Container environments and ECS instances need to protect stateful applications associated with multiple disks. The biggest problems of single-disk snapshot are as follows: Stateful applications are used as persistent storage based on cross-cloud disk LVM, Windows dynamic disk, and file system, and the single-disk snapshot data is backed up incorrectly. Database applications take into account both performance and data security. WAL log files and data files are stored on different storage devices, and therefore, system backup and disaster recovery (Dr) cannot be implemented regularly.

In addition to the deployment of stateful applications in POD under K8S and the deployment mode of single ECS instance, there are also distributed application deployment architecture and application high availability cluster under cloud environment, such as: Windows Failover Cluster, high availability architecture of active and standby application servers, and Application architecture of Oracle RAC based on shared storage also require data consistency protection across cloud disks and nodes.

The backend of cloud computing storage usually adopts distributed storage architecture. The lack of a global logical clock in a distributed environment makes it difficult to implement consistency group snapshots of single ECS instances and across ECS instances, single pods in a K8S environment, and multi-node cloudy disks. Minimizing the impact of snapshots on I/O performance is technically challenging. In the industry, the implementation technology of multi-disk crash consistency snapshot is mainly divided into two categories:

  • Block write I/OS during snapshots to achieve point-in-time data breakdown consistency across multiple disks
  • The sequential algorithm of logical clock is adopted, but it depends on distributed storage, so it is difficult to implement.

The second method is adopted for consistency group snapshots. Snapshots have no impact on I/O performance and minimize the impact on application performance

** Implementation principle: ** Adopts IO sequencing algorithm to create snapshot without write I/O block. Many users perform snapshot data protection only in off-peak service periods to worry about the impact of snapshot creation on I/O performance. Our optimized and improved multi-disk consistency group snapshot algorithm breaks people’s impression of the impact of snapshot I/O. Based on the write sequence preservation mechanism, we take the IO marking and ordering process according to the order in which the WRITE I/OS reach the underlying storage. Determine the I/O data set to be contained in a snapshot based on the snapshot completion time and I/O sequencing. Compared with the traditional method, snapshot sequencing does not prevent I/O writing. Compared with the traditional copy-on-write COW mode, the snapshot generation process redirects ROW while writing. The background data set reference generation process has no impact on I/O links, and the impact of reduced snapshot on I/O performance is minimal. The I/O performance is not damaged in the read/write scenarios of database services.

Test two disks, two clients, capacity 4TB, random write, iodepth=16, jobs=1, block write size 16KB test the impact on I/OS during snapshot creation in a high-iops database scenario The impact of snapshot creation on I/O performance of vendor 1 and vendor 2 increases by 1 to 3 times.

Applying a Consistency Snapshot

ESSD cloud disk snapshot data consistency types are crash consistency and application consistency. Crash consistency requires that file systems and applications have the ability to recover from downtime. The RPO of the recovery point is low and the impact on services is small. However, the RTO of high data backup reliability and second-level recovery point in time cannot be met in the following scenarios:

  • ** Atomicity defect risk: ** File system and database applications are difficult to implement atomicity transactions and may have defects. The article “All File Systems Are Not Created Equal” published at USENIX, a top-level conference on Systems, illustrates the possible implementation flaws in application and kernel guarantee atomicity.
  • Risk of data loss: Mainstream file systems work in performance first mode by default. Data loss may occur in a consistency backup crash. The default data writing mode of the Ext4 file system on Linux is ordered mode, causing data loss risks during file system verification and repair. When database applications are configured as performance-first, service data may be lost.
  • ** Long generation time and high impact: ** Traditional file-level physical backup and backup proxy rely on the generation of logical volume snapshots, which takes a long time and has a high impact on the system. Backup proxies require kernel drivers, which are incompatible and costly to maintain. Data needs to be read during file backup, consuming system CPU and I/O resources. Application consistency snapshots can communicate with applications only at the consistency generation point in time, and no incremental data generation or backup read/write operations are performed.

** Implementation principle: ** Compared with the traditional backup mode, the application consistency snapshot provides users with the advantages of cloud native agent-free application consistency snapshot, simplifying resource consumption, release complexity, software compatibility, kernel development, and software maintenance costs caused by the traditional backup mode. In order to meet the data consistency requirements of enterprise applications in storage snapshots, the data silence of IO and application transactions is realized based on the file system kernel and THE VSS mechanism on Windows. The generated protocol automatically restores THE I/O impact based on the impact duration. The snapshot consistency type depends on the submission result and application status of the creation protocol. The link length from the upper-layer application to the lower-layer storage and the performance of consistency components are optimized to reduce the I/O impact duration to seconds. Creation interval You can set the interval between creating file system consistency snapshots in seconds and creating application consistency snapshots in minutes based on service requirements.

From collapse consistency to application consistency, from single-disk consistency to multi-cloud disk group consistency, ESSD snapshot consistency classification implements full alignment of all types of snapshots in block storage public clouds in the industry. Compared with other vendors in terms of security risks and application support scalability, the advantages of the implemented original no-proxy snapshot are as follows: no resident service, no public IP address and port opening risks, role security authorization, and no additional kernel driver participation. Supports dynamic discovery of logical volumes and enterprise applications. Based on ESSD cloud disk snapshot, no proxy backup is required, no kernel driver maintenance is required, and no data is read and moved inside VMS.

According to the actual snapshot creation time and I/O impact time test of major cloud vendors at home and abroad, the SQL Server database application based on ESSD system disk and data disk can achieve second level write I/O blocking and minute level snapshot interval, and the creation time of application consistency snapshot is reduced by 2 to 3 times compared with other vendors. Consistent system recovery avoids the log replay process during consistent snapshot recovery and improves the startup speed of database applications.

Industry Feature Comparison

Compared with the snapshot features of other public cloud vendors, ESSD cloud disk is the only cloud vendor that fully supports the instant snapshot feature and consistency group snapshot, meeting the snapshot RTO and RPO requirements for data protection scenarios of cloud core applications.

future

** Data protection should be proactive rather than proactive. ** With the rapid development of cloud native technology, especially the evolution of container technology, enterprise users have higher and higher requirements on the recovery point target RPO and recovery point in time target RTO for cloud protection. In the future, we will also launch more new functions based on ESSD cloud disk, such as: High-density snapshot, continuous data protection, and application consistency protection based on multiple ECS instances continue to provide users with the “light”, “fast”, and “snap” features of the snapshot feature, reducing the RTO and RPO of enterprise data protection, and providing more advanced features of the native snapshot service to facilitate enterprise data protection.

Original work: Ali Cloud Storage Fan Jun

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.