AWS Disaster Recovery white paper translation summary

Reference text from AWS official website: docs.aws.amazon.com/zh_cn/white…

Disclaimer: The purpose of this article is to review the exam, so I have deleted part of the content and added part of my own understanding, summary and explanation. Please read the official text if you mind.

The opening

Disaster recovery consists of two processes: disaster preparation and disaster recovery. This article provides an overview of best practices for disaster Recovery for various AWS payloads and provides different ways to reduce risk and meet the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for that workload.

Disaster recovery

AWS classifies disasters into three categories:

  • Natural disasters: similar to earthquakes, floods and so on
  • Technical system problems such as power outages and network interruptions
  • Human behavior: such as malicious deletion and incorrect configuration modification

Disaster Recovery and High Availability differences

Both availability and disaster recovery rely on some of the same best practices, such as failure monitoring, deployment to multiple locations, and automatic failover. However, while availability focuses on the components of a workload, disaster recovery focuses on the preservation of copies of the entire workload. The efficiency of disaster recovery is mainly defined by the recovery time.

We also have to consider workload availability in our disaster recovery plan, because it affects our approach choice. Workloads running on a single Instance of Amazon EC2 in a single zone are not highly available. If the machine is affected by a local flood, we need to fail over to another available area to meet the DR goal. As a result, we usually deploy services to multiple zones, or even more safely to multiple regions.

There is also a difference between availability and disaster recovery in how data is handled. For example, we usually copy data to other devices for high availability (request shunting, backup read modules to multiple devices to speed up data reading). However, if one or more files on the primary storage device are deleted or corrupted, these destructive changes are immediately copied to the secondary storage device. In this case, despite the high availability, we could not recover the correct pre-disaster data. Therefore, point-in-time backup is usually adopted in Disaster Recovrey. Data can be returned to a specific point in time. That’s the difference between the two.

Business Continuity Plan (BCP) – Business Continuity Plan

I won’t go into details about this, but to put it simply, we should make disaster recovery plans based on the actual business situation, without affecting user availability or being too aggressive. docs.aws.amazon.com/zh_cn/white…

Recovery objectives (RTO and RPO)

This is the key point of the exam, using the official picture to illustrate.

  • RTO: How soon can we get back to business as usual after the disaster
  • RPO: How long data is lost to tolerate a disaster

Cloud-based disaster recovery

Disaster recovery strategies are evolving as technology evolves. Traditional local disaster management strategies focus on physical disk backup and migration. When services were transferred to AWS cloud services, our organization needed to re-evaluate the business impact, risk, and cost viability of our previous disaster recovery strategy. In general, AWS has the following advantages over traditional disaster recovery:

  • Recovery is fast and less complex.
  • Having simple and repetitive test schemes allows us to test more concisely and repeatably.
  • Less management operation burden
  • Reduced error records and improved recovery rates

AWS transforms fixed capital expenditures for our physical backup data centers into variable expenditures based on environmental changes, which can significantly reduce costs.

For many organizations, local services face the impact of a workload outage on the system and need to restore backup or replicated data to the secondary data center in a timely manner. When deploying workloads on AWS, they can implement well-structured workloads and rely on the design of the AWS global cloud infrastructure to help mitigate the impact of such outages. See the AWS Well-architected Framework – Reliability Pillar white paper for more information on architectural best practices for designing and operating reliable, secure, efficient, and cost-effective workloads in the cloud.

If your workload is on AWS, you don’t have to worry about data center power, air conditioning, fire suppression, and so on. You can access multiple fail-isolated availability zones (each consisting of one or more discrete data centers).

A single AWS Rigion

For disaster events based on an outage or loss of a physical data center, implementing high availability workloads across multiple availability zones within a single AWS region can help mitigate natural and technical disasters and reduce the risk of human threats, such as data loss due to erroneous or unauthorized activity. Each AWS zone consists of multiple availability zones, each isolated from failures in other zones. Each availability zone in turn consists of multiple physical data centers. To better isolate the impact issues and achieve high availability, you can partition workloads across multiple zones in the same zone. Usable areas are designed to achieve physical redundancy and provide resilience, providing uninterrupted performance even in the event of power outages, Internet outages, floods, and other natural disasters.

By deploying across multiple availability zones in a single AWS region, your workload is better protected against the failure of a single (or even multiple) data centers. To provide additional security for your single-region deployment, you can back up data and configuration, including infrastructure definitions, to another region. This policy Narrows the scope of a disaster recovery plan to include only data backup and recovery. Taking advantage of multi-zone elasticity by backing up to another AWS region is both simpler and cheaper than the other multi-zone options described in the next section. For example, by backing up to Amazon Simple Storage Service (Amazon S3), you can retrieve data immediately. However, if your DR strategy for partial data has more relaxed requirements for retrieval time (from minutes to hours), then using A Mazon S3 Glacier or Amazon S3 Glacier Deep Archive will significantly reduce the cost of your backup and recovery strategy.

Some workloads may have regulatory data resident requirements. If this applies to only one current AWS regions local workload, so in addition to the above mentioned design area available on the workload in order to achieve high availability, you can also use the available area in the area as a discrete location, it will be helpful to solve for you in the area of the workload of data resides. The DR policy described in the following section uses multiple AWS zones, but can also be implemented using availability zones instead of zones.

Multiple AWS Regions

For disaster events that involve the risk of losing multiple data centers far apart, you should consider disaster recovery options to mitigate natural and technical disasters that affect entire regions within AWS. All of the strategies described in the next section can be implemented as multi-region architectures to prevent such disasters.

Four strategies for disaster recovery

AWS provides 4zhogn

  • Backup and restore In simple terms, data is backed up in multiple regions and then deployed using infrastructure as Code (IaC) using services such as AWS CloudFormation or AWS Cloud Development Kit (CDK). Otherwise, restoring the workload in the recovery area can be complicated, resulting in increased recovery time and possibly exceeding the RTO set, as well as the need to back up code and configuration, such as creating the Amazon System image (AMI) of the Amazon EC2 instance. You can use AWS CodePipeline to automatically redeploy application code and configuration.
    • EBS snapshot
    • Dynamodb backup
    • Amazon RDS snapshot
    • Amazon Aurora DB snapshot
    • Amazon EFS backup (when using AWS Backup)
    • Amazon Redshift snapshot
    • Aws storage gateway
    • Amazon Fsx for windows file server and amazon fsx for lustre

    All of these services store snapshots or backups in S3, and restore the required snapshots when a disaster occurs

    • S3 Using Amazon S3 Cross-Region Replication (CRR)
    • S3 object versioninng

    For configuration and architecture, deployment mode backup, these can help you improve RTO:

    • AWS cloudFormation

      Provides infrastructure as Code (IaC) and enables you to define all AWS resources in your workload so that you can reliably deploy and redeploy to multiple AWS accounts and AWS regions.

    • AMI, instance metadata For ec2 instances

Pilot light

Using Pilog Light, you can copy data from one region to another,A copy of the core workload infrastructure is pre-shipped. The resources needed to support data replication and backup, such as databases and object storage, are always on. Other elements, such as the application server, load the application code and configuration, but are turned off and used only during testing or when disaster recovery failover is invoked. Unlike the backup and restore approach, your core infrastructure is always available, you can always choose to quickly configure a full production environment by opening and extending the application server.

The Pilot Light approach minimizes the ongoing cost of disaster recovery by minimizing active resources and simplifies recovery when a disaster occurs because the core infrastructure requirements are in place. This recovery option requires you to change the deployment method. You need to make core infrastructure changes to each zone and deploy workload (configuration, code) changes to each zone simultaneously. This step can be simplified by automating your deployment and using Infrastructure-as-a-Code (IaC) to deploy the infrastructure across multiple accounts and regions (deploying the full infrastructure to the primary region and scaling down/shutting down the infrastructure to the DR region). It is recommended that you use a different account for each region to provide the highest level of resources and security isolation (theft of credentials is also part of your disaster recovery plan).

With this approach, you must also mitigate data disasters. Continuous data replication can protect you from certain types of disasters, but it may not protect you from data corruption or destruction unless your policy also includes versioning or point-in-time recovery options for storing data. You can back up replicated data in the disaster zone to create point-in-time backups in the same zone.

For Pilot LIGH, continuous replication of data to real-time databases and data stores in the DR area is the best approach to low RPO (when used in conjunction with the point-in-time backups discussed earlier). AWS provides continuous, cross-region, asynchronous data replication for data using the following services and resources:

  • Amazon Simple Storage Service (Amazon S3) Replication
  • Amazon RDS read replicas
  • Amazon Aurora global database
  • Amazon DynamoDB global tables

When failover to a disaster recovery area to run read/write workloads, you must promote the RDS read-only copy as the primary instance. For database instances other than Aurora, this process takes several minutes to complete, and restarts are part of the process. Using the Amazon Aurora global database provides several advantages for cross-region replication (CRR) and failover of RDS. Global databases use dedicated infrastructure to make your database fully available to service your application and can be replicated to secondary regions with a typical delay of less than a second (well under 100 milliseconds in AWS regions). With the Amazon Aurora global database, if you have a performance degradation or outage in your primary area, you can promote one of your secondary areas to take on read/write responsibilities in less than 1 minute, even in the case of a complete regional outage. The upgrade can be automatic and there is no restart.

A scaled-down version of the core workload infrastructure with fewer or smaller resources must be deployed in the DR region. With AWS CloudFormation, you can define your infrastructure and deploy it consistently across AWS accounts and across AWS regions. AWS CloudFormation uses predefined pseudo parameters to identify the AWS account and AWS region where it is deployed. Therefore, you can implement conditional logic in the CloudFormation template to deploy a scaled-down version of the infrastructure only in the DR zone. For EC2 instance deployments, the Amazon System Image (AMI) provides information such as hardware configuration and installed software. You can implement an Image Builder pipeline to create the amis you need and copy those amis to your primary and backup areas. This helps ensure that these golden AMIS have everything they need to redeploy or scale their workloads in new areas in the event of a disaster event. Amazon EC2 instances are deployed in a reduced configuration (fewer instances than your primary region). You can use hibernation to put an EC2 instance in a stopped state, and you pay no EC2 fees, just for the storage used. To start an EC2 instance, you can create scripts using the AWS command line interface (CLI) or the AWS development kit. To scale the infrastructure to support production traffic, see AWS Auto Scaling in the hot Backup section.

For pilot Light all traffic is initially diverted to the primary area, and if the primary area is no longer available, it is switched to the disaster recovery area. There are two traffic management options to consider with AWS services. The first option is to use Amazon Route 53. With Amazon Route 53, you can associate multiple IP terminal nodes in one or more AWS regions with a Route 53 domain name. You can then route the traffic to the appropriate endpoint under that domain name. Amazon Route 53 Health check monitors these end nodes. Using these health checks, you can configure DNS failover to ensure that traffic is sent to healthy endpoints.

The second option is to use AWS Global Accelerator. With AnyCast IP, you can associate multiple end nodes in one or more AWS regions with the same static IP address. AWS Global Accelerator then routes the traffic to the appropriate end node associated with that address. Global Accelerator health check monitors endpoints. Using these health checks, AWS Global Accelerator automatically checks the health of your application and routes user traffic only to healthy application end nodes. Global Accelerator provides lower latency for application endpoints because it leverages the extensive AWS edge network to put traffic on the AWS network backbone as quickly as possible. Global Accelerator also avoids caching problems that can occur with DNS systems such as Route 53.

Warm standby

The hot backup approach involves ensuring that there is a scaled-down but fully functional copy of the production environment in another area. This approach extends the Pilot Light concept and reduces recovery time because your workload is always in a different zone. This approach also allows you to more easily perform tests or perform continuous tests to increase your confidence in your ability to recover from a disaster.

The difference between Pilot Light and Warm Standby is sometimes hard to understand. Both include a copy of the environment in your DR zone. The difference is that pilot Light can’t handle requests without taking additional action first, whereas warm standby can handle traffic immediately (to a reduced capacity level). Pilot Light requires you to “turn on” the server, possibly deploying additional (non-core) infrastructure and scaling up, whereas Warm Standby just requires you to scale up (everything is already deployed and running). Use your RTO and RPO requirements to help you choose between these approaches.

AWS Auto Scale can play an effective role in warm Standby. Generally, we only deploy the minimum resources at Warm Standby. Auto Scale can help us quickly recover to normal production capacity when disaster recovery occurs.

Multi-site active/active

As a multi-site active/ Active (Hot Standby), Warm Standby serves traffic from all regions to which it is deployed, while Warm Standby only serves traffic from a single region, and other regions are only used for disaster recovery. With the multi-site active/proactive approach, users can access your workload in any area where it is deployed. This approach is the most complex and costly disaster recovery approach, but it can reduce recovery times to near zero for most disasters with the right technology selection and implementation (although data corruption may require backups, which often results in non-zero recovery points). Hot Standby uses an active/passive configuration, where users are directed to only one area and the DR area does not occupy traffic. Most clients find it makes sense to use active/active if they are going to build a complete environment in the second area. Or, if you don’t want to use both zones to handle user traffic, Warm Standby offers a more economical and less operationally complex approach.

Use multi-site initiative/initiative because the workload runs in multiple regions, so there is no such thing as failover in this case. In this case, disaster recovery testing will focus on the workload’s response to area loss: Is traffic routed out of the failed area? Can other areas handle all traffic? You also need to test for data disasters. Backup and restore are still required and should be tested regularly. Also note that the recovery time for data disasters involving data corruption, deletion, or confusion will always be greater than zero, and the recovery point will always be at a point in time before the disaster is discovered. If the additional complexity and cost of a multi-site active/active (or hot standby) approach is required to maintain near-zero recovery times, additional measures should be taken to maintain security and prevent human error to mitigate man-made disasters.

For the active/passive scenarios discussed earlier (Pilot Light and Warm Standby), both Amazon Route 53 and AWS Global Accelerator can be used to Route network traffic to the active zone. For the active/active policy here, both services also support defining policies that determine which users go to which active region endpoints. With AWS Global Accelerator, you can set up the traffic dial to control the percentage of traffic directed to each application end node. Amazon Route 53 supports this percentage approach, as well as a variety of other policies available, including location-based and latency policies. Global Accelerator automatically leverages an extensive network of AWS edge servers to load traffic into the AWS network backbone as quickly as possible, thereby reducing request latency.

Data replication using this policy results in near-zero RPO. AWS services, such as the Amazon Aurora Global database, use dedicated infrastructure to make your database fully available to service your application and can be copied to a secondary region with a typical delay of less than a second. Using an active/passive strategy, writes occur only in the primary region. The difference from active/active is how the design handles writes to each active Region. User reads are usually designed to be provided to them from a Region closet, called Read Local. For writing, you have several options:

  • The write global policy routes all writes to a single zone. If this region fails, another region is promoted to accept writes. The Aurora global database is ideal for writing globally because it supports synchronization of read-only copies across regions, and you can promote one of the secondary regions to take on read/write responsibilities in less than a minute.

  • Writing a local policy will write routes to the nearest zone (just like reading). The Amazon DynamoDB global table supports this strategy, allowing reads and writes from each region to which the global table is deployed. The Amazon DynamoDB global table uses the last writer to win coordination between concurrent updates.

  • Write partitioning policies allocate writes to specific areas based on partitioning keys, such as user IDS, to avoid write conflicts. Two-way configured Amazon S3 replication can be used in this situation, and replication between two zones is currently supported. When performing this method, ensure that replica modification synchronization is enabled on buckets A and B to replicate replica metadata changes, such as object access control Lists (ACLs), object labels, or object locks on replicated objects. You can also configure whether to copy delete flags between buckets in an active region. In addition to replication, your policy must also include point-in-time backups to prevent data corruption or corruption events.

AWS CloudFormation is a powerful tool for implementing a consistently deployed infrastructure across AWS accounts across multiple AWS regions. AWS CloudFormation StackSets extend this capability, enabling you to create, update, or delete CloudFormation stacks across multiple accounts and zones in a single operation. While AWS CloudFormation uses YAML or JSON to define infrastructure-as-code, the AWS Cloud Development Kit (CDK) allows you to define infrastructure-as-code using familiar programming languages. Your code will be converted to CloudFormation, which will then be used to deploy resources in AWS.