Statistics show that over 40% of businesses will not survive a major data loss event
without adequate preparation and data protection. Though disasters don’t
occur often, the effects can be devastating when they do.
A Disaster Recovery Plan (DRP) specifies the measures to minimize the damage of a major data loss event so businesses can respond quickly and resume operations as soon as possible. A well-designed DRP is imperative to ensure business continuity for any organization. If you are running an application, you must have a Disaster Recovery Plan in place, as it allows for sufficient IT recovery and the prevention of data loss. While there are traditional disaster recovery solutions, there has been a shift to the cloud because of its affordability, stability, and scalability.
AWS gives the ability to configure multiple Availability Zones to launch an application infrastructure. In an AWS Region, Availability Zones are clusters of discrete data centers with redundant power, networking, and connectivity. If downtime occurs in a single availability zone, AWS will immediately shift the resources to a different availability zone and launch services there.
Of course, downtimes do occur occasionally. To better handle them, you should configure the Auto Scaling Groups (ASGs), Load Balancers, Database Clusters, and NAT Gateways in at least three Availability Zones, to withstand (n-1) failures; that is, failure of two availability zones (as depicted in the diagram below).
Disaster Management within an AWS Region
Regional Disaster Recovery Plan Options
A regional disaster recovery plan is the precursor for a successful business continuity plan and addresses questions our customers often ask, such as:
- What will be the recovery plan if the entire production AWS region goes down?
- Do you have a provision to restore the application and database in any other region?
- What is the recovery time of a regional disaster?
- What is the anticipated data loss if a regional disaster occurs?
The regional disaster recovery plan options available with AWS range from low-cost and low-complexity (of making backups) to more complex (using multiple active AWS Regions). Depending on your budget and the uptime SLA, there are three options available:
- Zero-cost option
- Moderate-cost option
- High-cost option
While preparing for the regional disaster recovery plan, you need to define two important factors:
- RTO (Recovery Time Objective) i.e. the time to recover in case of disaster
- RPO (Recovery Point Objective ) i.e. the maximum amount of data loss expected during the disaster
- Zero-cost Option:
In this approach, you begin with database and configuration backups in the recovery region. The next step involves writing the automation script to facilitate the infrastructure launch within a minimum time in the recovery region. In case of a disaster, the production environment is restored using the existing automation scripts and backups. Though this option increases the RTO, there is no need to launch any infrastructure for disaster recovery.
- Moderate-cost Option:
This approach keeps a minimum infrastructure in sync in the recovery region, i.e. the database and configuration servers. This arrangement reduces the DB backup restoration time, significantly lowering the RTO.
- High-cost option:
This is a resource-heavy approach that involves installing load balancers in the production environment across multiple regions. Though it's an expensive arrangement, with proper implementation and planning the application is successfully recovered with little downtime for a single region disaster.
Zero-cost Option: The Steps
The zero-cost option does not require the advance launch of additional resources in the recovery region; the only cost incurred is for practicing the DR drills.
Step 1: Configure Backups
At this stage, reducing data loss is the top priority. The first step is configuring the cross-region backups in the recovery region. With a proper backup configuration, you can reduce RPO. It's essential to configure the cross-region backups of:
- S3 buckets
- Database backups
- DNS zone file backups
- Configuration (chef/puppet) server configuration
- CICD (Jenkins/GoCD/ArgoCD) server configuration
- Application configurations
- Ansible playbooks
- Bash scripts for deployments and cronjobs
- Any other application dependencies required for restoring the application
Step 2: Write Infrastructure-as-a-Code (IaaC) Templates - Process Automation
Using IaaC to launch the AWS infrastructure and configure the application will reduce the RTO significantly, and automating the process will lessen the likelihood of human errors. Many automation tools are widely available.
- Terraform code to launch application infrastructure in AWS
- Ansible playbooks to configure Application AMI, Chef server, CICD servers, MongoDB Replica Sets Clusters, and other standalone servers
- Scripts to bootstrap the EKS cluster
Step 3: Prepare for a DR Drill
The preparation for a DR drill should be done in advance through a specified process. The following is a sample method to get ready for a DR drill:
- Select an environment similar to the production
- Prepare a plan to launch complete production infrastructure in the recovery region
- Identify all the application dependencies in the recovery region
- Configure the cross-region backup of all the databases & configurations
- Get ready with automation scripts with the help of Terraform, Ansible, and Shell-Scripts
- Identify the team members for DR Drill and make their responsibilities known
- Test your automation scripts and backup restoration in the recovery region
- Note the time taken for each task to get a rough estimate of the drill time
Step 4: Execute the DR Drill
The objective of the DR drill is to test the automation scripts and obtain the exact RTO. Once the plan is set, decide a date and time to execute your DR drill. Regular practice is advisable to perfect your restoration capabilities.
Benefits of DR Drills
- Practicing DR Drill boosts confidence that the production environment can be restored within a decided timeline.
- Drills help identify gaps and provide exact RTO and RPO timelines.
- They provide your customers with research-backed evidence of your disaster readiness.
Conclusion
Though AWS regions are very reliable, preparing for a disaster is a business-critical SaaS Application requirement. Multi-region or Multi-cloud deployments are complex, expensive architectures, and deciding the appropriate DR option depends on your budget and uptime SLA to recover during such disasters.