Zero-cost Disaster Recovery Plan for Applications Running on AWS

Statistics show that over 40% of businesses will not survive a major data loss event
without adequate preparation and data protection. Though disasters don’t
occur often, the effects can be devastating when they do.

A Disaster Recovery Plan (DRP) specifies the measures to minimize the damage of a major data loss event so businesses can respond quickly and resume operations as soon as possible. A well-designed DRP is imperative to ensure business continuity for any organization. If you are running an application, you must have a Disaster Recovery Plan in place, as it allows for sufficient IT recovery and the prevention of data loss. While there are traditional disaster recovery solutions, there has been a shift to the cloud because of its affordability, stability, and scalability.

AWS gives the ability to configure multiple Availability Zones to launch an application infrastructure. In an AWS Region, Availability Zones are clusters of discrete data centers with redundant power, networking, and connectivity. If downtime occurs in a single availability zone, AWS will immediately shift the resources to a different availability zone and launch services there.

Of course, downtimes do occur occasionally. To better handle them, you should configure the Auto Scaling Groups (ASGs), Load Balancers, Database Clusters, and NAT Gateways in at least three Availability Zones, to withstand (n-1) failures; that is, failure of two availability zones (as depicted in the diagram below).

Diagram of a Disaster Management in an AWS Region with failure of two of three Availability Zones.

Disaster Management within an AWS Region

Regional Disaster Recovery Plan Options

A regional disaster recovery plan is the precursor for a successful business continuity plan and addresses questions our customers often ask, such as:

What will be the recovery plan if the entire production AWS region goes down?
Do you have a provision to restore the application and database in any other region?
What is the recovery time of a regional disaster?
What is the anticipated data loss if a regional disaster occurs?

The regional disaster recovery plan options available with AWS range from low-cost and low-complexity (of making backups) to more complex (using multiple active AWS Regions). Depending on your budget and the uptime SLA, there are three options available:

Zero-cost option
Moderate-cost option
High-cost option

While preparing for the regional disaster recovery plan, you need to define two important factors:

RTO (Recovery Time Objective) i.e. the time to recover in case of disaster
RPO (Recovery Point Objective ) i.e. the maximum amount of data loss expected during the disaster

Zero-cost Option:

In this approach, you begin with database and configuration backups in the recovery region. The next step involves writing the automation script to facilitate the infrastructure launch within a minimum time in the recovery region. In case of a disaster, the production environment is restored using the existing automation scripts and backups. Though this option increases the RTO, there is no need to launch any infrastructure for disaster recovery.

Diagram of a zero-cost disaster recovery option with database and configuration backups in the recovery region.

Moderate-cost Option:

This approach keeps a minimum infrastructure in sync in the recovery region, i.e. the database and configuration servers. This arrangement reduces the DB backup restoration time, significantly lowering the RTO.

Diagram of a moderate-cost disaster recovery option with database and configuration servers in sync in the recovery region.

High-cost option:

This is a resource-heavy approach that involves installing load balancers in the production environment across multiple regions. Though it's an expensive arrangement, with proper implementation and planning the application is successfully recovered with little downtime for a single region disaster.

Diagram of a high-cost disaster recovery option with load balancers across multiple regions in the production environment.

Zero-cost Option: The Steps

The zero-cost option does not require the advance launch of additional resources in the recovery region; the only cost incurred is for practicing the DR drills.

Step 1: Configure Backups

At this stage, reducing data loss is the top priority. The first step is configuring the cross-region backups in the recovery region. With a proper backup configuration, you can reduce RPO. It's essential to configure the cross-region backups of:

S3 buckets
Database backups
DNS zone file backups
Configuration (chef/puppet) server configuration
CICD (Jenkins/GoCD/ArgoCD) server configuration
Application configurations
Ansible playbooks
Bash scripts for deployments and cronjobs
Any other application dependencies required for restoring the application

Step 2: Write Infrastructure-as-a-Code (IaaC) Templates - Process Automation

Using IaaC to launch the AWS infrastructure and configure the application will reduce the RTO significantly, and automating the process will lessen the likelihood of human errors. Many automation tools are widely available.

Terraform code to launch application infrastructure in AWS
Ansible playbooks to configure Application AMI, Chef server, CICD servers, MongoDB Replica Sets Clusters, and other standalone servers
Scripts to bootstrap the EKS cluster

Step 3: Prepare for a DR Drill

The preparation for a DR drill should be done in advance through a specified process. The following is a sample method to get ready for a DR drill:

Select an environment similar to the production
Prepare a plan to launch complete production infrastructure in the recovery region
Identify all the application dependencies in the recovery region
Configure the cross-region backup of all the databases & configurations
Get ready with automation scripts with the help of Terraform, Ansible, and Shell-Scripts
Identify the team members for DR Drill and make their responsibilities known
Test your automation scripts and backup restoration in the recovery region
Note the time taken for each task to get a rough estimate of the drill time

Step 4: Execute the DR Drill

The objective of the DR drill is to test the automation scripts and obtain the exact RTO. Once the plan is set, decide a date and time to execute your DR drill. Regular practice is advisable to perfect your restoration capabilities.

Benefits of DR Drills

Practicing DR Drill boosts confidence that the production environment can be restored within a decided timeline.
Drills help identify gaps and provide exact RTO and RPO timelines.
They provide your customers with research-backed evidence of your disaster readiness.

Conclusion

Though AWS regions are very reliable, preparing for a disaster is a business-critical SaaS Application requirement. Multi-region or Multi-cloud deployments are complex, expensive architectures, and deciding the appropriate DR option depends on your budget and uptime SLA to recover during such disasters.