Introduction

Hey everyone! 👋

I recently worked on a project where I automated disaster recovery (DR) on AWS using Terraform and CI/CD pipelines. Initially, I had no idea how to set up a proper DR solution, but after diving deep into AWS services and Terraform, I was able to create a fully automated system that recovers from failures without manual intervention.

I thought it would be a great idea to document the entire process in this blog so that others (especially students like me) can learn how to build a similar system. If you're new to Terraform, AWS, or CI/CD, don't worry—I’ll explain everything in simple terms.

📌 What will you learn?

How to automate disaster recovery on AWS
Setting up EC2, RDS, S3, CloudWatch, and more
Using Terraform to manage infrastructure as code
Implementing CI/CD pipelines for automatic deployment
How to test the DR system and make sure it actually works

🔗 GitHub Repository: Disaster Recovery CI/CD

Why Do We Need Disaster Recovery?

Imagine working on a project, and suddenly, the server hosting your application crashes. 😱 If you don’t have a backup plan, your website or app will go offline, leading to data loss, revenue loss, and frustrated users.

Disaster recovery helps solve this by automatically restoring services when something goes wrong. The main reasons for system failures include:
✅ Hardware failures (server crashes, storage failures)
✅ Cyber attacks (DDoS, malware, hacking)
✅ Human errors (accidental data deletion, misconfiguration)
✅ Natural disasters (earthquakes, floods)

A manual recovery process can take hours (or even days), but if we automate it, the system will restore itself in minutes! 💡

Project Overview

Tools & Technologies Used

🔹 Terraform – To define infrastructure as code
🔹 AWS EC2 – For hosting the application
🔹 AWS RDS (MySQL) – For database recovery
🔹 AWS S3 – For storing backups
🔹 AWS CloudWatch – For monitoring system health
🔹 AWS SNS – For sending failure notifications
🔹 GitHub Actions – For automating deployments

The goal of this project is to automatically recover from failures without manual intervention.

Step-by-Step Implementation

Step 1: Setting Up AWS Infrastructure Using Terraform

Terraform helps us define cloud resources using code so we can easily manage, update, and replicate our infrastructure.

Terraform Resources Used

✅ VPC & Subnets – Create a secure networking environment
✅ EC2 Instances – Set up virtual machines for recovery
✅ RDS Database – Use a multi-AZ setup for database failover
✅ S3 Bucket – Store logs and backup data
✅ CloudWatch & SNS – Monitor and alert in case of failures

Deploying Infrastructure with Terraform

Once the Terraform code is ready, we deploy it with these commands:

terraform init  
terraform plan  
terraform apply -auto-approve

✅ This creates the AWS infrastructure required for disaster recovery.

Step 2: Automating Deployment with CI/CD Pipelines

Manually applying Terraform every time isn’t efficient. Instead, I used GitHub Actions to automate the deployment. Here’s how the CI/CD pipeline works:

1️⃣ A developer pushes changes to GitHub.
2️⃣ GitHub Actions automatically runs terraform apply.
3️⃣ AWS infrastructure updates without manual intervention.
4️⃣ If something goes wrong, the pipeline rolls back to the last working state.

This ensures that any changes to the infrastructure are tested and deployed automatically. 🚀

Step 3: Simulating Failures & Testing the DR System

Now that everything is set up, let's test if it actually works by intentionally causing failures! 😈

1️⃣ Simulating an EC2 Failure

Terminate an EC2 instance and see if Auto Scaling replaces it:

aws ec2 terminate-instances --instance-ids <INSTANCE_ID> --region ap-south-1

✅ Expected Result: A new EC2 instance should launch automatically.

2️⃣ Testing RDS Failover

Reboot the RDS database to trigger failover:

aws rds reboot-db-instance --db-instance-identifier <RDS_INSTANCE_ID> --region ap-south-1 --force-failover

✅ Expected Result: The database should switch to the standby instance.

3️⃣ Checking CloudWatch Alarms

List CloudWatch alarms to see if they detect failures:

aws cloudwatch describe-alarms --region ap-south-1 --query "MetricAlarms[*].{Name:AlarmName,State:StateValue}"

✅ Expected Result: If CPU usage is high, an alert should trigger.

🔴 Challenges Faced & Lessons Learned

💡 Challenge 1: RDS Multi-AZ Setup Issues

👉 Problem: Initially, the Terraform script used an unsupported RDS instance type (db.t2.micro).
✅ Solution: Switched to a supported instance type (db.t3.micro).

💡 Challenge 2: CI/CD Pipeline Debugging

👉 Problem: Terraform apply failed due to missing IAM permissions.
✅ Solution: Created a dedicated IAM role for Terraform with necessary AWS permissions.

💡 Challenge 3: Auto Scaling Errors

👉 Problem: Launch Configurations are deprecated in AWS.
✅ Solution: Used Launch Templates instead of deprecated configurations.

✅ Key Takeaway: Debugging Terraform issues teaches valuable AWS troubleshooting skills.

Final Outcome 🎯

By automating disaster recovery, I achieved:
🚀 Automatic recovery from EC2 and RDS failures
📉 Minimal downtime with fast failover
📊 Real-time monitoring & alerts for failures
💡 Scalable and cost-efficient disaster recovery

This setup ensures that even if a failure happens, the system can heal itself—reducing downtime and keeping applications running smoothly.

Best Practices for Disaster Recovery

✅ Use Multi-AZ for RDS – Ensures database availability
✅ Enable Auto Scaling for EC2 – Replaces failed instances automatically
✅ Use CloudWatch for monitoring – Detects failures before they become critical
✅ Store backups in S3 – Protects against accidental data loss
✅ Automate deployments with CI/CD – Prevents manual errors

🧹 Cleaning Up – Destroying the Infrastructure

Once testing is done, delete the AWS resources to avoid extra costs:

terraform destroy -auto-approve

Final Thoughts & What’s Next?

This project was a great learning experience for me. Before starting, I had very little knowledge about Terraform and disaster recovery, but by breaking things down step by step, I was able to build an automated system that recovers from failures on its own.

If you’re interested in cloud automation, DevOps, or disaster recovery, I highly recommend trying this project out. It’s a great way to learn Terraform, AWS, and CI/CD while working on a real-world use case.

🔗 GitHub Repository: Disaster Recovery CI/CD

💡 Next Steps:

Implement AWS Lambda for automated recovery actions
Add AWS Backup for additional resilience
Extend failover to multi-region setups

More Resources

If you're interested in AWS projects, check out my other blog:
🔗 Building a Development Environment on AWS

Thanks for Reading! 😊

If you found this helpful, feel free to share it with others and ⭐ the GitHub repo! 🚀

Search This Blog

My AWS Cloud Journey

Automating Disaster Recovery Using Terraform & CI/CD on AWS