Amazon RDS Essentials
RDS essentials, storage types, backups, Multi-AZ, Read Replicas, and best practices
Amazon RDS (Relational Database Service) is a managed relational database service that automates administrative tasks like provisioning, patching, backups, recovery, and scaling. It supports multiple engines: PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Amazon Aurora.
The problem it solves: It eliminates the operational burden of managing databases - no more manual installation, backup configuration, infrastructure monitoring, OS/DB patching, or failover management. You focus on your schema and queries, AWS handles everything else.
When to use it: For any application requiring a relational database in production. RDS vs EC2+manual DB = 10-20 hours/month saved on administration + automatic backups + point-in-time recovery included. Alternatives: Aurora (more AWS-native features, 5x performance), self-managed on EC2 (only if you need very specific configurations not supported), DynamoDB (if you prefer NoSQL).
Key Concepts
| Concept | Description |
|---|---|
| DB Instance | Managed database server with compute, storage, and networking. It's the main RDS component - similar to an EC2 specialized for databases |
| Instance Class | Compute type for your DB (CPU, RAM, network). Families: T (burstable, dev/test), M (general purpose, balanced production), R (memory optimized, analytics), X (extra memory, in-memory workloads) |
| Storage Type | Disk type for persistence. gp3 (General Purpose SSD) for 90% of cases, io2 (Provisioned IOPS) for I/O-intensive workloads with strict SLA, Magnetic (legacy, avoid) |
| IOPS | Input/Output Operations Per Second - disk performance measure. gp3 baseline = 3,000 IOPS, configurable up to 16,000. io2 up to 256,000 IOPS |
| Allocated Storage | Disk space assigned to the DB. Minimum 20 GB (gp3), maximum 64 TB. Can grow automatically with storage autoscaling |
| Automated Backup | Daily full snapshot + continuous transaction logs every 5 minutes. Enables point-in-time recovery within the retention period (1-35 days) |
| Backup Retention Period | Days RDS retains automated backups. Default 7 days, maximum 35 days. Backups free up to DB size, then $0.095/GB-month |
| Point-in-Time Recovery (PITR) | Ability to restore DB to any specific second within the retention period. Useful for recovery from human errors (DROP TABLE, DELETE without WHERE) |
| Manual Snapshot | Explicit backup you create. Persists indefinitely (until you delete it), survives if you delete the DB instance. You pay $0.095/GB-month for storage |
| DB Endpoint | DNS hostname to connect to the DB (e.g., mydb.abc123.sa-east-1.rds.amazonaws.com). Doesn't change during instance lifetime (except for rename) |
| Master Username/Password | DB administrator user credentials. Configurable at creation, modifiable afterward via CLI/Console |
| DB Subnet Group | Set of subnets where RDS can launch DB instances. Minimum 2 subnets in different AZs (for Multi-AZ support) |
| Security Group | Firewall controlling which IPs/security groups can connect to your DB. Typically only allow connections from your app servers (EC2/ECS/Lambda) |
| Multi-AZ Deployment | High availability configuration with automatic standby replica in another AZ. Automatic failover in ~1-2 minutes on hardware/AZ failure |
| Read Replica | Read-only copy of your DB to scale reads. Asynchronous replication, can be in same region or cross-region. Useful for reports/analytics without impacting production |
| Maintenance Window | Weekly time window where AWS can apply patches/updates. Configurable, recommended during low traffic hours |
| Engine Version | Specific DB engine version (e.g., PostgreSQL 15.4). AWS handles minor version upgrades automatically, major versions require manual upgrade |
Essential AWS CLI Commands
Creating and Querying DB Instances
# Create DB Instance (PostgreSQL)
aws rds create-db-instance \
--db-instance-identifier myapp-db \
--db-instance-class db.t3.micro \
--engine postgres \
--engine-version 15.4 \
--master-username admin \
--master-user-password SecurePassword123! \
--allocated-storage 20 \
--storage-type gp3 \
--backup-retention-period 7 \
--preferred-backup-window "03:00-04:00" \
--vpc-security-group-ids sg-0abc123 \
--db-subnet-group-name my-db-subnet-group \
--no-publicly-accessible \
--storage-encrypted \
--tags Key=Environment,Value=Production
# List all DB instances
aws rds describe-db-instances
# View specific instance details
aws rds describe-db-instances \
--db-instance-identifier myapp-db \
--query 'DBInstances[0].{
Status:DBInstanceStatus,
Endpoint:Endpoint.Address,
Engine:Engine,
Class:DBInstanceClass,
Storage:AllocatedStorage,
IOPS:Iops
}'
# Verify if available
aws rds describe-db-instances \
--db-instance-identifier myapp-db \
--query 'DBInstances[0].DBInstanceStatus' \
--output text
# Get connection endpoint
aws rds describe-db-instances \
--db-instance-identifier myapp-db \
--query 'DBInstances[0].Endpoint.Address' \
--output textModifying DB Instances
# Change instance class (upgrade/downgrade)
aws rds modify-db-instance \
--db-instance-identifier myapp-db \
--db-instance-class db.t3.small \
--apply-immediately
# Increase storage
aws rds modify-db-instance \
--db-instance-identifier myapp-db \
--allocated-storage 50 \
--apply-immediately
# Increase IOPS (gp3)
aws rds modify-db-instance \
--db-instance-identifier myapp-db \
--iops 6000 \
--apply-immediately
# Change storage type (gp3 to io2)
aws rds modify-db-instance \
--db-instance-identifier myapp-db \
--storage-type io2 \
--iops 10000 \
--apply-immediately
# Warning: Causes downtime (~10-30 min)
# Change backup retention period
aws rds modify-db-instance \
--db-instance-identifier myapp-db \
--backup-retention-period 14 \
--apply-immediately
# Rename DB instance
aws rds modify-db-instance \
--db-instance-identifier myapp-db \
--new-db-instance-identifier myapp-db-v2 \
--apply-immediatelyDeleting DB Instances
# Delete DB instance WITH final snapshot
aws rds delete-db-instance \
--db-instance-identifier myapp-db \
--final-db-snapshot-identifier myapp-db-final-snapshot
# Delete DB instance WITHOUT snapshot (not recommended in prod)
aws rds delete-db-instance \
--db-instance-identifier myapp-db \
--skip-final-snapshot
# Verify deletion
aws rds describe-db-instances \
--db-instance-identifier myapp-db \
--query 'DBInstances[0].DBInstanceStatus'Manual Snapshots
# Create manual snapshot
aws rds create-db-snapshot \
--db-instance-identifier myapp-db \
--db-snapshot-identifier myapp-pre-migration-2025-11-30
# List snapshots
aws rds describe-db-snapshots \
--db-instance-identifier myapp-db
# View manual snapshots only
aws rds describe-db-snapshots \
--db-instance-identifier myapp-db \
--snapshot-type manual
# View automated snapshots only
aws rds describe-db-snapshots \
--db-instance-identifier myapp-db \
--snapshot-type automated
# Delete manual snapshot
aws rds delete-db-snapshot \
--db-snapshot-identifier myapp-pre-migration-2025-11-30
# Copy snapshot to another region (disaster recovery)
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:sa-east-1:123456789012:snapshot:myapp-snapshot \
--target-db-snapshot-identifier myapp-snapshot-us-east-1 \
--region us-east-1Point-in-Time and Snapshot Restore
# Point-in-Time Restore (any minute within retention period)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier myapp-db \
--target-db-instance-identifier myapp-db-restored \
--restore-time "2025-11-30T15:06:00Z" \
--vpc-security-group-ids sg-0abc123
# Restore from manual snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier myapp-db-from-snapshot \
--db-snapshot-identifier myapp-pre-migration-2025-11-30 \
--db-instance-class db.t3.micro
# View available restore points
aws rds describe-db-instances \
--db-instance-identifier myapp-db \
--query 'DBInstances[0].{
EarliestRestorableTime:EarliestRestorableTime,
LatestRestorableTime:LatestRestorableTime
}'CloudWatch Monitoring
# View CPU utilization (last 24 hours)
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name CPUUtilization \
--dimensions Name=DBInstanceIdentifier,Value=myapp-db \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average,Maximum
# View database connections
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=myapp-db \
--period 300 \
--statistics Average
# View Read/Write IOPS
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name ReadIOPS \
--dimensions Name=DBInstanceIdentifier,Value=myapp-db \
--period 300 \
--statistics Average
# View disk queue depth (if high, you need more IOPS)
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DiskQueueDepth \
--dimensions Name=DBInstanceIdentifier,Value=myapp-db \
--period 300 \
--statistics AverageLogs Access
# List available log files
aws rds describe-db-log-files \
--db-instance-identifier myapp-db
# Download complete log file
aws rds download-db-log-file-portion \
--db-instance-identifier myapp-db \
--log-file-name error/postgresql.log.2025-11-30-15 \
--output text
# View last lines of log (tail)
aws rds download-db-log-file-portion \
--db-instance-identifier myapp-db \
--log-file-name error/postgresql.log \
--starting-token 0 \
--output text | tail -n 100Architecture and Flows
Typical Architecture with RDS
Point-in-Time Recovery Flow
Storage Type Decision Tree
Multi-AZ Failover Flow
Best Practices Checklist
Security
- Never use publicly-accessible in production: DB instances should be in private subnets, accessible only from app servers' security group
- Encryption at rest enabled:
--storage-encryptedat creation (can't be enabled afterward) - SSL/TLS for connections: Force SSL in DB parameters, use RDS certificates in app
- Secrets Manager for credentials: Don't hardcode passwords, use automatic rotation
- IAM Database Authentication: For applications that support it (eliminates need for passwords)
- Restrictive security groups: Only necessary ports (5432 for PostgreSQL) from specific IPs/SGs
- Audit logging enabled: Enable log_statement, log_connections for PostgreSQL
- Encrypted snapshots: Manual snapshots inherit encryption from DB source
Cost Optimization
- Right-sizing instance class: Monitor CPU/Memory in CloudWatch, downgrade if sustained utilization under 40%
- gp3 over gp2: gp3 is cheaper and more flexible (3,000 IOPS baseline independent of storage size)
- Storage autoscaling configured: Avoids running out of space, grows only when needed
- Appropriate backup retention: 7 days for dev/test, 14-30 days for prod, not more than necessary
- Delete old snapshots: Manual snapshots cost $0.095/GB-month indefinitely
- Reserved Instances for stable production: 1-year RI = ~40% savings, 3-year = ~60% savings
- Dev/test instances stopped off-hours: Use scheduled Lambda/EventBridge for stop/start
- Read Replicas in same region when possible: Cross-region replicas have data transfer costs
Performance
- CloudWatch alarms configured: CPU over 80%, FreeableMemory under 500MB, DiskQueueDepth over 5
- Monitor IOPS utilization: If consistently over 80% baseline, increase provisioned IOPS
- Connection pooling in application: PgBouncer/RDS Proxy to handle many connections efficiently
- Appropriate indexes: EXPLAIN ANALYZE slow queries, add indexes where needed
- Read Replicas for read-heavy workloads: Offload reports/analytics to replica
- Optimized parameter group: Adjust shared_buffers, work_mem, effective_cache_size per workload
- Upgrade to modern instances: M6i/R6i more performance per $ than previous generations
Reliability
- Multi-AZ enabled in production: Automatic failover in ~1-2 min on hardware/AZ failure
- Automated backups enabled: Minimum 7 days retention, NEVER backup-retention-period = 0
- Manual snapshot before risky changes: Migrations, upgrades, major schema changes
- Test restore process quarterly: Disaster recovery drill, measure recovery time
- Maintenance window configured: Low traffic hours (early morning/weekend)
- CloudWatch Events for alerts: Notify when failover occurs, backups fail, storage low
- RDS Proxy for faster failover: Reduces connection storm post-failover
Operational Excellence
- Consistent tagging: Environment, Application, Owner, CostCenter for tracking
- Infrastructure as Code: Terraform/CloudFormation for DB instances, parameter groups, subnet groups
- Snapshot before deleting: Final snapshot or verify backups before delete-db-instance
- Naming conventions:
{app}-{env}-{region}e.g., myapp-prod-sa-east-1 - Enhanced Monitoring enabled: OS metrics (processes, threads, memory detail) every 60s
- Performance Insights enabled: Query-level metrics, wait event analysis (first 7 days free)
- Document custom configurations: Modified parameter groups, security group rules, etc.
Common Mistakes to Avoid
Backup Retention Period = 0
Why it happens: "I want to save costs, I'll disable automated backups."
The real problem: Automated backups up to your DB size are FREE. Disabling them saves $0 but leaves you unprotected against disasters.
Typical scenario:
# BAD: Disable backups
aws rds modify-db-instance \
--db-instance-identifier myapp-db \
--backup-retention-period 0
# Savings: $0 (backups are free up to DB size)
# Risk: Total data loss if DROP TABLE, DELETE without WHERE, etc.How to avoid it:
# GOOD: Minimum 7 days in production
aws rds modify-db-instance \
--db-instance-identifier myapp-db \
--backup-retention-period 7
# For compliance: 14-30 days
--backup-retention-period 30Golden rule: NEVER backup-retention-period = 0 in production. Only acceptable in temporary testing environments.
Choosing io2 Without Analyzing CloudWatch Metrics First
Why it happens: "More IOPS = better performance, let's go with io2."
The real problem: io2 is 10-15x more expensive than gp3. Most workloads work perfectly fine with gp3.
Before considering io2:
# 1. Check current IOPS utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name ReadIOPS \
--dimensions Name=DBInstanceIdentifier,Value=myapp-db \
--period 300 \
--statistics Average,Maximum
# 2. If Average under 2,500 IOPS -> gp3 baseline (3K) is sufficient
# 3. If Average 5,000-10,000 -> increase IOPS in gp3 first
# 4. Only if over 16,000 sustained IOPS -> consider io2Cost comparison (100 GB, 10,000 IOPS):
gp3: $11.50 (storage) + $35 (7K IOPS extra) = $46.50/month
io2: $13.80 (storage) + $650 (10K IOPS) = $663.80/month
Difference: 14x more expensiveWhen to USE io2: Financial applications (trading), latency under 1ms critical, 99.99%+ SLA where every millisecond counts.
Not Creating Manual Snapshot Before Risky Changes
Why it happens: "Automated backups are enough."
The real problem: Automated backups are every 24 hours (typically 3 AM). If you do a migration at 10 AM and it fails, the last backup is from 7 hours ago.
Data loss scenario:
# Thursday 10 AM: Schema migration
ALTER TABLE users ADD COLUMN preferences JSONB;
# Migration corrupts data
# Last automated backup: Thursday 3 AM (7 hours ago)
# Point-in-time restore: You lose 7 hours of legitimate transactionsHow to avoid it:
# ALWAYS before major changes:
aws rds create-db-snapshot \
--db-instance-identifier myapp-db \
--db-snapshot-identifier pre-migration-$(date +%Y%m%d-%H%M)
# Do risky change
# If it fails -> restore from manual snapshot (minutes lost, not hours)
# After confirming success:
aws rds delete-db-snapshot \
--db-snapshot-identifier pre-migration-20251130-1000When to create manual snapshots:
- Schema migrations
- Major version upgrades
- Bulk data modifications
- Deployment of critical changes
Publicly Accessible = True in Production
Why it happens: "I need to connect from my laptop for debugging."
The real problem: You expose your DB to the internet. Bots constantly scan for open DBs, attempting brute force.
# BAD: DB accessible from internet
aws rds create-db-instance \
--publicly-accessible
# Security group allows 0.0.0.0/0 -> DB exposed
# Result: You appear on Shodan, constant brute force attemptsHow to avoid it:
# GOOD: DB in private subnet, not publicly accessible
aws rds create-db-instance \
--no-publicly-accessible \
--vpc-security-group-ids sg-private-db
# Security group allows ONLY from app servers:
aws ec2 authorize-security-group-ingress \
--group-id sg-private-db \
--protocol tcp \
--port 5432 \
--source-group sg-app-servers
# For debugging from laptop:
# Option 1: Bastion host in public subnet
# Option 2: VPN/Direct Connect
# Option 3: SSM Session Manager port forwardingNot Monitoring FreeableMemory and SwapUsage
Why it happens: "CloudWatch shows CPU under 50%, must be fine."
The real problem: Normal CPU but saturated memory = slow queries, swapping to disk (100x slower than RAM).
Symptoms you're ignoring:
# CPU: 45% (seems OK)
# FreeableMemory: 50 MB (of 1 GB total) - PROBLEM
# SwapUsage: 500 MB - SWAP ACTIVE = SLOW
# Latency: 2000ms (should be under 50ms)Correct diagnosis:
# View memory metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name FreeableMemory \
--dimensions Name=DBInstanceIdentifier,Value=myapp-db \
--period 300 \
--statistics Average
# If FreeableMemory consistently under 500 MB:
# -> You need instance class upgrade (more RAM)
aws rds modify-db-instance \
--db-instance-identifier myapp-db \
--db-instance-class db.t3.small \
--apply-immediatelyPreventive alarm:
aws cloudwatch put-metric-alarm \
--alarm-name myapp-db-low-memory \
--metric-name FreeableMemory \
--namespace AWS/RDS \
--statistic Average \
--period 300 \
--evaluation-periods 2 \
--threshold 524288000 \
--comparison-operator LessThanThreshold \
--dimensions Name=DBInstanceIdentifier,Value=myapp-dbNot Testing Restore Process Until Real Disaster
Why it happens: "The configuration looks good, it must work when I need it."
The real problem: Disaster arrives, you discover that:
- You don't know how to use restore CLI correctly
- Snapshot is corrupted
- Process takes longer than expected
- App doesn't connect to restored DB
Time lost in production: 2-3 hours troubleshooting in panic vs 20 min if you had practiced.
How to avoid it - Quarterly Disaster Recovery Drill:
# Every 3 months (Friday afternoon, low traffic):
# 1. Create snapshot
aws rds create-db-snapshot \
--db-instance-identifier myapp-prod \
--db-snapshot-identifier dr-drill-2025-q4
# 2. Measure time: Restore
time aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier myapp-prod-dr-test \
--db-snapshot-identifier dr-drill-2025-q4
# 3. Verify connectivity
psql -h myapp-prod-dr-test.xxx.rds.amazonaws.com \
-U admin -d myapp -c "SELECT COUNT(*) FROM users;"
# 4. Document:
# - Total restore time: X minutes
# - Issues found
# - Update runbook
# 5. Clean up
aws rds delete-db-instance \
--db-instance-identifier myapp-prod-dr-test \
--skip-final-snapshot
aws rds delete-db-snapshot \
--db-snapshot-identifier dr-drill-2025-q4Storage Full Without Autoscaling Configured
Why it happens: "I have 20 GB, takes months to fill up, I'll check when it gets close."
The real problem: Friday 6 PM, storage reaches 100%, DB enters read-only mode, app stops working.
# DB state when storage is full:
Status: "storage-full"
# Queries: Only SELECT works
# INSERT/UPDATE/DELETE: ERRORHow to avoid it - Storage Autoscaling:
aws rds modify-db-instance \
--db-instance-identifier myapp-db \
--max-allocated-storage 100 \
--apply-immediately
# RDS automatically increases storage when:
# - Free space under 10% of allocated storage
# - Or under 6 GB free
# - Low-storage lasts at least 5 minutes
# - At least 6 hours since last storage modificationPreventive alarm:
aws cloudwatch put-metric-alarm \
--alarm-name myapp-db-low-storage \
--metric-name FreeStorageSpace \
--namespace AWS/RDS \
--statistic Average \
--period 300 \
--evaluation-periods 1 \
--threshold 2147483648 \
--comparison-operator LessThanThreshold \
--dimensions Name=DBInstanceIdentifier,Value=myapp-dbCost Considerations
Cost Components in RDS
| Component | Pricing | Free Tier |
|---|---|---|
| DB Instance (compute) | Per hour based on instance class | 750 hrs/month db.t3.micro (12 months) |
| Storage - gp3 | $0.115/GB-month | 20 GB (12 months) |
| Storage - io2 | 0.065/IOPS | Not included |
| Provisioned IOPS (gp3) | $0.005/IOPS above 3,000 | Not included |
| Automated Backups | Free up to DB size, then $0.095/GB-month | Included |
| Manual Snapshots | $0.095/GB-month | Not included |
| Data Transfer Out | $0.09/GB (to internet) | 1 GB/month (12 months) |
| Multi-AZ (standby replica) | 2x cost of DB instance + storage | Not included |
| Read Replica | Separate DB instance + storage cost | Not included |
Real Application Cost Example
Configuration:
- Instance: db.t3.small (2 vCPU, 2 GB RAM)
- Storage: 100 GB gp3, 5,000 IOPS (2,000 extra)
- Backups: 7 days retention (~150 GB backups total)
- Manual snapshots: 2 snapshots of 100 GB each
- Multi-AZ: Yes
- Region: sa-east-1
Monthly calculation:
DB Instance (primary):
db.t3.small x 730 hrs = $30/month
DB Instance (standby - Multi-AZ):
db.t3.small x 730 hrs = $30/month
Storage (primary):
100 GB gp3 x $0.115 = $11.50/month
Storage (standby):
100 GB gp3 x $0.115 = $11.50/month
Provisioned IOPS extra:
2,000 IOPS x $0.005 = $10/month (primary only)
Automated Backups:
First 100 GB free (= DB size)
Excess: 50 GB x $0.095 = $4.75/month
Manual Snapshots:
200 GB x $0.095 = $19/month
TOTAL MONTHLY: $116.75/monthOptimization Strategies
1. Reserved Instances for stable production
db.t3.small on-demand: $30/month x 12 = $360/year
Reserved Instance (1 year, No Upfront):
~$216/year (40% savings)
Reserved Instance (3 years, All Upfront):
~$140/year (60% savings)
Recommendation: 1-year RI for prod DBs you know will run over 1 year2. Rightsizing with CloudWatch Metrics
# If I see CPU under 30% and Memory under 50% for 2+ weeks:
# Downgrade from db.t3.small -> db.t3.micro
Savings: $30/month x 2 (primary + standby) = $60/month3. Optimized backup retention
Automated backups:
7 days (dev/test)
14 days (prod - cost/recovery balance)
30 days (compliance only if required)
Manual snapshots:
Only for compliance/long-term
Delete after verifying no longer needed4. Dev/Test instances stopped off-hours
# Lambda scheduled (EventBridge):
# Monday-Friday 8 AM: Start DB
# Monday-Friday 6 PM: Stop DB
Savings: ~60% of instance cost
db.t3.small: $30/month -> $12/month (12 hrs/day vs 24 hrs)Integration with Other Services
| AWS Service | How It Integrates | Typical Use Case |
|---|---|---|
| EC2 | App servers on EC2 connect to RDS via endpoint | Backend apps (Laravel, Django, Rails) using RDS as datastore |
| Lambda | Lambda functions connect to RDS (via VPC or RDS Proxy) | Serverless APIs, scheduled jobs reading/writing to DB |
| VPC | RDS instances live in VPC subnets, security groups control access | Network isolation, private subnets for DBs, granular traffic control |
| Secrets Manager | Stores DB credentials, automatic rotation | Don't hardcode passwords, automatic credential rotation every 30-90 days |
| CloudWatch | Automatic metrics (CPU, connections, IOPS), logs, alarms | Performance monitoring, alerts when thresholds exceeded |
| CloudTrail | Audit trail of RDS operations (create, modify, delete) | Compliance, forensics, "who deleted the DB on Friday?" |
| S3 | Snapshot export, long-term backups, query results (Athena) | Cross-region disaster recovery, compliance archives, analytics on exported data |
| IAM | Control who can manage RDS, IAM database authentication | Principle of least privilege, developers only read access to prod DBs |
| KMS | Encryption keys for storage at rest, snapshot encryption | Compliance (HIPAA, PCI-DSS), encrypted databases |
| EventBridge | RDS events (backup complete, failover, low storage) | Automation (notify Slack on failover, trigger Lambda on backup failure) |
| SNS | Target of CloudWatch alarms for notifications | Email/SMS when CPU over 80%, storage under 10%, connections maxed |
| Route53 | Health checks of RDS endpoint, failover DNS | Multi-region DR, automatic DNS failover if primary region fails |
| DMS | Migrate data to RDS from on-premise or other DBs | Lift-and-shift migrations, ongoing replication, zero-downtime migrations |
| ECS/Fargate | Containerized apps connect to RDS | Microservices architecture, each service with its own DB connection pool |
| ElastiCache | Cache layer in front of RDS to reduce DB load | Session storage, query result caching, hot data acceleration |
| RDS Proxy | Managed connection pooler between app and RDS | Lambda functions (avoid connection exhaustion), faster failover |
Additional Resources
Official AWS Documentation
Whitepapers and Best Practices
- Well-Architected Framework - Reliability Pillar
- Database Caching Strategies
- Backup and Recovery Approaches
- RDS Security Best Practices
Tutorials and Workshops
Engine-Specific Resources
- RDS for PostgreSQL Best Practices
- Working with PostgreSQL Read Replicas
- RDS for MySQL Best Practices