Amazon CloudWatch Essentials
CloudWatch essentials, metrics, logs, alarms, EventBridge, and best practices
Amazon CloudWatch is AWS's observability service that collects, monitors, and analyzes metrics, logs, and events from your infrastructure and applications. It acts as the "nervous system" of your architecture, providing complete visibility into the state of your resources in real-time.
The problem it solves: It eliminates operational opacity - instead of discovering problems when users report them, CloudWatch lets you monitor proactively, detect anomalies, automate responses, and maintain performance history. It transforms reactive debugging into proactive observability.
When to use it: ALWAYS in production. CloudWatch is fundamental to any serious AWS architecture. Without observability, you're operating blind - you don't know if your resources are healthy, what caused an incident, or when you need to scale.
Alternatives: DataDog, New Relic, Prometheus+Grafana, ELK Stack (Elasticsearch+Logstash+Kibana). CloudWatch has the advantage of native integration with AWS services and doesn't require additional infrastructure setup, but third-party tools offer more advanced analytics and better UX.
Key Concepts
| Concept | Description |
|---|---|
| Metric | Numeric data that varies over time, represented as a time series (timestamp + value). Examples: CPUUtilization, RequestCount, DiskReadOps. Each metric belongs to a namespace (AWS/EC2, AWS/ApplicationELB) |
| Namespace | Container that groups related metrics. AWS uses namespaces like AWS/EC2, AWS/RDS. You can create custom namespaces for your application metrics (e.g., CustomApp/Business) |
| Dimension | Key-value pair that identifies a metric variation (e.g., InstanceId=i-xxxxx, LoadBalancer=app/myapp-lb/xxx). Allows filtering and aggregating metrics |
| Statistic | Aggregation of data points over a period (Average, Sum, Minimum, Maximum, SampleCount). Defines how the reported value is calculated |
| Period | Time interval over which statistics are calculated (60s, 300s, 3600s). Standard metrics have 60s minimum, high-resolution can be 1s |
| Alarm | Automated monitor that compares a metric against a threshold and executes actions when crossed. Has 3 states: OK, ALARM, INSUFFICIENT_DATA |
| Alarm State | Current alarm state - OK (below threshold), ALARM (above threshold), INSUFFICIENT_DATA (not enough data to evaluate) |
| Evaluation Period | Number of consecutive periods that must meet condition before changing alarm state. Prevents false positives from temporary spikes |
| Datapoints to Alarm | Number of data points within evaluation periods that must violate threshold to trigger alarm (e.g., "3 of 5" means 3 bad data points in 5 periods) |
| Composite Alarm | Alarm that combines multiple alarms using boolean logic (AND/OR). Allows creating complex alerts based on multiple conditions |
| Log Group | Container for related log streams (e.g., /aws/lambda/my-function, /aws/ec2/myapp). Defines retention policy and encryption settings |
| Log Stream | Sequence of log events from a single source (one EC2 instance, one Lambda invocation). A log group can have multiple streams |
| Log Event | Individual log message with timestamp and content. Can be plain text or structured JSON |
| Metric Filter | Search pattern over logs that extracts data and publishes it as a metric. Allows creating custom metrics from logs without modifying code |
| Logs Insights | SQL-like interactive query service for analyzing logs. Supports aggregations, filters, regex, and statistics over large log volumes |
| Log Retention | Period CloudWatch stores logs before automatically deleting them (1 day to 10 years, or indefinite). Critical for compliance and cost optimization |
| EventBridge | Event bus service that captures AWS resource state changes and enables event-based automation |
| Event Pattern | JSON that defines which events to capture (e.g., "EC2 instance terminated in specific ASG"). Uses pattern matching against event structure |
| Event Rule | Combination of event pattern + target action. Defines "when X happens, do Y" |
| Event Target | Destination of an event (Lambda, SNS, SQS, Step Functions, etc.). A rule can have multiple targets |
| Scheduled Expression | Cron or rate expression for periodic events (e.g., cron(0 2 * * ? *) = 2 AM daily, rate(5 minutes) = every 5 minutes) |
| CloudWatch Agent | Daemon that runs on EC2/on-premise to send custom metrics and logs to CloudWatch. Enables monitoring of memory, disk space, custom processes |
| Dashboard | Customizable visualization of metrics in charts, numbers, and text. Provides consolidated view of application health |
| Widget | Individual dashboard component (line chart, number, log widget, alarm status). Configured with specific metrics and time period |
| Anomaly Detection | Machine learning that learns normal metric patterns and creates expected value bands. Alarms can be based on deviations from the pattern |
Essential AWS CLI Commands
Querying Metrics
# List available metrics for a namespace
aws cloudwatch list-metrics \
--namespace AWS/EC2
# List metrics for a specific instance
aws cloudwatch list-metrics \
--namespace AWS/EC2 \
--dimensions Name=InstanceId,Value=i-xxxxx
# Get CPU statistics (last 24 hours, 5 min periods)
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxxxx \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average,Maximum \
--output table
# Get multiple metrics with get-metric-data (more efficient)
cat > metric-queries.json <<EOF
[
{
"Id": "cpu",
"MetricStat": {
"Metric": {
"Namespace": "AWS/EC2",
"MetricName": "CPUUtilization",
"Dimensions": [{"Name": "InstanceId", "Value": "i-xxxxx"}]
},
"Period": 300,
"Stat": "Average"
}
},
{
"Id": "network",
"MetricStat": {
"Metric": {
"Namespace": "AWS/EC2",
"MetricName": "NetworkIn",
"Dimensions": [{"Name": "InstanceId", "Value": "i-xxxxx"}]
},
"Period": 300,
"Stat": "Sum"
}
}
]
EOF
aws cloudwatch get-metric-data \
--metric-data-queries file://metric-queries.json \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S)
# Get ALB metrics (request count, latency, errors)
aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB \
--metric-name TargetResponseTime \
--dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Average,Maximum
# Get Route53 health check status
aws cloudwatch get-metric-statistics \
--namespace AWS/Route53 \
--metric-name HealthCheckStatus \
--dimensions Name=HealthCheckId,Value=abc123-healthcheck \
--start-time $(date -u -d '12 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics MinimumPublishing Custom Metrics
# Publish simple custom metric
aws cloudwatch put-metric-data \
--namespace CustomApp/Business \
--metric-name OrdersProcessed \
--value 142 \
--timestamp $(date -u +%Y-%m-%dT%H:%M:%S)
# Publish metric with dimensions
aws cloudwatch put-metric-data \
--namespace CustomApp/API \
--metric-name ResponseTime \
--value 234 \
--unit Milliseconds \
--dimensions Environment=Production,Region=sa-east-1
# Publish multiple metrics (batch)
aws cloudwatch put-metric-data \
--namespace CustomApp/Database \
--metric-data \
MetricName=ActiveConnections,Value=45,Unit=Count \
MetricName=QueryTime,Value=123,Unit=Milliseconds \
MetricName=CacheHitRate,Value=0.89,Unit=Percent
# Publish with high resolution (1 second)
aws cloudwatch put-metric-data \
--namespace CustomApp/HighRes \
--metric-name Latency \
--value 78 \
--unit Milliseconds \
--storage-resolution 1Creating Alarms
# Simple alarm: CPU over 80%
aws cloudwatch put-metric-alarm \
--alarm-name high-cpu-i-xxxxx \
--alarm-description "CPU above 80% for 10 minutes" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxxxx \
--statistic Average \
--period 300 \
--evaluation-periods 2 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data notBreaching
# Alarm with SNS notification
aws cloudwatch put-metric-alarm \
--alarm-name high-cpu-with-notification \
--alarm-description "Alert ops team when CPU high" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxxxx \
--statistic Average \
--period 300 \
--evaluation-periods 2 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:sa-east-1:123456789012:ops-alerts \
--ok-actions arn:aws:sns:sa-east-1:123456789012:ops-alerts
# Alarm for ALB (high latency)
aws cloudwatch put-metric-alarm \
--alarm-name alb-high-latency \
--alarm-description "Target response time over 500ms" \
--namespace AWS/ApplicationELB \
--metric-name TargetResponseTime \
--dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
--statistic Average \
--period 60 \
--evaluation-periods 3 \
--threshold 0.5 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts
# Alarm for Route53 health check
aws cloudwatch put-metric-alarm \
--alarm-name route53-primary-unhealthy \
--alarm-description "Primary region health check failed - failover activated" \
--namespace AWS/Route53 \
--metric-name HealthCheckStatus \
--dimensions Name=HealthCheckId,Value=abc123-healthcheck \
--statistic Minimum \
--period 60 \
--evaluation-periods 2 \
--threshold 1 \
--comparison-operator LessThanThreshold \
--alarm-actions arn:aws:sns:sa-east-1:123456789012:failover-alerts
# Alarm for error rate (5XX over 100 requests)
aws cloudwatch put-metric-alarm \
--alarm-name alb-high-error-rate \
--alarm-description "Too many 5XX errors from targets" \
--namespace AWS/ApplicationELB \
--metric-name HTTPCode_Target_5XX_Count \
--dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
--statistic Sum \
--period 300 \
--evaluation-periods 1 \
--threshold 100 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts
# Alarm with "M of N" datapoints (more flexible)
aws cloudwatch put-metric-alarm \
--alarm-name cpu-high-3-of-5 \
--alarm-description "CPU high in 3 out of 5 datapoints" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxxxx \
--statistic Average \
--period 60 \
--evaluation-periods 5 \
--datapoints-to-alarm 3 \
--threshold 80 \
--comparison-operator GreaterThanThreshold
# Composite Alarm (combines multiple alarms)
aws cloudwatch put-composite-alarm \
--alarm-name app-unhealthy \
--alarm-description "App is unhealthy if high CPU AND high error rate" \
--alarm-rule "ALARM(high-cpu-i-xxxxx) AND ALARM(alb-high-error-rate)" \
--alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alertsQuerying and Modifying Alarms
# List all alarms
aws cloudwatch describe-alarms
# View alarms in ALARM state
aws cloudwatch describe-alarms \
--state-value ALARM
# View specific alarm details
aws cloudwatch describe-alarms \
--alarm-names high-cpu-i-xxxxx
# View alarm history (state changes)
aws cloudwatch describe-alarm-history \
--alarm-name high-cpu-i-xxxxx \
--max-records 10
# Disable alarm temporarily
aws cloudwatch disable-alarm-actions \
--alarm-names high-cpu-i-xxxxx
# Re-enable alarm
aws cloudwatch enable-alarm-actions \
--alarm-names high-cpu-i-xxxxx
# Change state manually (testing)
aws cloudwatch set-alarm-state \
--alarm-name high-cpu-i-xxxxx \
--state-value ALARM \
--state-reason "Testing alarm notifications"
# Delete alarms
aws cloudwatch delete-alarms \
--alarm-names high-cpu-i-xxxxx alb-high-latencyCreating and Managing Logs
# Create log group
aws logs create-log-group \
--log-group-name /aws/ec2/myapp
# Configure retention (7 days)
aws logs put-retention-policy \
--log-group-name /aws/ec2/myapp \
--retention-in-days 7
# Create log stream
aws logs create-log-stream \
--log-group-name /aws/ec2/myapp \
--log-stream-name i-xxxxx-app.log
# Send log events (from application)
cat > log-events.json <<EOF
[
{
"timestamp": $(date +%s)000,
"message": "Application started successfully"
},
{
"timestamp": $(date +%s)000,
"message": "Connected to database"
}
]
EOF
aws logs put-log-events \
--log-group-name /aws/ec2/myapp \
--log-stream-name i-xxxxx-app.log \
--log-events file://log-events.json
# Tag log group (for cost allocation)
aws logs tag-log-group \
--log-group-name /aws/ec2/myapp \
--tags Environment=Production,Application=MyAppQuerying Logs
# List log groups
aws logs describe-log-groups
# List log streams in a group
aws logs describe-log-streams \
--log-group-name /aws/ec2/myapp \
--order-by LastEventTime \
--descending \
--max-items 10
# Tail logs (real-time)
aws logs tail /aws/ec2/myapp --follow
# Filter logs by pattern (last 24 hours)
aws logs filter-log-events \
--log-group-name /aws/ec2/myapp \
--start-time $(date -u -d '24 hours ago' +%s)000 \
--filter-pattern "ERROR"
# Filter with structured pattern
aws logs filter-log-events \
--log-group-name /aws/ec2/myapp \
--filter-pattern '[timestamp, level=ERROR, msg]' \
--max-items 50
# Search in multiple log groups
aws logs filter-log-events \
--log-group-name-prefix /aws/ec2/ \
--filter-pattern "database connection failed"
# Export logs to S3 (for later analysis)
aws logs create-export-task \
--log-group-name /aws/ec2/myapp \
--from $(date -u -d '7 days ago' +%s)000 \
--to $(date -u +%s)000 \
--destination myapp-logs-bucket \
--destination-prefix logs/2025/11/Logs Insights Queries
# Query: Top 10 most frequent errors
aws logs start-query \
--log-group-name /aws/ec2/myapp \
--start-time $(date -u -d '24 hours ago' +%s) \
--end-time $(date -u +%s) \
--query-string '
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by @message
| sort count desc
| limit 10
'
# Get query results
aws logs get-query-results --query-id xxxxx-yyyy-zzzz
# Query: Average latency by endpoint
aws logs start-query \
--log-group-name /aws/lambda/my-api \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s) \
--query-string '
fields @timestamp, endpoint, duration
| stats avg(duration) as avg_latency, max(duration) as max_latency by endpoint
| sort avg_latency desc
'
# Query: Slow requests (over 500ms)
aws logs start-query \
--log-group-name /aws/ec2/myapp \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s) \
--query-string '
fields @timestamp, @message, latency
| filter latency > 500
| sort latency desc
| limit 100
'Creating Metric Filters from Logs
# Create metric filter to count errors
aws logs put-metric-filter \
--log-group-name /aws/ec2/myapp \
--filter-name ErrorCount \
--filter-pattern "[timestamp, level=ERROR, msg]" \
--metric-transformations \
metricName=ApplicationErrors,\
metricNamespace=CustomApp/Logs,\
metricValue=1,\
defaultValue=0
# Metric filter to extract latency from logs
aws logs put-metric-filter \
--log-group-name /aws/ec2/myapp \
--filter-name ResponseTime \
--filter-pattern "[timestamp, level, msg, latency]" \
--metric-transformations \
metricName=ResponseLatency,\
metricNamespace=CustomApp/Logs,\
metricValue='$latency',\
unit=Milliseconds
# List metric filters
aws logs describe-metric-filters \
--log-group-name /aws/ec2/myapp
# Delete metric filter
aws logs delete-metric-filter \
--log-group-name /aws/ec2/myapp \
--filter-name ErrorCountCreating EventBridge Rules
# Create SNS topic for notifications
aws sns create-topic --name ec2-state-changes
# Subscribe email
aws sns subscribe \
--topic-arn arn:aws:sns:sa-east-1:123456789012:ec2-state-changes \
--protocol email \
--notification-endpoint ops@example.com
# Event pattern: EC2 instance terminated
cat > event-pattern-terminated.json <<EOF
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {
"state": ["terminated"]
}
}
EOF
# Create rule
aws events put-rule \
--name notify-instance-terminated \
--description "Notify when EC2 instance is terminated" \
--event-pattern file://event-pattern-terminated.json
# Add SNS as target
aws events put-targets \
--rule notify-instance-terminated \
--targets "Id"="1","Arn"="arn:aws:sns:sa-east-1:123456789012:ec2-state-changes"
# Event pattern: Auto Scaling activities
cat > event-pattern-asg.json <<EOF
{
"source": ["aws.autoscaling"],
"detail-type": ["EC2 Instance Launch Successful", "EC2 Instance Terminate Successful"],
"detail": {
"AutoScalingGroupName": ["myapp-asg"]
}
}
EOF
aws events put-rule \
--name asg-scaling-events \
--event-pattern file://event-pattern-asg.json
# Scheduled rule (cron - daily at 2 AM UTC)
aws events put-rule \
--name daily-cleanup \
--schedule-expression "cron(0 2 * * ? *)" \
--description "Run cleanup Lambda daily at 2 AM UTC"
# Scheduled rule (rate - every 5 minutes)
aws events put-rule \
--name health-check-poller \
--schedule-expression "rate(5 minutes)" \
--description "Poll external health check every 5 minutes"
# Add Lambda as target
aws events put-targets \
--rule daily-cleanup \
--targets "Id"="1","Arn"="arn:aws:lambda:sa-east-1:123456789012:function:cleanup-function"
# Give EventBridge permission to invoke Lambda
aws lambda add-permission \
--function-name cleanup-function \
--statement-id AllowEventBridgeInvoke \
--action lambda:InvokeFunction \
--principal events.amazonaws.com \
--source-arn arn:aws:events:sa-east-1:123456789012:rule/daily-cleanupQuerying and Modifying EventBridge
# List rules
aws events list-rules
# View rule details
aws events describe-rule --name notify-instance-terminated
# List targets for a rule
aws events list-targets-by-rule --rule notify-instance-terminated
# Disable rule temporarily
aws events disable-rule --name daily-cleanup
# Re-enable rule
aws events enable-rule --name daily-cleanup
# Remove targets first
aws events remove-targets \
--rule daily-cleanup \
--ids "1"
# Then delete rule
aws events delete-rule --name daily-cleanupDashboards
# Create dashboard
cat > dashboard-body.json <<EOF
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/EC2", "CPUUtilization", {"stat": "Average"}]
],
"period": 300,
"stat": "Average",
"region": "sa-east-1",
"title": "EC2 CPU Utilization"
}
},
{
"type": "metric",
"properties": {
"metrics": [
["AWS/ApplicationELB", "RequestCount", {"stat": "Sum"}]
],
"period": 60,
"stat": "Sum",
"region": "sa-east-1",
"title": "ALB Request Count"
}
}
]
}
EOF
aws cloudwatch put-dashboard \
--dashboard-name MyApp-Production \
--dashboard-body file://dashboard-body.json
# List dashboards
aws cloudwatch list-dashboards
# View dashboard
aws cloudwatch get-dashboard --dashboard-name MyApp-Production
# Delete dashboard
aws cloudwatch delete-dashboards --dashboard-names MyApp-ProductionCloudWatch Agent Configuration
# Install CloudWatch Agent on EC2 (Amazon Linux 2)
sudo yum install amazon-cloudwatch-agent -y
# Create configuration
cat > /opt/aws/amazon-cloudwatch-agent/etc/config.json <<EOF
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/app.log",
"log_group_name": "/aws/ec2/myapp",
"log_stream_name": "{instance_id}-app.log",
"timezone": "UTC"
},
{
"file_path": "/var/log/nginx/error.log",
"log_group_name": "/aws/ec2/nginx",
"log_stream_name": "{instance_id}-error.log"
}
]
}
}
},
"metrics": {
"namespace": "CustomApp/System",
"metrics_collected": {
"mem": {
"measurement": [
{
"name": "mem_used_percent",
"rename": "MemoryUtilization",
"unit": "Percent"
}
],
"metrics_collection_interval": 60
},
"disk": {
"measurement": [
{
"name": "used_percent",
"rename": "DiskUtilization",
"unit": "Percent"
}
],
"metrics_collection_interval": 60,
"resources": ["/"]
}
}
}
}
EOF
# Start agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m ec2 \
-s \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json
# Verify status
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a query \
-m ec2 \
-s
# Stop agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a stop \
-m ec2 \
-sArchitecture and Flows
Complete Multi-Region Observability
Alarm Lifecycle Flow
Logs Pipeline
EventBridge Automation Flow
Decision: Alarm Configuration
Best Practices Checklist
Observability Strategy
- Define key metrics: Identify critical SLIs (Service Level Indicators) for your application
- Implement Golden Signals: Monitor Latency, Traffic, Errors, Saturation
- Structured logs: JSON logging with consistent fields (timestamp, level, message, context)
- Appropriate log levels: DEBUG in dev, INFO/WARN in staging, ERROR+ in prod
- Distributed tracing: Use X-Ray for microservices and latency debugging
- Custom business metrics: Don't just monitor infrastructure - track business KPIs (orders/min, revenue)
- Dashboard per environment: Separate Production, Staging, Development dashboards
Alarming
- Prioritize alarms: Critical (pager), High (email), Medium (dashboard only)
- Avoid alarm fatigue: Only alert on what's truly important
- Evaluation periods over 1: Minimum 2 periods to avoid false positives from spikes
- Configure datapoints to alarm: Use M-of-N pattern (e.g., 3 of 5) for more flexibility
- SNS topics by severity: critical-alerts, warning-alerts, info-alerts
- Document runbooks: Each alarm has documentation on "what to do"
- Configure OK actions: Notify when alarm resolves too
- Test alarms regularly: Simulate alarm conditions monthly
- Composite alarms for correlation: Combine related signals (CPU + Memory + Disk)
Logs Management
- CloudWatch Agent on all instances: Centralize logs automatically
- Retention policy by log type: Debug 7 days, Errors 90 days, Audit 1+ year
- Metric filters for critical errors: Convert log patterns into metrics
- Save Logs Insights queries: Document common queries for quick troubleshooting
- Structured logging: JSON with standard fields makes queries easier
- Don't log secrets: Sanitize passwords, tokens, PII before logging
- Correlation IDs: Request ID in all logs for request tracing
- Periodic export to S3: Old logs to S3 for compliance and cost optimization
Cost Optimization
- Aggressive retention policy: Don't use 90 days for ALL logs
- Standard vs High-resolution: Only use high-res when critical
- Consolidated dashboards: Don't create a dashboard for each resource
- Delete obsolete alarms: Regular cleanup of alarms for deleted resources
- Metric filters instead of custom metrics: More economical to extract from logs
- Log sampling: Log sample of successful requests, all errors
- Subscription filters for export: Don't use Insights queries for batch analysis
Security and Compliance
- Encryption at rest: Sensitive logs with KMS encryption
- Restrictive IAM: Only certain roles can view production logs
- Audit trail: CloudTrail logs of who accesses/modifies alarms and dashboards
- Log retention compliance: Respect GDPR, HIPAA, SOX requirements
- PII redaction: Automatic redaction of sensitive information
- Cross-account logs: Centralize logs from multiple accounts in security account
Automation
- EventBridge for remediation: Auto-scaling, restart services, failover
- Lambda triggered by alarms: Automatic actions (snapshot before termination)
- Infrastructure as Code: Alarms and dashboards in Terraform/CloudFormation
- CI/CD integration: Deployments automatically create/update alarms
- Scheduled maintenance windows: Disable alarms during maintenance with EventBridge
Common Mistakes to Avoid
Evaluation Periods = 1 Leading to Alarm Fatigue
Why it happens: Creating alarm with --evaluation-periods 1 thinking "I want to know immediately."
The real problem: CPU/Latency have natural 1-2 minute spikes that are normal. With evaluation=1, each spike triggers an alarm, causing the team to ignore alerts and creating alarm fatigue. When there's a real problem, nobody responds.
Typical scenario:
# BAD: Evaluation = 1
aws cloudwatch put-metric-alarm \
--alarm-name cpu-high \
--evaluation-periods 1 \
--period 60 \
--threshold 80
# Result: Alarms every hour from temporary spikes
# 10:00 AM - ALARM (CPU spike to 85% from deployment)
# 10:05 AM - OK
# 11:30 AM - ALARM (garbage collection spike)
# 11:32 AM - OK
# ... team stops paying attentionHow to avoid it:
# GOOD: Evaluation = 2-3 with datapoints-to-alarm
aws cloudwatch put-metric-alarm \
--alarm-name cpu-sustained-high \
--evaluation-periods 3 \
--datapoints-to-alarm 2 \
--period 300 \
--threshold 80
# Meaning: "Alarm if CPU over 80% in 2 of the last 3 periods of 5 min"
# Tolerates a momentary spike, detects sustained problemsGolden rule: Evaluation periods >= 2 for volatile metrics (CPU, latency). Only evaluation=1 for binary metrics (health check status).
Not Configuring Retention Policy Leading to Unexpected Costs
Why it happens: Creating a log group, forgetting to configure retention, logs accumulate infinitely.
Financial impact:
Application logging 10 GB/day
Without retention: 10 GB x 30 days x $0.03/GB = $9/month first month
10 GB x 365 days x $0.03/GB = $109.50/year
With 7 day retention: 10 GB x 7 days x $0.03/GB = $2.10/month
Annual savings: $107.40How to avoid it:
# ALWAYS configure retention when creating log group
aws logs create-log-group --log-group-name /aws/ec2/myapp
aws logs put-retention-policy \
--log-group-name /aws/ec2/myapp \
--retention-in-days 7 # Debug logs
# For critical logs
aws logs put-retention-policy \
--log-group-name /aws/ec2/myapp-errors \
--retention-in-days 90
# For audit logs (compliance)
aws logs put-retention-policy \
--log-group-name /aws/audit \
--retention-in-days 365Monthly verification script:
# List log groups without retention configured
aws logs describe-log-groups \
--query 'logGroups[?!retentionInDays].logGroupName' \
--output table
# If there are results, configure retention immediatelyUsing Sum Instead of Average for Latency
Why it happens: Confusion about when to use each statistic.
Technical problem:
# BAD: Sum for latency
aws cloudwatch put-metric-alarm \
--metric-name TargetResponseTime \
--statistic Sum \
--period 300 \
--threshold 500
# If you have 1000 requests in 5 min with avg 200ms each:
# Sum = 200ms x 1000 = 200,000ms
# Alarm triggers because 200,000 over 500 (meaningless)
# GOOD: Average for latency
--statistic Average
# Average = 200,000ms / 1000 requests = 200ms
# Alarm does NOT trigger because 200 under 500 (correct)Statistic selection rule:
Average: Latency, CPU%, Memory%, ratio metrics
Sum: Request count, error count, bytes transferred
Maximum: Worst-case latency (p100), disk queue depth
Minimum: Health checks (0 or 1)
SampleCount: Verify there's dataLogs Without Structured Format Making Queries Impossible
Why it happens: Logging plain text without structure.
Plain text log (BAD):
2025-11-16 15:30:45 User john@email.com placed order 12345 for $99.50 ERROR payment failedProblem: How do you extract "all orders over $100"? How do you filter by user? Complex and fragile regex.
JSON structured log (GOOD):
{
"timestamp": "2025-11-16T15:30:45Z",
"level": "ERROR",
"user": "john@email.com",
"orderId": "12345",
"amount": 99.50,
"currency": "USD",
"event": "payment_failed",
"errorCode": "CARD_DECLINED"
}Now queries are trivial:
# Logs Insights query
fields @timestamp, user, orderId, amount
| filter event = "payment_failed" and amount > 100
| stats count() by errorCodeAlarm Without SNS Action Creating Silent Alarms
Why it happens: Creating alarm only with threshold, forgetting to add --alarm-actions.
Problem: Alarm changes to ALARM state but nobody finds out. You discover the problem 2 days later while manually reviewing the dashboard.
Useless alarm (BAD):
aws cloudwatch put-metric-alarm \
--alarm-name critical-cpu \
--threshold 90 \
--evaluation-periods 2
# Missing --alarm-actionsAlarm that notifies (GOOD):
# 1. Create SNS topic
aws sns create-topic --name critical-alerts
# 2. Subscribe email
aws sns subscribe \
--topic-arn arn:aws:sns:sa-east-1:123456789012:critical-alerts \
--protocol email \
--notification-endpoint oncall@company.com
# 3. Create alarm WITH action
aws cloudwatch put-metric-alarm \
--alarm-name critical-cpu \
--threshold 90 \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts \
--ok-actions arn:aws:sns:sa-east-1:123456789012:critical-alertsUnnecessary High-Resolution Metrics
Why it happens: "More granularity is better, right?"
Cost impact:
Standard metric (60s): $0.30/month
High-resolution (1s): $0.30/month (same cost per metric)
BUT:
High-res alarm: $0.30/month (vs $0.10 standard)
High-res storage: 60x more data points
GetMetricStatistics: 60x more API calls ($0.01/1000)When to use each:
Standard (60s):
- General monitoring (CPU, disk, memory)
- Business metrics (orders/hour, revenue/day)
- Most use cases
High-resolution (1s):
- Financial trading systems
- Real-time gaming leaderboards
- Sub-minute auto-scaling (Lambda, containers)
- NOT for web application monitoring (overkill)CloudWatch Agent Without IAM Role Leading to Missing Logs
Why it happens: Installing CloudWatch Agent on EC2 but not granting permissions.
Symptom: Agent is running, but logs don't appear in CloudWatch Logs.
Debugging:
# On the EC2 instance
sudo tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
# Typical error:
# AccessDeniedException: User is not authorized to perform: logs:CreateLogGroupHow to avoid it: IAM Instance Profile with necessary permissions
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogStreams",
"cloudwatch:PutMetricData",
"ec2:DescribeVolumes",
"ec2:DescribeTags"
],
"Resource": "*"
}
]
}Verify before deploying agent:
# On EC2 instance, verify it has a role
curl http://169.254.169.254/latest/meta-data/iam/info
# Should return role info
# If returns 404, instance does NOT have role, agent will failNot Testing Alarms Before Disaster
Why it happens: "The configuration looks good, it must work."
Problem: Real incident arrives, you discover that:
- SNS topic misconfigured (email not confirmed)
- Threshold too high (alarm never triggers)
- Evaluation period too long (detects problem 20 min late)
- Lambda target without permissions (EventBridge can't invoke)
How to avoid it: Monthly alarm testing
# 1. Test alarm manually
aws cloudwatch set-alarm-state \
--alarm-name critical-cpu \
--state-value ALARM \
--state-reason "Manual test - monthly drill"
# 2. Verify you receive email/Slack notification
# 3. For EventBridge rules, manual trigger
aws events put-events \
--entries '[{
"Source": "custom.test",
"DetailType": "Manual Test",
"Detail": "{\"test\": true}"
}]'
# 4. Verify Lambda executed
aws logs tail /aws/lambda/my-response-function --since 5mCost Considerations
What Generates Costs in CloudWatch
| Concept | Cost | Unit | Free Tier |
|---|---|---|---|
| Metrics - Standard (AWS services) | FREE | Unlimited | Permanent |
| Metrics - Custom | $0.30/month | Per metric | 10 metrics free |
| Metrics - High-Resolution Custom | $0.30/month | Per metric | Not included |
| API Requests (GetMetricStatistics) | $0.01 | Per 1,000 requests | 1M free/month |
| Dashboard | $3/month | Per dashboard | 3 dashboards free |
| Alarms - Standard | $0.10/month | Per alarm | 10 alarms free |
| Alarms - High-Resolution | $0.30/month | Per alarm | Not included |
| Alarms - Composite | $0.50/month | Per alarm | Not included |
| Logs - Ingestion | $0.50 | Per GB ingested | 5 GB free/month |
| Logs - Storage | $0.03 | Per GB-month | 5 GB free/month |
| Logs - Archive (S3 export) | S3 pricing | See S3 costs | See S3 free tier |
| Logs Insights - Queries | $0.005 | Per GB scanned | Included in free tier logs |
| EventBridge - Custom Events | FREE | First 14M/month | Yes |
| EventBridge - Events over 14M | $1.00 | Per 1M events | Not included |
| Anomaly Detection | $0.30/month | Per metric monitored | Not included |
Real Application Cost Example
Scenario: Multi-region web app with complete monitoring
Infrastructure:
- 10 EC2 instances (5 per region)
- 2 ALB (1 per region)
- 2 RDS instances
- Auto-Scaling Groups
- Route53 health checks
CloudWatch Setup:
- AWS service metrics (EC2, ALB, RDS, ASG): FREE
- 5 custom business metrics (orders/min, revenue, etc.): 5 x $0.30 = $1.50/month
- 20 standard alarms (CPU, latency, errors, etc.): 20 x $0.10 = $2.00/month
- 2 composite alarms (app-healthy): 2 x $0.50 = $1.00/month
- 1 dashboard (production overview): $3.00/month
Logs:
- 50 GB/month ingestion: (50 - 5 free) x $0.50 = $22.50/month
- Average 200 GB stored (30 day retention): (200 - 5 free) x $0.03 = $5.85/month
- Logs Insights queries: ~10 GB/month scanned: 10 x $0.005 = $0.05/month
EventBridge:
- ~5M events/month (EC2, ASG, custom): FREE (under 14M)
TOTAL MONTHLY: $35.90/monthBreakdown by category:
Custom metrics: $1.50 (4%)
Alarms: $3.00 (8%)
Dashboard: $3.00 (8%)
Logs ingestion: $22.50 (63%) - Largest cost
Logs storage: $5.85 (16%)
Logs queries: $0.05 (under 1%)
EventBridge: $0.00 (0%)Cost Optimization Strategies
1. Logs - Greatest savings opportunity
# EXPENSIVE: 90 day retention for ALL logs
# 50 GB/month x 90 days = 4500 GB stored x $0.03 = $135/month
# ECONOMICAL: Differentiated retention by type
# Debug logs: 7 days
aws logs put-retention-policy \
--log-group-name /aws/ec2/myapp-debug \
--retention-in-days 7
# Application logs: 30 days
aws logs put-retention-policy \
--log-group-name /aws/ec2/myapp \
--retention-in-days 30
# Error logs: 90 days
aws logs put-retention-policy \
--log-group-name /aws/ec2/myapp-errors \
--retention-in-days 90
# Audit logs: 1 year (compliance)
aws logs put-retention-policy \
--log-group-name /aws/audit \
--retention-in-days 365
# Savings: ~60% in storage costs2. Log sampling for successful requests
import random
def log_request(status_code, latency):
# Always log errors
if status_code >= 400:
logger.error(f"Error {status_code}, latency {latency}ms")
return
# Sample 10% of successful requests
if random.random() < 0.1:
logger.info(f"Success {status_code}, latency {latency}ms")
# Log reduction: ~90% less data
# Maintain visibility of problems (100% errors)
# Still see trends (10% sample is statistically significant)3. Metric Filters instead of Custom Metrics
# EXPENSIVE: Publish custom metric from code
# Code: cloudwatch.put_metric_data(...)
# Cost: $0.30/month per custom metric
# ECONOMICAL: Metric filter on existing logs
aws logs put-metric-filter \
--log-group-name /aws/ec2/myapp \
--filter-name ErrorCount \
--filter-pattern "[timestamp, level=ERROR, msg]" \
--metric-transformations \
metricName=ApplicationErrors,\
metricNamespace=CustomApp/Logs,\
metricValue=1
# Cost: $0 (you're already paying for log ingestion)
# Same result: error metric4. Consolidated dashboards
# EXPENSIVE: Dashboard per resource
# Dashboard EC2 Instance 1: $3/month
# Dashboard EC2 Instance 2: $3/month
# ... x 10 instances = $30/month
# ECONOMICAL: Aggregated dashboard
# 1 Dashboard "Production Overview" with all instances: $3/month
# Savings: $27/monthIntegration with Other Services
| AWS Service | How It Integrates | Typical Use Case |
|---|---|---|
| EC2 | CloudWatch Agent sends metrics and logs | Memory, disk, application logs monitoring |
| Auto Scaling | Alarms trigger scaling policies | Scale out when CPU over 80% |
| ALB/NLB | Native metrics (RequestCount, Latency, 5XX) | Alarm on high error rate |
| RDS | Native metrics (CPUUtilization, FreeStorageSpace, Connections) | Alarm when storage under 10GB |
| Lambda | Native metrics and logs (Invocations, Errors, Duration) | Alarm on error rate, log analysis |
| S3 | Request metrics, storage metrics | Monitor bucket size, request patterns |
| Route53 | Health check metrics (HealthCheckStatus) | Alarm on failover events |
| SNS | Alarm notifications via topics | Email, SMS, Lambda triggers |
| SQS | Queue metrics (ApproximateNumberOfMessages) | Scale workers based on queue depth |
| API Gateway | Native metrics (Count, Latency, 4XX, 5XX) | Monitor API performance |
| ECS/EKS | Container metrics and logs | Monitor containerized applications |
| Step Functions | Execution metrics | Monitor workflow success/failure |
| X-Ray | Distributed tracing integration | End-to-end request tracing |
| Systems Manager | Run Command integration | Auto-remediation actions |
| EventBridge | Event-driven automation | React to resource state changes |
| CloudTrail | Audit logs | Security and compliance monitoring |
Additional Resources
Official AWS Documentation
- CloudWatch Developer Guide
- CloudWatch Logs Developer Guide
- EventBridge Developer Guide
- CloudWatch Agent Configuration
- CloudWatch Pricing
Whitepapers and Best Practices
- AWS Well-Architected Framework - Operational Excellence Pillar
- Monitoring and Observability Best Practices
- AWS Observability Best Practices
Hands-On Tutorials
- CloudWatch Workshop
- One Observability Workshop
- Building Dashboards
- CloudWatch Logs Insights Tutorial