Amazon CloudWatch Essentials

Key Concepts Essential AWS CLI Commands Querying Metrics Publishing Custom Metrics Creating Alarms Querying and Modifying Alarms Creating and Managing Logs Querying Logs Logs Insights Queries Creating Metric Filters from Logs Creating EventBridge Rules Querying and Modifying EventBridge Dashboards CloudWatch Agent Configuration Architecture and Flows Complete Multi-Region Observability Alarm Lifecycle Flow Logs Pipeline Best Practices Checklist Observability Strategy Alarming Logs Management Cost Optimization Security and Compliance Automation Common Mistakes to Avoid Evaluation Periods = 1 Leading to Alarm Fatigue Not Configuring Retention Policy Leading to Unexpected Costs Using Sum Instead of Average for Latency Logs Without Structured Format Making Queries Impossible Alarm Without SNS Action Creating Silent Alarms Unnecessary High-Resolution Metrics CloudWatch Agent Without IAM Role Leading to Missing Logs Not Testing Alarms Before Disaster Cost Considerations What Generates Costs in CloudWatch Real Application Cost Example Cost Optimization Strategies Integration with Other Services Additional Resources Official AWS Documentation Whitepapers and Best Practices Hands-On Tutorials Tools For AWS Solutions Architect Associate Certification

Amazon CloudWatch is AWS's observability service that collects, monitors, and analyzes metrics, logs, and events from your infrastructure and applications. It acts as the "nervous system" of your architecture, providing complete visibility into the state of your resources in real-time.

The problem it solves: It eliminates operational opacity - instead of discovering problems when users report them, CloudWatch lets you monitor proactively, detect anomalies, automate responses, and maintain performance history. It transforms reactive debugging into proactive observability.

When to use it: ALWAYS in production. CloudWatch is fundamental to any serious AWS architecture. Without observability, you're operating blind - you don't know if your resources are healthy, what caused an incident, or when you need to scale.

Alternatives: DataDog, New Relic, Prometheus+Grafana, ELK Stack (Elasticsearch+Logstash+Kibana). CloudWatch has the advantage of native integration with AWS services and doesn't require additional infrastructure setup, but third-party tools offer more advanced analytics and better UX.

Key Concepts

Concept	Description
Metric	Numeric data that varies over time, represented as a time series (timestamp + value). Examples: CPUUtilization, RequestCount, DiskReadOps. Each metric belongs to a namespace (AWS/EC2, AWS/ApplicationELB)
Namespace	Container that groups related metrics. AWS uses namespaces like `AWS/EC2`, `AWS/RDS`. You can create custom namespaces for your application metrics (e.g., `CustomApp/Business`)
Dimension	Key-value pair that identifies a metric variation (e.g., InstanceId=i-xxxxx, LoadBalancer=app/myapp-lb/xxx). Allows filtering and aggregating metrics
Statistic	Aggregation of data points over a period (Average, Sum, Minimum, Maximum, SampleCount). Defines how the reported value is calculated
Period	Time interval over which statistics are calculated (60s, 300s, 3600s). Standard metrics have 60s minimum, high-resolution can be 1s
Alarm	Automated monitor that compares a metric against a threshold and executes actions when crossed. Has 3 states: OK, ALARM, INSUFFICIENT_DATA
Alarm State	Current alarm state - OK (below threshold), ALARM (above threshold), INSUFFICIENT_DATA (not enough data to evaluate)
Evaluation Period	Number of consecutive periods that must meet condition before changing alarm state. Prevents false positives from temporary spikes
Datapoints to Alarm	Number of data points within evaluation periods that must violate threshold to trigger alarm (e.g., "3 of 5" means 3 bad data points in 5 periods)
Composite Alarm	Alarm that combines multiple alarms using boolean logic (AND/OR). Allows creating complex alerts based on multiple conditions
Log Group	Container for related log streams (e.g., `/aws/lambda/my-function`, `/aws/ec2/myapp`). Defines retention policy and encryption settings
Log Stream	Sequence of log events from a single source (one EC2 instance, one Lambda invocation). A log group can have multiple streams
Log Event	Individual log message with timestamp and content. Can be plain text or structured JSON
Metric Filter	Search pattern over logs that extracts data and publishes it as a metric. Allows creating custom metrics from logs without modifying code
Logs Insights	SQL-like interactive query service for analyzing logs. Supports aggregations, filters, regex, and statistics over large log volumes
Log Retention	Period CloudWatch stores logs before automatically deleting them (1 day to 10 years, or indefinite). Critical for compliance and cost optimization
EventBridge	Event bus service that captures AWS resource state changes and enables event-based automation
Event Pattern	JSON that defines which events to capture (e.g., "EC2 instance terminated in specific ASG"). Uses pattern matching against event structure
Event Rule	Combination of event pattern + target action. Defines "when X happens, do Y"
Event Target	Destination of an event (Lambda, SNS, SQS, Step Functions, etc.). A rule can have multiple targets
Scheduled Expression	Cron or rate expression for periodic events (e.g., `cron(0 2 * * ? *)` = 2 AM daily, `rate(5 minutes)` = every 5 minutes)
CloudWatch Agent	Daemon that runs on EC2/on-premise to send custom metrics and logs to CloudWatch. Enables monitoring of memory, disk space, custom processes
Dashboard	Customizable visualization of metrics in charts, numbers, and text. Provides consolidated view of application health
Widget	Individual dashboard component (line chart, number, log widget, alarm status). Configured with specific metrics and time period
Anomaly Detection	Machine learning that learns normal metric patterns and creates expected value bands. Alarms can be based on deviations from the pattern

Essential AWS CLI Commands

Querying Metrics

# List available metrics for a namespace
aws cloudwatch list-metrics \
    --namespace AWS/EC2

# List metrics for a specific instance
aws cloudwatch list-metrics \
    --namespace AWS/EC2 \
    --dimensions Name=InstanceId,Value=i-xxxxx

# Get CPU statistics (last 24 hours, 5 min periods)
aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 300 \
    --statistics Average,Maximum \
    --output table

# Get multiple metrics with get-metric-data (more efficient)
cat > metric-queries.json <<EOF
[
  {
    "Id": "cpu",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/EC2",
        "MetricName": "CPUUtilization",
        "Dimensions": [{"Name": "InstanceId", "Value": "i-xxxxx"}]
      },
      "Period": 300,
      "Stat": "Average"
    }
  },
  {
    "Id": "network",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/EC2",
        "MetricName": "NetworkIn",
        "Dimensions": [{"Name": "InstanceId", "Value": "i-xxxxx"}]
      },
      "Period": 300,
      "Stat": "Sum"
    }
  }
]
EOF

aws cloudwatch get-metric-data \
    --metric-data-queries file://metric-queries.json \
    --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S)

# Get ALB metrics (request count, latency, errors)
aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name TargetResponseTime \
    --dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
    --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 60 \
    --statistics Average,Maximum

# Get Route53 health check status
aws cloudwatch get-metric-statistics \
    --namespace AWS/Route53 \
    --metric-name HealthCheckStatus \
    --dimensions Name=HealthCheckId,Value=abc123-healthcheck \
    --start-time $(date -u -d '12 hours ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 60 \
    --statistics Minimum

Publishing Custom Metrics

# Publish simple custom metric
aws cloudwatch put-metric-data \
    --namespace CustomApp/Business \
    --metric-name OrdersProcessed \
    --value 142 \
    --timestamp $(date -u +%Y-%m-%dT%H:%M:%S)

# Publish metric with dimensions
aws cloudwatch put-metric-data \
    --namespace CustomApp/API \
    --metric-name ResponseTime \
    --value 234 \
    --unit Milliseconds \
    --dimensions Environment=Production,Region=sa-east-1

# Publish multiple metrics (batch)
aws cloudwatch put-metric-data \
    --namespace CustomApp/Database \
    --metric-data \
        MetricName=ActiveConnections,Value=45,Unit=Count \
        MetricName=QueryTime,Value=123,Unit=Milliseconds \
        MetricName=CacheHitRate,Value=0.89,Unit=Percent

# Publish with high resolution (1 second)
aws cloudwatch put-metric-data \
    --namespace CustomApp/HighRes \
    --metric-name Latency \
    --value 78 \
    --unit Milliseconds \
    --storage-resolution 1

Creating Alarms

# Simple alarm: CPU over 80%
aws cloudwatch put-metric-alarm \
    --alarm-name high-cpu-i-xxxxx \
    --alarm-description "CPU above 80% for 10 minutes" \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --statistic Average \
    --period 300 \
    --evaluation-periods 2 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --treat-missing-data notBreaching

# Alarm with SNS notification
aws cloudwatch put-metric-alarm \
    --alarm-name high-cpu-with-notification \
    --alarm-description "Alert ops team when CPU high" \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --statistic Average \
    --period 300 \
    --evaluation-periods 2 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:ops-alerts \
    --ok-actions arn:aws:sns:sa-east-1:123456789012:ops-alerts

# Alarm for ALB (high latency)
aws cloudwatch put-metric-alarm \
    --alarm-name alb-high-latency \
    --alarm-description "Target response time over 500ms" \
    --namespace AWS/ApplicationELB \
    --metric-name TargetResponseTime \
    --dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
    --statistic Average \
    --period 60 \
    --evaluation-periods 3 \
    --threshold 0.5 \
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts

# Alarm for Route53 health check
aws cloudwatch put-metric-alarm \
    --alarm-name route53-primary-unhealthy \
    --alarm-description "Primary region health check failed - failover activated" \
    --namespace AWS/Route53 \
    --metric-name HealthCheckStatus \
    --dimensions Name=HealthCheckId,Value=abc123-healthcheck \
    --statistic Minimum \
    --period 60 \
    --evaluation-periods 2 \
    --threshold 1 \
    --comparison-operator LessThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:failover-alerts

# Alarm for error rate (5XX over 100 requests)
aws cloudwatch put-metric-alarm \
    --alarm-name alb-high-error-rate \
    --alarm-description "Too many 5XX errors from targets" \
    --namespace AWS/ApplicationELB \
    --metric-name HTTPCode_Target_5XX_Count \
    --dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
    --statistic Sum \
    --period 300 \
    --evaluation-periods 1 \
    --threshold 100 \
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts

# Alarm with "M of N" datapoints (more flexible)
aws cloudwatch put-metric-alarm \
    --alarm-name cpu-high-3-of-5 \
    --alarm-description "CPU high in 3 out of 5 datapoints" \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --statistic Average \
    --period 60 \
    --evaluation-periods 5 \
    --datapoints-to-alarm 3 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold

# Composite Alarm (combines multiple alarms)
aws cloudwatch put-composite-alarm \
    --alarm-name app-unhealthy \
    --alarm-description "App is unhealthy if high CPU AND high error rate" \
    --alarm-rule "ALARM(high-cpu-i-xxxxx) AND ALARM(alb-high-error-rate)" \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts

Querying and Modifying Alarms

# List all alarms
aws cloudwatch describe-alarms

# View alarms in ALARM state
aws cloudwatch describe-alarms \
    --state-value ALARM

# View specific alarm details
aws cloudwatch describe-alarms \
    --alarm-names high-cpu-i-xxxxx

# View alarm history (state changes)
aws cloudwatch describe-alarm-history \
    --alarm-name high-cpu-i-xxxxx \
    --max-records 10

# Disable alarm temporarily
aws cloudwatch disable-alarm-actions \
    --alarm-names high-cpu-i-xxxxx

# Re-enable alarm
aws cloudwatch enable-alarm-actions \
    --alarm-names high-cpu-i-xxxxx

# Change state manually (testing)
aws cloudwatch set-alarm-state \
    --alarm-name high-cpu-i-xxxxx \
    --state-value ALARM \
    --state-reason "Testing alarm notifications"

# Delete alarms
aws cloudwatch delete-alarms \
    --alarm-names high-cpu-i-xxxxx alb-high-latency

Creating and Managing Logs

# Create log group
aws logs create-log-group \
    --log-group-name /aws/ec2/myapp

# Configure retention (7 days)
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp \
    --retention-in-days 7

# Create log stream
aws logs create-log-stream \
    --log-group-name /aws/ec2/myapp \
    --log-stream-name i-xxxxx-app.log

# Send log events (from application)
cat > log-events.json <<EOF
[
  {
    "timestamp": $(date +%s)000,
    "message": "Application started successfully"
  },
  {
    "timestamp": $(date +%s)000,
    "message": "Connected to database"
  }
]
EOF

aws logs put-log-events \
    --log-group-name /aws/ec2/myapp \
    --log-stream-name i-xxxxx-app.log \
    --log-events file://log-events.json

# Tag log group (for cost allocation)
aws logs tag-log-group \
    --log-group-name /aws/ec2/myapp \
    --tags Environment=Production,Application=MyApp

Querying Logs

# List log groups
aws logs describe-log-groups

# List log streams in a group
aws logs describe-log-streams \
    --log-group-name /aws/ec2/myapp \
    --order-by LastEventTime \
    --descending \
    --max-items 10

# Tail logs (real-time)
aws logs tail /aws/ec2/myapp --follow

# Filter logs by pattern (last 24 hours)
aws logs filter-log-events \
    --log-group-name /aws/ec2/myapp \
    --start-time $(date -u -d '24 hours ago' +%s)000 \
    --filter-pattern "ERROR"

# Filter with structured pattern
aws logs filter-log-events \
    --log-group-name /aws/ec2/myapp \
    --filter-pattern '[timestamp, level=ERROR, msg]' \
    --max-items 50

# Search in multiple log groups
aws logs filter-log-events \
    --log-group-name-prefix /aws/ec2/ \
    --filter-pattern "database connection failed"

# Export logs to S3 (for later analysis)
aws logs create-export-task \
    --log-group-name /aws/ec2/myapp \
    --from $(date -u -d '7 days ago' +%s)000 \
    --to $(date -u +%s)000 \
    --destination myapp-logs-bucket \
    --destination-prefix logs/2025/11/

Logs Insights Queries

# Query: Top 10 most frequent errors
aws logs start-query \
    --log-group-name /aws/ec2/myapp \
    --start-time $(date -u -d '24 hours ago' +%s) \
    --end-time $(date -u +%s) \
    --query-string '
        fields @timestamp, @message
        | filter @message like /ERROR/
        | stats count() by @message
        | sort count desc
        | limit 10
    '

# Get query results
aws logs get-query-results --query-id xxxxx-yyyy-zzzz

# Query: Average latency by endpoint
aws logs start-query \
    --log-group-name /aws/lambda/my-api \
    --start-time $(date -u -d '1 hour ago' +%s) \
    --end-time $(date -u +%s) \
    --query-string '
        fields @timestamp, endpoint, duration
        | stats avg(duration) as avg_latency, max(duration) as max_latency by endpoint
        | sort avg_latency desc
    '

# Query: Slow requests (over 500ms)
aws logs start-query \
    --log-group-name /aws/ec2/myapp \
    --start-time $(date -u -d '1 hour ago' +%s) \
    --end-time $(date -u +%s) \
    --query-string '
        fields @timestamp, @message, latency
        | filter latency > 500
        | sort latency desc
        | limit 100
    '

Creating Metric Filters from Logs

# Create metric filter to count errors
aws logs put-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ErrorCount \
    --filter-pattern "[timestamp, level=ERROR, msg]" \
    --metric-transformations \
        metricName=ApplicationErrors,\
        metricNamespace=CustomApp/Logs,\
        metricValue=1,\
        defaultValue=0

# Metric filter to extract latency from logs
aws logs put-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ResponseTime \
    --filter-pattern "[timestamp, level, msg, latency]" \
    --metric-transformations \
        metricName=ResponseLatency,\
        metricNamespace=CustomApp/Logs,\
        metricValue='$latency',\
        unit=Milliseconds

# List metric filters
aws logs describe-metric-filters \
    --log-group-name /aws/ec2/myapp

# Delete metric filter
aws logs delete-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ErrorCount

Creating EventBridge Rules

# Create SNS topic for notifications
aws sns create-topic --name ec2-state-changes

# Subscribe email
aws sns subscribe \
    --topic-arn arn:aws:sns:sa-east-1:123456789012:ec2-state-changes \
    --protocol email \
    --notification-endpoint ops@example.com

# Event pattern: EC2 instance terminated
cat > event-pattern-terminated.json <<EOF
{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["terminated"]
  }
}
EOF

# Create rule
aws events put-rule \
    --name notify-instance-terminated \
    --description "Notify when EC2 instance is terminated" \
    --event-pattern file://event-pattern-terminated.json

# Add SNS as target
aws events put-targets \
    --rule notify-instance-terminated \
    --targets "Id"="1","Arn"="arn:aws:sns:sa-east-1:123456789012:ec2-state-changes"

# Event pattern: Auto Scaling activities
cat > event-pattern-asg.json <<EOF
{
  "source": ["aws.autoscaling"],
  "detail-type": ["EC2 Instance Launch Successful", "EC2 Instance Terminate Successful"],
  "detail": {
    "AutoScalingGroupName": ["myapp-asg"]
  }
}
EOF

aws events put-rule \
    --name asg-scaling-events \
    --event-pattern file://event-pattern-asg.json

# Scheduled rule (cron - daily at 2 AM UTC)
aws events put-rule \
    --name daily-cleanup \
    --schedule-expression "cron(0 2 * * ? *)" \
    --description "Run cleanup Lambda daily at 2 AM UTC"

# Scheduled rule (rate - every 5 minutes)
aws events put-rule \
    --name health-check-poller \
    --schedule-expression "rate(5 minutes)" \
    --description "Poll external health check every 5 minutes"

# Add Lambda as target
aws events put-targets \
    --rule daily-cleanup \
    --targets "Id"="1","Arn"="arn:aws:lambda:sa-east-1:123456789012:function:cleanup-function"

# Give EventBridge permission to invoke Lambda
aws lambda add-permission \
    --function-name cleanup-function \
    --statement-id AllowEventBridgeInvoke \
    --action lambda:InvokeFunction \
    --principal events.amazonaws.com \
    --source-arn arn:aws:events:sa-east-1:123456789012:rule/daily-cleanup

Querying and Modifying EventBridge

# List rules
aws events list-rules

# View rule details
aws events describe-rule --name notify-instance-terminated

# List targets for a rule
aws events list-targets-by-rule --rule notify-instance-terminated

# Disable rule temporarily
aws events disable-rule --name daily-cleanup

# Re-enable rule
aws events enable-rule --name daily-cleanup

# Remove targets first
aws events remove-targets \
    --rule daily-cleanup \
    --ids "1"

# Then delete rule
aws events delete-rule --name daily-cleanup

Dashboards

# Create dashboard
cat > dashboard-body.json <<EOF
{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/EC2", "CPUUtilization", {"stat": "Average"}]
        ],
        "period": 300,
        "stat": "Average",
        "region": "sa-east-1",
        "title": "EC2 CPU Utilization"
      }
    },
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/ApplicationELB", "RequestCount", {"stat": "Sum"}]
        ],
        "period": 60,
        "stat": "Sum",
        "region": "sa-east-1",
        "title": "ALB Request Count"
      }
    }
  ]
}
EOF

aws cloudwatch put-dashboard \
    --dashboard-name MyApp-Production \
    --dashboard-body file://dashboard-body.json

# List dashboards
aws cloudwatch list-dashboards

# View dashboard
aws cloudwatch get-dashboard --dashboard-name MyApp-Production

# Delete dashboard
aws cloudwatch delete-dashboards --dashboard-names MyApp-Production

CloudWatch Agent Configuration

# Install CloudWatch Agent on EC2 (Amazon Linux 2)
sudo yum install amazon-cloudwatch-agent -y

# Create configuration
cat > /opt/aws/amazon-cloudwatch-agent/etc/config.json <<EOF
{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/app.log",
            "log_group_name": "/aws/ec2/myapp",
            "log_stream_name": "{instance_id}-app.log",
            "timezone": "UTC"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "/aws/ec2/nginx",
            "log_stream_name": "{instance_id}-error.log"
          }
        ]
      }
    }
  },
  "metrics": {
    "namespace": "CustomApp/System",
    "metrics_collected": {
      "mem": {
        "measurement": [
          {
            "name": "mem_used_percent",
            "rename": "MemoryUtilization",
            "unit": "Percent"
          }
        ],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": [
          {
            "name": "used_percent",
            "rename": "DiskUtilization",
            "unit": "Percent"
          }
        ],
        "metrics_collection_interval": 60,
        "resources": ["/"]
      }
    }
  }
}
EOF

# Start agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a fetch-config \
    -m ec2 \
    -s \
    -c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json

# Verify status
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a query \
    -m ec2 \
    -s

# Stop agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a stop \
    -m ec2 \
    -s

Architecture and Flows

Complete Multi-Region Observability

Alarm Lifecycle Flow

Logs Pipeline

Best Practices Checklist

Observability Strategy

Define key metrics: Identify critical SLIs (Service Level Indicators) for your application
Implement Golden Signals: Monitor Latency, Traffic, Errors, Saturation
Structured logs: JSON logging with consistent fields (timestamp, level, message, context)
Appropriate log levels: DEBUG in dev, INFO/WARN in staging, ERROR+ in prod
Distributed tracing: Use X-Ray for microservices and latency debugging
Custom business metrics: Don't just monitor infrastructure - track business KPIs (orders/min, revenue)
Dashboard per environment: Separate Production, Staging, Development dashboards

Alarming

Prioritize alarms: Critical (pager), High (email), Medium (dashboard only)
Avoid alarm fatigue: Only alert on what's truly important
Evaluation periods over 1: Minimum 2 periods to avoid false positives from spikes
Configure datapoints to alarm: Use M-of-N pattern (e.g., 3 of 5) for more flexibility
SNS topics by severity: critical-alerts, warning-alerts, info-alerts
Document runbooks: Each alarm has documentation on "what to do"
Configure OK actions: Notify when alarm resolves too
Test alarms regularly: Simulate alarm conditions monthly
Composite alarms for correlation: Combine related signals (CPU + Memory + Disk)

Logs Management

CloudWatch Agent on all instances: Centralize logs automatically
Retention policy by log type: Debug 7 days, Errors 90 days, Audit 1+ year
Metric filters for critical errors: Convert log patterns into metrics
Save Logs Insights queries: Document common queries for quick troubleshooting
Structured logging: JSON with standard fields makes queries easier
Don't log secrets: Sanitize passwords, tokens, PII before logging
Correlation IDs: Request ID in all logs for request tracing
Periodic export to S3: Old logs to S3 for compliance and cost optimization

Cost Optimization

Aggressive retention policy: Don't use 90 days for ALL logs
Standard vs High-resolution: Only use high-res when critical
Consolidated dashboards: Don't create a dashboard for each resource
Delete obsolete alarms: Regular cleanup of alarms for deleted resources
Metric filters instead of custom metrics: More economical to extract from logs
Log sampling: Log sample of successful requests, all errors
Subscription filters for export: Don't use Insights queries for batch analysis

Security and Compliance

Encryption at rest: Sensitive logs with KMS encryption
Restrictive IAM: Only certain roles can view production logs
Audit trail: CloudTrail logs of who accesses/modifies alarms and dashboards
Log retention compliance: Respect GDPR, HIPAA, SOX requirements
PII redaction: Automatic redaction of sensitive information
Cross-account logs: Centralize logs from multiple accounts in security account

Automation

EventBridge for remediation: Auto-scaling, restart services, failover
Lambda triggered by alarms: Automatic actions (snapshot before termination)
Infrastructure as Code: Alarms and dashboards in Terraform/CloudFormation
CI/CD integration: Deployments automatically create/update alarms
Scheduled maintenance windows: Disable alarms during maintenance with EventBridge

Common Mistakes to Avoid

Evaluation Periods = 1 Leading to Alarm Fatigue

Why it happens: Creating alarm with --evaluation-periods 1 thinking "I want to know immediately."

The real problem: CPU/Latency have natural 1-2 minute spikes that are normal. With evaluation=1, each spike triggers an alarm, causing the team to ignore alerts and creating alarm fatigue. When there's a real problem, nobody responds.

Typical scenario:

# BAD: Evaluation = 1
aws cloudwatch put-metric-alarm \
    --alarm-name cpu-high \
    --evaluation-periods 1 \
    --period 60 \
    --threshold 80

# Result: Alarms every hour from temporary spikes
# 10:00 AM - ALARM (CPU spike to 85% from deployment)
# 10:05 AM - OK
# 11:30 AM - ALARM (garbage collection spike)
# 11:32 AM - OK
# ... team stops paying attention

How to avoid it:

# GOOD: Evaluation = 2-3 with datapoints-to-alarm
aws cloudwatch put-metric-alarm \
    --alarm-name cpu-sustained-high \
    --evaluation-periods 3 \
    --datapoints-to-alarm 2 \
    --period 300 \
    --threshold 80

# Meaning: "Alarm if CPU over 80% in 2 of the last 3 periods of 5 min"
# Tolerates a momentary spike, detects sustained problems

Golden rule: Evaluation periods >= 2 for volatile metrics (CPU, latency). Only evaluation=1 for binary metrics (health check status).

Not Configuring Retention Policy Leading to Unexpected Costs

Why it happens: Creating a log group, forgetting to configure retention, logs accumulate infinitely.

Financial impact:

Application logging 10 GB/day
Without retention: 10 GB x 30 days x $0.03/GB = $9/month first month
                   10 GB x 365 days x $0.03/GB = $109.50/year
With 7 day retention: 10 GB x 7 days x $0.03/GB = $2.10/month
Annual savings: $107.40

How to avoid it:

# ALWAYS configure retention when creating log group
aws logs create-log-group --log-group-name /aws/ec2/myapp

aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp \
    --retention-in-days 7  # Debug logs

# For critical logs
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp-errors \
    --retention-in-days 90

# For audit logs (compliance)
aws logs put-retention-policy \
    --log-group-name /aws/audit \
    --retention-in-days 365

Monthly verification script:

# List log groups without retention configured
aws logs describe-log-groups \
    --query 'logGroups[?!retentionInDays].logGroupName' \
    --output table

# If there are results, configure retention immediately

Using Sum Instead of Average for Latency

Why it happens: Confusion about when to use each statistic.

Technical problem:

# BAD: Sum for latency
aws cloudwatch put-metric-alarm \
    --metric-name TargetResponseTime \
    --statistic Sum \
    --period 300 \
    --threshold 500

# If you have 1000 requests in 5 min with avg 200ms each:
# Sum = 200ms x 1000 = 200,000ms
# Alarm triggers because 200,000 over 500 (meaningless)

# GOOD: Average for latency
--statistic Average

# Average = 200,000ms / 1000 requests = 200ms
# Alarm does NOT trigger because 200 under 500 (correct)

Statistic selection rule:

Average: Latency, CPU%, Memory%, ratio metrics
Sum: Request count, error count, bytes transferred
Maximum: Worst-case latency (p100), disk queue depth
Minimum: Health checks (0 or 1)
SampleCount: Verify there's data

Logs Without Structured Format Making Queries Impossible

Why it happens: Logging plain text without structure.

Plain text log (BAD):

2025-11-16 15:30:45 User john@email.com placed order 12345 for $99.50 ERROR payment failed

Problem: How do you extract "all orders over $100"? How do you filter by user? Complex and fragile regex.

JSON structured log (GOOD):

{
  "timestamp": "2025-11-16T15:30:45Z",
  "level": "ERROR",
  "user": "john@email.com",
  "orderId": "12345",
  "amount": 99.50,
  "currency": "USD",
  "event": "payment_failed",
  "errorCode": "CARD_DECLINED"
}

Now queries are trivial:

# Logs Insights query
fields @timestamp, user, orderId, amount
| filter event = "payment_failed" and amount > 100
| stats count() by errorCode

Why it happens: Creating alarm only with threshold, forgetting to add --alarm-actions.

Problem: Alarm changes to ALARM state but nobody finds out. You discover the problem 2 days later while manually reviewing the dashboard.

Useless alarm (BAD):

aws cloudwatch put-metric-alarm \
    --alarm-name critical-cpu \
    --threshold 90 \
    --evaluation-periods 2
    # Missing --alarm-actions

Alarm that notifies (GOOD):

# 1. Create SNS topic
aws sns create-topic --name critical-alerts

# 2. Subscribe email
aws sns subscribe \
    --topic-arn arn:aws:sns:sa-east-1:123456789012:critical-alerts \
    --protocol email \
    --notification-endpoint oncall@company.com

# 3. Create alarm WITH action
aws cloudwatch put-metric-alarm \
    --alarm-name critical-cpu \
    --threshold 90 \
    --evaluation-periods 2 \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts \
    --ok-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts

Unnecessary High-Resolution Metrics

Why it happens: "More granularity is better, right?"

Cost impact:

Standard metric (60s): $0.30/month
High-resolution (1s):  $0.30/month (same cost per metric)

BUT:
High-res alarm:         $0.30/month (vs $0.10 standard)
High-res storage:       60x more data points
GetMetricStatistics:    60x more API calls ($0.01/1000)

When to use each:

Standard (60s):
- General monitoring (CPU, disk, memory)
- Business metrics (orders/hour, revenue/day)
- Most use cases

High-resolution (1s):
- Financial trading systems
- Real-time gaming leaderboards
- Sub-minute auto-scaling (Lambda, containers)
- NOT for web application monitoring (overkill)

CloudWatch Agent Without IAM Role Leading to Missing Logs

Why it happens: Installing CloudWatch Agent on EC2 but not granting permissions.

Symptom: Agent is running, but logs don't appear in CloudWatch Logs.

Debugging:

# On the EC2 instance
sudo tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log

# Typical error:
# AccessDeniedException: User is not authorized to perform: logs:CreateLogGroup

How to avoid it: IAM Instance Profile with necessary permissions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:DescribeLogStreams",
        "cloudwatch:PutMetricData",
        "ec2:DescribeVolumes",
        "ec2:DescribeTags"
      ],
      "Resource": "*"
    }
  ]
}

Verify before deploying agent:

# On EC2 instance, verify it has a role
curl http://169.254.169.254/latest/meta-data/iam/info

# Should return role info
# If returns 404, instance does NOT have role, agent will fail

Not Testing Alarms Before Disaster

Why it happens: "The configuration looks good, it must work."

Problem: Real incident arrives, you discover that:

SNS topic misconfigured (email not confirmed)
Threshold too high (alarm never triggers)
Evaluation period too long (detects problem 20 min late)
Lambda target without permissions (EventBridge can't invoke)

How to avoid it: Monthly alarm testing

# 1. Test alarm manually
aws cloudwatch set-alarm-state \
    --alarm-name critical-cpu \
    --state-value ALARM \
    --state-reason "Manual test - monthly drill"

# 2. Verify you receive email/Slack notification

# 3. For EventBridge rules, manual trigger
aws events put-events \
    --entries '[{
      "Source": "custom.test",
      "DetailType": "Manual Test",
      "Detail": "{\"test\": true}"
    }]'

# 4. Verify Lambda executed
aws logs tail /aws/lambda/my-response-function --since 5m

Cost Considerations

What Generates Costs in CloudWatch

Concept	Cost	Unit	Free Tier
Metrics - Standard (AWS services)	FREE	Unlimited	Permanent
Metrics - Custom	$0.30/month	Per metric	10 metrics free
Metrics - High-Resolution Custom	$0.30/month	Per metric	Not included
API Requests (GetMetricStatistics)	$0.01	Per 1,000 requests	1M free/month
Dashboard	$3/month	Per dashboard	3 dashboards free
Alarms - Standard	$0.10/month	Per alarm	10 alarms free
Alarms - High-Resolution	$0.30/month	Per alarm	Not included
Alarms - Composite	$0.50/month	Per alarm	Not included
Logs - Ingestion	$0.50	Per GB ingested	5 GB free/month
Logs - Storage	$0.03	Per GB-month	5 GB free/month
Logs - Archive (S3 export)	S3 pricing	See S3 costs	See S3 free tier
Logs Insights - Queries	$0.005	Per GB scanned	Included in free tier logs
EventBridge - Custom Events	FREE	First 14M/month	Yes
EventBridge - Events over 14M	$1.00	Per 1M events	Not included
Anomaly Detection	$0.30/month	Per metric monitored	Not included

Real Application Cost Example

Scenario: Multi-region web app with complete monitoring

Infrastructure:
- 10 EC2 instances (5 per region)
- 2 ALB (1 per region)
- 2 RDS instances
- Auto-Scaling Groups
- Route53 health checks

CloudWatch Setup:
- AWS service metrics (EC2, ALB, RDS, ASG): FREE
- 5 custom business metrics (orders/min, revenue, etc.): 5 x $0.30 = $1.50/month
- 20 standard alarms (CPU, latency, errors, etc.): 20 x $0.10 = $2.00/month
- 2 composite alarms (app-healthy): 2 x $0.50 = $1.00/month
- 1 dashboard (production overview): $3.00/month

Logs:
- 50 GB/month ingestion: (50 - 5 free) x $0.50 = $22.50/month
- Average 200 GB stored (30 day retention): (200 - 5 free) x $0.03 = $5.85/month
- Logs Insights queries: ~10 GB/month scanned: 10 x $0.005 = $0.05/month

EventBridge:
- ~5M events/month (EC2, ASG, custom): FREE (under 14M)

TOTAL MONTHLY: $35.90/month

Breakdown by category:

Custom metrics:         $1.50  (4%)
Alarms:                 $3.00  (8%)
Dashboard:              $3.00  (8%)
Logs ingestion:        $22.50 (63%) - Largest cost
Logs storage:           $5.85 (16%)
Logs queries:           $0.05 (under 1%)
EventBridge:            $0.00  (0%)

Cost Optimization Strategies

1. Logs - Greatest savings opportunity

# EXPENSIVE: 90 day retention for ALL logs
# 50 GB/month x 90 days = 4500 GB stored x $0.03 = $135/month

# ECONOMICAL: Differentiated retention by type
# Debug logs: 7 days
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp-debug \
    --retention-in-days 7

# Application logs: 30 days
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp \
    --retention-in-days 30

# Error logs: 90 days
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp-errors \
    --retention-in-days 90

# Audit logs: 1 year (compliance)
aws logs put-retention-policy \
    --log-group-name /aws/audit \
    --retention-in-days 365

# Savings: ~60% in storage costs

2. Log sampling for successful requests

import random

def log_request(status_code, latency):
    # Always log errors
    if status_code >= 400:
        logger.error(f"Error {status_code}, latency {latency}ms")
        return

    # Sample 10% of successful requests
    if random.random() < 0.1:
        logger.info(f"Success {status_code}, latency {latency}ms")

# Log reduction: ~90% less data
# Maintain visibility of problems (100% errors)
# Still see trends (10% sample is statistically significant)

3. Metric Filters instead of Custom Metrics

# EXPENSIVE: Publish custom metric from code
# Code: cloudwatch.put_metric_data(...)
# Cost: $0.30/month per custom metric

# ECONOMICAL: Metric filter on existing logs
aws logs put-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ErrorCount \
    --filter-pattern "[timestamp, level=ERROR, msg]" \
    --metric-transformations \
        metricName=ApplicationErrors,\
        metricNamespace=CustomApp/Logs,\
        metricValue=1

# Cost: $0 (you're already paying for log ingestion)
# Same result: error metric

4. Consolidated dashboards

# EXPENSIVE: Dashboard per resource
# Dashboard EC2 Instance 1: $3/month
# Dashboard EC2 Instance 2: $3/month
# ... x 10 instances = $30/month

# ECONOMICAL: Aggregated dashboard
# 1 Dashboard "Production Overview" with all instances: $3/month
# Savings: $27/month

Integration with Other Services

AWS Service	How It Integrates	Typical Use Case
EC2	CloudWatch Agent sends metrics and logs	Memory, disk, application logs monitoring
Auto Scaling	Alarms trigger scaling policies	Scale out when CPU over 80%
ALB/NLB	Native metrics (RequestCount, Latency, 5XX)	Alarm on high error rate
RDS	Native metrics (CPUUtilization, FreeStorageSpace, Connections)	Alarm when storage under 10GB
Lambda	Native metrics and logs (Invocations, Errors, Duration)	Alarm on error rate, log analysis
S3	Request metrics, storage metrics	Monitor bucket size, request patterns
Route53	Health check metrics (HealthCheckStatus)	Alarm on failover events
SNS	Alarm notifications via topics	Email, SMS, Lambda triggers
SQS	Queue metrics (ApproximateNumberOfMessages)	Scale workers based on queue depth
API Gateway	Native metrics (Count, Latency, 4XX, 5XX)	Monitor API performance
ECS/EKS	Container metrics and logs	Monitor containerized applications
Step Functions	Execution metrics	Monitor workflow success/failure
X-Ray	Distributed tracing integration	End-to-end request tracing
Systems Manager	Run Command integration	Auto-remediation actions
EventBridge	Event-driven automation	React to resource state changes
CloudTrail	Audit logs	Security and compliance monitoring