Amazon CloudWatch Essentials

CloudWatch essentials, metrics, logs, alarms, EventBridge, and best practices

@geomenaSun Aug 10 20251,059 views

Amazon CloudWatch is AWS's observability service that collects, monitors, and analyzes metrics, logs, and events from your infrastructure and applications. It acts as the "nervous system" of your architecture, providing complete visibility into the state of your resources in real-time.

The problem it solves: It eliminates operational opacity - instead of discovering problems when users report them, CloudWatch lets you monitor proactively, detect anomalies, automate responses, and maintain performance history. It transforms reactive debugging into proactive observability.

When to use it: ALWAYS in production. CloudWatch is fundamental to any serious AWS architecture. Without observability, you're operating blind - you don't know if your resources are healthy, what caused an incident, or when you need to scale.

Alternatives: DataDog, New Relic, Prometheus+Grafana, ELK Stack (Elasticsearch+Logstash+Kibana). CloudWatch has the advantage of native integration with AWS services and doesn't require additional infrastructure setup, but third-party tools offer more advanced analytics and better UX.

Key Concepts

ConceptDescription
MetricNumeric data that varies over time, represented as a time series (timestamp + value). Examples: CPUUtilization, RequestCount, DiskReadOps. Each metric belongs to a namespace (AWS/EC2, AWS/ApplicationELB)
NamespaceContainer that groups related metrics. AWS uses namespaces like AWS/EC2, AWS/RDS. You can create custom namespaces for your application metrics (e.g., CustomApp/Business)
DimensionKey-value pair that identifies a metric variation (e.g., InstanceId=i-xxxxx, LoadBalancer=app/myapp-lb/xxx). Allows filtering and aggregating metrics
StatisticAggregation of data points over a period (Average, Sum, Minimum, Maximum, SampleCount). Defines how the reported value is calculated
PeriodTime interval over which statistics are calculated (60s, 300s, 3600s). Standard metrics have 60s minimum, high-resolution can be 1s
AlarmAutomated monitor that compares a metric against a threshold and executes actions when crossed. Has 3 states: OK, ALARM, INSUFFICIENT_DATA
Alarm StateCurrent alarm state - OK (below threshold), ALARM (above threshold), INSUFFICIENT_DATA (not enough data to evaluate)
Evaluation PeriodNumber of consecutive periods that must meet condition before changing alarm state. Prevents false positives from temporary spikes
Datapoints to AlarmNumber of data points within evaluation periods that must violate threshold to trigger alarm (e.g., "3 of 5" means 3 bad data points in 5 periods)
Composite AlarmAlarm that combines multiple alarms using boolean logic (AND/OR). Allows creating complex alerts based on multiple conditions
Log GroupContainer for related log streams (e.g., /aws/lambda/my-function, /aws/ec2/myapp). Defines retention policy and encryption settings
Log StreamSequence of log events from a single source (one EC2 instance, one Lambda invocation). A log group can have multiple streams
Log EventIndividual log message with timestamp and content. Can be plain text or structured JSON
Metric FilterSearch pattern over logs that extracts data and publishes it as a metric. Allows creating custom metrics from logs without modifying code
Logs InsightsSQL-like interactive query service for analyzing logs. Supports aggregations, filters, regex, and statistics over large log volumes
Log RetentionPeriod CloudWatch stores logs before automatically deleting them (1 day to 10 years, or indefinite). Critical for compliance and cost optimization
EventBridgeEvent bus service that captures AWS resource state changes and enables event-based automation
Event PatternJSON that defines which events to capture (e.g., "EC2 instance terminated in specific ASG"). Uses pattern matching against event structure
Event RuleCombination of event pattern + target action. Defines "when X happens, do Y"
Event TargetDestination of an event (Lambda, SNS, SQS, Step Functions, etc.). A rule can have multiple targets
Scheduled ExpressionCron or rate expression for periodic events (e.g., cron(0 2 * * ? *) = 2 AM daily, rate(5 minutes) = every 5 minutes)
CloudWatch AgentDaemon that runs on EC2/on-premise to send custom metrics and logs to CloudWatch. Enables monitoring of memory, disk space, custom processes
DashboardCustomizable visualization of metrics in charts, numbers, and text. Provides consolidated view of application health
WidgetIndividual dashboard component (line chart, number, log widget, alarm status). Configured with specific metrics and time period
Anomaly DetectionMachine learning that learns normal metric patterns and creates expected value bands. Alarms can be based on deviations from the pattern

Essential AWS CLI Commands

Querying Metrics

# List available metrics for a namespace
aws cloudwatch list-metrics \
    --namespace AWS/EC2

# List metrics for a specific instance
aws cloudwatch list-metrics \
    --namespace AWS/EC2 \
    --dimensions Name=InstanceId,Value=i-xxxxx

# Get CPU statistics (last 24 hours, 5 min periods)
aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 300 \
    --statistics Average,Maximum \
    --output table

# Get multiple metrics with get-metric-data (more efficient)
cat > metric-queries.json <<EOF
[
  {
    "Id": "cpu",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/EC2",
        "MetricName": "CPUUtilization",
        "Dimensions": [{"Name": "InstanceId", "Value": "i-xxxxx"}]
      },
      "Period": 300,
      "Stat": "Average"
    }
  },
  {
    "Id": "network",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/EC2",
        "MetricName": "NetworkIn",
        "Dimensions": [{"Name": "InstanceId", "Value": "i-xxxxx"}]
      },
      "Period": 300,
      "Stat": "Sum"
    }
  }
]
EOF

aws cloudwatch get-metric-data \
    --metric-data-queries file://metric-queries.json \
    --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S)

# Get ALB metrics (request count, latency, errors)
aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name TargetResponseTime \
    --dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
    --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 60 \
    --statistics Average,Maximum

# Get Route53 health check status
aws cloudwatch get-metric-statistics \
    --namespace AWS/Route53 \
    --metric-name HealthCheckStatus \
    --dimensions Name=HealthCheckId,Value=abc123-healthcheck \
    --start-time $(date -u -d '12 hours ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 60 \
    --statistics Minimum

Publishing Custom Metrics

# Publish simple custom metric
aws cloudwatch put-metric-data \
    --namespace CustomApp/Business \
    --metric-name OrdersProcessed \
    --value 142 \
    --timestamp $(date -u +%Y-%m-%dT%H:%M:%S)

# Publish metric with dimensions
aws cloudwatch put-metric-data \
    --namespace CustomApp/API \
    --metric-name ResponseTime \
    --value 234 \
    --unit Milliseconds \
    --dimensions Environment=Production,Region=sa-east-1

# Publish multiple metrics (batch)
aws cloudwatch put-metric-data \
    --namespace CustomApp/Database \
    --metric-data \
        MetricName=ActiveConnections,Value=45,Unit=Count \
        MetricName=QueryTime,Value=123,Unit=Milliseconds \
        MetricName=CacheHitRate,Value=0.89,Unit=Percent

# Publish with high resolution (1 second)
aws cloudwatch put-metric-data \
    --namespace CustomApp/HighRes \
    --metric-name Latency \
    --value 78 \
    --unit Milliseconds \
    --storage-resolution 1

Creating Alarms

# Simple alarm: CPU over 80%
aws cloudwatch put-metric-alarm \
    --alarm-name high-cpu-i-xxxxx \
    --alarm-description "CPU above 80% for 10 minutes" \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --statistic Average \
    --period 300 \
    --evaluation-periods 2 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --treat-missing-data notBreaching

# Alarm with SNS notification
aws cloudwatch put-metric-alarm \
    --alarm-name high-cpu-with-notification \
    --alarm-description "Alert ops team when CPU high" \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --statistic Average \
    --period 300 \
    --evaluation-periods 2 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:ops-alerts \
    --ok-actions arn:aws:sns:sa-east-1:123456789012:ops-alerts

# Alarm for ALB (high latency)
aws cloudwatch put-metric-alarm \
    --alarm-name alb-high-latency \
    --alarm-description "Target response time over 500ms" \
    --namespace AWS/ApplicationELB \
    --metric-name TargetResponseTime \
    --dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
    --statistic Average \
    --period 60 \
    --evaluation-periods 3 \
    --threshold 0.5 \
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts

# Alarm for Route53 health check
aws cloudwatch put-metric-alarm \
    --alarm-name route53-primary-unhealthy \
    --alarm-description "Primary region health check failed - failover activated" \
    --namespace AWS/Route53 \
    --metric-name HealthCheckStatus \
    --dimensions Name=HealthCheckId,Value=abc123-healthcheck \
    --statistic Minimum \
    --period 60 \
    --evaluation-periods 2 \
    --threshold 1 \
    --comparison-operator LessThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:failover-alerts

# Alarm for error rate (5XX over 100 requests)
aws cloudwatch put-metric-alarm \
    --alarm-name alb-high-error-rate \
    --alarm-description "Too many 5XX errors from targets" \
    --namespace AWS/ApplicationELB \
    --metric-name HTTPCode_Target_5XX_Count \
    --dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
    --statistic Sum \
    --period 300 \
    --evaluation-periods 1 \
    --threshold 100 \
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts

# Alarm with "M of N" datapoints (more flexible)
aws cloudwatch put-metric-alarm \
    --alarm-name cpu-high-3-of-5 \
    --alarm-description "CPU high in 3 out of 5 datapoints" \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --statistic Average \
    --period 60 \
    --evaluation-periods 5 \
    --datapoints-to-alarm 3 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold

# Composite Alarm (combines multiple alarms)
aws cloudwatch put-composite-alarm \
    --alarm-name app-unhealthy \
    --alarm-description "App is unhealthy if high CPU AND high error rate" \
    --alarm-rule "ALARM(high-cpu-i-xxxxx) AND ALARM(alb-high-error-rate)" \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts

Querying and Modifying Alarms

# List all alarms
aws cloudwatch describe-alarms

# View alarms in ALARM state
aws cloudwatch describe-alarms \
    --state-value ALARM

# View specific alarm details
aws cloudwatch describe-alarms \
    --alarm-names high-cpu-i-xxxxx

# View alarm history (state changes)
aws cloudwatch describe-alarm-history \
    --alarm-name high-cpu-i-xxxxx \
    --max-records 10

# Disable alarm temporarily
aws cloudwatch disable-alarm-actions \
    --alarm-names high-cpu-i-xxxxx

# Re-enable alarm
aws cloudwatch enable-alarm-actions \
    --alarm-names high-cpu-i-xxxxx

# Change state manually (testing)
aws cloudwatch set-alarm-state \
    --alarm-name high-cpu-i-xxxxx \
    --state-value ALARM \
    --state-reason "Testing alarm notifications"

# Delete alarms
aws cloudwatch delete-alarms \
    --alarm-names high-cpu-i-xxxxx alb-high-latency

Creating and Managing Logs

# Create log group
aws logs create-log-group \
    --log-group-name /aws/ec2/myapp

# Configure retention (7 days)
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp \
    --retention-in-days 7

# Create log stream
aws logs create-log-stream \
    --log-group-name /aws/ec2/myapp \
    --log-stream-name i-xxxxx-app.log

# Send log events (from application)
cat > log-events.json <<EOF
[
  {
    "timestamp": $(date +%s)000,
    "message": "Application started successfully"
  },
  {
    "timestamp": $(date +%s)000,
    "message": "Connected to database"
  }
]
EOF

aws logs put-log-events \
    --log-group-name /aws/ec2/myapp \
    --log-stream-name i-xxxxx-app.log \
    --log-events file://log-events.json

# Tag log group (for cost allocation)
aws logs tag-log-group \
    --log-group-name /aws/ec2/myapp \
    --tags Environment=Production,Application=MyApp

Querying Logs

# List log groups
aws logs describe-log-groups

# List log streams in a group
aws logs describe-log-streams \
    --log-group-name /aws/ec2/myapp \
    --order-by LastEventTime \
    --descending \
    --max-items 10

# Tail logs (real-time)
aws logs tail /aws/ec2/myapp --follow

# Filter logs by pattern (last 24 hours)
aws logs filter-log-events \
    --log-group-name /aws/ec2/myapp \
    --start-time $(date -u -d '24 hours ago' +%s)000 \
    --filter-pattern "ERROR"

# Filter with structured pattern
aws logs filter-log-events \
    --log-group-name /aws/ec2/myapp \
    --filter-pattern '[timestamp, level=ERROR, msg]' \
    --max-items 50

# Search in multiple log groups
aws logs filter-log-events \
    --log-group-name-prefix /aws/ec2/ \
    --filter-pattern "database connection failed"

# Export logs to S3 (for later analysis)
aws logs create-export-task \
    --log-group-name /aws/ec2/myapp \
    --from $(date -u -d '7 days ago' +%s)000 \
    --to $(date -u +%s)000 \
    --destination myapp-logs-bucket \
    --destination-prefix logs/2025/11/

Logs Insights Queries

# Query: Top 10 most frequent errors
aws logs start-query \
    --log-group-name /aws/ec2/myapp \
    --start-time $(date -u -d '24 hours ago' +%s) \
    --end-time $(date -u +%s) \
    --query-string '
        fields @timestamp, @message
        | filter @message like /ERROR/
        | stats count() by @message
        | sort count desc
        | limit 10
    '

# Get query results
aws logs get-query-results --query-id xxxxx-yyyy-zzzz

# Query: Average latency by endpoint
aws logs start-query \
    --log-group-name /aws/lambda/my-api \
    --start-time $(date -u -d '1 hour ago' +%s) \
    --end-time $(date -u +%s) \
    --query-string '
        fields @timestamp, endpoint, duration
        | stats avg(duration) as avg_latency, max(duration) as max_latency by endpoint
        | sort avg_latency desc
    '

# Query: Slow requests (over 500ms)
aws logs start-query \
    --log-group-name /aws/ec2/myapp \
    --start-time $(date -u -d '1 hour ago' +%s) \
    --end-time $(date -u +%s) \
    --query-string '
        fields @timestamp, @message, latency
        | filter latency > 500
        | sort latency desc
        | limit 100
    '

Creating Metric Filters from Logs

# Create metric filter to count errors
aws logs put-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ErrorCount \
    --filter-pattern "[timestamp, level=ERROR, msg]" \
    --metric-transformations \
        metricName=ApplicationErrors,\
        metricNamespace=CustomApp/Logs,\
        metricValue=1,\
        defaultValue=0

# Metric filter to extract latency from logs
aws logs put-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ResponseTime \
    --filter-pattern "[timestamp, level, msg, latency]" \
    --metric-transformations \
        metricName=ResponseLatency,\
        metricNamespace=CustomApp/Logs,\
        metricValue='$latency',\
        unit=Milliseconds

# List metric filters
aws logs describe-metric-filters \
    --log-group-name /aws/ec2/myapp

# Delete metric filter
aws logs delete-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ErrorCount

Creating EventBridge Rules

# Create SNS topic for notifications
aws sns create-topic --name ec2-state-changes

# Subscribe email
aws sns subscribe \
    --topic-arn arn:aws:sns:sa-east-1:123456789012:ec2-state-changes \
    --protocol email \
    --notification-endpoint ops@example.com

# Event pattern: EC2 instance terminated
cat > event-pattern-terminated.json <<EOF
{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["terminated"]
  }
}
EOF

# Create rule
aws events put-rule \
    --name notify-instance-terminated \
    --description "Notify when EC2 instance is terminated" \
    --event-pattern file://event-pattern-terminated.json

# Add SNS as target
aws events put-targets \
    --rule notify-instance-terminated \
    --targets "Id"="1","Arn"="arn:aws:sns:sa-east-1:123456789012:ec2-state-changes"

# Event pattern: Auto Scaling activities
cat > event-pattern-asg.json <<EOF
{
  "source": ["aws.autoscaling"],
  "detail-type": ["EC2 Instance Launch Successful", "EC2 Instance Terminate Successful"],
  "detail": {
    "AutoScalingGroupName": ["myapp-asg"]
  }
}
EOF

aws events put-rule \
    --name asg-scaling-events \
    --event-pattern file://event-pattern-asg.json

# Scheduled rule (cron - daily at 2 AM UTC)
aws events put-rule \
    --name daily-cleanup \
    --schedule-expression "cron(0 2 * * ? *)" \
    --description "Run cleanup Lambda daily at 2 AM UTC"

# Scheduled rule (rate - every 5 minutes)
aws events put-rule \
    --name health-check-poller \
    --schedule-expression "rate(5 minutes)" \
    --description "Poll external health check every 5 minutes"

# Add Lambda as target
aws events put-targets \
    --rule daily-cleanup \
    --targets "Id"="1","Arn"="arn:aws:lambda:sa-east-1:123456789012:function:cleanup-function"

# Give EventBridge permission to invoke Lambda
aws lambda add-permission \
    --function-name cleanup-function \
    --statement-id AllowEventBridgeInvoke \
    --action lambda:InvokeFunction \
    --principal events.amazonaws.com \
    --source-arn arn:aws:events:sa-east-1:123456789012:rule/daily-cleanup

Querying and Modifying EventBridge

# List rules
aws events list-rules

# View rule details
aws events describe-rule --name notify-instance-terminated

# List targets for a rule
aws events list-targets-by-rule --rule notify-instance-terminated

# Disable rule temporarily
aws events disable-rule --name daily-cleanup

# Re-enable rule
aws events enable-rule --name daily-cleanup

# Remove targets first
aws events remove-targets \
    --rule daily-cleanup \
    --ids "1"

# Then delete rule
aws events delete-rule --name daily-cleanup

Dashboards

# Create dashboard
cat > dashboard-body.json <<EOF
{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/EC2", "CPUUtilization", {"stat": "Average"}]
        ],
        "period": 300,
        "stat": "Average",
        "region": "sa-east-1",
        "title": "EC2 CPU Utilization"
      }
    },
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/ApplicationELB", "RequestCount", {"stat": "Sum"}]
        ],
        "period": 60,
        "stat": "Sum",
        "region": "sa-east-1",
        "title": "ALB Request Count"
      }
    }
  ]
}
EOF

aws cloudwatch put-dashboard \
    --dashboard-name MyApp-Production \
    --dashboard-body file://dashboard-body.json

# List dashboards
aws cloudwatch list-dashboards

# View dashboard
aws cloudwatch get-dashboard --dashboard-name MyApp-Production

# Delete dashboard
aws cloudwatch delete-dashboards --dashboard-names MyApp-Production

CloudWatch Agent Configuration

# Install CloudWatch Agent on EC2 (Amazon Linux 2)
sudo yum install amazon-cloudwatch-agent -y

# Create configuration
cat > /opt/aws/amazon-cloudwatch-agent/etc/config.json <<EOF
{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/app.log",
            "log_group_name": "/aws/ec2/myapp",
            "log_stream_name": "{instance_id}-app.log",
            "timezone": "UTC"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "/aws/ec2/nginx",
            "log_stream_name": "{instance_id}-error.log"
          }
        ]
      }
    }
  },
  "metrics": {
    "namespace": "CustomApp/System",
    "metrics_collected": {
      "mem": {
        "measurement": [
          {
            "name": "mem_used_percent",
            "rename": "MemoryUtilization",
            "unit": "Percent"
          }
        ],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": [
          {
            "name": "used_percent",
            "rename": "DiskUtilization",
            "unit": "Percent"
          }
        ],
        "metrics_collection_interval": 60,
        "resources": ["/"]
      }
    }
  }
}
EOF

# Start agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a fetch-config \
    -m ec2 \
    -s \
    -c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json

# Verify status
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a query \
    -m ec2 \
    -s

# Stop agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a stop \
    -m ec2 \
    -s

Architecture and Flows

Complete Multi-Region Observability

Alarm Lifecycle Flow

Logs Pipeline

EventBridge Automation Flow

Decision: Alarm Configuration

Best Practices Checklist

Observability Strategy

  • Define key metrics: Identify critical SLIs (Service Level Indicators) for your application
  • Implement Golden Signals: Monitor Latency, Traffic, Errors, Saturation
  • Structured logs: JSON logging with consistent fields (timestamp, level, message, context)
  • Appropriate log levels: DEBUG in dev, INFO/WARN in staging, ERROR+ in prod
  • Distributed tracing: Use X-Ray for microservices and latency debugging
  • Custom business metrics: Don't just monitor infrastructure - track business KPIs (orders/min, revenue)
  • Dashboard per environment: Separate Production, Staging, Development dashboards

Alarming

  • Prioritize alarms: Critical (pager), High (email), Medium (dashboard only)
  • Avoid alarm fatigue: Only alert on what's truly important
  • Evaluation periods over 1: Minimum 2 periods to avoid false positives from spikes
  • Configure datapoints to alarm: Use M-of-N pattern (e.g., 3 of 5) for more flexibility
  • SNS topics by severity: critical-alerts, warning-alerts, info-alerts
  • Document runbooks: Each alarm has documentation on "what to do"
  • Configure OK actions: Notify when alarm resolves too
  • Test alarms regularly: Simulate alarm conditions monthly
  • Composite alarms for correlation: Combine related signals (CPU + Memory + Disk)

Logs Management

  • CloudWatch Agent on all instances: Centralize logs automatically
  • Retention policy by log type: Debug 7 days, Errors 90 days, Audit 1+ year
  • Metric filters for critical errors: Convert log patterns into metrics
  • Save Logs Insights queries: Document common queries for quick troubleshooting
  • Structured logging: JSON with standard fields makes queries easier
  • Don't log secrets: Sanitize passwords, tokens, PII before logging
  • Correlation IDs: Request ID in all logs for request tracing
  • Periodic export to S3: Old logs to S3 for compliance and cost optimization

Cost Optimization

  • Aggressive retention policy: Don't use 90 days for ALL logs
  • Standard vs High-resolution: Only use high-res when critical
  • Consolidated dashboards: Don't create a dashboard for each resource
  • Delete obsolete alarms: Regular cleanup of alarms for deleted resources
  • Metric filters instead of custom metrics: More economical to extract from logs
  • Log sampling: Log sample of successful requests, all errors
  • Subscription filters for export: Don't use Insights queries for batch analysis

Security and Compliance

  • Encryption at rest: Sensitive logs with KMS encryption
  • Restrictive IAM: Only certain roles can view production logs
  • Audit trail: CloudTrail logs of who accesses/modifies alarms and dashboards
  • Log retention compliance: Respect GDPR, HIPAA, SOX requirements
  • PII redaction: Automatic redaction of sensitive information
  • Cross-account logs: Centralize logs from multiple accounts in security account

Automation

  • EventBridge for remediation: Auto-scaling, restart services, failover
  • Lambda triggered by alarms: Automatic actions (snapshot before termination)
  • Infrastructure as Code: Alarms and dashboards in Terraform/CloudFormation
  • CI/CD integration: Deployments automatically create/update alarms
  • Scheduled maintenance windows: Disable alarms during maintenance with EventBridge

Common Mistakes to Avoid

Evaluation Periods = 1 Leading to Alarm Fatigue

Why it happens: Creating alarm with --evaluation-periods 1 thinking "I want to know immediately."

The real problem: CPU/Latency have natural 1-2 minute spikes that are normal. With evaluation=1, each spike triggers an alarm, causing the team to ignore alerts and creating alarm fatigue. When there's a real problem, nobody responds.

Typical scenario:

# BAD: Evaluation = 1
aws cloudwatch put-metric-alarm \
    --alarm-name cpu-high \
    --evaluation-periods 1 \
    --period 60 \
    --threshold 80

# Result: Alarms every hour from temporary spikes
# 10:00 AM - ALARM (CPU spike to 85% from deployment)
# 10:05 AM - OK
# 11:30 AM - ALARM (garbage collection spike)
# 11:32 AM - OK
# ... team stops paying attention

How to avoid it:

# GOOD: Evaluation = 2-3 with datapoints-to-alarm
aws cloudwatch put-metric-alarm \
    --alarm-name cpu-sustained-high \
    --evaluation-periods 3 \
    --datapoints-to-alarm 2 \
    --period 300 \
    --threshold 80

# Meaning: "Alarm if CPU over 80% in 2 of the last 3 periods of 5 min"
# Tolerates a momentary spike, detects sustained problems

Golden rule: Evaluation periods >= 2 for volatile metrics (CPU, latency). Only evaluation=1 for binary metrics (health check status).

Not Configuring Retention Policy Leading to Unexpected Costs

Why it happens: Creating a log group, forgetting to configure retention, logs accumulate infinitely.

Financial impact:

Application logging 10 GB/day
Without retention: 10 GB x 30 days x $0.03/GB = $9/month first month
                   10 GB x 365 days x $0.03/GB = $109.50/year
With 7 day retention: 10 GB x 7 days x $0.03/GB = $2.10/month
Annual savings: $107.40

How to avoid it:

# ALWAYS configure retention when creating log group
aws logs create-log-group --log-group-name /aws/ec2/myapp

aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp \
    --retention-in-days 7  # Debug logs

# For critical logs
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp-errors \
    --retention-in-days 90

# For audit logs (compliance)
aws logs put-retention-policy \
    --log-group-name /aws/audit \
    --retention-in-days 365

Monthly verification script:

# List log groups without retention configured
aws logs describe-log-groups \
    --query 'logGroups[?!retentionInDays].logGroupName' \
    --output table

# If there are results, configure retention immediately

Using Sum Instead of Average for Latency

Why it happens: Confusion about when to use each statistic.

Technical problem:

# BAD: Sum for latency
aws cloudwatch put-metric-alarm \
    --metric-name TargetResponseTime \
    --statistic Sum \
    --period 300 \
    --threshold 500

# If you have 1000 requests in 5 min with avg 200ms each:
# Sum = 200ms x 1000 = 200,000ms
# Alarm triggers because 200,000 over 500 (meaningless)

# GOOD: Average for latency
--statistic Average

# Average = 200,000ms / 1000 requests = 200ms
# Alarm does NOT trigger because 200 under 500 (correct)

Statistic selection rule:

Average: Latency, CPU%, Memory%, ratio metrics
Sum: Request count, error count, bytes transferred
Maximum: Worst-case latency (p100), disk queue depth
Minimum: Health checks (0 or 1)
SampleCount: Verify there's data

Logs Without Structured Format Making Queries Impossible

Why it happens: Logging plain text without structure.

Plain text log (BAD):

2025-11-16 15:30:45 User john@email.com placed order 12345 for $99.50 ERROR payment failed

Problem: How do you extract "all orders over $100"? How do you filter by user? Complex and fragile regex.

JSON structured log (GOOD):

{
  "timestamp": "2025-11-16T15:30:45Z",
  "level": "ERROR",
  "user": "john@email.com",
  "orderId": "12345",
  "amount": 99.50,
  "currency": "USD",
  "event": "payment_failed",
  "errorCode": "CARD_DECLINED"
}

Now queries are trivial:

# Logs Insights query
fields @timestamp, user, orderId, amount
| filter event = "payment_failed" and amount > 100
| stats count() by errorCode

Alarm Without SNS Action Creating Silent Alarms

Why it happens: Creating alarm only with threshold, forgetting to add --alarm-actions.

Problem: Alarm changes to ALARM state but nobody finds out. You discover the problem 2 days later while manually reviewing the dashboard.

Useless alarm (BAD):

aws cloudwatch put-metric-alarm \
    --alarm-name critical-cpu \
    --threshold 90 \
    --evaluation-periods 2
    # Missing --alarm-actions

Alarm that notifies (GOOD):

# 1. Create SNS topic
aws sns create-topic --name critical-alerts

# 2. Subscribe email
aws sns subscribe \
    --topic-arn arn:aws:sns:sa-east-1:123456789012:critical-alerts \
    --protocol email \
    --notification-endpoint oncall@company.com

# 3. Create alarm WITH action
aws cloudwatch put-metric-alarm \
    --alarm-name critical-cpu \
    --threshold 90 \
    --evaluation-periods 2 \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts \
    --ok-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts

Unnecessary High-Resolution Metrics

Why it happens: "More granularity is better, right?"

Cost impact:

Standard metric (60s): $0.30/month
High-resolution (1s):  $0.30/month (same cost per metric)

BUT:
High-res alarm:         $0.30/month (vs $0.10 standard)
High-res storage:       60x more data points
GetMetricStatistics:    60x more API calls ($0.01/1000)

When to use each:

Standard (60s):
- General monitoring (CPU, disk, memory)
- Business metrics (orders/hour, revenue/day)
- Most use cases

High-resolution (1s):
- Financial trading systems
- Real-time gaming leaderboards
- Sub-minute auto-scaling (Lambda, containers)
- NOT for web application monitoring (overkill)

CloudWatch Agent Without IAM Role Leading to Missing Logs

Why it happens: Installing CloudWatch Agent on EC2 but not granting permissions.

Symptom: Agent is running, but logs don't appear in CloudWatch Logs.

Debugging:

# On the EC2 instance
sudo tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log

# Typical error:
# AccessDeniedException: User is not authorized to perform: logs:CreateLogGroup

How to avoid it: IAM Instance Profile with necessary permissions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:DescribeLogStreams",
        "cloudwatch:PutMetricData",
        "ec2:DescribeVolumes",
        "ec2:DescribeTags"
      ],
      "Resource": "*"
    }
  ]
}

Verify before deploying agent:

# On EC2 instance, verify it has a role
curl http://169.254.169.254/latest/meta-data/iam/info

# Should return role info
# If returns 404, instance does NOT have role, agent will fail

Not Testing Alarms Before Disaster

Why it happens: "The configuration looks good, it must work."

Problem: Real incident arrives, you discover that:

  • SNS topic misconfigured (email not confirmed)
  • Threshold too high (alarm never triggers)
  • Evaluation period too long (detects problem 20 min late)
  • Lambda target without permissions (EventBridge can't invoke)

How to avoid it: Monthly alarm testing

# 1. Test alarm manually
aws cloudwatch set-alarm-state \
    --alarm-name critical-cpu \
    --state-value ALARM \
    --state-reason "Manual test - monthly drill"

# 2. Verify you receive email/Slack notification

# 3. For EventBridge rules, manual trigger
aws events put-events \
    --entries '[{
      "Source": "custom.test",
      "DetailType": "Manual Test",
      "Detail": "{\"test\": true}"
    }]'

# 4. Verify Lambda executed
aws logs tail /aws/lambda/my-response-function --since 5m

Cost Considerations

What Generates Costs in CloudWatch

ConceptCostUnitFree Tier
Metrics - Standard (AWS services)FREEUnlimitedPermanent
Metrics - Custom$0.30/monthPer metric10 metrics free
Metrics - High-Resolution Custom$0.30/monthPer metricNot included
API Requests (GetMetricStatistics)$0.01Per 1,000 requests1M free/month
Dashboard$3/monthPer dashboard3 dashboards free
Alarms - Standard$0.10/monthPer alarm10 alarms free
Alarms - High-Resolution$0.30/monthPer alarmNot included
Alarms - Composite$0.50/monthPer alarmNot included
Logs - Ingestion$0.50Per GB ingested5 GB free/month
Logs - Storage$0.03Per GB-month5 GB free/month
Logs - Archive (S3 export)S3 pricingSee S3 costsSee S3 free tier
Logs Insights - Queries$0.005Per GB scannedIncluded in free tier logs
EventBridge - Custom EventsFREEFirst 14M/monthYes
EventBridge - Events over 14M$1.00Per 1M eventsNot included
Anomaly Detection$0.30/monthPer metric monitoredNot included

Real Application Cost Example

Scenario: Multi-region web app with complete monitoring

Infrastructure:
- 10 EC2 instances (5 per region)
- 2 ALB (1 per region)
- 2 RDS instances
- Auto-Scaling Groups
- Route53 health checks

CloudWatch Setup:
- AWS service metrics (EC2, ALB, RDS, ASG): FREE
- 5 custom business metrics (orders/min, revenue, etc.): 5 x $0.30 = $1.50/month
- 20 standard alarms (CPU, latency, errors, etc.): 20 x $0.10 = $2.00/month
- 2 composite alarms (app-healthy): 2 x $0.50 = $1.00/month
- 1 dashboard (production overview): $3.00/month

Logs:
- 50 GB/month ingestion: (50 - 5 free) x $0.50 = $22.50/month
- Average 200 GB stored (30 day retention): (200 - 5 free) x $0.03 = $5.85/month
- Logs Insights queries: ~10 GB/month scanned: 10 x $0.005 = $0.05/month

EventBridge:
- ~5M events/month (EC2, ASG, custom): FREE (under 14M)

TOTAL MONTHLY: $35.90/month

Breakdown by category:

Custom metrics:         $1.50  (4%)
Alarms:                 $3.00  (8%)
Dashboard:              $3.00  (8%)
Logs ingestion:        $22.50 (63%) - Largest cost
Logs storage:           $5.85 (16%)
Logs queries:           $0.05 (under 1%)
EventBridge:            $0.00  (0%)

Cost Optimization Strategies

1. Logs - Greatest savings opportunity

# EXPENSIVE: 90 day retention for ALL logs
# 50 GB/month x 90 days = 4500 GB stored x $0.03 = $135/month

# ECONOMICAL: Differentiated retention by type
# Debug logs: 7 days
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp-debug \
    --retention-in-days 7

# Application logs: 30 days
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp \
    --retention-in-days 30

# Error logs: 90 days
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp-errors \
    --retention-in-days 90

# Audit logs: 1 year (compliance)
aws logs put-retention-policy \
    --log-group-name /aws/audit \
    --retention-in-days 365

# Savings: ~60% in storage costs

2. Log sampling for successful requests

import random

def log_request(status_code, latency):
    # Always log errors
    if status_code >= 400:
        logger.error(f"Error {status_code}, latency {latency}ms")
        return

    # Sample 10% of successful requests
    if random.random() < 0.1:
        logger.info(f"Success {status_code}, latency {latency}ms")

# Log reduction: ~90% less data
# Maintain visibility of problems (100% errors)
# Still see trends (10% sample is statistically significant)

3. Metric Filters instead of Custom Metrics

# EXPENSIVE: Publish custom metric from code
# Code: cloudwatch.put_metric_data(...)
# Cost: $0.30/month per custom metric

# ECONOMICAL: Metric filter on existing logs
aws logs put-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ErrorCount \
    --filter-pattern "[timestamp, level=ERROR, msg]" \
    --metric-transformations \
        metricName=ApplicationErrors,\
        metricNamespace=CustomApp/Logs,\
        metricValue=1

# Cost: $0 (you're already paying for log ingestion)
# Same result: error metric

4. Consolidated dashboards

# EXPENSIVE: Dashboard per resource
# Dashboard EC2 Instance 1: $3/month
# Dashboard EC2 Instance 2: $3/month
# ... x 10 instances = $30/month

# ECONOMICAL: Aggregated dashboard
# 1 Dashboard "Production Overview" with all instances: $3/month
# Savings: $27/month

Integration with Other Services

AWS ServiceHow It IntegratesTypical Use Case
EC2CloudWatch Agent sends metrics and logsMemory, disk, application logs monitoring
Auto ScalingAlarms trigger scaling policiesScale out when CPU over 80%
ALB/NLBNative metrics (RequestCount, Latency, 5XX)Alarm on high error rate
RDSNative metrics (CPUUtilization, FreeStorageSpace, Connections)Alarm when storage under 10GB
LambdaNative metrics and logs (Invocations, Errors, Duration)Alarm on error rate, log analysis
S3Request metrics, storage metricsMonitor bucket size, request patterns
Route53Health check metrics (HealthCheckStatus)Alarm on failover events
SNSAlarm notifications via topicsEmail, SMS, Lambda triggers
SQSQueue metrics (ApproximateNumberOfMessages)Scale workers based on queue depth
API GatewayNative metrics (Count, Latency, 4XX, 5XX)Monitor API performance
ECS/EKSContainer metrics and logsMonitor containerized applications
Step FunctionsExecution metricsMonitor workflow success/failure
X-RayDistributed tracing integrationEnd-to-end request tracing
Systems ManagerRun Command integrationAuto-remediation actions
EventBridgeEvent-driven automationReact to resource state changes
CloudTrailAudit logsSecurity and compliance monitoring

Additional Resources

Official AWS Documentation

Whitepapers and Best Practices

Hands-On Tutorials

Tools

For AWS Solutions Architect Associate Certification