Amazon CloudWatch: Comprehensive Observability and Monitoring on AWS

CloudWatch metrics, logs, alarms, EventBridge, dashboards, and observability strategies for production workloads

@geomenaSun Aug 10 2025#aws-roadmap#monitoring#observability646 views

Amazon CloudWatch serves as the foundational observability platform within the AWS ecosystem, collecting, monitoring, and analyzing metrics, logs, and events across your entire infrastructure and application stack. It functions as the central nervous system of any well-architected AWS deployment, delivering real-time visibility into the health, performance, and operational state of every resource under management.

The core problem it addresses: operational opacity. Without a robust observability layer, engineering teams discover failures only when end users report them — a reactive posture that erodes trust and extends incident resolution times. CloudWatch transforms this paradigm by enabling proactive monitoring, anomaly detection, automated remediation, and comprehensive performance trending.

When to deploy it: in every production environment, without exception. CloudWatch is not optional for any serious AWS architecture. Operating without observability is equivalent to flying blind — you cannot determine whether resources are healthy, diagnose the root cause of incidents, or anticipate when scaling is necessary.

Alternatives worth evaluating: Datadog, New Relic, Prometheus with Grafana, and the ELK Stack comprising Elasticsearch, Logstash, and Kibana. CloudWatch holds a distinct advantage through its native integration with AWS services and zero-infrastructure setup requirements, though third-party solutions often deliver more sophisticated analytics interfaces and richer visualization capabilities.

Key Concepts

ConceptDescription
MetricA numeric data point that varies over time, represented as a time series of timestamp-value pairs. Examples include CPUUtilization, RequestCount, and DiskReadOps. Each metric belongs to a namespace such as AWS/EC2 or AWS/ApplicationELB.
NamespaceA container that groups related metrics. AWS uses namespaces like AWS/EC2 and AWS/RDS. Custom namespaces — such as CustomApp/Business — can be created for application-specific metrics.
DimensionA key-value pair that identifies a metric variation, such as InstanceId=i-xxxxx or LoadBalancer=app/myapp-lb/xxx. Dimensions enable filtering and aggregation of metrics.
StatisticAn aggregation of data points over a defined period — Average, Sum, Minimum, Maximum, or SampleCount. This determines how the reported value is calculated.
PeriodThe time interval over which statistics are calculated: 60s, 300s, or 3600s. Standard metrics support a minimum of 60s, while high-resolution metrics can achieve 1s granularity.
AlarmAn automated monitor that compares a metric against a threshold and executes predefined actions when that threshold is breached. Alarms maintain three states: OK, ALARM, and INSUFFICIENT_DATA.
Alarm StateThe current condition of an alarm — OK indicates the metric is below threshold, ALARM indicates the metric has exceeded threshold, and INSUFFICIENT_DATA indicates there is not enough data to evaluate.
Evaluation PeriodThe number of consecutive periods that must satisfy the alarm condition before a state change occurs. This mechanism prevents false positives from transient spikes.
Datapoints to AlarmThe number of data points within the evaluation periods that must violate the threshold to trigger the alarm. A "3 of 5" configuration requires 3 breaching data points within 5 evaluation periods.
Composite AlarmAn alarm that combines multiple alarms using boolean logic — AND/OR operators — enabling complex alerting conditions based on multiple signals.
Log GroupA container for related log streams, such as /aws/lambda/my-function or /aws/ec2/myapp. Log groups define retention policies and encryption settings.
Log StreamA sequence of log events from a single source — one EC2 instance or one Lambda invocation. A log group can contain multiple streams.
Log EventAn individual log message comprising a timestamp and content. Events can be plain text or structured JSON.
Metric FilterA search pattern applied to logs that extracts data and publishes it as a CloudWatch metric. This enables custom metric creation from log data without code modifications.
Logs InsightsAn interactive query service with SQL-like syntax for analyzing logs. It supports aggregations, filters, regular expressions, and statistical operations over large log volumes.
Log RetentionThe duration CloudWatch stores logs before automatic deletion — configurable from 1 day to 10 years, or set to indefinite. This setting is critical for both compliance requirements and cost optimization.
EventBridgeAn event bus service that captures AWS resource state changes and enables event-driven automation across your architecture.
Event PatternA JSON definition specifying which events to capture, such as "EC2 instance terminated in a specific Auto Scaling Group." Pattern matching is applied against the event structure.
Event RuleThe combination of an event pattern and a target action. Rules define the logic: "when X happens, execute Y."
Event TargetThe destination for an event — Lambda, SNS, SQS, Step Functions, and others. A single rule can route to multiple targets.
Scheduled ExpressionA cron or rate expression for periodic events. For example, cron(0 2 * * ? *) triggers daily at 2 AM, while rate(5 minutes) triggers every 5 minutes.
CloudWatch AgentA daemon that runs on EC2 or on-premises servers to send custom metrics and logs to CloudWatch. It enables monitoring of memory utilization, disk space, and custom application processes.
DashboardA customizable visualization surface for metrics, rendered as charts, numeric displays, and text widgets. Dashboards provide a consolidated view of application health.
WidgetAn individual dashboard component — line chart, numeric display, log widget, or alarm status indicator — configured with specific metrics and time periods.
Anomaly DetectionA machine learning capability that learns normal metric behavior patterns and establishes expected value bands. Alarms can then trigger based on deviations from the learned baseline.

Essential AWS CLI Commands

List available metrics for a namespace
aws cloudwatch list-metrics \
    --namespace AWS/EC2
List metrics for a specific instance
aws cloudwatch list-metrics \
    --namespace AWS/EC2 \
    --dimensions Name=InstanceId,Value=i-xxxxx
Get CPU statistics -- last 24 hours, 5-minute periods
aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 300 \
    --statistics Average,Maximum \
    --output table
Get multiple metrics efficiently with get-metric-data
cat > metric-queries.json <<EOF
[
  {
    "Id": "cpu",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/EC2",
        "MetricName": "CPUUtilization",
        "Dimensions": [{"Name": "InstanceId", "Value": "i-xxxxx"}]
      },
      "Period": 300,
      "Stat": "Average"
    }
  },
  {
    "Id": "network",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/EC2",
        "MetricName": "NetworkIn",
        "Dimensions": [{"Name": "InstanceId", "Value": "i-xxxxx"}]
      },
      "Period": 300,
      "Stat": "Sum"
    }
  }
]
EOF
Execute the multi-metric query
aws cloudwatch get-metric-data \
    --metric-data-queries file://metric-queries.json \
    --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S)
Get ALB metrics -- request count, latency, errors
aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name TargetResponseTime \
    --dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
    --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 60 \
    --statistics Average,Maximum
Get Route53 health check status
aws cloudwatch get-metric-statistics \
    --namespace AWS/Route53 \
    --metric-name HealthCheckStatus \
    --dimensions Name=HealthCheckId,Value=abc123-healthcheck \
    --start-time $(date -u -d '12 hours ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 60 \
    --statistics Minimum
Publish a simple custom metric
aws cloudwatch put-metric-data \
    --namespace CustomApp/Business \
    --metric-name OrdersProcessed \
    --value 142 \
    --timestamp $(date -u +%Y-%m-%dT%H:%M:%S)
Publish a metric with dimensions
aws cloudwatch put-metric-data \
    --namespace CustomApp/API \
    --metric-name ResponseTime \
    --value 234 \
    --unit Milliseconds \
    --dimensions Environment=Production,Region=sa-east-1
Publish multiple metrics in a single batch
aws cloudwatch put-metric-data \
    --namespace CustomApp/Database \
    --metric-data \
        MetricName=ActiveConnections,Value=45,Unit=Count \
        MetricName=QueryTime,Value=123,Unit=Milliseconds \
        MetricName=CacheHitRate,Value=0.89,Unit=Percent
Publish with high resolution -- 1-second granularity
aws cloudwatch put-metric-data \
    --namespace CustomApp/HighRes \
    --metric-name Latency \
    --value 78 \
    --unit Milliseconds \
    --storage-resolution 1
Simple alarm -- CPU exceeding 80%
aws cloudwatch put-metric-alarm \
    --alarm-name high-cpu-i-xxxxx \
    --alarm-description "CPU above 80% for 10 minutes" \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --statistic Average \
    --period 300 \
    --evaluation-periods 2 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --treat-missing-data notBreaching
Alarm with SNS notification
aws cloudwatch put-metric-alarm \
    --alarm-name high-cpu-with-notification \
    --alarm-description "Alert ops team when CPU high" \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --statistic Average \
    --period 300 \
    --evaluation-periods 2 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:ops-alerts \
    --ok-actions arn:aws:sns:sa-east-1:123456789012:ops-alerts
Alarm for ALB high latency
aws cloudwatch put-metric-alarm \
    --alarm-name alb-high-latency \
    --alarm-description "Target response time over 500ms" \
    --namespace AWS/ApplicationELB \
    --metric-name TargetResponseTime \
    --dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
    --statistic Average \
    --period 60 \
    --evaluation-periods 3 \
    --threshold 0.5 \
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts
Alarm for Route53 health check failure
aws cloudwatch put-metric-alarm \
    --alarm-name route53-primary-unhealthy \
    --alarm-description "Primary region health check failed - failover activated" \
    --namespace AWS/Route53 \
    --metric-name HealthCheckStatus \
    --dimensions Name=HealthCheckId,Value=abc123-healthcheck \
    --statistic Minimum \
    --period 60 \
    --evaluation-periods 2 \
    --threshold 1 \
    --comparison-operator LessThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:failover-alerts
Alarm for elevated 5XX error rate
aws cloudwatch put-metric-alarm \
    --alarm-name alb-high-error-rate \
    --alarm-description "Too many 5XX errors from targets" \
    --namespace AWS/ApplicationELB \
    --metric-name HTTPCode_Target_5XX_Count \
    --dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
    --statistic Sum \
    --period 300 \
    --evaluation-periods 1 \
    --threshold 100 \
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts
Alarm with M-of-N datapoints for greater flexibility
aws cloudwatch put-metric-alarm \
    --alarm-name cpu-high-3-of-5 \
    --alarm-description "CPU high in 3 out of 5 datapoints" \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxxxx \
    --statistic Average \
    --period 60 \
    --evaluation-periods 5 \
    --datapoints-to-alarm 3 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold
Composite alarm combining multiple signals
aws cloudwatch put-composite-alarm \
    --alarm-name app-unhealthy \
    --alarm-description "App is unhealthy if high CPU AND high error rate" \
    --alarm-rule "ALARM(high-cpu-i-xxxxx) AND ALARM(alb-high-error-rate)" \
    --alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alerts
List all alarms
aws cloudwatch describe-alarms
View alarms currently in ALARM state
aws cloudwatch describe-alarms \
    --state-value ALARM
View specific alarm details
aws cloudwatch describe-alarms \
    --alarm-names high-cpu-i-xxxxx
View alarm history -- state change audit trail
aws cloudwatch describe-alarm-history \
    --alarm-name high-cpu-i-xxxxx \
    --max-records 10
Disable alarm actions temporarily
aws cloudwatch disable-alarm-actions \
    --alarm-names high-cpu-i-xxxxx
Re-enable alarm actions
aws cloudwatch enable-alarm-actions \
    --alarm-names high-cpu-i-xxxxx
Set alarm state manually for testing
aws cloudwatch set-alarm-state \
    --alarm-name high-cpu-i-xxxxx \
    --state-value ALARM \
    --state-reason "Testing alarm notifications"
Delete alarms
aws cloudwatch delete-alarms \
    --alarm-names high-cpu-i-xxxxx alb-high-latency
Create a log group
aws logs create-log-group \
    --log-group-name /aws/ec2/myapp
Configure retention -- 7 days
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp \
    --retention-in-days 7
Create a log stream
aws logs create-log-stream \
    --log-group-name /aws/ec2/myapp \
    --log-stream-name i-xxxxx-app.log
Prepare log events for submission
cat > log-events.json <<EOF
[
  {
    "timestamp": $(date +%s)000,
    "message": "Application started successfully"
  },
  {
    "timestamp": $(date +%s)000,
    "message": "Connected to database"
  }
]
EOF
Send log events
aws logs put-log-events \
    --log-group-name /aws/ec2/myapp \
    --log-stream-name i-xxxxx-app.log \
    --log-events file://log-events.json
Tag a log group for cost allocation
aws logs tag-log-group \
    --log-group-name /aws/ec2/myapp \
    --tags Environment=Production,Application=MyApp
List all log groups
aws logs describe-log-groups
List log streams ordered by last event time
aws logs describe-log-streams \
    --log-group-name /aws/ec2/myapp \
    --order-by LastEventTime \
    --descending \
    --max-items 10
Tail logs in real time
aws logs tail /aws/ec2/myapp --follow
Filter logs by pattern -- last 24 hours
aws logs filter-log-events \
    --log-group-name /aws/ec2/myapp \
    --start-time $(date -u -d '24 hours ago' +%s)000 \
    --filter-pattern "ERROR"
Filter with structured pattern matching
aws logs filter-log-events \
    --log-group-name /aws/ec2/myapp \
    --filter-pattern '[timestamp, level=ERROR, msg]' \
    --max-items 50
Search across multiple log groups by prefix
aws logs filter-log-events \
    --log-group-name-prefix /aws/ec2/ \
    --filter-pattern "database connection failed"
Export logs to S3 for archival analysis
aws logs create-export-task \
    --log-group-name /aws/ec2/myapp \
    --from $(date -u -d '7 days ago' +%s)000 \
    --to $(date -u +%s)000 \
    --destination myapp-logs-bucket \
    --destination-prefix logs/2025/11/
Query: top 10 most frequent errors
aws logs start-query \
    --log-group-name /aws/ec2/myapp \
    --start-time $(date -u -d '24 hours ago' +%s) \
    --end-time $(date -u +%s) \
    --query-string '
        fields @timestamp, @message
        | filter @message like /ERROR/
        | stats count() by @message
        | sort count desc
        | limit 10
    '
Retrieve query results
aws logs get-query-results --query-id xxxxx-yyyy-zzzz
Query: average latency by endpoint
aws logs start-query \
    --log-group-name /aws/lambda/my-api \
    --start-time $(date -u -d '1 hour ago' +%s) \
    --end-time $(date -u +%s) \
    --query-string '
        fields @timestamp, endpoint, duration
        | stats avg(duration) as avg_latency, max(duration) as max_latency by endpoint
        | sort avg_latency desc
    '
Query: slow requests exceeding 500ms
aws logs start-query \
    --log-group-name /aws/ec2/myapp \
    --start-time $(date -u -d '1 hour ago' +%s) \
    --end-time $(date -u +%s) \
    --query-string '
        fields @timestamp, @message, latency
        | filter latency > 500
        | sort latency desc
        | limit 100
    '
Create metric filter to count errors
aws logs put-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ErrorCount \
    --filter-pattern "[timestamp, level=ERROR, msg]" \
    --metric-transformations \
        metricName=ApplicationErrors,\
        metricNamespace=CustomApp/Logs,\
        metricValue=1,\
        defaultValue=0
Metric filter to extract latency values from logs
aws logs put-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ResponseTime \
    --filter-pattern "[timestamp, level, msg, latency]" \
    --metric-transformations \
        metricName=ResponseLatency,\
        metricNamespace=CustomApp/Logs,\
        metricValue='$latency',\
        unit=Milliseconds
List existing metric filters
aws logs describe-metric-filters \
    --log-group-name /aws/ec2/myapp
Delete a metric filter
aws logs delete-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ErrorCount
Create an SNS topic for notifications
aws sns create-topic --name ec2-state-changes
Subscribe an email endpoint
aws sns subscribe \
    --topic-arn arn:aws:sns:sa-east-1:123456789012:ec2-state-changes \
    --protocol email \
    --notification-endpoint ops@example.com
Define event pattern -- EC2 instance terminated
cat > event-pattern-terminated.json <<EOF
{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["terminated"]
  }
}
EOF
Create the EventBridge rule
aws events put-rule \
    --name notify-instance-terminated \
    --description "Notify when EC2 instance is terminated" \
    --event-pattern file://event-pattern-terminated.json
Add SNS as the rule target
aws events put-targets \
    --rule notify-instance-terminated \
    --targets "Id"="1","Arn"="arn:aws:sns:sa-east-1:123456789012:ec2-state-changes"
Define event pattern -- Auto Scaling activities
cat > event-pattern-asg.json <<EOF
{
  "source": ["aws.autoscaling"],
  "detail-type": ["EC2 Instance Launch Successful", "EC2 Instance Terminate Successful"],
  "detail": {
    "AutoScalingGroupName": ["myapp-asg"]
  }
}
EOF
Create Auto Scaling event rule
aws events put-rule \
    --name asg-scaling-events \
    --event-pattern file://event-pattern-asg.json
Scheduled rule -- daily at 2 AM UTC via cron
aws events put-rule \
    --name daily-cleanup \
    --schedule-expression "cron(0 2 * * ? *)" \
    --description "Run cleanup Lambda daily at 2 AM UTC"
Scheduled rule -- every 5 minutes via rate
aws events put-rule \
    --name health-check-poller \
    --schedule-expression "rate(5 minutes)" \
    --description "Poll external health check every 5 minutes"
Add Lambda as a rule target
aws events put-targets \
    --rule daily-cleanup \
    --targets "Id"="1","Arn"="arn:aws:lambda:sa-east-1:123456789012:function:cleanup-function"
Grant EventBridge permission to invoke Lambda
aws lambda add-permission \
    --function-name cleanup-function \
    --statement-id AllowEventBridgeInvoke \
    --action lambda:InvokeFunction \
    --principal events.amazonaws.com \
    --source-arn arn:aws:events:sa-east-1:123456789012:rule/daily-cleanup
List all EventBridge rules
aws events list-rules
View rule details
aws events describe-rule --name notify-instance-terminated
List targets for a specific rule
aws events list-targets-by-rule --rule notify-instance-terminated
Disable a rule temporarily
aws events disable-rule --name daily-cleanup
Re-enable a rule
aws events enable-rule --name daily-cleanup
Remove targets before deletion
aws events remove-targets \
    --rule daily-cleanup \
    --ids "1"
Delete the rule
aws events delete-rule --name daily-cleanup
Define the dashboard body
cat > dashboard-body.json <<EOF
{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/EC2", "CPUUtilization", {"stat": "Average"}]
        ],
        "period": 300,
        "stat": "Average",
        "region": "sa-east-1",
        "title": "EC2 CPU Utilization"
      }
    },
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/ApplicationELB", "RequestCount", {"stat": "Sum"}]
        ],
        "period": 60,
        "stat": "Sum",
        "region": "sa-east-1",
        "title": "ALB Request Count"
      }
    }
  ]
}
EOF
Create the dashboard
aws cloudwatch put-dashboard \
    --dashboard-name MyApp-Production \
    --dashboard-body file://dashboard-body.json
List all dashboards
aws cloudwatch list-dashboards
View a specific dashboard
aws cloudwatch get-dashboard --dashboard-name MyApp-Production
Delete a dashboard
aws cloudwatch delete-dashboards --dashboard-names MyApp-Production
Install CloudWatch Agent on Amazon Linux 2
sudo yum install amazon-cloudwatch-agent -y
Create agent configuration
cat > /opt/aws/amazon-cloudwatch-agent/etc/config.json <<EOF
{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/app.log",
            "log_group_name": "/aws/ec2/myapp",
            "log_stream_name": "{instance_id}-app.log",
            "timezone": "UTC"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "/aws/ec2/nginx",
            "log_stream_name": "{instance_id}-error.log"
          }
        ]
      }
    }
  },
  "metrics": {
    "namespace": "CustomApp/System",
    "metrics_collected": {
      "mem": {
        "measurement": [
          {
            "name": "mem_used_percent",
            "rename": "MemoryUtilization",
            "unit": "Percent"
          }
        ],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": [
          {
            "name": "used_percent",
            "rename": "DiskUtilization",
            "unit": "Percent"
          }
        ],
        "metrics_collection_interval": 60,
        "resources": ["/"]
      }
    }
  }
}
EOF
Start the agent with the configuration
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a fetch-config \
    -m ec2 \
    -s \
    -c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json
Verify agent status
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a query \
    -m ec2 \
    -s
Stop the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a stop \
    -m ec2 \
    -s

Architecture and Flows

Complete Multi-Region Observability

Alarm Lifecycle Flow

Logs Pipeline

Best Practices

Observability Strategy

  • Define key metrics first — identify the critical Service Level Indicators for your application before deploying any monitoring
  • Implement the four Golden Signals — monitor Latency, Traffic, Errors, and Saturation as the foundation of your observability posture
  • Adopt structured logging — use JSON with consistent fields such as timestamp, level, message, and context across all services
  • Calibrate log levels by environment — DEBUG in development, INFO and WARN in staging, ERROR and above in production
  • Enable distributed tracing — use X-Ray for microservices architectures and latency debugging across service boundaries
  • Track custom business metrics — infrastructure monitoring alone is insufficient; measure business KPIs such as orders per minute and revenue throughput
  • Maintain per-environment dashboards — separate Production, Staging, and Development dashboards to avoid confusion during incident response

The four Golden Signals — Latency, Traffic, Errors, and Saturation — originate from Google's Site Reliability Engineering methodology. Monitoring these four dimensions provides comprehensive coverage of most failure modes in distributed systems.

Alarming

  • Prioritize alarms by severity — Critical triggers a pager, High sends email, Medium appears on the dashboard only
  • Guard against alarm fatigue — alert exclusively on conditions that demand immediate human attention
  • Set evaluation periods to 2 or higher — a minimum of 2 periods prevents false positives from transient spikes
  • Use the M-of-N datapoints pattern — a configuration such as 3 of 5 offers greater flexibility and resilience against noise
  • Segment SNS topics by severity — maintain separate topics for critical-alerts, warning-alerts, and info-alerts
  • Document runbooks for every alarm — each alarm must have associated documentation describing the appropriate response
  • Configure OK actions — notify the team when an alarm resolves, not only when it triggers
  • Test alarms on a monthly cadence — simulate alarm conditions regularly to verify the entire notification chain
  • Deploy composite alarms for signal correlation — combine related signals such as CPU, Memory, and Disk into a single composite evaluation

Alarm fatigue is one of the most dangerous failure modes in operational monitoring. When teams receive hundreds of non-actionable alerts, they begin ignoring all notifications — including the ones that indicate genuine production incidents. Every alarm must justify its existence.

Logs Management

  • Deploy the CloudWatch Agent on every instance — centralize log collection automatically across your fleet
  • Apply differentiated retention by log type — Debug logs at 7 days, Error logs at 90 days, Audit logs at 1 year or longer
  • Create metric filters for critical error patterns — convert log patterns into metrics without modifying application code
  • Save and document Logs Insights queries — maintain a library of common queries for rapid troubleshooting during incidents
  • Never log secrets — sanitize passwords, tokens, and personally identifiable information before writing to any log stream
  • Embed correlation IDs — include a request ID in all log entries to enable end-to-end request tracing across services
  • Export aged logs to S3 periodically — move older logs to S3 for compliance requirements and cost optimization

Security and Compliance

  • Encrypt sensitive logs at rest — use KMS encryption for log groups containing sensitive data
  • Apply restrictive IAM policies — ensure only authorized roles can view production logs
  • Maintain an audit trail — use CloudTrail to log all access and modifications to alarms and dashboards
  • Enforce log retention for compliance — respect GDPR, HIPAA, and SOX requirements when configuring retention periods
  • Implement PII redaction — configure automatic redaction of sensitive information before it reaches CloudWatch
  • Centralize cross-account logs — aggregate logs from multiple accounts into a dedicated security account

Failing to configure log retention policies for compliance-regulated data — HIPAA, GDPR, SOX, or PCI-DSS — can result in significant regulatory penalties. Establish retention policies as part of the initial log group creation process, never as an afterthought.

Automation

  • Leverage EventBridge for remediation — automate auto-scaling, service restarts, and failover procedures
  • Trigger Lambda functions from alarms — execute automatic actions such as creating snapshots before termination
  • Define monitoring as Infrastructure as Code — manage all alarms and dashboards in Terraform or CloudFormation
  • Integrate with CI/CD pipelines — ensure deployments automatically create or update associated alarms
  • Schedule maintenance windows — use EventBridge to disable alarm actions during planned maintenance

Common Mistakes

Cost Considerations

CloudWatch Pricing Breakdown

ComponentCostUnitFree Tier
Metrics — Standard AWS servicesFREEUnlimitedPermanent
Metrics — Custom$0.30/monthPer metric10 metrics free
Metrics — High-Resolution Custom$0.30/monthPer metricNot included
API Requests — GetMetricStatistics$0.01Per 1,000 requests1M free/month
Dashboard$3/monthPer dashboard3 dashboards free
Alarms — Standard$0.10/monthPer alarm10 alarms free
Alarms — High-Resolution$0.30/monthPer alarmNot included
Alarms — Composite$0.50/monthPer alarmNot included
Logs — Ingestion$0.50Per GB ingested5 GB free/month
Logs — Storage$0.03Per GB-month5 GB free/month
Logs — Archive to S3S3 pricingSee S3 costsSee S3 free tier
Logs Insights — Queries$0.005Per GB scannedIncluded in free tier logs
EventBridge — Custom EventsFREEFirst 14M/monthYes
EventBridge — Events over 14M$1.00Per 1M eventsNot included
Anomaly Detection$0.30/monthPer metric monitoredNot included

Real Application Cost Example

Scenario: a multi-region web application with comprehensive monitoring.

Infrastructure under observation:

  • 10 EC2 instances, 5 per region
  • 2 Application Load Balancers, 1 per region
  • 2 RDS instances
  • Auto-Scaling Groups
  • Route53 health checks
Cost CategoryCalculationMonthly Cost
AWS service metrics — EC2, ALB, RDS, ASGIncluded free$0.00
5 custom business metrics5 x $0.30$1.50
20 standard alarms20 x $0.10$2.00
2 composite alarms2 x $0.50$1.00
1 production dashboard1 x $3.00$3.00
Logs ingestion — 50 GB/month45 billable GB x $0.50$22.50
Logs storage — 200 GB average with 30-day retention195 billable GB x $0.03$5.85
Logs Insights queries — 10 GB/month scanned10 x $0.005$0.05
EventBridge — approximately 5M events/monthUnder 14M free threshold$0.00
Total$35.90

Cost distribution insight: logs ingestion alone represents 63% of the total CloudWatch spend. This is the single greatest optimization opportunity.

Cost Optimization Strategies

Strategy 1: Differentiated Log Retention

The most impactful cost reduction comes from applying tiered retention rather than a uniform policy across all log groups.

Debug logs -- 7-day retention
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp-debug \
    --retention-in-days 7
Application logs -- 30-day retention
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp \
    --retention-in-days 30
Error logs -- 90-day retention
aws logs put-retention-policy \
    --log-group-name /aws/ec2/myapp-errors \
    --retention-in-days 90
Audit logs -- 1-year retention for compliance
aws logs put-retention-policy \
    --log-group-name /aws/audit \
    --retention-in-days 365

This tiered approach yields approximately 60% savings in storage costs compared to a blanket 90-day retention policy.

Strategy 2: Log Sampling for Successful Requests

Sample 10% of successes, capture 100% of errors
import random

def log_request(status_code, latency):
    # Always log errors
    if status_code >= 400:
        logger.error(f"Error {status_code}, latency {latency}ms")
        return

    # Sample 10% of successful requests
    if random.random() < 0.1:
        logger.info(f"Success {status_code}, latency {latency}ms")

# Log reduction: approximately 90% less data
# Full visibility of problems -- 100% of errors captured
# Trend analysis preserved -- 10% sample is statistically significant

Strategy 3: Metric Filters Instead of Custom Metrics

Cost-effective alternative -- metric filter on existing logs
aws logs put-metric-filter \
    --log-group-name /aws/ec2/myapp \
    --filter-name ErrorCount \
    --filter-pattern "[timestamp, level=ERROR, msg]" \
    --metric-transformations \
        metricName=ApplicationErrors,\
        metricNamespace=CustomApp/Logs,\
        metricValue=1

Publishing custom metrics from application code costs $0.30/month per metric. Metric filters on existing logs cost nothing beyond the log ingestion you are already paying for — delivering the same error metric at zero marginal cost.

Strategy 4: Consolidated Dashboards — instead of creating a dashboard per resource at 3/montheach,aggregateallinstancesintoasingle"ProductionOverview"dashboard.Forafleetof10instances,thisreducesdashboardcostsfrom3/month each, aggregate all instances into a single "Production Overview" dashboard. For a fleet of 10 instances, this reduces dashboard costs from 30/month to $3/month.

Integration with Other AWS Services

AWS ServiceIntegration MethodTypical Use Case
EC2CloudWatch Agent sends metrics and logsMemory, disk, and application log monitoring
Auto ScalingAlarms trigger scaling policiesScale out when CPU exceeds 80%
ALB/NLBNative metrics — RequestCount, Latency, 5XXAlarm on elevated error rate
RDSNative metrics — CPUUtilization, FreeStorageSpace, ConnectionsAlarm when storage drops below 10 GB
LambdaNative metrics and logs — Invocations, Errors, DurationAlarm on error rate, log analysis
S3Request metrics, storage metricsMonitor bucket size and request patterns
Route53Health check metrics — HealthCheckStatusAlarm on failover events
SNSAlarm notifications via topicsEmail, SMS, and Lambda triggers
SQSQueue metrics — ApproximateNumberOfMessagesScale workers based on queue depth
API GatewayNative metrics — Count, Latency, 4XX, 5XXMonitor API performance and error rates
ECS/EKSContainer metrics and logsMonitor containerized applications
Step FunctionsExecution metricsMonitor workflow success and failure rates
X-RayDistributed tracing integrationEnd-to-end request tracing across services
Systems ManagerRun Command integrationAutomated remediation actions
EventBridgeEvent-driven automationReact to resource state changes in real time
CloudTrailAudit logsSecurity and compliance monitoring

Additional Resources

Official AWS Documentation

Whitepapers and Best Practices

Hands-On Tutorials

Tools

For AWS Solutions Architect Associate Certification