Amazon CloudWatch: Comprehensive Observability and Monitoring on AWS
CloudWatch metrics, logs, alarms, EventBridge, dashboards, and observability strategies for production workloads
Amazon CloudWatch serves as the foundational observability platform within the AWS ecosystem, collecting, monitoring, and analyzing metrics, logs, and events across your entire infrastructure and application stack. It functions as the central nervous system of any well-architected AWS deployment, delivering real-time visibility into the health, performance, and operational state of every resource under management.
The core problem it addresses: operational opacity. Without a robust observability layer, engineering teams discover failures only when end users report them — a reactive posture that erodes trust and extends incident resolution times. CloudWatch transforms this paradigm by enabling proactive monitoring, anomaly detection, automated remediation, and comprehensive performance trending.
When to deploy it: in every production environment, without exception. CloudWatch is not optional for any serious AWS architecture. Operating without observability is equivalent to flying blind — you cannot determine whether resources are healthy, diagnose the root cause of incidents, or anticipate when scaling is necessary.
Alternatives worth evaluating: Datadog, New Relic, Prometheus with Grafana, and the ELK Stack comprising Elasticsearch, Logstash, and Kibana. CloudWatch holds a distinct advantage through its native integration with AWS services and zero-infrastructure setup requirements, though third-party solutions often deliver more sophisticated analytics interfaces and richer visualization capabilities.
Key Concepts
| Concept | Description |
|---|---|
| Metric | A numeric data point that varies over time, represented as a time series of timestamp-value pairs. Examples include CPUUtilization, RequestCount, and DiskReadOps. Each metric belongs to a namespace such as AWS/EC2 or AWS/ApplicationELB. |
| Namespace | A container that groups related metrics. AWS uses namespaces like AWS/EC2 and AWS/RDS. Custom namespaces — such as CustomApp/Business — can be created for application-specific metrics. |
| Dimension | A key-value pair that identifies a metric variation, such as InstanceId=i-xxxxx or LoadBalancer=app/myapp-lb/xxx. Dimensions enable filtering and aggregation of metrics. |
| Statistic | An aggregation of data points over a defined period — Average, Sum, Minimum, Maximum, or SampleCount. This determines how the reported value is calculated. |
| Period | The time interval over which statistics are calculated: 60s, 300s, or 3600s. Standard metrics support a minimum of 60s, while high-resolution metrics can achieve 1s granularity. |
| Alarm | An automated monitor that compares a metric against a threshold and executes predefined actions when that threshold is breached. Alarms maintain three states: OK, ALARM, and INSUFFICIENT_DATA. |
| Alarm State | The current condition of an alarm — OK indicates the metric is below threshold, ALARM indicates the metric has exceeded threshold, and INSUFFICIENT_DATA indicates there is not enough data to evaluate. |
| Evaluation Period | The number of consecutive periods that must satisfy the alarm condition before a state change occurs. This mechanism prevents false positives from transient spikes. |
| Datapoints to Alarm | The number of data points within the evaluation periods that must violate the threshold to trigger the alarm. A "3 of 5" configuration requires 3 breaching data points within 5 evaluation periods. |
| Composite Alarm | An alarm that combines multiple alarms using boolean logic — AND/OR operators — enabling complex alerting conditions based on multiple signals. |
| Log Group | A container for related log streams, such as /aws/lambda/my-function or /aws/ec2/myapp. Log groups define retention policies and encryption settings. |
| Log Stream | A sequence of log events from a single source — one EC2 instance or one Lambda invocation. A log group can contain multiple streams. |
| Log Event | An individual log message comprising a timestamp and content. Events can be plain text or structured JSON. |
| Metric Filter | A search pattern applied to logs that extracts data and publishes it as a CloudWatch metric. This enables custom metric creation from log data without code modifications. |
| Logs Insights | An interactive query service with SQL-like syntax for analyzing logs. It supports aggregations, filters, regular expressions, and statistical operations over large log volumes. |
| Log Retention | The duration CloudWatch stores logs before automatic deletion — configurable from 1 day to 10 years, or set to indefinite. This setting is critical for both compliance requirements and cost optimization. |
| EventBridge | An event bus service that captures AWS resource state changes and enables event-driven automation across your architecture. |
| Event Pattern | A JSON definition specifying which events to capture, such as "EC2 instance terminated in a specific Auto Scaling Group." Pattern matching is applied against the event structure. |
| Event Rule | The combination of an event pattern and a target action. Rules define the logic: "when X happens, execute Y." |
| Event Target | The destination for an event — Lambda, SNS, SQS, Step Functions, and others. A single rule can route to multiple targets. |
| Scheduled Expression | A cron or rate expression for periodic events. For example, cron(0 2 * * ? *) triggers daily at 2 AM, while rate(5 minutes) triggers every 5 minutes. |
| CloudWatch Agent | A daemon that runs on EC2 or on-premises servers to send custom metrics and logs to CloudWatch. It enables monitoring of memory utilization, disk space, and custom application processes. |
| Dashboard | A customizable visualization surface for metrics, rendered as charts, numeric displays, and text widgets. Dashboards provide a consolidated view of application health. |
| Widget | An individual dashboard component — line chart, numeric display, log widget, or alarm status indicator — configured with specific metrics and time periods. |
| Anomaly Detection | A machine learning capability that learns normal metric behavior patterns and establishes expected value bands. Alarms can then trigger based on deviations from the learned baseline. |
Essential AWS CLI Commands
aws cloudwatch list-metrics \
--namespace AWS/EC2aws cloudwatch list-metrics \
--namespace AWS/EC2 \
--dimensions Name=InstanceId,Value=i-xxxxxaws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxxxx \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average,Maximum \
--output tablecat > metric-queries.json <<EOF
[
{
"Id": "cpu",
"MetricStat": {
"Metric": {
"Namespace": "AWS/EC2",
"MetricName": "CPUUtilization",
"Dimensions": [{"Name": "InstanceId", "Value": "i-xxxxx"}]
},
"Period": 300,
"Stat": "Average"
}
},
{
"Id": "network",
"MetricStat": {
"Metric": {
"Namespace": "AWS/EC2",
"MetricName": "NetworkIn",
"Dimensions": [{"Name": "InstanceId", "Value": "i-xxxxx"}]
},
"Period": 300,
"Stat": "Sum"
}
}
]
EOFaws cloudwatch get-metric-data \
--metric-data-queries file://metric-queries.json \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S)aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB \
--metric-name TargetResponseTime \
--dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Average,Maximumaws cloudwatch get-metric-statistics \
--namespace AWS/Route53 \
--metric-name HealthCheckStatus \
--dimensions Name=HealthCheckId,Value=abc123-healthcheck \
--start-time $(date -u -d '12 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Minimumaws cloudwatch put-metric-data \
--namespace CustomApp/Business \
--metric-name OrdersProcessed \
--value 142 \
--timestamp $(date -u +%Y-%m-%dT%H:%M:%S)aws cloudwatch put-metric-data \
--namespace CustomApp/API \
--metric-name ResponseTime \
--value 234 \
--unit Milliseconds \
--dimensions Environment=Production,Region=sa-east-1aws cloudwatch put-metric-data \
--namespace CustomApp/Database \
--metric-data \
MetricName=ActiveConnections,Value=45,Unit=Count \
MetricName=QueryTime,Value=123,Unit=Milliseconds \
MetricName=CacheHitRate,Value=0.89,Unit=Percentaws cloudwatch put-metric-data \
--namespace CustomApp/HighRes \
--metric-name Latency \
--value 78 \
--unit Milliseconds \
--storage-resolution 1aws cloudwatch put-metric-alarm \
--alarm-name high-cpu-i-xxxxx \
--alarm-description "CPU above 80% for 10 minutes" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxxxx \
--statistic Average \
--period 300 \
--evaluation-periods 2 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data notBreachingaws cloudwatch put-metric-alarm \
--alarm-name high-cpu-with-notification \
--alarm-description "Alert ops team when CPU high" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxxxx \
--statistic Average \
--period 300 \
--evaluation-periods 2 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:sa-east-1:123456789012:ops-alerts \
--ok-actions arn:aws:sns:sa-east-1:123456789012:ops-alertsaws cloudwatch put-metric-alarm \
--alarm-name alb-high-latency \
--alarm-description "Target response time over 500ms" \
--namespace AWS/ApplicationELB \
--metric-name TargetResponseTime \
--dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
--statistic Average \
--period 60 \
--evaluation-periods 3 \
--threshold 0.5 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alertsaws cloudwatch put-metric-alarm \
--alarm-name route53-primary-unhealthy \
--alarm-description "Primary region health check failed - failover activated" \
--namespace AWS/Route53 \
--metric-name HealthCheckStatus \
--dimensions Name=HealthCheckId,Value=abc123-healthcheck \
--statistic Minimum \
--period 60 \
--evaluation-periods 2 \
--threshold 1 \
--comparison-operator LessThanThreshold \
--alarm-actions arn:aws:sns:sa-east-1:123456789012:failover-alertsaws cloudwatch put-metric-alarm \
--alarm-name alb-high-error-rate \
--alarm-description "Too many 5XX errors from targets" \
--namespace AWS/ApplicationELB \
--metric-name HTTPCode_Target_5XX_Count \
--dimensions Name=LoadBalancer,Value=app/myapp-lb/xxxxx \
--statistic Sum \
--period 300 \
--evaluation-periods 1 \
--threshold 100 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alertsaws cloudwatch put-metric-alarm \
--alarm-name cpu-high-3-of-5 \
--alarm-description "CPU high in 3 out of 5 datapoints" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxxxx \
--statistic Average \
--period 60 \
--evaluation-periods 5 \
--datapoints-to-alarm 3 \
--threshold 80 \
--comparison-operator GreaterThanThresholdaws cloudwatch put-composite-alarm \
--alarm-name app-unhealthy \
--alarm-description "App is unhealthy if high CPU AND high error rate" \
--alarm-rule "ALARM(high-cpu-i-xxxxx) AND ALARM(alb-high-error-rate)" \
--alarm-actions arn:aws:sns:sa-east-1:123456789012:critical-alertsaws cloudwatch describe-alarmsaws cloudwatch describe-alarms \
--state-value ALARMaws cloudwatch describe-alarms \
--alarm-names high-cpu-i-xxxxxaws cloudwatch describe-alarm-history \
--alarm-name high-cpu-i-xxxxx \
--max-records 10aws cloudwatch disable-alarm-actions \
--alarm-names high-cpu-i-xxxxxaws cloudwatch enable-alarm-actions \
--alarm-names high-cpu-i-xxxxxaws cloudwatch set-alarm-state \
--alarm-name high-cpu-i-xxxxx \
--state-value ALARM \
--state-reason "Testing alarm notifications"aws cloudwatch delete-alarms \
--alarm-names high-cpu-i-xxxxx alb-high-latencyaws logs create-log-group \
--log-group-name /aws/ec2/myappaws logs put-retention-policy \
--log-group-name /aws/ec2/myapp \
--retention-in-days 7aws logs create-log-stream \
--log-group-name /aws/ec2/myapp \
--log-stream-name i-xxxxx-app.logcat > log-events.json <<EOF
[
{
"timestamp": $(date +%s)000,
"message": "Application started successfully"
},
{
"timestamp": $(date +%s)000,
"message": "Connected to database"
}
]
EOFaws logs put-log-events \
--log-group-name /aws/ec2/myapp \
--log-stream-name i-xxxxx-app.log \
--log-events file://log-events.jsonaws logs tag-log-group \
--log-group-name /aws/ec2/myapp \
--tags Environment=Production,Application=MyAppaws logs describe-log-groupsaws logs describe-log-streams \
--log-group-name /aws/ec2/myapp \
--order-by LastEventTime \
--descending \
--max-items 10aws logs tail /aws/ec2/myapp --followaws logs filter-log-events \
--log-group-name /aws/ec2/myapp \
--start-time $(date -u -d '24 hours ago' +%s)000 \
--filter-pattern "ERROR"aws logs filter-log-events \
--log-group-name /aws/ec2/myapp \
--filter-pattern '[timestamp, level=ERROR, msg]' \
--max-items 50aws logs filter-log-events \
--log-group-name-prefix /aws/ec2/ \
--filter-pattern "database connection failed"aws logs create-export-task \
--log-group-name /aws/ec2/myapp \
--from $(date -u -d '7 days ago' +%s)000 \
--to $(date -u +%s)000 \
--destination myapp-logs-bucket \
--destination-prefix logs/2025/11/aws logs start-query \
--log-group-name /aws/ec2/myapp \
--start-time $(date -u -d '24 hours ago' +%s) \
--end-time $(date -u +%s) \
--query-string '
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by @message
| sort count desc
| limit 10
'aws logs get-query-results --query-id xxxxx-yyyy-zzzzaws logs start-query \
--log-group-name /aws/lambda/my-api \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s) \
--query-string '
fields @timestamp, endpoint, duration
| stats avg(duration) as avg_latency, max(duration) as max_latency by endpoint
| sort avg_latency desc
'aws logs start-query \
--log-group-name /aws/ec2/myapp \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s) \
--query-string '
fields @timestamp, @message, latency
| filter latency > 500
| sort latency desc
| limit 100
'aws logs put-metric-filter \
--log-group-name /aws/ec2/myapp \
--filter-name ErrorCount \
--filter-pattern "[timestamp, level=ERROR, msg]" \
--metric-transformations \
metricName=ApplicationErrors,\
metricNamespace=CustomApp/Logs,\
metricValue=1,\
defaultValue=0aws logs put-metric-filter \
--log-group-name /aws/ec2/myapp \
--filter-name ResponseTime \
--filter-pattern "[timestamp, level, msg, latency]" \
--metric-transformations \
metricName=ResponseLatency,\
metricNamespace=CustomApp/Logs,\
metricValue='$latency',\
unit=Millisecondsaws logs describe-metric-filters \
--log-group-name /aws/ec2/myappaws logs delete-metric-filter \
--log-group-name /aws/ec2/myapp \
--filter-name ErrorCountaws sns create-topic --name ec2-state-changesaws sns subscribe \
--topic-arn arn:aws:sns:sa-east-1:123456789012:ec2-state-changes \
--protocol email \
--notification-endpoint ops@example.comcat > event-pattern-terminated.json <<EOF
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {
"state": ["terminated"]
}
}
EOFaws events put-rule \
--name notify-instance-terminated \
--description "Notify when EC2 instance is terminated" \
--event-pattern file://event-pattern-terminated.jsonaws events put-targets \
--rule notify-instance-terminated \
--targets "Id"="1","Arn"="arn:aws:sns:sa-east-1:123456789012:ec2-state-changes"cat > event-pattern-asg.json <<EOF
{
"source": ["aws.autoscaling"],
"detail-type": ["EC2 Instance Launch Successful", "EC2 Instance Terminate Successful"],
"detail": {
"AutoScalingGroupName": ["myapp-asg"]
}
}
EOFaws events put-rule \
--name asg-scaling-events \
--event-pattern file://event-pattern-asg.jsonaws events put-rule \
--name daily-cleanup \
--schedule-expression "cron(0 2 * * ? *)" \
--description "Run cleanup Lambda daily at 2 AM UTC"aws events put-rule \
--name health-check-poller \
--schedule-expression "rate(5 minutes)" \
--description "Poll external health check every 5 minutes"aws events put-targets \
--rule daily-cleanup \
--targets "Id"="1","Arn"="arn:aws:lambda:sa-east-1:123456789012:function:cleanup-function"aws lambda add-permission \
--function-name cleanup-function \
--statement-id AllowEventBridgeInvoke \
--action lambda:InvokeFunction \
--principal events.amazonaws.com \
--source-arn arn:aws:events:sa-east-1:123456789012:rule/daily-cleanupaws events list-rulesaws events describe-rule --name notify-instance-terminatedaws events list-targets-by-rule --rule notify-instance-terminatedaws events disable-rule --name daily-cleanupaws events enable-rule --name daily-cleanupaws events remove-targets \
--rule daily-cleanup \
--ids "1"aws events delete-rule --name daily-cleanupcat > dashboard-body.json <<EOF
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/EC2", "CPUUtilization", {"stat": "Average"}]
],
"period": 300,
"stat": "Average",
"region": "sa-east-1",
"title": "EC2 CPU Utilization"
}
},
{
"type": "metric",
"properties": {
"metrics": [
["AWS/ApplicationELB", "RequestCount", {"stat": "Sum"}]
],
"period": 60,
"stat": "Sum",
"region": "sa-east-1",
"title": "ALB Request Count"
}
}
]
}
EOFaws cloudwatch put-dashboard \
--dashboard-name MyApp-Production \
--dashboard-body file://dashboard-body.jsonaws cloudwatch list-dashboardsaws cloudwatch get-dashboard --dashboard-name MyApp-Productionaws cloudwatch delete-dashboards --dashboard-names MyApp-Productionsudo yum install amazon-cloudwatch-agent -ycat > /opt/aws/amazon-cloudwatch-agent/etc/config.json <<EOF
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/app.log",
"log_group_name": "/aws/ec2/myapp",
"log_stream_name": "{instance_id}-app.log",
"timezone": "UTC"
},
{
"file_path": "/var/log/nginx/error.log",
"log_group_name": "/aws/ec2/nginx",
"log_stream_name": "{instance_id}-error.log"
}
]
}
}
},
"metrics": {
"namespace": "CustomApp/System",
"metrics_collected": {
"mem": {
"measurement": [
{
"name": "mem_used_percent",
"rename": "MemoryUtilization",
"unit": "Percent"
}
],
"metrics_collection_interval": 60
},
"disk": {
"measurement": [
{
"name": "used_percent",
"rename": "DiskUtilization",
"unit": "Percent"
}
],
"metrics_collection_interval": 60,
"resources": ["/"]
}
}
}
}
EOFsudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m ec2 \
-s \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/config.jsonsudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a query \
-m ec2 \
-ssudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a stop \
-m ec2 \
-sArchitecture and Flows
Complete Multi-Region Observability
Alarm Lifecycle Flow
Logs Pipeline
Best Practices
Observability Strategy
- Define key metrics first — identify the critical Service Level Indicators for your application before deploying any monitoring
- Implement the four Golden Signals — monitor Latency, Traffic, Errors, and Saturation as the foundation of your observability posture
- Adopt structured logging — use JSON with consistent fields such as timestamp, level, message, and context across all services
- Calibrate log levels by environment — DEBUG in development, INFO and WARN in staging, ERROR and above in production
- Enable distributed tracing — use X-Ray for microservices architectures and latency debugging across service boundaries
- Track custom business metrics — infrastructure monitoring alone is insufficient; measure business KPIs such as orders per minute and revenue throughput
- Maintain per-environment dashboards — separate Production, Staging, and Development dashboards to avoid confusion during incident response
The four Golden Signals — Latency, Traffic, Errors, and Saturation — originate from Google's Site Reliability Engineering methodology. Monitoring these four dimensions provides comprehensive coverage of most failure modes in distributed systems.
Alarming
- Prioritize alarms by severity — Critical triggers a pager, High sends email, Medium appears on the dashboard only
- Guard against alarm fatigue — alert exclusively on conditions that demand immediate human attention
- Set evaluation periods to 2 or higher — a minimum of 2 periods prevents false positives from transient spikes
- Use the M-of-N datapoints pattern — a configuration such as 3 of 5 offers greater flexibility and resilience against noise
- Segment SNS topics by severity — maintain separate topics for critical-alerts, warning-alerts, and info-alerts
- Document runbooks for every alarm — each alarm must have associated documentation describing the appropriate response
- Configure OK actions — notify the team when an alarm resolves, not only when it triggers
- Test alarms on a monthly cadence — simulate alarm conditions regularly to verify the entire notification chain
- Deploy composite alarms for signal correlation — combine related signals such as CPU, Memory, and Disk into a single composite evaluation
Alarm fatigue is one of the most dangerous failure modes in operational monitoring. When teams receive hundreds of non-actionable alerts, they begin ignoring all notifications — including the ones that indicate genuine production incidents. Every alarm must justify its existence.
Logs Management
- Deploy the CloudWatch Agent on every instance — centralize log collection automatically across your fleet
- Apply differentiated retention by log type — Debug logs at 7 days, Error logs at 90 days, Audit logs at 1 year or longer
- Create metric filters for critical error patterns — convert log patterns into metrics without modifying application code
- Save and document Logs Insights queries — maintain a library of common queries for rapid troubleshooting during incidents
- Never log secrets — sanitize passwords, tokens, and personally identifiable information before writing to any log stream
- Embed correlation IDs — include a request ID in all log entries to enable end-to-end request tracing across services
- Export aged logs to S3 periodically — move older logs to S3 for compliance requirements and cost optimization
Security and Compliance
- Encrypt sensitive logs at rest — use KMS encryption for log groups containing sensitive data
- Apply restrictive IAM policies — ensure only authorized roles can view production logs
- Maintain an audit trail — use CloudTrail to log all access and modifications to alarms and dashboards
- Enforce log retention for compliance — respect GDPR, HIPAA, and SOX requirements when configuring retention periods
- Implement PII redaction — configure automatic redaction of sensitive information before it reaches CloudWatch
- Centralize cross-account logs — aggregate logs from multiple accounts into a dedicated security account
Failing to configure log retention policies for compliance-regulated data — HIPAA, GDPR, SOX, or PCI-DSS — can result in significant regulatory penalties. Establish retention policies as part of the initial log group creation process, never as an afterthought.
Automation
- Leverage EventBridge for remediation — automate auto-scaling, service restarts, and failover procedures
- Trigger Lambda functions from alarms — execute automatic actions such as creating snapshots before termination
- Define monitoring as Infrastructure as Code — manage all alarms and dashboards in Terraform or CloudFormation
- Integrate with CI/CD pipelines — ensure deployments automatically create or update associated alarms
- Schedule maintenance windows — use EventBridge to disable alarm actions during planned maintenance
Common Mistakes
Cost Considerations
CloudWatch Pricing Breakdown
| Component | Cost | Unit | Free Tier |
|---|---|---|---|
| Metrics — Standard AWS services | FREE | Unlimited | Permanent |
| Metrics — Custom | $0.30/month | Per metric | 10 metrics free |
| Metrics — High-Resolution Custom | $0.30/month | Per metric | Not included |
| API Requests — GetMetricStatistics | $0.01 | Per 1,000 requests | 1M free/month |
| Dashboard | $3/month | Per dashboard | 3 dashboards free |
| Alarms — Standard | $0.10/month | Per alarm | 10 alarms free |
| Alarms — High-Resolution | $0.30/month | Per alarm | Not included |
| Alarms — Composite | $0.50/month | Per alarm | Not included |
| Logs — Ingestion | $0.50 | Per GB ingested | 5 GB free/month |
| Logs — Storage | $0.03 | Per GB-month | 5 GB free/month |
| Logs — Archive to S3 | S3 pricing | See S3 costs | See S3 free tier |
| Logs Insights — Queries | $0.005 | Per GB scanned | Included in free tier logs |
| EventBridge — Custom Events | FREE | First 14M/month | Yes |
| EventBridge — Events over 14M | $1.00 | Per 1M events | Not included |
| Anomaly Detection | $0.30/month | Per metric monitored | Not included |
Real Application Cost Example
Scenario: a multi-region web application with comprehensive monitoring.
Infrastructure under observation:
- 10 EC2 instances, 5 per region
- 2 Application Load Balancers, 1 per region
- 2 RDS instances
- Auto-Scaling Groups
- Route53 health checks
| Cost Category | Calculation | Monthly Cost |
|---|---|---|
| AWS service metrics — EC2, ALB, RDS, ASG | Included free | $0.00 |
| 5 custom business metrics | 5 x $0.30 | $1.50 |
| 20 standard alarms | 20 x $0.10 | $2.00 |
| 2 composite alarms | 2 x $0.50 | $1.00 |
| 1 production dashboard | 1 x $3.00 | $3.00 |
| Logs ingestion — 50 GB/month | 45 billable GB x $0.50 | $22.50 |
| Logs storage — 200 GB average with 30-day retention | 195 billable GB x $0.03 | $5.85 |
| Logs Insights queries — 10 GB/month scanned | 10 x $0.005 | $0.05 |
| EventBridge — approximately 5M events/month | Under 14M free threshold | $0.00 |
| Total | $35.90 |
Cost distribution insight: logs ingestion alone represents 63% of the total CloudWatch spend. This is the single greatest optimization opportunity.
Cost Optimization Strategies
Strategy 1: Differentiated Log Retention
The most impactful cost reduction comes from applying tiered retention rather than a uniform policy across all log groups.
aws logs put-retention-policy \
--log-group-name /aws/ec2/myapp-debug \
--retention-in-days 7aws logs put-retention-policy \
--log-group-name /aws/ec2/myapp \
--retention-in-days 30aws logs put-retention-policy \
--log-group-name /aws/ec2/myapp-errors \
--retention-in-days 90aws logs put-retention-policy \
--log-group-name /aws/audit \
--retention-in-days 365This tiered approach yields approximately 60% savings in storage costs compared to a blanket 90-day retention policy.
Strategy 2: Log Sampling for Successful Requests
import random
def log_request(status_code, latency):
# Always log errors
if status_code >= 400:
logger.error(f"Error {status_code}, latency {latency}ms")
return
# Sample 10% of successful requests
if random.random() < 0.1:
logger.info(f"Success {status_code}, latency {latency}ms")
# Log reduction: approximately 90% less data
# Full visibility of problems -- 100% of errors captured
# Trend analysis preserved -- 10% sample is statistically significantStrategy 3: Metric Filters Instead of Custom Metrics
aws logs put-metric-filter \
--log-group-name /aws/ec2/myapp \
--filter-name ErrorCount \
--filter-pattern "[timestamp, level=ERROR, msg]" \
--metric-transformations \
metricName=ApplicationErrors,\
metricNamespace=CustomApp/Logs,\
metricValue=1Publishing custom metrics from application code costs $0.30/month per metric. Metric filters on existing logs cost nothing beyond the log ingestion you are already paying for — delivering the same error metric at zero marginal cost.
Strategy 4: Consolidated Dashboards — instead of creating a dashboard per resource at 30/month to $3/month.
Integration with Other AWS Services
| AWS Service | Integration Method | Typical Use Case |
|---|---|---|
| EC2 | CloudWatch Agent sends metrics and logs | Memory, disk, and application log monitoring |
| Auto Scaling | Alarms trigger scaling policies | Scale out when CPU exceeds 80% |
| ALB/NLB | Native metrics — RequestCount, Latency, 5XX | Alarm on elevated error rate |
| RDS | Native metrics — CPUUtilization, FreeStorageSpace, Connections | Alarm when storage drops below 10 GB |
| Lambda | Native metrics and logs — Invocations, Errors, Duration | Alarm on error rate, log analysis |
| S3 | Request metrics, storage metrics | Monitor bucket size and request patterns |
| Route53 | Health check metrics — HealthCheckStatus | Alarm on failover events |
| SNS | Alarm notifications via topics | Email, SMS, and Lambda triggers |
| SQS | Queue metrics — ApproximateNumberOfMessages | Scale workers based on queue depth |
| API Gateway | Native metrics — Count, Latency, 4XX, 5XX | Monitor API performance and error rates |
| ECS/EKS | Container metrics and logs | Monitor containerized applications |
| Step Functions | Execution metrics | Monitor workflow success and failure rates |
| X-Ray | Distributed tracing integration | End-to-end request tracing across services |
| Systems Manager | Run Command integration | Automated remediation actions |
| EventBridge | Event-driven automation | React to resource state changes in real time |
| CloudTrail | Audit logs | Security and compliance monitoring |
Additional Resources
Official AWS Documentation
- CloudWatch Developer Guide
- CloudWatch Logs Developer Guide
- EventBridge Developer Guide
- CloudWatch Agent Configuration
- CloudWatch Pricing
Whitepapers and Best Practices
- AWS Well-Architected Framework — Operational Excellence Pillar
- Monitoring and Observability Best Practices
- AWS Observability Best Practices
Hands-On Tutorials
- CloudWatch Workshop
- One Observability Workshop
- Building Dashboards
- CloudWatch Logs Insights Tutorial