AWS CloudWatch Alarms Too Noisy Due To Ignoring Missing Data in Averages

I want to know when our app starts getting slower so I sat up an alarm on the Latency metric of our ELB. According to the AWS Console, "This alarm will trigger when the blue line [average latency over the period of 15 min] goes above the red line [2 sec] for a duration of 45 minutes." (I.e. it triggers if Latency > 2 for 3 consecutive period(s).) This is exactly what I need - except that it is a lie.

This night I got 8 alarm/ok notifications even though the average latency has never been over 2 sec for 45 minutes. The problem is that CloudWatch ignores null/missing data. So if you have a slow request at 3am and no other request comes until 4am, it will look at [slow, null, null, null] and trigger the alarm.

So I want to configure it to treat null as 0 and preferably to ignore latency if it only affected a single user. But there is no way to do this in CloudWatch.

Solution: I will likely need to run my own job that will read the metrics and produce a normalized, reasonable metric - replacing null / missing data with 0 and weight the average latency by the number of users in the period.

Tags: monitoring DevOps


Copyright © 2024 Jakub Holý
Powered by Cryogen
Theme by KingMob