"Deviation from Norm" and "Frog Boil" type alerting
Currently Atlas only alerts us when CPU reaches a critical threshold such as 90%. We would like to see additional types of alerts to detect issues sooner, including the following.
- “Deviation from norm” - A given metric is X% worse than the Y-hour average at the same time window for the past Z days (e.g. “CPU today at 4am is much worse than the average of 3am ~ 4am for the past 7 days”.)
- “Frog boils” - A given metric is becoming progressively X% worse over Y hours / days / weeks (e.g. "CPU usage is 10% higher on average today than it was 1 week ago.")
Such alert criteria would allow us to detect and respond to critical issues earlier, i.e. before we hit 90% CPU.
6
votes
Johnny Shields
shared this idea