Add 1 second granularity to ATLAS metrics
At present the finest granularity of ATLAS metrics is 1 minute ,as the metrics are averaged by 1 minute , this would not provide info on spikes lasting less than few seconds
reducing the granularity to 1 second would give more insight
Rather than increasing granularity, just reporting the `max` rather than (or in addition to) the average would go a long way towards making the metrics useful.
We have to open up a support ticket to request "FTDC diagnostic data" every time our clusters do anything weird, and it's a pain.
For write heavy workloads, sub-minute granularity of disk latency and IOPS would be useful to visually identify the limits of performance.
We have also recently had alerts for metrics that we can not see more granularly. We have configured our IOPS for what we believe our usage is but have recently found out we have occasional sub minute spikes of much higher usage.
Having alerts that fire from data points at a finer granularity than the metrics, it is very misleading.
In our case, multiple alerts are triggering and then you see the metrics, and all looks fine.
It makes debugging issues almost impossible.
Hi Rez, please look at support case 00643354 - a short CPU usage spike causes disconnections from the server. This CPU spike is invisible to Atlas users since over a period of 1 minute the average cpu usage is smoothed down 60%.
I suggest exposing CPU (and possible network as well) at a finer granularity.
Hi Sudheer - Thanks for the feedback. We are considering this. What would you reckon are the most metrics to provide <1m granularity to and why?