Skip to Main Content

MongoByte MongoDB Logo

Welcome to the new MongoDB Feedback Portal!

{Improvement: "Your idea"}
We’ve upgraded our system to better capture and act on your feedback.
Your feedback is meaningful and helps us build better products.

Status Submitted
Created by Carolin Klose
Created on Feb 27, 2026

Prometheus Metric for Atlas Maintenance Events

What problem are you trying to solve?

Focus on the what and why of the need you have, not the how you'd like it solved.

Our customer operates a large, business‑critical Atlas footprint and standardizes all alerting on Prometheus + Alertmanager. Today, they have no direct, machine‑readable signal in Prometheus for upcoming or ongoing Atlas maintenance events. As a result, they cannot easily distinguish planned maintenance from genuine incidents in their central monitoring stack, and they must configure and maintain separate alerts in the Atlas UI, which also requires broader Atlas access for non‑admin team members.

What would you like to see happen?

Describe the desired outcome or enhancement.

We would like Atlas’s Prometheus integration to expose a dedicated metric for maintenance, e.g. a mongodb_maintenance_* family, that indicates both scheduled and in‑progress maintenance at project/cluster level. This metric should include labels for the configured maintenance window (day/time/timezone) and the timestamp of the next scheduled maintenance, and a clear signal when maintenance is currently running. The intent is that customers can drive all maintenance‑related alerting directly from Prometheus/Alertmanager, using the existing Prometheus integration (important: including over PrivateLink where configured).

Why is this important to you or your team?

Explain how the request adds value or solves a business need.

The customer has strict reliability requirements and wants a single, trusted observability stack. Having maintenance visibility only in the Atlas UI or via ad‑hoc API/webhook integrations fragments their alerting story and makes it harder to coordinate operational response. A first‑class Prometheus metric would:

  • Centralize alerting for Atlas alongside other systems.

  • Make it much easier to separate planned maintenance from real incidents.

  • Allow proactive planning (dashboards, alerts before maintenance windows).

The customer explicitly called out the next World Cup as a high‑risk period with expected peak traffic; they would like unified monitoring for Atlas maintenance in place before then, to align with their internal reliability objectives. It would not be needed only for this period, but long-term.

What steps, if any, are you taking today to manage this problem?

Today, the customer uses an indirect workaround: they alert on the availability of cluster members via Prometheus and infer that maintenance is happening when replicas briefly go down during a failover. This helps, but it is noisy and does not provide forward‑looking insight into upcoming maintenance. We also discussed alternatives such as consuming maintenance events from the Atlas Activity Feed via the Admin API or webhooks, but these options require the customer to build and operate custom integrations and/or manage separate alert rules in Atlas, which they view as less desirable than a native Prometheus metric.