|
What problem are you trying to solve?
Focus on the what and why of the need you have, not the how you'd like it solved.
|
Our service operates under strict operational policies: quarterly maintenance cycles, no scheduled downtime, and an expectation of fully uninterrupted service. Other managed services such as Amazon RDS (Aurora MySQL) or Elastic allow customers to ignore or skip minor patches and apply only major updates in a controlled, non-disruptive manner. In Atlas, however, system-initiated maintenance can still trigger instance restarts or minor version updates that we cannot defer or ignore. Even though these events are designed to be non-disruptive, the resulting failover and alerts cause operational workload, internal escalations, and service reliability concerns.
|
|
What would you like to see happen?
Describe the desired outcome or enhancement.
|
A customer-controlled option to fully skip, defer, or disable maintenance tasks, including minor version patches and system-initiated host updates.
Version-level control, allowing customers to skip specific minor versions or apply updates only during explicitly approved windows.
A strict no-downtime mode, similar to AWS/Aurora’s optional OS/security patch controls, allowing customers to opt out of non-critical maintenance.
|
|
Why is this important to you or your team?
Explain how the request adds value or solves a business need.
|
We run a B2C service that is extremely sensitive to even momentary failovers. Although Atlas maintenance is designed to avoid service interruption, the failovers and CPU spikes during elections still generate alerts at night, requiring manual checks and follow-up actions.
This adds unnecessary operational cost and creates reliability concerns for our service teams.
If unplanned maintenance continues, it may become a barrier to expanding Atlas adoption across more services, despite our willingness to grow usage.
|
What steps, if any, are you taking today to manage this problem? |
Currently, we manage the issue by:
Conducting service health checks after every maintenance event
Responding to nighttime alerts caused by instance restarts
Manually verifying stability after minor version changes
Control the maintenance and requiring no-downtime operation policies internally
|
This is a critical matter that significantly impacts our credibility with users. It also imposes a substantial operational burden, as it requires follow-up actions from our Developer, DBA, Product, and CS teams, leading to unavoidable extra operational resource drain.