Force restart of unhealthy node

Right now, there's a "test failover" option, which shuts down the primary and forces an election. However, the option is only available if the cluster is in a healthy state.

If, for whatever reason, the cluster is unhealthy, it's impossible to manually restart the primary. It should be possible to force an election in an unhealthy state. Often, this is all that is required to get back into a healthy state (e.g. if the primary is in a CPU burning loop that was caused by an unexpected write pattern that has stopped.)

124 votes

Robbie shared this idea · Jul 22, 2020 · Report… · Admin →

An error occurred while saving the comment

Nathan commented · July 10, 2025 2:27 PM · Report

We've been waiting 18 hours for support (which we are paying good money for) to reboot our cluster after a large operation caused the dirty cache fill ratio to jump to over 20% which completely CPU locked a secondary node evicting pages.

We can't scale it up ourselves, we tried and it is simply stuck and the operation failed.

It's completely ridiculous that the best they can suggest for self service is to "Test Resilience" which causes a primary failover which their UI blocks you from doing in many cases (such as the one we currently find ourselves in).

I will never recommend Atlas again until this is resolved. This suggestion has been open for nearly 5 years so I wouldn't hold my breath.

Submitting...
Surajit commented · December 27, 2024 7:19 AM · Report

We have faced this problem many times in our production server, no way to come out of 100% CPU burn problem unless someone from support team restarts our server for us. Causing almost 2-3 hours of downtime for our servers.

Submitting...
Jon Sapyta commented · October 25, 2023 12:05 PM · Report

We also have this issue, where if the primary gets loaded, scaling doesn't have capacity to scale, and it just gets stuck, and you have to call support while your application is effectively down or struggling. We've just scaled up our instances to overcome this, but it's a massive waste of resource to keep them scaled up so much because scaling can't respond under load and isn't configurable. We've started to look at other DB solutions with more effective scaling.

It's unbelievable that 'dark mode' for the UI is being worked on, while critical issues with scaling that cause outages are not.

Submitting...
Hemang commented · May 12, 2023 5:39 AM · Report

We have faced this issue multiple times when the primary got loaded and tried to upscale the instance its become unresponsive, this time we need to take help from the support team which is again process based task, if we have control to restart the node, it would be faster than what we are facing right now.

Submitting...
Jason commented · October 29, 2020 12:47 PM · Report

This is a badly needed feature. The only solution to force an election is to contact Mongo support and wait upwards of 2 hours so that they can force a restart of the process on the unhealthy node. This has happened several times since we've started using this service and it's getting to the point now that we may need to start looking at alternatives because we might lose customers due to a lack of confidence in the system being available.

Submitting...
Pedro commented · October 15, 2020 6:43 AM · Report

This is definitely a useful feature that should get implemented. Having to wait for support to take care of restarting a faulty node increases MTTR which could have a huge impact in averting a disaster or at least mitigating it quicker. Please consider implementing this.

Submitting...