Force restart of unhealthy node
Right now, there's a "test failover" option, which shuts down the primary and forces an election. However, the option is only available if the cluster is in a healthy state.
If, for whatever reason, the cluster is unhealthy, it's impossible to manually restart the primary. It should be possible to force an election in an unhealthy state. Often, this is all that is required to get back into a healthy state (e.g. if the primary is in a CPU burning loop that was caused by an unexpected write pattern that has stopped.)
-
Jon Sapyta commented
We also have this issue, where if the primary gets loaded, scaling doesn't have capacity to scale, and it just gets stuck, and you have to call support while your application is effectively down or struggling. We've just scaled up our instances to overcome this, but it's a massive waste of resource to keep them scaled up so much because scaling can't respond under load and isn't configurable. We've started to look at other DB solutions with more effective scaling.
It's unbelievable that 'dark mode' for the UI is being worked on, while critical issues with scaling that cause outages are not.
-
Hemang commented
We have faced this issue multiple times when the primary got loaded and tried to upscale the instance its become unresponsive, this time we need to take help from the support team which is again process based task, if we have control to restart the node, it would be faster than what we are facing right now.
-
Jason commented
This is a badly needed feature. The only solution to force an election is to contact Mongo support and wait upwards of 2 hours so that they can force a restart of the process on the unhealthy node. This has happened several times since we've started using this service and it's getting to the point now that we may need to start looking at alternatives because we might lose customers due to a lack of confidence in the system being available.
-
Pedro commented
This is definitely a useful feature that should get implemented. Having to wait for support to take care of restarting a faulty node increases MTTR which could have a huge impact in averting a disaster or at least mitigating it quicker. Please consider implementing this.