Scheduled stepdown for smoother primary election
Stepdown is a great tool that allows us to keep clusters operating smoothly. We use it for example when we want to perform some maintenance work on the host where the primary is currently running, to perform a rolling upgrade, and in many other cases we need to switch the primary to another node.
While usually electing a new primary is fast enough, for clusters with very high write traffic, it sometimes unfortunately leads to write errors on the application side. The reason is that drivers need to disconnect from the previous primary and connect to the new one, and this takes time. All write attempts at this time end with an error. Of course those operations can be retried after a short time, but it seems to me that this situation could be avoided fairly easily.
Here's my idea:
- We add an optional parameter to the stepDown method, let’s name it: electionEffectiveTimeSeconds. If we do not set it, stepDown works normally - as before,
The new primary is elected, but the cluster and drivers do not use it as a primary yet, instead they wait for electionEffectiveTimeSeconds seconds after the election is done,
Until then, the previous primary handles write operations,
In the meantime, all cluster nodes and drivers receive information about the new primary and when the change will take effect - based on electionEffectiveTimeSeconds,
This will allow all parties to properly prepare for the switchover, which should minimize the time of unavailability of the primary,
When the time expires, clusters and drivers change primaries to the new node.
The key element will be to determine the optimal range for this new parameter. On the one hand, it must be big enough to give enough of time to spread the information through the cluster and drivers. On the other hand, it must not be too long, because there is a risk that - during that time - the new elected primary will catch a lag and will not be able to become a primary. In this case, however, it could try one more time, and if that fails also, then perform a normal stepdown for a third time.
I think this method would be useful during all stepdowns, which are planned in advance and it is not necessary to happen immediately.