Improve handling of ROLLBACK state
Recently we had a member of a replica set in a sharded cluster enter ROLLBACK
state. We just happened to notice it in an automated email sent by Atlas about primary elections which showed one of the members in ROLLBACK
state.
Fortunately for us the writes which had been reverted were not critical and we were fine with that. However, this could have resulted in serious data loss which could have gone unnoticed until a customer reached out to us.
The handling of a situation like this one should have been much better and more user friendly. We should have gotten a proper notification for an issue like this one and there should be an easy user-friendly way to recover the data if we want to. If you happen to notice that a member entered ROLLBACK
state, then you need to reach out to Support, get them to make the files available to you, then manually spin up a new cluster, import the data into the new cluster and inspect it there, before finally uploading it to the original cluster. It's a very manual process and I think the process could be improved considering this is a managed service.