Improve handling of ROLLBACK state

Recently we had a member of a replica set in a sharded cluster enter ROLLBACK state. We just happened to notice it in an automated email sent by Atlas about primary elections which showed one of the members in ROLLBACK state.

Fortunately for us the writes which had been reverted were not critical and we were fine with that. However, this could have resulted in serious data loss which could have gone unnoticed until a customer reached out to us.

The handling of a situation like this one should have been much better and more user friendly. We should have gotten a proper notification for an issue like this one and there should be an easy user-friendly way to recover the data if we want to. If you happen to notice that a member entered ROLLBACK state, then you need to reach out to Support, get them to make the files available to you, then manually spin up a new cluster, import the data into the new cluster and inspect it there, before finally uploading it to the original cluster. It's a very manual process and I think the process could be improved considering this is a managed service.

3 votes

Pedro shared this idea · Aug 18, 2021 · Report… · Admin →

An error occurred while saving the comment

How can we improve the platform?

Improve handling of ROLLBACK state

Feedback

Atlas: Other

Feedback and Knowledge Base

Searching…

Give feedback

Improve handling of ROLLBACK state

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

Atlas: Other

Categories

Searching…

Give feedback