Add a grace period between a failure to obtain a certificate and the cluster shutting down.
If you set up encryption at rest on Atlas using an external key provider, such as an Azure Key Vault, then a link is created between being able to access the key vault and the status of the MongoDb cluster.
For example, on the 28th September, there was an issue at Azure (SM79-F88) and for over 2hrs around 19% of all authentication processes in Europe failed. This included accessing the KeyVault used by our MongoDb cluster. As Atlas checks the keys roughly every 15 minutes it failed to obtain the key at the beginning of this period and our cluster was stopped. It was only restarted when Atlas could successfully authenticate and make contact with the KeyVault which was over 2 hrs later.
I completely understand the need for this type of linkage and to shutdown the cluster when encryption is no longer valid. However, it is very fragile when there is the type of incident highlighted.
What I propose is to introduce a configurable grace period between failing to obtain a key and shutting down the cluster. If this was available to us we could have weathered this Azure issue without any downtime as all our other infrastructure was running and accessible.
-
Hi Glenn,
As communicated to others in https://feedback.mongodb.com/forums/924145-atlas/suggestions/41578642-allow-customer-encryption-key-validation-time-inte
Please accept our apologies for the availability consequences of the Azure outage you mentioned: You have my commitment that we are making changes on our side so that the Azure outage you mentioned does *not* in future lead to Atlas cluster shutdown--we will instead treat transient errors like this differently.
-Andrew (VP Cloud Products)