"Chaos testing" for Atlas - simulate node(s) down
The current "Test Failover" feature supports testing application/driver resiliency in case of elections. For additional testing, we want to be able to cause a node or nodes to be shut down and started up in a cluster. There should be selectivity allowing the entire node or just the mongod or mongos process to be shut down and started up.
We're starting to scope out the ability to test region level outages: expect an update later this year!
Rob Powell commented
Had customer requests for this during POC to test these scenarios.
Have the possibility to test a DR scenario like lost a cloud region.
This may also be useful for 'stuck' secondaries that need rebooting or to trigger the bouncing of a mongod, which can only be done by a Cloud TS at this time.
Along this line of thinking, it would be nice to add functionality to enact, specifically, the standard processes used when deploying maintenance to Atlas clusters to test that applications are resilient to more than just the election process that occurs during Atlas maintenance.