Record running queries at the time of a failover
I recently observed some failovers in a replica set that may have been related to long-running queries. However, since these queries happened only sporadically, it was hard to track them down. Since the queries didn't finish, they weren't present in the logs at the time of the crash.
It would be useful if it could be possible to somehow capture the long-running queries that were running at the time that something went wrong. I recognize that this is potentially impossible, since once something has gone wrong it can be difficult to do anything.
We ended up being able to diagnose this by watching the real-time stats tab in Cloud Manager at the right time, though we aren't certain. This suggests that maybe this could be some Atlas feature. I don't know how real-time stats works, but if can run when the user isn't watching it, would it be possible to record slow queries at the time of a failover? I could also imagine this as some server feature, where the server would make some effort to write long-running queries to its logs when it becomes a secondary. There may not be a guarantee that this would work, but in the cases where it did it would be useful.