Trigger event queue metrics
We recently ran into the problem that our trigger was suspended because the resume token was no longer present in the oplog.
The problem was a combination of queued up trigger events and a short oplog window and the suggested solution was to increase the triggers performance by parallelizing it and by increasing the oplog size to get a bigger oplog window.
Unfortunately, we're not able to verify the real success of these suggestions, because even though we can monitor the oplog window in the metrics, there are no metrics for the trigger queue itself. As a result, we can't see how close to our oplog window this trigger queue comes and therefore we're not able to predict a potential suspension in the future.
Our suggestion is to include some metrics for the trigger queue. Examples could be the number of events in the queue or the age of the oldest event in it.