Hi, we have tried BETA of Autoscaling feature and we have some thoughts how to make it better. In its current setup its not really suitable for our production workloads. Here are some thoughts how to make it better:
- Define separate scaling steps
At them moment the scaling step is always 1. Going from M10 -> M20 which is not really suitable for burst loads where going one step up might not be enough. Same goes for rapid scaling down
Scale range = M10 - M50
Scale step up 4 = (M10 -> M50)
Scale step down 2 = (M50 -> M30 -> M10)
Define custom timescale
It seems that current setup is to start scaling down after 72h, and then repeat every 24h.
Our system can be scaled down much more rapidly, when our burst load goes away its done for a few days, we know we can start scaling down after 12h and repeat every 6h.
With current setup it will take 6 days to scale down from M50 -> M10.
Define custom scaling metrics and thresholds
It seems that current system is not taking N of connected clients as a scaling metric.
- when connecting from cloud functions its easy to have a lot of connections which are not draining CPU or memory, when we are scaled down to M10 that limit is only 200
- additionally we would like to scale up when our CPU limit is >50% which is not possible ATM
Nice to have:
4. Time based scaling events
scale up/down at specified time & day, useful for scaling up DEV/Research environments within working hours
PS: writing a long post in this form is terrible
Karan Munjal commented
Must required feature in case of real time applications.
The way Atlas cluster tier auto-scaling works is that you select the maximum tier you're willing to be scaled up to. In other words what you're looking for is already there today.
Jonathan Weintraub commented
We'd use more cluster tier autoscaling if it could be turned on with a maximum permitted tier. (upper limit to prevent runaway costs beyond some acceptable X.)
This is a really good suggestion. Scaling up and down based on custom rules and times would ne a huge improvement.
For example one hour to scale up is kind of longand and for us usually it is based on IO load and not so much CPU.
For now cluster scaling seems to be more targeted at workloads with different CPU loads, but there is also definitely a need for IO based scaling.
For example while IO load is low M20 is sufficient, when it increases M30 with provisioned IOPS at different levels. That is something we would like to automate instead of performing manually.
AWS now also offers IOPS scaling. Those interested in this feature should also vote for that one: https://feedback.mongodb.com/forums/924145-atlas/suggestions/42288652-aws-ebs-gp3-volumes
Our workload is highly predictable. We serve K-12 students. For 8 hours, M-F we have very heavy loads. Evenings, weekends, holidays and summers we have nothing.
I'd like to +1 on the time-based scaling... but only as substitute for better granularity on perf metrics.
It would be better to trigger scale up/scale down on IOPS or Ops. Ops is the better metric b/c it does not change when the scale changes. (whereas read iops can drop precipitously after a scale up)
For instance, scale up when OPS hit 500, 1000, 2000. To scale down, you could specify these metrics as pairs.
500,100 => scale up when hit 500, down when fall back to 100.
1000, 500 => scale up when hit 1000, down when back to 500.
2000, 1000 = > scale up when hit 2000, down when back to 1000.
Or, just take single points... and 100, 500, 1000, 2000 and infer the scale down from the previous up point.
Hi Rez and Andrew,
I have sent you an email, looking forward to speaking with you.
@Marek - Thanks a lot for the feedback! Would love to chat with you more about this in person. Do you mind shooting me an email at firstname.lastname@example.org ?
Thank you so much for this detailed suggestion. We will likely reach out to you to get a chance to speak to you in more detail.
You've got great ideas here--I think you can tell that our initial auto-scaling capability is definitely very conservative and you bring up great examples of use cases that we need to better address in the future.