Hi, we have tried BETA of Autoscaling feature and we have some thoughts how to make it better. In its current setup its not really suitable for our production workloads. Here are some thoughts how to make it better:
- Define separate scaling steps At them moment the scaling step is always 1. Going from M10 -> M20 which is not really suitable for burst loads where going one step up might not be enough. Same goes for rapid scaling down
Scale range = M10 - M50
Scale step up 4 = (M10 -> M50)
Scale step down 2 = (M50 -> M30 -> M10)
Define custom timescale
It seems that current setup is to start scaling down after 72h, and then repeat every 24h.
Our system can be scaled down much more rapidly, when our burst load goes away its done for a few days, we know we can start scaling down after 12h and repeat every 6h.
With current setup it will take 6 days to scale down from M50 -> M10.
Define custom scaling metrics and thresholds
It seems that current system is not taking N of connected clients as a scaling metric.
when connecting from cloud functions its easy to have a lot of connections which are not draining CPU or memory, when we are scaled down to M10 that limit is only 200
additionally we would like to scale up when our CPU limit is >50% which is not possible ATM
Nice to have:
4. Time based scaling events
scale up/down at specified time & day, useful for scaling up DEV/Research environments within working hours
PS: writing a long post in this form is terrible
We've found when there is a sudden burst of activity that takes Atlas to 100%, the autoscaling fails because it relies on there being excess capacity to do the autoscaling, so scaling fails. Then you need to call Mongo support and have an engineer intervene. Exactly the situation that scaling is meant to prevent. They need to change this architecture, and also make scaling more configurable so you can take into account what you know about your workload.
"Define custom scaling metrics and thresholds", is critical for us to be able to handle unpredictable data growths. Having the capability to set storage threshold to a lower value than the current fixed 90% would save us from Atlas downtime caused by disks getting full.
Any update on custom and time-based auto scaling? Implementation of these features would move my team from AWS to Atlas
Transparency as a feature: As a user of Mongo Atlas I've been for years I believe It's important to include in the activity feed the exact reason/criteria/trigger that made the cluster upgrade or downgrade, because without it, as per our experience, can be complicated, if not impossible, to understand why the cluster was upgraded/downgraded being hands tied to manage and make the changes required to efficient use of our database infra.
+1 for "Time based scaling events" as it will be useful for DEV environments
It is a needed one
Is there any timeline to implement such functionality?
It is a very important feature for us, as the current auto-scale functionality is not answering our needs.
It's in top5 and been opened for 2 years+: any update / ETA on this subject ?
Definitely need some way of configuring auto-scaling conditions and windows:
- atlas is billed hourly
- most usecases would need to upscale fast during peak hours, and downscale 1-2 hours after
These conditions mean that if we have 1 hour of very high traffic per day (~30 hours per month), we'd have to pay for the full month (~720 hours) or 24 times more.
Karan Munjal commented
Must required feature in case of real time applications.
The way Atlas cluster tier auto-scaling works is that you select the maximum tier you're willing to be scaled up to. In other words what you're looking for is already there today.
We'd use more cluster tier autoscaling if it could be turned on with a maximum permitted tier. (upper limit to prevent runaway costs beyond some acceptable X.)
This is a really good suggestion. Scaling up and down based on custom rules and times would ne a huge improvement.
For example one hour to scale up is kind of longand and for us usually it is based on IO load and not so much CPU.
For now cluster scaling seems to be more targeted at workloads with different CPU loads, but there is also definitely a need for IO based scaling.
For example while IO load is low M20 is sufficient, when it increases M30 with provisioned IOPS at different levels. That is something we would like to automate instead of performing manually.
AWS now also offers IOPS scaling. Those interested in this feature should also vote for that one: https://feedback.mongodb.com/forums/924145-atlas/suggestions/42288652-aws-ebs-gp3-volumes
Our workload is highly predictable. We serve K-12 students. For 8 hours, M-F we have very heavy loads. Evenings, weekends, holidays and summers we have nothing.
I'd like to +1 on the time-based scaling... but only as substitute for better granularity on perf metrics.
It would be better to trigger scale up/scale down on IOPS or Ops. Ops is the better metric b/c it does not change when the scale changes. (whereas read iops can drop precipitously after a scale up)
For instance, scale up when OPS hit 500, 1000, 2000. To scale down, you could specify these metrics as pairs.
500,100 => scale up when hit 500, down when fall back to 100.
1000, 500 => scale up when hit 1000, down when back to 500.
2000, 1000 = > scale up when hit 2000, down when back to 1000.
Or, just take single points... and 100, 500, 1000, 2000 and infer the scale down from the previous up point.
Hi Rez and Andrew,
I have sent you an email, looking forward to speaking with you.
@Marek - Thanks a lot for the feedback! Would love to chat with you more about this in person. Do you mind shooting me an email at firstname.lastname@example.org ?
Thank you so much for this detailed suggestion. We will likely reach out to you to get a chance to speak to you in more detail.
You've got great ideas here--I think you can tell that our initial auto-scaling capability is definitely very conservative and you bring up great examples of use cases that we need to better address in the future.