Autoscaling improvements

Hi, we have tried BETA of Autoscaling feature and we have some thoughts how to make it better. In its current setup its not really suitable for our production workloads. Here are some thoughts how to make it better:

Define separate scaling steps At them moment the scaling step is always 1. Going from M10 -> M20 which is not really suitable for burst loads where going one step up might not be enough. Same goes for rapid scaling down

Example:
Scale range = M10 - M50
Scale step up 4 = (M10 -> M50)
Scale step down 2 = (M50 -> M30 -> M10)

Define custom timescale
It seems that current setup is to start scaling down after 72h, and then repeat every 24h.
Our system can be scaled down much more rapidly, when our burst load goes away its done for a few days, we know we can start scaling down after 12h and repeat every 6h.
With current setup it will take 6 days to scale down from M50 -> M10.
Define custom scaling metrics and thresholds
It seems that current system is not taking N of connected clients as a scaling metric.
when connecting from cloud functions its easy to have a lot of connections which are not draining CPU or memory, when we are scaled down to M10 that limit is only 200
additionally we would like to scale up when our CPU limit is >50% which is not possible ATM

Nice to have:
4. Time based scaling events
scale up/down at specified time & day, useful for scaling up DEV/Research environments within working hours

PS: writing a long post in this form is terrible

156 votes

Marek shared this idea · Oct 16, 2019 · Report… · Admin →

An error occurred while saving the comment

Jon Sapyta commented · October 25, 2023 11:53 AM · Report

We've found when there is a sudden burst of activity that takes Atlas to 100%, the autoscaling fails because it relies on there being excess capacity to do the autoscaling, so scaling fails. Then you need to call Mongo support and have an engineer intervene. Exactly the situation that scaling is meant to prevent. They need to change this architecture, and also make scaling more configurable so you can take into account what you know about your workload.

Submitting...
Anders commented · June 21, 2023 6:05 AM · Report

"Define custom scaling metrics and thresholds", is critical for us to be able to handle unpredictable data growths. Having the capability to set storage threshold to a lower value than the current fixed 90% would save us from Atlas downtime caused by disks getting full.

Submitting...
Paul commented · March 21, 2023 3:45 PM · Report

Any update on custom and time-based auto scaling? Implementation of these features would move my team from AWS to Atlas

Submitting...
Hernán commented · February 28, 2023 8:22 AM · Report

Transparency as a feature: As a user of Mongo Atlas I've been for years I believe It's important to include in the activity feed the exact reason/criteria/trigger that made the cluster upgrade or downgrade, because without it, as per our experience, can be complicated, if not impossible, to understand why the cluster was upgraded/downgraded being hands tied to manage and make the changes required to efficient use of our database infra.

Submitting...
Daniele commented · February 25, 2023 6:55 AM · Report

+1 for "Time based scaling events" as it will be useful for DEV environments

Submitting...
Thirumalaisamy commented · August 15, 2022 7:13 PM · Report

It is a needed one

Submitting...
Amit commented · January 13, 2022 4:55 AM · Report

Is there any timeline to implement such functionality?
It is a very important feature for us, as the current auto-scale functionality is not answering our needs.

Submitting...
Julien commented · January 7, 2022 10:56 AM · Report

It's in top5 and been opened for 2 years+: any update / ETA on this subject ?

Definitely need some way of configuring auto-scaling conditions and windows:
- atlas is billed hourly
- most usecases would need to upscale fast during peak hours, and downscale 1-2 hours after

These conditions mean that if we have 1 hour of very high traffic per day (~30 hours per month), we'd have to pay for the full month (~720 hours) or 24 times more.

Submitting...
Karan Munjal commented · June 9, 2021 7:11 AM · Report

Must required feature in case of real time applications.

Submitting...
AdminAndrew Davidson (VP, Cloud Products, MongoDB) commented · May 3, 2021 9:29 AM · Report

Hi Jonathan,

The way Atlas cluster tier auto-scaling works is that you select the maximum tier you're willing to be scaled up to. In other words what you're looking for is already there today.

Cheers
-Andrew

Submitting...
Jonathan commented · April 30, 2021 3:37 PM · Report

We'd use more cluster tier autoscaling if it could be turned on with a maximum permitted tier. (upper limit to prevent runaway costs beyond some acceptable X.)

Submitting...
Jan commented · February 26, 2021 7:34 AM · Report

This is a really good suggestion. Scaling up and down based on custom rules and times would ne a huge improvement.

For example one hour to scale up is kind of longand and for us usually it is based on IO load and not so much CPU.

For now cluster scaling seems to be more targeted at workloads with different CPU loads, but there is also definitely a need for IO based scaling.

For example while IO load is low M20 is sufficient, when it increases M30 with provisioned IOPS at different levels. That is something we would like to automate instead of performing manually.

Submitting...
Sinan commented · January 13, 2021 6:53 AM · Report

AWS now also offers IOPS scaling. Those interested in this feature should also vote for that one: https://feedback.mongodb.com/forums/924145-atlas/suggestions/42288652-aws-ebs-gp3-volumes

Submitting...
Omar commented · September 16, 2020 4:27 AM · Report

THIS!

Submitting...
Eric commented · February 25, 2020 1:08 PM · Report

Our workload is highly predictable. We serve K-12 students. For 8 hours, M-F we have very heavy loads. Evenings, weekends, holidays and summers we have nothing.

I'd like to +1 on the time-based scaling... but only as substitute for better granularity on perf metrics.

It would be better to trigger scale up/scale down on IOPS or Ops. Ops is the better metric b/c it does not change when the scale changes. (whereas read iops can drop precipitously after a scale up)

For instance, scale up when OPS hit 500, 1000, 2000. To scale down, you could specify these metrics as pairs.

500,100 => scale up when hit 500, down when fall back to 100.
1000, 500 => scale up when hit 1000, down when back to 500.
2000, 1000 = > scale up when hit 2000, down when back to 1000.

Or, just take single points... and 100, 500, 1000, 2000 and infer the scale down from the previous up point.

Submitting...
Marek commented · October 18, 2019 11:34 AM · Report

Hi Rez and Andrew,
I have sent you an email, looking forward to speaking with you.

Submitting...
AdminRez (Admin, MongoDB) commented · October 17, 2019 10:34 AM · Report

@Marek - Thanks a lot for the feedback! Would love to chat with you more about this in person. Do you mind shooting me an email at rez@mongodb.com ?

Submitting...
AdminAndrew Davidson (VP, Cloud Products, MongoDB) commented · October 17, 2019 9:37 AM · Report

Hi Marek,

Thank you so much for this detailed suggestion. We will likely reach out to you to get a chance to speak to you in more detail.

You've got great ideas here--I think you can tell that our initial auto-scaling capability is definitely very conservative and you bring up great examples of use cases that we need to better address in the future.

-Andrew

Submitting...