Proposal for an Optimized Load Mechanism in MongoDB Atlas via Spark + Databricks
Content:
I currently use the Spark connector in Databricks jobs to load data into MongoDB Atlas. To handle large volumes and minimize the impact of writing to active collections,
I developed a mechanism that significantly accelerates the ingestion process while maintaining consistency and query performance.
The strategy involves:
- Creating a temporary collection, cloning the structure of the original collection without indexes.
- Inserting data directly into the temporary collection, avoiding the write overhead caused by indexes.
- Recreating indexes after the load, on the temporary collection.
- Swapping collections, promoting the new collection as the “hot” one and deactivating the previous version (via drop).
This process reduces load time, avoids locking, and improves the read experience during ingestion. It also provides greater control over atomicity and consistency.
My suggestion to the MongoDB community is to incorporate this mechanism as a new write mode in the PySpark API, for example: insert_swap. This mode could encapsulate the entire logic described above, making the process more accessible, reusable, and standardized for data engineers working with Spark and MongoDB Atlas.
Important note: For this type of implementation to work, the user must have sufficient privileges to perform operations such as collection creation, index management, and atomic renaming. Proper access control and permission handling are essential to ensure the reliability and security of the process.
Additionally, I have developed a production-ready module that implements this mechanism. I’d be happy to share it with the community if needed, to help validate and evolve the idea collaboratively.
