Connectors (BI, Kafka, Spark)

← MongoDB Feedback Engine

Share your feedback and ideas on the MongoDB Connectors (BI, Spark and Kafka)

Enter your idea

(thinking…)

Enter your idea and we'll search to see if someone has already suggested it.

If a similar idea already exists, you can support and comment on it.

If it doesn't exist, you can post your idea so others can support it.

Enter your idea and we'll search to see if someone has already suggested it.

Proposal for an Optimized Load Mechanism in MongoDB Atlas via Spark + Databricks
Content:
I currently use the Spark connector in Databricks jobs to load data into MongoDB Atlas. To handle large volumes and minimize the impact of writing to active collections,

I developed a mechanism that significantly accelerates the ingestion process while maintaining consistency and query performance.

The strategy involves:

Creating a temporary collection, cloning the structure of the original collection without indexes.

Inserting data directly into the temporary collection, avoiding the write overhead caused by indexes.

Recreating indexes after the load, on the temporary collection.

Swapping collections, promoting the new collection as the “hot” one and deactivating the previous version (via drop).

This process reduces load time, avoids locking, and improves the read experience during ingestion. It also provides greater control over atomicity and consistency.

My suggestion to the MongoDB community is to incorporate this mechanism as a new write mode in the PySpark API, for example: insert_swap. This mode could encapsulate the entire logic described above, making the process more accessible, reusable, and standardized for data engineers working with Spark and MongoDB Atlas.

Important note: For this type of implementation to work, the user must have sufficient privileges to perform operations such as collection creation, index management, and atomic renaming. Proper access control and permission handling are essential to ensure the reliability and security of the process.

Additionally, I have developed a production-ready module that implements this mechanism. I’d be happy to share it with the community if needed, to help validate and evolve the idea collaboratively.
Content:
I currently use the Spark connector in Databricks jobs to load data into MongoDB Atlas. To handle large volumes and minimize the impact of writing to active collections,

I developed a mechanism that significantly accelerates the ingestion process while maintaining consistency and query performance.

The strategy involves:

Creating a temporary collection, cloning the structure of the original collection without indexes.

Inserting data directly into the temporary collection, avoiding the write overhead caused by indexes.

Recreating indexes after the load, on the temporary collection.

Swapping collections, promoting the new collection as the “hot” one and deactivating the previous version (via…
4 votes

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Spark Connector · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Mongo Spark Connector Option to refresh the Schema

This w.r.t the ticket we raised "https://support.mongodb.com/case/01352011"
In the current spark connector to infer the schema automatically we have an option "stream.publish.full.document.only" to "true", once this configured there is no explicit schema we need to pass but the driver will infer the schema on the first document it streams and use/apply the schema for any future documents coming from that collection.
But the issue here is when there is any addition of new fields in the source collection the streams are not inferring the new changes and instead it is using the old schema.
We should either design in a way to always use the new schema or at least we should have a configuration option to refresh the inferred schema with the new documents.

This w.r.t the ticket we raised "https://support.mongodb.com/case/01352011"
In the current spark connector to infer the schema automatically we have an option "stream.publish.full.document.only" to "true", once this configured there is no explicit schema we need to pass but the driver will infer the schema on the first document it streams and use/apply the schema for any future documents coming from that collection.
But the issue here is when there is any addition of new fields in the source collection the streams are not inferring the new changes and instead it is using the old schema.
We should either design…

4 votes

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Spark Connector · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

Don't see your idea?

Connectors (BI, Kafka, Spark)

Share your feedback and ideas on the MongoDB Connectors (BI, Spark and Kafka)

Proposal for an Optimized Load Mechanism in MongoDB Atlas via Spark + Databricks

Mongo Spark Connector Option to refresh the Schema

Feedback

Connectors (BI, Kafka, Spark)

Feedback and Knowledge Base

Searching…

Give feedback

Share your feedback and ideas on the MongoDB Connectors (BI, Spark and Kafka)

We're glad you're here

Proposal for an Optimized Load Mechanism in MongoDB Atlas via Spark + Databricks

We're glad you're here

We're glad you're here

Mongo Spark Connector Option to refresh the Schema

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

Connectors (BI, Kafka, Spark)

Categories

Searching…

Give feedback