Data Federation and Data Lake

← MongoDB Feedback Engine

How can we improve Data Federation and Data Lake?

Enter your idea

(thinking…)

Enter your idea and we'll search to see if someone has already suggested it.

If a similar idea already exists, you can support and comment on it.

If it doesn't exist, you can post your idea so others can support it.

Enter your idea and we'll search to see if someone has already suggested it.

Specify when you'd like Online Archive to migrate data

I'd like the ability to specify when a migration from my Atlas cluster to my Online Archive to take place.

2 votes

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Connect Data Lake to Self Managed MongoDB Clusters

Connect your Atlas Data Lake to Self Managed MongoDB clusters in private data centers, self managed in a public cloud, or locally hosted.

(This would also allow certain cloud services like Charts and a component of Realm Scheduled Triggers to work with these clusters.)

2 votes

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Infrastructure Options · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
M0 Support for Evaluation

Please provide M0 support for evaluation purposes.

2 votes

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Infrastructure Options · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
The "Date field to archive on" option under Archiving Rule tab should also accept date in timestamp format.

The "Date field to archive on" option under Archiving Rule tab in Online Archive should also accept date field having timestamp format instead of only having date format.

2 votes

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Query Functionality · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Add support for XML

I would like to be able to query XML files using my Atlas Data Lake

2 votes

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · File Formats · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Write "_SUCCESS" File when finish data exporting

We use MongoDB to store time-series data, and export the data via Data Federation incrementally on daily basis onto s3 as Parquet. The data is relative big, and duration to export data varies from day to day. It’s hard for downstream services to know when data exporting completes. Sometimes, the downstream service start reading the parquets while MongoDB is writing, which causes partial extraction. Normally, a big data job would create a flag file, such as _SUCCESS, to indicate that the job has finished writing the dataset. This file serves as a marker, indicating that all tasks associated with the job were finished successfully, and the data files in the directory are complete and consistent. Could you consider adding such feature?

const outStage = {
"$out": {
"s3": {
"bucket": ${aws-s3-bucket},
"filename": ${fileName},
"format": {
"name": "parquet",
"maxFileSize": "1GB",
"maxRowGroupSize": "128MB",
}
}
}
}
const coll = db.collection(collName);
await coll.aggregate([
matchStage,
outStage
], { background: true }).toArray();
console.log(Job Submitted);

We use MongoDB to store time-series data, and export the data via Data Federation incrementally on daily basis onto s3 as Parquet. The data is relative big, and duration to export data varies from day to day. It’s hard for downstream services to know when data exporting completes. Sometimes, the downstream service start reading the parquets while MongoDB is writing, which causes partial extraction. Normally, a big data job would create a flag file, such as _SUCCESS, to indicate that the job has finished writing the dataset. This file serves as a marker, indicating that all tasks associated with the…

1 vote

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Automation · Edit… · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Allow a single timestamp field to be split by Year Month Day and Hour for folders instead of just one field like Year in filepath for Azure

I checked internally, and it has been confirmed that an attribute can only appear once in a template. If Atlas Data Federation (ADF) has a template like the one you are using, it wouldn't know what value to assign to StatusDatetime because it's being assigned multiple values. Unfortunately, ADF doesn't support defining a single field value across multiple segments of the path. Instead, each of those segments should be different attributes.

{
"path": "/HistoryCollection/{StatusDatetime isodate:Year}/StatusDatetime isodate:Month}/StatusDatetime isodate:Day}/StatusDatetime isodate:Hour}/{RecordSource string}/{Status string}/*",
"storeName": "sampledatabase"
}

We would like to have the store we are creating as an archive be queried by StatusDatetime RecordSource and Status so it matches the queries we use against the live collections under Federation instead of extracting the Year Month Day and Hour fields which don't exist in the live collection.

I checked internally, and it has been confirmed that an attribute can only appear once in a template. If Atlas Data Federation (ADF) has a template like the one you are using, it wouldn't know what value to assign to StatusDatetime because it's being assigned multiple values. Unfortunately, ADF doesn't support defining a single field value across multiple segments of the path. Instead, each of those segments should be different attributes.

{
"path": "/HistoryCollection/{StatusDatetime isodate:Year}/StatusDatetime isodate:Month}/StatusDatetime isodate:Day}/StatusDatetime isodate:Hour}/{RecordSource string}/{Status string}/*",
"storeName": "sampledatabase"
}

We would like to have the store we are creating as an archive be queried by StatusDatetime…

1 vote

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Query Functionality · Edit… · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Background aggregation queries return a query ID or correlation ID to be able to quickly poll for completion

1 vote

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Query Functionality · Edit… · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Support Azure Data Federation private endpoint

Now you have supported Azure blobs for data federation it will be great to have a private endpoint connection to the storage account

1 vote

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Connectors · Edit… · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Include support for Federated Database Instance with Data API services.
- Screenshot 2024-05-13 at 12.25.50.png 374 KB
1 vote

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Edit… · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
AWS IAM AuthN for Atlas SQL

Support AWS IAM Authentication mechanism in JDBC and ODBC drivers (Atlas SQL)

1 vote

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Connectors · Edit… · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Create a read/write Data Federation connection string

Some customers need a connection string both to the cluster and to Online Archive with the ability to write to the cluster only.

So far, the only option is to use more than a connection string in the application.

1 vote

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Query Functionality · Edit… · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Implement a feature to track data download volume per DB user

In order to enhance data security and prevent unauthorized data exfiltration, our team proposes the implementation of a metric within MongoDB Atlas that allows administrators to monitor and measure the amount of data downloaded by each database user over a specified period. This feature would provide critical insights into user behavior, helping to identify unusual data access patterns or potential data breaches. By tracking network data usage at the user level, we can more effectively audit data access and transfer, ensuring that data is used appropriately and in compliance with organizational data governance policies. This granularity in monitoring would be a significant step forward in data management and security within MongoDB Atlas, offering a proactive tool for administrators in safeguarding sensitive data.

In order to enhance data security and prevent unauthorized data exfiltration, our team proposes the implementation of a metric within MongoDB Atlas that allows administrators to monitor and measure the amount of data downloaded by each database user over a specified period. This feature would provide critical insights into user behavior, helping to identify unusual data access patterns or potential data breaches. By tracking network data usage at the user level, we can more effectively audit data access and transfer, ensuring that data is used appropriately and in compliance with organizational data governance policies. This granularity in monitoring would be…

1 vote

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Reporting · Edit… · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Combine data lake snapshots into a single federated collection
A common use case for data analytics is to analyse how your data evolve over time.
For example, imagine you have an e-commerce database and your products have their price change every day. You may only store the price in your database but you'd like to make a chart that shows the evolution of your product prices over time (price y axis and time for x axis).

It is possible today to make this happen with the combination of Data Lake and Data Federation, but the Storage Configuration JSON need to be manually updated like this:

{ "databases": [ { "collections": [ { "name": "collectionName", "dataSources": [ { "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230814T050424Z", "provenanceFieldName": "provenance", "storeName": "..." }, { "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230813T050415Z", "provenanceFieldName": "provenance", "storeName": "..." }, { "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230812T050424Z", "provenanceFieldName": "provenance", "storeName": "..." }, ..., ..., ..., ..., ..., ..., ] } ], } ], ... }

The Data Federation configuration json is going to make thousands of lines and need to be maintained daily or maybe using a script + API. (3 lines of json per collection * 365 snapshots a year * 20 collections = 22'000 lines of json a year)

One idea could be to use a simple wildcard instead of the timestamp like this:

{ "databases": [ { "collections": [ { "name": "collectionName", "dataSources": [ { "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$*", "provenanceFieldName": "provenance", "storeName": "..." } ] } ] } ] }

P.S: I know that time series collection could be useful in the specific example I just gave. But, sometime, you may want to analyse various properties over time, that's where a Data Lake solutions make sense.
A common use case for data analytics is to analyse how your data evolve over time.
For example, imagine you have an e-commerce database and your products have their price change every day. You may only store the price in your database but you'd like to make a chart that shows the evolution of your product prices over time (price y axis and time for x axis).

It is possible today to make this happen with the combination of Data Lake and Data Federation, but the Storage Configuration JSON need to be manually updated like this:

{ "databases": [
…
1 vote

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Storage Configuration · Edit… · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Schema inference

Schemaless is flexible but it has a big impact for the downstreams especially for data exchange and DW/AI.

It is a must-have effort to derive & infer the schema from the actual documents, so that we can understand/track/evolve/translate the document schema.

https://www.mongodb.com/blog/post/engblog-implementing-online-parquet-shredder is a great article.

I'd like to propose an additional feature in ADL/ADF to make schema inference as a 1st-class citizen with faster turnaround & less operation cost.

After the $out operation of ADL/ADF, please collect the Parquet schema from each data files and union/unify them into a single schema. This schema will be stored in a .schema.json or .schema.txt file in the same S3/GCS location.

Add a new flag/parameter for $out to scan through all the documents based on the filter condition in the queries, but instead of writing the Parquet files, the $out only writes out .schema.json or .schema.txt file to s3. This can be a quite useful operation routine to run every week with or without a rough datetime incremental filter to infer the schema and then update the corporate/enterprise schema repository/central.

I've elaborated this idea with MGM and benjamin.flast

Thank you.

Schemaless is flexible but it has a big impact for the downstreams especially for data exchange and DW/AI.

It is a must-have effort to derive & infer the schema from the actual documents, so that we can understand/track/evolve/translate the document schema.

https://www.mongodb.com/blog/post/engblog-implementing-online-parquet-shredder is a great article.

I'd like to propose an additional feature in ADL/ADF to make schema inference as a 1st-class citizen with faster turnaround & less operation cost.

After the $out operation of ADL/ADF, please collect the Parquet schema from each data files and union/unify them into a single schema. This schema will be stored in a .schema.json…

1 vote

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Query Functionality · Edit… · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close
Data Uploading process Is Little bit Difficult for new users. Upload a demo vedio of Uploading.

Overall I Found one of the Interesting Software and Friendly use

1 vote

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

We’ll send you updates on this idea

0 comments · Edit… · Delete… · Admin →

How important is this to you?

We're glad you're here
Please sign in to leave feedback

Signed in as (Sign out)

Close

Close

← Previous 1 2 3 Next →

Don't see your idea?

Data Federation and Data Lake

How can we improve Data Federation and Data Lake?

Specify when you'd like Online Archive to migrate data

Connect Data Lake to Self Managed MongoDB Clusters

M0 Support for Evaluation

The "Date field to archive on" option under Archiving Rule tab should also accept date in timestamp format.

Add support for XML

Write "_SUCCESS" File when finish data exporting

Allow a single timestamp field to be split by Year Month Day and Hour for folders instead of just one field like Year in filepath for Azure

Background aggregation queries return a query ID or correlation ID to be able to quickly poll for completion

Support Azure Data Federation private endpoint

Include support for Federated Database Instance with Data API services.

AWS IAM AuthN for Atlas SQL

Create a read/write Data Federation connection string

Implement a feature to track data download volume per DB user

Combine data lake snapshots into a single federated collection

Schema inference

Data Uploading process Is Little bit Difficult for new users. Upload a demo vedio of Uploading.

Feedback

Data Federation and Data Lake

Feedback and Knowledge Base

Searching…

Give feedback

How can we improve Data Federation and Data Lake?

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

Data Federation and Data Lake

Categories

Searching…