Combine data lake snapshots into a single federated collection
A common use case for data analytics is to analyse how your data evolve over time.
For example, imagine you have an e-commerce database and your products have their price change every day. You may only store the price in your database but you'd like to make a chart that shows the evolution of your product prices over time (price y axis and time for x axis).
It is possible today to make this happen with the combination of Data Lake
and Data Federation
, but the Storage Configuration JSON need to be manually updated like this:
{
"databases": [
{
"collections": [
{
"name": "collectionName",
"dataSources": [
{
"datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230814T050424Z",
"provenanceFieldName": "provenance",
"storeName": "..."
},
{
"datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230813T050415Z",
"provenanceFieldName": "provenance",
"storeName": "..."
},
{
"datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230812T050424Z",
"provenanceFieldName": "provenance",
"storeName": "..."
},
...,
...,
...,
...,
...,
...,
]
}
],
}
],
...
}
The Data Federation configuration json is going to make thousands of lines and need to be maintained daily or maybe using a script + API. (3 lines of json per collection * 365 snapshots a year * 20 collections = 22'000 lines of json a year)
One idea could be to use a simple wildcard instead of the timestamp like this:
{
"databases": [
{
"collections": [
{
"name": "collectionName",
"dataSources": [
{
"datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$*",
"provenanceFieldName": "provenance",
"storeName": "..."
}
]
}
]
}
]
}
P.S: I know that time series collection could be useful in the specific example I just gave. But, sometime, you may want to analyse various properties over time, that's where a Data Lake solutions make sense.