Combine data lake snapshots into a single federated collection

A common use case for data analytics is to analyse how your data evolve over time.
For example, imagine you have an e-commerce database and your products have their price change every day. You may only store the price in your database but you'd like to make a chart that shows the evolution of your product prices over time (price y axis and time for x axis).

It is possible today to make this happen with the combination of Data Lake and Data Federation, but the Storage Configuration JSON need to be manually updated like this:

{
  "databases": [
    {
      "collections": [
        {
          "name": "collectionName",
          "dataSources": [
            {
              "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230814T050424Z",
              "provenanceFieldName": "provenance",
              "storeName": "..."
            },
            {
              "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230813T050415Z",
              "provenanceFieldName": "provenance",
              "storeName": "..."
            },
            {
              "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230812T050424Z",
              "provenanceFieldName": "provenance",
              "storeName": "..."
            },
            ...,
            ...,
            ...,
            ...,
            ...,
            ...,
          ]
        }
      ],

    }
  ],
  ...
}

The Data Federation configuration json is going to make thousands of lines and need to be maintained daily or maybe using a script + API. (3 lines of json per collection * 365 snapshots a year * 20 collections = 22'000 lines of json a year)

One idea could be to use a simple wildcard instead of the timestamp like this:

{
  "databases": [
    {
      "collections": [
        {
          "name": "collectionName",
          "dataSources": [
            {
              "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$*",
              "provenanceFieldName": "provenance",
              "storeName": "..."
            }
          ]
        }
      ]
    }
  ]
}

P.S: I know that time series collection could be useful in the specific example I just gave. But, sometime, you may want to analyse various properties over time, that's where a Data Lake solutions make sense.

1 vote

Mathieu Urstein shared this idea · Aug 21, 2023 · Report… · Admin →

An error occurred while saving the comment

How can we improve Data Federation and Data Lake?

Combine data lake snapshots into a single federated collection

Feedback

Data Federation and Data Lake: Storage Configuration

Feedback and Knowledge Base

Searching…

Give feedback

Combine data lake snapshots into a single federated collection

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

Data Federation and Data Lake: Storage Configuration

Categories

Searching…

Give feedback