Schema inference

Schemaless is flexible but it has a big impact for the downstreams especially for data exchange and DW/AI.

It is a must-have effort to derive & infer the schema from the actual documents, so that we can understand/track/evolve/translate the document schema.

https://www.mongodb.com/blog/post/engblog-implementing-online-parquet-shredder is a great article.

I'd like to propose an additional feature in ADL/ADF to make schema inference as a 1st-class citizen with faster turnaround & less operation cost.

After the $out operation of ADL/ADF, please collect the Parquet schema from each data files and union/unify them into a single schema. This schema will be stored in a .schema.json or .schema.txt file in the same S3/GCS location.

Add a new flag/parameter for $out to scan through all the documents based on the filter condition in the queries, but instead of writing the Parquet files, the $out only writes out .schema.json or .schema.txt file to s3. This can be a quite useful operation routine to run every week with or without a rough datetime incremental filter to infer the schema and then update the corporate/enterprise schema repository/central.

I've elaborated this idea with MGM and benjamin.flast

Thank you.

1 vote

Eric shared this idea · Jun 8, 2023 · Report… · Admin →

An error occurred while saving the comment

How can we improve Data Federation and Data Lake?

Schema inference

Feedback

Data Federation and Data Lake: Query Functionality

Feedback and Knowledge Base

Searching…

Give feedback

Schema inference

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

Data Federation and Data Lake: Query Functionality

Categories

Searching…

Give feedback