Challenge 5 : DDM & Access Control

In this challenge, you will explore the DDM data lifecycle end-to-end from upload to metadata enrichment to data quality validation and observe how identity + project scoping are enforced across every stage.

You will validate that DDM:

Ingests datasets and registers them in a searchable catalog,
Generates and stores derived artifacts, including:
- system metadata
- profiling reports (HTML),
bound to your identity and project

Enriches dataset schema using LLM-assisted column descriptions
Supports reusable expectation suites for data quality evaluation
Produces validation artifacts, linked to both the dataset and the expectation suite
Enforces access control consistently across all artifacts via:
- project_id scoping (primary isolation boundary)
- uploader identity binding (ownership + provenance)
- artifact lineage (derived assets inherit the same scope as the source dataset)
- catalog visibility rules (no cross-project or cross-user discovery)

Estimated Time : 30–45 minutes

Difficulty Level : Low/Intermediate (Python understanding needed if challenging with sdk)

Background

Decentralized Data Management layer provides:

1. Dataset Ingestion (Zenoh P2P network) + Catalog

Upload heterogeneous datasets (CSV, Parquet, XLSX, PNG/JPG, GeoPandas, etc.) and store in zenoh p2p network
Register them in the catalog with metadata (type, size, category, description, owner, project_id, paths, lineage references)

2. Automated Metadata & Enrichment

Parse structural metadata (schema, column types, summary statistics)
Enrich metadata with human-readable column descriptions (LLM-assisted)

3. Derived Artifacts

Generate dataset profiling reports (HTML artifacts)
Store system-generated file metadata alongside the raw dataset
Track derived operational artifacts, including file stats (downloads/views), file history (lineage/parent files), and user action logging for auditability

4. User Driven Operations (Human in the middle)

Build and run advanced catalog searches using filters
Save reusable queries as personal artifacts (user-scoped)
Use UI feedback (reports, validation pages) to iteratively refine and verify results

5. Data Quality Artifacts

Create expectation suites (rules/constraints describing “high quality data”)
Validate datasets against suites, producing validation results artifacts

6. Authorization & Access Control (ABAC Context Handlers)

Enforce access decisions using Attribute-Based Access Control via context handlers
Authorize requests based on runtime attributes such as:
- user identity (username / sub / roles)
- resource ownership (file.user_id, suite.user_id)
- project scope (project_id)
action type (upload, download, validate, view report, etc.)Ensure access control applies not only to raw datasets, but also to all derived artifacts (metadata, reports, suites, validations, saved queries)

See Appendix for full DDM architecture and component overview.

Prerequisites

You can complete this challenge in two equivalent ways:

Both options follow the same steps and validate the same access-control + artifact lineage behavior.
If you choose the DDM UI option, you can reference the implementation details here.

If you choose the DDM Client (SDK) option, use this repository.

A dedicated tutorial page is available for this challenge, including:

progress tracking in real-time
step-by-step guidance for UI and SDK
screenshots + links to github assets

Files used in this guide : sample_files/

Challenge Steps

Step 1 — Upload files under a user-scoped project

Client SDK : script 01_upload_file.py

DDM UI: Upload Page

Upload the following datasets:

Titanic-Dataset.csv

titanic.parquet

titanic_large.csv

under:

project_id = (a) <username> if using ui

(b) tutorial-<username> if using client sdk

Use consistent metadata for all three uploads:

use_case = crisis

description (CSV) = “Challenge 05 CSV upload”

description (Parquet) = “Challenge 05 Parquet upload”

description (Large CSV) = “Challenge 05 Large CSV upload”

Capture all returned file_id values.

Step 2 — Attach uploader metadata (identity-bound metadata)

Client SDK : attach_metadata.py

DDM UI: Upload Page

Locate the file:

titanic.parquet (the parquet file uploaded in Step 1)

Edit the sample metadata file:

uploader_metadata.json

fill fields that identify you ( username , challenge )

Attach the edited uploader_metadata.json to titanic.parquet as uploader metadata.

Access control insight 
The parquet file now has identity-bound uploader metadata, meaning it becomes part of the dataset’s provenance and should only be visible/usable within your identity and project scope.

Key question : Can uploader metadata be observed only where expected?

Step 3 — Download generated file metadata

Client SDK : 03_download_file_metadata.py

DDM UI: My Catalog

Use the same project_id = (a) <username> if using ui

(b) tutorial-<username> If using client sdk

For two files (TitanicDataset.csv and titanic.parquet), retrieve the system-generated file metadata that DDM produced after ingestion (schema, types, summary stats, system fields, etc.)

Step 4 — Download profiling report (derived artifact)

Client SDK : 04_download_report_html.py

DDM UI: My Catalog

Use the CSV file_id (Titanic-Dataset.csv from Step 1).

Download the profiling report HTML derived from the dataset.

Open the report and inspect the data duality metrics / statistics / missingness / distributions.

Step 5 — Advanced catalog query (visibility boundary)

Client SDK : 05_catalog_advanced.py

DDM UI: Catalog advanced

Create and run advanced query using filters.json.

Ensure the query is scoped to your tutorial project:

project_id = (a) <username> if using ui

(b) tutorial-<username> If using client sdk

Note how many files are returned initially.

Modify filters until only one file is returned, specifically the CSV you uploaded in Step 1 (Titanic-Dataset.csv).

Step 6 — Save advanced query (user-bound artifact)

Client SDK : 06_save_advanced_query.py

DDM UI: Catalog advanced

Client SDK script: 06_save_advanced_query.py

Save the final query (the one returning 1 CSV file).

Use name such as:

tutorial-<username> (or <tutorial> if you re using UI)

Treat the saved query as a personal reproducible artifact (it should remain bound to your identity and not be reusable by other users unless explicitly shared by design).

Step 7 — Create an expectation suite (quality rules + LLM enrichment)

Client SDK : 07_create_suite.py

DDM UI: Create Expectations page

Create an expectation suite using titanic-sample.csv dataset (for generating rules + enriched descriptions) and expectations.json.

Use:

project_id = tutorial-<username> if you are using SDK

<tutorial> if you are using UI

suite_name = tutorial-<username>

expectations source =expectations.json

The expectations you will have to set, are shown in expecttions.json ffile.
If you are using DDM UI,see Appendix for expectations you will have to set in human readable format.

This step also triggers LLM-assisted enrichment, producing human-readable column descriptions linked to the suite.

In the UI Expectations page, you can see:

suite details

enriched column descriptions

charts/graphs derived from suite metadata

The suite is reusable by reference, but does not grant access to datasets outside your project.

Generate expectations suite and capture suite_id.

Step 8 — Validate datasets against suite

Client SDK: 08_validate.py

DDM UI: Validations page

Validate Titanic-Dataset.csv against the suite from Step 7.

Inputs you must use:

suite_id (Step 7)

file_id (Step 1 CSV)

project_id = tutorial-<username> if you are using SDK

<tutorial> if you are using UI

After completion, open Validation Results and review:

success/failure status

failed expectations and details

linked dataset + linked suite

Then repeat validation using the large titanic_large.csv file_id from Step 1, and compare differences.

Validation results are stored as artifacts linked to both:

the dataset (file_id)

the suite (suite_id)

Access control insight 

Reusing a suite across datasets does not bypass access controls: every validation remains scoped to the dataset’s project/owner.

Step 9 — Download file

Client SDK: 09_download_file.py

DDM UI: My Catalog

Use:

file_id = the file you want to download (the Titanic-Dataset.csv file id from Step 1)

to download the uploaded file.

Step 10 — Catalog list

Client SDK: 10_catalog_list.py

DDM UI: My Catalog

List catalog entries for your tutorial project:

project_id = tutorial-<username> if you are using SDK

<tutorial> if you are using UI

Confirm that:

all 3 uploaded files exist

derived artifacts (metadata, reports, validations) are visible where expected.

Step 11 — Access Control

Create a new account and login.
Run whichever steps you want among 2,3,4,9.
What message you receive?

Deliverables

The backend endpoint checks step completion based on stored artifacts and logs, returning per-step status (pending/success/failure). Additionally you will have to provide a file with short answers:

What defines access control in DDM?
Why did expectations fail or pass? 
Which dataset had better validation performance and why? 
What is the key for reusable expectations?

Can uploader metadata, reports etc. be observed only where expected?

Appendix

1. Architecture

Expectations Sample

0) Create the suite shell

1. Go to Expectation Suites → Create Suite
2. Name it (e.g. tutorial__suite)
3. Click Next to upload sample file
4. Select the sample dataset: titanic_sample.csv and upload it
5. Click Next to enter the suite editor and fill in the next.

1) Table-level expectations

A) Column count between

Add Expectation: Select (expect_table_column_count_to_be_between); Min value (1); Max value (11)
B) Columns match ordered list

Add Expectation: Select (expect_table_columns_to_match_ordered_list); Column list (in this exact order-prepopulated by DDM): PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Embarked
C) Row count between

Add Expectation: Select (expect_table_row_count_to_be_between); Min value (1000); Max value (200000); Strict min (false); Strict max (false)

2) Column-level expectations (add these one by one)

PassengerId

Not null

Add Expectation

Select: expect_column_values_to_not_be_null

Column: PassengerId

Mostly: 0.95

Unique

Add Expectation

Select: expect_column_values_to_be_unique

Column: PassengerId

Mostly: 0.95

Survived

Not null

Select: expect_column_values_to_not_be_null

Column: Survived

Mostly: 0.95

In set

Select: expect_column_values_to_be_in_set

Column: Survived

Mostly: 0.95

Value set: [“1”, “0”]

In the UI: add 2 values: 1 and 0 (press enter-it is “chips” elements)

Pclass

Not null

Select: expect_column_values_to_not_be_null

Column: Pclass

Mostly: 0.95

Name

Not null

Select: expect_column_values_to_not_be_null

Column: Name

Mostly: 0.95

Unique

Select: expect_column_values_to_be_unique

Column: Name

Mostly: 0.95

Sex

Not null

Select: expect_column_values_to_not_be_null

Column: Sex

Mostly: 0.95

In set

Select: expect_column_values_to_be_in_set

Column: Sex

Mostly: 0.95

Value set: [“male”, “female”]

Age

Not null

Select: expect_column_values_to_not_be_null

Column: Age

Mostly: 0.95

Min between

Select: expect_column_min_to_be_between

Column: Age

Min value: 0

Max value: 2

Strict min: false (unchecked)

Strict max: false (unchecked)

Mean between

Select: expect_column_mean_to_be_between

Column: Age

Min value: 33

Max value: 88

Strict min: false (unchecked)

Strict max: false (unchecked)

SibSp

Not null

Select: expect_column_values_to_not_be_null

Column: SibSp

Mostly: 0.95

Min between

Select: expect_column_min_to_be_between

Column: SibSp

Min value: 0

Max value: 15

Strict min: false

Strict max: false

Parch

Not null

Select: expect_column_values_to_not_be_null

Column: Parch

Mostly: 0.95

Ticket

Not null

Select: expect_column_values_to_not_be_null

Column: Ticket

Mostly: 0.95

Unique

Select: expect_column_values_to_be_unique

Column: Ticket

Mostly: 0.95

Fare

Not null

Select: expect_column_values_to_not_be_null

Column: Fare

Mostly: 0.95

Embarked

Not null

Select: expect_column_values_to_not_be_null

Column: Embarked

Mostly: 0.95

In set

Select: expect_column_values_to_be_in_set

Column: Embarked

Mostly: 0.95

Value set: [“C”, “Q”, “S”]

3) Final step

Click Save Suite

Your feedback

Send us your results and your comments about this challenge!

Click here to send your feedback (Google Forms)

Challenge 5 : DDM & Access Control

Estimated Time : 30–45 minutes

Difficulty Level : Low/Intermediate (Python understanding needed if challenging with sdk)

Background

Prerequisites

Challenge Steps

Deliverables

What defines access control in DDM?

Why did expectations fail or pass?

Which dataset had better validation performance and why?

What is the key for reusable expectations?

Appendix

1. Architecture

Expectations Sample

0) Create the suite shell

1. Go to Expectation Suites → Create Suite

2. Name it (e.g. tutorial__suite)

3. Click Next to upload sample file

4. Select the sample dataset: titanic_sample.csv and upload it

5. Click Next to enter the suite editor and fill in the next.

1) Table-level expectations

A) Column count between

B) Columns match ordered list

C) Row count between

2) Column-level expectations (add these one by one)

3) Final step

Click Save Suite

Your feedback

Why did expectations fail or pass? 

Which dataset had better validation performance and why?