Challenge 5 : DDM & Access Control

In this challenge, you will explore the DDM data lifecycle end-to-end from upload to metadata enrichment to data quality validation  and observe how identity + project scoping are enforced across every stage.

You will validate that DDM: 

  • Ingests datasets and registers them in a searchable catalog,  
  • Generates and stores derived artifacts, including: 
    • system metadata 
    • profiling reports (HTML), 
  • bound to your identity and project 
  • Enriches dataset schema using LLM-assisted column descriptions 
  • Supports reusable expectation suites for data quality evaluation 
  • Produces validation artifacts, linked to both the dataset and the expectation suite 
  • Enforces access control consistently across all artifacts via: 
    • project_id scoping (primary isolation boundary) 
    • uploader identity binding (ownership + provenance) 
    • artifact lineage (derived assets inherit the same scope as the source dataset) 
    • catalog visibility rules (no cross-project or cross-user discovery) 

Estimated Time : 30–45 minutes

Difficulty Level : Low/Intermediate (Python understanding needed if challenging with sdk)

Background

Decentralized Data Management layer provides:
1. Dataset Ingestion (Zenoh P2P network) + Catalog
  • Upload heterogeneous datasets (CSV, Parquet, XLSX, PNG/JPG, GeoPandas, etc.) and store in zenoh p2p network 
  • Register them in the catalog with metadata (type, size, category, description, owner, project_id, paths, lineage references) 
2. Automated Metadata & Enrichment
  • Parse structural metadata (schema, column types, summary statistics) 
  • Enrich metadata with human-readable column descriptions (LLM-assisted) 
3. Derived Artifacts
  • Generate dataset profiling reports (HTML artifacts) 
  • Store system-generated file metadata alongside the raw dataset 
  • Track derived operational artifacts, including file stats (downloads/views), file history (lineage/parent files), and user action logging for auditability 
4. User Driven Operations (Human in the middle)
  • Build and run advanced catalog searches using filters 
  • Save reusable queries as personal artifacts (user-scoped) 
  • Use UI feedback (reports, validation pages) to iteratively refine and verify results 
5. Data Quality Artifacts
  • Create expectation suites (rules/constraints describing “high quality data”) 
  • Validate datasets against suites, producing validation results artifacts 

 

6. Authorization & Access Control (ABAC Context Handlers)
  • Enforce access decisions using Attribute-Based Access Control via context handlers 
  • Authorize requests based on runtime attributes such as: 
    • user identity (username / sub / roles) 
    • resource ownership (file.user_id, suite.user_id) 
    • project scope (project_id) 
  • action type (upload, download, validate, view report, etc.)Ensure access control applies not only to raw datasets, but also to all derived artifacts (metadata, reports, suites, validations, saved queries) 

See Appendix for full DDM architecture and component overview. 

Prerequisites

You can complete this challenge in two equivalent ways:

Both options follow the same steps and validate the same access-control + artifact lineage behavior. 
If you choose the DDM UI option, you can reference the implementation details here. 

If you choose the DDM Client (SDK) option, use this repository. 

A dedicated tutorial page is available for this challenge, including:
  • progress tracking in real-time
  • step-by-step guidance for UI and SDK 
  • screenshots + links to github assets 

Files used in this guide : sample_files/ 

Challenge Steps

Step 1 — Upload files under a user-scoped project

Client SDK script 01_upload_file.py 

DDM UI: Upload Page 

  • Upload the following datasets: 
  • Titanic-Dataset.csv 
  • titanic.parquet 
  • titanic_large.csv 

under: 

  • project_id =  (a) <username>   if using ui 

(b) tutorial-<username>  if using client sdk 

Use consistent metadata for all three uploads: 

  • use_case = crisis 
  • description (CSV) = “Challenge 05 CSV upload” 
  • description (Parquet) = “Challenge 05 Parquet upload” 
  • description (Large CSV) = “Challenge 05 Large CSV upload” 

Capture all returned file_id values. 

Step 2 — Attach uploader metadata (identity-bound metadata)

Client SDK attach_metadata.py 

DDM UI: Upload Page 

  • Locate the file: 
  • titanic.parquet (the parquet file uploaded in Step 1) 
  • Edit the sample metadata file: 
  • uploader_metadata.json 
  • fill fields that identify you ( username , challenge ) 
  • Attach the edited uploader_metadata.json to titanic.parquet as uploader metadata. 

Access control insight 
The parquet file now has identity-bound uploader metadata, meaning it becomes part of the dataset’s provenance and should only be visible/usable within your identity and project scope. 

Key question : Can uploader metadata be observed only where expected? 

Step 3 — Download generated file metadata

Client SDK 03_download_file_metadata.py 

DDM UI: My Catalog 

  • Use the same project_id =  (a) <username> if using ui 

(b) tutorial-<username>  If using client sdk 

 

  • For two files (TitanicDataset.csv and titanic.parquet), retrieve the system-generated file metadata that DDM produced after ingestion (schema, types, summary stats, system fields, etc.) 
Step 4 — Download profiling report (derived artifact)

 

Client SDK 04_download_report_html.py 

DDM UI: My Catalog 

  • Download the profiling report HTML derived from the dataset. 
  • Open the report and inspect the data duality metrics / statistics / missingness / distributions. 

 

Step 5 — Advanced catalog query (visibility boundary)

Client SDK 05_catalog_advanced.py 

DDM UI: Catalog advanced 

  • Create and run advanced query using filters.json. 
  • Ensure the query is scoped to your tutorial project: 
  • project_id =  (a) <username> if using ui 

(b) tutorial-<username>  If using client sdk 

  • Note how many files are returned initially.  

Modify filters until only one file is returned, specifically the CSV you uploaded in Step 1 (Titanic-Dataset.csv). 

Step 6 — Save advanced query (user-bound artifact)

Client SDK 06_save_advanced_query.py 

DDM UI: Catalog advanced 

Client SDK script: 06_save_advanced_query.py 

  • Save the final query (the one returning 1 CSV file). 
  • Use name such as: 
  • tutorial-<username> (or <tutorial> if you re using UI) 

Treat the saved query as a personal reproducible artifact (it should remain bound to your identity and not be reusable by other users unless explicitly shared by design). 

Step 7 — Create an expectation suite (quality rules + LLM enrichment)

Client SDK 07_create_suite.py 

DDM UI: Create Expectations page 

  • Use: 
  • project_id  = tutorial-<username>  if you are using SDK 

<tutorial>  if you are using UI 

  • suite_name = tutorial-<username> 
  • The expectations you will have to set, are shown in expecttions.json ffile. 
    If you are using DDM UI,see Appendix for expectations you will have to set in human readable format. 

This step also triggers LLM-assisted enrichment, producing human-readable column descriptions linked to the suite. 

In the UI Expectations page, you can see: 

  • suite details 
  • enriched column descriptions 
  • charts/graphs derived from suite metadata 

The suite is reusable by reference, but does not grant access to datasets outside your project. 

Generate expectations suite and capture suite_id. 

Step 8 — Validate datasets against suite

Client SDK: 08_validate.py 

DDM UI: Validations page 

Validate Titanic-Dataset.csv against the suite from Step 7. 

Inputs you must use: 

  • suite_id (Step 7) 
  • file_id  (Step 1 CSV) 
  • project_id  tutorial-<username>  if you are using SDK 

<tutorial>  if you are using UI 

After completion, open Validation Results and review: 

  • success/failure status 
  • failed expectations and details 
  • linked dataset + linked suite  

Then repeat validation using the large titanic_large.csv  file_id from Step 1, and compare differences. 

Validation results are stored as artifacts linked to both: 

  • the dataset (file_id) 
  • the suite (suite_id) 

Access control insight 

Reusing a suite across datasets does not bypass access controls: every validation remains scoped to the dataset’s project/owner. 

 

Step 9 — Download file

Client SDK: 09_download_file.py 

DDM UI: My Catalog 

  • Use: 
  • file_id = the file you want to download (the Titanic-Dataset.csv file id from Step 1) 

to download the uploaded file. 

Step 10 — Catalog list

Client SDK: 10_catalog_list.py 

DDM UI: My Catalog 

  • List catalog entries for your tutorial project: 
  • project_id =  tutorial-<username>  if you are using SDK 

<tutorial>  if you are using UI 

  • Confirm that: 
  • all 3 uploaded files exist 
  • derived artifacts (metadata, reports, validations) are visible where expected. 
Step 11 — Access Control

Create a new account and login. 
Run whichever steps you want among 2,3,4,9. 
What message you receive? 

Deliverables

The backend endpoint checks step completion based on stored artifacts and logs, returning per-step status (pending/success/failure). Additionally you will have to provide a file with short answers:

  • What defines access control in DDM?

  • Why did expectations fail or pass? 

  • Which dataset had better validation performance and why? 

  • What is the key for reusable expectations?

Can uploader metadata, reports etc. be observedonly where expected?

Appendix

1. Architecture

1. Architecture

Expectations Sample

0) Create the suite shell
  • 1. Go to Expectation Suites → Create Suite

  • 2. Name it (e.g. tutorial__suite)

  • 3. Click Next to upload sample file

  • 4. Select the sample dataset: titanic_sample.csv and upload it

  • 5. Click Next to enter the suite editor and fill in the next.

1) Table-level expectations
  • A) Column count between

    Add Expectation: Select (expect_table_column_count_to_be_between); Min value (1); Max value (11)

  • B) Columns match ordered list

    Add Expectation: Select (expect_table_columns_to_match_ordered_list); Column list (in this exact order-prepopulated by DDM): PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Embarked

  • C) Row count between

    Add Expectation: Select (expect_table_row_count_to_be_between); Min value (1000); Max value (200000); Strict min (false); Strict max (false)

2) Column-level expectations (add these one by one)
PassengerId
  1. Not null

Add Expectation 

  • Select: expect_column_values_to_not_be_null 
  • Column: PassengerId 
  • Mostly: 0.95 
  1. Unique

Add Expectation 

  • Select: expect_column_values_to_be_unique 
  • Column: PassengerId 
  • Mostly: 0.95 
Survived
  1. Not null
  • Select: expect_column_values_to_not_be_null 
  • Column: Survived 
  • Mostly: 0.95 
  1. In set
  • Select: expect_column_values_to_be_in_set 
  • Column: Survived 
  • Mostly: 0.95 
  • Value set: [“1”, “0”] 
  • In the UI: add 2 values: 1 and 0 (press enter-it is “chips” elements)
Pclass

Not null 

  • Select: expect_column_values_to_not_be_null 
  • Column: Pclass 
  • Mostly: 0.95 
Name
  1. Not null
  • Select: expect_column_values_to_not_be_null 
  • Column: Name 
  • Mostly: 0.95 
  1. Unique
  • Select: expect_column_values_to_be_unique 
  • Column: Name 
  • Mostly: 0.95 
Sex
  1. Not null
  • Select: expect_column_values_to_not_be_null 
  • Column: Sex 
  • Mostly: 0.95 
  1. In set
  • Select: expect_column_values_to_be_in_set 
  • Column: Sex 
  • Mostly: 0.95 
  • Value set: [“male”, “female”] 
Age
  1. Not null
  • Select: expect_column_values_to_not_be_null 
  • Column: Age 
  • Mostly: 0.95 
  1. Min between

Select: expect_column_min_to_be_between 

  • Column: Age 
  • Min value: 0 
  • Max value: 2 
  • Strict min: false (unchecked) 
  • Strict max: false (unchecked) 
  1. Mean between

Select: expect_column_mean_to_be_between 

  • Column: Age 
  • Min value: 33 
  • Max value: 88 
  • Strict min: false (unchecked) 
  • Strict max: false (unchecked) 

 

SibSp
  1. Not null

Select: expect_column_values_to_not_be_null 

  • Column: SibSp 
  • Mostly: 0.95 
  1. Min between

Select: expect_column_min_to_be_between 

  • Column: SibSp 
  • Min value: 0 
  • Max value: 15 
  • Strict min: false 
  • Strict max: false 

 

Parch

Not null 

Select: expect_column_values_to_not_be_null 

  • Column: Parch 
  • Mostly: 0.95 
Ticket
  1. Not null

Select: expect_column_values_to_not_be_null 

  • Column: Ticket 
  • Mostly: 0.95 
  1. Unique

Select: expect_column_values_to_be_unique 

  • Column: Ticket 
  • Mostly: 0.95 
Fare

Not null 

Select: expect_column_values_to_not_be_null 

  • Column: Fare 
  • Mostly: 0.95 

Embarked 

  1. Not null

Select: expect_column_values_to_not_be_null 

  • Column: Embarked 
  • Mostly: 0.95 
  1. In set

Select: expect_column_values_to_be_in_set 

  • Column: Embarked 
  • Mostly: 0.95 
  • Value set: [“C”, “Q”, “S”] 
3) Final step
  • Click Save Suite

The ExtremeXP project is co-funded by the European Union Horizon Program HORIZON CL4-2022-DATA-01-01, under Grant Agreement No. 101093164

© ExtremeXP 2026. All Rights Reserved – Privacy Policy