Challenge 4 : Visualization and Explainability

In this challenge, you will explore the Visualization and Explainability capabilities of the ExtremeXP framework. Building on a predefined and already-implemented experiment, you will analyze workflow executions, compare model configurations, inspect metrics and datasets, and generate explainability insights to better understand model behavior.

This challenge focuses on post-execution analysis, interactive visualization, and explainability-driven interpretation, rather than experiment specification and execution (you will not need to write or modify any code or DSL definitions all workflows, experiments, and tasks are provided).

Estimated Time : 45–60 minutes (without considering experiment execution time)

Difficulty Level : Intermediate (basic understanding of DSL workflows and experiments, and of explainability concepts)

Prerequisites

Before starting this challenge, you must have successfully completed Challenge 1: Core functionalities of the ExtremeXP framework, including:

defining an experiment using the ExtremeXP DSL,
sending the experiment to the Experimentation Engine,
running the experiment on the Execution Engine (e.g. ProActive),
verifying that workflows completed successfully.

No additional challenges are required prior to this one.

Before proceeding, note that although this challenge builds on the experiment already executed in Challenge 1, the workflow has been slightly adjusted to support more advanced ML visualization and explainability. In particular, the workflow includes an additional explainability-related task that prepares trained models, evaluation datasets, predictions, and ground-truth labels in a standardized form, and ensures that the relevant artifacts produced during execution are persisted in the Decentralized Data Management (DDM) of ExtremeXP and accessible to the Visualization Dashboard and Explainability Module.

These adjustments do not change the learning task, model training logic, or overall workflow structure; they are required solely to enable the analyses explored in this challenge.

Background

In this challenge, you will focus on the analytic capabilities of the ExtremeXP Visualization and Explainability Dashboard. As a reminder, the ExtremeXP framework consists of several interacting components that together support experiment execution, monitoring, and analysis. An overview of the system architecture was introduced in Challenge 1, the components most relevant for this challenge are summarized below.

Experimentation Engine

The Experimentation Engine orchestrates the execution of experiments defined using the ExtremeXP DSL. It interprets workflows, expands parameter spaces, schedules workflow instances, and collects execution metadata and results. Once execution completes, all experiment-related information is made available to downstream components.

Decentralized Data Management (DDM)

DDM stores datasets and data artifacts produced during execution, such as training and test datasets, model outputs, predictions, and exported artifacts. These datasets are dynamically loaded by the Visualization UI for interactive exploration and explainability analysis.

Executionware

The executionware (e.g. ProActive) is responsible for running workflows and tasks defined by the Experimentation Engine. While execution is not the focus of this challenge, execution status and runtime behaviour are visualized and analysed through the UI.

Visualization Dashboard

The Visualization Dashboard, which is the main focus of this challenge, is a web-based user interface that enables interactive monitoring, exploration, comparison, and analysis of experiments, as well as more advanced ML visualization and explainability for experiments that support these capabilities. It integrates experiment monitoring, workflow analysis, model performance diagnostics, data exploration, and explainability views into a unified environment that supports human-in-the-loop experimentation.

In this challenge, the focus is on using the analytic capabilities of the dashboard to explore experiment results and explain model behavior.

Exercise

You will work with the same customer churn prediction experiment introduced in Challenge 1. For this challenge, you must execute an adjusted version of the experiment, which includes additional explainability-related functionality required for advanced visualization and analysis.

Step 1 : Download and inspect the experiment assets provided

Before starting, it is recommended that you familiarize yourself with the provided assets.

Download the provided files from the shared folder.
Inspect the provided DSL and task python files
Review the files related to the explainability extension, located under tasks/task6_explainability. At a high level, identify how this task prepares models, datasets, predictions, and labels so that they can be consumed by the visualization and explainability components.

You are not required to modify any code or DSL files.

Step 2 : Execute the Experiment on Proactive

Follow the same instructions as in Challenge 1 to execute the experiment on proactive.
Ensure ProActive is configured as the executionware.
Submit the experiment for execution.

Step 3 : Access the Experiment Monitoring Page

In the ProActive UI, locate your experiment and note its experiment ID.
Open the Visualization Dashboard for your experiment at https://vis.extremexp-icom.intracom-telecom.com/<experiment id>.

Step 4 : Analyse workflow executions and metrics

Once logged in, the central page of the Visualization Dashboard is the Experiment Monitoring Page. This page provides an overview of the experiment progress and results. Using this page:

Inspect the experiment progress summary bar and verify that all workflows have completed successfully.
Examine the Workflow Execution Table.
Inspect variability points (e.g. model type, hyperparameters).
Review recorded metrics such as accuracy, F1 score, ROC AUC, execution time.
Use sorting, filtering, and grouping controls to explore the results.

Questions

Identify workflows with roc-auc higher than 0.71.
Which one of them achieved the highest accuracy ?
Which model type (Logistic Regression or Random Forest) performed better on average, based on accuracy ?

Step 5 : Explore Parameter–Metric Relationships with the Parallel Coordinates Plot

Using the Parallel Coordinates Plot:

Select a performance metric (e.g. accuracy) as the color encoding.
Explore how different parameter values relate to model performance.
Identify sensitivities, trade-offs, and promising parameter regions.

Questions

Which variability points appear to have a strong effect on accuracy ?

Step 6 : Compare workflows using the Comparative Analysis Page

Using the Comparative Analysis Page:

Select multiple the top 5 workflows with respect to accuracy.
Compare them using:
1. 1. - metrics visualizations,
    - model insights (e.g. confusion matrices, ROC curves),
Inspect how these configurations trade-off precision and recall
Group the selected workflows by model type and compare their performance using metrics such as accuracy, precision, recall, F1 score, ROC AUC, as well as execution time.

Questions

(Based on Steps 1–3)

Among the top-performing workflows in terms of accuracy, are there noticeable differences in recall?

(Based on Step 4)

Do the two model types exhibit systematic differences in performance (e.g. accuracy, F1 score, ROC AUC) or execution time?

Step 7 : Inspect individual workflows in the Workflow Analysis Page

1. Identify the workflow that :

belongs to the top-performing group in terms of accuracy, and
achieves the highest recall among those workflows.

2. Navigate to its Workflow Analysis Page and :

Examine the workflow structure and task-level execution details.
Inspect parameters, metrics, and input/output artifacts associated with each task.

Step 8 : Analyse model performance and explainability

1. Expand the Model Insights section for the workflow from Step 7.

2. Inspect model performance visualizations such as (select the trained model entry):

confusion matrix
ROC curve
precision–recall curve
classification report

3. Explore instance-level predictions using the Instance View.

Identify misclassified instances.
Focus on instances where the true label is 1 (churn), but the model predicted 0 (non-churn).
Inspect their feature values.

4. Select one such instance and :

generate local counterfactual explanations to understand which changes in feature values would lead the model to predict the correct class.
generate hyperparameter counterfactuals to explore whether alternative training configurations would change the prediction.
inspect and interpret the generated explanations.
inspect the results

5. Generate feature-level explainability results:

feature importance
PDP
ALE.

6. Explore hyperparameter explainability views to understand how training configurations influence performance.

Questions

Instance-level analysis

Provide an example of a misclassified instance where the true label is 1 but the model predicted 0.
- Which feature changes, suggested by local counterfactual explanations, would lead the model to predict the correct class?
- Do hyperparameter counterfactuals suggest that changes in the training configuration could also correct this prediction? If yes, which ones?

SHAP-based explanations

Using SHAP values for the selected instance:
- Which features contributed most to the model’s prediction?
- In which direction (positive or negative) and how much did they influence the predicted outcome?

Comparing Feature Importances

Compare feature importance and SHAP-based feature importance in the Feature Explainability section:
- Are the most important features the same?
- If not, how do they differ?
- What are the two most important features for the decisions of the model?

Feature Effects

How exactly does the two most important feature affect the prediction of the model?
Provide any insights or hypotheses that the PDPs-ALEs reveal about customer behavior.

Hyperparameter effects

Across the workflows that train a Random Forest, analyze how changes in the rfMaxDepth hyperparameter affect different evaluation metrics (e.g. accuracy, recall, precision, F1-score).
- Which metric appears to be most sensitive to changes in rfMaxDepth?
- What evidence from the visualizations supports your conclusion?
Generate a 2D Partial Dependence Plot (2D PDP) involving rfMaxDepth and one additional feature of your choice, using recall as the target metric.
- How do the two variables interact to influence recall?
- Are there regions of the parameter–feature space where recall improves or degrades significantly?

Deliverables

Below we describe all artifacts that you must produce and submit for this challenge.

1. Screenshots showing

the Experiment Monitoring Page with completed workflows,
at least one Workflow Analysis Page,
at least one explainability visualization (feature- or hyperparameter-level),
the Comparative Analysis Page comparing multiple workflows.

2. A short written summary answering the questions posed throughout the exercise, including

which workflow configuration performed best,
key insights obtained from visualization and explainability,
how explainability influenced your understanding of model behavior.
Other key insights and observations.

Success criteria

Experiment Execution & Availability Checklist

The provided experiment is successfully executed
The experiment is accessible in the Visualization Dashboard via its experiment ID
Workflow parameters, metrics, and artifacts are correctly registered and visible

Visualization and Explainability Checklist

The participant answered the questions posed correctly.
- They have found the best-performing workflow
- They have drawn correct conclusions with respect to how different configurations impact performance.
They provided screenshots that serve as evidence to their observations.

Your feedback

Send us your results and your comments about this challenge!

Click here to send your feedback (Google Forms)