Challenge 3: Analytics task library

In this challenge, you will learn to include conditional tasks into a workflow. This will allow you to include external tasks provided by the analytics task library and improve a workflow with task generated by the ExtremeXP Community.

Estimated Time : 60–90 minutes

Difficulty Level : Intermediate (intermediate DSL knowledge, some Python understanding)

Background

One of the benefits of ExtremeXP is the use of conditional workflows. These workflows allow experimental setups to be triggered when a condition is not achieved. For instance, one can declare that if a condition, typically a benchmarking metric is not met, auxiliar workflows are executed with tasks to improve the results, and it is then evaluated whether this new workflow improves the results.

Figure 1 : Conditional workflow

There is not a magic recipe to determine which tasks a practitioner should include in these auxiliary workflows to improve the overall performance of a workflow. Nevertheless, the analytical catalogue provides a set of common tasks identified in the literature that can help improve benchmarking results. In a nutshell, the tasks included in the first release of ExtremeXP are the following :

Feature Selection

The process of identifying and retaining the most relevant features while removing redundant or irrelevant ones to improve model performance.

Data Integration

The task of combining data from multiple sources or modalities into a unified representation to improve the model by exploring complementary information.

Data Augmentation

From a technical perspective, figure 2 illustrates the execution of an experiment that defines a conditional workflow. If the defined condition is not achieved when using the first workflow (for instance, an F-score > 0.65), a second workflow (Workflow 1″) is executed. This second workflow expands the experimentation space by including additional tasks.

Figure 2 : Conditional workflow execution

Prerequisites

This challenge requires the user to have already generated the first assignment.

To learn more

Exercise

In this challenge we will use the E. coli dataset from the UCI Machine Learning Repository, which is a typical dataset for benchmarking classification algorithms. The objective of this dataset is to predict the cellular localization site of proteins in E. Coli bacteria. As success criteria of this experiment we would like to obtain at least a F-SCORE of 0.90.

Format

CSV with numerical features.
Size

336 rows, 8 features
Target

Binary column churn
Features include

mcg, gvh, lip, chg, aac, alm1, alm2

Step 0 : Set up the environment

Download the challenge material provided in the shared folder

Follow the instructions in the README to generate a uv environment with all the dependencies to run the workflows.

Step 1 : Understanding the ML problem

Once the environment is created, the next step is to generate a first workflow that runs a supervised machine learning model. In this assignment, we already provide the set of tasks; you will find these tasks under the library-tasks directory:

- tasks/load_data.py → Load CSV
- tasks/preprocess_data.py → Split data into test/train
- tasks/train_model.py → Train a Classification model
- tasks/evaluate_model.py → Evaluate the model.

Complete the workflow_challenge5_baseline.dsl with the expected tasks and variability points. For simplicity, use a max_depth value between 2 and 4
Run the workflow and analyse the results. Is the workflow obtaining the expected F-score ?

Step 2 : Define the Analytical Task library

In order to improve the analytical results of the workflow, one option suggested in the literature is the use of data augmentation.

Check the data augmentation endpoint (http://146.124.106.171:9010).
Complete the file tasks/analytics_catalogue.py with the correct endpoint information.
Complete the Challenge5_experiment2.xxp definition, and highlight how the condition is implemented.
Run the workflow again and evaluate whether the F-score is improved when using data augmentation.

Step 3 : Define the new experiment using a conditional workflow

Create a file named analytics_experiment.xxp that defines the experiment space to explore.
Declare the experiment parameters and their value ranges
Define outputs metrics

Step 4 : Improve the workflow with extended variability points

As you might observe, the potential variability points for Data Augmentation are statically defined. In this step, the workflow is extended to support additional variability points, enabling a more flexible exploration of data augmentation strategies.

Check the endpoint for valid potential variability points
Implement the new variability points
Re-run the workflow

Deliverables

A document with the observed performance metrics

Success criteria

Execute all the steps !

Feedback

Click here to acess the feedback form and send us your results and report potential usability issues...

Click here

Challenge 3: Analytics task library

Estimated Time : 60–90 minutes

Difficulty Level : Intermediate (intermediate DSL knowledge, some Python understanding)

Background

Prerequisites

Exercise

Format

Size

Target

Features include

Deliverables

Success criteria

Feedback