Challenge 2 : Intents

In this challenge you will use I2WG to automatically generate analytical workflows compatible with ExtremeXP with given datasets, using different configurations and data and see the differences between results.

Estimated Time : 30 minutes

Difficulty Level : Easy (no exhaustive technical background required)

Background

Overview

This component executes the translation from user-defined intents to actionable workflows. The user starts the process by defining, on a high level, the analytical task to be executed. I2W extracts the key features from the description and maps it to a rich knowledge base. From there, it chases the dependencies indicated in the ontology to produce workflows to implement the defined task according to the specified intention. These workflows are initially encoded using RDF, which implies a high flexibility to be translated to other representations, such as the DSL required by the execution engine.

Architecture and Features

The backbone of the project can be found in the backend/modules folder, where the two modules that implement the main functionalities can be found (belonging to T4.3 and T4.4). Besides the backend logic, we provide a frontend for an easy and intuitive interaction with the system. The frontend communicates with the backend via a main API (backend/api) that collects the functionalities presented by the two modules.

The intentAnticipation module is in charge of anticipating, capturing and processing the user-defined intent. The user defines this intent by indicating the required parameters (type of task, dataset to use, etc.). Then, the system (i) maps the intent to the concepts defined by our knowledge base and (ii) provides recommendations, extracted from past experiments to the user regarding the definition of the intent. That is, it indicates which are the additional constraints that are recommended for optimizing the workflow (e.g. which is the best algorithm to use). Optionally, the intent can be defined on an even higher-level via natural language, which is processed by LLMs to extract the required elements.

The IntentSpecification2WorkflowGenerator module generates, once the intent has been captured, the corresponding workflow. This is done in a series of steps :

Data annotation

The data is annotated following the concepts represented in the knowledge base. This allows us to understand the characteristics of the data and provide the appropriate operations to adequately work with them.

Abstract plans

The first proposed workflows are very abstract instantiations of the pipeline to encode. These include high level tasks that provide a general orientation regarding how to define the workflow. One of these plans is created by each of the algorithms that can be employed for the task. The user has to select at least one of these plans.

Logical plans

The abstract plans are mapped to specific workflows where all the necessary tasks to execute the intent can be found. These plans explore all the potential variability points regarding the needs of the intent, pruning those paths that are deemed less relevant (in order not to bombard the user with too many, indistinguishable alternatives). These are divided by algorithms and specific implementations of these algorithms (e.g. neural networks -> LSTM networks, convolutional networks, etc.). The user selects, at least, one of these plans.

Workflow representation

Once the definitive list of workflows has been selected, these can be visualized, stored in the system for later use or exported in RDF format. Alternatively, we offer the possibility of directly converting the workflows to the DSL language required by the experimentation engine.

Prerequisites

Basic understanding of Machine Learning pipelines.

Exercise

In this challenge we are going to do an exercise that consists of the steps shown below.

Titanic dataset description

Format: CSV with mixed numeric and categorical features
Size: 891 rows, 10 features
Target: Binary column survived (0=No/1=Yes)
Features include:
- Numeric: Age, Fare, SibSp(No. of siblings/spouses aboard the Titanic), Parch (No. of parents/children aboard the Titanic (months), Fare.
- Categorical: Pclass(Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd), Sex,

Breast cancer dataset description

Format: CSV with mixed numeric and categorical features
Size: 569 rows, 30 features
Target: Binary column diagnosis (0=Benign/1=Malignant)
Features include:
- Numeric: mean concave points, mean symmetry, mean fractal dimension, radius error, texture error, perimeter error, area error, smoothness error, compactness error, concavity error

Step 0 : Create two data products

Before starting the process, you need to create the required data products

Download datasets from the shared folder.
Go to the data products page
Click import from file
To import data product 1: Import file titanic.csv from the shared folder
To import data product 2: Repeat the same process and import breast_cancer.csv

Data Product 1

Step 1 : Create titanic intent

Go to intents page
Create a new intent with name TitanicIntent
Infer intent type with textual input “Predict whether a passenger survived the Titanic disaster or not”
Select Survived as target variable
Click suggest parameters

Step 2 : Create ExtremeXP experiment files

Execute abstract planner with no restrictions
Execute logical planner only with the abstract plan with the best estimated accuracy
Materialize only the logical plan that contains a random relative train-test partition
Download intent to dsl
Rename the downloaded ZIP file to Titanic_XXP.zip

Data Product 2

Step 3 : Create breast cancer intent

Go to intents page
Create a new intent with name BreastCancerIntent
Infer intent type with textual input “Predict whether a breast tumor is malignant or benign based on measurements computed from a digitized image”
Select y as target variable
Click suggest parameters

Step 4 : Generate ExtremeXP experiment files

Execute abstract planner restricted using the recommended algorithm used in similar cases
Execute logical planner with all abstract plans
Execute workflow planner with all logical plans
Download intent to dsl
Rename the downloaded ZIP file to BreastCancer_XXP.zip

Comparison of results from data product 1 and data product 2

Step 5 : Analyze the results

Extract the contents of produced ZIPs
Compare both experiments.xxp files.
Is there any major difference in the number of experiments being generated? Why?

Deliverables

Input : you are provided with the following files in DDM

titanic.csv
breast_cancer.csv

Output : your Deliverables (What You Will Create)

Titanic_XXP.zip (the file obtained from titanic intent)
BreastCancer_XXP.zip (the file obtained from breast cancer intent)

Learning Resources

Intent2Workflows docs

Feedback

Click here to acess the feedback form and send us your results and report potential usability issues...

Click here

Success criteria

Titanic_XXP.zip and BreastCancer_XXP.zip are generated according to the criteria specified in the exercise. Thus, with the correct selection of workflows.