Challenge 2 : Intents
In this challenge you will use I2WG to automatically generate analytical workflows compatible with ExtremeXP with given datasets, using different configurations and data and see the differences between results.
Estimated Time : 30 minutes
Difficulty Level : Easy (no exhaustive technical background required)
Background
Overview
This component executes the translation from user-defined intents to actionable workflows. The user starts the process by defining, on a high level, the analytical task to be executed. I2W extracts the key features from the description and maps it to a rich knowledge base. From there, it chases the dependencies indicated in the ontology to produce workflows to implement the defined task according to the specified intention. These workflows are initially encoded using RDF, which implies a high flexibility to be translated to other representations, such as the DSL required by the execution engine.
Architecture and Features
The backbone of the project can be found in the backend/modules folder, where the two modules that implement the main functionalities can be found (belonging to T4.3 and T4.4). Besides the backend logic, we provide a frontend for an easy and intuitive interaction with the system. The frontend communicates with the backend via a main API (backend/api) that collects the functionalities presented by the two modules.
The intentAnticipation module is in charge of anticipating, capturing and processing the user-defined intent. The user defines this intent by indicating the required parameters (type of task, dataset to use, etc.). Then, the system (i) maps the intent to the concepts defined by our knowledge base and (ii) provides recommendations, extracted from past experiments to the user regarding the definition of the intent. That is, it indicates which are the additional constraints that are recommended for optimizing the workflow (e.g. which is the best algorithm to use). Optionally, the intent can be defined on an even higher-level via natural language, which is processed by LLMs to extract the required elements.
The IntentSpecification2WorkflowGenerator module generates, once the intent has been captured, the corresponding workflow. This is done in a series of steps :
The data is annotated following the concepts represented in the knowledge base. This allows us to understand the characteristics of the data and provide the appropriate operations to adequately work with them.
The first proposed workflows are very abstract instantiations of the pipeline to encode. These include high level tasks that provide a general orientation regarding how to define the workflow. One of these plans is created by each of the algorithms that can be employed for the task. The user has to select at least one of these plans.
The abstract plans are mapped to specific workflows where all the necessary tasks to execute the intent can be found. These plans explore all the potential variability points regarding the needs of the intent, pruning those paths that are deemed less relevant (in order not to bombard the user with too many, indistinguishable alternatives). These are divided by algorithms and specific implementations of these algorithms (e.g. neural networks -> LSTM networks, convolutional networks, etc.). The user selects, at least, one of these plans.
Once the definitive list of workflows has been selected, these can be visualized, stored in the system for later use or exported in RDF format. Alternatively, we offer the possibility of directly converting the workflows to the DSL language required by the experimentation engine.
Prerequisites
Basic understanding of Machine Learning pipelines.
- Format: CSV with mixed numeric and categorical features
- Size: 891 rows, 10 features
- Target: Binary column survived (0=No/1=Yes)
- Features include:
- Numeric: Age, Fare, SibSp(No. of siblings/spouses aboard the Titanic), Parch (No. of parents/children aboard the Titanic (months), Fare.
- Categorical: Pclass(Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd), Sex,
- Format: CSV with mixed numeric and categorical features
- Size: 569 rows, 30 features
- Target: Binary column diagnosis (0=Benign/1=Malignant)
- Features include:
- Numeric: mean concave points, mean symmetry, mean fractal dimension, radius error, texture error, perimeter error, area error, smoothness error, compactness error, concavity error
Before starting the process, you need to create the required data products
- Download datasets from the shared folder.
- Go to the data products page
- Click import from file
- To import data product 1: Import file titanic.csv from the shared folder
- To import data product 2: Repeat the same process and import breast_cancer.csv
Data Product 1
- Go to intents page
- Create a new intent with name TitanicIntent
- Infer intent type with textual input “Predict whether a passenger survived the Titanic disaster or not”
- Select Survived as target variable
- Click suggest parameters
- Execute abstract planner with no restrictions
- Execute logical planner only with the abstract plan with the best estimated accuracy
- Materialize only the logical plan that contains a random relative train-test partition
- Download intent to dsl
- Rename the downloaded ZIP file to Titanic_XXP.zip
Data Product 2
- Go to intents page
- Create a new intent with name BreastCancerIntent
- Infer intent type with textual input “Predict whether a breast tumor is malignant or benign based on measurements computed from a digitized image”
- Select y as target variable
- Click suggest parameters
- Execute abstract planner restricted using the recommended algorithm used in similar cases
- Execute logical planner with all abstract plans
- Execute workflow planner with all logical plans
- Download intent to dsl
- Rename the downloaded ZIP file to BreastCancer_XXP.zip
Comparison of results from data product 1 and data product 2
- Extract the contents of produced ZIPs
- Compare both experiments.xxp files.
- Is there any major difference in the number of experiments being generated? Why?
Deliverables
- titanic.csv
- breast_cancer.csv
- Titanic_XXP.zip (the file obtained from titanic intent)
- BreastCancer_XXP.zip (the file obtained from breast cancer intent)
Learning Resources
Feedback
Titanic_XXP.zip and BreastCancer_XXP.zip are generated according to the criteria specified in the exercise. Thus, with the correct selection of workflows.