Publications

Auditing for Spatial Fairness

Authors:
Dimitris Sacharidis, Giorgos Giannopoulos, George Papastefanatos, Kostas Stefanidis

Source: https://dx.doi.org/10.48786/edbt.2023.41

Abstract:
This paper studies algorithmic fairness when the protected attribute is location. To handle protected attributes that are continuous, such as age or income, the standard approach is to discretize the domain into predefined groups, and compare algorithmic outcomes across groups. However, applying this idea to location raises concerns of gerrymandering and may introduce statistical bias. Prior work addresses these concerns but only for regularly spaced locations, while raising other issues, most notably its inability to discern regions that are likely to exhibit spatial unfairness. Similar to established notions of algorithmic fairness, we define spatial fairness as the statistical independence of outcomes from location. This translates into requiring that for each region of space, the distribution of outcomes is identical inside and outside the region. To allow for localized discrepancies in the distribution of outcomes, we compare how well two competing hypotheses explain the observed outcomes. The null hypothesis assumes spatial fairness, while the alternate allows different distributions inside and outside regions. Their goodness of fit is then assessed by a likelihood ratio test. If there is no significant difference in how well the two hypotheses explain the observed outcomes, we conclude that the algorithm is spatially fair.

Evolvability of Machine Learning-based Systems: An Architectural Design Decision Framework

Authors:
Joran Leest, Ilias Gerostathopoulos and Claudia Raibulet

Source: https://doi.org/10.1109/ICSA-C57050.2023.00033

Abstract:
The increasing integration of machine learning (ML) in modern software systems has lead to new challenges as a result of the shift from human-determined behavior to data-determined behavior. One of the key relevant challenges concerns concept drift (CD), i.e., the potential performance degradation due to changes in the data distribution. CD may severely affect the quality of the provided services, being also difficult to predict and detect, as well as costly to address. In this context, we focus on the evolvability of ML-based systems and the architectural considerations in addressing this concern. In this paper, we propose a novel scenario-based framework to support, justify and underpin architectural design decisions that address evolvability concerns in ML-based systems. The applicability and relevance of our framework is outlined through an illustrative example. We envision our framework to be extended to address other quality attributes important to ML-based systems and, overall, provide architectural support for ML operations (MLOps). Finally, we outline our plan to apply it in a number of industrial case studies, evaluate it with practitioners, and iteratively refine it.

Towards a Reference Component Model of Edge-Cloud Continuum

Authors:
Danylo Khalyeyev, Tomáš Bureš, and Petr Hnětynka

Source: https://doi.org/10.1109/ICSA-C57050.2023.00030

Abstract:
Edge-cloud continuum (ECC) is a novel paradigm that seeks to blend the worlds of cloud computing and IoT into a continuous ecosystem capable of providing access to a range of previously impossible applications with significantly improved quality of service. However, while using the term ECC becomes increasingly common, there is still no clear and commonly accepted consensus on what the term entails and which properties the ECC environment must possess. Consequently, there is a lack of tools and examples for reasoning about applications in ECC and their specific properties. In this paper, we present the results of our literature study aimed at identifying the most common properties ascribed to ECC. Based on this, we outline a reference component model that can serve as a tool for reasoning about ECC systems and their properties.

An Empirical Performance Comparison between Matrix Multiplication Join and Hash Join on GPUs

Authors:
Wenbo Sun, Asterios Katsifodimos and Rihan Hai

Source: https://doi.org/10.1109/ICDEW58674.2023.00034

Abstract:
Recent advances in Graphic Processing Units (GPUs) have facilitated a significant performance boost for database operators, in particular, joins. It has been intensively studied how conventional join implementations, such as hash joins, benefit from the massive parallelism of GPUs. With the proliferation of machine learning, more databases have started to provide native support for the basic building blocks of ML algorithms, i.e., linear algebra operators such as matrix multiplication (MM). Despite the recent increasing interest in processing relational joins using matrix multiplication (MM-join), two crucial questions still remain open: i) how efficient are current MM-join implementations compared to the GPU-based join algorithms; ii) how should practitioners choose among MM-join and conventional GPU-based joins given different data characteristics.In this paper, we compare the execution time, and memory I/O of MM-join against multiple GPU hash joins. An empirical analysis of our experimental results reveals that the state-of-the-art hash join implementation shows substantial scalability for various data characteristics. In contrast, MM-join outperforms the SOTA hash join in low join selectivity and low table cardinality but shows unsatisfactory scalability due to synchronous data movement and computation.

Amalur: Data Integration Meets Machine Learning

Authors:
Rihan Hai, Christos Koutras, Andra Ionescu, Ziyu Li, Wenbo Sun, Jessie Van Schijndel, Yan Kang and Asterios Katsifodimos.

Source: https://doi.org/10.1109/ICDE55515.2023.00301

Abstract:
Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy and security constraints, data often cannot leave the premises of data silos, hence model training should proceed in a decentralized manner. In this work, we present a vision of how to bridge the traditional data integration (DI) techniques with the requirements of modern machine learning. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness and efficiency of ML models. Towards this direction, we analyze two common use cases over data silos, feature augmentation and federated learning. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning and federated learning.

Controlling Automatic Experiment-Driven Systems Using Statistics and Machine Learning

Authors:
Milad Abdullah

Source: https://doi.org/10.1007/978-3-031-36889-9_9

Abstract:
Experiments are used in many modern systems to optimize their operation. Such experiment-driven systems are used in various fields, such as web-based systems, smart-* systems, and various self-adaptive systems. There is a class of these systems that derive their data from running simulations or another type of computation, such as in digital twins, online planning using probabilistic model-checking, or performance benchmarking. To obtain statistically significant results, these systems must repeat the experiments multiple times. As a result, they consume extensive computation resources. The GraalVM benchmarking project detects performance changes in the GraalVM compiler. However, the benchmarking project has an extensive usage of computational resources and time. The doctoral research project proposed in this paper focuses on controlling the experiments with the goal of reducing computation costs. The plan is to use statistical and machine learning approaches to predict the outcomes of experiments and select the experiments yielding more useful information. As an evaluation, we are applying these methods to the GraalVM benchmarking project; the initial results confirm that these methods have the potential to significantly reduce computation costs.

Early Stopping of Non-productive Performance Testing Experiments Using Measurement Mutations

Authors:
Milad Abdullah, Lubomír Bulej, Tomáš Bureš, Vojtěch Horký and Petr Tůma

Source: https://doi.org/10.1109/SEAA60479.2023.00022

Abstract:
Modern software projects often incorporate some form of performance testing into their development cycle, intending to detect changes in performance between commits or releases. Performance testing generally relies on experimental evaluation using various benchmark workloads. To detect performance changes reliably, benchmarks must be executed many times to account for variability in the measurement results. While considered best practice, this approach can become prohibitively expensive when the number of versions and benchmark workloads increases. To alleviate the cost of performance testing, we propose an approach for the early stopping of non-productive experiments that are unlikely to detect a performance bug in a particular benchmark. The stopping conditions are based on benchmark-specific thresholds determined from historical data modified to emulate the potential effects of software changes on benchmark performance. We evaluate the approach on the GraalVM benchmarking project and show that it can eliminate about 50% of the experiments if we can afford to ignore about 15% of the least significant performance changes.

Online ML Self-adaptation in Face of Traps

Authors:
Michal Töpfer, František Plášil, Tomáš Bureš, Petr Hnětynka, Martin Kruliš and Danny Weyns

Source: https://doi.org/10.1109/ACSOS58161.2023.00023

Abstract:
Online machine learning (ML) is often used in selfadaptive systems to strengthen the adaptation mechanism and improve the system utility. Despite such benefits, applying online ML for self-adaptation can be challenging, and not many papers report its limitations. Recently, we experimented with applying online ML for self-adaptation of a smart farming scenario and we had faced several unexpected difficulties – traps – that, to our knowledge, are not discussed enough in the community. In this paper, we report our experience with these traps. Specifically, we discuss several traps that relate to the specification and online training of the ML-based estimators, their impact on selfadaptation, and the approach used to evaluate the estimators. Our overview of these traps provides a list of lessons learned, which can serve as guidance for other researchers and practitioners when applying online ML for self-adaptation.

METIS : An Open-Architecture for Building AI-Ready Cloud Platforms – Application to Foster Research on Hydrological Modeling

Authors:
Vincent GAUDISSART, Yasmine BOULFANI, Kevin LARNIER, Gwendoline STEPHAN, Jacques, COVES and Christophe TRIQUET –  CS GROUP France 

Source: https://data.europa.eu/doi/10.2760/46796

Abstract:
In today’s data-driven world, organizations often face the challenge of implementing and maintaining complex platforms for processing and leveraging big data and artificial intelligence (AI) technologies. This article introduces METIS, a powerful software suite, that simplifies the creation of such platforms by providing a range of reusable components. METIS, based on open-source components, follows an open architecture philosophy, enabling integration with existing systems and the flexibility to meet diverse project requirements. We specifically highlight BISAW, an instantiation of METIS components tailored for Business Intelligence (BI) and AI applications using earth observation data. BISAW offers comprehensive functionality to manage the entire data lifecycle and fosters collaboration among data scientists, engineers, decision-makers, and data providers. This article explores the challenges of implementing such platforms and demonstrates how BISAW facilitates the exploitation of data through its flexibility, integration capabilities, and streamlined development flow. Furthermore, we present the current METIS use case involving the ExtremeXP EU Horizon-funded project.

GEANT Security Days : Extending UEBA for emerging threat, detection, characterisation and intelligence generation

Authors:
Carolina FERÁNDEZ, Maxime COMPASITIÉ, Nil ORTIZ RABELLA, Sonu PREETAM and Xavier MARRUGAT –  i2Cat

Source: https://data.europa.eu/doi/10.2760/46796

Abstract:
This presentation aims to overcome some of the challenges regarding emerging and mutable threats, which may go unnoticed for some time due to a constrained data foundation that does not extract enough knowledge from the network status. We bring an AI, knowledge-based technology and one of its applied use cases to detect and categorise threats based on user’s, device’s and tool’s behaviour across the network. The presented technology can also be used to foster collaboration across academic and research centres regarding threat intelligence sharing, since both the extracted knowledge and some particularities of the models can be exported for others to learn, adapt and act on it.

An Approach for Intelligent Behaviour-Based Threat Modelling with Explanations

Authors:
Sonu Preetam, Maxime Compastié, Vanesa Daza, Shuaib Siddiqui

Source: https://doi.org/10.1109/NFV-SDN59219.2023.10329587

Abstract:
To disrupt the emergence of novel threats, defenders must obtain insights into the attacker’s behaviours through Tactics, Techniques, and Procedures (TTP) to establish adequate countermeasures. However, albeit detecting the usage of a subset of techniques is well documented and investigated, understanding the chaining of these techniques into a complete set of attack scenarios remains a manned process, prone to errors in complex and dynamic environments, such as software networks. In this paper, we propose a hybrid model for threat behaviour profiling. Our model exploits multimodal threat data using diverse real-time logs from virtualised environments to generate a novel dataset that maximises the explainability of a technique. Once a set of techniques is qualified, we leverage attack graphs and AI model explanations to correlate techniques usage into attack scenarios describing a complete behaviour from a threat actor. Our proposed approach is generalizable to distributed and heterogeneous environments, making it a promising method against ever-evolving threats.

Index Terms— Big Data, Earth Observation, Open Architecture, MLOps, Hydrology

Expert Driven Monitoring of Operational ML Models

Authors:
Joran Leest, lias Gerostathopoulos, Claudia Raibulet and Patricia Lago

Source: https://doi.org/10.48550/arXiv.2401.11993

Abstract:
We propose Expert Monitoring, an approach that leverages domain expertise to enhance the detection and mitigation of concept drift in machine learning (ML) models. Our approach supports practitioners by consolidating domain expertise related to concept drift-inducing events, making this expertise accessible to on-call personnel, and enabling automatic adaptability with expert oversight.

Article on Information Systems Journal

Authors:
Joseph Giovanelli, Besim Bilalli, Alberto Abelló, Fernando Silva-Coira and Guillermo de Bernardo

Source: https://doi.org/10.1016/j.is.2023.102314

Abstract:
This work is a companion reproducibility paper of the experiments and results reported in Giovanelli et al. (2022), where data pre-processing pipelines are evaluated in order to find pipeline prototypes that reduce the classification error of supervised learning algorithms. With the recent shift towards data-centric approaches, where instead of the model, the dataset is systematically changed for better model performance, data pre-processing is receiving a lot of attention. Yet, its impact over the final analysis is not widely recognized, primarily due to the lack of publicly available experiments that quantify it. To bridge this gap, this work introduces a set of reproducible experiments on the impact of data pre-processing by providing a detailed reproducibility protocol together with a software tool and a set of extensible datasets, which allow for all the experiments and results of our aforementioned work to be reproduced. We introduce a set of strongly reproducible experiments based on a collection of intermediate results, and a set of weakly reproducible experiments (Lastra-Dıaz, 0000) that allows reproducing our end-to-end optimization process and evaluation of all the methods reported in our primary paper. The reproducibility protocol is created in Docker and tested in Windows and Linux. In brief, our primary work (i) develops a method for generating effective prototypes, as templates or logical sequences of pre-processing transformations, and (ii) instantiates the prototypes into pipelines, in the form of executable or physical sequences of actual operators that implement the respective transformations. For the first, a set of heuristic rules learned from extensive experiments are used, and for the second techniques from Automated Machine Learning (AutoML) are applied.

Non-Expert Level Analysis of Self-Adaptive System

Authors:
Claudia Raibulet and Xiaojun Ling

Source: https://doi.org/10.1007/978-981-97-0989-2_8

Abstract:
Self-adaptivity is mainly used to address uncertainties, unpredicted events, as well as to automate administration tasks. It allows systems to change themselves while executing in order to address expected or unexpected changes and to adapt as much as possible to the current execution context. Self-adaptivity is particularly meaningful for dynamic application domains such as Internet of Things (IoT), Cyber-Physical Systems (CPS), service oriented based solutions (SOA), cloud computing, robotics, among many others. There are various available solutions in these domains that exploit self-adaptivity. The question is how can we analyze them to understand how self-adaptivity is implemented and exploited in order to use and re-use, as well as to adapt existing solutions to new or other systems? In this paper, we propose a first step in this direction, by analyzing available self-adaptive systems (and especially their self-adaptive mechanisms) in various application domains using the Understand tool – widely used for software development, analysis, and quality assessment.

Model Selection with Model Zoo via Graph Learning

Authors:
Ziyu Li, Hilco van der Wilk, Danning Zhan, Megha Khosla, Alessandro Bozzon and Rihan Hai

Source: https://doi.org/10.48550/arXiv.2404.03988

Abstract:
Pre-trained deep learning (DL) models are increasingly accessible in public repositories, i.e., model zoos. Given a new prediction task, finding the best model to fine-tune can be computationally intensive and costly, especially when the number of pre-trained models is large. Selecting the right pre-trained models is crucial, yet complicated by the diversity of models from various model families (like ResNet, Vit, Swin) and the hidden relationships between models and datasets. Existing methods, which utilize basic information from models and datasets to compute scores indicating model performance on target datasets, overlook the intrinsic relationships, limiting their effectiveness in model selection. In this study, we introduce TransferGraph, a novel framework that reformulates model selection as a graph learning problem. TransferGraph constructs a graph using extensive metadata extracted from models and datasets, while capturing their inherent relationships. Through comprehensive experiments across 16 real datasets, both images and texts, we demonstrate TransferGraph’s effectiveness in capturing essential model-dataset relationships, yielding up to a 32% improvement in correlation between predicted performance and the actual fine-tuning results compared to the state-of-the-art methods.

Mitigating Data Sparsity in Integrated Data through Text Conceptualization

Authors:
Md Ataur Rahman, Sergi Nadal, Oscar Romero and Dimitris Sacharidis

Source: https://doi.org/10.1109/ICDE60146.2024.00269

Abstract:

We study the data sparsity problem for data generated from an integration system. We approach the problem from a textual information extraction perspective and propose to conceptualize external documents using the concepts in the integrated schema. We present THOR, a novel system that, unlike related approaches, neither relies on complex rules nor models trained with large annotated corpus, but on the integrated data and its schema without the need for human annotations. An extensive evaluation on the text conceptualization task demonstrates the superiority of our approach in terms of F1-score, effort and use of resources over the state-of-the-art language models.

HYPPO: Using Equivalences to Optimise Pipelines in Exploratory Machine Learning

Authors:
Antonis Kontaxakis, Dimitris Sacharidis, Alkis Simitsis, Alberto Abelló and Sergi Nadal

Source: https://doi.org/10.1109/ICDE60146.2024.00024

Abstract:
We present HYPPO, a novel system to optimize pipelines encountered in exploratory machine learning. HYPPO exploits alternative computational paths of artifacts from past executions to derive better execution plans while reusing materialized artifacts. Adding alternative computations introduces new challenges for exploratory machine learning regarding workload representation, system architecture, and optimal execution plan generation. To this end, we present a novel workload representation based on directed hypergraphs, and we formulate the problem of discovering the optimal execution plan as a search problem over directed hypergraphs and that of selecting artifacts to materialize as an optimization problem. A thorough experimental evaluation shows that HYPPO results in plans that are typically one order (up to two orders) of magnitude faster and cheaper than the non-optimized pipeline and considerably (up to one order of magnitude) faster and cheaper than plans generated by the state of the art when materializing artifacts is possible. Lastly, our evaluation reveals that HYPPO reduces the cost by 3–4× even when materialization cannot be exploited.

AutoFeat: Transitive Feature Discovery over Join Paths

Authors:
Andra Ionescu, Kiril Vasilev, Florena Buse, Rihan Hai and Asterios Katsifodimos

Source: https://doi.org/10.1109/ICDE60146.2024.00150

Abstract:
Can we automatically discover machine learning (ML) features in a large data lake in order to increase the accuracy of a given ML model? Existing solutions either focus on simple star schemata, failing to discover features in more complex real-world schemata or consider only PK-FK relationships in clean, curated databases. However, real-world data lakes can contain long join paths of uncurated joinability relationships resulting from automated dataset discovery methods. This paper proposes a novel ranking-based feature discovery method called AutoFeat. Given a base table with a target label, AutoFeat explores multi-hop, transitive join paths to find relevant features in order to augment the base table with additional features, ultimately leading to increased accuracy of an ML model. AutoFeat is general: it evaluates the predictive power of features without the need to train an ML model, ranking join paths using the concepts of relevance and redundancy. Our experiments on real-world open data show that AutoFeat is efficient: it can find features of high predictive power on data lakes with an increased number of dataset joinability relationships 5x-44x faster than baseline approaches. In addition, AutoFeat is effective, improving accuracy by 16% on average compared to the baseline approaches, even in noisy, uncurated data lakes.

Visualization-aware Time Series Min-Max Caching with Error Bound Guarantees, 50th International Conference on Very Large Databases (VLDB 2024)

Authors:
Stavros Maroulis, Vassilis Stamatopoulos, George Papastefanatos and Manolis Terrovitis

Source: https://doi.org/10.14778/3659437.3659460

Abstract:
This paper addresses the challenges in interactive visual exploration of large multi-variate time series data. Traditional data reduction techniques may improve latency but can distort visualizations. State-of-the-art methods aimed at 100% accurate visualization often fail to maintain interactive response times or require excessive preprocessing and additional storage. We propose an in-memory adaptive caching approach, MinMaxCache, that efficiently reuses previous query results to accelerate visualization performance within accuracy constraints. MinMaxCache fetches data at adaptively determined aggregation granularities to maintain interactive response times and generate approximate visualizations with accuracy guarantees. Our results show that it is up to 10 times faster than current solutions without significant accuracy compromise.

Optimizing Data Analytics Workflows through User driven Experimentation 3rd International Conference on AI Engineering, Software Engineering for AI (CAIN 2024)

Authors:
Keerthiga Rajenthiram

Source: https://doi.org/10.1145/3644815.3644971

Abstract:

In the Big Data era, efficient data analytics workflows are imperative to extract useful and meaningful insights. Data analysts and scientists spend an inordinate amount of time finding the best workflow via trial and error to get accurate and meaningful results that meet their expectations. We propose an Experimentation Engine that selects and optimizes the best workflow variant through continuous experimentation and having the user in the loop. Experimentation Engine saves time finding the workflow that satisfies the user requirements and provides accurate, useful and trustworthy results.

Adaptive Strategies Metric Suite

Authors:
Koen Kraaijveld and Claudia Raibulet

Source: https://doi.org/10.1007/978-3-031-64182-4_14

Abstract:
Self-Adaptive Systems (SAS) are able to cope with the changes occurring dynamically in their execution environment in an autonomous and automated manner. They use adaptive strategies to address such changes with the main objective to ensure the proper functionality of a SAS and its performance. Hence, adaptive strategies play a key role in SAS. To ensure their use and re-use in various systems in different application domains, there is a need for common mechanisms that enable their analysis, evaluation, and comparison. In this direction, our Adaptive Strategies Metric Suite (ASMS) defines a set of software metrics for the measurement of various static and dynamic properties of adaptive strategies in SAS. The metrics concerning the static properties have been implemented in a plugin for the Understand tool.

An Analysis of MLOps Architectures: A Systematic Mapping Study

Authors:
Faezeh Amou Najafabadi, Justus Bogner, Ilias Gerostathopoulos and Patricia Lago

Source: https://doi.org/10.1007/978-3-031-70797-1_5

Abstract:
Despite the increasing adoption of Machine Learning Operations (MLOps), teams still encounter challenges in effectively applying this paradigm to their specific projects. While there is a large variety of available tools usable for MLOps, there is simultaneously a lack of consolidated architecture knowledge that can inform the architecture design.Objective. Our primary objective is to provide a comprehensive overview of (i) how MLOps architectures are defined across the literature and (ii) which tools are mentioned to support the implementation of each architecture component. Method. We apply the Systematic Mapping Study method and select 43 primary studies via automatic, manual, and snowballing-based search and selection procedures. Subsequently, we use card sorting to synthesize the results. Results. We contribute (i) a categorization of 35 MLOps architecture components, (ii) a description of several MLOps architecture variants, and (iii) a systematic map between the identified components and the existing MLOps tools. Conclusion. This study provides an overview of the state of the art in MLOps from an architectural perspective. Researchers and practitioners can use our findings to inform the architecture design of their MLOps systems.

Reference Architecture of MLOps Workflows

Authors:
Faezeh Amou Najafabadi

Source: https://doi.org/10.1007/978-3-031-71246-3_6

Abstract:
The rapid growth in the adoption of Machine Learning Operations Workflows (MLOps WFs) has given rise to the development of numerous guidelines and tools aimed at supporting the creation and management of these WFs. However, MLOps stakeholders continue to encounter challenges in employing these guidelines and tools. Firstly, there is a lack of consensus on the standard implementation of MLOps. Secondly, the current tools only support one or a number of components within MLOps workflows, complicating their integration into end-to-end WFs. Furthermore, the tendency towards automation in MLOps has led to uncertainty about the optimal level of human involvement, raising concerns about whether complete automation is the ideal approach. Responding to these issues, our goal in this research is to aid the MLOps WF stakeholders by providing a comprehensive reference architecture, that can be consulted as a basis of consolidated knowledge and experience in designing and managing MLOps WFs.

Discovery of Semantic Non-Syntactic Joins, DOLAP 2024: 26th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data

Authors:
Marc Maynou and Sergi Nadal

Source: http://hdl.handle.net/2117/409838

Abstract:
Data discovery is an essential step in the data integration pipeline involving finding datasets whose combined information provides relevant insights. Discovering joinable attributes requires assessing the closeness of the semantic concepts that two attributes represent, which is highly sensitive and dependent on the chosen similarity metric. The state of the art commonly approaches this task from a syntactic perspective, this is, performing comparisons based on the data values or on direct transformations (e.g., via hash functions). These approaches suffice when the two sets of instances share the same syntactic representation, but fail to detect cases in which the same semantic concept is represented by different sets of values, which we refer as semantic non-syntactic joins. This is a relevant problem in data lake scenarios, when the underlying datasets present high heterogeneity and lack of standardization. To that end, in this paper, we propose an empirical approach to detect semantic non-syntactic joins, which leverages, simultaneously, syntactic and semantic measurements of the data. We demonstrate that our approach is effective in detecting such kind of joins.

Capturing Analytical Intents from Text

Authors:
Gerard Pons, Miona Dimic and Besim Bilalli

Source: 

Abstract:
T

MAPE-K based Guidelines for Designing Reactive and Proactive Self-Adaptive Systems

Authors:
Hendrik Jilderda and Claudia Raibulet

Source: https://doi.org/10.1007/978 3-031-66326-0_4

Abstract:
The rapid evolution of complex software systems claims for novel approaches, such as self-adaptive systems that can autonomously adjust their behavior to meet changing requirements and dynamic contexts. This paper proposes guidelines for the development of reactive and proactive approaches to self-adaptive systems. The guidelines concern architectural based approaches using explicitly the MAPE-K (Monitor, Analyze, Plan, Execute using Knowledge) control feedback loop. The proposed guidelines are organized into five parts, one for each step of the MAPE-K loop including the knowledge with the goal to provide guidance when designing and implementing self-adaptive systems from a software engineering point of view. Three self-adaptive artifacts (i.e., OCCI Monitoring, DingNet, and SWIM) from different application domains (i.e., cloud computing, Internet of Things, Web application) have been analyzed by considering the proposed guidelines. Their reactive and proactive characteristics are compared.

There is no Data Science without Data Governance: A Proposal Based on Knowledge Graphs DOLAP 2024: 26th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data

Authors:
Besim Bilalli, Petar Jovanovic, Sergi Nadal, Anna Queralt and
 Oscar Romero

Source: http://hdl.handle.net/2117/409892

Abstract:
Data Science and data-driven Artificial Intelligence are here to stay and they are expected to further transform the current global economy. From a technical point of view, there is an overall agreement that disciplines based on data require to combine data engineering and data analysis skills, but the fact is that data engineering is nowadays trailing and catching up with the rapid changes in the data analysis landscape. To unleash the real power of data, data-centric systems must be professionalized, i.e., operationalized and systematized, so that repetitive, time-consuming and error-prone tasks are automated. To such end, we propose our vision on next generation data governance for data-centric systems based on knowledge graphs. We claim that without the knowledge embedded in the data governance layer, Data Science will not unleash its potential.

Flash flood modelling and in urban areas using High Resolution hydrodynamic model and machine learning models, Fifth Space for Hydrology Workshop

Authors:
K. Larnier, J. Coves, G. Stephan and L. Dumas

Source: Titre de la présentation avec aplat dégradé [Tahoma Bold 40 pt] (i-aida.org)

Abstract:
Modeling flash floods in urban areas with complex topography is always challenging. Considering fine-scale hydrodynamic 2D shallow water model to perform simulations requires a lot of manual or semi-automatic data processing before being able to run simulations. This involves the transformation of high-resolution Digital Surface Model (Lidar) into a Digital Elevation Model that conserves the main hydraulic properties of the ground (culverts, weirs, barriers, etc) as well as accurate delineation of the streets and buildings, etc. In the context of the ExtremeXP project funded by the European Commission we assess the role of machine learning to improve the simulation and nowcasting (forecast with short term horizon) of flash flood events in the city of Nîmes in the South of France. First, we prepare all relevant datasets to design a fine scale 2D hydrodynamic model and then we calibrate it on several historical flood events. Once this model is calibrated and validated, we use it as a reference for conducting several scenarios of improvements using machine learning model. Two kinds of scenarios are analyzed. In the first kind lie all the machine learning techniques that would facilitate the design of the hydrodynamic model by either reducing the number of input data or reducing the necessary data transformation processes. The second kind of scenario consists in designing surrogates for the reference hydrodynamic model itself for nowcasting flood propagation during an event.

RBD24 : A labelled dataset with risk activities using log application data

Authors:
Albert Calvo, Santiago Escuder, Nil Ortiz, Josep Escrig and Maxime Compastié

Source: https://doi.org/10.1016/j.cose.2024.104290

Abstract:
This paper introduces the Risk Activities Dataset 2024 (RBD24), an open-source dataset designed to facilitate the identification and analysis of risk activities within the cybersecurity domain. The RBD24 Dataset is derived from multimodal application logs collected over a two-week period at a Spanish state university, identifying activities aligned with the early stages of the attack scenario. This dataset paves the way for novel User and Entity behaviour Analytics (UEBA) and risk assessment frameworks within the cybersecurity domain. In detail, the dataset offers a fully user-centric approach by providing ground-truth data for various risk behaviours, including cryptocurrency activities, outdated software usage, P2P file sharing, and phishing incidents. These ground-truth data, identified through intrusion detection systems (IDS) and experimental campaigns, are represented as a set of indicators extracted from DNS, HTTP, SSL, and SMTP protocol logs. This dataset is expected to be a valuable resource for developing and benchmarking cybersecurity models, particularly in the realm of risk behaviour assessment.

The ExtremeXP project is co-funded by the European Union Horizon Program HORIZON-CL4-2022-DATA-01-01, under Grant Agreement No. 101093164
© ExtremeXP 2023. All Rights Reserved – Privacy Policy