Publications
Authors:
Dimitris Sacharidis, Giorgos Giannopoulos, George Papastefanatos, Kostas Stefanidis
Source: https://dx.doi.org/10.48786/edbt.2023.41
Abstract:
This paper studies algorithmic fairness when the protected attribute is location. To handle protected attributes that are continuous, such as age or income, the standard approach is to discretize the domain into predefined groups, and compare algorithmic outcomes across groups. However, applying this idea to location raises concerns of gerrymandering and may introduce statistical bias. Prior work addresses these concerns but only for regularly spaced locations, while raising other issues, most notably its inability to discern regions that are likely to exhibit spatial unfairness. Similar to established notions of algorithmic fairness, we define spatial fairness as the statistical independence of outcomes from location. This translates into requiring that for each region of space, the distribution of outcomes is identical inside and outside the region. To allow for localized discrepancies in the distribution of outcomes, we compare how well two competing hypotheses explain the observed outcomes. The null hypothesis assumes spatial fairness, while the alternate allows different distributions inside and outside regions. Their goodness of fit is then assessed by a likelihood ratio test. If there is no significant difference in how well the two hypotheses explain the observed outcomes, we conclude that the algorithm is spatially fair.
Authors:
Sonu Preetam, Maxime Compastié, Vanesa Daza, Shuaib Siddiqui
Source: https://doi.org/10.1109/NFV-SDN59219.2023.10329587
Abstract:
To disrupt the emergence of novel threats, defenders must obtain insights into the attacker’s behaviours through Tactics, Techniques, and Procedures (TTP) to establish adequate countermeasures. However, albeit detecting the usage of a subset of techniques is well documented and investigated, understanding the chaining of these techniques into a complete set of attack scenarios remains a manned process, prone to errors in complex and dynamic environments, such as software networks. In this paper, we propose a hybrid model for threat behaviour profiling. Our model exploits multimodal threat data using diverse real-time logs from virtualised environments to generate a novel dataset that maximises the explainability of a technique. Once a set of techniques is qualified, we leverage attack graphs and AI model explanations to correlate techniques usage into attack scenarios describing a complete behaviour from a threat actor. Our proposed approach is generalizable to distributed and heterogeneous environments, making it a promising method against ever-evolving threats.
Index Terms— Big Data, Earth Observation, Open Architecture, MLOps, Hydrology
Authors:
Vincent GAUDISSART, Yasmine BOULFANI, Kevin LARNIER, Gwendoline STEPHAN, Jacques, COVES and Christophe TRIQUET – CS GROUP France
Source: https://data.europa.eu/doi/10.2760/46796
Abstract:
In today’s data-driven world, organizations often face the challenge of implementing and maintaining complex platforms for processing and leveraging big data and artificial intelligence (AI) technologies. This article introduces METIS, a powerful software suite, that simplifies the creation of such platforms by providing a range of reusable components. METIS, based on open-source components, follows an open architecture philosophy, enabling integration with existing systems and the flexibility to meet diverse project requirements. We specifically highlight BISAW, an instantiation of METIS components tailored for Business Intelligence (BI) and AI applications using earth observation data. BISAW offers comprehensive functionality to manage the entire data lifecycle and fosters collaboration among data scientists, engineers, decision-makers, and data providers. This article explores the challenges of implementing such platforms and demonstrates how BISAW facilitates the exploitation of data through its flexibility, integration capabilities, and streamlined development flow. Furthermore, we present the current METIS use case involving the ExtremeXP EU Horizon-funded project.
Authors:
Carolina FERÁNDEZ, Maxime COMPASITIÉ, Nil ORTIZ RABELLA, Sonu PREETAM and Xavier MARRUGAT – i2Cat
Source: https://data.europa.eu/doi/10.2760/46796
Abstract:
This presentation aims to overcome some of the challenges regarding emerging and mutable threats, which may go unnoticed for some time due to a constrained data foundation that does not extract enough knowledge from the network status. We bring an AI, knowledge-based technology and one of its applied use cases to detect and categorise threats based on user’s, device’s and tool’s behaviour across the network. The presented technology can also be used to foster collaboration across academic and research centres regarding threat intelligence sharing, since both the extracted knowledge and some particularities of the models can be exported for others to learn, adapt and act on it.
Authors:
Danylo Khalyeyev, Tomáš Bureš, and Petr Hnětynka
Source: https://doi.org/10.1109/ICSA-C57050.2023.00030
Abstract:
Edge-cloud continuum (ECC) is a novel paradigm that seeks to blend the worlds of cloud computing and IoT into a continuous ecosystem capable of providing access to a range of previously impossible applications with significantly improved quality of service. However, while using the term ECC becomes increasingly common, there is still no clear and commonly accepted consensus on what the term entails and which properties the ECC environment must possess. Consequently, there is a lack of tools and examples for reasoning about applications in ECC and their specific properties. In this paper, we present the results of our literature study aimed at identifying the most common properties ascribed to ECC. Based on this, we outline a reference component model that can serve as a tool for reasoning about ECC systems and their properties.
Authors:
Milad Abdullah
Source: https://doi.org/10.1007/978-3-031-36889-9_9
Abstract:
Experiments are used in many modern systems to optimize their operation. Such experiment-driven systems are used in various fields, such as web-based systems, smart-* systems, and various self-adaptive systems. There is a class of these systems that derive their data from running simulations or another type of computation, such as in digital twins, online planning using probabilistic model-checking, or performance benchmarking. To obtain statistically significant results, these systems must repeat the experiments multiple times. As a result, they consume extensive computation resources. The GraalVM benchmarking project detects performance changes in the GraalVM compiler. However, the benchmarking project has an extensive usage of computational resources and time. The doctoral research project proposed in this paper focuses on controlling the experiments with the goal of reducing computation costs. The plan is to use statistical and machine learning approaches to predict the outcomes of experiments and select the experiments yielding more useful information. As an evaluation, we are applying these methods to the GraalVM benchmarking project; the initial results confirm that these methods have the potential to significantly reduce computation costs.
Authors:
Rihan Hai, Christos Koutras, Andra Ionescu, Ziyu Li, Wenbo Sun, Jessie Van Schijndel, Yan Kang and Asterios Katsifodimos.
Source: https://doi.org/10.1109/ICDE55515.2023.00301
Abstract:
Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy and security constraints, data often cannot leave the premises of data silos, hence model training should proceed in a decentralized manner. In this work, we present a vision of how to bridge the traditional data integration (DI) techniques with the requirements of modern machine learning. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness and efficiency of ML models. Towards this direction, we analyze two common use cases over data silos, feature augmentation and federated learning. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning and federated learning.
Authors:
Wenbo Sun, Asterios Katsifodimos and Rihan Hai
Source: https://doi.org/10.1109/ICDEW58674.2023.00034
Abstract:
Recent advances in Graphic Processing Units (GPUs) have facilitated a significant performance boost for database operators, in particular, joins. It has been intensively studied how conventional join implementations, such as hash joins, benefit from the massive parallelism of GPUs. With the proliferation of machine learning, more databases have started to provide native support for the basic building blocks of ML algorithms, i.e., linear algebra operators such as matrix multiplication (MM). Despite the recent increasing interest in processing relational joins using matrix multiplication (MM-join), two crucial questions still remain open: i) how efficient are current MM-join implementations compared to the GPU-based join algorithms; ii) how should practitioners choose among MM-join and conventional GPU-based joins given different data characteristics.In this paper, we compare the execution time, and memory I/O of MM-join against multiple GPU hash joins. An empirical analysis of our experimental results reveals that the state-of-the-art hash join implementation shows substantial scalability for various data characteristics. In contrast, MM-join outperforms the SOTA hash join in low join selectivity and low table cardinality but shows unsatisfactory scalability due to synchronous data movement and computation.
Authors:
Michal Töpfer, František Plášil, Tomáš Bureš, Petr Hnětynka, Martin Kruliš and Danny Weyns
Source: https://doi.org/10.1109/ACSOS58161.2023.00023
Abstract:
Online machine learning (ML) is often used in selfadaptive systems to strengthen the adaptation mechanism and improve the system utility. Despite such benefits, applying online ML for self-adaptation can be challenging, and not many papers report its limitations. Recently, we experimented with applying online ML for self-adaptation of a smart farming scenario and we had faced several unexpected difficulties – traps – that, to our knowledge, are not discussed enough in the community. In this paper, we report our experience with these traps. Specifically, we discuss several traps that relate to the specification and online training of the ML-based estimators, their impact on selfadaptation, and the approach used to evaluate the estimators. Our overview of these traps provides a list of lessons learned, which can serve as guidance for other researchers and practitioners when applying online ML for self-adaptation.
Authors:
Milad Abdullah, Lubomír Bulej, Tomáš Bureš, Vojtěch Horký and Petr Tůma
Source: https://doi.org/10.1109/SEAA60479.2023.00022
Abstract:
Modern software projects often incorporate some form of performance testing into their development cycle, intending to detect changes in performance between commits or releases. Performance testing generally relies on experimental evaluation using various benchmark workloads. To detect performance changes reliably, benchmarks must be executed many times to account for variability in the measurement results. While considered best practice, this approach can become prohibitively expensive when the number of versions and benchmark workloads increases. To alleviate the cost of performance testing, we propose an approach for the early stopping of non-productive experiments that are unlikely to detect a performance bug in a particular benchmark. The stopping conditions are based on benchmark-specific thresholds determined from historical data modified to emulate the potential effects of software changes on benchmark performance. We evaluate the approach on the GraalVM benchmarking project and show that it can eliminate about 50% of the experiments if we can afford to ignore about 15% of the least significant performance changes.
Authors:
Joran Leest, Ilias Gerostathopoulos and Claudia Raibulet
Source: https://doi.org/10.1109/ICSA-C57050.2023.00033
Abstract:
The increasing integration of machine learning (ML) in modern software systems has lead to new challenges as a result of the shift from human-determined behavior to data-determined behavior. One of the key relevant challenges concerns concept drift (CD), i.e., the potential performance degradation due to changes in the data distribution. CD may severely affect the quality of the provided services, being also difficult to predict and detect, as well as costly to address. In this context, we focus on the evolvability of ML-based systems and the architectural considerations in addressing this concern. In this paper, we propose a novel scenario-based framework to support, justify and underpin architectural design decisions that address evolvability concerns in ML-based systems. The applicability and relevance of our framework is outlined through an illustrative example. We envision our framework to be extended to address other quality attributes important to ML-based systems and, overall, provide architectural support for ML operations (MLOps). Finally, we outline our plan to apply it in a number of industrial case studies, evaluate it with practitioners, and iteratively refine it.
Authors:
Joseph Giovanelli, Besim Bilalli, Alberto Abelló, Fernando Silva-Coira and Guillermo de Bernardo
Source: https://doi.org/10.1016/j.is.2023.102314
Abstract:
This work is a companion reproducibility paper of the experiments and results reported in Giovanelli et al. (2022), where data pre-processing pipelines are evaluated in order to find pipeline prototypes that reduce the classification error of supervised learning algorithms. With the recent shift towards data-centric approaches, where instead of the model, the dataset is systematically changed for better model performance, data pre-processing is receiving a lot of attention. Yet, its impact over the final analysis is not widely recognized, primarily due to the lack of publicly available experiments that quantify it. To bridge this gap, this work introduces a set of reproducible experiments on the impact of data pre-processing by providing a detailed reproducibility protocol together with a software tool and a set of extensible datasets, which allow for all the experiments and results of our aforementioned work to be reproduced. We introduce a set of strongly reproducible experiments based on a collection of intermediate results, and a set of weakly reproducible experiments (Lastra-Dıaz, 0000) that allows reproducing our end-to-end optimization process and evaluation of all the methods reported in our primary paper. The reproducibility protocol is created in Docker and tested in Windows and Linux. In brief, our primary work (i) develops a method for generating effective prototypes, as templates or logical sequences of pre-processing transformations, and (ii) instantiates the prototypes into pipelines, in the form of executable or physical sequences of actual operators that implement the respective transformations. For the first, a set of heuristic rules learned from extensive experiments are used, and for the second techniques from Automated Machine Learning (AutoML) are applied.
Authors:
Joran Leest, lias Gerostathopoulos, Claudia Raibulet and Patricia Lago
Source: https://doi.org/10.48550/arXiv.2401.11993
Abstract:
We propose Expert Monitoring, an approach that leverages domain expertise to enhance the detection and mitigation of concept drift in machine learning (ML) models. Our approach supports practitioners by consolidating domain expertise related to concept drift-inducing events, making this expertise accessible to on-call personnel, and enabling automatic adaptability with expert oversight.
Authors:
Hendrik Jilderda and Claudia Raibulet
Source: https://doi.org/10.1007/978 3-031-66326-0_4
Abstract:
The rapid evolution of complex software systems claims for novel approaches, such as self-adaptive systems that can autonomously adjust their behavior to meet changing requirements and dynamic contexts. This paper proposes guidelines for the development of reactive and proactive approaches to self-adaptive systems. The guidelines concern architectural based approaches using explicitly the MAPE-K (Monitor, Analyze, Plan, Execute using Knowledge) control feedback loop. The proposed guidelines are organized into five parts, one for each step of the MAPE-K loop including the knowledge with the goal to provide guidance when designing and implementing self-adaptive systems from a software engineering point of view. Three self-adaptive artifacts (i.e., OCCI Monitoring, DingNet, and SWIM) from different application domains (i.e., cloud computing, Internet of Things, Web application) have been analyzed by considering the proposed guidelines. Their reactive and proactive characteristics are compared.
Authors:
Claudia Raibulet and Xiaojun Ling
Source: https://doi.org/10.1007/978-981-97-0989-2_8
Abstract:
Self-adaptivity is mainly used to address uncertainties, unpredicted events, as well as to automate administration tasks. It allows systems to change themselves while executing in order to address expected or unexpected changes and to adapt as much as possible to the current execution context. Self-adaptivity is particularly meaningful for dynamic application domains such as Internet of Things (IoT), Cyber-Physical Systems (CPS), service oriented based solutions (SOA), cloud computing, robotics, among many others. There are various available solutions in these domains that exploit self-adaptivity. The question is how can we analyze them to understand how self-adaptivity is implemented and exploited in order to use and re-use, as well as to adapt existing solutions to new or other systems? In this paper, we propose a first step in this direction, by analyzing available self-adaptive systems (and especially their self-adaptive mechanisms) in various application domains using the Understand tool – widely used for software development, analysis, and quality assessment.
Authors:
Koen Kraaijveld and Claudia Raibulet
Source: https://doi.org/10.1007/978-3-031-64182-4_14
Abstract:
Self-Adaptive Systems (SAS) are able to cope with the changes occurring dynamically in their execution environment in an autonomous and automated manner. They use adaptive strategies to address such changes with the main objective to ensure the proper functionality of a SAS and its performance. Hence, adaptive strategies play a key role in SAS. To ensure their use and re-use in various systems in different application domains, there is a need for common mechanisms that enable their analysis, evaluation, and comparison. In this direction, our Adaptive Strategies Metric Suite (ASMS) defines a set of software metrics for the measurement of various static and dynamic properties of adaptive strategies in SAS. The metrics concerning the static properties have been implemented in a plugin for the Understand tool.
Authors:
Keerthiga Rajenthiram
Source: https://doi.org/10.1145/3644815.3644971
Abstract:
In the Big Data era, efficient data analytics workflows are imperative to extract useful and meaningful insights. Data analysts and scientists spend an inordinate amount of time finding the best workflow via trial and error to get accurate and meaningful results that meet their expectations. We propose an Experimentation Engine that selects and optimizes the best workflow variant through continuous experimentation and having the user in the loop. Experimentation Engine saves time finding the workflow that satisfies the user requirements and provides accurate, useful and trustworthy results.
Authors:
Md Ataur Rahman, Sergi Nadal, Oscar Romero and Dimitris Sacharidis
Source: https://doi.org/10.1109/ICDE60146.2024.00269
Abstract:
We study the data sparsity problem for data generated from an integration system. We approach the problem from a textual information extraction perspective and propose to conceptualize external documents using the concepts in the integrated schema. We present THOR, a novel system that, unlike related approaches, neither relies on complex rules nor models trained with large annotated corpus, but on the integrated data and its schema without the need for human annotations. An extensive evaluation on the text conceptualization task demonstrates the superiority of our approach in terms of F1-score, effort and use of resources over the state-of-the-art language models.
Authors:
Antonis Kontaxakis, Dimitris Sacharidis, Alkis Simitsis, Alberto Abelló and Sergi Nadal
Source: https://doi.org/10.1109/ICDE60146.2024.00024
Abstract:
We present HYPPO, a novel system to optimize pipelines encountered in exploratory machine learning. HYPPO exploits alternative computational paths of artifacts from past executions to derive better execution plans while reusing materialized artifacts. Adding alternative computations introduces new challenges for exploratory machine learning regarding workload representation, system architecture, and optimal execution plan generation. To this end, we present a novel workload representation based on directed hypergraphs, and we formulate the problem of discovering the optimal execution plan as a search problem over directed hypergraphs and that of selecting artifacts to materialize as an optimization problem. A thorough experimental evaluation shows that HYPPO results in plans that are typically one order (up to two orders) of magnitude faster and cheaper than the non-optimized pipeline and considerably (up to one order of magnitude) faster and cheaper than plans generated by the state of the art when materializing artifacts is possible. Lastly, our evaluation reveals that HYPPO reduces the cost by 3–4× even when materialization cannot be exploited.
Authors:
Stavros Maroulis, Vassilis Stamatopoulos, George Papastefanatos and Manolis Terrovitis
Source: https://doi.org/10.14778/3659437.3659460
Abstract:
This paper addresses the challenges in interactive visual exploration of large multi-variate time series data. Traditional data reduction techniques may improve latency but can distort visualizations. State-of-the-art methods aimed at 100% accurate visualization often fail to maintain interactive response times or require excessive preprocessing and additional storage. We propose an in-memory adaptive caching approach, MinMaxCache, that efficiently reuses previous query results to accelerate visualization performance within accuracy constraints. MinMaxCache fetches data at adaptively determined aggregation granularities to maintain interactive response times and generate approximate visualizations with accuracy guarantees. Our results show that it is up to 10 times faster than current solutions without significant accuracy compromise.
Authors:
Gerard Pons, Miona Dimic and Besim Bilalli
Source:
Abstract:
T
Authors:
Andra Ionescu, Kiril Vasilev, Florena Buse, Rihan Hai and Asterios Katsifodimos
Source: https://doi.org/10.1109/ICDE60146.2024.00150
Abstract:
Can we automatically discover machine learning (ML) features in a large data lake in order to increase the accuracy of a given ML model? Existing solutions either focus on simple star schemata, failing to discover features in more complex real-world schemata or consider only PK-FK relationships in clean, curated databases. However, real-world data lakes can contain long join paths of uncurated joinability relationships resulting from automated dataset discovery methods. This paper proposes a novel ranking-based feature discovery method called AutoFeat. Given a base table with a target label, AutoFeat explores multi-hop, transitive join paths to find relevant features in order to augment the base table with additional features, ultimately leading to increased accuracy of an ML model. AutoFeat is general: it evaluates the predictive power of features without the need to train an ML model, ranking join paths using the concepts of relevance and redundancy. Our experiments on real-world open data show that AutoFeat is efficient: it can find features of high predictive power on data lakes with an increased number of dataset joinability relationships 5x-44x faster than baseline approaches. In addition, AutoFeat is effective, improving accuracy by 16% on average compared to the baseline approaches, even in noisy, uncurated data lakes.
Authors:
Ziyu Li, Hilco van der Wilk, Danning Zhan, Megha Khosla, Alessandro Bozzon and Rihan Hai
Source: https://doi.org/10.48550/arXiv.2404.03988
Abstract:
Pre-trained deep learning (DL) models are increasingly accessible in public repositories, i.e., model zoos. Given a new prediction task, finding the best model to fine-tune can be computationally intensive and costly, especially when the number of pre-trained models is large. Selecting the right pre-trained models is crucial, yet complicated by the diversity of models from various model families (like ResNet, Vit, Swin) and the hidden relationships between models and datasets. Existing methods, which utilize basic information from models and datasets to compute scores indicating model performance on target datasets, overlook the intrinsic relationships, limiting their effectiveness in model selection. In this study, we introduce TransferGraph, a novel framework that reformulates model selection as a graph learning problem. TransferGraph constructs a graph using extensive metadata extracted from models and datasets, while capturing their inherent relationships. Through comprehensive experiments across 16 real datasets, both images and texts, we demonstrate TransferGraph’s effectiveness in capturing essential model-dataset relationships, yielding up to a 32% improvement in correlation between predicted performance and the actual fine-tuning results compared to the state-of-the-art methods.
Authors:
Marc Maynou and Sergi Nadal
Source: http://hdl.handle.net/2117/409838
Abstract:
Data discovery is an essential step in the data integration pipeline involving finding datasets whose combined information provides relevant insights. Discovering joinable attributes requires assessing the closeness of the semantic concepts that two attributes represent, which is highly sensitive and dependent on the chosen similarity metric. The state of the art commonly approaches this task from a syntactic perspective, this is, performing comparisons based on the data values or on direct transformations (e.g., via hash functions). These approaches suffice when the two sets of instances share the same syntactic representation, but fail to detect cases in which the same semantic concept is represented by different sets of values, which we refer as semantic non-syntactic joins. This is a relevant problem in data lake scenarios, when the underlying datasets present high heterogeneity and lack of standardization. To that end, in this paper, we propose an empirical approach to detect semantic non-syntactic joins, which leverages, simultaneously, syntactic and semantic measurements of the data. We demonstrate that our approach is effective in detecting such kind of joins.
Authors:
Besim Bilalli, Petar Jovanovic, Sergi Nadal, Anna Queralt and Oscar Romero
Source: http://hdl.handle.net/2117/409892
Abstract:
Data Science and data-driven Artificial Intelligence are here to stay and they are expected to further transform the current global economy. From a technical point of view, there is an overall agreement that disciplines based on data require to combine data engineering and data analysis skills, but the fact is that data engineering is nowadays trailing and catching up with the rapid changes in the data analysis landscape. To unleash the real power of data, data-centric systems must be professionalized, i.e., operationalized and systematized, so that repetitive, time-consuming and error-prone tasks are automated. To such end, we propose our vision on next generation data governance for data-centric systems based on knowledge graphs. We claim that without the knowledge embedded in the data governance layer, Data Science will not unleash its potential.
Authors:
K. Larnier, J. Coves, G. Stephan and L. Dumas
Source: Titre de la présentation avec aplat dégradé [Tahoma Bold 40 pt] (i-aida.org)
Abstract:
Modeling flash floods in urban areas with complex topography is always challenging. Considering fine-scale hydrodynamic 2D shallow water model to perform simulations requires a lot of manual or semi-automatic data processing before being able to run simulations. This involves the transformation of high-resolution Digital Surface Model (Lidar) into a Digital Elevation Model that conserves the main hydraulic properties of the ground (culverts, weirs, barriers, etc) as well as accurate delineation of the streets and buildings, etc. In the context of the ExtremeXP project funded by the European Commission we assess the role of machine learning to improve the simulation and nowcasting (forecast with short term horizon) of flash flood events in the city of Nîmes in the South of France. First, we prepare all relevant datasets to design a fine scale 2D hydrodynamic model and then we calibrate it on several historical flood events. Once this model is calibrated and validated, we use it as a reference for conducting several scenarios of improvements using machine learning model. Two kinds of scenarios are analyzed. In the first kind lie all the machine learning techniques that would facilitate the design of the hydrodynamic model by either reducing the number of input data or reducing the necessary data transformation processes. The second kind of scenario consists in designing surrogates for the reference hydrodynamic model itself for nowcasting flood propagation during an event.