%0 Journal Article %J Ethics and Information Technology %D 2023 %T Generative AI models should include detection mechanisms as a condition for public releaseAbstract %A Knott, Alistair %A Pedreschi, Dino %A Chatila, Raja %A Chakraborti, Tapabrata %A Leavy, Susan %A Baeza-Yates, Ricardo %A Eyers, David %A Trotman, Andrew %A Teal, Paul D. %A Biecek, Przemyslaw %A Russell, Stuart %A Bengio, Yoshua %X The new wave of ‘foundation models’—general-purpose generative AI models, for production of text (e.g., ChatGPT) or images (e.g., MidJourney)—represent a dramatic advance in the state of the art for AI. But their use also introduces a range of new risks, which has prompted an ongoing conversation about possible regulatory mechanisms. Here we propose a specific principle that should be incorporated into legislation: that any organization developing a foundation model intended for public use must demonstrate a reliable detection mechanism for the content it generates, as a condition of its public release. The detection mechanism should be made publicly available in a tool that allows users to query, for an arbitrary item of content, whether the item was generated (wholly or partly) by the model. In this paper, we argue that this requirement is technically feasible and would play an important role in reducing certain risks from new AI models in many domains. We also outline a number of options for the tool’s design, and summarize a number of points where further input from policymakers and researchers would be required. %B Ethics and Information Technology %V 25 %8 Jan-12-2023 %G eng %U https://link.springer.com/article/10.1007/s10676-023-09728-4?utm_source=rct_congratemailt&utm_medium=email&utm_campaign=oa_20231028&utm_content=10.1007/s10676-023-09728-4 %! Ethics Inf Technol %R 10.1007/s10676-023-09728-4 %0 Journal Article %D 2021 %T Give more data, awareness and control to individual citizens, and they will help COVID-19 containment %A Mirco Nanni %A Andrienko, Gennady %A Barabasi, Albert-Laszlo %A Boldrini, Chiara %A Bonchi, Francesco %A Cattuto, Ciro %A Chiaromonte, Francesca %A Comandé, Giovanni %A Conti, Marco %A Coté, Mark %A Dignum, Frank %A Dignum, Virginia %A Domingo-Ferrer, Josep %A Ferragina, Paolo %A Fosca Giannotti %A Riccardo Guidotti %A Helbing, Dirk %A Kaski, Kimmo %A Kertész, János %A Lehmann, Sune %A Lepri, Bruno %A Lukowicz, Paul %A Matwin, Stan %A Jiménez, David Megías %A Anna Monreale %A Morik, Katharina %A Oliver, Nuria %A Passarella, Andrea %A Passerini, Andrea %A Dino Pedreschi %A Pentland, Alex %A Pianesi, Fabio %A Francesca Pratesi %A S Rinzivillo %A Salvatore Ruggieri %A Siebes, Arno %A Torra, Vicenc %A Roberto Trasarti %A Hoven, Jeroen van den %A Vespignani, Alessandro %X The rapid dynamics of COVID-19 calls for quick and effective tracking of virus transmission chains and early detection of outbreaks, especially in the “phase 2” of the pandemic, when lockdown and other restriction measures are progressively withdrawn, in order to avoid or minimize contagion resurgence. For this purpose, contact-tracing apps are being proposed for large scale adoption by many countries. A centralized approach, where data sensed by the app are all sent to a nation-wide server, raises concerns about citizens’ privacy and needlessly strong digital surveillance, thus alerting us to the need to minimize personal data collection and avoiding location tracking. We advocate the conceptual advantage of a decentralized approach, where both contact and location data are collected exclusively in individual citizens’ “personal data stores”, to be shared separately and selectively (e.g., with a backend system, but possibly also with other citizens), voluntarily, only when the citizen has tested positive for COVID-19, and with a privacy preserving level of granularity. This approach better protects the personal sphere of citizens and affords multiple benefits: it allows for detailed information gathering for infected people in a privacy-preserving fashion; and, in turn this enables both contact tracing, and, the early detection of outbreak hotspots on more finely-granulated geographic scale. The decentralized approach is also scalable to large populations, in that only the data of positive patients need be handled at a central level. Our recommendation is two-fold. First to extend existing decentralized architectures with a light touch, in order to manage the collection of location data locally on the device, and allow the user to share spatio-temporal aggregates—if and when they want and for specific aims—with health authorities, for instance. Second, we favour a longer-term pursuit of realizing a Personal Data Store vision, giving users the opportunity to contribute to collective good in the measure they want, enhancing self-awareness, and cultivating collective efforts for rebuilding society. %8 2021/02/02 %@ 1572-8439 %G eng %U https://link.springer.com/article/10.1007/s10676-020-09572-w %! Ethics and Information Technology %R https://doi.org/10.1007/s10676-020-09572-w %0 Conference Paper %B Formal Methods. FM 2019 International Workshops %D 2020 %T Analysis and Visualization of Performance Indicators in University Admission Tests %A Michela Natilli %A Daniele Fadda %A S Rinzivillo %A Dino Pedreschi %A Licari, Federica %E Sekerinski, Emil %E Moreira, Nelma %E Oliveira, José N. %E Ratiu, Daniel %E Riccardo Guidotti %E Farrell, Marie %E Luckcuck, Matt %E Marmsoler, Diego %E Campos, José %E Astarte, Troy %E Gonnord, Laure %E Cerone, Antonio %E Couto, Luis %E Dongol, Brijesh %E Kutrib, Martin %E Monteiro, Pedro %E Delmas, David %X This paper presents an analytical platform for evaluation of the performance and anomaly detection of tests for admission to public universities in Italy. Each test is personalized for each student and is composed of a series of questions, classified on different domains (e.g. maths, science, logic, etc.). Since each test is unique for composition, it is crucial to guarantee a similar level of difficulty for all the tests in a session. For this reason, to each question, it is assigned a level of difficulty from a domain expert. Thus, the general difficultness of a test depends on the correct classification of each item. We propose two approaches to detect outliers. A visualization-based approach using dynamic filter and responsive visual widgets. A data mining approach to evaluate the performance of the different questions for five years. We used clustering to group the questions according to a set of performance indicators to provide labeling of the data-driven level of difficulty. The measured level is compared with the a priori assigned by experts. The misclassifications are then highlighted to the expert, who will be able to refine the question or the classification. Sequential pattern mining is used to check if biases are present in the composition of the tests and their performance. This analysis is meant to exclude overlaps or direct dependencies among questions. Analyzing co-occurrences we are able to state that the composition of each test is fair and uniform for all the students, even on several sessions. The analytical results are presented to the expert through a visual web application that loads the analytical data and indicators and composes an interactive dashboard. The user may explore the patterns and models extracted by filtering and changing thresholds and analytical parameters. %B Formal Methods. FM 2019 International Workshops %I Springer International Publishing %C Cham %8 2020// %@ 978-3-030-54994-7 %G eng %U https://link.springer.com/chapter/10.1007/978-3-030-54994-7_14 %R https://doi.org/10.1007/978-3-030-54994-7_14 %0 Journal Article %J Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery %D 2020 %T Bias in data-driven artificial intelligence systems—An introductory survey %A Ntoutsi, Eirini %A Fafalios, Pavlos %A Gadiraju, Ujwal %A Iosifidis, Vasileios %A Nejdl, Wolfgang %A Vidal, Maria-Esther %A Salvatore Ruggieri %A Franco Turini %A Papadopoulos, Symeon %A Krasanakis, Emmanouil %A others %X Artificial Intelligence (AI)‐based systems are widely employed nowadays to make decisions that have far‐reaching impact on individuals and society. Their decisions might affect everyone, everywhere, and anytime, entailing concerns about potential human rights issues. Therefore, it is necessary to move beyond traditional AI algorithms optimized for predictive performance and embed ethical and legal principles in their design, training, and deployment to ensure social good while still benefiting from the huge potential of the AI technology. The goal of this survey is to provide a broad multidisciplinary overview of the area of bias in AI systems, focusing on technical challenges and solutions as well as to suggest new research directions towards approaches well‐grounded in a legal frame. In this survey, we focus on data‐driven AI, as a large part of AI is powered nowadays by (big) data and powerful machine learning algorithms. If otherwise not specified, we use the general term bias to describe problems related to the gathering or processing of data that might result in prejudiced decisions on the bases of demographic features such as race, sex, and so forth. %B Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery %V 10 %P e1356 %G eng %U https://onlinelibrary.wiley.com/doi/full/10.1002/widm.1356 %R https://doi.org/10.1002/widm.1356 %0 Conference Paper %B Machine Learning and Knowledge Discovery in Databases %D 2020 %T Black Box Explanation by Learning Image Exemplars in the Latent Feature Space %A Riccardo Guidotti %A Anna Monreale %A Matwin, Stan %A Dino Pedreschi %E Brefeld, Ulf %E Fromont, Elisa %E Hotho, Andreas %E Knobbe, Arno %E Maathuis, Marloes %E Robardet, Céline %X We present an approach to explain the decisions of black box models for image classification. While using the black box to label images, our explanation method exploits the latent feature space learned through an adversarial autoencoder. The proposed method first generates exemplar images in the latent feature space and learns a decision tree classifier. Then, it selects and decodes exemplars respecting local decision rules. Finally, it visualizes them in a manner that shows to the user how the exemplars can be modified to either stay within their class, or to become counter-factuals by “morphing” into another class. Since we focus on black box decision systems for image classification, the explanation obtained from the exemplars also provides a saliency map highlighting the areas of the image that contribute to its classification, and areas of the image that push it into another class. We present the results of an experimental evaluation on three datasets and two black box models. Besides providing the most useful and interpretable explanations, we show that the proposed method outperforms existing explainers in terms of fidelity, relevance, coherence, and stability. %B Machine Learning and Knowledge Discovery in Databases %I Springer International Publishing %C Cham %8 2020// %@ 978-3-030-46150-8 %G eng %U https://link.springer.com/chapter/10.1007/978-3-030-46150-8_12 %R https://doi.org/10.1007/978-3-030-46150-8_12 %0 Journal Article %D 2020 %T Causal inference for social discrimination reasoning %A Qureshi, Bilal %A Kamiran, Faisal %A Karim, Asim %A Salvatore Ruggieri %A Dino Pedreschi %X The discovery of discriminatory bias in human or automated decision making is a task of increasing importance and difficulty, exacerbated by the pervasive use of machine learning and data mining. Currently, discrimination discovery largely relies upon correlation analysis of decisions records, disregarding the impact of confounding biases. We present a method for causal discrimination discovery based on propensity score analysis, a statistical tool for filtering out the effect of confounding variables. We introduce causal measures of discrimination which quantify the effect of group membership on the decisions, and highlight causal discrimination/favoritism patterns by learning regression trees over the novel measures. We validate our approach on two real world datasets. Our proposed framework for causal discrimination has the potential to enhance the transparency of machine learning with tools for detecting discriminatory bias both in the training data and in the learning algorithms. %V 54 %P 425 - 437 %8 2020/04/01 %@ 1573-7675 %G eng %U https://link.springer.com/article/10.1007/s10844-019-00580-x %! Journal of Intelligent Information Systems %R https://doi.org/10.1007/s10844-019-00580-x %0 Conference Paper %B International Symposium on Intelligent Data Analysis %D 2020 %T Digital Footprints of International Migration on Twitter %A Jisu Kim %A Alina Sirbu %A Fosca Giannotti %A Lorenzo Gabrielli %X Studying migration using traditional data has some limitations. To date, there have been several studies proposing innovative methodologies to measure migration stocks and flows from social big data. Nevertheless, a uniform definition of a migrant is difficult to find as it varies from one work to another depending on the purpose of the study and nature of the dataset used. In this work, a generic methodology is developed to identify migrants within the Twitter population. This describes a migrant as a person who has the current residence different from the nationality. The residence is defined as the location where a user spends most of his/her time in a certain year. The nationality is inferred from linguistic and social connections to a migrant’s country of origin. This methodology is validated first with an internal gold standard dataset and second with two official statistics, and shows strong performance scores and correlation coefficients. Our method has the advantage that it can identify both immigrants and emigrants, regardless of the origin/destination countries. The new methodology can be used to study various aspects of migration, including opinions, integration, attachment, stocks and flows, motivations for migration, etc. Here, we exemplify how trending topics across and throughout different migrant communities can be observed. %B International Symposium on Intelligent Data Analysis %I Springer %G eng %U https://link.springer.com/chapter/10.1007/978-3-030-44584-3_22 %R https://doi.org/10.1007/978-3-030-44584-3_22 %0 Journal Article %D 2020 %T An ethico-legal framework for social data science %A Forgó, Nikolaus %A Hänold, Stefanie %A van den Hoven, Jeroen %A Krügel, Tina %A Lishchuk, Iryna %A Mahieu, René %A Anna Monreale %A Dino Pedreschi %A Francesca Pratesi %A van Putten, David %X This paper presents a framework for research infrastructures enabling ethically sensitive and legally compliant data science in Europe. Our goal is to describe how to design and implement an open platform for big data social science, including, in particular, personal data. To this end, we discuss a number of infrastructural, organizational and methodological principles to be developed for a concrete implementation. These include not only systematically tools and methodologies that effectively enable both the empirical evaluation of the privacy risk and data transformations by using privacy-preserving approaches, but also the development of training materials (a massive open online course) and organizational instruments based on legal and ethical principles. This paper provides, by way of example, the implementation that was adopted within the context of the SoBigData Research Infrastructure. %8 2020/03/31 %@ 2364-4168 %G eng %U https://link.springer.com/article/10.1007/s41060-020-00211-7 %! International Journal of Data Science and Analytics %R https://doi.org/10.1007/s41060-020-00211-7 %0 Journal Article %J International Journal of Data Science and Analytics %D 2020 %T Human migration: the big data perspective %A Alina Sirbu %A Andrienko, Gennady %A Andrienko, Natalia %A Boldrini, Chiara %A Conti, Marco %A Fosca Giannotti %A Riccardo Guidotti %A Bertoli, Simone %A Jisu Kim %A Muntean, Cristina Ioana %A Luca Pappalardo %A Passarella, Andrea %A Dino Pedreschi %A Pollacci, Laura %A Francesca Pratesi %A Sharma, Rajesh %X How can big data help to understand the migration phenomenon? In this paper, we try to answer this question through an analysis of various phases of migration, comparing traditional and novel data sources and models at each phase. We concentrate on three phases of migration, at each phase describing the state of the art and recent developments and ideas. The first phase includes the journey, and we study migration flows and stocks, providing examples where big data can have an impact. The second phase discusses the stay, i.e. migrant integration in the destination country. We explore various data sets and models that can be used to quantify and understand migrant integration, with the final aim of providing the basis for the construction of a novel multi-level integration index. The last phase is related to the effects of migration on the source countries and the return of migrants. %B International Journal of Data Science and Analytics %P 1–20 %8 2020/03/23 %@ 2364-4168 %G eng %U https://link.springer.com/article/10.1007%2Fs41060-020-00213-5 %! International Journal of Data Science and Analytics %R https://doi.org/10.1007/s41060-020-00213-5 %0 Conference Paper %B Formal Methods. FM 2019 International Workshops %D 2020 %T “Know Thyself” How Personal Music Tastes Shape the Last.Fm Online Social Network %A Riccardo Guidotti %A Giulio Rossetti %E Sekerinski, Emil %E Moreira, Nelma %E Oliveira, José N. %E Ratiu, Daniel %E Riccardo Guidotti %E Farrell, Marie %E Luckcuck, Matt %E Marmsoler, Diego %E Campos, José %E Astarte, Troy %E Gonnord, Laure %E Cerone, Antonio %E Couto, Luis %E Dongol, Brijesh %E Kutrib, Martin %E Monteiro, Pedro %E Delmas, David %X As Nietzsche once wrote “Without music, life would be a mistake” (Twilight of the Idols, 1889.). The music we listen to reflects our personality, our way to approach life. In order to enforce self-awareness, we devised a Personal Listening Data Model that allows for capturing individual music preferences and patterns of music consumption. We applied our model to 30k users of Last.Fm for which we collected both friendship ties and multiple listening. Starting from such rich data we performed an analysis whose final aim was twofold: (i) capture, and characterize, the individual dimension of music consumption in order to identify clusters of like-minded Last.Fm users; (ii) analyze if, and how, such clusters relate to the social structure expressed by the users in the service. Do there exist individuals having similar Personal Listening Data Models? If so, are they directly connected in the social graph or belong to the same community?. %B Formal Methods. FM 2019 International Workshops %I Springer International Publishing %C Cham %8 2020// %@ 978-3-030-54994-7 %G eng %U https://link.springer.com/chapter/10.1007/978-3-030-54994-7_11 %R https://doi.org/10.1007/978-3-030-54994-7_11 %0 Conference Paper %B ECML PKDD 2020 Workshops %D 2020 %T Prediction and Explanation of Privacy Risk on Mobility Data with Neural Networks %A Francesca Naretto %A Roberto Pellungrini %A Nardini, Franco Maria %A Fosca Giannotti %E Koprinska, Irena %E Kamp, Michael %E Appice, Annalisa %E Loglisci, Corrado %E Antonie, Luiza %E Zimmermann, Albrecht %E Riccardo Guidotti %E Özgöbek, Özlem %E Ribeiro, Rita P. %E Gavaldà, Ricard %E Gama, João %E Adilova, Linara %E Krishnamurthy, Yamuna %E Ferreira, Pedro M. %E Malerba, Donato %E Medeiros, Ibéria %E Ceci, Michelangelo %E Manco, Giuseppe %E Masciari, Elio %E Ras, Zbigniew W. %E Christen, Peter %E Ntoutsi, Eirini %E Schubert, Erich %E Zimek, Arthur %E Anna Monreale %E Biecek, Przemyslaw %E S Rinzivillo %E Kille, Benjamin %E Lommatzsch, Andreas %E Gulla, Jon Atle %X The analysis of privacy risk for mobility data is a fundamental part of any privacy-aware process based on such data. Mobility data are highly sensitive. Therefore, the correct identification of the privacy risk before releasing the data to the public is of utmost importance. However, existing privacy risk assessment frameworks have high computational complexity. To tackle these issues, some recent work proposed a solution based on classification approaches to predict privacy risk using mobility features extracted from the data. In this paper, we propose an improvement of this approach by applying long short-term memory (LSTM) neural networks to predict the privacy risk directly from original mobility data. We empirically evaluate privacy risk on real data by applying our LSTM-based approach. Results show that our proposed method based on a LSTM network is effective in predicting the privacy risk with results in terms of F1 of up to 0.91. Moreover, to explain the predictions of our model, we employ a state-of-the-art explanation algorithm, Shap. We explore the resulting explanation, showing how it is possible to provide effective predictions while explaining them to the end-user. %B ECML PKDD 2020 Workshops %I Springer International Publishing %C Cham %8 2020// %@ 978-3-030-65965-3 %G eng %U https://link.springer.com/chapter/10.1007/978-3-030-65965-3_34 %R https://doi.org/10.1007/978-3-030-65965-3_34 %0 Journal Article %J PloS one %D 2019 %T Algorithmic bias amplifies opinion fragmentation and polarization: A bounded confidence model %A Alina Sirbu %A Dino Pedreschi %A Fosca Giannotti %A Kertész, János %X The flow of information reaching us via the online media platforms is optimized not by the information content or relevance but by popularity and proximity to the target. This is typically performed in order to maximise platform usage. As a side effect, this introduces an algorithmic bias that is believed to enhance fragmentation and polarization of the societal debate. To study this phenomenon, we modify the well-known continuous opinion dynamics model of bounded confidence in order to account for the algorithmic bias and investigate its consequences. In the simplest version of the original model the pairs of discussion participants are chosen at random and their opinions get closer to each other if they are within a fixed tolerance level. We modify the selection rule of the discussion partners: there is an enhanced probability to choose individuals whose opinions are already close to each other, thus mimicking the behavior of online media which suggest interaction with similar peers. As a result we observe: a) an increased tendency towards opinion fragmentation, which emerges also in conditions where the original model would predict consensus, b) increased polarisation of opinions and c) a dramatic slowing down of the speed at which the convergence at the asymptotic state is reached, which makes the system highly unstable. Fragmentation and polarization are augmented by a fragmented initial population. %B PloS one %V 14 %P e0213246 %G eng %U https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0213246 %R 10.1371/journal.pone.0213246 %0 Journal Article %J Journal of Intelligent Information Systems %D 2019 %T Causal inference for social discrimination reasoning %A Qureshi, Bilal %A Kamiran, Faisal %A Karim, Asim %A Salvatore Ruggieri %A Dino Pedreschi %X The discovery of discriminatory bias in human or automated decision making is a task of increasing importance and difficulty, exacerbated by the pervasive use of machine learning and data mining. Currently, discrimination discovery largely relies upon correlation analysis of decisions records, disregarding the impact of confounding biases. We present a method for causal discrimination discovery based on propensity score analysis, a statistical tool for filtering out the effect of confounding variables. We introduce causal measures of discrimination which quantify the effect of group membership on the decisions, and highlight causal discrimination/favoritism patterns by learning regression trees over the novel measures. We validate our approach on two real world datasets. Our proposed framework for causal discrimination has the potential to enhance the transparency of machine learning with tools for detecting discriminatory bias both in the training data and in the learning algorithms. %B Journal of Intelligent Information Systems %P 1–13 %G eng %U https://link.springer.com/article/10.1007/s10844-019-00580-x %R 10.1007/s10844-019-00580-x %0 Journal Article %J ERCIM News %D 2019 %T Public opinion and Algorithmic bias %A Alina Sirbu %A Fosca Giannotti %A Dino Pedreschi %A Kertész, János %B ERCIM News %G eng %U https://ercim-news.ercim.eu/en116/special/public-opinion-and-algorithmic-bias %0 Journal Article %J Data Mining and Constraint Programming: Foundations of a Cross-Disciplinary Approach %D 2017 %T ICON Loop Carpooling Show Case %A Mirco Nanni %A Lars Kotthoff %A Riccardo Guidotti %A Barry O'Sullivan %A Dino Pedreschi %X In this chapter we describe a proactive carpooling service that combines induction and optimization mechanisms to maximize the impact of carpooling within a community. The approach autonomously infers the mobility demand of the users through the analysis of their mobility traces (i.e. Data Mining of GPS trajectories) and builds the network of all possible ride sharing opportunities among the users. Then, the maximal set of carpooling matches that satisfy some standard requirements (maximal capacity of vehicles, etc.) is computed through Constraint Programming models, and the resulting matches are proactively proposed to the users. Finally, in order to maximize the expected impact of the service, the probability that each carpooling match is accepted by the users involved is inferred through Machine Learning mechanisms and put in the CP model. The whole process is reiterated at regular intervals, thus forming an instance of the general ICON loop. %B Data Mining and Constraint Programming: Foundations of a Cross-Disciplinary Approach %V 10101 %P 310 %G eng %U https://link.springer.com/content/pdf/10.1007/978-3-319-50137-6.pdf#page=314 %0 Journal Article %J IEEE Intelligent Systems %D 2017 %T The Inductive Constraint Programming Loop %A Bessiere, Christian %A De Raedt, Luc %A Tias Guns %A Lars Kotthoff %A Mirco Nanni %A Siegfried Nijssen %A Barry O'Sullivan %A Paparrizou, Anastasia %A Dino Pedreschi %A Simonis, Helmut %X Constraint programming is used for a variety of real-world optimization problems, such as planning, scheduling and resource allocation problems. At the same time, one continuously gathers vast amounts of data about these problems. Current constraint programming software does not exploit such data to update schedules, resources and plans. We propose a new framework, which we call the inductive constraint programming loop. In this approach data is gathered and analyzed systematically in order to dynamically revise and adapt constraints and optimization criteria. Inductive Constraint Programming aims at bridging the gap between the areas of data mining and machine learning on the one hand, and constraint programming on the other. %B IEEE Intelligent Systems %G eng %R 10.1109/MIS.2017.265115706 %0 Journal Article %J arXiv preprint arXiv:1608.03735 %D 2016 %T Causal Discrimination Discovery Through Propensity Score Analysis %A Qureshi, Bilal %A Kamiran, Faisal %A Karim, Asim %A Salvatore Ruggieri %X Social discrimination is considered illegal and unethical in the modern world. Such discrimination is often implicit in observed decisions' datasets, and anti-discrimination organizations seek to discover cases of discrimination and to understand the reasons behind them. Previous work in this direction adopted simple observational data analysis; however, this can produce biased results due to the effect of confounding variables. In this paper, we propose a causal discrimination discovery and understanding approach based on propensity score analysis. The propensity score is an effective statistical tool for filtering out the effect of confounding variables. We employ propensity score weighting to balance the distribution of individuals from protected and unprotected groups w.r.t. the confounding variables. For each individual in the dataset, we quantify its causal discrimination or favoritism with a neighborhood-based measure calculated on the balanced distributions. Subsequently, the causal discrimination/favoritism patterns are understood by learning a regression tree. Our approach avoids common pitfalls in observational data analysis and make its results legally admissible. We demonstrate the results of our approach on two discrimination datasets. %B arXiv preprint arXiv:1608.03735 %G eng %U https://arxiv.org/abs/1608.03735 %0 Generic %D 2016 %T Data Mining and Constraint Programming - Foundations of a Cross-Disciplinary Approach. %A Bessiere, Christian %A De Raedt, Luc %A Lars Kotthoff %A Siegfried Nijssen %A Barry O'Sullivan %A Dino Pedreschi %X A successful integration of constraint programming and data mining has the potential to lead to a new ICT paradigm with far reaching implications. It could change the face of data mining and machine learning, as well as constraint programming technology. It would not only allow one to use data mining techniques in constraint programming to identify and update constraints and optimization criteria, but also to employ constraints and criteria in data mining and machine learning in order to discover models compatible with prior knowledge. This book reports on some key results obtained on this integrated and cross- disciplinary approach within the European FP7 FET Open project no. 284715 on “Inductive Constraint Programming” and a number of associated workshops and Dagstuhl seminars. The book is structured in five parts: background; learning to model; learning to solve; constraint programming for data mining; and showcases. %G eng %R 10.1007/978-3-319-50137-6 %0 Journal Article %J Social Network Analysis and Mining %D 2016 %T Homophilic network decomposition: a community-centric analysis of online social services %A Giulio Rossetti %A Luca Pappalardo %A Riivo Kikas %A Dino Pedreschi %A Fosca Giannotti %A Marlon Dumas %X In this paper we formulate the homophilic network decomposition problem: Is it possible to identify a network partition whose structure is able to characterize the degree of homophily of its nodes? The aim of our work is to understand the relations between the homophily of individuals and the topological features expressed by specific network substructures. We apply several community detection algorithms on three large-scale online social networks—Skype, LastFM and Google+—and advocate the need of identifying the right algorithm for each specific network in order to extract a homophilic network decomposition. Our results show clear relations between the topological features of communities and the degree of homophily of their nodes in three online social scenarios: product engagement in the Skype network, number of listened songs on LastFM and homogeneous level of education among users of Google+. %B Social Network Analysis and Mining %V 6 %P 103 %G eng %R 10.1007/s1327 %0 Conference Paper %B Cloud Computing Technology and Science (CloudCom), 2016 IEEE International Conference on %D 2016 %T Privacy-Preserving Outsourcing of Pattern Mining of Event-Log Data-A Use-Case from Process Industry %A Marrella, Alessandro %A Anna Monreale %A Kloepper, Benjamin %A Krueger, Martin W %X With the advent of cloud computing and its model for IT services based on the Internet and big data centers, the interest of industries into XaaS ("Anything as a Service") paradigm is increasing. Business intelligence and knowledge discovery services are typical services that companies tend to externalize on the cloud, due to their data intensive nature and the algorithms complexity. What is appealing for a company is to rely on external expertise and infrastructure to compute the analytical results and models which are required by the business analysts for understanding the business phenomena under observation. Although it is advantageous to achieve sophisticated analysis there exist several serious privacy issues in this paradigm. In this paper we investigate through an industrial use-case the application of a framework for privacypreserving outsourcing of pattern mining on event-log data. Moreover, we present and discuss some ideas about possible extensions. %B Cloud Computing Technology and Science (CloudCom), 2016 IEEE International Conference on %I IEEE %G eng %R 10.1109/CloudCom.2016.0095 %0 Conference Paper %B International conference on Advances in Social Network Analysis and Mining %D 2015 %T Community-centric analysis of user engagement in Skype social network %A Giulio Rossetti %A Luca Pappalardo %A Riivo Kikas %A Dino Pedreschi %A Fosca Giannotti %A Marlon Dumas %B International conference on Advances in Social Network Analysis and Mining %I IEEE %C Paris, France %@ 978-1-4503-3854-7 %G eng %U http://dl.acm.org/citation.cfm?doid=2808797.2809384 %R 10.1145/2808797.2809384 %0 Conference Paper %B Principles and Practice of Constraint Programming %D 2015 %T Find Your Way Back: Mobility Profile Mining with Constraints %A Lars Kotthoff %A Mirco Nanni %A Riccardo Guidotti %A Barry O'Sullivan %X Mobility profile mining is a data mining task that can be formulated as clustering over movement trajectory data. The main challenge is to separate the signal from the noise, i.e. one-off trips. We show that standard data mining approaches suffer the important drawback that they cannot take the symmetry of non-noise trajectories into account. That is, if a trajectory has a symmetric equivalent that covers the same trip in the reverse direction, it should become more likely that neither of them is labelled as noise. We present a constraint model that takes this knowledge into account to produce better clusters. We show the efficacy of our approach on real-world data that was previously processed using standard data mining techniques. %B Principles and Practice of Constraint Programming %I Springer International Publishing %C Cork %G eng %0 Journal Article %J Journal of Trust Management %D 2015 %T A risk model for privacy in trajectory data %A Anirban Basu %A Anna Monreale %A Roberto Trasarti %A Juan Camilo Corena %A Fosca Giannotti %A Dino Pedreschi %A Shinsaku Kiyomoto %A Yutaka Miyake %A Tadashi Yanagihara %X Time sequence data relating to users, such as medical histories and mobility data, are good candidates for data mining, but often contain highly sensitive information. Different methods in privacy-preserving data publishing are utilised to release such private data so that individual records in the released data cannot be re-linked to specific users with a high degree of certainty. These methods provide theoretical worst-case privacy risks as measures of the privacy protection that they offer. However, often with many real-world data the worst-case scenario is too pessimistic and does not provide a realistic view of the privacy risks: the real probability of re-identification is often much lower than the theoretical worst-case risk. In this paper, we propose a novel empirical risk model for privacy which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release. We show detailed evaluation of the proposed risk model by using k-anonymised real-world mobility data and then, we show how the empirical evaluation of the privacy risk has a different trend in synthetic data describing random movements. %B Journal of Trust Management %V 2 %P 9 %G eng %R 10.1186/s40493-015-0020-6 %0 Conference Paper %B Joint European Conference on Machine Learning and Knowledge Discovery in Databases %D 2014 %T Anti-discrimination analysis using privacy attack strategies %A Salvatore Ruggieri %A Sara Hajian %A Kamiran, Faisal %A Zhang, Xiangliang %X Social discrimination discovery from data is an important task to identify illegal and unethical discriminatory patterns towards protected-by-law groups, e.g., ethnic minorities. We deploy privacy attack strategies as tools for discrimination discovery under hard assumptions which have rarely tackled in the literature: indirect discrimination discovery, privacy-aware discrimination discovery, and discrimination data recovery. The intuition comes from the intriguing parallel between the role of the anti-discrimination authority in the three scenarios above and the role of an attacker in private data publishing. We design strategies and algorithms inspired/based on Frèchet bounds attacks, attribute inference attacks, and minimality attacks to the purpose of unveiling hidden discriminatory practices. Experimental results show that they can be effective tools in the hands of anti-discrimination authorities. %B Joint European Conference on Machine Learning and Knowledge Discovery in Databases %I Springer, Berlin, Heidelberg %G eng %R 10.1007/978-3-662-44851-9_44 %0 Conference Paper %B Cloud Computing Technology and Science (CloudCom), 2014 IEEE 6th International Conference on %D 2014 %T CF-inspired Privacy-Preserving Prediction of Next Location in the Cloud %A Anirban Basu %A Juan Camilo Corena %A Anna Monreale %A Dino Pedreschi %A Fosca Giannotti %A Shinsaku Kiyomoto %A Vaidya, Jaideep %A Yutaka Miyake %X Mobility data gathered from location sensors such as Global Positioning System (GPS) enabled phones and vehicles is valuable for spatio-temporal data mining for various location-based services (LBS). Such data is often considered sensitive and there exist many a mechanism for privacy preserving analyses of the data. Through various anonymisation mechanisms, it can be ensured with a high probability that a particular individual cannot be identified when mobility data is outsourced to third parties for analysis. However, challenges remain with the privacy of the queries on outsourced analysis results, especially when the queries are sent directly to third parties by end-users. Drawing inspiration from our earlier work in privacy preserving collaborative filtering (CF) and next location prediction, in this exploratory work, we propose a novel representation of trajectory data in the CF domain and experiment with a privacy preserving Slope One CF predictor. We present evaluations for the accuracy and the computational performance of our proposal using anonymised data gathered from real traffic data in the Italian cities of Pisa and Milan. One use-case is a third-party location-prediction-as-a-service deployed on a public cloud, which can respond to privacy-preserving queries while enabling data owners to build a rich predictor on the cloud. %B Cloud Computing Technology and Science (CloudCom), 2014 IEEE 6th International Conference on %I IEEE %G eng %U http://dx.doi.org/10.1109/CloudCom.2014.114 %R 10.1109/CloudCom.2014.114 %0 Conference Paper %B Trust Management {VIII} - 8th {IFIP} {WG} 11.11 International Conference, {IFIPTM} 2014, Singapore, July 7-10, 2014. Proceedings %D 2014 %T A Privacy Risk Model for Trajectory Data %A Anirban Basu %A Anna Monreale %A Juan Camilo Corena %A Fosca Giannotti %A Dino Pedreschi %A Shinsaku Kiyomoto %A Yutaka Miyake %A Tadashi Yanagihara %A Roberto Trasarti %X Time sequence data relating to users, such as medical histories and mobility data, are good candidates for data mining, but often contain highly sensitive information. Different methods in privacy-preserving data publishing are utilised to release such private data so that individual records in the released data cannot be re-linked to specific users with a high degree of certainty. These methods provide theoretical worst-case privacy risks as measures of the privacy protection that they offer. However, often with many real-world data the worst-case scenario is too pessimistic and does not provide a realistic view of the privacy risks: the real probability of re-identification is often much lower than the theoretical worst-case risk. In this paper we propose a novel empirical risk model for privacy which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release. We show detailed evaluation of the proposed risk model by using k-anonymised real-world mobility data. %B Trust Management {VIII} - 8th {IFIP} {WG} 11.11 International Conference, {IFIPTM} 2014, Singapore, July 7-10, 2014. Proceedings %P 125–140 %U http://dx.doi.org/10.1007/978-3-662-43813-8_9 %R 10.1007/978-3-662-43813-8_9 %0 Conference Paper %B Entity Relationship Conference - ER 2013 %D 2013 %T Baquara: A Holistic Ontological Framework for Movement Analysis with Linked Data %A Renato Fileto %A Marcelo Krger %A Nikos Pelekis %A Yannis Theodoridis %A Chiara Renso %B Entity Relationship Conference - ER 2013 %C Hong Kong %0 Conference Paper %B Proceedings of the 3rd International Conference on Ambient Systems, Networks and Technologies {(ANT} 2012), the 9th International Conference on Mobile Web Information Systems (MobiWIS-2012), Niagara Falls, Ontario, Canada, August 27-29, 2012 %D 2012 %T An Agent-Based Model to Evaluate Carpooling at Large Manufacturing Plants %A Tom Bellemans %A Sebastian Bothe %A Sungjin Cho %A Fosca Giannotti %A Davy Janssens %A Luk Knapen %A Christine Körner %A Michael May %A Mirco Nanni %A Dino Pedreschi %A Hendrik Stange %A Roberto Trasarti %A Ansar-Ul-Haque Yasar %A Geert Wets %B Proceedings of the 3rd International Conference on Ambient Systems, Networks and Technologies {(ANT} 2012), the 9th International Conference on Mobile Web Information Systems (MobiWIS-2012), Niagara Falls, Ontario, Canada, August 27-29, 2012 %G eng %U http://dx.doi.org/10.1016/j.procs.2012.08.001 %R 10.1016/j.procs.2012.08.001 %0 Journal Article %J PLoS One %D 2012 %T RNA-Seq vs dual- and single-channel microarray data: sensitivity analysis for differential expression and clustering. %A Alina Sirbu %A Kerr, Gráinne %A Martin Crane %A Heather J Ruskin %X

With the fast development of high-throughput sequencing technologies, a new generation of genome-wide gene expression measurements is under way. This is based on mRNA sequencing (RNA-seq), which complements the already mature technology of microarrays, and is expected to overcome some of the latter's disadvantages. These RNA-seq data pose new challenges, however, as strengths and weaknesses have yet to be fully identified. Ideally, Next (or Second) Generation Sequencing measures can be integrated for more comprehensive gene expression investigation to facilitate analysis of whole regulatory networks. At present, however, the nature of these data is not very well understood. In this paper we study three alternative gene expression time series datasets for the Drosophila melanogaster embryo development, in order to compare three measurement techniques: RNA-seq, single-channel and dual-channel microarrays. The aim is to study the state of the art for the three technologies, with a view of assessing overlapping features, data compatibility and integration potential, in the context of time series measurements. This involves using established tools for each of the three different technologies, and technical and biological replicates (for RNA-seq and microarrays, respectively), due to the limited availability of biological RNA-seq replicates for time series data. The approach consists of a sensitivity analysis for differential expression and clustering. In general, the RNA-seq dataset displayed highest sensitivity to differential expression. The single-channel data performed similarly for the differentially expressed genes common to gene sets considered. Cluster analysis was used to identify different features of the gene space for the three datasets, with higher similarities found for the RNA-seq and single-channel microarray dataset.

%B PLoS One %V 7 %P e50986 %8 2012 %G eng %R 10.1371/journal.pone.0050986 %0 Journal Article %J Nature Methods %D 2012 %T Wisdom of crowds for robust gene network inference %A Daniel Marbach %A J.C. Costello %A Robert Küffner %A N.M. Vega %A R.J. Prill %A D.M. Camacho %A K.R. Allison %A Manolis Kellis %A J.J. Collins %A Aderhold, A. %A Gustavo Stolovitzky %A Bonneau, R. %A Chen, Y. %A Cordero, F. %A Martin Crane %A Dondelinger, F. %A Drton, M. %A Esposito, R. %A Foygel, R. %A De La Fuente, A. %A Gertheiss, J. %A Geurts, P. %A Greenfield, A. %A Grzegorczyk, M. %A Haury, A.-C. %A Holmes, B. %A Hothorn, T. %A Husmeier, D. %A Huynh-Thu, V.A. %A Irrthum, A. %A Karlebach, G. %A Lebre, S. %A De Leo, V. %A Madar, A. %A Mani, S. %A Mordelet, F. %A Ostrer, H. %A Ouyang, Z. %A Pandya, R. %A Petri, T. %A Pinna, A. %A Poultney, C.S. %A Rezny, S. %A Heather J Ruskin %A Saeys, Y. %A Shamir, R. %A Alina Sirbu %A Song, M. %A Soranzo, N. %A Statnikov, A. %A N.M. Vega %A Vera-Licona, P. %A Vert, J.-P. %A Visconti, A. %A Haizhou Wang %A Wehenkel, L. %A Windhager, L. %A Zhang, Y. %A Zimmer, R. %B Nature Methods %V 9 %P 796-804 %G eng %U http://www.scopus.com/inward/record.url?eid=2-s2.0-84870305264&partnerID=40&md5=04a686572bdefff60157bf68c95df7ea %R 10.1038/nmeth.2016 %0 Journal Article %J Nat Methods %D 2012 %T Wisdom of crowds for robust gene network inference. %A Daniel Marbach %A J.C. Costello %A Robert Küffner %A N.M. Vega %A R.J. Prill %A D.M. Camacho %A K.R. Allison %A Manolis Kellis %A J.J. Collins %A Gustavo Stolovitzky %X

Reconstructing gene regulatory networks from high-throughput data is a long-standing challenge. Through the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we performed a comprehensive blind assessment of over 30 network inference methods on Escherichia coli, Staphylococcus aureus, Saccharomyces cerevisiae and in silico microarray data. We characterize the performance, data requirements and inherent biases of different inference approaches, and we provide guidelines for algorithm application and development. We observed that no single inference method performs optimally across all data sets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse data sets. We thereby constructed high-confidence networks for E. coli and S. aureus, each comprising ~1,700 transcriptional interactions at a precision of ~50%. We experimentally tested 53 previously unobserved regulatory interactions in E. coli, of which 23 (43%) were supported. Our results establish community-based methods as a powerful and robust tool for the inference of transcriptional gene regulatory networks.

%B Nat Methods %V 9 %P 796-804 %8 2012 Aug %G eng %R 10.1038/nmeth.2016 %0 Book Section %B Data Mining and Knowledge Discovery Handbook %D 2010 %T Spatio-temporal clustering %A Slava Kisilevich %A Florian Mansmann %A Mirco Nanni %A S Rinzivillo %B Data Mining and Knowledge Discovery Handbook %P 855-874 %0 Conference Paper %B GIS %D 2008 %T Clustering of German municipalities based on mobility characteristics: an overview of results %A Andrea Zanda %A Christine Körner %A Fosca Giannotti %A Daniel Schulz %A Michael May %B GIS %P 69 %0 Book Section %B Mobility, Data Mining and Privacy %D 2008 %T Knowledge Discovery from Geographical Data %A S Rinzivillo %A Franco Turini %A Vania Bogorny %A Christine Körner %A Bart Kuijpers %A Michael May %B Mobility, Data Mining and Privacy %P 243-265 %0 Book Section %B Mobility, Data Mining and Privacy %D 2008 %T Querying and Reasoning for Spatiotemporal Data Mining %A Giuseppe Manco %A Miriam Baglioni %A Fosca Giannotti %A Bart Kuijpers %A Alessandra Raffaetà %A Chiara Renso %B Mobility, Data Mining and Privacy %P 335-374 %0 Book Section %D 2008 %T Querying and Reasoning for Spatio-Temporal Data Mining %A Giuseppe Manco %A Miriam Baglioni %A Fosca Giannotti %A Bart Kuijpers %A Alessandra Raffaetà %A Chiara Renso %I a Knowledge Discovery vision %C Mobility, Privacy, and Geography %G eng %0 Book Section %B Mobility, Data Mining and Privacy %D 2008 %T Spatiotemporal Data Mining %A Mirco Nanni %A Bart Kuijpers %A Christine Körner %A Michael May %A Dino Pedreschi %B Mobility, Data Mining and Privacy %P 267-296