TY - CONF T1 - Agnostic Label-Only Membership Inference Attack T2 - 17th International Conference on Network and System Security Y1 - 2023 A1 - Anna Monreale A1 - Francesca Naretto A1 - Simone Rizzo JF - 17th International Conference on Network and System Security PB - Springer ER - TY - CONF T1 - Evaluating the Privacy Exposure of Interpretable Global and Local Explainers T2 - Submitted at Journal of Artificial Intelligence and Law Y1 - 2023 A1 - Francesca Naretto A1 - Anna Monreale A1 - Fosca Giannotti JF - Submitted at Journal of Artificial Intelligence and Law ER - TY - CONF T1 - Monitoring Fairness in HOLDA T2 - HHAI 2022: Augmenting Human Intellect - Proceedings of the First International Conference on Hybrid Human-Artificial Intelligence, Amsterdam, The Netherlands, 13-17 June 2022 Y1 - 2022 A1 - Michele Fontana A1 - Francesca Naretto A1 - Anna Monreale A1 - Fosca Giannotti ED - Stefan Schlobach ED - María Pérez-Ortiz ED - Myrthe Tielman JF - HHAI 2022: Augmenting Human Intellect - Proceedings of the First International Conference on Hybrid Human-Artificial Intelligence, Amsterdam, The Netherlands, 13-17 June 2022 PB - IOS Press UR - https://doi.org/10.3233/FAIA220205 ER - TY - Generic T1 - Semantic Enrichment of XAI Explanations for Healthcare T2 - 24th International Conference on Artificial Intelligence Y1 - 2022 A1 - Corbucci, Luca A1 - Anna Monreale A1 - Cecilia Panigutti A1 - Michela Natilli A1 - Smiraglio, Simona A1 - Dino Pedreschi AB - Explaining black-box models decisions is crucial to increase doctors' trust in AI-based clinical decision support systems. However, current eXplainable Artificial Intelligence (XAI) techniques usually provide explanations that are not easily understandable by experts outside of AI. Furthermore, most of the them produce explanations that consider only the input features of the algorithm. However, broader information about the clinical context of a patient is usually available even if not processed by the AI-based clinical decision support system for its decision. Enriching the explanations with relevant clinical information concerning the health status of a patient would increase the ability of human experts to assess the reliability of the AI decision. Therefore, in this paper we present a methodology that aims to enable clinical reasoning by semantically enriching AI explanations. Starting from a medical AI explanation based only on the input features provided to the algorithm, our methodology leverages medical ontologies and NLP embedding techniques to link relevant information present in the patient's clinical notes to the original explanation. We validate our methodology with two experiments involving a human expert. Our results highlight promising performance in correctly identifying relevant information about the diseases of the patients, in particular about the associated morphology. This suggests that the presented methodology could be a first step toward developing a natural language explanation of AI decision support systems. JF - 24th International Conference on Artificial Intelligence ER - TY - JOUR T1 - Give more data, awareness and control to individual citizens, and they will help COVID-19 containment Y1 - 2021 A1 - Mirco Nanni A1 - Andrienko, Gennady A1 - Barabasi, Albert-Laszlo A1 - Boldrini, Chiara A1 - Bonchi, Francesco A1 - Cattuto, Ciro A1 - Chiaromonte, Francesca A1 - Comandé, Giovanni A1 - Conti, Marco A1 - Coté, Mark A1 - Dignum, Frank A1 - Dignum, Virginia A1 - Domingo-Ferrer, Josep A1 - Ferragina, Paolo A1 - Fosca Giannotti A1 - Riccardo Guidotti A1 - Helbing, Dirk A1 - Kaski, Kimmo A1 - Kertész, János A1 - Lehmann, Sune A1 - Lepri, Bruno A1 - Lukowicz, Paul A1 - Matwin, Stan A1 - Jiménez, David Megías A1 - Anna Monreale A1 - Morik, Katharina A1 - Oliver, Nuria A1 - Passarella, Andrea A1 - Passerini, Andrea A1 - Dino Pedreschi A1 - Pentland, Alex A1 - Pianesi, Fabio A1 - Francesca Pratesi A1 - S Rinzivillo A1 - Salvatore Ruggieri A1 - Siebes, Arno A1 - Torra, Vicenc A1 - Roberto Trasarti A1 - Hoven, Jeroen van den A1 - Vespignani, Alessandro AB - The rapid dynamics of COVID-19 calls for quick and effective tracking of virus transmission chains and early detection of outbreaks, especially in the “phase 2” of the pandemic, when lockdown and other restriction measures are progressively withdrawn, in order to avoid or minimize contagion resurgence. For this purpose, contact-tracing apps are being proposed for large scale adoption by many countries. A centralized approach, where data sensed by the app are all sent to a nation-wide server, raises concerns about citizens’ privacy and needlessly strong digital surveillance, thus alerting us to the need to minimize personal data collection and avoiding location tracking. We advocate the conceptual advantage of a decentralized approach, where both contact and location data are collected exclusively in individual citizens’ “personal data stores”, to be shared separately and selectively (e.g., with a backend system, but possibly also with other citizens), voluntarily, only when the citizen has tested positive for COVID-19, and with a privacy preserving level of granularity. This approach better protects the personal sphere of citizens and affords multiple benefits: it allows for detailed information gathering for infected people in a privacy-preserving fashion; and, in turn this enables both contact tracing, and, the early detection of outbreak hotspots on more finely-granulated geographic scale. The decentralized approach is also scalable to large populations, in that only the data of positive patients need be handled at a central level. Our recommendation is two-fold. First to extend existing decentralized architectures with a light touch, in order to manage the collection of location data locally on the device, and allow the user to share spatio-temporal aggregates—if and when they want and for specific aims—with health authorities, for instance. Second, we favour a longer-term pursuit of realizing a Personal Data Store vision, giving users the opportunity to contribute to collective good in the measure they want, enhancing self-awareness, and cultivating collective efforts for rebuilding society. SN - 1572-8439 UR - https://link.springer.com/article/10.1007/s10676-020-09572-w JO - Ethics and Information Technology ER - TY - JOUR T1 - GLocalX - From Local to Global Explanations of Black Box AI Models Y1 - 2021 A1 - Mattia Setzu A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Franco Turini A1 - Dino Pedreschi A1 - Fosca Giannotti AB - Artificial Intelligence (AI) has come to prominence as one of the major components of our society, with applications in most aspects of our lives. In this field, complex and highly nonlinear machine learning models such as ensemble models, deep neural networks, and Support Vector Machines have consistently shown remarkable accuracy in solving complex tasks. Although accurate, AI models often are “black boxes” which we are not able to understand. Relying on these models has a multifaceted impact and raises significant concerns about their transparency. Applications in sensitive and critical domains are a strong motivational factor in trying to understand the behavior of black boxes. We propose to address this issue by providing an interpretable layer on top of black box models by aggregating “local” explanations. We present GLocalX, a “local-first” model agnostic explanation method. Starting from local explanations expressed in form of local decision rules, GLocalX iteratively generalizes them into global explanations by hierarchically aggregating them. Our goal is to learn accurate yet simple interpretable models to emulate the given black box, and, if possible, replace it entirely. We validate GLocalX in a set of experiments in standard and constrained settings with limited or no access to either data or local explanations. Experiments show that GLocalX is able to accurately emulate several models with simple and small models, reaching state-of-the-art performance against natively global solutions. Our findings show how it is often possible to achieve a high level of both accuracy and comprehensibility of classification models, even in complex domains with high-dimensional data, without necessarily trading one property for the other. This is a key requirement for a trustworthy AI, necessary for adoption in high-stakes decision making applications. VL - 294 SN - 0004-3702 UR - https://www.sciencedirect.com/science/article/pii/S0004370221000084 JO - Artificial Intelligence ER - TY - CONF T1 - A new approach for cross-silo federated learning and its privacy risks T2 - 18th International Conference on Privacy, Security and Trust, PST 2021, Auckland, New Zealand, December 13-15, 2021 Y1 - 2021 A1 - Michele Fontana A1 - Francesca Naretto A1 - Anna Monreale JF - 18th International Conference on Privacy, Security and Trust, PST 2021, Auckland, New Zealand, December 13-15, 2021 PB - IEEE UR - https://doi.org/10.1109/PST52912.2021.9647753 ER - TY - CONF T1 - Privacy Risk Assessment of Individual Psychometric Profiles T2 - Discovery Science - 24th International Conference, DS 2021, Halifax, NS, Canada, October 11-13, 2021, Proceedings Y1 - 2021 A1 - Giacomo Mariani A1 - Anna Monreale A1 - Francesca Naretto ED - Carlos Soares ED - Luís Torgo JF - Discovery Science - 24th International Conference, DS 2021, Halifax, NS, Canada, October 11-13, 2021, Proceedings PB - Springer UR - https://doi.org/10.1007/978-3-030-88942-5_32 ER - TY - JOUR T1 - Authenticated Outlier Mining for Outsourced Databases JF - IEEE Transactions on Dependable and Secure Computing Y1 - 2020 A1 - Dong, Boxiang A1 - Wang, Hui A1 - Anna Monreale A1 - Dino Pedreschi A1 - Fosca Giannotti A1 - Guo, Wenge AB - The Data-Mining-as-a-Service (DMaS) paradigm is becoming the focus of research, as it allows the data owner (client) who lacks expertise and/or computational resources to outsource their data and mining needs to a third-party service provider (server). Outsourcing, however, raises some issues about result integrity: how could the client verify the mining results returned by the server are both sound and complete? In this paper, we focus on outlier mining, an important mining task. Previous verification techniques use an authenticated data structure (ADS) for correctness authentication, which may incur much space and communication cost. In this paper, we propose a novel solution that returns a probabilistic result integrity guarantee with much cheaper verification cost. The key idea is to insert a set of artificial records (ARs) into the dataset, from which it constructs a set of artificial outliers (AOs) and artificial non-outliers (ANOs). The AOs and ANOs are used by the client to detect any incomplete and/or incorrect mining results with a probabilistic guarantee. The main challenge that we address is how to construct ARs so that they do not change the (non-)outlierness of original records, while guaranteeing that the client can identify ANOs and AOs without executing mining. Furthermore, we build a strategic game and show that a Nash equilibrium exists only when the server returns correct outliers. Our implementation and experiments demonstrate that our verification solution is efficient and lightweight. VL - 17 UR - https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=8858https://ieeexplore.ieee.org/document/8048342/http://xplorestaging.ieee.org/ielx7/8858/9034462/08048342.pdf?arnumber=8048342https://ieeexplore.ieee.org/ielam/8858/9034462/8048342-aam.pdf JO - IEEE Trans. Dependable and Secure Comput. ER - TY - CONF T1 - Black Box Explanation by Learning Image Exemplars in the Latent Feature Space T2 - Machine Learning and Knowledge Discovery in Databases Y1 - 2020 A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Matwin, Stan A1 - Dino Pedreschi ED - Brefeld, Ulf ED - Fromont, Elisa ED - Hotho, Andreas ED - Knobbe, Arno ED - Maathuis, Marloes ED - Robardet, Céline AB - We present an approach to explain the decisions of black box models for image classification. While using the black box to label images, our explanation method exploits the latent feature space learned through an adversarial autoencoder. The proposed method first generates exemplar images in the latent feature space and learns a decision tree classifier. Then, it selects and decodes exemplars respecting local decision rules. Finally, it visualizes them in a manner that shows to the user how the exemplars can be modified to either stay within their class, or to become counter-factuals by “morphing” into another class. Since we focus on black box decision systems for image classification, the explanation obtained from the exemplars also provides a saliency map highlighting the areas of the image that contribute to its classification, and areas of the image that push it into another class. We present the results of an experimental evaluation on three datasets and two black box models. Besides providing the most useful and interpretable explanations, we show that the proposed method outperforms existing explainers in terms of fidelity, relevance, coherence, and stability. JF - Machine Learning and Knowledge Discovery in Databases PB - Springer International Publishing CY - Cham SN - 978-3-030-46150-8 UR - https://link.springer.com/chapter/10.1007/978-3-030-46150-8_12 ER - TY - JOUR T1 - An ethico-legal framework for social data science Y1 - 2020 A1 - Forgó, Nikolaus A1 - Hänold, Stefanie A1 - van den Hoven, Jeroen A1 - Krügel, Tina A1 - Lishchuk, Iryna A1 - Mahieu, René A1 - Anna Monreale A1 - Dino Pedreschi A1 - Francesca Pratesi A1 - van Putten, David AB - This paper presents a framework for research infrastructures enabling ethically sensitive and legally compliant data science in Europe. Our goal is to describe how to design and implement an open platform for big data social science, including, in particular, personal data. To this end, we discuss a number of infrastructural, organizational and methodological principles to be developed for a concrete implementation. These include not only systematically tools and methodologies that effectively enable both the empirical evaluation of the privacy risk and data transformations by using privacy-preserving approaches, but also the development of training materials (a massive open online course) and organizational instruments based on legal and ethical principles. This paper provides, by way of example, the implementation that was adopted within the context of the SoBigData Research Infrastructure. SN - 2364-4168 UR - https://link.springer.com/article/10.1007/s41060-020-00211-7 JO - International Journal of Data Science and Analytics ER - TY - CONF T1 - Global Explanations with Local Scoring T2 - Machine Learning and Knowledge Discovery in Databases Y1 - 2020 A1 - Mattia Setzu A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Franco Turini ED - Cellier, Peggy ED - Driessens, Kurt AB - Artificial Intelligence systems often adopt machine learning models encoding complex algorithms with potentially unknown behavior. As the application of these “black box” models grows, it is our responsibility to understand their inner working and formulate them in human-understandable explanations. To this end, we propose a rule-based model-agnostic explanation method that follows a local-to-global schema: it generalizes a global explanation summarizing the decision logic of a black box starting from the local explanations of single predicted instances. We define a scoring system based on a rule relevance score to extract global explanations from a set of local explanations in the form of decision rules. Experiments on several datasets and black boxes show the stability, and low complexity of the global explanations provided by the proposed solution in comparison with baselines and state-of-the-art global explainers. JF - Machine Learning and Knowledge Discovery in Databases PB - Springer International Publishing CY - Cham SN - 978-3-030-43823-4 UR - https://link.springer.com/chapter/10.1007%2F978-3-030-43823-4_14 ER - TY - JOUR T1 - Modeling Adversarial Behavior Against Mobility Data Privacy JF - IEEE Transactions on Intelligent Transportation SystemsIEEE Transactions on Intelligent Transportation Systems Y1 - 2020 A1 - Roberto Pellungrini A1 - Luca Pappalardo A1 - F. Simini A1 - Anna Monreale AB - Privacy risk assessment is a crucial issue in any privacy-aware analysis process. Traditional frameworks for privacy risk assessment systematically generate the assumed knowledge for a potential adversary, evaluating the risk without realistically modelling the collection of the background knowledge used by the adversary when performing the attack. In this work, we propose Simulated Privacy Annealing (SPA), a new adversarial behavior model for privacy risk assessment in mobility data. We model the behavior of an adversary as a mobility trajectory and introduce an optimization approach to find the most effective adversary trajectory in terms of privacy risk produced for the individuals represented in a mobility data set. We use simulated annealing to optimize the movement of the adversary and simulate a possible attack on mobility data. We finally test the effectiveness of our approach on real human mobility data, showing that it can simulate the knowledge gathering process for an adversary in a more realistic way. SN - 1558-0016 UR - https://ieeexplore.ieee.org/abstract/document/9199893 JO - IEEE Transactions on Intelligent Transportation Systems ER - TY - CONF T1 - Predicting and Explaining Privacy Risk Exposure in Mobility Data T2 - Discovery Science Y1 - 2020 A1 - Francesca Naretto A1 - Roberto Pellungrini A1 - Anna Monreale A1 - Nardini, Franco Maria A1 - Musolesi, Mirco ED - Appice, Annalisa ED - Tsoumakas, Grigorios ED - Manolopoulos, Yannis ED - Matwin, Stan AB - Mobility data is a proxy of different social dynamics and its analysis enables a wide range of user services. Unfortunately, mobility data are very sensitive because the sharing of people’s whereabouts may arise serious privacy concerns. Existing frameworks for privacy risk assessment provide tools to identify and measure privacy risks, but they often (i) have high computational complexity; and (ii) are not able to provide users with a justification of the reported risks. In this paper, we propose expert, a new framework for the prediction and explanation of privacy risk on mobility data. We empirically evaluate privacy risk on real data, simulating a privacy attack with a state-of-the-art privacy risk assessment framework. We then extract individual mobility profiles from the data for predicting their risk. We compare the performance of several machine learning algorithms in order to identify the best approach for our task. Finally, we show how it is possible to explain privacy risk prediction on real data, using two algorithms: Shap, a feature importance-based method and Lore, a rule-based method. Overall, expert is able to provide a user with the privacy risk and an explanation of the risk itself. The experiments show excellent performance for the prediction task. JF - Discovery Science PB - Springer International Publishing CY - Cham SN - 978-3-030-61527-7 UR - https://link.springer.com/chapter/10.1007/978-3-030-61527-7_27 ER - TY - CONF T1 - Prediction and Explanation of Privacy Risk on Mobility Data with Neural Networks T2 - ECML PKDD 2020 Workshops Y1 - 2020 A1 - Francesca Naretto A1 - Roberto Pellungrini A1 - Nardini, Franco Maria A1 - Fosca Giannotti ED - Koprinska, Irena ED - Kamp, Michael ED - Appice, Annalisa ED - Loglisci, Corrado ED - Antonie, Luiza ED - Zimmermann, Albrecht ED - Riccardo Guidotti ED - Özgöbek, Özlem ED - Ribeiro, Rita P. ED - Gavaldà, Ricard ED - Gama, João ED - Adilova, Linara ED - Krishnamurthy, Yamuna ED - Ferreira, Pedro M. ED - Malerba, Donato ED - Medeiros, Ibéria ED - Ceci, Michelangelo ED - Manco, Giuseppe ED - Masciari, Elio ED - Ras, Zbigniew W. ED - Christen, Peter ED - Ntoutsi, Eirini ED - Schubert, Erich ED - Zimek, Arthur ED - Anna Monreale ED - Biecek, Przemyslaw ED - S Rinzivillo ED - Kille, Benjamin ED - Lommatzsch, Andreas ED - Gulla, Jon Atle AB - The analysis of privacy risk for mobility data is a fundamental part of any privacy-aware process based on such data. Mobility data are highly sensitive. Therefore, the correct identification of the privacy risk before releasing the data to the public is of utmost importance. However, existing privacy risk assessment frameworks have high computational complexity. To tackle these issues, some recent work proposed a solution based on classification approaches to predict privacy risk using mobility features extracted from the data. In this paper, we propose an improvement of this approach by applying long short-term memory (LSTM) neural networks to predict the privacy risk directly from original mobility data. We empirically evaluate privacy risk on real data by applying our LSTM-based approach. Results show that our proposed method based on a LSTM network is effective in predicting the privacy risk with results in terms of F1 of up to 0.91. Moreover, to explain the predictions of our model, we employ a state-of-the-art explanation algorithm, Shap. We explore the resulting explanation, showing how it is possible to provide effective predictions while explaining them to the end-user. JF - ECML PKDD 2020 Workshops PB - Springer International Publishing CY - Cham SN - 978-3-030-65965-3 UR - https://link.springer.com/chapter/10.1007/978-3-030-65965-3_34 ER - TY - JOUR T1 - PRIMULE: Privacy risk mitigation for user profiles Y1 - 2020 A1 - Francesca Pratesi A1 - Lorenzo Gabrielli A1 - Paolo Cintia A1 - Anna Monreale A1 - Fosca Giannotti AB - The availability of mobile phone data has encouraged the development of different data-driven tools, supporting social science studies and providing new data sources to the standard official statistics. However, this particular kind of data are subject to privacy concerns because they can enable the inference of personal and private information. In this paper, we address the privacy issues related to the sharing of user profiles, derived from mobile phone data, by proposing PRIMULE, a privacy risk mitigation strategy. Such a method relies on PRUDEnce (Pratesi et al., 2018), a privacy risk assessment framework that provides a methodology for systematically identifying risky-users in a set of data. An extensive experimentation on real-world data shows the effectiveness of PRIMULE strategy in terms of both quality of mobile user profiles and utility of these profiles for analytical services such as the Sociometer (Furletti et al., 2013), a data mining tool for city users classification. VL - 125 SN - 0169-023X UR - https://www.sciencedirect.com/science/article/pii/S0169023X18305342 JO - Data & Knowledge Engineering ER - TY - JOUR T1 - The AI black box Explanation Problem JF - ERCIM NEWS Y1 - 2019 A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Dino Pedreschi ER - TY - CONF T1 - Explaining multi-label black-box classifiers for health applications T2 - International Workshop on Health Intelligence Y1 - 2019 A1 - Cecilia Panigutti A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Dino Pedreschi AB - Today the state-of-the-art performance in classification is achieved by the so-called “black boxes”, i.e. decision-making systems whose internal logic is obscure. Such models could revolutionize the health-care system, however their deployment in real-world diagnosis decision support systems is subject to several risks and limitations due to the lack of transparency. The typical classification problem in health-care requires a multi-label approach since the possible labels are not mutually exclusive, e.g. diagnoses. We propose MARLENA, a model-agnostic method which explains multi-label black box decisions. MARLENA explains an individual decision in three steps. First, it generates a synthetic neighborhood around the instance to be explained using a strategy suitable for multi-label decisions. It then learns a decision tree on such neighborhood and finally derives from it a decision rule that explains the black box decision. Our experiments show that MARLENA performs well in terms of mimicking the black box behavior while gaining at the same time a notable amount of interpretability through compact decision rules, i.e. rules with limited length. JF - International Workshop on Health Intelligence PB - Springer UR - https://link.springer.com/chapter/10.1007/978-3-030-24409-5_9 ER - TY - JOUR T1 - Factual and Counterfactual Explanations for Black Box Decision Making JF - IEEE Intelligent Systems Y1 - 2019 A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Fosca Giannotti A1 - Dino Pedreschi A1 - Salvatore Ruggieri A1 - Franco Turini AB - The rise of sophisticated machine learning models has brought accurate but obscure decision systems, which hide their logic, thus undermining transparency, trust, and the adoption of artificial intelligence (AI) in socially sensitive and safety-critical contexts. We introduce a local rule-based explanation method, providing faithful explanations of the decision made by a black box classifier on a specific instance. The proposed method first learns an interpretable, local classifier on a synthetic neighborhood of the instance under investigation, generated by a genetic algorithm. Then, it derives from the interpretable classifier an explanation consisting of a decision rule, explaining the factual reasons of the decision, and a set of counterfactuals, suggesting the changes in the instance features that would lead to a different outcome. Experimental results show that the proposed method outperforms existing approaches in terms of the quality of the explanations and of the accuracy in mimicking the black box. UR - https://ieeexplore.ieee.org/abstract/document/8920138 ER - TY - CONF T1 - Investigating Neighborhood Generation Methods for Explanations of Obscure Image Classifiers T2 - Pacific-Asia Conference on Knowledge Discovery and Data Mining Y1 - 2019 A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Cariaggi, Leonardo AB - Given the wide use of machine learning approaches based on opaque prediction models, understanding the reasons behind decisions of black box decision systems is nowadays a crucial topic. We address the problem of providing meaningful explanations in the widely-applied image classification tasks. In particular, we explore the impact of changing the neighborhood generation function for a local interpretable model-agnostic explanator by proposing four different variants. All the proposed methods are based on a grid-based segmentation of the images, but each of them proposes a different strategy for generating the neighborhood of the image for which an explanation is required. A deep experimentation shows both improvements and weakness of each proposed approach. JF - Pacific-Asia Conference on Knowledge Discovery and Data Mining PB - Springer UR - https://link.springer.com/chapter/10.1007/978-3-030-16148-4_5 ER - TY - CONF T1 - Meaningful explanations of Black Box AI decision systems T2 - Proceedings of the AAAI Conference on Artificial Intelligence Y1 - 2019 A1 - Dino Pedreschi A1 - Fosca Giannotti A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Salvatore Ruggieri A1 - Franco Turini AB - Black box AI systems for automated decision making, often based on machine learning over (big) data, map a user’s features into a class or a score without exposing the reasons why. This is problematic not only for lack of transparency, but also for possible biases inherited by the algorithms from human prejudices and collection artifacts hidden in the training data, which may lead to unfair or wrong decisions. We focus on the urgent open challenge of how to construct meaningful explanations of opaque AI/ML systems, introducing the local-toglobal framework for black box explanation, articulated along three lines: (i) the language for expressing explanations in terms of logic rules, with statistical and causal interpretation; (ii) the inference of local explanations for revealing the decision rationale for a specific case, by auditing the black box in the vicinity of the target instance; (iii), the bottom-up generalization of many local explanations into simple global ones, with algorithms that optimize for quality and comprehensibility. We argue that the local-first approach opens the door to a wide variety of alternative solutions along different dimensions: a variety of data sources (relational, text, images, etc.), a variety of learning problems (multi-label classification, regression, scoring, ranking), a variety of languages for expressing meaningful explanations, a variety of means to audit a black box. JF - Proceedings of the AAAI Conference on Artificial Intelligence UR - https://aaai.org/ojs/index.php/AAAI/article/view/5050 ER - TY - CONF T1 - Privacy Risk for Individual Basket Patterns T2 - ECML PKDD 2018 Workshops Y1 - 2019 A1 - Roberto Pellungrini A1 - Anna Monreale A1 - Riccardo Guidotti ED - Alzate, Carlos ED - Anna Monreale ED - Bioglio, Livio ED - Bitetta, Valerio ED - Bordino, Ilaria ED - Caldarelli, Guido ED - Ferretti, Andrea ED - Riccardo Guidotti ED - Gullo, Francesco ED - Pascolutti, Stefano ED - Pensa, Ruggero G. ED - Robardet, Céline ED - Squartini, Tiziano AB - Retail data are of fundamental importance for businesses and enterprises that want to understand the purchasing behaviour of their customers. Such data is also useful to develop analytical services and for marketing purposes, often based on individual purchasing patterns. However, retail data and extracted models may also provide very sensitive information to possible malicious third parties. Therefore, in this paper we propose a methodology for empirically assessing privacy risk in the releasing of individual purchasing data. The experiments on real-world retail data show that although individual patterns describe a summary of the customer activity, they may be successful used for the customer re-identifiation. JF - ECML PKDD 2018 Workshops PB - Springer International Publishing CY - Cham SN - 978-3-030-13463-1 UR - https://link.springer.com/chapter/10.1007/978-3-030-13463-1_11 ER - TY - CONF T1 - Analyzing Privacy Risk in Human Mobility Data T2 - Software Technologies: Applications and Foundations - STAF 2018 Collocated Workshops, Toulouse, France, June 25-29, 2018, Revised Selected Papers Y1 - 2018 A1 - Roberto Pellungrini A1 - Luca Pappalardo A1 - Francesca Pratesi A1 - Anna Monreale AB - Mobility data are of fundamental importance for understanding the patterns of human movements, developing analytical services and modeling human dynamics. Unfortunately, mobility data also contain individual sensitive information, making it necessary an accurate privacy risk assessment for the individuals involved. In this paper, we propose a methodology for assessing privacy risk in human mobility data. Given a set of individual and collective mobility features, we define the minimum data format necessary for the computation of each feature and we define a set of possible attacks on these data formats. We perform experiments computing the empirical risk in a real-world mobility dataset, and show how the distributions of the considered mobility features are affected by the removal of individuals with different levels of privacy risk. JF - Software Technologies: Applications and Foundations - STAF 2018 Collocated Workshops, Toulouse, France, June 25-29, 2018, Revised Selected Papers UR - https://doi.org/10.1007/978-3-030-04771-9_10 ER - TY - JOUR T1 - Discovering temporal regularities in retail customers’ shopping behavior JF - EPJ Data Science Y1 - 2018 A1 - Riccardo Guidotti A1 - Lorenzo Gabrielli A1 - Anna Monreale A1 - Dino Pedreschi A1 - Fosca Giannotti AB - In this paper we investigate the regularities characterizing the temporal purchasing behavior of the customers of a retail market chain. Most of the literature studying purchasing behavior focuses on what customers buy while giving few importance to the temporal dimension. As a consequence, the state of the art does not allow capturing which are the temporal purchasing patterns of each customers. These patterns should describe the customer’s temporal habits highlighting when she typically makes a purchase in correlation with information about the amount of expenditure, number of purchased items and other similar aggregates. This knowledge could be exploited for different scopes: set temporal discounts for making the purchases of customers more regular with respect the time, set personalized discounts in the day and time window preferred by the customer, provide recommendations for shopping time schedule, etc. To this aim, we introduce a framework for extracting from personal retail data a temporal purchasing profile able to summarize whether and when a customer makes her distinctive purchases. The individual profile describes a set of regular and characterizing shopping behavioral patterns, and the sequences in which these patterns take place. We show how to compare different customers by providing a collective perspective to their individual profiles, and how to group the customers with respect to these comparable profiles. By analyzing real datasets containing millions of shopping sessions we found that there is a limited number of patterns summarizing the temporal purchasing behavior of all the customers, and that they are sequentially followed in a finite number of ways. Moreover, we recognized regular customers characterized by a small number of temporal purchasing behaviors, and changing customers characterized by various types of temporal purchasing behaviors. Finally, we discuss on how the profiles can be exploited both by customers to enable personalized services, and by the retail market chain for providing tailored discounts based on temporal purchasing regularity. VL - 7 UR - https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-018-0133-0 ER - TY - CONF T1 - Exploring Students Eating Habits Through Individual Profiling and Clustering Analysis T2 - ECML PKDD 2018 Workshops Y1 - 2018 A1 - Michela Natilli A1 - Anna Monreale A1 - Riccardo Guidotti A1 - Luca Pappalardo JF - ECML PKDD 2018 Workshops PB - Springer ER - TY - JOUR T1 - Gastroesophageal reflux symptoms among Italian university students: epidemiology and dietary correlates using automatically recorded transactions JF - BMC gastroenterology Y1 - 2018 A1 - Martinucci, Irene A1 - Michela Natilli A1 - Lorenzoni, Valentina A1 - Luca Pappalardo A1 - Anna Monreale A1 - Turchetti, Giuseppe A1 - Dino Pedreschi A1 - Marchi, Santino A1 - Barale, Roberto A1 - de Bortoli, Nicola AB - Background: Gastroesophageal reflux disease (GERD) is one of the most common gastrointestinal disorders worldwide, with relevant impact on the quality of life and health care costs.The aim of our study is to assess the prevalence of GERD based on self-reported symptoms among university students in central Italy. The secondary aim is to evaluate lifestyle correlates, particularly eating habits, in GERD students using automatically recorded transactions through cashiers at university canteen. Methods: A web-survey was created and launched through an app, ad-hoc developed for an interactive exchange of information with students, including anthropometric data and lifestyle habits. Moreover, the web-survey allowed users a self-diagnosis of GERD through a simple questionnaire. As regard eating habits, detailed collection of meals consumed, including number and type of dishes, were automatically recorded through cashiers at the university canteen equipped with an automatic registration system. Results: We collected 3012 questionnaires. A total of 792 students (26.2% of the respondents) reported typical GERD symptoms occurring at least weekly. Female sex was more prevalent than male sex. In the set of students with GERD, the percentage of smokers was higher, and our results showed that when BMI tends to higher values the percentage of students with GERD tends to increase. When evaluating correlates with diet, we found, among all users, a lower frequency of legumes choice in GERD students and, among frequent users, a lower frequency of choice of pasta and rice in GERD students. Discussion: The results of our study are in line with the values reported in the literature. Nowadays, GERD is a common problem in our communities, and can potentially lead to serious medical complications; the economic burden involved in the diagnostic and therapeutic management of the disease has a relevant impact on healthcare costs. Conclusions: To our knowledge, this is the first study evaluating the prevalence of typical GERD–related symptoms in a young population of University students in Italy. Considering the young age of enrolled subjects, our prevalence rate, relatively high compared to the usual estimates, could represent a further negative factor for the future economic sustainability of the healthcare system. Keywords: Gastroesophageal reflux disease, GERD, Heartburn, Regurgitation, Diet, Prevalence, University students VL - 18 UR - https://bmcgastroenterol.biomedcentral.com/articles/10.1186/s12876-018-0832-9 ER - TY - CHAP T1 - How Data Mining and Machine Learning Evolved from Relational Data Base to Data Science T2 - A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years Y1 - 2018 A1 - Amato, G. A1 - Candela, L. A1 - Castelli, D. A1 - Esuli, A. A1 - Falchi, F. A1 - Gennaro, C. A1 - Fosca Giannotti A1 - Anna Monreale A1 - Mirco Nanni A1 - Pagano, P. A1 - Luca Pappalardo A1 - Dino Pedreschi A1 - Francesca Pratesi A1 - Rabitti, F. A1 - S Rinzivillo A1 - Giulio Rossetti A1 - Salvatore Ruggieri A1 - Sebastiani, F. A1 - Tesconi, M. ED - Flesca, Sergio ED - Greco, Sergio ED - Masciari, Elio ED - Saccà, Domenico AB - During the last 35 years, data management principles such as physical and logical independence, declarative querying and cost-based optimization have led to profound pervasiveness of relational databases in any kind of organization. More importantly, these technical advances have enabled the first round of business intelligence applications and laid the foundation for managing and analyzing Big Data today. JF - A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years PB - Springer International Publishing CY - Cham SN - 978-3-319-61893-7 UR - https://link.springer.com/chapter/10.1007%2F978-3-319-61893-7_17 ER - TY - CONF T1 - Learning Data Mining T2 - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) Y1 - 2018 A1 - Riccardo Guidotti A1 - Anna Monreale A1 - S Rinzivillo AB - In the last decade the usage and study of data mining and machine learning algorithms have received an increasing attention from several and heterogeneous fields of research. Learning how and why a certain algorithm returns a particular result, and understanding which are the main problems connected to its execution is a hot topic in the education of data mining methods. In order to support data mining beginners, students, teachers, and researchers we introduce a novel didactic environment. The Didactic Data Mining Environment (DDME) allows to execute a data mining algorithm on a dataset and to observe the algorithm behavior step by step to learn how and why a certain result is returned. DDME can be practically exploited by teachers and students for having a more interactive learning of data mining. Indeed, on top of the core didactic library, we designed a visual platform that allows online execution of experiments and the visualization of the algorithm steps. The visual platform abstracts the coding activity and makes available the execution of algorithms to non-technicians. JF - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) UR - https://ieeexplore.ieee.org/document/8631453 ER - TY - RPRT T1 - Local Rule-Based Explanations of Black Box Decision Systems Y1 - 2018 A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Salvatore Ruggieri A1 - Dino Pedreschi A1 - Franco Turini A1 - Fosca Giannotti JF - arXiv preprint arXiv:1805.10820 ER - TY - RPRT T1 - Open the Black Box Data-Driven Explanation of Black Box Decision Systems Y1 - 2018 A1 - Dino Pedreschi A1 - Fosca Giannotti A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Luca Pappalardo A1 - Salvatore Ruggieri A1 - Franco Turini JF - arXiv preprint arXiv:1806.09936 ER - TY - JOUR T1 - PRUDEnce: a system for assessing privacy risk vs utility in data sharing ecosystems JF - Transactions on Data Privacy Y1 - 2018 A1 - Francesca Pratesi A1 - Anna Monreale A1 - Roberto Trasarti A1 - Fosca Giannotti A1 - Dino Pedreschi A1 - Yanagihara, Tadashi AB - Data describing human activities are an important source of knowledge useful for understanding individual and collective behavior and for developing a wide range of user services. Unfortunately, this kind of data is sensitive, because people’s whereabouts may allow re-identification of individuals in a de-identified database. Therefore, Data Providers, before sharing those data, must apply any sort of anonymization to lower the privacy risks, but they must be aware and capable of controlling also the data quality, since these two factors are often a trade-off. In this paper we propose PRUDEnce (Privacy Risk versus Utility in Data sharing Ecosystems), a system enabling a privacy-aware ecosystem for sharing personal data. It is based on a methodology for assessing both the empirical (not theoretical) privacy risk associated to users represented in the data, and the data quality guaranteed only with users not at risk. Our proposal is able to support the Data Provider in the exploration of a repertoire of possible data transformations with the aim of selecting one specific transformation that yields an adequate trade-off between data quality and privacy risk. We study the practical effectiveness of our proposal over three data formats underlying many services, defined on real mobility data, i.e., presence data, trajectory data and road segment data. VL - 11 UR - http://www.tdp.cat/issues16/tdp.a284a17.pdf ER - TY - JOUR T1 - A survey of methods for explaining black box models JF - ACM computing surveys (CSUR) Y1 - 2018 A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Salvatore Ruggieri A1 - Franco Turini A1 - Fosca Giannotti A1 - Dino Pedreschi AB - In recent years, many accurate decision support systems have been constructed as black boxes, that is as systems that hide their internal logic to the user. This lack of explanation constitutes both a practical and an ethical issue. The literature reports many approaches aimed at overcoming this crucial weakness, sometimes at the cost of sacrificing accuracy for interpretability. The applications in which black box decision systems can be used are various, and each approach is typically developed to provide a solution for a specific problem and, as a consequence, it explicitly or implicitly delineates its own definition of interpretability and explanation. The aim of this article is to provide a classification of the main problems addressed in the literature with respect to the notion of explanation and the type of black box system. Given a problem definition, a black box type, and a desired explanation, this survey should help the researcher to find the proposals more useful for his own work. The proposed classification of approaches to open black box models should also be useful for putting the many research open questions in perspective. VL - 51 UR - https://dl.acm.org/doi/abs/10.1145/3236009 ER - TY - JOUR T1 - Authenticated Outlier Mining for Outsourced Databases JF - IEEE Transactions on Dependable and Secure Computing Y1 - 2017 A1 - Dong, Boxiang A1 - Hui Wendy Wang A1 - Anna Monreale A1 - Dino Pedreschi A1 - Fosca Giannotti A1 - W Guo AB - The Data-Mining-as-a-Service (DMaS) paradigm is becoming the focus of research, as it allows the data owner (client) who lacks expertise and/or computational resources to outsource their data and mining needs to a third-party service provider (server). Outsourcing, however, raises some issues about result integrity: how could the client verify the mining results returned by the server are both sound and complete? In this paper, we focus on outlier mining, an important mining task. Previous verification techniques use an authenticated data structure (ADS) for correctness authentication, which may incur much space and communication cost. In this paper, we propose a novel solution that returns a probabilistic result integrity guarantee with much cheaper verification cost. The key idea is to insert a set of artificial records (ARs) into the dataset, from which it constructs a set of artificial outliers (AOs) and artificial non-outliers (ANOs). The AOs and ANOs are used by the client to detect any incomplete and/or incorrect mining results with a probabilistic guarantee. The main challenge that we address is how to construct ARs so that they do not change the (non-)outlierness of original records, while guaranteeing that the client can identify ANOs and AOs without executing mining. Furthermore, we build a strategic game and show that a Nash equilibrium exists only when the server returns correct outliers. Our implementation and experiments demonstrate that our verification solution is efficient and lightweight. UR - https://ieeexplore.ieee.org/document/8048342/ ER - TY - CONF T1 - Clustering Individual Transactional Data for Masses of Users T2 - Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Y1 - 2017 A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Mirco Nanni A1 - Fosca Giannotti A1 - Dino Pedreschi AB - Mining a large number of datasets recording human activities for making sense of individual data is the key enabler of a new wave of personalized knowledge-based services. In this paper we focus on the problem of clustering individual transactional data for a large mass of users. Transactional data is a very pervasive kind of information that is collected by several services, often involving huge pools of users. We propose txmeans, a parameter-free clustering algorithm able to efficiently partitioning transactional data in a completely automatic way. Txmeans is designed for the case where clustering must be applied on a massive number of different datasets, for instance when a large set of users need to be analyzed individually and each of them has generated a long history of transactions. A deep experimentation on both real and synthetic datasets shows the practical effectiveness of txmeans for the mass clustering of different personal datasets, and suggests that txmeans outperforms existing methods in terms of quality and efficiency. Finally, we present a personal cart assistant application based on txmeans JF - Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining PB - ACM ER - TY - JOUR T1 - A Data Mining Approach to Assess Privacy Risk in Human Mobility Data JF - ACM Trans. Intell. Syst. Technol. Y1 - 2017 A1 - Roberto Pellungrini A1 - Luca Pappalardo A1 - Francesca Pratesi A1 - Anna Monreale AB - Human mobility data are an important proxy to understand human mobility dynamics, develop analytical services, and design mathematical models for simulation and what-if analysis. Unfortunately mobility data are very sensitive since they may enable the re-identification of individuals in a database. Existing frameworks for privacy risk assessment provide data providers with tools to control and mitigate privacy risks, but they suffer two main shortcomings: (i) they have a high computational complexity; (ii) the privacy risk must be recomputed every time new data records become available and for every selection of individuals, geographic areas, or time windows. In this article, we propose a fast and flexible approach to estimate privacy risk in human mobility data. The idea is to train classifiers to capture the relation between individual mobility patterns and the level of privacy risk of individuals. We show the effectiveness of our approach by an extensive experiment on real-world GPS data in two urban areas and investigate the relations between human mobility patterns and the privacy risk of individuals. VL - 9 UR - http://doi.acm.org/10.1145/3106774 ER - TY - ABST T1 - Fast Estimation of Privacy Risk in Human Mobility Data Y1 - 2017 A1 - Roberto Pellungrini A1 - Luca Pappalardo A1 - Francesca Pratesi A1 - Anna Monreale AB - Mobility data are an important proxy to understand the patterns of human movements, develop analytical services and design models for simulation and prediction of human dynamics. Unfortunately mobility data are also very sensitive, since they may contain personal information about the individuals involved. Existing frameworks for privacy risk assessment enable the data providers to quantify and mitigate privacy risks, but they suffer two main limitations: (i) they have a high computational complexity; (ii) the privacy risk must be re-computed for each new set of individuals, geographic areas or time windows. In this paper we explore a fast and flexible solution to estimate privacy risk in human mobility data, using predictive models to capture the relation between an individual’s mobility patterns and her privacy risk. We show the effectiveness of our approach by experimentation on a real-world GPS dataset and provide a comparison with traditional methods. SN - 978-3-319-66283-1 ER - TY - JOUR T1 - MyWay: Location prediction via mobility profiling JF - Information Systems Y1 - 2017 A1 - Roberto Trasarti A1 - Riccardo Guidotti A1 - Anna Monreale A1 - Fosca Giannotti AB - Forecasting the future positions of mobile users is a valuable task allowing us to operate efficiently a myriad of different applications which need this type of information. We propose MyWay, a prediction system which exploits the individual systematic behaviors modeled by mobility profiles to predict human movements. MyWay provides three strategies: the individual strategy uses only the user individual mobility profile, the collective strategy takes advantage of all users individual systematic behaviors, and the hybrid strategy that is a combination of the previous two. A key point is that MyWay only requires the sharing of individual mobility profiles, a concise representation of the user׳s movements, instead of raw trajectory data revealing the detailed movement of the users. We evaluate the prediction performances of our proposal by a deep experimentation on large real-world data. The results highlight that the synergy between the individual and collective knowledge is the key for a better prediction and allow the system to outperform the state-of-art methods. VL - 64 ER - TY - CONF T1 - Privacy Preserving Multidimensional Profiling T2 - International Conference on Smart Objects and Technologies for Social Good Y1 - 2017 A1 - Francesca Pratesi A1 - Anna Monreale A1 - Fosca Giannotti A1 - Dino Pedreschi AB - Recently, big data had become central in the analysis of human behavior and the development of innovative services. In particular, a new class of services is emerging, taking advantage of different sources of data, in order to consider the multiple aspects of human beings. Unfortunately, these data can lead to re-identification problems and other privacy leaks, as diffusely reported in both scientific literature and media. The risk is even more pressing if multiple sources of data are linked together since a potential adversary could know information related to each dataset. For this reason, it is necessary to evaluate accurately and mitigate the individual privacy risk before releasing personal data. In this paper, we propose a methodology for the first task, i.e., assessing privacy risk, in a multidimensional scenario, defining some possible privacy attacks and simulating them using real-world datasets. JF - International Conference on Smart Objects and Technologies for Social Good PB - Springer UR - https://link.springer.com/chapter/10.1007/978-3-319-76111-4_15 ER - TY - JOUR T1 - Big Data Research in Italy: A Perspective JF - Engineering Y1 - 2016 A1 - Sonia Bergamaschi A1 - Emanuele Carlini A1 - Michelangelo Ceci A1 - Barbara Furletti A1 - Fosca Giannotti A1 - Donato Malerba A1 - Mario Mezzanzanica A1 - Anna Monreale A1 - Gabriella Pasi A1 - Dino Pedreschi A1 - Raffaele Perego A1 - Salvatore Ruggieri AB - The aim of this article is to synthetically describe the research projects that a selection of Italian universities is undertaking in the context of big data. Far from being exhaustive, this article has the objective of offering a sample of distinct applications that address the issue of managing huge amounts of data in Italy, collected in relation to diverse domains. VL - 2 UR - http://engineering.org.cn/EN/abstract/article_12288.shtml ER - TY - JOUR T1 - Driving Profiles Computation and Monitoring for Car Insurance CRM JF - Journal ACM Transactions on Intelligent Systems and Technology (TIST) Y1 - 2016 A1 - Mirco Nanni A1 - Roberto Trasarti A1 - Anna Monreale A1 - Valerio Grossi A1 - Dino Pedreschi AB - Customer segmentation is one of the most traditional and valued tasks in customer relationship management (CRM). In this article, we explore the problem in the context of the car insurance industry, where the mobility behavior of customers plays a key role: Different mobility needs, driving habits, and skills imply also different requirements (level of coverage provided by the insurance) and risks (of accidents). In the present work, we describe a methodology to extract several indicators describing the driving profile of customers, and we provide a clustering-oriented instantiation of the segmentation problem based on such indicators. Then, we consider the availability of a continuous flow of fresh mobility data sent by the circulating vehicles, aiming at keeping our segments constantly up to date. We tackle a major scalability issue that emerges in this context when the number of customers is large-namely, the communication bottleneck-by proposing and implementing a sophisticated distributed monitoring solution that reduces communications between vehicles and company servers to the essential. We validate the framework on a large database of real mobility data coming from GPS devices on private cars. Finally, we analyze the privacy risks that the proposed approach might involve for the users, providing and evaluating a countermeasure based on data perturbation. VL - 8 UR - http://doi.acm.org/10.1145/2912148 ER - TY - CHAP T1 - Partition-Based Clustering Using Constraint Optimization T2 - Data Mining and Constraint Programming - Foundations of a Cross-Disciplinary Approach Y1 - 2016 A1 - Valerio Grossi A1 - Tias Guns A1 - Anna Monreale A1 - Mirco Nanni A1 - Siegfried Nijssen AB - Partition-based clustering is the task of partitioning a dataset in a number of groups of examples, such that examples in each group are similar to each other. Many criteria for what constitutes a good clustering have been identified in the literature; furthermore, the use of additional constraints to find more useful clusterings has been proposed. In this chapter, it will be shown that most of these clustering tasks can be formalized using optimization criteria and constraints. We demonstrate how a range of clustering tasks can be modelled in generic constraint programming languages with these constraints and optimization criteria. Using the constraint-based modeling approach we also relate the DBSCAN method for density-based clustering to the label propagation technique for community discovery. JF - Data Mining and Constraint Programming - Foundations of a Cross-Disciplinary Approach PB - Springer International Publishing UR - http://dx.doi.org/10.1007/978-3-319-50137-6_11 ER - TY - CONF T1 - Privacy-Preserving Outsourcing of Data Mining T2 - 40th IEEE Annual Computer Software and Applications Conference, {COMPSAC} Workshops 2016, Atlanta, GA, USA, June 10-14, 2016 Y1 - 2016 A1 - Anna Monreale A1 - Hui Wendy Wang AB - Data mining is gaining momentum in society due to the ever increasing availability of large amounts of data, easily gathered by a variety of collection technologies and stored via computer systems. Due to the limited computational resources of data owners and the developments in cloud computing, there has been considerable recent interest in the paradigm of data mining-as-a-service (DMaaS). In this paradigm, a company (data owner) lacking in expertise or computational resources outsources its mining needs to a third party service provider (server). Given the fact that the server may not be fully trusted, one of the main concerns of the DMaaS paradigm is the protection of data privacy. In this paper, we provide an overview of a variety of techniques and approaches that address the privacy issues of the DMaaS paradigm. JF - 40th IEEE Annual Computer Software and Applications Conference, {COMPSAC} Workshops 2016, Atlanta, GA, USA, June 10-14, 2016 PB - IEEE Computer Society CY - Atlanta, GA, USA UR - http://dx.doi.org/10.1109/COMPSAC.2016.169 ER - TY - CONF T1 - Privacy-Preserving Outsourcing of Pattern Mining of Event-Log Data-A Use-Case from Process Industry T2 - Cloud Computing Technology and Science (CloudCom), 2016 IEEE International Conference on Y1 - 2016 A1 - Marrella, Alessandro A1 - Anna Monreale A1 - Kloepper, Benjamin A1 - Krueger, Martin W AB - With the advent of cloud computing and its model for IT services based on the Internet and big data centers, the interest of industries into XaaS ("Anything as a Service") paradigm is increasing. Business intelligence and knowledge discovery services are typical services that companies tend to externalize on the cloud, due to their data intensive nature and the algorithms complexity. What is appealing for a company is to rely on external expertise and infrastructure to compute the analytical results and models which are required by the business analysts for understanding the business phenomena under observation. Although it is advantageous to achieve sophisticated analysis there exist several serious privacy issues in this paradigm. In this paper we investigate through an industrial use-case the application of a framework for privacypreserving outsourcing of pattern mining on event-log data. Moreover, we present and discuss some ideas about possible extensions. JF - Cloud Computing Technology and Science (CloudCom), 2016 IEEE International Conference on PB - IEEE ER - TY - BOOK T1 - Realising the European open science cloud Y1 - 2016 A1 - Ayris, Paul A1 - Berthou, Jean-Yves A1 - Bruce, Rachel A1 - Lindstaedt, Stefanie A1 - Anna Monreale A1 - Mons, Barend A1 - Murayama, Yasuhiro A1 - Södergård, Caj A1 - Tochtermann, Klaus A1 - Wilkinson, Ross AB - The European Open Science Cloud (EOSC) aims to accelerate and support the current transition to more effective Open Science and Open Innovation in the Digital Single Market. It should enable trusted access to services, systems and the re-use of shared scientific data across disciplinary, social and geographical borders. This report approaches the EOSC as a federated environment for scientific data sharing and re-use, based on existing and emerging elements in the Member States, with light-weight international guidance and governance, and a large degree of freedom regarding practical implementation. SN - 978-92-79-61762-1 UR - http://dx.doi.org/10.2777/940154 ER - TY - JOUR T1 - Unveiling mobility complexity through complex network analysis JF - Social Network Analysis and Mining Y1 - 2016 A1 - Riccardo Guidotti A1 - Anna Monreale A1 - S Rinzivillo A1 - Dino Pedreschi A1 - Fosca Giannotti AB - The availability of massive digital traces of individuals is offering a series of novel insights on the understanding of patterns characterizing human mobility. Many studies try to semantically enrich mobility data with annotations about human activities. However, these approaches either focus on places with high frequencies (e.g., home and work), or relay on background knowledge (e.g., public available points of interest). In this paper, we depart from the concept of frequency and we focus on a high level representation of mobility using network analytics. The visits of each driver to each systematic destination are modeled as links in a bipartite network where a set of nodes represents drivers and the other set represents places. We extract such network from two real datasets of human mobility based, respectively, on GPS and GSM data. We introduce the concept of mobility complexity of drivers and places as a ranking analysis over the nodes of these networks. In addition, by means of community discovery analysis, we differentiate subgroups of drivers and places according both to their homogeneity and to their mobility complexity. VL - 6 ER - TY - CONF T1 - Clustering Formulation Using Constraint Optimization T2 - Software Engineering and Formal Methods - {SEFM} 2015 Collocated Workshops: ATSE, HOFM, MoKMaSD, and VERY*SCART, York, UK, September 7-8, 2015, Revised Selected Papers Y1 - 2015 A1 - Valerio Grossi A1 - Anna Monreale A1 - Mirco Nanni A1 - Dino Pedreschi A1 - Franco Turini AB - The problem of clustering a set of data is a textbook machine learning problem, but at the same time, at heart, a typical optimization problem. Given an objective function, such as minimizing the intra-cluster distances or maximizing the inter-cluster distances, the task is to find an assignment of data points to clusters that achieves this objective. In this paper, we present a constraint programming model for a centroid based clustering and one for a density based clustering. In particular, as a key contribution, we show how the expressivity introduced by the formulation of the problem by constraint programming makes the standard problem easy to be extended with other constraints that permit to generate interesting variants of the problem. We show this important aspect in two different ways: first, we show how the formulation of the density-based clustering by constraint programming makes it very similar to the label propagation problem and then, we propose a variant of the standard label propagation approach. JF - Software Engineering and Formal Methods - {SEFM} 2015 Collocated Workshops: ATSE, HOFM, MoKMaSD, and VERY*SCART, York, UK, September 7-8, 2015, Revised Selected Papers PB - Springer Berlin Heidelberg UR - http://dx.doi.org/10.1007/978-3-662-49224-6_9 ER - TY - JOUR T1 - Discrimination- and privacy-aware patterns JF - Data Min. Knowl. Discov. Y1 - 2015 A1 - Sara Hajian A1 - Josep Domingo-Ferrer A1 - Anna Monreale A1 - Dino Pedreschi A1 - Fosca Giannotti AB - Data mining is gaining societal momentum due to the ever increasing availability of large amounts of human data, easily collected by a variety of sensing technologies. We are therefore faced with unprecedented opportunities and risks: a deeper understanding of human behavior and how our society works is darkened by a greater chance of privacy intrusion and unfair discrimination based on the extracted patterns and profiles. Consider the case when a set of patterns extracted from the personal data of a population of individual persons is released for a subsequent use into a decision making process, such as, e.g., granting or denying credit. First, the set of patterns may reveal sensitive information about individual persons in the training population and, second, decision rules based on such patterns may lead to unfair discrimination, depending on what is represented in the training cases. Although methods independently addressing privacy or discrimination in data mining have been proposed in the literature, in this context we argue that privacy and discrimination risks should be tackled together, and we present a methodology for doing so while publishing frequent pattern mining results. We describe a set of pattern sanitization methods, one for each discrimination measure used in the legal literature, to achieve a fair publishing of frequent patterns in combination with two possible privacy transformations: one based on k-anonymity and one based on differential privacy. Our proposed pattern sanitization methods based on k-anonymity yield both privacy- and discrimination-protected patterns, while introducing reasonable (controlled) pattern distortion. Moreover, they obtain a better trade-off between protection and data quality than the sanitization methods based on differential privacy. Finally, the effectiveness of our proposals is assessed by extensive experiments. VL - 29 UR - http://dx.doi.org/10.1007/s10618-014-0393-7 ER - TY - CONF T1 - Quantification in Social Networks T2 - International Conference on Data Science and Advanced Analytics (IEEE DSAA'2015) Y1 - 2015 A1 - Letizia Milli A1 - Anna Monreale A1 - Giulio Rossetti A1 - Dino Pedreschi A1 - Fosca Giannotti A1 - Fabrizio Sebastiani AB - In many real-world applications there is a need to monitor the distribution of a population across different classes, and to track changes in this distribution over time. As an example, an important task is to monitor the percentage of unemployed adults in a given region. When the membership of an individual in a class cannot be established deterministically, a typical solution is the classification task. However, in the above applications the final goal is not determining which class the individuals belong to, but estimating the prevalence of each class in the unlabeled data. This task is called quantification. Most of the work in the literature addressed the quantification problem considering data presented in conventional attribute format. Since the ever-growing availability of web and social media we have a flourish of network data representing a new important source of information and by using quantification network techniques we could quantify collective behavior, i.e., the number of users that are involved in certain type of activities, preferences, or behaviors. In this paper we exploit the homophily effect observed in many social networks in order to construct a quantifier for networked data. Our experiments show the effectiveness of the proposed approaches and the comparison with the existing state-of-the-art quantification methods shows that they are more accurate. JF - International Conference on Data Science and Advanced Analytics (IEEE DSAA'2015) PB - IEEE CY - Paris, France UR - http://www.giuliorossetti.net/about/wp-content/uploads/2015/12/main_DSAA.pdf ER - TY - JOUR T1 - A risk model for privacy in trajectory data JF - Journal of Trust Management Y1 - 2015 A1 - Anirban Basu A1 - Anna Monreale A1 - Roberto Trasarti A1 - Juan Camilo Corena A1 - Fosca Giannotti A1 - Dino Pedreschi A1 - Shinsaku Kiyomoto A1 - Yutaka Miyake A1 - Tadashi Yanagihara AB - Time sequence data relating to users, such as medical histories and mobility data, are good candidates for data mining, but often contain highly sensitive information. Different methods in privacy-preserving data publishing are utilised to release such private data so that individual records in the released data cannot be re-linked to specific users with a high degree of certainty. These methods provide theoretical worst-case privacy risks as measures of the privacy protection that they offer. However, often with many real-world data the worst-case scenario is too pessimistic and does not provide a realistic view of the privacy risks: the real probability of re-identification is often much lower than the theoretical worst-case risk. In this paper, we propose a novel empirical risk model for privacy which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release. We show detailed evaluation of the proposed risk model by using k-anonymised real-world mobility data and then, we show how the empirical evaluation of the privacy risk has a different trend in synthetic data describing random movements. VL - 2 ER - TY - JOUR T1 - Anonymity preserving sequential pattern mining JF - Artif. Intell. Law Y1 - 2014 A1 - Anna Monreale A1 - Dino Pedreschi A1 - Ruggero G. Pensa A1 - Fabio Pinelli AB - The increasing availability of personal data of a sequential nature, such as time-stamped transaction or location data, enables increasingly sophisticated sequential pattern mining techniques. However, privacy is at risk if it is possible to reconstruct the identity of individuals from sequential data. Therefore, it is important to develop privacy-preserving techniques that support publishing of really anonymous data, without altering the analysis results significantly. In this paper we propose to apply the Privacy-by-design paradigm for designing a technological framework to counter the threats of undesirable, unlawful effects of privacy violation on sequence data, without obstructing the knowledge discovery opportunities of data mining technologies. First, we introduce a k-anonymity framework for sequence data, by defining the sequence linking attack model and its associated countermeasure, a k-anonymity notion for sequence datasets, which provides a formal protection against the attack. Second, we instantiate this framework and provide a specific method for constructing the k-anonymous version of a sequence dataset, which preserves the results of sequential pattern mining, together with several basic statistics and other analytical properties of the original data, including the clustering structure. A comprehensive experimental study on realistic datasets of process-logs, web-logs and GPS tracks is carried out, which empirically shows how, in our proposed method, the protection of privacy meets analytical utility. VL - 22 UR - http://dx.doi.org/10.1007/s10506-014-9154-6 ER - TY - CONF T1 - CF-inspired Privacy-Preserving Prediction of Next Location in the Cloud T2 - Cloud Computing Technology and Science (CloudCom), 2014 IEEE 6th International Conference on Y1 - 2014 A1 - Anirban Basu A1 - Juan Camilo Corena A1 - Anna Monreale A1 - Dino Pedreschi A1 - Fosca Giannotti A1 - Shinsaku Kiyomoto A1 - Vaidya, Jaideep A1 - Yutaka Miyake AB - Mobility data gathered from location sensors such as Global Positioning System (GPS) enabled phones and vehicles is valuable for spatio-temporal data mining for various location-based services (LBS). Such data is often considered sensitive and there exist many a mechanism for privacy preserving analyses of the data. Through various anonymisation mechanisms, it can be ensured with a high probability that a particular individual cannot be identified when mobility data is outsourced to third parties for analysis. However, challenges remain with the privacy of the queries on outsourced analysis results, especially when the queries are sent directly to third parties by end-users. Drawing inspiration from our earlier work in privacy preserving collaborative filtering (CF) and next location prediction, in this exploratory work, we propose a novel representation of trajectory data in the CF domain and experiment with a privacy preserving Slope One CF predictor. We present evaluations for the accuracy and the computational performance of our proposal using anonymised data gathered from real traffic data in the Italian cities of Pisa and Milan. One use-case is a third-party location-prediction-as-a-service deployed on a public cloud, which can respond to privacy-preserving queries while enabling data owners to build a rich predictor on the cloud. JF - Cloud Computing Technology and Science (CloudCom), 2014 IEEE 6th International Conference on PB - IEEE UR - http://dx.doi.org/10.1109/CloudCom.2014.114 ER - TY - CONF T1 - Fair pattern discovery T2 - Symposium on Applied Computing, {SAC} 2014, Gyeongju, Republic of Korea - March 24 - 28, 2014 Y1 - 2014 A1 - Sara Hajian A1 - Anna Monreale A1 - Dino Pedreschi A1 - Josep Domingo-Ferrer A1 - Fosca Giannotti AB - Data mining is gaining societal momentum due to the ever increasing availability of large amounts of human data, easily collected by a variety of sensing technologies. We are assisting to unprecedented opportunities of understanding human and society behavior that unfortunately is darkened by several risks for human rights: one of this is the unfair discrimination based on the extracted patterns and profiles. Consider the case when a set of patterns extracted from the personal data of a population of individual persons is released for subsequent use in a decision making process, such as, e.g., granting or denying credit. Decision rules based on such patterns may lead to unfair discrimination, depending on what is represented in the training cases. In this context, we address the discrimination risks resulting from publishing frequent patterns. We present a set of pattern sanitization methods, one for each discrimination measure used in the legal literature, for fair (discrimination-protected) publishing of frequent pattern mining results. Our proposed pattern sanitization methods yield discrimination-protected patterns, while introducing reasonable (controlled) pattern distortion. Finally, the effectiveness of our proposals is assessed by extensive experiments. JF - Symposium on Applied Computing, {SAC} 2014, Gyeongju, Republic of Korea - March 24 - 28, 2014 UR - http://doi.acm.org/10.1145/2554850.2555043 ER - TY - CONF T1 - A Privacy Risk Model for Trajectory Data T2 - Trust Management {VIII} - 8th {IFIP} {WG} 11.11 International Conference, {IFIPTM} 2014, Singapore, July 7-10, 2014. Proceedings Y1 - 2014 A1 - Anirban Basu A1 - Anna Monreale A1 - Juan Camilo Corena A1 - Fosca Giannotti A1 - Dino Pedreschi A1 - Shinsaku Kiyomoto A1 - Yutaka Miyake A1 - Tadashi Yanagihara A1 - Roberto Trasarti AB - Time sequence data relating to users, such as medical histories and mobility data, are good candidates for data mining, but often contain highly sensitive information. Different methods in privacy-preserving data publishing are utilised to release such private data so that individual records in the released data cannot be re-linked to specific users with a high degree of certainty. These methods provide theoretical worst-case privacy risks as measures of the privacy protection that they offer. However, often with many real-world data the worst-case scenario is too pessimistic and does not provide a realistic view of the privacy risks: the real probability of re-identification is often much lower than the theoretical worst-case risk. In this paper we propose a novel empirical risk model for privacy which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release. We show detailed evaluation of the proposed risk model by using k-anonymised real-world mobility data. JF - Trust Management {VIII} - 8th {IFIP} {WG} 11.11 International Conference, {IFIPTM} 2014, Singapore, July 7-10, 2014. Proceedings UR - http://dx.doi.org/10.1007/978-3-662-43813-8_9 ER - TY - JOUR T1 - Privacy-by-Design in Big Data Analytics and Social Mining JF - EPJ Data Science Y1 - 2014 A1 - Anna Monreale A1 - S Rinzivillo A1 - Francesca Pratesi A1 - Fosca Giannotti A1 - Dino Pedreschi AB - Privacy is ever-growing concern in our society and is becoming a fundamental aspect to take into account when one wants to use, publish and analyze data involving human personal sensitive information. Unfortunately, it is increasingly hard to transform the data in a way that it protects sensitive information: we live in the era of big data characterized by unprecedented opportunities to sense, store and analyze social data describing human activities in great detail and resolution. As a result, privacy preservation simply cannot be accomplished by de-identification alone. In this paper, we propose the privacy-by-design paradigm to develop technological frameworks for countering the threats of undesirable, unlawful effects of privacy violation, without obstructing the knowledge discovery opportunities of social mining and big data analytical technologies. Our main idea is to inscribe privacy protection into the knowledge discovery technology by design, so that the analysis incorporates the relevant privacy requirements from the start. VL - 10 N1 - 2014:10 ER - TY - CHAP T1 - Retrieving Points of Interest from Human Systematic Movements T2 - Software Engineering and Formal Methods Y1 - 2014 A1 - Riccardo Guidotti A1 - Anna Monreale A1 - S Rinzivillo A1 - Dino Pedreschi A1 - Fosca Giannotti AB - Human mobility analysis is emerging as a more and more fundamental task to deeply understand human behavior. In the last decade these kind of studies have become feasible thanks to the massive increase in availability of mobility data. A crucial point, for many mobility applications and analysis, is to extract interesting locations for people. In this paper, we propose a novel methodology to retrieve efficiently significant places of interest from movement data. Using car drivers’ systematic movements we mine everyday interesting locations, that is, places around which people life gravitates. The outcomes show the empirical evidence that these places capture nearly the whole mobility even though generated only from systematic movements abstractions. JF - Software Engineering and Formal Methods PB - Springer International Publishing ER - TY - JOUR T1 - Evolving networks: Eras and turning points JF - Intell. Data Anal. Y1 - 2013 A1 - Michele Berlingerio A1 - Michele Coscia A1 - Fosca Giannotti A1 - Anna Monreale A1 - Dino Pedreschi AB - Within the large body of research in complex network analysis, an important topic is the temporal evolution of networks. Existing approaches aim at analyzing the evolution on the global and the local scale, extracting properties of either the entire network or local patterns. In this paper, we focus on detecting clusters of temporal snapshots of a network, to be interpreted as eras of evolution. To this aim, we introduce a novel hierarchical clustering methodology, based on a dissimilarity measure (derived from the Jaccard coefficient) between two temporal snapshots of the network, able to detect the turning points at the beginning of the eras. We devise a framework to discover and browse the eras, either in top-down or a bottom-up fashion, supporting the exploration of the evolution at any level of temporal resolution. We show how our approach applies to real networks and null models, by detecting eras in an evolving co-authorship graph extracted from a bibliographic dataset, a collaboration graph extracted from a cinema database, and a network extracted from a database of terrorist attacks; we illustrate how the discovered temporal clustering highlights the crucial moments when the networks witnessed profound changes in their structure. Our approach is finally boosted by introducing a meaningful labeling of the obtained clusters, such as the characterizing topics of each discovered era, thus adding a semantic dimension to our analysis. VL - 17 UR - http://dx.doi.org/10.3233/IDA-120566 ER - TY - CONF T1 - On multidimensional network measures T2 - SEDB 2013 Y1 - 2013 A1 - Matteo Magnani A1 - Anna Monreale A1 - Giulio Rossetti A1 - Fosca Giannotti AB - Networks, i.e., sets of interconnected entities, are ubiquitous, spanning disciplines as diverse as sociology, biology and computer science. The recent availability of large amounts of network data has thus provided a unique opportunity to develop models and analysis tools applicable to a wide range of scenarios. However, real-world phenomena are often more complex than existing graph data models. One relevant example concerns the numerous types of social relationships (or edges) that can be present between individuals in a social network. In this short paper we present a unified model and a set of measures recently developed to represent and analyze network data with multiple types of edges. JF - SEDB 2013 UR - https://www.researchgate.net/publication/256194479_On_multidimensional_network_measures ER - TY - CONF T1 - Privacy-Aware Distributed Mobility Data Analytics T2 - SEBD Y1 - 2013 A1 - Francesca Pratesi A1 - Anna Monreale A1 - Hui Wendy Wang A1 - S Rinzivillo A1 - Dino Pedreschi A1 - Gennady Andrienko A1 - Natalia Andrienko AB - We propose an approach to preserve privacy in an analytical processing within a distributed setting, and tackle the problem of obtaining aggregated information about vehicle traffic in a city from movement data collected by individual vehicles and shipped to a central server. Movement data are sensitive because they may describe typical movement behaviors and therefore be used for re-identification of individuals in a database. We provide a privacy-preserving framework for movement data aggregation based on trajectory generalization in a distributed environment. The proposed solution, based on the differential privacy model and on sketching techniques for efficient data compression, provides a formal data protection safeguard. Using real-life data, we demonstrate the effectiveness of our approach also in terms of data utility preserved by the data transformation. JF - SEBD CY - Roccella Jonica ER - TY - CHAP T1 - Privacy-Preserving Distributed Movement Data Aggregation T2 - Geographic Information Science at the Heart of Europe Y1 - 2013 A1 - Anna Monreale A1 - Hui Wendy Wang A1 - Francesca Pratesi A1 - S Rinzivillo A1 - Dino Pedreschi A1 - Gennady Andrienko A1 - Natalia Andrienko ED - Vandenbroucke, Danny ED - Bucher, Bénédicte ED - Crompvoets, Joep AB - We propose a novel approach to privacy-preserving analytical processing within a distributed setting, and tackle the problem of obtaining aggregated information about vehicle traffic in a city from movement data collected by individual vehicles and shipped to a central server. Movement data are sensitive because people’s whereabouts have the potential to reveal intimate personal traits, such as religious or sexual preferences, and may allow re-identification of individuals in a database. We provide a privacy-preserving framework for movement data aggregation based on trajectory generalization in a distributed environment. The proposed solution, based on the differential privacy model and on sketching techniques for efficient data compression, provides a formal data protection safeguard. Using real-life data, we demonstrate the effectiveness of our approach also in terms of data utility preserved by the data transformation. JF - Geographic Information Science at the Heart of Europe T3 - Lecture Notes in Geoinformation and Cartography PB - Springer International Publishing SN - 978-3-319-00614-7 UR - http://dx.doi.org/10.1007/978-3-319-00615-4_13 ER - TY - JOUR T1 - Privacy-Preserving Mining of Association Rules From Outsourced Transaction Databases JF - IEEE Systems Journal Y1 - 2013 A1 - Fosca Giannotti A1 - L.V.S. Lakshmanan A1 - Anna Monreale A1 - Dino Pedreschi A1 - Hui Wendy Wang AB - Spurred by developments such as cloud computing, there has been considerable recent interest in the paradigm of data mining-as-a-service. A company (data owner) lacking in expertise or computational resources can outsource its mining needs to a third party service provider (server). However, both the items and the association rules of the outsourced database are considered private property of the corporation (data owner). To protect corporate privacy, the data owner transforms its data and ships it to the server, sends mining queries to the server, and recovers the true patterns from the extracted patterns received from the server. In this paper, we study the problem of outsourcing the association rule mining task within a corporate privacy-preserving framework. We propose an attack model based on background knowledge and devise a scheme for privacy preserving outsourced mining. Our scheme ensures that each transformed item is indistinguishable with respect to the attacker's background knowledge, from at least k-1 other transformed items. Our comprehensive experiments on a very large and real transaction database demonstrate that our techniques are effective, scalable, and protect privacy. ER - TY - CONF T1 - Quantification Trees T2 - 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7-10, 2013 Y1 - 2013 A1 - Letizia Milli A1 - Anna Monreale A1 - Giulio Rossetti A1 - Fosca Giannotti A1 - Dino Pedreschi A1 - Fabrizio Sebastiani AB - In many applications there is a need to monitor how a population is distributed across different classes, and to track the changes in this distribution that derive from varying circumstances, an example such application is monitoring the percentage (or "prevalence") of unemployed people in a given region, or in a given age range, or at different time periods. When the membership of an individual in a class cannot be established deterministically, this monitoring activity requires classification. However, in the above applications the final goal is not determining which class each individual belongs to, but simply estimating the prevalence of each class in the unlabeled data. This task is called quantification. In a supervised learning framework we may estimate the distribution across the classes in a test set from a training set of labeled individuals. However, this may be sub optimal, since the distribution in the test set may be substantially different from that in the training set (a phenomenon called distribution drift). So far, quantification has mostly been addressed by learning a classifier optimized for individual classification and later adjusting the distribution it computes to compensate for its tendency to either under-or over-estimate the prevalence of the class. In this paper we propose instead to use a type of decision trees (quantification trees) optimized not for individual classification, but directly for quantification. Our experiments show that quantification trees are more accurate than existing state-of-the-art quantification methods, while retaining at the same time the simplicity and understandability of the decision tree framework. JF - 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7-10, 2013 UR - http://dx.doi.org/10.1109/ICDM.2013.122 ER - TY - CONF T1 - Anonymity: a Comparison between the Legal and Computer Science Perspectives. T2 - The 5rd International Conference on Computers, Privacy, and Data Protection: “European Data Protection: Coming of Age” Y1 - 2012 A1 - S Mascetti A1 - Anna Monreale A1 - A Ricci A1 - A. Gerino AB - Privacy preservation has emerged as a major challenge in ICT. One possible solution for enforcing privacy is to guarantee anonymity. Indeed, according to international regulations, no restriction is applied to the handling of anonymous data. Consequently, in the past years the notion of anonymity has been extensively studied by two different communities: Law researchers and professionals that propose definitions of privacy regulations, and Computer Scientists attempting to provide technical solutions for enforcing the legal requirements. In this contribution we address the problem with an interdisciplinary approach, in the aim to encourage the reciprocal understanding and collaboration between researchers in the two areas. To achieve this, we compare the different notions of anonymity provided in the European data protection Law with the formal models proposed in Computer Science. This analysis allows us to identify the main similarities and differences between the two points of view, hence highlighting the need for a joint research effort. JF - The 5rd International Conference on Computers, Privacy, and Data Protection: “European Data Protection: Coming of Age” ER - TY - CONF T1 - AUDIO: An Integrity Auditing Framework of Outlier-Mining-as-a-Service Systems. T2 - Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2012 Y1 - 2012 A1 - R.Liu A1 - Hui Wendy Wang A1 - Anna Monreale A1 - Dino Pedreschi A1 - Fosca Giannotti A1 - W Guo AB - Spurred by developments such as cloud computing, there has been considerable recent interest in the data-mining-as-a-service paradigm. Users lacking in expertise or computational resources can outsource their data and mining needs to a third-party service provider (server). Outsourcing, however, raises issues about result integrity: how can the data owner verify that the mining results returned by the server are correct? In this paper, we present AUDIO, an integrity auditing framework for the specific task of distance-based outlier mining outsourcing. It provides efficient and practical verification approaches to check both completeness and correctness of the mining results. The key idea of our approach is to insert a small amount of artificial tuples into the outsourced data; the artificial tuples will produce artificial outliers and non-outliers that do not exist in the original dataset. The server’s answer is verified by analyzing the presence of artificial outliers/non-outliers, obtaining a probabilistic guarantee of correctness and completeness of the mining result. Our empirical results show the effectiveness and efficiency of our method. JF - Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2012 ER - TY - CONF T1 - Classifying Trust/Distrust Relationships in Online Social Networks T2 - 2012 International Conference on Privacy, Security, Risk and Trust, {PASSAT} 2012, and 2012 International Confernece on Social Computing, SocialCom 2012, Amsterdam, Netherlands, September 3-5, 2012 Y1 - 2012 A1 - Giacomo Bachi A1 - Michele Coscia A1 - Anna Monreale A1 - Fosca Giannotti AB - Online social networks are increasingly being used as places where communities gather to exchange information, form opinions, collaborate in response to events. An aspect of this information exchange is how to determine if a source of social information can be trusted or not. Data mining literature addresses this problem. However, if usually employs social balance theories, by looking at small structures in complex networks known as triangles. This has proven effective in some cases, but it under performs in the lack of context information about the relation and in more complex interactive structures. In this paper we address the problem of creating a framework for the trust inference, able to infer the trust/distrust relationships in those relational environments that cannot be described by using the classical social balance theory. We do so by decomposing a trust network in its ego network components and mining on this ego network set the trust relationships, extending a well known graph mining algorithm. We test our framework on three public datasets describing trust relationships in the real world (from the social media Epinions, Slash dot and Wikipedia) and confronting our results with the trust inference state of the art, showing better performances where the social balance theory fails. JF - 2012 International Conference on Privacy, Security, Risk and Trust, {PASSAT} 2012, and 2012 International Confernece on Social Computing, SocialCom 2012, Amsterdam, Netherlands, September 3-5, 2012 UR - http://dx.doi.org/10.1109/SocialCom-PASSAT.2012.115 ER - TY - CONF T1 - Injecting Discrimination and Privacy Awareness Into Pattern Discovery T2 - 12th {IEEE} International Conference on Data Mining Workshops, {ICDM} Workshops, Brussels, Belgium, December 10, 2012 Y1 - 2012 A1 - Sara Hajian A1 - Anna Monreale A1 - Dino Pedreschi A1 - Josep Domingo-Ferrer A1 - Fosca Giannotti AB - Data mining is gaining societal momentum due to the ever increasing availability of large amounts of human data, easily collected by a variety of sensing technologies. Data mining comes with unprecedented opportunities and risks: a deeper understanding of human behavior and how our society works is darkened by a greater chance of privacy intrusion and unfair discrimination based on the extracted patterns and profiles. Although methods independently addressing privacy or discrimination in data mining have been proposed in the literature, in this context we argue that privacy and discrimination risks should be tackled together, and we present a methodology for doing so while publishing frequent pattern mining results. We describe a combined pattern sanitization framework that yields both privacy and discrimination-protected patterns, while introducing reasonable (controlled) pattern distortion. JF - 12th {IEEE} International Conference on Data Mining Workshops, {ICDM} Workshops, Brussels, Belgium, December 10, 2012 UR - http://dx.doi.org/10.1109/ICDMW.2012.51 ER - TY - JOUR T1 - Multidimensional networks: foundations of structural analysis JF - World Wide Web Y1 - 2012 A1 - Michele Berlingerio A1 - Michele Coscia A1 - Fosca Giannotti A1 - Anna Monreale A1 - Dino Pedreschi AB - Complex networks have been receiving increasing attention by the scientific community, thanks also to the increasing availability of real-world network data. So far, network analysis has focused on the characterization and measurement of local and global properties of graphs, such as diameter, degree distribution, centrality, and so on. In the last years, the multidimensional nature of many real world networks has been pointed out, i.e. many networks containing multiple connections between any pair of nodes have been analyzed. Despite the importance of analyzing this kind of networks was recognized by previous works, a complete framework for multidimensional network analysis is still missing. Such a framework would enable the analysts to study different phenomena, that can be either the generalization to the multidimensional setting of what happens in monodimensional networks, or a new class of phenomena induced by the additional degree of complexity that multidimensionality provides in real networks. The aim of this paper is then to give the basis for multidimensional network analysis: we present a solid repertoire of basic concepts and analytical measures, which take into account the general structure of multidimensional networks. We tested our framework on different real world multidimensional networks, showing the validity and the meaningfulness of the measures introduced, that are able to extract important and non-random information about complex phenomena in such networks. VL - Volume 15 / 2012 UR - http://www.springerlink.com/content/f774289854430410/abstract/ ER - TY - JOUR T1 - C-safety: a framework for the anonymization of semantic trajectories JF - Transactions on Data Privacy Y1 - 2011 A1 - Anna Monreale A1 - Roberto Trasarti A1 - Dino Pedreschi A1 - Chiara Renso A1 - Vania Bogorny AB - The increasing abundance of data about the trajectories of personal movement is opening new opportunities for analyzing and mining human mobility. However, new risks emerge since it opens new ways of intruding into personal privacy. Representing the personal movements as sequences of places visited by a person during her/his movements - semantic trajectory - poses great privacy threats. In this paper we propose a privacy model defining the attack model of semantic trajectory linking and a privacy notion, called c-safety based on a generalization of visited places based on a taxonomy. This method provides an upper bound to the probability of inferring that a given person, observed in a sequence of non-sensitive places, has also visited any sensitive location. Coherently with the privacy model, we propose an algorithm for transforming any dataset of semantic trajectories into a c-safe one. We report a study on two real-life GPS trajectory datasets to show how our algorithm preserves interesting quality/utility measures of the original trajectories, when mining semantic trajectories sequential pattern mining results. We also empirically measure how the probability that the attacker’s inference succeeds is much lower than the theoretical upper bound established. VL - 4 UR - http://dl.acm.org/citation.cfm?id=2019319&CFID=803961971&CFTOKEN=35994039 ER - TY - CONF T1 - Foundations of Multidimensional Network Analysis T2 - ASONAM Y1 - 2011 A1 - Michele Berlingerio A1 - Michele Coscia A1 - Fosca Giannotti A1 - Anna Monreale A1 - Dino Pedreschi AB - Complex networks have been receiving increasing attention by the scientific community, thanks also to the increasing availability of real-world network data. In the last years, the multidimensional nature of many real world networks has been pointed out, i.e. many networks containing multiple connections between any pair of nodes have been analyzed. Despite the importance of analyzing this kind of networks was recognized by previous works, a complete framework for multidimensional network analysis is still missing. Such a framework would enable the analysts to study different phenomena, that can be either the generalization to the multidimensional setting of what happens inmonodimensional network, or a new class of phenomena induced by the additional degree of complexity that multidimensionality provides in real networks. The aim of this paper is then to give the basis for multidimensional network analysis: we develop a solid repertoire of basic concepts and analytical measures, which takes into account the general structure of multidimensional networks. We tested our framework on a real world multidimensional network, showing the validity and the meaningfulness of the measures introduced, that are able to extract important, nonrandom, information about complex phenomena. JF - ASONAM ER - TY - CONF T1 - Privacy-preserving data mining from outsourced databases. T2 - the 3rd International Conference on Computers, Privacy, and Data Protection: An element of choice Y1 - 2011 A1 - Fosca Giannotti A1 - L.V.S. Lakshmanan A1 - Anna Monreale A1 - Dino Pedreschi A1 - Hui Wendy Wang AB - Spurred by developments such as cloud computing, there has been considerable recent interest in the paradigm of data mining-as-service: a company (data owner) lacking in expertise or computational resources can outsource its mining needs to a third party service provider (server). However, both the outsourced database and the knowledge extract from it by data mining are considered private property of the data owner. To protect corporate privacy, the data owner transforms its data and ships it to the server, sends mining queries to the server, and recovers the true patterns from the extracted patterns received from the server. In this paper, we study the problem of outsourcing a data mining task within a corporate privacy-preserving framework. We propose a scheme for privacy-preserving outsourced mining which offers a formal protection against information disclosure, and show that the data owner can recover the correct data mining results efficiently. JF - the 3rd International Conference on Computers, Privacy, and Data Protection: An element of choice ER - TY - JOUR T1 - The pursuit of hubbiness: Analysis of hubs in large multidimensional networks JF - J. Comput. Science Y1 - 2011 A1 - Michele Berlingerio A1 - Michele Coscia A1 - Fosca Giannotti A1 - Anna Monreale A1 - Dino Pedreschi AB - Hubs are highly connected nodes within a network. In complex network analysis, hubs have been widely studied, and are at the basis of many tasks, such as web search and epidemic outbreak detection. In reality, networks are often multidimensional, i.e., there can exist multiple connections between any pair of nodes. In this setting, the concept of hub depends on the multiple dimensions of the network, whose interplay becomes crucial for the connectedness of a node. In this paper, we characterize multidimensional hubs. We consider the multidimensional generalization of the degree and introduce a new class of measures, that we call Dimension Relevance, aimed at analyzing the importance of different dimensions for the hubbiness of a node. We assess the meaningfulness of our measures by comparing them on real networks and null models, then we study the interplay among dimensions and their effect on node connectivity. Our findings show that: (i) multidimensional hubs do exist and their characterization yields interesting insights and (ii) it is possible to detect the most influential dimensions that cause the different hub behaviors. We demonstrate the usefulness of multidimensional analysis in three real world domains: detection of ambiguous query terms in a word–word query log network, outlier detection in a social network, and temporal analysis of behaviors in a co-authorship network. VL - 2 ER - TY - CONF T1 - As Time Goes by: Discovering Eras in Evolving Social Networks T2 - PAKDD (1) Y1 - 2010 A1 - Michele Berlingerio A1 - Michele Coscia A1 - Fosca Giannotti A1 - Anna Monreale A1 - Dino Pedreschi AB - Within the large body of research in complex network analysis, an important topic is the temporal evolution of networks. Existing approaches aim at analyzing the evolution on the global and the local scale, extracting properties of either the entire network or local patterns. In this paper, we focus instead on detecting clusters of temporal snapshots of a network, to be interpreted as eras of evolution. To this aim, we introduce a novel hierarchical clustering methodology, based on a dissimilarity measure (derived from the Jaccard coefficient) between two temporal snapshots of the network. We devise a framework to discover and browse the eras, either in top-down or a bottom-up fashion, supporting the exploration of the evolution at any level of temporal resolution. We show how our approach applies to real networks, by detecting eras in an evolving co-authorship graph extracted from a bibliographic dataset; we illustrate how the discovered temporal clustering highlights the crucial moments when the network had profound changes in its structure. Our approach is finally boosted by introducing a meaningful labeling of the obtained clusters, such as the characterizing topics of each discovered era, thus adding a semantic dimension to our analysis. JF - PAKDD (1) ER - TY - CONF T1 - Discovering Eras in Evolving Social Networks (Extended Abstract) T2 - SEBD Y1 - 2010 A1 - Michele Berlingerio A1 - Michele Coscia A1 - Fosca Giannotti A1 - Anna Monreale A1 - Dino Pedreschi JF - SEBD ER - TY - CONF T1 - Exploring Real Mobility Data with M-Atlas T2 - ECML/PKDD (3) Y1 - 2010 A1 - Roberto Trasarti A1 - S Rinzivillo A1 - Fabio Pinelli A1 - Mirco Nanni A1 - Anna Monreale A1 - Chiara Renso A1 - Dino Pedreschi A1 - Fosca Giannotti AB - Research on moving-object data analysis has been recently fostered by the widespread diffusion of new techniques and systems for monitoring, collecting and storing location aware data, generated by a wealth of technological infrastructures, such as GPS positioning and wireless networks. These have made available massive repositories of spatio-temporal data recording human mobile activities, that call for suitable analytical methods, capable of enabling the development of innovative, location-aware applications. JF - ECML/PKDD (3) ER - TY - Generic T1 - A Generalisation-based Approach to Anonymising Movement Data T2 - 13th AGILE conference on Geographic Information Science Y1 - 2010 A1 - Gennady Andrienko A1 - Natalia Andrienko A1 - Fosca Giannotti A1 - Anna Monreale A1 - Dino Pedreschi A1 - S Rinzivillo AB - The possibility to collect, store, disseminate, and analyze data about movements of people raises very serious privacy concerns, given the sensitivity of the information about personal positions. In particular, sensitive information about individuals can be uncovered with the use of data mining and visual analytics methods. In this paper we present a method for the generalization of trajectory data that can be adopted as the first step of a process to obtain k-anonymity in spatio-temporal datasets. We ran a preliminary set of experiments on a real-world trajectory dataset, demonstrating that this method of generalization of trajectories preserves the clustering analysis results. JF - 13th AGILE conference on Geographic Information Science UR - http://agile2010.dsi.uminho.pt/pen/ShortPapers_PDF%5C122_DOC.pdf ER - TY - CONF T1 - Location Prediction through Trajectory Pattern Mining (Extended Abstract) T2 - SEBD Y1 - 2010 A1 - Anna Monreale A1 - Fabio Pinelli A1 - Roberto Trasarti A1 - Fosca Giannotti JF - SEBD ER - TY - JOUR T1 - Movement Data Anonymity through Generalization JF - Transactions on Data Privacy Y1 - 2010 A1 - Anna Monreale A1 - Gennady Andrienko A1 - Natalia Andrienko A1 - Fosca Giannotti A1 - Dino Pedreschi A1 - S Rinzivillo A1 - Stefan Wrobel AB - Wireless networks and mobile devices, such as mobile phones and GPS receivers, sense and track the movements of people and vehicles, producing society-wide mobility databases. This is a challenging scenario for data analysis and mining. On the one hand, exciting opportunities arise out of discovering new knowledge about human mobile behavior, and thus fuel intelligent info-mobility applications. On other hand, new privacy concerns arise when mobility data are published. The risk is particularly high for GPS trajectories, which represent movement of a very high precision and spatio-temporal resolution: the de-identification of such trajectories (i.e., forgetting the ID of their associated owners) is only a weak protection, as generally it is possible to re-identify a person by observing her routine movements. In this paper we propose a method for achieving true anonymity in a dataset of published trajectories, by defining a transformation of the original GPS trajectories based on spatial generalization and k-anonymity. The proposed method offers a formal data protection safeguard, quantified as a theoretical upper bound to the probability of re-identification. We conduct a thorough study on a real-life GPS trajectory dataset, and provide strong empirical evidence that the proposed anonymity techniques achieve the conflicting goals of data utility and data privacy. In practice, the achieved anonymity protection is much stronger than the theoretical worst case, while the quality of the cluster analysis on the trajectory data is preserved. VL - 3 UR - http://www.tdp.cat/issues/abs.a045a10.php ER - TY - CONF T1 - Preserving privacy in semantic-rich trajectories of human mobility T2 - SPRINGL Y1 - 2010 A1 - Anna Monreale A1 - Roberto Trasarti A1 - Chiara Renso A1 - Dino Pedreschi A1 - Vania Bogorny AB - The increasing abundance of data about the trajectories of personal movement is opening up new opportunities for analyzing and mining human mobility, but new risks emerge since it opens new ways of intruding into personal privacy. Representing the personal movements as sequences of places visited by a person during her/his movements - semantic trajectory - poses even greater privacy threats w.r.t. raw geometric location data. In this paper we propose a privacy model defining the attack model of semantic trajectory linking, together with a privacy notion, called c-safety. This method provides an upper bound to the probability of inferring that a given person, observed in a sequence of nonsensitive places, has also stopped in any sensitive location. Coherently with the privacy model, we propose an algorithm for transforming any dataset of semantic trajectories into a c-safe one. We report a study on a real-life GPS trajectory dataset to show how our algorithm preserves interesting quality/utility measures of the original trajectories, such as sequential pattern mining results. JF - SPRINGL ER - TY - CONF T1 - Towards Discovery of Eras in Social Networks T2 - M3SN 2010 Workshop, in conjunction with ICDE2010 Y1 - 2010 A1 - Michele Berlingerio A1 - Michele Coscia A1 - Fosca Giannotti A1 - Anna Monreale A1 - Dino Pedreschi AB - In the last decades, much research has been devoted in topics related to Social Network Analysis. One important direction in this area is to analyze the temporal evolution of a network. So far, previous approaches analyzed this setting at both the global and the local level. In this paper, we focus on finding a way to detect temporal eras in an evolving network. We pose the basis for a general framework that aims at helping the analyst in browsing the temporal clusters both in a top-down and bottom-up way, exploring the network at any level of temporal details. We show the effectiveness of our approach on real data, by applying our proposed methodology to a co-authorship network extracted from a bibliographic dataset. Our first results are encouraging, and open the way for the definition and implementation of a general framework for discovering eras in evolving social networks. JF - M3SN 2010 Workshop, in conjunction with ICDE2010 ER - TY - Generic T1 - Anonymous Sequences from Trajectory Data T2 - 17th Italian Symposium on Advanced Database Systems Y1 - 2009 A1 - Ruggero G. Pensa A1 - Anna Monreale A1 - Fabio Pinelli A1 - Dino Pedreschi JF - 17th Italian Symposium on Advanced Database Systems CY - Camogli, Italy ER - TY - CONF T1 - Movement data anonymity through generalization T2 - Proceedings of the 2nd SIGSPATIAL ACM GIS 2009 International Workshop on Security and Privacy in GIS and LBS Y1 - 2009 A1 - Gennady Andrienko A1 - Natalia Andrienko A1 - Fosca Giannotti A1 - Anna Monreale A1 - Dino Pedreschi AB - In recent years, spatio-temporal and moving objects databases have gained considerable interest, due to the diffusion of mobile devices (e.g., mobile phones, RFID devices and GPS devices) and of new applications, where the discovery of consumable, concise, and applicable knowledge is the key step. Clearly, in these applications privacy is a concern, since models extracted from this kind of data can reveal the behavior of group of individuals, thus compromising their privacy. Movement data present a new challenge for the privacy-preserving data mining community because of their spatial and temporal characteristics. In this position paper we briefly present an approach for the generalization of movement data that can be adopted for obtaining k-anonymity in spatio-temporal datasets; specifically, it can be used to realize a framework for publishing of spatio-temporal data while preserving privacy. We ran a preliminary set of experiments on a real-world trajectory dataset, demonstrating that this method of generalization of trajectories preserves the clustering analysis results. JF - Proceedings of the 2nd SIGSPATIAL ACM GIS 2009 International Workshop on Security and Privacy in GIS and LBS PB - ACM ER - TY - Generic T1 - WhereNext: a Location Predictor on Trajectory Pattern Mining T2 - 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Y1 - 2009 A1 - Anna Monreale A1 - Fabio Pinelli A1 - Roberto Trasarti A1 - Fosca Giannotti AB - The pervasiveness of mobile devices and location based services is leading to an increasing volume of mobility data.This side eect provides the opportunity for innovative methods that analyse the behaviors of movements. In this paper we propose WhereNext, which is a method aimed at predicting with a certain level of accuracy the next location of a moving object. The prediction uses previously extracted movement patterns named Trajectory Patterns, which are a concise representation of behaviors of moving objects as sequences of regions frequently visited with a typical travel time. A decision tree, named T-pattern Tree, is built and evaluated with a formal training and test process. The tree is learned from the Trajectory Patterns that hold a certain area and it may be used as a predictor of the next location of a new trajectory finding the best matching path in the tree. Three dierent best matching methods to classify a new moving object are proposed and their impact on the quality of prediction is studied extensively. Using Trajectory Patterns as predictive rules has the following implications: (I) the learning depends on the movement of all available objects in a certain area instead of on the individual history of an object; (II) the prediction tree intrinsically contains the spatio-temporal properties that have emerged from the data and this allows us to define matching methods that striclty depend on the properties of such movements. In addition, we propose a set of other measures, that evaluate a priori the predictive power of a set of Trajectory Patterns. This measures were tuned on a real life case study. Finally, an exhaustive set of experiments and results on the real dataset are presented. JF - 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining ER - TY - Generic T1 - Location prediction within the mobility data analysis environment Daedalus T2 - First International Workshop on Computational Transportation Science Y1 - 2008 A1 - Fabio Pinelli A1 - Anna Monreale A1 - Roberto Trasarti A1 - Fosca Giannotti AB - In this paper we propose a method to predict the next location of a moving object based on two recent results in GeoPKDD project: DAEDALUS, a mobility data analysis environment and Trajectory Pattern, a sequential pattern mining algorithm with temporal annotation integrated in DAEDALUS. The first one is a DMQL environment for moving objects, where both data and patterns can be represented. The second one extracts movement patterns as sequences of movements between locations with typical travel times. This paper proposes a prediction method which uses the local models extracted by Trajectory Pattern to build a global model called Prediction Tree. The future location of a moving object is predicted visiting the tree and calculating the best matching function. The integration within DAEDALUS system supports an interactive construction of the predictor on the top of a set of spatio-temporal patterns. Others proposals in literature base the definition of prediction methods for future location of a moving object on previously extracted frequent patterns. They use the recent history of movements of the object itself and often use time only to order the events. Our work uses the movements of all moving objects in a certain area to learn a classifier built on the mined trajectory patterns, which are intrinsically equipped with temporal information. JF - First International Workshop on Computational Transportation Science CY - Dublin, Ireland ER - TY - CONF T1 - Pattern-Preserving k-Anonymization of Sequences and its Application to Mobility Data Mining T2 - PiLBA Y1 - 2008 A1 - Ruggero G. Pensa A1 - Anna Monreale A1 - Fabio Pinelli A1 - Dino Pedreschi AB - Sequential pattern mining is a major research field in knowledge discovery and data mining. Thanks to the increasing availability of transaction data, it is now possible to provide new and improved services based on users’ and customers’ behavior. However, this puts the citizen’s privacy at risk. Thus, it is important to develop new privacy-preserving data mining techniques that do not alter the analysis results significantly. In this paper we propose a new approach for anonymizing sequential data by hiding infrequent, and thus potentially sensible, subsequences. Our approach guarantees that the disclosed data are k-anonymous and preserve the quality of extracted patterns. An application to a real-world moving object database is presented, which shows the effectiveness of our approach also in complex contexts. JF - PiLBA UR - https://air.unimi.it/retrieve/handle/2434/52786/106397/ProceedingsPiLBA08.pdf#page=44 ER -