Big Data Sanitization Using Scalable In-Memory Frameworks




Journal Title

Journal ISSN

Volume Title



As more and more data is collected, it is growing beyond the scale humans could ever have imagined. Not only data but also data collection and analysis techniques have evolved and have enabled researchers to advance many fields such as medical science. Although health data can have a huge impact on the future success of research, data is usually distributed among various stakeholders. Organizations need to share this data to help research move forward, but health data sharing is a regulated domain. Due to privacy concerns, the U.S. Department of Health and Human Services (HHS) has taken steps to ensure privacy protection of individuals by regulating data sharing through Health Insurance Portability and Accountability Act of 1996 (HIPAA). HIPAA policy restricts publishers from sharing identification information as well as any auxiliary information that can be used for record re-identification.
To make data sharing compliant with the HIPAA policy, various data privacy protection techniques evolved. Differential privacy techniques focused on query accuracy maximization in statistical databases while minimizing the “risk” of record identification, whereas, data anonymization allows the publisher to share original data at lesser precision, i.e., sharing attribute value of age as 25-35 instead of 30. These techniques are considered as an industry standard. Newer risk-based models determine record anonymization level based on the hidden “risk” of re-identification of the record. With constantly increasingly sanitization requests around big data, sanitization algorithms need to be adapted for distributed computing frameworks. Frameworks like Hadoop-MapReduce achieve parallelism by distributing tasks on multiple machines and executing them in parallel. Apache Spark is a Hadoop-MapReduce based in-memory distributed framework with support for data caching making it more suitable choice for iterative anonymization algorithms. This study focuses on developing distributed in-memory data sanitization techniques. To extend traditional k-anonymity methods, we implemented Mondrian k-anonymization algorithm for Apache Spark. The Mondrian algorithm performs multidimensional partitioning cuts until data cannot be divided further without violating k-anonymity property. We propose a locality sensitive hashing (LSH) based one pass anonymization algorithm in which we use LSH functions for the formation of clusters of size k and finding a summary statistic for each cluster.
To support newer data anonymization methods, we implement an in-memory version of risk estimation based anonymization algorithm that leverages game theoretical approach for deciding optimal generalization level for each record. We then propose a hybrid risk anonymization algorithm that uses LSH bucketing to minimize the number of risk estimation algorithm executions. To support online sanitization, we propose an aspect-oriented approach for modifying Apache Spark RDD’s computation at runtime. We show how an aspect can suppress identifier field based on predefined policy at runtime. With evolving functional requirements like within-dataset anonymization vs within population anonymization, centralized vs distributed anonymization, risk-based vs strict k-anonymization, it is crucial to select the method that fits the requirement correctly. This study offers different solutions that are suitable for different functional requirements. The analysis and comparison of above methods would enable data publishers to make efficient computation cost anonymization decisions.



Big data, SPARK (Computer program language), Data protection, Hashing (Computer science)


Copyright ©2017 is held by the author. Digital access to this material is made possible by the Eugene McDermott Library. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.