Understanding and Mitigating Privacy Risks Raised by Record Linkage




Journal Title

Journal ISSN

Volume Title




Record linkage is the process of combining data belonging to the same entity but possibly lacking a common identifier and often dispersed in multiple repositories. Despite having its legitimate usage in data de-duplication and data integration, record linkage poses serious privacy risk as numerous demonstrative attacks show that dedicated adversaries can use this technique to infer sensitive information about their target entities. In this dissertation, we analyze the extent to which record linkage poses privacy risk at the current state of data availability and data-sharing policies and then present some insights on how to mitigate this risk. We start by discussing the “missing value problem” in record linkage. We analyze the impact of this problem on a few widely used “blocking methods” that play an important role in the timely completion of any sizable record linkage task. By experimenting on real-world datasets, we provide guidance on choosing the appropriate blocking methods along with their parameters in the presence of such problem. Next, we discuss a cost-constrained multi-dataset linkage problem in which the adversaries try to link only a subset of the available datasets to limit their purchasing cost while optimizing the expected utility in terms of the quality of the inferred sensitive information. We propose a few metadata-driven heuristics that the adversaries could use to optimally choose the datasets for linkage. By simulating a few realistic scenarios for this multi-dataset linkage task, we analyze the efficacy of the proposed heuristics and the extent to which the adversaries are able to find sensitive information accurately and thereby quantify the privacy risk. Finally, we present how machine learning models could be utilized to predict undocumented personal attributes which if included in the record linkage process may further increase the privacy risks. In particular, we use machine learning to reveal the real-world identities of online entities (e.g., Twitter users) with the help of an auxiliary data source that already contains their identities (e.g., voter registration database). We train the state-of-the-art machine learning models on the unstructured data (e.g., tweets) generated by the online entities to predict their personal attributes (e.g., gender, age, race, political orientation etc.) and combine these predicted attributes with their given attributes (e.g., Twitter name, location etc.) to “re-identify” them to the auxiliary data source. We analyze the severity of the re-identification risk raised by this technique and impart some insights on how to mitigate this risk. We believe our work would provide a deeper understanding of the privacy risk posed by record linkage and guide policymakers in their endeavors for taking preventive measures.



Twitter (Firm), Data integration (Computer science), Information services, Computer security, Voter registration, Heuristic