Determining the Impact of Missing Values on Blocking in Record Linkage

dc.contributor.ORCID0000-0001-6423-4533 (Kantarcioglu, M)
dc.contributor.authorAnindya, Imrul Chowdhury
dc.contributor.authorKantarcioglu, Murat
dc.contributor.authorMalin, B.
dc.contributor.utdAuthorAnindya, Imrul Chowdhury
dc.contributor.utdAuthorKantarcioglu, Murat
dc.descriptionDue to copyright restrictions full text access from Treasures at UT Dallas is restricted to current UTD affiliates (use the provided Link to Article).
dc.description.abstractRecord linkage is the process of integrating information from the same underlying entity across disparate data sets. This process, which is increasingly utilized to build accurate representations of individuals and organizations for a variety of applications, ranging from credit worthiness assessments to continuity of medical care, can be computationally intensive because it requires comparing large quantities of records over a range of attributes. To reduce the amount of computation in record linkage in big data settings, blocking methods, which are designed to limit the number of record pair comparisons that needs to be performed, are critical for scaling up the record linkage process. These methods group together potential matches into blocks, often using a subset of attributes before a final comparator function predicts which record pairs within the blocks correspond to matches. Yet data corruption and missing values adversely influence the performance of blocking methods (e.g., it may cause some matching records not to be placed in the same block). While there has been some investigation into the impact of missing values on general record linkage techniques (e.g., the comparator function), no study has addressed the impact of the missing values on blocking methods. To address this issue, in this work, we systematically perform a detailed empirical analysis of the individual and joint impact of missing values and data corruption on different blocking methods using realistic data sets. Our results show that blocking approaches that do not depend on one type of blocking attributes are more robust against missing values. In addition, our results indicate that blocking parameters must be chosen carefully for different blocking techniques. © Springer Nature Switzerland AG 2019.
dc.description.departmentErik Jonsson School of Engineering and Computer Science
dc.identifier.bibliographicCitationAnindya, I. C., M. Kantarcioglu, and B. Malin. 2019. "Determining the impact of missing values on blocking in record linkage." Advances in Knowledge Discovery and Data Mining (Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11441: 262-274, doi: 10.1007/978-3-030-16142-2_21
dc.publisherSpringer Verlag
dc.relation.isPartOfAdvances in Knowledge Discovery and Data Mining
dc.rights©2019 Springer Nature Switzerland AG
dc.subjectComparator circuits
dc.subjectData mining
dc.subjectRecord linkage
dc.titleDetermining the Impact of Missing Values on Blocking in Record Linkage


Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
165.43 KB
Adobe Portable Document Format
Link to Article