Determining the Impact of Missing Values on Blocking in Record Linkage

Anindya, Imrul Chowdhury; Kantarcioglu, Murat; Malin, B.

Determining the Impact of Missing Values on Blocking in Record Linkage

dc.contributor.ORCID	0000-0001-6423-4533 (Kantarcioglu, M)
dc.contributor.author	Anindya, Imrul Chowdhury
dc.contributor.author	Kantarcioglu, Murat
dc.contributor.author	Malin, B.
dc.contributor.utdAuthor	Anindya, Imrul Chowdhury
dc.contributor.utdAuthor	Kantarcioglu, Murat
dc.date.accessioned	2020-02-03T22:29:27Z
dc.date.available	2020-02-03T22:29:27Z
dc.date.issued	2019-03-20
dc.description	Due to copyright restrictions full text access from Treasures at UT Dallas is restricted to current UTD affiliates (use the provided Link to Article).
dc.description.abstract	Record linkage is the process of integrating information from the same underlying entity across disparate data sets. This process, which is increasingly utilized to build accurate representations of individuals and organizations for a variety of applications, ranging from credit worthiness assessments to continuity of medical care, can be computationally intensive because it requires comparing large quantities of records over a range of attributes. To reduce the amount of computation in record linkage in big data settings, blocking methods, which are designed to limit the number of record pair comparisons that needs to be performed, are critical for scaling up the record linkage process. These methods group together potential matches into blocks, often using a subset of attributes before a final comparator function predicts which record pairs within the blocks correspond to matches. Yet data corruption and missing values adversely influence the performance of blocking methods (e.g., it may cause some matching records not to be placed in the same block). While there has been some investigation into the impact of missing values on general record linkage techniques (e.g., the comparator function), no study has addressed the impact of the missing values on blocking methods. To address this issue, in this work, we systematically perform a detailed empirical analysis of the individual and joint impact of missing values and data corruption on different blocking methods using realistic data sets. Our results show that blocking approaches that do not depend on one type of blocking attributes are more robust against missing values. In addition, our results indicate that blocking parameters must be chosen carefully for different blocking techniques. © Springer Nature Switzerland AG 2019.
dc.description.department	Erik Jonsson School of Engineering and Computer Science
dc.identifier	9783030161415
dc.identifier.bibliographicCitation	Anindya, I. C., M. Kantarcioglu, and B. Malin. 2019. "Determining the impact of missing values on blocking in record linkage." Advances in Knowledge Discovery and Data Mining (Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11441: 262-274, doi: 10.1007/978-3-030-16142-2_21
dc.identifier.uri	http://dx.doi.org/10.1007/978-3-030-16142-2_21
dc.identifier.uri	https://hdl.handle.net/10735.1/7220
dc.identifier.volume	11441
dc.language.iso	en
dc.publisher	Springer Verlag
dc.relation.isPartOf	Advances in Knowledge Discovery and Data Mining
dc.rights	©2019 Springer Nature Switzerland AG
dc.subject	Comparator circuits
dc.subject	Crime
dc.subject	Data mining
dc.subject	Record linkage
dc.title	Determining the Impact of Missing Values on Blocking in Record Linkage
dc.type.genre	article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: JECS-2796-260829.65-LINK.pdf
Size:: 165.43 KB
Format:: Adobe Portable Document Format
Description:: Link to Article

Download

Collections

Kantarcioglu, Murat