IncRDD: Incremental Updates for RDD in Apache Spark

Dodabelle Prakash, Prathish

IncRDD: Incremental Updates for RDD in Apache Spark

dc.contributor.advisor	Wu, Weili
dc.creator	Dodabelle Prakash, Prathish
dc.creator.orcid	0000-0003-3516-8316
dc.date.accessioned	2017-06-26T11:51:57Z
dc.date.available	2017-06-26T11:51:57Z
dc.date.created	2017-05
dc.date.issued	2017-05
dc.date.submitted	May 2017
dc.date.updated	2017-06-26T11:51:57Z
dc.description.abstract	Data is constantly changing. Today, there can be incremental updates to the existing data. As the data is evolving with new updates, the results of big data applications gradually become out of date and stale. It is required to refresh the results for every update efficiently. Apache Spark is used to process multiple petabytes of data on clusters having thousands of nodes. The core abstraction of Spark is RDD (Resilient Distributed Dataset), which is an immutable collection of elements. Due to the immutability of RDD, Spark works information in parallel, permits information reuse, and handles failures and stragglers productively. But Spark lacks flexibility and efficiency of incremental processing of small updates. In this thesis, IncRDD framework is proposed for incremental processing of updates to the existing data. IncRDD sustains all the powerful features of Spark including parallel processing, data reusability, and fault tolerance. New operations for RDD are implemented to add new records, update the existing records, and delete them. We introduce a new variant of Cuckoo hashing, Dual-CH Fast-Simple. Dual Cuckoo hashing uses two cuckoo hash tables. The first cuckoo table is used to store records, in every partition of a node. The second hash table is used to implement structural sharing, which adds persistence, utilize previous versions, and avoids expensive re-computation. We evaluate IncRDD using incremental algorithms and provide experimental results to show the significant improvement in the performance of Incremental RDD.
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/10735.1/5417
dc.language.iso	en
dc.rights	Copyright ©2017 is held by the author. Digital access to this material is made possible by the Eugene McDermott Library. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
dc.subject	Big data
dc.subject	SPARK (Computer program language)
dc.subject	Electronic data processing
dc.subject	Parallel processing (Electronic computers)
dc.subject	Computer software—Reusability
dc.subject	Fault-tolerant computing
dc.subject	Hashing (Computer science)
dc.title	IncRDD: Incremental Updates for RDD in Apache Spark
dc.type	Thesis
dc.type.material	text
thesis.degree.department	Computer Science
thesis.degree.grantor	University of Texas at Dallas
thesis.degree.level	Masters
thesis.degree.name	MSCS

Files

Original bundle

Now showing 1 - 1 of 1

Name:: ETD-5608-7409.37.pdf
Size:: 872.59 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 2 of 2

Name:: LICENSE.txt
Size:: 1.85 KB
Format:: Plain Text
Description:

Download

Name:: PROQUEST_LICENSE.txt
Size:: 5.86 KB
Format:: Plain Text
Description:

Download

Collections

UTD Theses and Dissertations