• Login
    View Item 
    •   Treasures Home
    • Electronic Theses and Dissertations
    • UTD Theses and Dissertations
    • View Item
    •   Treasures Home
    • Electronic Theses and Dissertations
    • UTD Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Novel Class Detection and Cross-Lingual Duplicate Detection over Online Data Stream

    Thumbnail
    View/Open
    Dissertation (2.922Mb)
    Date
    2018-05
    Author
    Mustafa, Ahmad M.
    Metadata
    Show full item record
    Abstract
    Abstract
    Data streams are continuous flows of data points. They are very common now-a-days in several domains such as e-commerce, education, health, security, and social networks. Their sheer volume and throughput speed pose a great challenge for the data mining community to extract useful knowledge from such streams. Data stream classification refers to the task of predicting class labels of data instances using classification models trained with the past labeled data. Data stream classification has been a major research thrust for the past several years because of increasing demand in many business and security applications. Data streams induce several unique properties when compared with traditional datasets, such as infinite length, concept-drift, and concept-evolution. Concept-drift occurs in data streams when the underlying concept of data changes over time, making previously trained models obsolete. Concept-evolution refers to the emergence of a new or novel class. Existing classification techniques that address the concept-evolution and concept-drift requires labeled data to detect stream changes. Labeled data is scarce and expensive. This dissertation addresses the aforementioned challenges in a number of ways. We address infinite length, concept-drift, and concept-evolution by proposing an ensemble-based classification model that exploits unsupervised deep embeddings and non-parametric change point detection. We show empirically on both synthetic and several benchmark data streams that the proposed techniques outperform state-of-the-art techniques. The detection of duplicate news reports - those which cover the same event - plays an important role in condensing information about an event into an easily digestible format. Several methods of duplicate report detection have been developed. However, these existing methods either result in poor accuracy or are not effective when reports are in different languages. We propose a novel method to measure the similarity between news reports written in different languages in an unsupervised learning approach. Experimental results on publicly available datasets of multi-lingual reports show that our approach efficiently and significantly reduce duplicate detection errors compared to state-of-the-art techniques.
    URI
    http://hdl.handle.net/10735.1/5863
    Collections
    • UTD Theses and Dissertations

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of TreasuresCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV