Browsing by Author "Khan, Latifur"

Now showing 1 - 20 of 48

A Big Data Framework for Unstructured Text Processing With Applications Towards Political Science and Healthcare
(2021-12-01T06:00:00.000Z) Salam, Sayeed; Khan, Latifur; Hu, Yang; Bastani, Farokh B.; Kim, Dohyeong; Wu, Weili
Machine learning and deep neural networks have soared in popularity in recent years, allowing us to enhance many aspects of everyday life. While these methods are intuitive, they are very reliant on the dataset being used to build the model. A high-quality dataset boosts the model’s accuracy and validates the model’s output in the context of a real-world scenario. Furthermore, continuous improvement on the dataset contributes in the tuning of the model in a time-consistent way and the mitigation of temporal inconsistencies. However, preparing datasets, particularly for text domains, is difficult due to the inherent unstructured nature of the data and the use of multiple languages. Furthermore, the amount of text produced in the form of news articles or social media posts is massive, necessitating large-scale processing. The velocity at which new texts are produced demands an elastic and scalable system that can accommodate any surge of inputs while remaining resource efficient while not in use. Texts are created in a variety of ways and must be preprocessed and analyzed in order to provide well-structured, consistent data. This can be accomplished through the use of a well-defined domain-specific ontology (rule-based approach) or machine learning approaches. While rule-based systems can provide information that are more precise and are preferred in a variety of circumstances, they lack flexibility as the ontologies are often fixed and does not respond well with the continuous changes in respective domains. We propose associated solutions to the challenges described above in this dissertation. First, we go over a scalable architecture for collecting news stories from around the world and utilizing a rule-based approach with the Conflict and Mediation Event Observation(CAMEO) ontology to generate political events. We present a summary of the generated dataset, as well as some basic analysis, to demonstrate how it relates to the real-world scenario. We present techniques to dynamically adding information to the ontology using a mining approach for discovering new political actors that works as a recommender system and retrieves more than 80% of the missing information including political figures and their roles. We discuss an extended data processing system for processing articles published in several languages, with a focus on translation methodologies and tools developed. In comparison to the English language, we demonstrate the efficacy of the coder in Spanish. When compared to equivalent events in English articles, the revised event coder with translated knowledge-base was able to recognize 83% of information in Spanish. For healthcare, we propose an alternative strategy in which we use several machine learning algorithms and social media, such as tweets, to extract the location and severity of Road Traffic Incidents (RTI). We highlight a pipeline that goes from collecting tweets to summarizing related tweets for an RTI. We also demonstrate how semi-automatic ontology learning can be useful in determining severity and offer a simplified example in which 100% of the target rules were identified using an iterative technique.
A Complex Task Scheduling Scheme for Big Data Platforms Based on Boolean Satisfiability Problem
(Institute of Electrical and Electronics Engineers Inc.) Hong, H.; Khan, Latifur; Ayoade, Gbadebo G.; Shaohua, Z.; Yong, W.; Khan, Latifur; Ayoade, Gbadebo G.
In the big data processing systems, the amount of data is increasing. At the same time, the real-time requirement of data processing and analysis is higher and higher. Therefore, it is required that the big data processing and analysis systems have better performance. Job scheduling plays an important role in improving the overall system performance in big data processing frameworks. However, job scheduling is a difficult NP-hard problem. There are many factors that need to be considered for job scheduling. For example, jobs have dependencies among stages, therefore we should not allocate resources to tasks that are not ready. Sometimes, there are constraints between jobs. These are a challenge to the scheduling performance of big data processing and analysis systems. In this paper, we try to solve the problem by translating it into Boolean Satisfiability Problem (SAT) which is an exact method. SAT-based scheduling algorithm is not a new approach, but in the past it mainly used to solve the static scheduling problems. For dynamic scheduling system, it requires all problems to be solved within a limited time, which is a challenge for SAT encoding. In this paper, we refer to the previous SAT solution to the Job Shop Scheduling Problem, and adjust the algorithm to meet the requirements of the big data processing system. At the same time, we optimized the coding approach and reduced the number of clauses. Thus, the efficiency of the problem solved is improved to meet the performance requirements. The experimental results show that the number of clauses is reduced by more than 30%, and the processing time of the SAT solver to get the solution can be reduced by more than 50%. To demonstrate its effectiveness, we have also implemented our new job scheduler in Apache Hadoop YARN, and validated its effectiveness.
A Federated Learning Framework For Medical Data
(2021-05-05) Halim, Sadaf Md; Khan, Latifur
This thesis investigates the Federation of Medical Machine Learning Models. Medical data is by its very nature sensitive - which makes the free sharing of medical information difficult. This data often contains very useful information, and in this thesis we broadly explore the possibility of utilizing data without their direct transfer. Instead, we explore the use of Federated Learning to transfer the knowledge from sensitive patient information as gradient updates and model parameters in order to better inform a variety of learning tasks. We also explore specific types of models such as graph networks which can often be a natural representation for Electronic Health Records (EHRs) collected at hospitals, and we investigate how we might use federated learning to aggregate information such as this across hospitals. Lastly we identify risks involved in a federated system, in the form of adversarial entities, and we show how we can mitigate them.
Adversarial Machine Learning for Social Good
(2022-08-01T05:00:00.000Z) Belavadi, Vibha Chandramouli; Kantarcioglu, Murat; Thuraisingham, Bhavani; Basu, Kanad; Khan, Latifur; Iyer, Rishabh
The deployment of Machine learning (ML) techniques to automate critical decision-making in healthcare, employment, finance, and crime prevention has played a huge role in im- proving these systems. However, an incorrect decision can lead to potentially life-changing consequences. In addition, these deployed ML models are also highly vulnerable to test-time adversarial attacks. Thus these ML models need to be made trustworthy and reliable. This dissertation deals with these problems and presents work to make the model more robust, fair, and reliable by using adversarial machine learning techniques. In this dissertation, we start with the problem of whether ML-based deception detection systems are generalizable to real-life scenarios. We perform experiments to examine whether multimodal aspects such as facial expressions, eye movements, and video cues can be used as deception detection features. We develop three different datasets based on real-life lying scenarios (e.g., for a reward, duress, and speaking white lies) and use them as deception detection features. We also study state-of-the-art deception detection systems and algorithms and try to extend them to our deception scenarios. We show that deception detection is not generalizable to real-life scenarios, and more subject matter knowledge and better models are needed to make such a claim. We also address the issue of using adversarial examples to protect user security and privacy and make the black-box models more advantageous to the end-user, specifically in the vision and tabular data domain. In the vision domain, we consider the problems of protecting a sensitive attribute (e.g., gender) using an adversarial artifact like adversarial glasses on the image. In the tabular data domain, we provide adversarial recommendations to the user in their loan application data to change the black-box loan application model from bad credit to good credit. Thus, we demonstrate that we can craft adversarial examples to help end-users protect their privacy and ensure they do not get unfair treatment. Finally, we tackle the problem of multi-concept adversarial examples. We show that the state-of-the-art adversarial attack of a particular classifier, e.g., gender, reduces the accuracy of a different classifier, e.g., age trained independently on the same data pool. Combining the loss functions of the attacked model and the protected model in our loss formulation shows that we can create targeted adversarial attacks. These custom adversarial attacks not only successfully attack the target classifiers, but also cause no drop in the accuracy of the other classifiers in the protected set.
Bayesian Nonparametric Probabilistic Methods in Machine Learning
(2018-12) Sahs, Justin C.; Thuraisingham, Bhavani; Khan, Latifur
Many aspects of modern science, business and engineering have become data-centric, relying on tools from Artificial Intelligence and Machine Learning. Practitioners and researchers in these fields need tools that can incorporate observed data into rich models of uncertainty to make discoveries and predictions. One area of study that provides such models is the field of Bayesian Nonparametrics. This dissertation is focused on furthering the development of this field. After reviewing the relevant background and surveying the field, we consider two areas of structured data: - We first consider relational data that takes the form of a 2-dimensional array—such as social network data. We introduce a novel nonparametric model that takes advantage of a representation theorem about arrays whose column and row order is unimportant. We then develop an inference algorithm for this model and evaluate it experimentally. - Second, we consider the classification of streaming data whose distribution evolves over time. We introduce a novel nonparametric model that finds and exploits a dynamic hierarchical structure underlying the data. We present an algorithm for inference in this model and show experimental results. We then extend our streaming model to handle the emergence of novel and recurrent classes, and evaluate the extended model experimentally.
Big Data Sanitization Using Scalable In-Memory Frameworks
(2017-05) Waikar, Kanchan P.; Kantarcioglu, Murat; Khan, Latifur; Thuraisingham, Bhavani
As more and more data is collected, it is growing beyond the scale humans could ever have imagined. Not only data but also data collection and analysis techniques have evolved and have enabled researchers to advance many fields such as medical science. Although health data can have a huge impact on the future success of research, data is usually distributed among various stakeholders. Organizations need to share this data to help research move forward, but health data sharing is a regulated domain. Due to privacy concerns, the U.S. Department of Health and Human Services (HHS) has taken steps to ensure privacy protection of individuals by regulating data sharing through Health Insurance Portability and Accountability Act of 1996 (HIPAA). HIPAA policy restricts publishers from sharing identification information as well as any auxiliary information that can be used for record re-identification. To make data sharing compliant with the HIPAA policy, various data privacy protection techniques evolved. Differential privacy techniques focused on query accuracy maximization in statistical databases while minimizing the “risk” of record identification, whereas, data anonymization allows the publisher to share original data at lesser precision, i.e., sharing attribute value of age as 25-35 instead of 30. These techniques are considered as an industry standard. Newer risk-based models determine record anonymization level based on the hidden “risk” of re-identification of the record. With constantly increasingly sanitization requests around big data, sanitization algorithms need to be adapted for distributed computing frameworks. Frameworks like Hadoop-MapReduce achieve parallelism by distributing tasks on multiple machines and executing them in parallel. Apache Spark is a Hadoop-MapReduce based in-memory distributed framework with support for data caching making it more suitable choice for iterative anonymization algorithms. This study focuses on developing distributed in-memory data sanitization techniques. To extend traditional k-anonymity methods, we implemented Mondrian k-anonymization algorithm for Apache Spark. The Mondrian algorithm performs multidimensional partitioning cuts until data cannot be divided further without violating k-anonymity property. We propose a locality sensitive hashing (LSH) based one pass anonymization algorithm in which we use LSH functions for the formation of clusters of size k and finding a summary statistic for each cluster. To support newer data anonymization methods, we implement an in-memory version of risk estimation based anonymization algorithm that leverages game theoretical approach for deciding optimal generalization level for each record. We then propose a hybrid risk anonymization algorithm that uses LSH bucketing to minimize the number of risk estimation algorithm executions. To support online sanitization, we propose an aspect-oriented approach for modifying Apache Spark RDD’s computation at runtime. We show how an aspect can suppress identifier field based on predefined policy at runtime. With evolving functional requirements like within-dataset anonymization vs within population anonymization, centralized vs distributed anonymization, risk-based vs strict k-anonymization, it is crucial to select the method that fits the requirement correctly. This study offers different solutions that are suitable for different functional requirements. The analysis and comparison of above methods would enable data publishers to make efficient computation cost anonymization decisions.
Classified Enhancement Model for Big Data Storage Reliability Based on Boolean Satisfiability Problem
(Springer New York LLC, 2019-05-11) Huang, H.; Khan, Latifur; Zhou, S.; 51656251 (Khan, L); Khan, Latifur
Disk reliability is a serious problem in the big data foundation environment. Although the reliability of disk drives has greatly improved over the past few years, they are still the most vulnerable core components in the server. If they fail, the result can be catastrophic: it can take some days to recover data, sometimes data lost forever. These are unacceptable for some important data. XOR parity is a typical method to generate reliability syndrome, thus improving the reliability of the data. In practice, we find that the data is still likely to be lost. In most storage systems reliability improvements are achieved through the allocation of additional disks in Redundant Arrays of Independent Disks (RAID), which will increase the hardware costs, thus it will be very difficult for cost-constrained environments. Therefore, how to improve the data integrity without raising the hardware cost has aroused much interest of big data researchers. This challenge is when creating non-traditional RAID geometries, care must be taken to respect data dependence relationships to ensure that the new RAID strategy improves reliability, which is a NP-hard problem. In this paper, we present an approach for characterizing these challenges using high-dimension variants of the n-queens problem that enables performable solutions via the SAT solver MiniSAT, and use the greedy algorithm to analyze the queen’s attack domain, as a basis for reliability syndrome generation. A large number of experiments show that the approach proposed in this paper is feasible in software-defined data centers and the performance of the algorithm can meet the current requirements of the big data environment. © 2019, Springer Science+Business Media, LLC, part of Springer Nature.
Comparative Text Analysis on a Classification Task of Political Fake Statement Detection
(2018-08) Amin, Shalin Anilkumar; 0000-0002-3540-1555 (Amin, SA); Khan, Latifur
Automatic fake news detection is a very challenging problem especially in a fraud/deception detection and it has significant real-world political and social impact. During the 2016 US Presidential Election, the world saw many such cases. Thus, it is essential to address this socially relevant phenomenon. However, statistical approaches to combating fake news have been dramatically limited by the lack of a publicly available labeled dataset. Especially one with political news headline and their labels. Until now most of the research has been done on news articles or headline-article pair. But in this research, we emphasize only on political news headlines/statements spoken by political candidates or Facebook posts. This thesis explores different attempts on fake news detection task using a wide variety of natural language processing techniques. These techniques include extracting linguistic features from the statement, considering their predictive power by conducting feature engineering and topic modeling and determining the reputation of a speaker by his/her credit score, topic-speaker analysis, and word vectors. Using different classifiers, the overall approach is discussed. At the end, an attempt at stance detection is also discussed.
Decentralized IoT Data Management Using BlockChain and Trusted Execution Environment
(Institute of Electrical and Electronics Engineers Inc.) Ayoade, Ghadebo; Karande, Vishal; Khan, Latifur; Hamlen, Kevin W.; 51656251 (Khan, L); 50151836493420401232 (Hamlen, KW); Ayoade, Ghadebo; Karande, Vishal; Khan, Latifur; Hamlen, Kevin
Due to the centralization of authority in the management of data generated by IoT devices, there is a lack of transparency in how user data is being shared among third party entities. With the popularity of adoption of blockchain technology, which provide decentralized management of assets such as currency as seen in Bitcoin, we propose a decentralized system of data management for IoT devices where all data access permission is enforced using smart contracts and the audit trail of data access is stored in the blockchain. With smart contracts applications, multiple parties can specify rules to govern their interactions which is independently enforced in the blockchain without the need for a centralized system. We provide a framework that store the hash of the data in the blockchain and store the raw data in a secure storage platform using trusted execution environment (TEE). In particular, we consider Intel SGX as a part of TEE that ensure data security and privacy for sensitive part of the application (code and data).
Deep Learning Methods for Improving Event Extraction on Political and Social Science Studies
(2022-05-01T05:00:00.000Z) Skorupa Parolin, Erick; Khan, Latifur; O, Kenneth K.; Wu, Weili; Bastani, Farokh B.; Brandt, Patrick T.
Political and social scholars increasingly rely on event coders, which are automated systems that extract structured event representations from news articles, in order to monitor, ana- lyze and predict conflicts and affairs involving political entities across the globe. However, the existing event coders rest on outdated pattern matching techniques, relying on large manually maintained dictionaries composed of lexico-syntactic patterns designed for cap- turing conflict events. Apart from the high costs, time and specialized knowledge required to update and expand such dictionaries, these techniques do not support event extraction on multilingual corpus. As a consequence, the application of existing systems often yields low-recall results and imposes limitations when working with sources coming from different countries and languages. In this dissertation, we propose deep learning based frameworks to obtain state-of-the-art results for extracting structured events from natural language text in political and social sciences domains. We do so by exploring three main directions: (i) automatically extending the external dictionaries and knowledge bases utilized in the current event coders through knowledge extraction techniques; (ii) formulating the event coding task as a classification problem and proposing a supervised deep learning model to solve it; and (iii) developing an innovative deep neural network design by combining state-of-the-art lan- guage representation models with multi-task learning technique to efficiently extract events in a structured format from multilingual corpus. We demonstrate the superiority of our ap- proaches through conducting extensive experiments on real-world multilingual corpora based on political science and conflict domains.
Deep Neural Network Based Representation Learning and Modeling for Robust Speaker Recognition
(2022-08-01T05:00:00.000Z) Xia, Wei; Hansen, John; Kumar, Golden; Busso, Carlos; Ouyang, Jessica; Khan, Latifur
Automatic Speaker Verification (ASV) involves determining a person’s identity from audio streams. ASV provides a natural and efficient way for biometric identity authentication. Being able to perform text-independent speaker verification that does not require a fixed input text phrase can significantly help verify or retrieve a target person. Speaker recognition has many applications today including audio surveillance, computer access control, and voice authentication. Smart home devices such as Google Home, Amazon Alexa, and Apple Homepod can also benefit from ASV for personalized voice applications. This dissertation address four related research problems. First, we investigate the impact of non-linear distortion based on waveform peak clipping for automatic speech-based systems. We begin by defining various forms of clipping and then explore potential impact for practical speech systems and speech corpora. Next, we present an overview of audio quality assessment, illustrate the effect that clipping has on automated speaker recognition systems, and provide findings of an investigation into the occurrence of clipping in a variety of data sets used by the speech community. Second, we provide an unsupervised Adversarial Discriminative Domain Adaptation (ADDA) method for speaker verification when training and testing data have mismatched conditions. ADDA needs just the source and unlabeled target domain data in order to discover an asymmetric mapping that adapts the target domain feature encoder to the source domain. The experimental findings demonstrate that trained ADDA speaker embeddings can perform well on speaker classification for the target domain data and are less sensitive to language shifts. In the third topic, a generalized global context modeling framework is proposed for speaker recognition. We first present a data-driven attention based global time-frequency context model, which can better capture long-range time-frequency dependencies and channel variances. It aims to obtain a better combination of the Non-local block and Squeeze&Excitation block to adaptively recalibrate the learned feature map and provides time-frequency attention to specific regions. Further, we propose a data-independent Discrete Cosine Transform (DCT) based global context model. A multi-DCT attention mechanism is presented to improve the modeling power with different DCT bases. We also use global context information to enhance important channels and recalibrate salient time-frequency locations by computing the similarity between global context and local features. We show that the proposed global context modeling method can be easily incorporated into a CNN model with little additional computational costs and effectively improves the speaker verification performance by a large margin. Lastly, in topic four, we investigate the effects of reverberation and noise for self-supervised speaker verification. In order to normalize extrinsic variations of two random segments taken from one spoken utterance, a number of alternate training data augmentation methodologies are investigated. We systematically simulate alternate levels and types of reverberation and noise on the test data for performance comparison. The experiments show a clear correlation between microphone distance, reverberation time, signal-to-noise ratio, and the verification result. Taken collectively, the investigative studies have contributed to a more comprehensive understanding of speaker recognition, as well as advancing algorithm robustness for real-world speaker systems.
Design and Development of Real-Time Big Data Analytics Frameworks
(2017-12) Solaimani, M.; Khan, Latifur
Today most sophisticated technologies such as Internet of Things (IoT), autonomous driving, Cloud, data center consolidation, etc., demand smarter IT infrastructure and real-time operations. They continuously generate lots of data called “Big Data” to report their operational activities. In response to this, we need advanced analytics frameworks to capture, ﬁlter, and analyze data and make quick decisions in real-time. The high volumes, velocities, and varieties of data make it an impossible (overwhelming) task for humans in real-time. Current state-of-the-arts like advanced analytics, Machine learning (ML), Natural Language Processing (NLP) can be utilized to handle heterogeneous Big Data. However, most of these algorithms suffer scalability issues and cannot manage real-time constraints. In this dissertation, we have focused on two areas: anomaly detection on structured VMware performance data (e.g., CPU/Memory usage metric, etc.) and text mining for politics in unstructured text data. We have developed real-time distributed frameworks with ML and NLP techniques. With regard to anomaly detection, we have implemented an adaptive clustering technique to identify individual anomalies and a Chi-square-based statistical technique to detect group anomalies in real-time. With regards to text mining, we have developed a real-time framework SPEC to capture online news articles of different languages from the web and annotated them using CoreNLP, PETRARCH, and CAMEO dictionary to generate structured political events like ‘who-did-what-to-whom’ format. Later, we extend this framework to code atrocity events – a machine coded structured data containing perpetrators, action, victims, etc. Finally, we have developed a novel, distributed, window-based political actor recommendation framework to discover and recommend new political actors with their possible roles. We have implemented scalable distributed streaming frameworks with a message broker – Kafka, unsupervised and supervised machine learning techniques and Spark.
Design and Development of Scalable Analytics Frameworks with Applications in Blockchain Smart Contract Security and Political News Mining
(2020-05) Bahojb Imani, Maryam; Khan, Latifur; Thuraisingham, Bhavani
Nowadays, high amounts of data are continuously generated at unprecedented rate from various domains such as e-commerce, education, health, security, and social networks. This is due to many technological advancements, including Internet of Things (IoT), autonomous driving, the proliferation of Cloud Computing, data center consolidation as well as the growth of smart devices. The term big data was created to demonstrate the meaning of this emerging trend. The high volumes, velocities, and varieties of data pose a great challenge for the data mining community to extract useful knowledge. In response to this, we need scalable analytics frameworks for data acquisition, filtering, and analyzing in a quick time. Current state-of-the-arts like advanced analytics, Machine Learning (ML), Natural Language Processing (NLP) can be utilized to handle heterogeneous Big Data. Yet, most of these systems suffer scalability issues. In this dissertation, we focus on social science and blockchain areas. More specifically, we focus on location extraction from unstructured political text data, vulnerability detection in Blockchain’s smart contracts and fault diagnosis in wind turbine vibration data. With regard to focus location extraction, although various tools exist to identify geolocation, they fail to identify at a granular level; they mostly rely on external knowledge, and they do not support most languages. We propose a novel scalable framework PROFILE to extract the primary focus location from political news articles in different languages. With regard to blockchain, existing solutions to this problem particularly rely on human experts to define features or different rules to detect vulnerabilities, which often lead to missing many vulnerabilities and they are inefficient in detecting new vulnerabilities. We develop a novel scalable framework to detect vulnerabilities in smart contracts. With regard to fault diagnosis in wind turbines, real-time fault diagnosis for streaming vibration data from turbine gearboxes is still an outstanding challenge. Moreover, monitoring gearboxes in a wind farm with thousands of wind turbines requires massive computational power. We address these challenges by developing SAIL, a scalable real-time framework, to capture wind turbine vibration data using a novel feature extraction and predict faults in gearbox. We show empirically that the proposed techniques outperform state-of-the-art techniques in all three areas.
Efficient Continual Learning Framework for Stream Mining
(2022-05-01T05:00:00.000Z) Wang, Zhuoyi; Khan, Latifur; Hamlen, Kevin; Ma, Dongsheng Brian; Chen, Feng; Ruozzi, Nicholas
In recent times, deep learning-based neural models have performed excellent intelligence in several real-world tasks (e.g. object recognition, speech recognition, and machine trans- lation). However, existing achievements are typically under a closed, static environment, compared with the human brain that can learn and perform in the changing, evolving dy- namic setting with new tasks, it is hard for the current intelligent agent that discovers the novel knowledge effectively, and incrementally learn such new skills fast and efficient. We could observe that the ability to learn and accumulate knowledge over the lifetime is an essential perspective of human intelligence. Under this scenario, how encouraging the agent continually discover and learn sequentially from non-stationary or online stream of data, is significant in real-world research and application. We consider a situation, that infinite stream of data sampled from a non-stationary distribu- tion with the sequence of new emerged tasks, the key factor of the continual learning process is to automatically discover the novel/unseen pattern in the new coming tasks (compared with previous data), and also reduce the knowledge forgetting of previously seen concepts. A common problem that current deep learning/machine learning models are well known to suffer from. The contribution we described in this dissertation could be expanded to mitigate the novel knowledge discovery, incrementally efficient learning of new skills, and reduce the forgetting phenomena in the deep learning algorithm. To approach such challenges in the continual learning scenario, we first describe a class- incremental learning setting where incoming task include new classes reaching to the agent at a time, and the previous tasks could not, or limited be accessed. We introduce specific background about existing technologies for solving different issues in the learning process, and then describe our developed frameworks that aim for high-level performance on each challenge. It reserves different specialist models for each goal, includes the discovery and further incremental learning of novel knowledge using a shared model with a limited, fixed capacity. Also, when accounting for privacy issues and memory constraints, we propose to update model parameters while only accessing the previous statistics information, instead of original data. As a result, the knowledge forgetting on old concepts is reduced, and storing original input could be avoided.
Enhancing Classification and Retrieval Performance by Mining Semantic Similarity Relation from Data
(2021-02-12) Gao, Yang; Khan, Latifur
When describing unstructured data, e.g., images and texts, humans often resort to similarity defining the characteristics of these data in relative terms rather than absolute terms. The subtle differences between such data can be indicated by a human easily while completely describing a single instance of them is a challenging task. For example, in an image retrieval task, to determine if two images are describing the same object, humans may simply ignore the differences in illumination, scaling, background, occlusion, viewpoint and only pay attention to the object itself. On the other hand, describing an image with all its information is hard and unnecessary. Cognitive evidence also suggests that we interpret objects by relating them to prototypical examples stored in our brain. Thus, the similarity is a fundamental property and of great importance in classification and retrieval tasks alike. Metric learning is the process of determining a non-negative, symmetric, and subadditive distance function d(a, b) that aims to establish the similarity or dissimilarity between objects. It reduces the distance between similar objects and increases the distance between dissimilar objects. From the human’s perspective, metric learning can be viewed as determining a function that best matches the user interpretation of the similarity and dissimilarity relation between data items. In this dissertation, we explore the possibilities to enhance the classification and retrieval performance by mining semantic similarity relations in data via metric learning. Unfortunately, existing metric learning solutions have several drawbacks. First, most metric learning models have a fixed model capacity that cannot be changed for adaption to input data. Second, existing online metric learning models learn a linear metric function which limits the model’s expressiveness. Third, they usually require a user-specified margin sensitive to input data and ignore a lot of failure cases during learning. To address these drawbacks, we propose a novel online metric learning framework OAHU that automatically adjusts model capacity based on input data, and introduce an Adaptive Bound Triplet Loss (ABTL) to avoid failure cases during learning. On the other hand, as an important subarea of classification, imbalanced classification is critical to the success of many real-world applications, but few existing solutions have ever considered utilizing data similarity to assist imbalanced learning. Based on this observation, we introduce a novel framework named SetConv, which customizes the feature extraction process for each input sample by considering its semantic similarity relation to the minority class to alleviate the model bias towards the majority classes. We also incorporate metric/similarity learning into a novel open-world stream classifier SIM to handle classifications on open-ended data distribution. Based on our research, we demonstrate that mining semantic similarity relation in data is critical to improving the performance of real-world classification and retrieval tasks.
Enhancing Cybersecurity with Encrypted Traffic Fingerprinting
(2017-11-20) Al Naami, Khaled Mohammed; Khan, Latifur; Hamlen, Kevin W.
Recently, network traffic analysis and cyber deception have been increasingly used in various applications to protect people, information, and systems from major cyber threats. Network traffic ﬁngerprinting is a traffic analysis attack which threatens web navigation privacy. It is a set of techniques used to discover patterns from a sequence of network packets generated while a user accesses different websites. Internet users (such as online activists or journalists) may wish to hide their identity and online activity to protect their privacy. Typically, an anonymity network is utilized for this purpose. These anonymity networks such as Tor (The Onion Router) provide layers of data encryption which poses a challenge to the traffic analysis techniques. Traffic ﬁngerprinting studies have employed various traffic analysis and statistical techniques over anonymity networks. Most studies use a similar set of features including packet size, packet direction, total count of packets, and other summaries of different packets. More-over, various defense mechanisms have been proposed to counteract these feature selection processes, thereby reducing prediction accuracy. In this dissertation, we address the aforementioned challenges and present a novel method to extract characteristics from encrypted traffic by utilizing data dependencies that occur over sequential transmissions of network packets. In addition, we explore the temporal nature of encrypted traffic and introduce an adaptive model that considers changes in data content over time. We not only consider traditional learning techniques for prediction, but also use semantic vector space models (VSMs) of language where each word (packet) is represented as a real-valued vector. We also introduce a novel defense algorithm to counter the traffic ﬁngerprinting attack. The defense uses sampling and mathematical optimization techniques to morph packet sequences and destroy traffic ﬂow dependency patterns. Cyber deception has been shown to be a key ingredient in cyber warfare. Cyber security deception is the methodology followed by an organization to lure the adversary into a controlled and transparent environment for the purpose of protecting the organization, disinforming the attacker, and discovering zero-day threats. We extend our traffic ﬁngerprinting work to the cyber deception domain and leverage recent advances in software deception to enhance Intrusion Detection Systems by feeding back attack traces into machine learning classiﬁers. We present a feature-rich attack classiﬁcation approach to extract security-relevant network-and system-level characteristics from production servers hosting enterprise web applications.
Enhancing Point Cloud Generation From Various Information Sources by Applying Geometry-aware Folding Operation
(2022-05-01T05:00:00.000Z) Lin, Yu; Khan, Latifur; Fumagalli, Andrea; Ruozzi, Nicholas; D'Orazio, Vito; Du, Ding-Zhu
A plethora of cutting-edge computer vision and graphic applications, such as Augmented Reality (AR), Virtual Reality (VR), automatic vehicles, and robotics, require rapid creation and access to abundant 3D data. Among various 3D data representations, e.g., RGB images, depth images, or voxel grids, point cloud attracts considerable attention from the research community because it offers additional geometric, shape, and scale information in comparison with 2D images and demands less computational resource to process in contrast to other 3D representations, e.g., voxel grids, octree, or triangle meshes. Unfortunately, even with the increasing availability of 3D sensors, the size and variety of 3D point clouds datasets pale when compared to the vast size datasets of other representations. Therefore, it will benefit many applications if we can generate point clouds from other information sources. Point cloud generation is a sub-field of 3D reconstruction, which aims to generate a complete 3D object from other information sources. Conventional methods generally focus on 2D images and heavily rely on the knowledge of multi-view geometry, while multiple 2D views of a target 3D object usually are inaccessible in many real-world scenarios. On the contrary, recent deep learning approaches either dedicate to 3D representations with regular structures, such as voxel grids and octrees, and thus suffer from resolution and scalability issues, or unconsciously ignore the crucial 3D prior knowledge and lead to sub-optimal solutions. To address the aforementioned drawbacks, we explore the possibilities to improve the point cloud generation by developing advanced folding operations and geometry-aware (3D-prioraware) reconstruction networks in this dissertation. Specifically, we start with a novel point cloud generation framework TDPNet that reconstructs complete point clouds by employing a hierarchical manifold decoder and a collection of latent 3D prototypes. Later, we find that applying vanilla folding operation is insufficient for a realistic reconstruction, and using KMeans centroids as the prototype features is unstable and lacks interpretability. Inspired by these observations, we further introduce a novel framework equipped with a collection of Learnable Shape Primitives (L-SHAP), which encode the crucial 3D prior knowledge from training data through an additional folding operation. On the other hand, it’s beneficial to many applications if point clouds can be generated in a few-shot scenario. We tackle this problem by a novel few-shot generation framework FSPG, which simultaneously considers class-agnostic and class-specific 3D priors during the generation process. Finally, we observe that conventional folding operations are implemented by a simple shared-MLP, which increases training difficulty and limits the network’s modeling capability. In order to solve this problem, we incorporate the popular Transformer architecture into a novel attentional folding decoder AttnFold and introduce a Local Semantic Consistency (LSC) regularizer to further boost the model’s capability. Based on our research, we demonstrate that learning flexible data-driven 3D priors and adopting advanced folding operations are effective for point cloud generation under different problem settings.
Ensuring Integrity, Privacy, and Fairness for Machine Learning Using Trusted Execution Environments
(December 2021) Asvadishirehjini, Aref; Kantarcioglu, Murat; Wallace, Robert; Thuraisingham, Bhavani; Khan, Latifur; Iyer, Rishabh Krishnan
In this day and age, numerous decision-making systems increasingly rely on machine learning (ML) and deep learning to deliver cutting-edge technologies to the members of society. Due to potential security, privacy and bias issues with respect to these ML methods, currently, end users cannot fully trust these systems with their private data, and their prediction outcome. For instance, in many cases, it is not clear how an individual’s medical record is being used for building tools for medical diagnosis? Is the data always encrypted at rest? When they are decrypted, is there a guarantee that only a trusted application can have access to the private data to eliminate potential misuse? Throughout this dissertation, solutions that leverage various security and integrity capabilities provided by hardware assisted Trusted Execution Environments (TEE) are proposed to make these ML based systems more reliable and trustworthy so that end users can have a greater trust in these systems. As a starting point, we first address the privacy and integrity issues in ML model learning in the cloud setting. Training of a deep learning model that only relies on a TEE is not very attractive to businesses that need to continuously train their models in a remote cloud setting. This is due to the fact that special hardware such as Graphical Processing Units (GPU) are much more efficient in training ML models compared to CPU based TEEs. In this dissertation, we propose an integrity-preserving solution that combines TEEs, and GPUs in order to provide an efficient solution. In this solution, we focus on the ML model training task using the efficient GPU while ensuring the detect any deviation from the ML model learning protocol with a high probability using the TEE capabilities. Using our solution, we can ascertain (with high probability) the model is trained with the correct training dataset using the correct training hyperparameters, and correct code execution flow. Once we provide an integrity preserving ML model training solution, we focus on how to use the learned ML model privately and securely in practice. To provide privacy-preserving inference on sensitive data, wherein ML model owner and data owner do not trust each other, the dissertation proposes a solution that the inference task is run inside a TEE and the result is sent to the data owner(s). The most important benefit of our solution is that the data owner can ensure their data will not be used for any other purposes in the future and no information other than the agreed model inference result is disclosed. Furthermore, we show the efficacy of our solution in the context of genomic data analysis. Next we focus on the bias and unfairness embedded in certain ML models. It is has been reported that the ML models can unfairly treat certain subgroups, and it is hard to test for such issues in application deployment settings where both the ML model and the input data to the ML model is sensitive (i.e., both the model and the data cannot be disclosed to public for auditing directly). This dissertation proposes a privacy-preserving solution for fairness analytics using TEEs. In this setting, the model owner and the fairness test set owner do not trust each other, therefore they do not want their input to be disclosed. The end goal is for the fairness analyst to conduct tests about the quality and fairness of the model’s outcome with respect to a set of predefined minority groups or subgroups and compare and contrast them with privileged group(s). This way, models can be analyzed, and the analyst can shed light on the potential latent biases in the ML model in a privacy-preserving manner. Even if the ML model is trained, and deployed securely, due to data poisoning, the final model may still contain hidden backdoors (which in the literature is referred to as trojan attacks). Finally, in this dissertation, we develop novel techniques to detect such attacks. We design experiments that first creates a multitude of models that carry a trojan, and another set that does not have any trojan. Then, we build classifiers to see if we can tell them apart. Our results show that ML models could be used to detect trojan attacks against other ML models.
Explainable AI Algorithms for Classification Tasks With Mixed Data
(December 2022) Wang, Huaduo; Hayenga, Heather; Gupta, Gopal; Tamil, Lakshman; Salazar, Elmer; Nourani, Mehrdad; Khan, Latifur
With the great power of Machine Learning techniques, numerous applications have been created that have become an integral part of our modern life. However, the decision making processes of many of these machine learning-based applications are being questioned and criticized due to their opacity to users, especially for critical tasks such as disease diagnosis, loan application, industrial robots, etc. This opacity is the result of using statistical ma- chine learning approaches that generate models that can be viewed as solutions to optimization problems that minimize loss or maximize likelihood. Explainable Artificial Intelligence (XAI) models or Explainable Machine Learning (XML) models are machine-learned models in which human users can understand the decision making or prediction making process. The main goals of XAI are to: 1) generate highly accurate models that are comprehensible to human users. 2) explain a model’s decision-making process to a human so that they can easily understand it, develop trust in it, and diagnose any potential problems. This dissertation presents the FOLD family of new explainable AI algorithms for classification tasks that are able to efficiently handle mixed data (numerical and categorical) without extra effort (i.e., without resorting to any special data encoding). These algorithms generate a set of default rules, represented as a stratified logic program, that serves as the predictive model. Due to their symbolic nature and because they are based on logic, they can be easily understood and modified by humans. These new algorithms are competitive in predictive performance with state-of-the-art machine learning algorithms such as XGBoost and Multi-Layer Perceptrons (MLP), however, they are an order of magnitude faster with respect to execution efficiency. The FOLD-R++ algorithm has been designed for solving binary classification problems, FOLD-RM for multi-category classification problems, and FOLD-LTR for ranking. FOLD-SE is a further improvement over these algorithms that leads to scalable explainability. Scalable explainability means that regardless of the size of the data, the generated model is represented using a small number of rules—resulting in improved human-interpretability and human-explainability—while maintaining excellent predictive performance. The rest of this thesis presents the FOLD family of algorithms and compares and contrasts them with state-of-the-art machine learning algorithms.
Exploration of Podcast Corpora, Summarization, and Search
(2020-11-17) Perez, Mathew; Khan, Latifur
Podcasts have emerged as an increasingly ubiquitous form of media. This new medium carries several idiosyncrasies, such as multiple speakers, varying audio quality, oscillating topics, (etc.). As podcast consumption grows, so too does the need for knowledge and algorithms to apply to this burgeoning data space. We focus on two useful data tasks: summarization and search, developing methods to tackle both problems and discuss how existing methods in both areas can be tailored to podcast data. Specifically, we use Spotify’s podcast dataset, comprising episodes from their ever-growing database of podcasts, as a case study in the data space. Also, we explore this novel dataset, drawing several judgements and patterns regarding the nature of podcast data. Then, we conclude by considering future work and improvements as podcast data continues to grow and its analysis matures.