Browsing by Author "Wu, Weili"

Now showing 1 - 20 of 22

A Big Data Framework for Unstructured Text Processing With Applications Towards Political Science and Healthcare
(2021-12-01T06:00:00.000Z) Salam, Sayeed; Khan, Latifur; Hu, Yang; Bastani, Farokh B.; Kim, Dohyeong; Wu, Weili
Machine learning and deep neural networks have soared in popularity in recent years, allowing us to enhance many aspects of everyday life. While these methods are intuitive, they are very reliant on the dataset being used to build the model. A high-quality dataset boosts the model’s accuracy and validates the model’s output in the context of a real-world scenario. Furthermore, continuous improvement on the dataset contributes in the tuning of the model in a time-consistent way and the mitigation of temporal inconsistencies. However, preparing datasets, particularly for text domains, is difficult due to the inherent unstructured nature of the data and the use of multiple languages. Furthermore, the amount of text produced in the form of news articles or social media posts is massive, necessitating large-scale processing. The velocity at which new texts are produced demands an elastic and scalable system that can accommodate any surge of inputs while remaining resource efficient while not in use. Texts are created in a variety of ways and must be preprocessed and analyzed in order to provide well-structured, consistent data. This can be accomplished through the use of a well-defined domain-specific ontology (rule-based approach) or machine learning approaches. While rule-based systems can provide information that are more precise and are preferred in a variety of circumstances, they lack flexibility as the ontologies are often fixed and does not respond well with the continuous changes in respective domains. We propose associated solutions to the challenges described above in this dissertation. First, we go over a scalable architecture for collecting news stories from around the world and utilizing a rule-based approach with the Conflict and Mediation Event Observation(CAMEO) ontology to generate political events. We present a summary of the generated dataset, as well as some basic analysis, to demonstrate how it relates to the real-world scenario. We present techniques to dynamically adding information to the ontology using a mining approach for discovering new political actors that works as a recommender system and retrieves more than 80% of the missing information including political figures and their roles. We discuss an extended data processing system for processing articles published in several languages, with a focus on translation methodologies and tools developed. In comparison to the English language, we demonstrate the efficacy of the coder in Spanish. When compared to equivalent events in English articles, the revised event coder with translated knowledge-base was able to recognize 83% of information in Spanish. For healthcare, we propose an alternative strategy in which we use several machine learning algorithms and social media, such as tweets, to extract the location and severity of Road Traffic Incidents (RTI). We highlight a pipeline that goes from collecting tweets to summarizing related tweets for an RTI. We also demonstrate how semi-automatic ontology learning can be useful in determining severity and offer a simplified example in which 100% of the target rules were identified using an iterative technique.
A Random Algorithm for Profit Maximization in Online Social Networks
(Elsevier B.V.) Chen, T.; Liu, B.; Liu, W.; Fang, Q.; Yuan, Jing; Wu, Weili; 56851698 (Wu, W); Yuan, Jing; Wu, Weili
Given a social network G and a positive integer k, the influence maximization problem seeks for k nodes in G that can influence the largest number of nodes. This problem has found important applications, and a large amount of works have been devoted to identifying the few most influential users. But most of existing works only focus on the diffusion of a single idea or product in social networks. However, in reality, one company may produce multiple kinds of products and one user may also have multiple adoptions. For multiple kinds of different products with different activation costs and profits, it is crucial for the company to distribute the limited budget among multiple products in order to achieve profit maximization. The Profit Maximization with Multiple Adoptions (PM²A) problem aims to seek for a seed set within the budget to maximize the overall profit. In this paper, a Randomized Modified Greedy (RMG) algorithm based on the Reverse Influence Sampling (RIS) technique is presented for the PM²A problem, which could achieve a (1-1/e-ε)-approximate solution with high probability and is also the best performance ratio of the PM²A problem. Comprehensive experiments on three real-world social networks are conducted, and the results demonstrate that our RMG algorithm outperforms the algorithm proposed in [16] and other heuristics in terms of profit maximization, and could better allocate the budget. ©2019 Elsevier B.V.
Breach-Free Sleep-Wakeup Scheduling for Barrier Coverage with Heterogeneous Wireless Sensors
(Institute of Electrical and Electronics Engineers Inc.) Zhang, Z.; Wu, Weili; Yuan, Jing; Du, Dingzhu; 288884264 (Du, D); Wu, Weili; Yuan, Jing; Du, Dingzhu
Barrier Coverage plays a vital role in wireless sensor networks. Research on barrier coverage has mainly focused on the lifetime maximization and the critical conditions to achieve k-Barrier Coverage under various sensing models. When sensors are randomly deployed along the boundary of an area of interest, they may form several disjoint barrier covers. To maximize the lifetime of barrier coverage, those barrier covers need to be scheduled to avoid a security problem, call breach. In a heterogeneous wireless sensor network, given a set of barrier-covers each with a lifetime, we study the problem of finding a lifetime-maximizing subset with a breach-free sleep-wakeup scheduling. We first prove that it can be judged in polynomial time whether a given sleep-wakeup schedule is breach-free or not, but given a set of barrier-covers, it is NP-Complete to determine whether there exists a breach-free schedule. Then, we show that the problem of finding a lifetime-maximizing breach-free schedule is equivalent to the maximum node weighted path problem in a directed graph, and design a parameterized algorithm. Experimental results show that our algorithm significantly outperforms the heuristics proposed in the literature.
Community Detection in Social Networks
(2016-12) Cui, Lei; Wu, Weili
Community structure is one of the essential properties of social networks. That is, the users can be divided into groups within which the communications are dense while between which the communications are sparse. This modular structure can disclose important cues, especially in online social networks, metabolic interaction networks, WWW, wireless sensor network and viral marketing, and facilitate creation, representation and transfer of knowledge and influence. For lack of standard and exact definition of community structure of social networks, we take the factors of time complexity and effectiveness into account and design community detection algorithms adjusted to various application situations. In this dissertation, based on exiting popular community detection methods, several innovative methods and ideas were proposed, including Global Influence-based Maximum K-Community Partition (GI-MKCP), Degree-based Terminal-Set-Enhanced Community Detection (TSECD-D), Influence-based Modularity Maximization and Competitive Influence Maximization Game-based community detection. Due to the NP-hardness of general community detection problems, our proposed algorithms can be executed in polynomial time and approximate the optimal solution. More importantly, validated and demonstrated by the experiments performed on benchmark networks, the proposed algorithms are able to generate high-quality community structures and outperform existing algorithms.
Content Spread and User Relations in Social Computing
(2019-08) Li, Yi; 0000-0002-0875-3475 (Li, Y); Wu, Weili
With the rapid growth of social media and the rise in popularity of social networks, content sharing and spreading have become the major activities for social media users. One of the valuable characteristics of social networks is its capability for user generated content to circulate rapidly through the whole network and spread influence on others. Another characteristic is its openness to everyone. It enables not only news organizations and government agencies to post information, but also ordinary citizens to post from their own perspectives and experiences. In this way, users have the access to more comprehensive and complicated information online. On one hand, social networks offer users many valuable experiences. We can take advantages of social networks such that, for example, the spread of innovation ideas can be maximized, or the expectation of users can be satisfied. On the other hand, we hope to take actions on the negative side that social networks bring to users. For example, to limit the spread of rumors and misinformation or to minimize the negative influence of cybervictims. In this dissertation, we study several problems regarding both positive and negative content spread on social network. First, we study the emerging problems of misinformation/rumor blocking and minimizing the cyberbullying influence on specific user based on Independent Cascade diffusion model and its variance Competitive Independent Cascade model. We formulate these two problems as optimization problems and design algorithms with performance guarantees. Second, we propose a content spread maximization problem and formulate the problem from a marginal gain perspective. As the considered problems are all NP-hard, we focus on the analysis of approximation results. Third, because the network structures are changing dynamically, we predict the missing links and emerging links based on community structure. Last, we study the correlations between user generated content and their roles in online discussion forum.
Cross-Domain Data Fusion for Disaster Detection
(2017-05) Ghosh, Smita; Wu, Weili
With the advancement of the internet and the World Wide Web, staying updated with current affairs has become very easy. Every recent news, current event is just a type away. The large number of domains - whether it’s a search engine, news domain or social media domain - that are coming into existing every day brings with it an abundance of information. This gives rise to two main questions. Is the information about a particular event from one domain enough? Is the information correct? The answer to the first question varies from person to person. One might just be satisfied with the result that they get from querying in one domain while others might be curious to know what other domains have to offer for a given query. This leads to the need of summarization of data from various domains. Summarization of data and high accuracy may not seem that vital for a regular event, for instance, someone querying “Cold Play Concert in the US”. But it rises to importance in cases where someone queries “Earthquake in California”. In scenarios where people want to monitor a disaster it becomes very useful to have information gathered from various sources and summarized in one place. Researchers all over the world have come up with cross-domain data fusion techniques for monitoring disasters. We decided to introduce a dataflow of cross-domain data fusion that gathers the raw data on current disasters from various sources, processes it, accumulates it together to give a summarized table. This approach tries to lessen the need of traversing from one domain to another to obtain information about a particular event. Also it tries to validate the summarized information based on the fact that the more the domains display the same information, the more the accuracy of the data. We evaluate the approach through the amount of relevant information from different domains.
Deep Learning Methods for Improving Event Extraction on Political and Social Science Studies
(2022-05-01T05:00:00.000Z) Skorupa Parolin, Erick; Khan, Latifur; O, Kenneth K.; Wu, Weili; Bastani, Farokh B.; Brandt, Patrick T.
Political and social scholars increasingly rely on event coders, which are automated systems that extract structured event representations from news articles, in order to monitor, ana- lyze and predict conflicts and affairs involving political entities across the globe. However, the existing event coders rest on outdated pattern matching techniques, relying on large manually maintained dictionaries composed of lexico-syntactic patterns designed for cap- turing conflict events. Apart from the high costs, time and specialized knowledge required to update and expand such dictionaries, these techniques do not support event extraction on multilingual corpus. As a consequence, the application of existing systems often yields low-recall results and imposes limitations when working with sources coming from different countries and languages. In this dissertation, we propose deep learning based frameworks to obtain state-of-the-art results for extracting structured events from natural language text in political and social sciences domains. We do so by exploring three main directions: (i) automatically extending the external dictionaries and knowledge bases utilized in the current event coders through knowledge extraction techniques; (ii) formulating the event coding task as a classification problem and proposing a supervised deep learning model to solve it; and (iii) developing an innovative deep neural network design by combining state-of-the-art lan- guage representation models with multi-task learning technique to efficiently extract events in a structured format from multilingual corpus. We demonstrate the superiority of our ap- proaches through conducting extensive experiments on real-world multilingual corpora based on political science and conflict domains.
Healthcare Information Platform in AI Era
(2020-11-30) Hu, Yanke; Wu, Weili
Healthcare analytics has attracted increasing research interests as electronic health records (EHR) and medical image data have skyrocketed over the past decade [1]. EHR and lab reports contain rich text, visual, and time series information such as a patient’s medical and diagnosis history, radiology images, etc which is the major source for managing and predicting a patient’s health status. Meanwhile, Deep Learning [2] has greatly pushed forward the research frontier of computer vision, speech recognition, and natural language processing, since its big success in ImageNet 2012 competition [3]. There is an increasing interest in applying state-of-the-art deep learning techniques to the healthcare industry from the combined effort of industry and academia. IBM [4, 5], Amazon [6, 7], and Google [77, 9, 10, 11] all have pushed out their healthcare information services that can provide early symptoms warning, diagnostic support, and help make clinical decisions. Medical schools and healthcare institutes also have conducted extensive research on illness detection, physiological signals classification, mortality early warning detection, Intensive Care Unit(ICU) length of stay prediction, etc with deep learning models [12, 13, 14, 15, 16, 17]. In this dissertation, we will present two use cases of applying the recent progress of Deep Learning to the healthcare domain: (1) Faster Healthcare Time Series Classification with Convolutional Feature Engineering, and (2) Deep Healthcare Pre-Trained Language Models on Mobile Devices. Our work not only has generated several top tier conference papers, but will also lay the foundation for the next generation healthcare information platform development in the US.
IncRDD: Incremental Updates for RDD in Apache Spark
(2017-05) Dodabelle Prakash, Prathish; Wu, Weili
Data is constantly changing. Today, there can be incremental updates to the existing data. As the data is evolving with new updates, the results of big data applications gradually become out of date and stale. It is required to refresh the results for every update efficiently. Apache Spark is used to process multiple petabytes of data on clusters having thousands of nodes. The core abstraction of Spark is RDD (Resilient Distributed Dataset), which is an immutable collection of elements. Due to the immutability of RDD, Spark works information in parallel, permits information reuse, and handles failures and stragglers productively. But Spark lacks flexibility and efficiency of incremental processing of small updates. In this thesis, IncRDD framework is proposed for incremental processing of updates to the existing data. IncRDD sustains all the powerful features of Spark including parallel processing, data reusability, and fault tolerance. New operations for RDD are implemented to add new records, update the existing records, and delete them. We introduce a new variant of Cuckoo hashing, Dual-CH Fast-Simple. Dual Cuckoo hashing uses two cuckoo hash tables. The first cuckoo table is used to store records, in every partition of a node. The second hash table is used to implement structural sharing, which adds persistence, utilize previous versions, and avoids expensive re-computation. We evaluate IncRDD using incremental algorithms and provide experimental results to show the significant improvement in the performance of Incremental RDD.
Influence Optimization Problems in Social Networks
(2020-05) Gu, Shuyang; Wu, Weili
Online social networks have been developing and prosperous during the last two decades, my dissertation focus on the study of social influence. Several practical problems about social influence are formulated as optimization problems. First, users of online social networks such as Twitter, Instagram have a nature of expanding social relationships. Thus, one important social network service is to provide potential friends to a user that he or she might be interested in, which is called friend recommendation. Different from friend recommendation, which is a passive way for an user to connect with a potential friend, in my work, I tackle a different problem named active friending as an optimization problem about how to friend a person in social networks taking advantage of social influence to increase the acceptance probability by maximizing mutual friends influence. Second, the influence maximization problem has been studied extensively with the development of online social networks. Most of the existing works focus on the maximization of influence spread under the assumption that the number of influenced users determines the success of product promotion. However, the profit of some products such as online game depends on the interactions among users besides the number of users. We take both the number of active users and the user-to-user interactions into account and propose the interaction-aware influence maximization problem. Furthermore, due to the uncertainty in edge probability estimates in social networks, we propose the robust profit maximization problem to have the best solution in the worst case of probability settings.
IoT Data Discovery and Learning
(2022-08-01T05:00:00.000Z) Tran, Trung Hieu; Yen, I-Ling; Bastani, Farokh; Cho, Kyeongjae; Khan, Latifur; Wu, Weili
The massive number of Internet-of-Things (IoT) creates a torrent of data. These data may be stored and hosted by nodes dispersed over the edge of the Internet, forming peer-to-peer (p2p) IoT database networks (IoT-DBNs) that can be dynamically discovered and used to enhance daily operations and solve real-world problems. The issues toward making use of the massive amount of IoT data include how to discover the IoT data streams from the IoT-DBN and how to learn and extract useful knowledge from the discovered data to help cope with dynamically arising tasks. In this dissertation, we consider these two problems and develop solutions for them. First, we consider the IoT data discovery problem in growing IoT-DBNs. We show the benefits of p2p unstructured routing for IoT data discovery and point out the space efficiency issue that has been overlooked in keyword-based routing algorithms. As the first in the field, this work investigates routing table designs and various compression techniques to support effective and space efficient IoT data discovery routing. Novel summarization algorithms are proposed, including alphabetical-based, hash-based, and meaning-based summarization and their corresponding coding schemes. We also consider routing table design to support summarization without degrading lookup efficiency for discovery query routing. To evaluate our approach, we collected 100K IoT data streams from various IoT resources and distributed them over a simulated Internet. Then, our data discovery routing with the summarization techniques is applied for handling discovery queries. The results show that our summarization solutions can reduce the routing table size by 20 to 30 folds with a 2-5% increase in latency compared with other peer-to-peer discovery routing algorithms. Our approach outperforms DHT-based approaches by 2 to 6 folds in latency and communication cost. After IoT data discovery and retrieval, a prominent problem is how to learn from the data to address real-world tasks. Since different applications require different learning schemes, we choose to focus on one example application, the estimated time of arrival (ETA) problem, which is very important in intelligent transportation systems and has received a lot of attention recently. Though many tools exist for ETA, ETA for special vehicles, such as ambulances, fire engines, etc., is still challenging due to the scarcity or non-existence of data. To tackle it, we propose a deep transfer learning framework TLETA for the ETA of special vehicles, namely TLETA. TLETA constructs cellular level spatial-temporal knowledge for fine-grained extraction of driving patterns. The learning network contains transferable layers to support knowledge transfer between different categories of vehicles. Importantly, our transfer models only train the last layers to map the transferred knowledge, significantly reducing the training time to achieve real-time learning. We also introduce the inter-region transfer method to build a mapping function between vehicle domains within a region. The mapping functions of top-k region spatial-temporal similarity are then used to construct the predictor in regions whose target data is unavailable. The experimental studies show that our model outperforms many state-of-the-art approaches in accuracy and training time.
Marginal Gains to Maximize Content Spread in Social Networks
(Institute of Electrical and Electronics Engineers Inc., 2019-05-06) Yang, W.; Ma, J.; Li, Y.; Yan, R.; Yuan, Jing; Wu, Weili; Li, D.; 56851698 (Wu, W); Yuan, Jing; Wu, Weili
The growing importance of social network for sharing and spreading various contents is leading to the changes in the way of information diffusion. To what extent can social content be diffused highly depends on the size of seed nodes and connectivity of the network. If the seed set is predetermined, then the best way to maximize the content spread is to add connectivities among the users. The existing work shows the content spread maximization problem to be NP-hard. One of the difficulties of designing an effective and efficient algorithm for the content spread maximization problem lies in that the objective function we aim to maximize lacks submodularity. In our work, we formulate the maximize content spread problem from an incremental marginal gain perspective. Although the objective function we derive is not submodular, both submodular lower and upper bounds are constructed and proved. Therefore, we apply the sandwich framework and devise a marginal increment-based algorithm (MIS) that guarantees a data-dependent factor. Furthermore, a novel scalable content spread maximization algorithm influence ranking and fast adjustment (IRFA), which is based on the influence ranking of a single node and fast adjustment with each boosting step in the network, is proposed. Through extensive experiments, we demonstrate that both MIS and IRFA algorithms are effective and outperform other edge selection strategies.
Maximisation of the Number of β-View Covered Targets in Visual Sensor Networks
(Inderscience Enterprises Ltd., 2019-03-24) Guo, L.; Li, D.; Wang, Y.; Zhang, Z.; Tong, Guangmo; Wu, Weili; Du, Dingzhu; 56851698 (Wu, W); 288884264 (Du, D); Tong, Guangmo; Wu, Weili; Du, Dingzhu
In some applications using visual sensor networks (VSNs), the facing directions of targets are bounded. Therefore existing full-view coverage (all the facing directions of a target constitutes a disk) is not necessary. We propose a novel model called β-view coverage model through which only necessary facing directions of a target are effectively viewed. This model uses much fewer cameras than those used by full-view coverage model. Based on β-view coverage model, a new problem called β-view covered target maximisation (BVCTM) problem is proposed to maximise the number of β-view covered targets given some fixed and freely rotatable camera sensors. We prove its NP-hardness and transform it into an Integer Linear Programming problem equivalently. Besides, a (1 - e - 1 )-factor approximate algorithm and a camera-utility based greedy algorithm are given for this problem. Finally, we conduct many experiments and investigate the influence of many parameters on these two algorithms. © 2019 Inderscience Enterprises Ltd.
Near Optimal Social Promotion in Online Social Networks
(2018-12) Yuan, Jing; 0000-0001-6407-834X (YUAN, J); Wu, Weili
Due to dramatic growth of user population in the past decade, online social networks have become attractive channels for companies to launch marketing campaigns. We study a couple of closely related problems in online social networks. We first study the problem of active friending, which can be considered as a service provided by the social platforms to stimulate user engagement. The idea is that a source user can name a target user and the social platform offers a friending strategy to maximize the probability of the target accepting the invitation from the source. We prove that the problem is NP-hard and propose a discrete semi-differential based algorithm with guaranteed approximation ratio. As for the profit maximization problem, we consider the scenario when a profit is generated from group activities, captured by a hypergraph where hyperedges are introduced to represent the interactions among multiple users. We prove that the problem is NP-hard and develop an approximation algorithm along with an exchange-based technique to maximize the profit generated from group activities. In social advertising, the goal of online social platforms is to optimize the ad allocation with multiple advertisers. We access the tradeoff between maximizing the revenue earned from each advertiser and reducing free-riding to ensure the truthfulness of advertisers. We prove that both budgeted and unconstrained ad allocation problems are NP-hard and then develop constant factor approximation algorithms for both problems. We also study the adaptive version of the seed selection problem and the discount allocation problem. The goal is to find a limited set of highly influential seed users to assign proper discounts, such that the number of users who finally adopt the product is maximized. We propose a series of adaptive policies with bounded approximation ratio for both problems.
Optimization Problems for Maximizing Influence in Social Networks
(2020-04-21) Ghosh, Smita; Wu, Weili
Social Networks have become very popular in the past decade. They started as platforms to stay connected with friends and family living in different parts of the world, but have evolved into so much more, resulting in Social Network Analysis (SNA) becoming a very popular area of research. One popular problem under the umbrella of SNA is Influence Maximization (IM), which aims at selecting k initially influenced nodes (users) in a social network that will maximize the expected number of eventually-influenced nodes (users) in the network. Influence maximization finds its application in many domains, such as viral marketing, content maximization, epidemic control, virus eradication, rumor control and misinformation blocking. In this dissertation, we study various variations of the IM problem such as Composed Influence Maximization, Group Influence Maximization, Profit Maximization in Groups and Rumor Blocking Problem in Social Networks. We formulate objective functions for these problems and as most of them are NP-hard, we focus on finding methods that ensure efficient estimation of these functions. The two main challenges we face are submodularity and scalibility. To design efficient algorithms, we perform simulations with sampling techniques to improve the effectiveness of our solution approach.
Quality of Barrier Cover with Wireless Sensors
(Inderscience Enterprises Ltd, 2019-04-01) Wu, Weili; Zhang, Zhao; Gao, Chuangen; Du, Hai; Wang, Hua; Du, Dingzhu; Wu, Weili; Du, Dingzhu
A set of wireless sensors is called a barrier cover if they can monitor the boundary of an area so that an intruder cannot enter the area without being found by any sensor. The quality of a barrier cover is the shortest length of path along which an intruder can enter the area from outside. We study four problems, in this paper, related to the quality of the barrier cover and give their computational complexity and algorithmic solutions.
Set Function Optimization
(Operations Research Society of China, 2018-12-17) Wu, Weili; Zhang, Z.; Du, Dingzhu; Wu, Weili; Du, Dingzhu
This article is an introduction to recent development of optimization theory on set functions, the nonsubmodular optimization, which contains two interesting results, DS (difference of submodular) functions decomposition and sandwich theorem, together with iterated sandwich method and data-dependent approximation. Some potential research problems will be mentioned. © 2018, The Author(s).
Several Practical Models and Their Approximate Solutions in Social Networks
(2020-12-08) Guo, Jianxiong; Du, Ding-Zhu; Wu, Weili
The online social platforms developed due to the popularity of the Internet and have become the mainstream way for daily communication as well as information spreading. The users and relationships between users in these social platforms can be abstracted by a social graph (social network). The most typical application in social networks is viral marketing, which takes the advantage of online advertisements to make information spread to more audiences in a short time. At the same time, we have to constrain the negative impact of misinformation spread. They can be formulated as combinatorial optimization problems in the directed graph, such as influence maximization, profit maximization, and rumor blocking. Influence spread can be characterized by different diffusion models. However, the existing models cannot portray the colorful real world. In this dissertation, we propose a series of new diffusion models, including a complementary products model, a multi-feature diffusion model, a k-hop collaborate game model, and an influence balancing model to adapt to some realistic applications in social networks, also study their related algorithmic problems. Because of their NP-hardness, we focus on designing efficient approximate algorithms. To overcome the #P-hardness of computing objective functions, we adopt the techniques of reverse influence sampling to improve efficiency without losing the approximation ratio.
Study in Big Data Harnessing and Related Problems
(2021-08-01T05:00:00.000Z) Jin, Rong; Wu, Weili; Ma, Dongsheng Brian; Khan, Latifur; Guo, Xiaohu; Bastani, Farokh B.
Social networks, such as Facebook and Twitter, have provided incredible opportunities for social communication between web users around the world. Social network analysis is an important problem in data harnessing. The analysis of social networks helps summarizing the interests and opinions of users (nodes), discovering patterns from the interactions (edges) between users, and mining the events that take place in online platforms. The information obtained by analyzing social networks could be especially valuable for many applications. Some typical examples include online advertisement targeting, viral marketing, personalized recommendation, health social media, social influence analysis, and citation network analysis. In this dissertation, we study two types of applications emerging from modern online social platforms in the view of social influence. One is influence maximization(IM) problem from a discount-based online viral marketing scenario, which aims at maximizing influence in the adoption of target products, and the other is online rumor source detection problem, in which the spread of misinformation is supposed to be minimized and the source is expected to be detected. We formulate them as set function optimization problems and design solutions with performance guarantees. In study of set function optimization, there is a challenge coming from the submodularity of objective function. That is, some of the practical problems are not submodular or supermodular, the existing greedy strategy cannot be directly applied to problems to get a guaranteed approximate solution. To solve those non-submodular and non-supermodular problems, one method called DS decomposition has been considered, in which given a set function, we decompose it to be representable as a difference between submodular functions. Based on this method, we further study a problem about how to find a DS decomposition efficiently and effectively. Then we propose a generalized framework that is made up of our novel algorithms under deterministic version and random version respectively to solve maximization of DS decomposition and show their performances under various combinatorial settings. In addition, we discuss our findings on the role of black-box, that has been an important component in study of computational complexity theory as well as has been used for establishing the hardness of problems, about its implied power and limitations in study of data-driven computation for proving solutions to some computational problems.
Transformer Based Model for Political Spanish Text
(May 2023) Zawad, Niamat; Khan, Latifur; Bastani, Farokh B.; Wu, Weili
Conflict research is a subfield of political science which covers protests, riots, repression, genocide, criminal violence etc. Conflict researchers are interested in tracking and analyzing conflict events. Due to the large number of conflicts happening across the globe, manually tracking and annotating conflicts can be a laborious task and so researchers use language models to automate the process. While transformer-based language models have already been trained on English text, there has been no work done on training models on Spanish text to the best of our knowledge. Spanish is one of the most widely spoken languages in the world and it’s the medium used to express many conflicts happening in Latin America and so a model trained exclusively on Spanish text would hypothetically outperform models based on other languages. With this objective in mind first a domain-specific text corpus is mined from various Spanish websites and then a BERT based model is trained from scratch on the corpus. The model is then evaluated on downstream tasks on some available datasets to assess the model’s practical application in conflict research. Finally, we evaluate several versions of BERT to compare the performance of our model.