A Generalization of the Community Detection Problem via Side Information
Abstract
Abstract
The standard community detection consists of observing a graph and detecting its node
labels. However, many practical problems offer further information about the individual node
labels, which we denote side information. For example, social networks such as Facebook
and Twitter have access to information other than the graph edges, such as age, gender, etc.
This dissertation aims to understand when and by how much can side information change
the fundamental limits of the community detection problem, devise efficient algorithms for
it, and study the asymptotic performance of these algorithms.
In the context of the binary symmetric and single-community stochastic block models for the
graph, we introduce a model for side information that conveniently captures the variation of
its quantity and quality as the size of the graph grows.
For the binary symmetric stochastic block model under side information, we characterize
tight necessary and sufficient conditions for exact recovery. An efficient, asymptotically
optimal two-stage algorithm is introduced for recovery. Furthermore, the analysis of phase
transition is extended to continuous-valued side information.
For the single community stochastic block model, we characterize tight necessary and sufficient conditions for weak and exact recovery. An efficient belief propagation algorithm under
side information is proposed. A weak recovery phase transition for belief propagation is characterized when both quality and quantity are fixed. When quality of side information varies
with graph size, sufficient conditions for weak recovery are established. Sufficient conditions
for belief propagation to achieve exact recovery are derived.
The extrinsic information transfer (EXIT) method is applied to the analysis of belief propagation in community detection with side information, providing insights such as the asymptotic threshold for weak recovery, the performance of belief propagation near the optimal
threshold, and the performance of belief propagation through the first few iterations.