David Gamarnik, Julia Gaudio
We consider the problem of estimating an unknown coordinate-wise monotone function given noisy measurements, known as the isotonic regression problem. Often, only a small subset of the features affects the output. This motivates the sparse isotonic regression setting, which we consider here. We provide an upper bound on the expected VC entropy of the space of sparse coordinate-wise monotone functions, and identify the regime of statistical consistency of our estimator. We also propose a linear program to recover the active coordinates, and provide theoretical recovery guarantees. We close with experiments on cancer classification, and show that our method significantly outperforms standard methods.
Julia Gaudio, Colin Sandon, Jiaming Xu, Dana Yang
This paper studies the problem of inferring a $k$-factor, specifically a spanning $k$-regular graph, planted within an Erdos--Renyi random graph $G(n,λ/n)$. We uncover an interesting "all-something-nothing" phase transition. Specifically, we show that as the average degree $λ$ surpasses the critical threshold of $1/k$, the inference problem undergoes a transition from almost exact recovery ("all" phase) to partial recovery ("something" phase). Moreover, as $λ$ tends to infinity, the accuracy of recovery diminishes to zero, leading to the onset of the "nothing" phase. This finding complements the recent result by Mossel, Niles-Weed, Sohn, Sun, and Zadik who established that for certain sufficiently dense graphs, the problem undergoes an "all-or-nothing" phase transition, jumping from near-perfect to near-zero recovery. In addition, we characterize the recovery accuracy of a linear-time iterative pruning algorithm and show that it achieves almost exact recovery when $λ< 1/k$. A key component of our analysis is a two-step cycle construction: we first build trees through local neighborhood exploration and then connect them by sprinkling using reserved edges. Interestingly, for proving impossibility of almost exact recovery, we construct $Θ(n)$ many small trees of size $Θ(1)$, whereas for establishing the algorithmic lower bound, a single large tree of size $Θ(\sqrt{n\log n})$ suffices.
Julia Gaudio, Miklós Z. Rácz, Anirudh Sridhar
We propose a simple and efficient local algorithm for graph isomorphism which succeeds for a large class of sparse graphs. This algorithm produces a low-depth canonical labeling, which is a labeling of the vertices of the graph that identifies its isomorphism class using vertices' local neighborhoods. Prior work by Czajka and Pandurangan showed that the degree profile of a vertex (i.e., the sorted list of the degrees of its neighbors) gives a canonical labeling with high probability when $n p_n = ω( \log^{4}(n) / \log \log n )$ (and $p_{n} \leq 1/2$); subsequently, Mossel and Ross showed that the same holds when $n p_n = ω( \log^{2}(n) )$. We first show that their analysis essentially cannot be improved: we prove that when $n p_n = o( \log^{2}(n) / (\log \log n)^{3} )$, with high probability there exist distinct vertices with isomorphic $2$-neighborhoods. Our first main result is a positive counterpart to this, showing that $3$-neighborhoods give a canonical labeling when $n p_n \geq (1+δ) \log n$ (and $p_n \leq 1/2$); this improves a recent result of Ding, Ma, Wu, and Xu, completing the picture above the connectivity threshold. Our second main result is a smoothed analysis of graph isomorphism, showing that for a large class of deterministic graphs, a small random perturbation ensures that $3$-neighborhoods give a canonical labeling with high probability. While the worst-case complexity of graph isomorphism is still unknown, this shows that graph isomorphism has polynomial smoothed complexity.
Ali Kaan Kurbanzade, Julia Gaudio
In typical applications of facility location problems, the location of demand is assumed to be an input to the problem. The demand may be fixed or dynamic, but ultimately outside the optimizers control. In contrast, there are settings, especially in humanitarian contexts, in which the optimizer decides where to locate a demand node. In this work, we introduce an optimization framework for joint facility and demand location. As examples of our general framework, we extend the well-known k-median and k-center problems into joint facility and demand location problems (JFDLP) and formulate them as integer programs. We propose a local search heuristic based on network flow. We apply our heuristic to a hurricane evacuation response case study. Our results demonstrate the challenging nature of these simultaneous optimization problems, especially when there are many potential locations. The local search heuristic is most promising when the the number of potential locations is large, while the number of facility and demand nodes to be located is small.
Julia Gaudio, Heming Liu
We consider the problem of exact community recovery in the Labeled Stochastic Block Model (LSBM) with $k$ communities, where each pair of vertices is associated with a label from the set $\{0,1, \dots, L\}$. A pair of vertices from communities $i,j$ is given label $\ell$ with probability $p_{ij}^{(\ell)}$, and the goal is to recover the community partition. We propose a simple spectral algorithm for exact community recovery, and show that it achieves the information-theoretic threshold in the logarithmic-degree regime, under the assumption that the eigenvalues of certain parameter matrices are distinct and nonzero. Our results generalize recent work of Dhara, Gaudio, Mossel, and Sandon (2023), who showed that a spectral algorithm achieves the information-theoretic threshold in the Censored SBM, which is equivalent to the LSBM with $L = 2$. Interestingly, their algorithm uses eigenvectors from two matrix representations of the graph, while our algorithm uses eigenvectors from $L$ matrices.
Julia Gaudio, Andrew Jin
Community detection is the problem of identifying dense communities in networks. Motivated by transitive behavior in social networks ("thy friend is my friend"), an emerging line of work considers spatially-embedded networks, which inherently produce graphs containing many triangles. In this paper, we consider the problem of exact label recovery in the Geometric Stochastic Block Model (GSBM), a model proposed by Baccelli and Sankararaman as the spatially-embedded analogue of the well-studied Stochastic Block Model. Under mild technical assumptions, we completely characterize the information-theoretic threshold for exact recovery, generalizing the earlier work of Gaudio, Niu, and Wei.
Julia Gaudio, Charlie K. Guan
This paper considers the problem of label recovery in random graphs and matrices. Motivated by transitive behavior in real-world networks (i.e., ``the friend of my friend is my friend''), a recent line of work considers spatially-embedded networks, which exhibit transitive behavior. In particular, the Geometric Hidden Community Model (GHCM), introduced by Gaudio, Guan, Niu, and Wei, models a network as a labeled Poisson point process where every pair of vertices is associated with a pairwise observation whose distribution depends on the labels and positions of the vertices. The GHCM is in turn a generalization of the Geometric SBM (proposed by Baccelli and Sankararaman). Gaudio et al. provided a threshold below which exact recovery is information-theoretically impossible. Above the threshold, they provided a linear-time algorithm that succeeds in exact recovery under a certain ``distinctness-of-distributions'' assumption, which they conjectured to be unnecessary. In this paper, we partially resolve the conjecture by showing that the threshold is indeed tight for the two-community GHCM. We provide a two-phase, linear-time algorithm that explores the spatial graph in a data-driven manner in Phase I to yield an almost exact labeling, which is refined to achieve exact recovery in Phase II. Our results extend achievability to geometric formulations of well-known inference problems, such as the planted dense subgraph problem and submatrix localization, in which the distinctness-of-distributions assumption does not hold.
Souvik Dhara, Julia Gaudio, Elchanan Mossel, Colin Sandon
Spectral algorithms are some of the main tools in optimization and inference problems on graphs. Typically, the graph is encoded as a matrix and eigenvectors and eigenvalues of the matrix are then used to solve the given graph problem. Spectral algorithms have been successfully used for graph partitioning, hidden clique recovery and graph coloring. In this paper, we study the power of spectral algorithms using two matrices in a graph partitioning problem. We use two different matrices resulting from two different encodings of the same graph and then combine the spectral information coming from these two matrices. We analyze a two-matrix spectral algorithm for the problem of identifying latent community structure in large random graphs. In particular, we consider the problem of recovering community assignments exactly in the censored stochastic block model, where each edge status is revealed independently with some probability. We show that spectral algorithms based on two matrices are optimal and succeed in recovering communities up to the information theoretic threshold. Further, we show that for most choices of the parameters, any spectral algorithm based on one matrix is suboptimal. The latter observation is in contrast to our prior works (2022a, 2022b) which showed that for the symmetric Stochastic Block Model and the Planted Dense Subgraph problem, a spectral algorithm based on one matrix achieves the information theoretic threshold. We additionally provide more general geometric conditions for the (sub)-optimality of spectral algorithms.
Julia Gaudio, Nirmit Joshi
We study the problem of exact community recovery in general, two-community block models, in the presence of node-attributed $side$ $information$. We allow for a very general side information channel for node attributes, and for pairwise (edge) observations, consider both Bernoulli and Gaussian matrix models, capturing the Stochastic Block Model, Submatrix Localization, and $\mathbb{Z}_2$-Synchronization as special cases. A recent work of Dreveton et al. 2024 characterized the information-theoretic limit of a very general exact recovery problem with side information. In this paper, we show algorithmic achievability in the above important cases by designing a simple but optimal spectral algorithm that incorporates side information (when present) along with the eigenvectors of the pairwise observation matrix. Using the powerful tool of entrywise eigenvector analysis of Abbe et al. 2020, we show that our spectral algorithm can mimic the so called $genie$-$aided$ $estimators$, where the $i^{\mathrm{th}}$ genie-aided estimator optimally computes the estimate of the $i^{\mathrm{th}}$ label, when all remaining labels are revealed by a genie. This perspective provides a unified understanding of the optimality of spectral algorithms for various exact recovery problems in a recent line of work.
Souvik Dhara, Julia Gaudio, Elchanan Mossel, Colin Sandon
Community detection is the problem of identifying community structure in graphs. Often the graph is modeled as a sample from the Stochastic Block Model, in which each vertex belongs to a community. The probability that two vertices are connected by an edge depends on the communities of those vertices. In this paper, we consider a model of {\em censored} community detection with two communities, where most of the data is missing as the status of only a small fraction of the potential edges is revealed. In this model, vertices in the same community are connected with probability $p$ while vertices in opposite communities are connected with probability $q$. The connectivity status of a given pair of vertices $\{u,v\}$ is revealed with probability $α$, independently across all pairs, where $α= \frac{t \log(n)}{n}$. We establish the information-theoretic threshold $t_c(p,q)$, such that no algorithm succeeds in recovering the communities exactly when $t < t_c(p,q)$. We show that when $t > t_c(p,q)$, a simple spectral algorithm based on a weighted, signed adjacency matrix succeeds in recovering the communities exactly. While spectral algorithms are shown to have near-optimal performance in the symmetric case, we show that they may fail in the asymmetric case where the connection probabilities inside the two communities are allowed to be different. In particular, we show the existence of a parameter regime where a simple two-phase algorithm succeeds but any algorithm based on the top two eigenvectors of the weighted, signed adjacency matrix fails.
Julia Gaudio, Elchanan Mossel
Graph shotgun assembly refers to the problem of reconstructing a graph from a collection of local neighborhoods. In this paper, we consider shotgun assembly of \ER random graphs $G(n, p_n)$, where $p_n = n^{-α}$ for $0 < α< 1$. We consider both reconstruction up to isomorphism as well as exact reconstruction (recovering the vertex labels as well as the structure). We show that given the collection of distance-$1$ neighborhoods, $G$ is exactly reconstructable for $0 < α< \frac{1}{3}$, but not reconstructable for $\frac{1}{2} < α< 1$. Given the collection of distance-$2$ neighborhoods, $G$ is exactly reconstructable for $α\in \left(0, \frac{1}{2}\right) \cup \left(\frac{1}{2}, \frac{3}{5}\right)$, but not reconstructable for $\frac{3}{4} < α< 1$.
Julia Gaudio, Nirmit Joshi
Community detection is a fundamental problem in network science. In this paper, we consider community detection in hypergraphs drawn from the $hypergraph$ $stochastic$ $block$ $model$ (HSBM), with a focus on exact community recovery. We study the performance of polynomial-time algorithms which operate on the $similarity$ $matrix$ $W$, where $W_{ij}$ reports the number of hyperedges containing both $i$ and $j$. Under this information model, while the precise information-theoretic limit is unknown, Kim, Bandeira, and Goemans derived a sharp threshold up to which the natural min-bisection estimator on $W$ succeeds. As min-bisection is NP-hard in the worst case, they additionally proposed a semidefinite programming (SDP) relaxation and conjectured that it achieves the same recovery threshold as the min-bisection estimator. In this paper, we confirm this conjecture. We also design a simple and highly efficient spectral algorithm with nearly linear runtime and show that it achieves the min-bisection threshold. Moreover, the spectral algorithm also succeeds in denser regimes and is considerably more efficient than previous approaches, establishing it as the method of choice. Our analysis of the spectral algorithm crucially relies on strong $entrywise$ bounds on the eigenvectors of $W$. Our bounds are inspired by the work of Abbe, Fan, Wang, and Zhong, who developed entrywise bounds for eigenvectors of symmetric matrices with independent entries. Despite the complex dependency structure in similarity matrices, we prove similar entrywise guarantees.
Julia Gaudio, Charlie Guan, Xiaochun Niu, Ermin Wei
In this paper, we propose a family of label recovery problems on weighted Euclidean random graphs. The vertices of a graph are embedded in $\mathbb{R}^d$ according to a Poisson point process, and are assigned to a discrete community label. Our goal is to infer the vertex labels, given edge weights whose distributions depend on the vertex labels as well as their geometric positions. Our general model provides a geometric extension of popular graph and matrix problems, including submatrix localization and $\mathbb{Z}_2$-synchronization, and includes the Geometric Stochastic Block Model (proposed by Sankararaman and Baccelli) as a special case. We study the fundamental limits of exact recovery of the vertex labels. Under a mild distinctness of distributions assumption, we determine the information-theoretic threshold for exact label recovery, in terms of a Chernoff-Hellinger divergence criterion. Impossibility of recovery below the threshold is proven by a unified analysis using a Cramér lower bound. Achievability above the threshold is proven via an efficient two-phase algorithm, where the first phase computes an almost-exact labeling through a local propagation scheme, while the second phase refines the labels. The information-theoretic threshold is dictated by the performance of the so-called genie estimator, which decodes the label of a single vertex given all the other labels. This shows that our proposed models exhibit the local-to-global amplification phenomenon.
David Gamarnik, Julia Gaudio
In a multi-index model with $k$ index vectors, the input variables are transformed by taking inner products with the index vectors. A transfer function $f: \mathbb{R}^k \to \mathbb{R}$ is applied to these inner products to generate the output. Thus, multi-index models are a generalization of linear models. In this paper, we consider monotone multi-index models. Namely, the transfer function is assumed to be coordinate-wise monotone. The monotone multi-index model therefore generalizes both linear regression and isotonic regression, which is the estimation of a coordinate-wise monotone function. We consider the case of nonnegative index vectors. We provide an algorithm based on integer programming for the estimation of monotone multi-index models, and provide guarantees on the $L_2$ loss of the estimated function relative to the ground truth.
Julia Gaudio, Yury Polyanskiy
This paper introduces the Attracting Random Walks model, which describes the dynamics of a system of particles on a graph with $n$ vertices. At each step, a single particle moves to an adjacent vertex (or stays at the current one) with probability proportional to the exponent of the number of other particles at a vertex. From an applied standpoint, the model captures the rich get richer phenomenon. We show that the Markov chain exhibits a phase transition in mixing time, as the parameter governing the attraction is varied. Namely, mixing time is $O(n\log n)$ when the temperature is sufficiently high and $\exp(Ω(n))$ when temperature is sufficiently low. When $\mathcal{G}$ is the complete graph, the model is a projection of the Potts model, whose mixing properties and the critical temperature have been known previously. However, for any other graph our model is non-reversible and does not seem to admit a simple Gibbsian description of a stationary distribution. Notably, we demonstrate existence of the dynamic phase transition without decomposing the stationary distribution into phases.
Julia Gaudio, Saurabh Amin, Patrick Jaillet
In this brief paper we find computable exponential convergence rates for a large class of stochastically ordered Markov processes. We extend the result of Lund, Meyn, and Tweedie (1996), who found exponential convergence rates for stochastically ordered Markov processes starting from a fixed initial state, by allowing for a random initial condition that is also stochastically ordered. Our bounds are formulated in terms of moment-generating functions of hitting times. To illustrate our result, we find an explicit exponential convergence rate for an M/M/1 queue beginning in equilibrium and then experiencing a change in its arrival or departure rates, a setting which has not been studied to our knowledge.
Julia Gaudio, Miklos Z. Racz, Anirudh Sridhar
We consider the problem of learning latent community structure from multiple correlated networks. We study edge-correlated stochastic block models with two balanced communities, focusing on the regime where the average degree is logarithmic in the number of vertices. Our main result derives the precise information-theoretic threshold for exact community recovery using multiple correlated graphs. This threshold captures the interplay between the community recovery and graph matching tasks. In particular, we uncover and characterize a region of the parameter space where exact community recovery is possible using multiple correlated graphs, even though (1) this is information-theoretically impossible using a single graph and (2) exact graph matching is also information-theoretically impossible. In this regime, we develop a novel algorithm that carefully synthesizes algorithms from the community recovery and graph matching literatures.
Souvik Dhara, Julia Gaudio, Elchanan Mossel, Colin Sandon
Spectral algorithms are an important building block in machine learning and graph algorithms. We are interested in studying when such algorithms can be applied directly to provide optimal solutions to inference tasks. Previous works by Abbe, Fan, Wang and Zhong (2020) and by Dhara, Gaudio, Mossel and Sandon (2022) showed the optimality for community detection in the Stochastic Block Model (SBM), as well as in a censored variant of the SBM. Here we show that this optimality is somewhat universal as it carries over to other planted substructures such as the planted dense subgraph problem and submatrix localization problem, as well as to a censored version of the planted dense subgraph problem.
Julia Gaudio, Patrick Jaillet
Let $X_1, X_2, \dots, X_n$ be independent uniform random variables on $[0,1]^2$. Let $L(X_1, \dots, X_n)$ be the length of the shortest Traveling Salesman tour through these points. It is known that there exists a constant $β$ such that $$\lim_{n \to \infty} \frac{L(X_1, \dots, X_n)}{\sqrt{n}} = β$$ almost surely (Beardwood 1959). The original analysis in (Beardwood 1959) showed that $β\geq 0.625$. Building upon an approach proposed in (Steinerberger 2015), we improve the lower bound to $β\geq 0.6277$.
Julia Gaudio, Colin Sandon, Jiaming Xu, Dana Yang
In this paper, we study the problem of finding a collection of planted cycles in an \ER random graph $G \sim \mathcal{G}(n, λ/n)$, in analogy to the famous Planted Clique Problem. When the cycles are planted on a uniformly random subset of $δn$ vertices, we show that almost-exact recovery (that is, recovering all but a vanishing fraction of planted-cycle edges as $n \to \infty$) is information-theoretically possible if $λ< \frac{1}{(\sqrt{2 δ} + \sqrt{1-δ})^2}$ and impossible if $λ> \frac{1}{(\sqrt{2 δ} + \sqrt{1-δ})^2}$. Moreover, despite the worst-case computational hardness of finding long cycles, we design a polynomial-time algorithm that attains almost exact recovery when $λ< \frac{1}{(\sqrt{2 δ} + \sqrt{1-δ})^2}$. This stands in stark contrast to the Planted Clique Problem, where a significant computational-statistical gap is widely conjectured.