Andrea Montanari
Let ${\boldsymbol A}\in{\mathbb R}^{n\times n}$ be a symmetric random matrix with independent and identically distributed Gaussian entries above the diagonal. We consider the problem of maximizing $\langle{\boldsymbol σ},{\boldsymbol A}{\boldsymbol σ}\rangle$ over binary vectors ${\boldsymbol σ}\in\{+1,-1\}^n$. In the language of statistical physics, this amounts to finding the ground state of the Sherrington-Kirkpatrick model of spin glasses. The asymptotic value of this optimization problem was characterized by Parisi via a celebrated variational principle, subsequently proved by Talagrand. We give an algorithm that, for any $\varepsilon>0$, outputs ${\boldsymbol σ}_*\in\{-1,+1\}^n$ such that $\langle{\boldsymbol σ}_*,{\boldsymbol A}{\boldsymbol σ}_*\rangle$ is at least $(1-\varepsilon)$ of the optimum value, with probability converging to one as $n\to\infty$. The algorithm's time complexity is $C(\varepsilon)\, n^2$. It is a message-passing algorithm, but the specific structure of its update rules is new. As a side result, we prove that, at (low) non-zero temperature, the algorithm constructs approximate solutions of the Thouless-Anderson-Palmer equations.
Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari
We consider the problem of learning an unknown function $f_{\star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples $\{(y_i,{\boldsymbol x}_i)\}_{i\le n}$ where ${\boldsymbol x}_i$ is a feature vector uniformly distributed on the sphere and $y_i=f_{\star}({\boldsymbol x}_i)+\varepsilon_i$. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$. We consider two specific regimes: the approximation-limited regime, in which $n=\infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=\infty$ while $d$ and $n$ are large but finite. In the first regime we prove that if $d^{\ell + δ} \le N\le d^{\ell+1-δ}$ for small $δ> 0$, then \RF\, effectively fits a degree-$\ell$ polynomial in the raw features, and \NT\, fits a degree-$(\ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d^{\ell + δ} \le n \le d^{\ell +1-δ}$, then kernel methods can fit at most a a degree-$\ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.
Andrea Montanari, Daniel Reichman, Ofer Zeitouni
We consider the following detection problem: given a realization of a symmetric matrix ${\mathbf{X}}$ of dimension $n$, distinguish between the hypothesis that all upper triangular variables are i.i.d. Gaussians variables with mean 0 and variance $1$ and the hypothesis where ${\mathbf{X}}$ is the sum of such matrix and an independent rank-one perturbation. This setup applies to the situation where under the alternative, there is a planted principal submatrix ${\mathbf{B}}$ of size $L$ for which all upper triangular variables are i.i.d. Gaussians with mean $1$ and variance $1$, whereas all other upper triangular elements of ${\mathbf{X}}$ not in ${\mathbf{B}}$ are i.i.d. Gaussians variables with mean 0 and variance $1$. We refer to this as the `Gaussian hidden clique problem.' When $L=(1+ε)\sqrt{n}$ ($ε>0$), it is possible to solve this detection problem with probability $1-o_n(1)$ by computing the spectrum of ${\mathbf{X}}$ and considering the largest eigenvalue of ${\mathbf{X}}$. We prove that this condition is tight in the following sense: when $L<(1-ε)\sqrt{n}$ no algorithm that examines only the eigenvalues of ${\mathbf{X}}$ can detect the existence of a hidden Gaussian clique, with error probability vanishing as $n\to\infty$. We prove this result as an immediate consequence of a more general result on rank-one perturbations of $k$-dimensional Gaussian tensors. In this context we establish a lower bound on the critical signal-to-noise ratio below which a rank-one signal cannot be detected.
Hamid Javadi, Andrea Montanari
Given a collection of data points, non-negative matrix factorization (NMF) suggests to express them as convex combinations of a small set of `archetypes' with non-negative entries. This decomposition is unique only if the true archetypes are non-negative and sufficiently sparse (or the weights are sufficiently sparse), a regime that is captured by the separability condition and its generalizations. In this paper, we study an approach to NMF that can be traced back to the work of Cutler and Breiman (1994) and does not require the data to be separable, while providing a generally unique decomposition. We optimize the trade-off between two objectives: we minimize the distance of the data points from the convex envelope of the archetypes (which can be interpreted as an empirical risk), while minimizing the distance of the archetypes from the convex envelope of the data (which can be interpreted as a data-dependent regularization). The archetypal analysis method of (Cutler, Breiman, 1994) is recovered as the limiting case in which the last term is given infinite weight. We introduce a `uniqueness condition' on the data which is necessary for exactly recovering the archetypes from noiseless data. We prove that, under uniqueness (plus additional regularity conditions on the geometry of the archetypes), our estimator is robust. While our approach requires solving a non-convex optimization problem, we find that standard optimization methods succeed in finding good solutions both for real and synthetic data.
Andrea Montanari, Ricardo Restrepo, Prasad Tetali
Random instances of Constraint Satisfaction Problems (CSP's) appear to be hard for all known algorithms, when the number of constraints per variable lies in a certain interval. Contributing to the general understanding of the structure of the solution space of a CSP in the satisfiable regime, we formulate a set of natural technical conditions on a large family of (random) CSP's, and prove bounds on three most interesting thresholds for the density of such an ensemble: namely, the satisfiability threshold, the threshold for clustering of the solution space, and the threshold for an appropriate reconstruction problem on the CSP's. The bounds become asymptoticlally tight as the number of degrees of freedom in each clause diverges. The families are general enough to include commonly studied problems such as, random instances of Not-All-Equal-SAT, k-XOR formulae, hypergraph 2-coloring, and graph k-coloring. An important new ingredient is a condition involving the Fourier expansion of clauses, which characterizes the class of problems with a similar threshold structure.
Brice Huang, Andrea Montanari, Huy Tuan Pham
We consider the problem of algorithmically sampling from the Gibbs measure of a mixed $p$-spin spherical spin glass. We give a polynomial-time algorithm that samples from the Gibbs measure up to vanishing total variation error, for any model whose mixture satisfies $$ξ''(s) < \frac{1}{(1-s)^2}, \qquad \forall s\in [0,1).$$ This includes the pure $p$-spin glasses above a critical temperature that is within an absolute ($p$-independent) constant of the so-called shattering phase transition. Our algorithm follows the algorithmic stochastic localization approach introduced in (Alaoui, Montanari, Sellke, 20022). A key step of this approach is to estimate the mean of a sequence of tilted measures. We produce an improved estimator for this task by identifying a suitable correction to the TAP fixed point selected by approximate message passing (AMP). As a consequence, we improve the algorithm's guarantee over previous work, from normalized Wasserstein to total variation error. In particular, the new algorithm and analysis opens the way to perform inference about one-dimensional projections of the measure.
Jose Bento, Andrea Montanari
We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it).
Andrea Montanari, Ramji Venkataramanan
Consider the problem of estimating a low-rank matrix when its entries are perturbed by Gaussian noise. If the empirical distribution of the entries of the spikes is known, optimal estimators that exploit this knowledge can substantially outperform simple spectral approaches. Recent work characterizes the asymptotic accuracy of Bayes-optimal estimators in the high-dimensional limit. In this paper we present a practical algorithm that can achieve Bayes-optimal accuracy above the spectral threshold. A bold conjecture from statistical physics posits that no polynomial-time algorithm achieves optimal error below the same threshold (unless the best estimator is trivial). Our approach uses Approximate Message Passing (AMP) in conjunction with a spectral initialization. AMP algorithms have proved successful in a variety of statistical estimation tasks, and are amenable to exact asymptotic analysis via state evolution. Unfortunately, state evolution is uninformative when the algorithm is initialized near an unstable fixed point, as often happens in low-rank matrix estimation. We develop a new analysis of AMP that allows for spectral initializations. Our main theorem is general and applies beyond matrix estimation. However, we use it to derive detailed predictions for the problem of estimating a rank-one matrix in noise. Special cases of this problem are closely related---via universality arguments---to the network community detection problem for two asymmetric communities. For general rank-one models, we show that AMP can be used to construct confidence intervals and control false discovery rate. We provide illustrations of the general methodology by considering the cases of sparse low-rank matrices and of block-constant low-rank matrices with symmetric blocks (we refer to the latter as to the `Gaussian Block Model').
Gerard Ben Arous, Song Mei, Andrea Montanari, Mihai Nica
We consider the problem of estimating a large rank-one tensor ${\boldsymbol u}^{\otimes k}\in({\mathbb R}^{n})^{\otimes k}$, $k\ge 3$ in Gaussian noise. Earlier work characterized a critical signal-to-noise ratio $λ_{Bayes}= O(1)$ above which an ideal estimator achieves strictly positive correlation with the unknown vector of interest. Remarkably no polynomial-time algorithm is known that achieved this goal unless $λ\ge C n^{(k-2)/4}$ and even powerful semidefinite programming relaxations appear to fail for $1\ll λ\ll n^{(k-2)/4}$. In order to elucidate this behavior, we consider the maximum likelihood estimator, which requires maximizing a degree-$k$ homogeneous polynomial over the unit sphere in $n$ dimensions. We compute the expected number of critical points and local maxima of this objective function and show that it is exponential in the dimensions $n$, and give exact formulas for the exponential growth rate. We show that (for $λ$ larger than a constant) critical points are either very close to the unknown vector ${\boldsymbol u}$, or are confined in a band of width $Θ(λ^{-1/(k-1)})$ around the maximum circle that is orthogonal to ${\boldsymbol u}$. For local maxima, this band shrinks to be of size $Θ(λ^{-1/(k-2)})$. These `uninformative' local maxima are likely to cause the failure of optimization algorithms.
Andrea Montanari, Devavrat Shah
We present a deterministic approximation algorithm to compute logarithm of the number of `good' truth assignments for a random k-satisfiability (k-SAT) formula in polynomial time (by `good' we mean that violate a small fraction of clauses). The relative error is bounded above by an arbitrarily small constant epsilon with high probability as long as the clause density (ratio of clauses to variables) alpha<alpha_{u}(k) = 2k^{-1}\log k(1+o(1)). The algorithm is based on computation of marginal distribution via belief propagation and use of an interpolation procedure. This scheme substitutes the traditional one based on approximation of marginal probabilities via MCMC, in conjunction with self-reduction, which is not easy to extend to the present problem. We derive 2k^{-1}\log k (1+o(1)) as threshold for uniqueness of the Gibbs distribution on satisfying assignment of random infinite tree k-SAT formulae to establish our results, which is of interest in its own right.
Andrea Montanari, Rudiger Urbanke
These are the notes for a set of lectures delivered by the two authors at the Les Houches Summer School on `Complex Systems' in July 2006. They provide an introduction to the basic concepts in modern (probabilistic) coding theory, highlighting connections with statistical mechanics. We also stress common concepts with other disciplines dealing with similar problems that can be generically referred to as `large graphical models'. While most of the lectures are devoted to the classical channel coding problem over simple memoryless channels, we present a discussion of more complex channel models. We conclude with an overview of the main open challenges in the field.
Andrea Montanari, Ruediger Urbanke
We consider communication over a noisy network under randomized linear network coding. Possible error mechanism include node- or link- failures, Byzantine behavior of nodes, or an over-estimate of the network min-cut. Building on the work of Koetter and Kschischang, we introduce a probabilistic model for errors. We compute the capacity of this channel and we define an error-correction scheme based on random sparse graphs and a low-complexity decoding algorithm. By optimizing over the code degree profile, we show that this construction achieves the channel capacity in complexity which is jointly quadratic in the number of coded information bits and sublogarithmic in the error probability.
Andrea Montanari
Turbo codes are a very efficient method for communicating reliably through a noisy channel. There is no theoretical understanding of their effectiveness. In [1] they are mapped onto a class of disordered spin models. The analytical calculations concerning these models are reported here. We prove the existence of a no-error phase and compute its local stability threshold. As a byproduct, we gain some insight into the dynamics of the decoding algorithm.
Cyril Measson, Andrea Montanari, Tom Richardson, Rudiger Urbanke
We consider communication over memoryless channels using low-density parity-check code ensembles above the iterative (belief propagation) threshold. What is the computational complexity of decoding (i.e., of reconstructing all the typical input codewords for a given channel output) in this regime? We define an algorithm accomplishing this task and analyze its typical performance. The behavior of the new algorithm can be expressed in purely information-theoretical terms. Its analysis provides an alternative proof of the area theorem for the binary erasure channel. Finally, we explain how the area theorem is generalized to arbitrary memoryless channels. We note that the recently discovered relation between mutual information and minimal square error is an instance of the area theorem in the setting of Gaussian channels.
Cyril Measson, Andrea Montanari, Ruediger Urbanke
There is a fundamental relationship between belief propagation and maximum a posteriori decoding. A decoding algorithm, which we call the Maxwell decoder, is introduced and provides a constructive description of this relationship. Both, the algorithm itself and the analysis of the new decoder are reminiscent of the Maxwell construction in thermodynamics. This paper investigates in detail the case of transmission over the binary erasure channel, while the extension to general binary memoryless channels is discussed in a companion paper.
Andrea Montanari, Eliran Subag
We consider the problem of efficiently solving a system of $n$ non-linear equations in ${\mathbb R}^d$. Addressing Smale's 17th problem stated in 1998, we consider a setting whereby the $n$ equations are random homogeneous polynomials of arbitrary degrees. In the complex case and for $n= d-1$, Beltrán and Pardo proved the existence of an efficient randomized algorithm and Lairez recently showed it can be de-randomized to produce a deterministic efficient algorithm. Here we consider the real setting, to which previously developed methods do not apply. We describe a polynomial time algorithm that finds solutions (with high probability) for $n= d -O(\sqrt{d\log d})$ if the maximal degree is bounded by $d^2$ and for $n=d-1$ if the maximal degree is larger than $d^2$.
Andrea Montanari, Basil Saeed
Consider supervised learning from i.i.d. samples $\{{\boldsymbol x}_i,y_i\}_{i\le n}$ where ${\boldsymbol x}_i \in\mathbb{R}^p$ are feature vectors and ${y} \in \mathbb{R}$ are labels. We study empirical risk minimization over a class of functions that are parameterized by $\mathsf{k} = O(1)$ vectors ${\boldsymbol θ}_1, . . . , {\boldsymbol θ}_{\mathsf k} \in \mathbb{R}^p$ , and prove universality results both for the training and test error. Namely, under the proportional asymptotics $n,p\to\infty$, with $n/p = Θ(1)$, we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed $-$to leading order$-$ under a simpler model in which the feature vectors ${\boldsymbol x}_i$ are replaced by Gaussian vectors ${\boldsymbol g}_i$ with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors ${\boldsymbol x}_i$ with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors ${\boldsymbol x}_i$ that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).
Andrea Montanari, Yiqiao Zhong, Kangjie Zhou
In the negative perceptron problem we are given $n$ data points $({\boldsymbol x}_i,y_i)$, where ${\boldsymbol x}_i$ is a $d$-dimensional vector and $y_i\in\{+1,-1\}$ is a binary label. The data are not linearly separable and hence we content ourselves to find a linear classifier with the largest possible \emph{negative} margin. In other words, we want to find a unit norm vector ${\boldsymbol θ}$ that maximizes $\min_{i\le n}y_i\langle {\boldsymbol θ},{\boldsymbol x}_i\rangle$. This is a non-convex optimization problem (it is equivalent to finding a maximum norm vector in a polytope), and we study its typical properties under two random models for the data. We consider the proportional asymptotics in which $n,d\to \infty$ with $n/d\toδ$, and prove upper and lower bounds on the maximum margin $κ_{\text{s}}(δ)$ or -- equivalently -- on its inverse function $δ_{\text{s}}(κ)$. In other words, $δ_{\text{s}}(κ)$ is the overparametrization threshold: for $n/d\le δ_{\text{s}}(κ)-\varepsilon$ a classifier achieving vanishing training error exists with high probability, while for $n/d\ge δ_{\text{s}}(κ)+\varepsilon$ it does not. Our bounds on $δ_{\text{s}}(κ)$ match to the leading order as $κ\to -\infty$. We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold $δ_{\text{lin}}(κ)$. We observe a gap between the interpolation threshold $δ_{\text{s}}(κ)$ and the linear programming threshold $δ_{\text{lin}}(κ)$, raising the question of the behavior of other algorithms.
Andrea Montanari, Viet Vu
Denoising diffusions sample from a probability distribution $μ$ in $\mathbb{R}^d$ by constructing a stochastic process $({\hat{\boldsymbol x}}_t:t\ge 0)$ in $\mathbb{R}^d$ such that ${\hat{\boldsymbol x}}_0$ is easy to sample, but the distribution of $\hat{\boldsymbol x}_T$ at large $T$ approximates $μ$. The drift ${\boldsymbol m}:\mathbb{R}^d\times\mathbb{R}\to\mathbb{R}^d$ of this diffusion process is learned my minimizing a score-matching objective. Is every probability distribution $μ$, for which sampling is tractable, also amenable to sampling via diffusions? We provide evidence to the contrary by studying a probability distribution $μ$ for which sampling is easy, but the drift of the diffusion process is intractable -- under a popular conjecture on information-computation gaps in statistical estimation. We show that there exist drifts that are superpolynomially close to the optimum value (among polynomial time drifts) and yet yield samples with distribution that is very far from the target one.
Andrea Montanari, Basil Saeed
Given data $\{({\boldsymbol x}_i,y_i): i\le n\}$, with ${\boldsymbol x}_i$ standard $d$-dimensional Gaussian feature vectors, and $y_i\in{\mathbb R}$ response variables, we study the general problem of learning a model parametrized by ${\boldsymbol θ}\in{\mathbb R}^d$, by minimizing a loss function that depends on ${\boldsymbol θ}$ via the one-dimensional projections ${\boldsymbol θ}^{\sf T}{\boldsymbol x}_i$. While previous work mostly dealt with convex losses, our approach assumes general (non-convex) losses hence covering classical, yet poorly understood examples such as the perceptron and non-convex robust regression. We use the Kac-Rice formula to control the asymptotics of the expected number of local minima of the empirical risk, under the proportional asymptotics $n,d\to\infty$, $n/d\toα>1$. Specifically, we prove a finite dimensional variational formula for the exponential growth rate of the expected number of local minima. Further we provide sufficient conditions under which the exponential growth rate vanishes and all empirical risk minimizers have the same asymptotic properties (in fact, we expect the minimizer to be unique in these circumstances). We refer to this phenomenon as `rate trivialization.' If the population risk has a unique minimizer, our sufficient condition for rate trivialization is typically verified when the samples/parameters ratio $α$ is larger than a suitable constant $α_{\star}$. Previous general results of this type required $n\ge Cd \log d$. We illustrate our results in the case of non-convex robust regression. Based on heuristic arguments and numerical simulations, we present a conjecture for the exact location of the trivialization phase transition $α_{\text{tr}}$.