Hamed Hassani, Shrinivas Kudekar, Or Ordentlich, Yury Polyanskiy, Rüdiger Urbanke
Consider a binary linear code of length $N$, minimum distance $d_{\text{min}}$, transmission over the binary erasure channel with parameter $0 < ε< 1$ or the binary symmetric channel with parameter $0 < ε< \frac12$, and block-MAP decoding. It was shown by Tillich and Zemor that in this case the error probability of the block-MAP decoder transitions "quickly" from $δ$ to $1-δ$ for any $δ>0$ if the minimum distance is large. In particular the width of the transition is of order $O(1/\sqrt{d_{\text{min}}})$. We strengthen this result by showing that under suitable conditions on the weight distribution of the code, the transition width can be as small as $Θ(1/N^{\frac12-κ})$, for any $κ>0$, even if the minimum distance of the code is not linear. This condition applies e.g., to Reed-Mueller codes. Since $Θ(1/N^{\frac12})$ is the smallest transition possible for any code, we speak of "almost" optimal scaling. We emphasize that the width of the transition says nothing about the location of the transition. Therefore this result has no bearing on whether a code is capacity-achieving or not. As a second contribution, we present a new estimate on the derivative of the EXIT function, the proof of which is based on the Blowing-Up Lemma.
S. Hamed Hassani, Kasra Alishahi, Rudiger Urbanke
Consider a binary-input memoryless output-symmetric channel $W$. Such a channel has a capacity, call it $I(W)$, and for any $R<I(W)$ and strictly positive constant $P_{\rm e}$ we know that we can construct a coding scheme that allows transmission at rate $R$ with an error probability not exceeding $P_{\rm e}$. Assume now that we let the rate $R$ tend to $I(W)$ and we ask how we have to "scale" the blocklength $N$ in order to keep the error probability fixed to $P_{\rm e}$. We refer to this as the "finite-length scaling" behavior. This question was addressed by Strassen as well as Polyanskiy, Poor and Verdu, and the result is that $N$ must grow at least as the square of the reciprocal of $I(W)-R$. Polar codes are optimal in the sense that they achieve capacity. In this paper, we are asking to what degree they are also optimal in terms of their finite-length behavior. Our approach is based on analyzing the dynamics of the un-polarized channels. The main results of this paper can be summarized as follows. Consider the sum of Bhattacharyya parameters of sub-channels chosen (by the polar coding scheme) to transmit information. If we require this sum to be smaller than a given value $P_{\rm e}>0$, then the required block-length $N$ scales in terms of the rate $R < I(W)$ as $N \geq \fracα{(I(W)-R)^{\underlineμ}}$, where $α$ is a positive constant that depends on $P_{\rm e}$ and $I(W)$, and $\underlineμ = 3.579$. Also, we show that with the same requirement on the sum of Bhattacharyya parameters, the block-length scales in terms of the rate like $N \leq \fracβ{(I(W)-R)^{\overlineμ}}$, where $β$ is a constant that depends on $P_{\rm e}$ and $I(W)$, and $\overlineμ=6$.
S. Hamed Hassani, Rudiger Urbanke
Polar codes, invented by Arikan in 2009, are known to achieve the capacity of any binary-input memoryless output-symmetric channel. One of the few drawbacks of the original polar code construction is that it is not universal. This means that the code has to be tailored to the channel if we want to transmit close to capacity. We present two "polar-like" schemes which are capable of achieving the compound capacity of the whole class of binary-input memoryless output-symmetric channels with low complexity. Roughly speaking, for the first scheme we stack up $N$ polar blocks of length $N$ on top of each other but shift them with respect to each other so that they form a "staircase." Coding then across the columns of this staircase with a standard Reed-Solomon code, we can achieve the compound capacity using a standard successive decoder to process the rows (the polar codes) and in addition a standard Reed-Solomon erasure decoder to process the columns. Compared to standard polar codes this scheme has essentially the same complexity per bit but a block length which is larger by a factor $O(N \log_2(N)/ε)$, where $ε$ is the gap to capacity. For the second scheme we first show how to construct a true polar code which achieves the compound capacity for a finite number of channels. We achieve this by introducing special "polarization" steps which "align" the good indices for the various channels. We then show how to exploit the compactness of the space of binary-input memoryless output-symmetric channels to reduce the compound capacity problem for this class to a compound capacity problem for a finite set of channels. This scheme is similar in spirit to standard polar codes, but the price for universality is a considerably larger blocklength. We close with what we consider to be some interesting open problems.
S. Hamed Hassani, Nicolas Macris, Ryuhei Mori
Convolutional Low-Density-Parity-Check (LDPC) ensembles have excellent performance. Their iterative threshold increases with their average degree, or with the size of the coupling window in randomized constructions. In the later case, as the window size grows, the Belief Propagation (BP) threshold attains the maximum-a-posteriori (MAP) threshold of the underlying ensemble. In this contribution we show that a similar phenomenon happens for the growth rate of coupled ensembles. Loosely speaking, we observe that as the coupling strength grows, the growth rate of the coupled ensemble comes close to the concave hull of the underlying ensemble's growth rate. For ensembles randomly coupled across a window the growth rate actually tends to the concave hull of the underlying one as the window size increases. Our observations are supported by the calculations of the combinatorial growth rate, and that of the growth rate derived from the replica method. The observed concavity is a general feature of coupled mean field graphical models and is already present at the level of coupled Curie-Weiss models. There, the canonical free energy of the coupled system tends to the concave hull of the underlying one. As we explain, the behavior of the growth rate of coupled ensembles is exactly analogous.
S. Hamed Hassani, Ryuhei Mori, Toshiyuki Tanaka, Rudiger Urbanke
For a binary-input memoryless symmetric channel $W$, we consider the asymptotic behavior of the polarization process in the large block-length regime when transmission takes place over $W$. In particular, we study the asymptotics of the cumulative distribution $\mathbb{P}(Z_n \leq z)$, where $\{Z_n\}$ is the Bhattacharyya process defined from $W$, and its dependence on the rate of transmission. On the basis of this result, we characterize the asymptotic behavior, as well as its dependence on the rate, of the block error probability of polar codes using the successive cancellation decoder. This refines the original bounds by Arıkan and Telatar. Our results apply to general polar codes based on $\ell \times \ell$ kernel matrices. We also provide lower bounds on the block error probability of polar codes using the MAP decoder. The MAP lower bound and the successive cancellation upper bound coincide when $\ell=2$, but there is a gap for $\ell>2$.
S. Hamed Hassani, Kasra Alishahi, Rudiger Urbanke
We provide upper and lower bounds on the escape rate of the Bhattacharyya process corresponding to polar codes and transmission over the the binary erasure channel. More precisely, we bound the exponent of the number of sub-channels whose Bhattacharyya constant falls in a fixed interval $[a,b]$. Mathematically this can be stated as bounding the limit $\lim_{n \to \infty} \frac{1}{n} \ln \mathbb{P}(Z_n \in [a,b])$, where $Z_n$ is the Bhattacharyya process. The quantity $\mathbb{P}(Z_n \in [a,b])$ represents the fraction of sub-channels that are still un-polarized at time $n$.
S. Hamed Hassani, Rudiger Urbanke
We consider the asymptotic behavior of the polarization process for polar codes when the blocklength tends to infinity. In particular, we study the problem of asymptotic analysis of the cumulative distribution $\mathbb{P}(Z_n \leq z)$, where $Z_n=Z(W_n)$ is the Bhattacharyya process, and its dependence to the rate of transmission R. We show that for a BMS channel $W$, for $R < I(W)$ we have $\lim_{n \to \infty} \mathbb{P} (Z_n \leq 2^{-2^{\frac{n}{2}+\sqrt{n} \frac{Q^{-1}(\frac{R}{I(W)})}{2} +o(\sqrt{n})}}) = R$ and for $R<1- I(W)$ we have $\lim_{n \to \infty} \mathbb{P} (Z_n \geq 1-2^{-2^{\frac{n}{2}+ \sqrt{n} \frac{Q^{-1}(\frac{R}{1-I(W)})}{2} +o(\sqrt{n})}}) = R$, where $Q(x)$ is the probability that a standard normal random variable will obtain a value larger than $x$. As a result, if we denote by $\mathbb{P}_e ^{\text{SC}}(n,R)$ the probability of error using polar codes of block-length $N=2^n$ and rate $R<I(W)$ under successive cancellation decoding, then $\log(-\log(\mathbb{P}_e ^{\text{SC}}(n,R)))$ scales as $\frac{n}{2}+\sqrt{n}\frac{Q^{-1}(\frac{R}{I(W)})}{2}+ o(\sqrt{n})$. We also prove that the same result holds for the block error probability using the MAP decoder, i.e., for $\log(-\log(\mathbb{P}_e ^{\text{MAP}}(n,R)))$.
S. Hamed Hassani, Nicolas Macris, Rudiger Urbanke
We consider chains of random constraint satisfaction models that are spatially coupled across a finite window along the chain direction. We investigate their phase diagram at zero temperature using the survey propagation formalism and the interpolation method. We prove that the SAT-UNSAT phase transition threshold of an infinite chain is identical to the one of the individual standard model, and is therefore not affected by spatial coupling. We compute the survey propagation complexity using population dynamics as well as large degree approximations, and determine the survey propagation threshold. We find that a clustering phase survives coupling. However, as one increases the range of the coupling window, the survey propagation threshold increases and saturates towards the phase transition threshold. We also briefly discuss other aspects of the problem. Namely, the condensation threshold is not affected by coupling, but the dynamic threshold displays saturation towards the condensation one. All these features may provide a new avenue for obtaining better provable algorithmic lower bounds on phase transition thresholds of the individual standard model.
Ali Goli, S. Hamed Hassani, Rudiger Urbanke
We consider the problem of determining the trade-off between the rate and the block-length of polar codes for a given block error probability when we use the successive cancellation decoder. We take the sum of the Bhattacharyya parameters as a proxy for the block error probability, and show that there exists a universal parameter $μ$ such that for any binary memoryless symmetric channel $W$ with capacity $I(W)$, reliable communication requires rates that satisfy $R< I(W)-αN^{-\frac{1}μ}$, where $α$ is a positive constant and $N$ is the block-length. We provide lower bounds on $μ$, namely $μ\geq 3.553$, and we conjecture that indeed $μ=3.627$, the parameter for the binary erasure channel.
S. Hamed Hassani, Nicolas Macris, Rudiger Urbanke
The XOR-satisfiability (XORSAT) problem deals with a system of $n$ Boolean variables and $m$ clauses. Each clause is a linear Boolean equation (XOR) of a subset of the variables. A $K$-clause is a clause involving $K$ distinct variables. In the random $K$-XORSAT problem a formula is created by choosing $m$ $K$-clauses uniformly at random from the set of all possible clauses on $n$ variables. The set of solutions of a random formula exhibits various geometrical transitions as the ratio $\frac{m}{n}$ varies. We consider a {\em coupled} $K$-XORSAT ensemble, consisting of a chain of random XORSAT models that are spatially coupled across a finite window along the chain direction. We observe that the threshold saturation phenomenon takes place for this ensemble and we characterize various properties of the space of solutions of such coupled formulae.
S. Hamed Hassani, Nicolas Macris, Ruediger Urbanke
The excellent performance of convolutional low-density parity-check codes is the result of the spatial coupling of individual underlying codes across a window of growing size, but much smaller than the length of the individual codes. Remarkably, the belief-propagation threshold of the coupled ensemble is boosted to the maximum-a-posteriori one of the individual system. We investigate the generality of this phenomenon beyond coding theory: we couple general graphical models into a one-dimensional chain of large individual systems. For the later we take the Curie-Weiss, random field Curie-Weiss, $K$-satisfiability, and $Q$-coloring models. We always find, based on analytical as well as numerical calculations, that the message passing thresholds of the coupled systems come very close to the static ones of the individual models. The remarkable properties of convolutional low-density parity-check codes are a manifestation of this very general phenomenon.
S. Hamed Hassani, Nicolas Macris, Ruediger Urbanke
We consider a collection of Curie-Weiss (CW) spin systems, possibly with a random field, each of which is placed along the positions of a one-dimensional chain. The CW systems are coupled together by a Kac-type interaction in the longitudinal direction of the chain and by an infinite range interaction in the direction transverse to the chain. Our motivations for studying this model come from recent findings in the theory of error correcting codes based on spatially coupled graphs. We find that, although much simpler than the codes, the model studied here already displays similar behaviors. We are interested in the van der Waals curve in a regime where the size of each Curie-Weiss model tends to infinity, and the length of the chain and range of the Kac interaction are large but finite. Below the critical temperature, and with appropriate boundary conditions, there appears a series of equilibrium states representing kink-like interfaces between the two equilibrium states of the individual system. The van der Waals curve oscillates periodically around the Maxwell plateau. These oscillations have a period inversely proportional to the chain length and an amplitude exponentially small in the range of the interaction; in other words the spinodal points of the chain model lie exponentially close to the phase transition threshold. The amplitude of the oscillations is closely related to a Peierls-Nabarro free energy barrier for the motion of the kink along the chain. Analogies to similar phenomena and their possible algorithmic significance for graphical models of interest in coding theory and theoretical computer science are pointed out.
Ramtin Pedarsani, S. Hamed Hassani, Ido Tal, Emre Telatar
We consider the problem of efficiently constructing polar codes over binary memoryless symmetric (BMS) channels. The complexity of designing polar codes via an exact evaluation of the polarized channels to find which ones are "good" appears to be exponential in the block length. In \cite{TV11}, Tal and Vardy show that if instead the evaluation if performed approximately, the construction has only linear complexity. In this paper, we follow this approach and present a framework where the algorithms of \cite{TV11} and new related algorithms can be analyzed for complexity and accuracy. We provide numerical and analytical results on the efficiency of such algorithms, in particular we show that one can find all the "good" channels (except a vanishing fraction) with almost linear complexity in block-length (except a polylogarithmic factor).
S. Hamed Hassani, Rudiger Urbanke
Polar codes provably achieve the capacity of a wide array of channels under successive decoding. This assumes infinite precision arithmetic. Given the successive nature of the decoding algorithm, one might worry about the sensitivity of the performance to the precision of the computation. We show that even very coarsely quantized decoding algorithms lead to excellent performance. More concretely, we show that under successive decoding with an alphabet of cardinality only three, the decoder still has a threshold and this threshold is a sizable fraction of capacity. More generally, we show that if we are willing to transmit at a rate $δ$ below capacity, then we need only $c \log(1/δ)$ bits of precision, where $c$ is a universal constant.
Payam Delgosha, Hamed Hassani, Ramtin Pedarsani
It is well-known that machine learning models are vulnerable to small but cleverly-designed adversarial perturbations that can cause misclassification. While there has been major progress in designing attacks and defenses for various adversarial settings, many fundamental and theoretical problems are yet to be resolved. In this paper, we consider classification in the presence of $\ell_0$-bounded adversarial perturbations, a.k.a. sparse attacks. This setting is significantly different from other $\ell_p$-adversarial settings, with $p\geq 1$, as the $\ell_0$-ball is non-convex and highly non-smooth. Under the assumption that data is distributed according to the Gaussian mixture model, our goal is to characterize the optimal robust classifier and the corresponding robust classification error as well as a variety of trade-offs between robustness, accuracy, and the adversary's budget. To this end, we develop a novel classification algorithm called FilTrun that has two main modules: Filtration and Truncation. The key idea of our method is to first filter out the non-robust coordinates of the input and then apply a carefully-designed truncated inner product for classification. By analyzing the performance of FilTrun, we derive an upper bound on the optimal robust classification error. We also find a lower bound by designing a specific adversarial strategy that enables us to derive the corresponding robust classifier and its achieved error. For the case that the covariance matrix of the Gaussian mixtures is diagonal, we show that as the input's dimension gets large, the upper and lower bounds converge; i.e. we characterize the asymptotically-optimal robust classifier. Throughout, we discuss several examples that illustrate interesting behaviors such as the existence of a phase transition for adversary's budget determining whether the effect of adversarial perturbation can be fully neutralized.
Arman Adibi, Aryan Mokhtari, Hamed Hassani
In this paper, we introduce a discrete variant of the meta-learning framework. Meta-learning aims at exploiting prior experience and data to improve performance on future tasks. By now, there exist numerous formulations for meta-learning in the continuous domain. Notably, the Model-Agnostic Meta-Learning (MAML) formulation views each task as a continuous optimization problem and based on prior data learns a suitable initialization that can be adapted to new, unseen tasks after a few simple gradient updates. Motivated by this terminology, we propose a novel meta-learning framework in the discrete domain where each task is equivalent to maximizing a set function under a cardinality constraint. Our approach aims at using prior data, i.e., previously visited tasks, to train a proper initial solution set that can be quickly adapted to a new task at a relatively low computational cost. This approach leads to (i) a personalized solution for each individual task, and (ii) significantly reduced computational cost at test time compared to the case where the solution is fully optimized once the new task is revealed. The training procedure is performed by solving a challenging discrete optimization problem for which we present deterministic and randomized algorithms. In the case where the tasks are monotone and submodular, we show strong theoretical guarantees for our proposed methods even though the training objective may not be submodular. We also demonstrate the effectiveness of our framework on two real-world problem instances where we observe that our methods lead to a significant reduction in computational complexity in solving the new tasks while incurring a small performance loss compared to when the tasks are fully optimized.
Mohammad Fereydounian, Hamed Hassani, Mohammad Vahid Jamali, Hessam Mahdavifar
Low-capacity scenarios have become increasingly important in the technology of the Internet of Things (IoT) and the next generation of wireless networks. Such scenarios require efficient and reliable transmission over channels with an extremely small capacity. Within these constraints, the state-of-the-art coding techniques may not be directly applicable. Moreover, the prior work on the finite-length analysis of optimal channel coding provides inaccurate predictions of the limits in the low-capacity regime. In this paper, we study channel coding at low capacity from two perspectives: fundamental limits at finite length and code constructions. We first specify what a low-capacity regime means. We then characterize finite-length fundamental limits of channel coding in the low-capacity regime for various types of channels, including binary erasure channels (BECs), binary symmetric channels (BSCs), and additive white Gaussian noise (AWGN) channels. From the code construction perspective, we characterize the optimal number of repetitions for transmission over binary memoryless symmetric (BMS) channels, in terms of the code blocklength and the underlying channel capacity, such that the capacity loss due to the repetition is negligible. Furthermore, it is shown that capacity-achieving polar codes naturally adopt the aforementioned optimal number of repetitions.
Alexander Robey, Luiz F. O. Chamon, George J. Pappas, Hamed Hassani
Many of the successes of machine learning are based on minimizing an averaged loss function. However, it is well-known that this paradigm suffers from robustness issues that hinder its applicability in safety-critical domains. These issues are often addressed by training against worst-case perturbations of data, a technique known as adversarial training. Although empirically effective, adversarial training can be overly conservative, leading to unfavorable trade-offs between nominal performance and robustness. To this end, in this paper we propose a framework called probabilistic robustness that bridges the gap between the accurate, yet brittle average case and the robust, yet conservative worst case by enforcing robustness to most rather than to all perturbations. From a theoretical point of view, this framework overcomes the trade-offs between the performance and the sample-complexity of worst-case and average-case learning. From a practical point of view, we propose a novel algorithm based on risk-aware optimization that effectively balances average- and worst-case performance at a considerably lower computational cost relative to adversarial training. Our results on MNIST, CIFAR-10, and SVHN illustrate the advantages of this framework on the spectrum from average- to worst-case robustness.
Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ramtin Pedarsani
We consider the problem of decentralized consensus optimization, where the sum of $n$ smooth and strongly convex functions are minimized over $n$ distributed agents that form a connected network. In particular, we consider the case that the communicated local decision variables among nodes are quantized in order to alleviate the communication bottleneck in distributed optimization. We propose the Quantized Decentralized Gradient Descent (QDGD) algorithm, in which nodes update their local decision variables by combining the quantized information received from their neighbors with their local information. We prove that under standard strong convexity and smoothness assumptions for the objective function, QDGD achieves a vanishing mean solution error under customary conditions for quantizers. To the best of our knowledge, this is the first algorithm that achieves vanishing consensus error in the presence of quantization noise. Moreover, we provide simulation results that show tight agreement between our derived theoretical convergence rate and the numerical results.
Alkis Gotovos, Hamed Hassani, Andreas Krause, Stefanie Jegelka
We consider the problem of inference in discrete probabilistic models, that is, distributions over subsets of a finite ground set. These encompass a range of well-known models in machine learning, such as determinantal point processes and Ising models. Locally-moving Markov chain Monte Carlo algorithms, such as the Gibbs sampler, are commonly used for inference in such models, but their convergence is, at times, prohibitively slow. This is often caused by state-space bottlenecks that greatly hinder the movement of such samplers. We propose a novel sampling strategy that uses a specific mixture of product distributions to propose global moves and, thus, accelerate convergence. Furthermore, we show how to construct such a mixture using semigradient information. We illustrate the effectiveness of combining our sampler with existing ones, both theoretically on an example model, as well as practically on three models learned from real-world data sets.