Nika Haghtalab, Michael I. Jordan, Eric Zhao
We provide a unifying framework for the design and analysis of multicalibrated predictors. By placing the multicalibration problem in the general setting of multi-objective learning -- where learning guarantees must hold simultaneously over a set of distributions and loss functions -- we exploit connections to game dynamics to achieve state-of-the-art guarantees for a diverse set of multicalibration learning problems. In addition to shedding light on existing multicalibration guarantees and greatly simplifying their analysis, our approach also yields improved guarantees, such as obtaining stronger multicalibration conditions that scale with the square-root of group size and improving the complexity of $k$-class multicalibration by an exponential factor of $k$. Beyond multicalibration, we use these game dynamics to address emerging considerations in the study of group fairness and multi-distribution learning.
Alexandre Bouchard-Côté, Michael I. Jordan
Jul 26, 2012·q-bio.PE·PDF We address the problem of the joint statistical inference of phylogenetic trees and multiple sequence alignments from unaligned molecular sequences. This problem is generally formulated in terms of string-valued evolutionary processes along the branches of a phylogenetic tree. The classical evolutionary process, the TKF91 model, is a continuous-time Markov chain model comprised of insertion, deletion and substitution events. Unfortunately this model gives rise to an intractable computational problem---the computation of the marginal likelihood under the TKF91 model is exponential in the number of taxa. In this work, we present a new stochastic process, the Poisson Indel Process (PIP), in which the complexity of this computation is reduced to linear. The new model is closely related to the TKF91 model, differing only in its treatment of insertions, but the new model has a global characterization as a Poisson process on the phylogeny. Standard results for Poisson processes allow key computations to be decoupled, which yields the favorable computational profile of inference under the PIP model. We present illustrative experiments in which Bayesian inference under the PIP model is compared to separate inference of phylogenies and alignments.
Bin Shi, Simon S. Du, Michael I. Jordan, Weijie J. Su
Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms---Nesterov's accelerated gradient method for strongly convex functions (NAG-SC) and Polyak's heavy-ball method---we study an alternative limiting process that yields high-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG-SC and Polyak's heavy-ball method, but they allow the identification of a term that we refer to as "gradient correction" that is present in NAG-SC but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov's accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result---that NAG-C minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-C, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG-C for smooth convex functions.
Puyudi Yang, Jianbo Chen, Cho-Jui Hsieh, Jane-Ling Wang, Michael I. Jordan
We present a probabilistic framework for studying adversarial attacks on discrete data. Based on this framework, we derive a perturbation-based method, Greedy Attack, and a scalable learning-based method, Gumbel Attack, that illustrate various tradeoffs in the design of attacks. We demonstrate the effectiveness of these methods using both quantitative metrics and human evaluation on various state-of-the-art models for text classification, including a word-based CNN, a character-based CNN and an LSTM. As as example of our results, we show that the accuracy of character-based convolutional networks drops to the level of random selection by modifying only five characters through Greedy Attack.
Tianyi Lin, Nhat Ho, Xi Chen, Marco Cuturi, Michael I. Jordan
We study the fixed-support Wasserstein barycenter problem (FS-WBP), which consists in computing the Wasserstein barycenter of $m$ discrete probability measures supported on a finite metric space of size $n$. We show first that the constraint matrix arising from the standard linear programming (LP) representation of the FS-WBP is \textit{not totally unimodular} when $m \geq 3$ and $n \geq 3$. This result resolves an open question pertaining to the relationship between the FS-WBP and the minimum-cost flow (MCF) problem since it proves that the FS-WBP in the standard LP form is not an MCF problem when $m \geq 3$ and $n \geq 3$. We also develop a provably fast \textit{deterministic} variant of the celebrated iterative Bregman projection (IBP) algorithm, named \textsc{FastIBP}, with a complexity bound of $\tilde{O}(mn^{7/3}\varepsilon^{-4/3})$, where $\varepsilon \in (0, 1)$ is the desired tolerance. This complexity bound is better than the best known complexity bound of $\tilde{O}(mn^2\varepsilon^{-2})$ for the IBP algorithm in terms of $\varepsilon$, and that of $\tilde{O}(mn^{5/2}\varepsilon^{-1})$ from accelerated alternating minimization algorithm or accelerated primal-dual adaptive gradient algorithm in terms of $n$. Finally, we conduct extensive experiments with both synthetic data and real images and demonstrate the favorable performance of the \textsc{FastIBP} algorithm in practice.
Ryan Giordano, Runjing Liu, Michael I. Jordan, Tamara Broderick
Bayesian models based on the Dirichlet process and other stick-breaking priors have been proposed as core ingredients for clustering, topic modeling, and other unsupervised learning tasks. However, due to the flexibility of these models, the consequences of prior choices can be opaque. And so prior specification can be relatively difficult. At the same time, prior choice can have a substantial effect on posterior inferences. Thus, considerations of robustness need to go hand in hand with nonparametric modeling. In the current paper, we tackle this challenge by exploiting the fact that variational Bayesian methods, in addition to having computational advantages in fitting complex nonparametric models, also yield sensitivities with respect to parametric and nonparametric aspects of Bayesian models. In particular, we demonstrate how to assess the sensitivity of conclusions to the choice of concentration parameter and stick-breaking distribution for inferences under Dirichlet process mixtures and related mixture models. We provide both theoretical and empirical support for our variational approach to Bayesian sensitivity analysis.
Tatjana Chavdarova, Tong Yang, Matteo Pagliardini, Michael I. Jordan
Yang et al. (2023) recently showed how to use first-order gradient methods to solve general variational inequalities (VIs) under a limiting assumption that analytic solutions of specific subproblems are available. In this paper, we circumvent this assumption via a warm-starting technique where we solve subproblems approximately and initialize variables with the approximate solution found at the previous iteration. We prove the convergence of this method and show that the gap function of the last iterate of the method decreases at a rate of $O(\frac{1}{\sqrt{K}})$ when the operator is $L$-Lipschitz and monotone. In numerical experiments, we show that this technique can converge much faster than its exact counterpart. Furthermore, for the cases when the inequality constraints are simple, we introduce an alternative variant of ACVI and establish its convergence under the same conditions. Finally, we relax the smoothness assumptions in Yang et al., yielding, to our knowledge, the first convergence result for VIs with general constraints that does not rely on the assumption that the operator is $L$-Lipschitz.
Samuel Horváth, Lihua Lei, Peter Richtárik, Michael I. Jordan
Adaptivity is an important yet under-studied property in modern optimization theory. The gap between the state-of-the-art theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as step-size schemes and batch sizes, in different regimes. Despite the appealing theoretical results, such divisive strategies provide little, if any, insight to practitioners to select algorithms that work broadly without tweaking the hyperparameters. In this work, blending the "geometrization" technique introduced by Lei & Jordan 2016 and the \texttt{SARAH} algorithm of Nguyen et al., 2017, we propose the Geometrized \texttt{SARAH} algorithm for non-convex finite-sum and stochastic optimization. Our algorithm is proved to achieve adaptivity to both the magnitude of the target accuracy and the Polyak-Łojasiewicz (PL) constant if present. In addition, it achieves the best-available convergence rate for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives.
Horia Mania, Michael I. Jordan, Benjamin Recht
While the identification of nonlinear dynamical systems is a fundamental building block of model-based reinforcement learning and feedback control, its sample complexity is only understood for systems that either have discrete states and actions or for systems that can be identified from data generated by i.i.d. random inputs. Nonetheless, many interesting dynamical systems have continuous states and actions and can only be identified through a judicious choice of inputs. Motivated by practical settings, we study a class of nonlinear dynamical systems whose state transitions depend linearly on a known feature embedding of state-action pairs. To estimate such systems in finite time identification methods must explore all directions in feature space. We propose an active learning approach that achieves this by repeating three steps: trajectory planning, trajectory tracking, and re-estimation of the system from all available data. We show that our method estimates nonlinear dynamical systems at a parametric rate, similar to the statistical rate of standard linear regression.
Lihua Lei, Michael I. Jordan
Stochastic-gradient-based optimization has been a core enabling methodology in applications to large-scale problems in machine learning and related areas. Despite the progress, the gap between theory and practice remains significant, with theoreticians pursuing mathematical optimality at a cost of obtaining specialized procedures in different regimes (e.g., modulus of strong convexity, magnitude of target accuracy, signal-to-noise ratio), and with practitioners not readily able to know which regime is appropriate to their problem, and seeking broadly applicable algorithms that are reasonably close to optimality. To bridge these perspectives it is necessary to study algorithms that are adaptive to different regimes. We present the stochastically controlled stochastic gradient (SCSG) method for composite convex finite-sum optimization problems and show that SCSG is adaptive to both strong convexity and target accuracy. The adaptivity is achieved by batch variance reduction with adaptive batch sizes and a novel technique, which we referred to as geometrization, which sets the length of each epoch as a geometric random variable. The algorithm achieves strictly better theoretical complexity than other existing adaptive algorithms, while the tuning parameters of the algorithm only depend on the smoothness parameter of the objective.
Ani Adhikari, John DeNero, Michael I. Jordan
The undergraduate data science curriculum at the University of California, Berkeley is anchored in five new courses that emphasize computational thinking, inferential thinking, and working on real-world problems. We believe that interleaving these elements within our core courses is essential to preparing students to engage in data-driven inquiry at the scale that contemporary scientific and industrial applications demand. This new curriculum is already reshaping the undergraduate experience at Berkeley, where these courses have become some of the most popular on campus and have led to a surging interest in a new undergraduate major and minor program in data science.
Wenshuo Guo, Mihaela Curmei, Serena Wang, Benjamin Recht, Michael I. Jordan
We study the problem of finding equilibrium strategies in multi-agent games with incomplete payoff information, where the payoff matrices are only known to the players up to some bounded uncertainty sets. In such games, an ex-post equilibrium characterizes equilibrium strategies that are robust to the payoff uncertainty. When the game is one-shot, we show that in zero-sum polymatrix games, an ex-post equilibrium can be computed efficiently using linear programming. We further extend the notion of ex-post equilibrium to stochastic games, where the game is played repeatedly in a sequence of stages and the transition dynamics are governed by an Markov decision process (MDP). We provide sufficient condition for the existence of an ex-post Markov perfect equilibrium (MPE). We show that under bounded payoff uncertainty, the value of any two-player zero-sum stochastic game can be computed up to a tight value interval using dynamic programming.
Tianyi Lin, Michael. I. Jordan
This paper settles an open and challenging question pertaining to the design of simple and optimal high-order methods for solving smooth and monotone variational inequalities (VIs). A VI involves finding $x^\star \in \mathcal{X}$ such that $\langle F(x), x - x^\star\rangle \geq 0$ for all $x \in \mathcal{X}$. We consider the setting in which $F$ is smooth with up to $(p-1)^{th}$-order derivatives. For $p = 2$, the cubic regularized Newton method was extended to VIs with a global rate of $O(ε^{-1})$. An improved rate of $O(ε^{-2/3}\log\log(1/ε))$ can be obtained via an alternative second-order method, but this method requires a nontrivial line-search procedure as an inner loop. Similarly, high-order methods based on line-search procedures have been shown to achieve a rate of $O(ε^{-2/(p+1)}\log\log(1/ε))$. As emphasized by Nesterov, however, such procedures do not necessarily imply practical applicability in large-scale applications, and it would be desirable to complement these results with a simple high-order VI method that retains the optimality of the more complex methods. We propose a $p^{th}$-order method that does \textit{not} require any line search procedure and provably converges to a weak solution at a rate of $O(ε^{-2/(p+1)})$. We prove that our $p^{th}$-order method is optimal in the monotone setting by establishing a matching lower bound under a generalized linear span assumption. Our method with restarting attains a linear rate for smooth and uniformly monotone VIs and a local superlinear rate for smooth and strongly monotone VIs. Our method also achieves a global rate of $O(ε^{-2/p})$ for solving smooth and nonmonotone VIs satisfying the Minty condition and when augmented with restarting it attains a global linear and local superlinear rate for smooth and nonmonotone VIs satisfying the uniform/strong Minty condition.
Jianbo Chen, Michael I. Jordan, Martin J. Wainwright
The goal of a decision-based adversarial attack on a trained model is to generate adversarial examples based solely on observing output labels returned by the targeted model. We develop HopSkipJumpAttack, a family of algorithms based on a novel estimate of the gradient direction using binary information at the decision boundary. The proposed family includes both untargeted and targeted attacks optimized for $\ell_2$ and $\ell_\infty$ similarity metrics respectively. Theoretical analysis is provided for the proposed algorithms and the gradient direction estimate. Experiments show HopSkipJumpAttack requires significantly fewer model queries than Boundary Attack. It also achieves competitive performance in attacking several widely-used defense mechanisms. (HopSkipJumpAttack was named Boundary Attack++ in a previous version of the preprint.)
Ryan Giordano, Michael I. Jordan, Tamara Broderick
Cross validation (CV) and the bootstrap are ubiquitous model-agnostic tools for assessing the error or variability of machine learning and statistical estimators. However, these methods require repeatedly re-fitting the model with different weighted versions of the original dataset, which can be prohibitively time-consuming. For sufficiently regular optimization problems the optimum depends smoothly on the data weights, and so the process of repeatedly re-fitting can be approximated with a Taylor series that can be often evaluated relatively quickly. The first-order approximation is known as the "infinitesimal jackknife" in the statistics literature and has been the subject of recent interest in machine learning for approximate CV. In this work, we consider high-order approximations, which we call the "higher-order infinitesimal jackknife" (HOIJ). Under mild regularity conditions, we provide a simple recursive procedure to compute approximations of all orders with finite-sample accuracy bounds. Additionally, we show that the HOIJ can be efficiently computed even in high dimensions using forward-mode automatic differentiation. We show that a linear approximation with bootstrap weights approximation is equivalent to those provided by asymptotic normal approximations. Consequently, the HOIJ opens up the possibility of enjoying higher-order accuracy properties of the bootstrap using local approximations. Consistency of the HOIJ for leave-one-out CV under different asymptotic regimes follows as corollaries from our finite-sample bounds under additional regularity assumptions. The generality of the computation and bounds motivate the name "higher-order Swiss Army infinitesimal jackknife."
Nika Haghtalab, Michael I. Jordan, Eric Zhao
Social and real-world considerations such as robustness, fairness, social welfare and multi-agent tradeoffs have given rise to multi-distribution learning paradigms, such as collaborative learning, group distributionally robust optimization, and fair federated learning. In each of these settings, a learner seeks to uniformly minimize its expected loss over $n$ predefined data distributions, while using as few samples as possible. In this paper, we establish the optimal sample complexity of these learning paradigms and give algorithms that meet this sample complexity. Importantly, our sample complexity bounds for multi-distribution learning exceed that of learning a single distribution by only an additive factor of $n \log(n) / ε^2$. This improves upon the best known sample complexity bounds for fair federated learning by Mohri et al. and collaborative learning by Nguyen and Zakynthinou by multiplicative factors of $n$ and $\log(n)/ε^3$, respectively. We also provide the first sample complexity bounds for the group DRO objective of Sagawa et al. To guarantee these optimal sample complexity bounds, our algorithms learn to sample from data distributions on demand. Our algorithm design and analysis are enabled by our extensions of online learning techniques for solving stochastic zero-sum games. In particular, we contribute stochastic variants of no-regret dynamics that can trade off between players' differing sampling costs.
Maxim Rabinovich, Elaine Angelino, Michael I. Jordan
Practitioners of Bayesian statistics have long depended on Markov chain Monte Carlo (MCMC) to obtain samples from intractable posterior distributions. Unfortunately, MCMC algorithms are typically serial, and do not scale to the large datasets typical of modern machine learning. The recently proposed consensus Monte Carlo algorithm removes this limitation by partitioning the data and drawing samples conditional on each partition in parallel (Scott et al, 2013). A fixed aggregation function then combines these samples, yielding approximate posterior samples. We introduce variational consensus Monte Carlo (VCMC), a variational Bayes algorithm that optimizes over aggregation functions to obtain samples from a distribution that better approximates the target. The resulting objective contains an intractable entropy term; we therefore derive a relaxation of the objective and show that the relaxed problem is blockwise concave under mild conditions. We illustrate the advantages of our algorithm on three inference tasks from the literature, demonstrating both the superior quality of the posterior approximation and the moderate overhead of the optimization step. Our algorithm achieves a relative error reduction (measured against serial MCMC) of up to 39% compared to consensus Monte Carlo on the task of estimating 300-dimensional probit regression parameter expectations; similarly, it achieves an error reduction of 92% on the task of estimating cluster comembership probabilities in a Gaussian mixture model with 8 components in 8 dimensions. Furthermore, these gains come at moderate cost compared to the runtime of serial MCMC, achieving near-ideal speedup in some instances.
Nilesh Tripuraneni, Nicolas Flammarion, Francis Bach, Michael I. Jordan
We consider the minimization of a function defined on a Riemannian manifold $\mathcal{M}$ accessible only through unbiased estimates of its gradients. We develop a geometric framework to transform a sequence of slowly converging iterates generated from stochastic gradient descent (SGD) on $\mathcal{M}$ to an averaged iterate sequence with a robust and fast $O(1/n)$ convergence rate. We then present an application of our framework to geodesically-strongly-convex (and possibly Euclidean non-convex) problems. Finally, we demonstrate how these ideas apply to the case of streaming $k$-PCA, where we show how to accelerate the slow rate of the randomized power method (without requiring knowledge of the eigengap) into a robust algorithm achieving the optimal rate of convergence.
Etienne Gauthier, Francis Bach, Michael I. Jordan
In many multi-agent systems, agents interact repeatedly and are expected to settle into equilibrium behavior over time. Yet in practice, behavior often drifts, and detecting such deviations in real time remains an open challenge. We introduce a sequential testing framework that monitors whether observed play in repeated games is consistent with equilibrium, without assuming a fixed sample size. Our approach builds on the e-value framework for safe anytime-valid inference: by "betting" against equilibrium, we construct a test supermartingale that accumulates evidence whenever observed payoffs systematically violate equilibrium conditions. This yields a statistically sound, interpretable measure of departure from equilibrium that can be monitored online. We also leverage Benjamini-Hochberg-type procedures to increase detection power in large games while rigorously controlling the false discovery rate. Our framework unifies the treatment of Nash, correlated, and coarse correlated equilibria, offering finite-time guarantees and a detailed analysis of detection times. Moreover, we extend our method to stochastic games, broadening its applicability beyond repeated-play settings.
Etienne Gauthier, Francis Bach, Michael I. Jordan
We introduce $\textit{Backward Conformal Prediction}$, a method that guarantees conformal coverage while providing flexible control over the size of prediction sets. Unlike standard conformal prediction, which fixes the coverage level and allows the conformal set size to vary, our approach defines a rule that constrains how prediction set sizes behave based on the observed data, and adapts the coverage level accordingly. Our method builds on two key foundations: (i) recent results by Gauthier et al. [2025] on post-hoc validity using e-values, which ensure marginal coverage of the form $\mathbb{P}(Y_{\rm test} \in \hat C_n^{\tildeα}(X_{\rm test})) \ge 1 - \mathbb{E}[\tildeα]$ up to a first-order Taylor approximation for any data-dependent miscoverage $\tildeα$, and (ii) a novel leave-one-out estimator $\hatα^{\rm LOO}$ of the marginal miscoverage $\mathbb{E}[\tildeα]$ based on the calibration set, ensuring that the theoretical guarantees remain computable in practice. This approach is particularly useful in applications where large prediction sets are impractical such as medical diagnosis. We provide theoretical results and empirical evidence supporting the validity of our method, demonstrating that it maintains computable coverage guarantees while ensuring interpretable, well-controlled prediction set sizes.