Pierfrancesco Beneventano, Blake Woodworth
We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize -- about $2/\textrm{sharpness}$. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.
Pierfrancesco Beneventano, Andrea Pinto, Tomaso Poggio
We investigate the ability of deep neural networks to identify the support of the target function. Our findings reveal that mini-batch SGD effectively learns the support in the first layer of the network by shrinking to zero the weights associated with irrelevant components of input. In contrast, we demonstrate that while vanilla GD also approximates the target function, it requires an explicit regularization term to learn the support in the first layer. We prove that this property of mini-batch SGD is due to a second-order implicit regularization effect which is proportional to $η/ b$ (step size / batch size). Our results are not only another proof that implicit regularization has a significant impact on training optimization dynamics but they also shed light on the structure of the features that are learned by the network. Additionally, they suggest that smaller batches enhance feature interpretability and reduce dependency on initialization.
Pierfrancesco Beneventano, Patrick Cheridito, Arnulf Jentzen, Philippe von Wurstemberger
In this paper we develop a new machinery to study the capacity of artificial neural networks (ANNs) to approximate high-dimensional functions without suffering from the curse of dimensionality. Specifically, we introduce a concept which we refer to as approximation spaces of artificial neural networks and we present several tools to handle those spaces. Roughly speaking, approximation spaces consist of sequences of functions which can, in a suitable way, be approximated by ANNs without curse of dimensionality in the sense that the number of required ANN parameters to approximate a function of the sequence with an accuracy $\varepsilon > 0$ grows at most polynomially both in the reciprocal $1/\varepsilon$ of the required accuracy and in the dimension $d \in \mathbb{N} = \{1, 2, 3, \ldots \}$ of the function. We show that these approximation spaces are closed under various operations including linear combinations, formations of limits, and infinite compositions. To illustrate the utility of the machinery proposed in this paper, we employ the developed theory to prove that ANNs have the capacity to overcome the curse of dimensionality in the numerical approximation of certain first order transport partial differential equations (PDEs). We even prove that approximation spaces are closed under flows of first order transport PDEs.
Pierfrancesco Beneventano
This article examines the implicit regularization effect of Stochastic Gradient Descent (SGD). We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks. We analyze this algorithm in a more realistic regime than typically considered in theoretical works on SGD, as, e.g., we allow the product of the learning rate and Hessian to be $O(1)$ and we do not specify any model architecture, learning task, or loss (objective) function. Our core theoretical result is that optimizing with SGD without replacement is locally equivalent to making an additional step on a novel regularizer. This implies that the expected trajectories of SGD without replacement can be decoupled in (i) following SGD with replacement (in which batches are sampled i.i.d.) along the directions of high curvature, and (ii) regularizing the trace of the noise covariance along the flat ones. As a consequence, SGD without replacement travels flat areas and may escape saddles significantly faster than SGD with replacement. On several vision tasks, the novel regularizer penalizes a weighted trace of the Fisher Matrix, thus encouraging sparsity in the spectrum of the Hessian of the loss in line with empirical observations from prior work. We also propose an explanation for why SGD does not train at the edge of stability (as opposed to GD).
Arseniy Andreyev, Pierfrancesco Beneventano
Recent findings by Cohen et al., 2021, demonstrate that when training neural networks using full-batch gradient descent with a step size of $η$, the largest eigenvalue $λ_{\max}$ of the full-batch Hessian consistently stabilizes around $2/η$. These results have significant implications for convergence and generalization. This, however, is not the case for mini-batch optimization algorithms, limiting the broader applicabilityof the consequences of these findings. We show mini-batch Stochastic Gradient Descent (SGD) trains in a different regime we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at $2/η$ is Batch Sharpness: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence $λ_{\max}$ -- which is generally smaller than Batch Sharpness -- is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for mathematical modeling of SGD trajectories.
Mahmoud Abdelmoneum, Pierfrancesco Beneventano, Tomaso Poggio
We present pAI/MSc, an open-source, customizable, modular multi-agent system for academic research workflows. Our goal is not autonomous scientific ideation, nor fully automated research. It is narrower and more practical: to reduce by orders of magnitude the human steering required to turn a specified hypothesis into a literature-grounded, mathematically established, experimentally supported, submission-oriented manuscript draft. pAI/MSc is built with a current emphasis on machine learning theory and adjacent quantitative fields.
Pierfrancesco Beneventano, Patrick Cheridito, Robin Graeber, Arnulf Jentzen, Benno Kuckuck
The purpose of this article is to develop machinery to study the capacity of deep neural networks (DNNs) to approximate high-dimensional functions. In particular, we show that DNNs have the expressive power to overcome the curse of dimensionality in the approximation of a large class of functions. More precisely, we prove that these functions can be approximated by DNNs on compact sets such that the number of parameters necessary to represent the approximating DNNs grows at most polynomially in the reciprocal $1/\varepsilon$ of the approximation accuracy $\varepsilon>0$ and in the input dimension $d\in \mathbb{N} =\{1,2,3,\dots\}$. To this end, we introduce certain approximation spaces, consisting of sequences of functions that can be efficiently approximated by DNNs. We then establish closure properties which we combine with known and new bounds on the number of parameters necessary to approximate locally Lipschitz continuous functions, maximum functions, and product functions by DNNs. The main result of this article demonstrates that DNNs have sufficient expressiveness to approximate certain sequences of functions which can be constructed by means of a finite number of compositions using locally Lipschitz continuous functions, maxima, and products without the curse of dimensionality.
Federico Vittorio Cortesi, Giuseppe Iannone, Giulia Crippa, Tomaso Poggio, Pierfrancesco Beneventano
Neural networks applied to financial time series operate in a regime of underspecification, where model predictors achieve indistinguishable out-of-sample error. Using large-scale volatility forecasting for S$\&$P 500 stocks, we show that different model-training-pipeline pairs with identical test loss learn qualitatively different functions. Across architectures, predictive accuracy remains unchanged, yet optimizer choice reshapes non-linear response profiles and temporal dependence differently. These divergences have material consequences for decisions: volatility-ranked portfolios trace a near-vertical Sharpe-turnover frontier, with nearly $3\times$ turnover dispersion at comparable Sharpe ratios. We conclude that in underspecified settings, optimization acts as a consequential source of inductive bias, thus model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.
Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, Pierfrancesco Beneventano
Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.
Alessandro Morosini, Matea Gjika, Tomaso Poggio, Pierfrancesco Beneventano
Modern neural networks can achieve high accuracy while remaining poorly calibrated, producing confidence estimates that do not match empirical correctness. Yet calibration is often treated as a post-hoc attribute. We take a different perspective: we study calibration as a training-time phenomenon on small vision tasks, and ask whether calibrated solutions can be obtained reliably by intervening on the training procedure. We identify a tight coupling between calibration, curvature, and margins during training of deep networks under multiple gradient-based methods. Empirically, Expected Calibration Error (ECE) closely tracks curvature-based sharpness throughout optimization. Mathematically, we show that both ECE and Gauss--Newton curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the trajectory. Guided by this mechanism, we introduce a margin-aware training objective that explicitly targets robust-margin tails and local smoothness, yielding improved out-of-sample calibration across optimizers without sacrificing accuracy.
Yizhou Xu, Pierfrancesco Beneventano, Isaac Chuang, Liu Ziyin
A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network's loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textit{a priori} preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across all output dimensions. When the noise in the labels is anisotropic, the model instead prefers sharpness and can converge to an arbitrarily sharp solution, depending on the imbalance in the noise in the labels spectrum. We reproduce this key insight in controlled settings with different model architectures such as MLP, RNN, and transformers.
Federico Bottino, Carlo Ferrero, Nicholas Dosio, Pierfrancesco Beneventano
Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content without distinguishing binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. We argue that the ceiling on organizational AI is not retrieval fidelity but \emph{epistemic} fidelity--the system's ability to represent commitment strength, contradiction status, and organizational ignorance as computable properties. We present OIDA, a framework that structures organizational knowledge as typed Knowledge Objects carrying epistemic class, importance scores with class-specific decay, and signed contradiction edges. The Knowledge Gravity Engine maintains scores deterministically with proved convergence guarantees (sufficient condition: max degree $< 7$; empirically robust to degree 43). OIDA introduces QUESTION-as-modeled-ignorance: a primitive with inverse decay that surfaces what an organization does \emph{not} know with increasing urgency--a mechanism absent from all surveyed systems. We describe the Epistemic Quality Score (EQS), a five-component evaluation methodology with explicit circularity analysis. In a controlled comparison ($n{=}10$ response pairs), OIDA's RAG condition (3,868 tokens) achieves EQS 0.530 vs.\ 0.848 for a full-context baseline (108,687 tokens); the $28.1\times$ token budget difference is the primary confound. The QUESTION mechanism is statistically validated (Fisher $p{=}0.0325$, OR$=21.0$). The formal properties are established; the decisive ablation at equal token budget (E4) is pre-registered and not yet run.