Gianlorenzo D'Angelo, Riccardo Michielan
We study the efficient generation of random graphs with a prescribed expected degree sequence, focusing on rank-1 inhomogeneous models in which vertices are assigned weights and edges are drawn independently with probabilities proportional to the product of endpoint weights. We adopt a temporal viewpoint, adding edges to the graph one at a time up to a fixed time horizon, and allowing for self-loops or duplicate edges in the first stage. Then, the simple projection of the resulting multigraph recovers exactly the simple Norros--Reittu random graph, whose expected degrees match the prescribed targets under mild conditions. Building on this representation, we develop an exact generator based on \textit{edge-arrivals} for expected-degree random graphs with running time $O(n+m)$, where $m$ is the number of generated edges, and hence proportional to the output size. This removes the typical vertex sorting used by widely-used fast generator algorithms based on \textit{edge-skipping} for rank-1 expected-degree models, which leads to a total running time of $O(n \log n + m)$. In addition, our algorithm is simpler than those in the literature, easy to implement, and very flexible, thus opening up to extensions to directed and temporal random graphs, generalization to higher-order structures, and improvements through parallelization.
Gabriel Riffo, Leonard Schmitz
We introduce SignatureTensors.jl, a new package for computing signature tensors of paths in julia. We present its core functionality and demonstrate its use through illustrative examples. The package is compatible with the computer algebra system OSCAR, enabling both exact and numerical computations with signatures.
Yifan Li, Giulia Guidi
In computational science and data analytics, many workloads involve irregular and sparse computations that are inherently difficult to optimize for modern hardware. A key kernel is Sparse General Matrix-Matrix Multiplication (SpGEMM), which underpins simulations, graph analytics, and machine learning applications. SpGEMM exhibits irregular memory access patterns and workload imbalance, making it challenging to achieve high performance on GPUs. Current GPU SpGEMM solutions typically rely on a two-pass workflow to address load imbalance and reduce memory access. The symbolic pass, which determines the number of output elements per row, accounts for roughly 28% of the total runtime on average. In this work, we question the necessity of exact symbolic computation and introduce an estimation-based SpGEMM workflow. Our approach replaces the costly symbolic step with lightweight HyperLogLog estimators, combined with an analysis strategy that dynamically selects the optimal workflow and guides accumulator configuration. In addition, we introduce a hybrid accumulator design, including a novel hash-based accumulator that leverages both shared and global memory. Our approach consistently outperforms leading GPU SpGEMM implementations across a wide range of both square and rectangular matrices, achieving speedups of 1.4x-2.8x on NVIDIA A100 and H100 architectures.
Matic Petrič, René Zander
Apr 20, 2026·quant-ph·PDF Block-encoding is a foundational technique in modern quantum algorithms, enabling the implementation of non-unitary operations by embedding them into larger unitary matrices. While theoretically powerful and essential for advanced protocols like Quantum Singular Value Transformation (QSVT) and Quantum Signal Processing (QSP), the generation of compilable implementations of block-encodings poses a formidable challenge. This work presents the BlockEncoding interface within the Eclipse Qrisp framework, establishing block-encodings as a high-level programming abstraction accessible to a broad scientific audience. Serving as both a technical framework introduction and a hands-on tutorial, this paper explicitly details key underlying concepts abstracted away by the interface, such as block-encoding construction and qubitization, and their practical integration into methods like the Childs-Kothari-Somma (CKS) algorithm. We outline the interface's software architecture, encompassing constructors, core utilities, arithmetic composition, and algorithmic applications such as matrix inversion, polynomial filtering, and Hamiltonian simulation. Through code examples, we demonstrate how this interface simplifies both the practical realization of advanced quantum algorithms and their associated resource estimation.
Devin A. Matthews, Tze Meng Low, Margaret E. Myers, Devangi N. Parikh, Robert A. van de Geijn
We leverage highly successful prior projects sponsored by multiple NSF grants and gifts from industry: the BLAS-like Library Instantiation Software (BLIS) and the libflame efforts to lay the foundation for a new flexible framework by vertically integrating the dense linear and multi-linear (tensor) software stacks that are important to modern computing. This vertical integration will enable high-performance computations from node-level to massively-parallel, and across both CPU and GPU architectures. The effort builds on decades of experience by the research team turning fundamental research on the systematic derivation of algorithms (the NSF-sponsored FLAME project) into practical software for this domain, targeting single and multi-core (BLIS, TBLIS, and libflame), GPU-accelerated (SuperMatrix), and massively parallel (PLAPACK, Elemental, and ROTE) compute environments. This project will implement key linear algebra and tensor operations which highlight the flexibility and effectiveness of the new framework, and set the stage for further work in broadening functionality and integration into diverse scientific and machine learning software.
Kirill Brilliantov, Etienne Bamas, Emmanuel Abbé
We introduce a code-based challenge for automated, open-ended mathematical discovery based on the $k$-server conjecture, a central open problem in competitive analysis. The task is to discover a potential function satisfying a large graph-structured system of simple linear inequalities. The resulting evaluation procedure is sound but incomplete: any violated inequality definitively refutes a candidate, whereas satisfying all inequalities does not by itself constitute a proof of the corresponding conjecture's special case. Nevertheless, a candidate that passes all constraints would be strong evidence toward a valid proof and, to the best of our knowledge, no currently known potential achieves this under our formulation in the open $k=4$ circle case. As such, a successful candidate would already be an interesting contribution to the $k$-server conjecture, and could become a substantial theoretical result when paired with a full proof. Experiments on the resolved $k=3$ regime show that current agentic methods can solve nontrivial instances, and in the open $k=4$ regime they reduce the number of violations relative to existing potentials without fully resolving the task. Taken together, these results suggest that the task is challenging but plausibly within reach of current methods. Beyond its relevance to the $k$-server community, where the developed tooling enables researchers to test new hypotheses and potentially improve on the current record, the task also serves as a useful \emph{benchmark} for developing code-based discovery agents. In particular, our $k=3$ results show that it mitigates important limitations of existing open-ended code-based benchmarks, including early saturation and the weak separation between naive random baselines and more sophisticated methods.
Victor Vanthilt, Adwait Naravane, Chenqi Meng, Atsushi Ueda
We present TNRKit, an open-source Julia package for Tensor Network Renormalization (TNR) of two- and three-dimensional classical statistical models and Euclidean lattice field theories. Built on top of TensorKit, it provides a symmetry-aware framework for constructing tensor-network representations of partition functions and coarse-graining them using methods such as TRG, HOTRG, and LoopTNR. Beyond thermodynamic quantities, the package enables the extraction of universal conformal data -- including scaling dimensions and the central charge -- directly from fixed-point tensors. TNRKit is designed with both usability and extensibility in mind, offering a practical platform for applying, benchmarking, and developing modern tensor renormalization algorithms. This paper also serves as a self-contained introduction to the TNR framework.
Yi-Shuai Niu, Shing-Tung Yau
Polylab is a MATLAB toolbox for multivariate polynomial scalars and polynomial matrices with a unified symbolic-numeric interface across CPU and GPU-oriented backends. The software exposes three aligned classes: MPOLY for CPU execution, MPOLY_GPU as a legacy GPU baseline, and MPOLY_HP as an improved GPU-oriented implementation. Across these backends, Polylab supports polynomial construction, algebraic manipulation, simplification, matrix operations, differentiation, Jacobian and Hessian construction, LaTeX export, CPU-side LaTeX reconstruction, backend conversion, and interoperability with YALMIP and SOSTOOLS. Versions 3.0 and 3.1 add two practically important extensions: explicit variable identity and naming for safe mixed-variable expression handling, and affine-normal direction computation via automatic differentiation, MF-logDet-Exact, and MF-logDet-Stochastic. The toolbox has already been used successfully in prior research applications, and Polylab Version 3.1 adds a new geometry-oriented computational layer on top of a mature polynomial modeling core. This article documents the architecture and user-facing interface of the software, organizes its functionality by workflow, presents representative MATLAB sessions with actual outputs, and reports reproducible benchmarks. The results show that MPOLY is the right default for lightweight interactive workloads, whereas MPOLY-HP becomes advantageous for reduction-heavy simplification and medium-to-large affine-normal computation; the stochastic log-determinant variant becomes attractive in larger sparse regimes under approximation-oriented parameter choices.
Yumeng He, Pavel Panchekha
Floating-point arithmetic is error-prone and unintuitive. Floating-point debuggers instrument programs to monitor floating-point arithmetic at run time and flag numerical issues. They estimate residues, i.e., the difference between actual floating-point and ideal real values, for every floating-point value in the program. Prior work explores various approaches for computing these residues accurately and efficiently. Unfortunately, the most efficient methods, based on "error-free transformations", have a high rate of false reports, while the most accurate methods, based on high-precision arithmetic, are very slow. This paper builds on error-free-transformations-based approaches and aims to improve their accuracy while preserving efficiency. To more accurately compute residues, this paper divides residue computation into two steps (rounding error computation and residue function evaluation) and shows how to perform each step accurately via careful improvements to the current state of the art. We evaluate on 44 large scientific computing workloads, focusing on the 14 benchmarks where prior tools produce false reports: our approach eliminates false reports on 10 benchmarks and substantially reduces them on the remaining 3 benchmarks. Moreover, complex numerical issues require additional care due to absorption, where two machine-precision residues cannot both be computed accurately in a single execution. This paper introduces residue override, which re-executes the program multiple times, computing different residues in different executions and assembling a final "patchwork" execution. We evaluate on 169 standard benchmarks drawn from numerical analysis papers and textbooks, requiring only 3.6 re-executions on average. Among 34 benchmarks with false reports in the initial run, residue override is triggered on 29 of them and reduces false reports on 25 of them, averaging 7.1 re-executions.
Daniel Arnström, Emilio Benenati, Giuseppe Belgioioso
We present \texttt{DR-DAQP}, an open-source solver for strongly monotone affine variational inequaliries that combines Douglas-Rachford operator splitting with an active-set acceleration strategy. The key idea is to estimate the active set along the iterations to attempt a Newton-type correction. This step yields the exact AVI solution when the active set is correctly estimated, thus overcoming the asymptotic convergence limitation inherent in first-order methods. Moreover, we exploit warm-starting and pre-factorization of relevant matrices to further accelerate evaluation of the algorithm iterations. We prove convergence and establish conditions under which the algorithm terminates in finite time with the exact solution. Numerical experiments on randomly generated AVIs show that \texttt{DR-DAQP} is up to two orders of magnitude faster than the state-of-the-art solver \texttt{PATH}. On a game-theoretic MPC benchmark, \texttt{DR-DAQP} achieves solve times several orders of magnitude below those of the mixed-integer solver \texttt{NashOpt}. A high-performing C implementation is available at \textt{https://github.com/darnstrom/daqp}, with easily-accessible interfaces to Julia, MATLAB, and Python.
Jeffrey D. Varner
Mar 31, 2026·q-bio.QM·PDF Mathematical models of natural and man-made systems often have many adjustable parameters that must be estimated from multiple, potentially conflicting datasets. Rather than reporting a single best-fit parameter vector, it is often more informative to generate an ensemble of parameter sets that collectively map out the trade-offs among competing objectives. This paper presents ParetoEnsembles.jl, an open-source Julia package that generates such ensembles using Pareto Optimal Ensemble Techniques (POETs), a simulated-annealing-based algorithm that requires no gradient information. The implementation corrects the original dominance relation from weak to strict Pareto dominance, reduces the per-iteration ranking cost from $O(n^2 m)$ to $O(nm)$ through an incremental update scheme, and adds multi-chain parallel execution for improved front coverage. We demonstrate the package on a cell-free gene expression model fitted to experimental data and a blood coagulation cascade model with ten estimated rate constants and three objectives. A controlled synthetic-data study reveals parameter identifiability structure, with individual rate constants off by several-fold yet model predictions accurate to 7%. A five-replicate coverage analysis confirms that timing features are reliably covered while peak amplitude is systematically overconfident. Validation against published experimental thrombin generation data demonstrates that the ensemble predicts held-out conditions to within 10% despite inherent model approximation error. By making ensemble generation lightweight and accessible, ParetoEnsembles.jl aims to lower the barrier to routine uncertainty characterization in mechanistic modeling.
Silvia Bertoluzza
We present BioNetFlux, an open-source Python framework for the numerical simulation of coupled systems of partial differential equations (PDEs) on one-dimensional multi-arc networks by the Hybridized Discontinuous Galerkin method. Its design targets biological transport phenomena on graph-like geometries that arise naturally in microfluidic organ-on-chip (OoC) devices, vascular networks, and in-vitro cell-migration assays.
Shota Kawakami, Daisuke Takahashi
Modern processors deliver higher throughput for lower-precision arithmetic than for higher-precision arithmetic. For matrix multiplication, the Ozaki scheme exploits this performance gap by splitting the inputs into lower-precision components and delegating the computation to optimized lower-precision routines. However, no similar approach exists for the fast Fourier transform (FFT). Here, we propose a method that computes target-precision FFTs using lower-precision FFTs by applying the Ozaki scheme to the cyclic convolution in the Bluestein FFT. The split component convolutions are computed exactly using the number theoretic transform (NTT), an FFT over a finite field, instead of floating-point FFTs, combined with the Chinese remainder theorem. We introduce an upper bound on the number of splits and an NTT-domain accumulation strategy to reduce the NTT call count. As a concrete implementation, we implement a double-precision FFT using 32-bit NTTs and confirm reduced relative error compared with those for FFTs based on FFTW and Triple-Single precision arithmetic, with stable error across FFT lengths, at most 96 NTT calls, or 64 NTT calls with NTT-domain accumulation. On an Intel Xeon Platinum 8468 for lengths $n=2^{10}$-$2^{18}$, the execution time is approximately 107-1315$\times$ that of FFTW's double-precision FFT, with NTTs accounting for approximately 80% of the total time.
Dinesh Kumar, Jeffrey Donatelli
Model-Based Iterative Reconstruction (MBIR) is important because direct methods, such as Filtered Back-Projection (FBP) can introduce significant noise and artifacts in sparse-angle tomography, especially for time-evolving samples. Although MBIR produces high-quality reconstructions through prior-informed optimization, its computational cost has traditionally limited its broader adoption. In previous work, we addressed this limitation by expressing the Radon transform and its adjoint using non-uniform fast Fourier transforms (NUFFTs), reducing computational complexity relative to conventional projection-based methods. We further accelerated computation by employing a multi-GPU system for parallel processing. In this work, we further accelerate our Fourier-domain framework, by introducing four main strategies: (1) a reformulation of the MBIR forward and adjoint operators that exploits their multi-level Toeplitz structure for efficient Fourier-domain computation; (2) an improved initialization strategy that uses back-projected data filtered with a standard ramp filter as the starting estimate; (3) a hierarchical multi-resolution reconstruction approach that first solves the problem on coarse grids and progressively transitions to finer grids using Lanczos interpolation; and (4) a distributed-memory implementation using MPI that enables near-linear scaling on large high-performance computing (HPC) systems. Together, these innovations significantly reduce iteration counts, improve parallel efficiency, and make high-quality MBIR reconstruction practical for large-scale tomographic imaging. These advances open the door to near-real-time MBIR for applications such as in situ, in operando, and time-evolving experiments.
Jian Zhou
We present a large-scale computational study combining arbitrary-precision arithmetic, sequence acceleration, and the PSLQ integer relation algorithm to discover exact closed-form expressions for fundamental constants arising in asymptotic analysis. We compute the Stokes multipliers C_M of the one-dimensional anharmonic oscillators H = p^2/2 + x^2/2 + g x^{2M} for M = 2, 3, ..., 11, extracting 17-30 significant digits from up to 1200 perturbation coefficients computed at 300-digit working precision. The computational pipeline consists of three stages: (i) Rayleigh-Schrodinger recursion in the harmonic oscillator basis, (ii) Richardson extrapolation of order 40-100 to accelerate convergence of ratio sequences, and (iii) PSLQ searches over bases of Gamma-function values and algebraic numbers. This pipeline discovers three new exact identities: C_3^2 pi^4 = 32, C_5^4 Gamma(1/4)^4 pi^5 = 2^{12} 3^2, and C_7^6 Gamma(1/3)^9 pi^6 = 2^{20} 3^3, in addition to confirming the known C_2^2 pi^3 = 6. Equally significant is a negative result: exhaustive PSLQ searches at 30-digit precision with coefficient bounds up to 2000 find no closed form for C_4, strongly suggesting the x^8 case introduces a genuinely new transcendental number. A number-theoretic pattern emerges: closed-form existence correlates with Euler's totient function phi(M-1)/2, which counts algebraically independent Gamma-function transcendentals at denominator M-1. We formulate conjectures connecting computational constant recognition to classical number theory, and provide all code and data for full reproducibility.
Mohamed Amine Bergach
We present an optimized Fast Fourier Transform (FFT) implementation for Apple Silicon GPUs, achieving 138.45~GFLOPS for $N\!=\!4096$ complex single-precision transforms -- a 29\% improvement over Apple's highly optimized vDSP/Accelerate baseline (107~GFLOPS). Our approach is grounded in a \emph{two-tier local memory model} that formally characterizes the Apple GPU's 208~KiB register file as the primary data-resident tier and the 32~KiB threadgroup memory as an exchange-only tier, extending the decomposition framework established in a 2015 PhD thesis on Intel integrated GPU FFT for radar processing. We implement and evaluate radix-4 and radix-8 split-radix Stockham kernels in Metal Shading Language (MSL), demonstrating that the radix-8 decimation-in-time butterfly with 512 threads yields the best performance. We further present the first investigation of Apple's \texttt{simdgroup\_matrix} 8$\times$8 hardware MMA for FFT butterfly computation and report the counter-intuitive finding that on Apple GPU, threadgroup memory barriers are inexpensive ($\sim$2 cycles) while scattered threadgroup access patterns are the true bottleneck. Our multi-size implementation supports $N\!=\!256$ through $N\!=\!16384$ using a four-step decomposition for sizes exceeding the 32~KiB threadgroup memory limit. All kernels are validated against vDSP reference outputs.
Esmail Abdul Fattah, Elias Krainski, Havard Rue
Bayesian inference often relies on Markov chain Monte Carlo (MCMC) methods, particularly required for non-Gaussian data families. When dealing with complex hierarchical models, the MCMC approach can be computationally demanding in workflows that require repeated model fitting or when working with models of large dimensions with limited hardware resources. The Integrated Nested Laplace Approximations (INLA) is a deterministic alternative for models with non-Gaussian data that belong to the class of latent Gaussian models (LGMs), yielding accurate approximations to posterior marginals in many applied settings. The INLA method was implemented in C as a standalone program, inla, that is widely used in R through the INLA package. This paper introduces PyINLA, a dedicated Python package that provides a Pythonic interface directly to the inla program. Therefore, PyINLA enables specifying LGMs, running INLA-based inference, and accessing posterior summaries directly from Python while leveraging the established INLA implementation. We describe the package design and illustrate its use on representative models, including generalized linear mixed models, time series forecasting, disease mapping, and geostatistical prediction, demonstrating how deterministic Bayesian inference can be performed in Python using INLA in a way that integrates naturally with common scientific computing workflows.
Maik Punke, Marco Salvalaglio
We present a MATLAB-based framework for two- and three-dimensional fast Fourier transforms on multiple GPUs for large-scale numerical simulations using the pseudo-spectral Fourier method. The software implements two complementary multi-GPU strategies that overcome single-GPU memory limitations and accelerate spectral solvers. This approach is motivated by and applied to phase-field crystal (PFC) models, which are governed by tenth-order partial differential equations, require fine spatial resolution, and are typically formulated in periodic domains. Our resulting numerical framework achieves significant speedups, approximately sixfold for standard PFC simulations and up to sixtyfold for multiphysics extensions, compared to a purely CPU-based implementation running on hundreds of cores.
J. M. Alonso, J. Sastre, J. Ibáñez, E. Defez
A method for evaluating matrix polynomials have recently been developed that require one fewer matrix product ($1M$) than the Paterson--Stockmeyer (PS) method. Since the computational cost for large-scale matrices is asymptotically determined by the number of matrix products, this reduction directly affects the total execution time. However, the coefficients in these optimized formulas emerge as solutions to systems of nonlinear polynomial equations, resulting in multiple potential solution sets. An inappropriate selection of these coefficients can lead to numerical instability in floating-point arithmetic. This paper presents a systematic framework and a MATLAB implementation, MatrixPolEval1, used to obtain and validate stable coefficient sets for matrix polynomials of degrees $m \in \{8, 10, 12\}$ and above. The framework introduces structural variants to maintain stability even when the original configuration fails to yield a robust solution. The provided tool identifies stable coefficient sets using variable precision arithmetic (VPA) and provides a reliability indicator for expected accuracy. Numerical experiments on polynomials arising in applications, including the matrix exponential and geometric series, show that the framework achieves the $1M$ saving while maintaining numerical accuracy comparable to the PS method.
Julian Bellavita, Lorenzo Pichetti, Thomas Pasquali, Flavio Vella, Giulia Guidi
The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In distributed memory, the cost of exchanging and merging partial products across nodes further constrains performance. These issues are exacerbated on modern heterogeneous supercomputers with deep, hierarchical GPU interconnects. Current SpGEMM implementations overlook the gap between intra-node and inter-node bandwidth, resulting in unnecessary data movement and synchronization not fully exploiting the fast intra-node interconnect. To address these challenges, we introduce Trident, a hierarchy-aware 2D distributed SpGEMM algorithm that uses communication-avoiding techniques and asynchronous communication to exploit the hierarchical and heterogeneous architecture of modern supercomputing interconnect. Central to Trident is the novel trident partitioning scheme, which enables hierarchy-aware decomposition and reduces internode communication by leveraging the higher bandwidth between GPUs within a node compared to across nodes. Here, we evaluate Trident on unstructured matrices, achieving up to $2.38\times$ speedup over a 2D SpGEMM with a corresponding geometric mean speedup of $1.54\times$. Trident reduces internode communication volume by up to $2\times$ on NERSC's Perlmutter supercomputer. Furthermore, we demonstrate the effectiveness of Trident in speeding up Markov Clustering, achieving up to $2\times$ speedup compared to competing strategies.