Spencer Wadsworth, Nabin Koirala, Nicole Landi, Ofer Harel
Detecting shared neural activity from functional magnetic resonance imaging (fMRI) across individuals exposed to the same stimulus can reveal synchronous brain responses, functional roles of regions, and potential clinical biomarkers. Intersubject correlation (ISC) is the main method for identifying voxelwise shared responses and per-subject variability, but it relies on heavy data summarization and thousands of regional tests, leading to poor uncertainty quantification and multiple testing issues. ISC also does not directly estimate a shared neural response (SNR) function. We propose a model-based alternative applicable to both task-based and naturalistic fMRI that simultaneously identifies spatial regions of shared activity and estimates the SNR function. The model combines sparse Gaussian process estimation of the response function with a Bayesian sparsity prior inspired by the horseshoe prior to detect voxel activation. A spatially structured extension encourages neighboring voxels to exhibit similar activation patterns. We examine the model's properties, evaluate performance via simulations, and analyze two real-world fMRI datasets, including one task-based and one naturalistic dataset. The Bayesian framework provides principled uncertainty quantification for the shared response function and shows improved activation detection and response estimation compared to standard approaches. Model fits demonstrate comparable or superior performance relative to ISC, while the framework opens avenues for clinical applications.
Ragnhild Laursen, Marta Pelizzola, Lasse Maretty, Asger Hobolth
Apr 23, 2026·q-bio.PE·PDF Mutational signatures describe the pattern of mutations over the different mutation types. Each mutation type is determined by a base substitution and the flanking nucleotides to the left and right of that base substitution. Due to the widespread interest in mutational signatures, several efforts have been devoted to the development of methods for robust and stable signature estimation. Here, we combine various extensions of the standard framework to estimate mutational signatures. These extensions include (a) incorporating opportunities to the analysis, (b) allowing for extended sequence contexts, (c) using the Negative Binomial model, and (d) parametrizing the signatures. We show that the combination of these four extensions gives very robust and reliable mutational signatures. In particular, we highlight the importance of including mutational opportunities and parametrizing the signatures when the mutation types describe an extended sequence context with two or three flanking nucleotides to each side of the base substitution.
C. J. R. Murphy-Barltrop, J. Richards, B. Poschlod, A. Sasse, J. Zscheischler
Concurrent floods and concurrent droughts in nearby catchments pose challenges to risk assessment and water management. Climate change is affecting extremely high and low discharge, but the complex interplay between changes in individual catchments and in the dependence across catchments make it difficult to provide accurate assessments of the occurrence probabilities of concurrent extremes. In this work, we use a contemporary statistical deep learning model (the deep SPAR framework) to capture concurrent river floods and droughts in four catchments in the Upper Danube basin, based on discharge simulated by a hydrological model driven with large ensemble climate model output. The statistical model is able to accurately capture the multivariate extremes of the simulated discharge, which we assess by making use of the large available sample size. We subsequently use our statistical model to study changes in joint tail behaviour of discharge over time, finding that both compound flooding and drought-like conditions are becoming increasingly likely towards the end of the 21st century under a high-emission scenario. In particular, our results highlight that changes in the dependence structure of extremes strongly contribute to the detected changes, an aspect that would be difficult to capture with traditional approaches. This work paves the way for highly flexible, general inference on compound extremes in hydrological applications, and demonstrates key advantages of using statistical deep learning in this setting.
Luisa Ferrari, Maria Franco Villoria, Garritt L. Page, Alex Laini
Clustering multivariate binary data is of interest in many scientific fields, including ecology, biomedicine, and social policy. Beyond heuristic clustering algorithms, such data can be modelled using multivariate Bernoulli mixture models. Many Bayesian implementations of these models involve a trade-off between computational efficiency and full posterior inference. We propose instead a Bayesian approach able to provide both aspects. The method fixes the total number of components to a large value and employs an asymmetric Dirichlet prior on the mixture weights. The asymmetric Dirichlet hyperparameters are elicited using the popular Penalized Complexity prior framework, which provides an intuitive way for users to inform the induced distribution of the number of clusters. An efficient MCMC algorithm is then developed to fit the model. Simulations and real-world applications demonstrate that the method is competitive with existing alternatives and can outperform them in certain settings. The proposal is illustrated using an ecological dataset about presence-absence of species across multiple sites, where cluster-specific parameters are modelled on the basis of environmental conditions. Overall, the proposed method provides a computationally efficient, fully Bayesian, and interpretable framework for clustering multivariate binary data, with potential applications across diverse scientific domains.
Mario Francisco-Fernández, Andrea Meilán-Vila
Spatial orientation is a fundamental cognitive skill that relies on sensory information to update perceived direction. Understanding how sensory conditions influence directional accuracy is important for both cognitive science and the design of assistive technologies. We analyze experimental data in which blind, low-vision, and sighted participants performed spatial updating tasks under five sensory conditions, with signed angular error as the response. To model these data, we propose a nonparametric circular regression framework that accommodates both continuous and categorical predictors via a product-kernel estimator. Bandwidth selection is crucial in this setting, yet developing practical data-driven methods remains challenging. We derive asymptotic bias and variance expressions for the estimator, though these results do not directly lead to a feasible plug-in bandwidth selector. To address this, we develop a bootstrap bandwidth selection criterion tailored to the cosine loss and compare it with cross-validation and rule-of-thumb approaches in simulation studies. Applied to the spatial updating data, the proposed framework reveals nonlinear, condition-specific patterns and quantifies uncertainty via simultaneous bootstrap confidence bands. Across the scenarios considered, the proposed bootstrap selector achieves a favorable bias-variance trade-off and yields stable inference relative to the competing methods. An implementation is available in the R package circMixedReg.
Keita Fukuyama, Yukiko Mori, Tomohiro Kuroda, Hiroaki Kikuchi
Differential privacy (DP) is a mathematical framework that guarantees individual privacy; however, systematic evaluation of its impact on statistical utility in survival analyses remains limited. In this study, we systematically evaluated the impact of DP mechanisms (Laplace mechanism and Randomized Response) with data-driven clipping bounds on the Cox proportional hazards model, using 5 clinical datasets ($n = 168$--$6{,}524$), 15 levels of $\varepsilon$ (0.1--1000), and $B = 1{,}000$ Monte Carlo iterations. The data-driven clipping bounds used here are observed min/max and therefore do not provide formal $\varepsilon$-DP guarantees; the results represent an optimistic lower bound on utility degradation under formal DP. We compared three types of input perturbations (covariates only, all inputs, and the discrete-time model) with output perturbations (dfbeta-based sensitivity), using loss of significance rate (LSR), C-index, and coefficient bias as metrics. At standard DP levels ($\varepsilon \leq 1$), approximately 90% (90--94%) of the significant covariates lost significance, even in the largest dataset ($n = 6{,}524$), and the predictive performance approached random levels (test C-index $\approx 0.5$) under many conditions. Among the input perturbation approaches, perturbing only covariates preserved the risk-set structure and achieved the best recovery, whereas output perturbation (dfbeta-based sensitivity) maintained near-baseline performance at $\varepsilon \geq 5$. At $n \approx 3{,}000$, the significance recovered rapidly at $\varepsilon = 3$--10; however, in practice, $\varepsilon \geq 10$ (for predictive performance) to $\varepsilon \geq 30$--60 (for significance preservation) is required. In the moderate-to-high $\varepsilon$ range, false-positive rates increased for variables whose baseline $p$-values were near the significance threshold.
Rajius Idzalika, Muhammad Rheza Muztahid, Radityo Eko Prasojo
Timely population displacement estimates are critical for humanitarian response during disasters, but traditional surveys and field assessments are slow. Mobile phone data enables near real-time tracking, yet existing approaches apply uniform displacement definitions regardless of individual mobility patterns, misclassifying regular commuters as displaced. We present a methodological framework addressing this through three innovations: (1) mobility profile classification distinguishing local residents from commuter types, (2) context-aware between-municipality displacement detection accounting for expected location by user type and day of week, and (3) operational uncertainty bounds derived from baseline coefficient of variation with a disaster adjustment factor, intended for humanitarian decision support rather than formal statistical inference. The framework produces three complementary metrics scaled to population with uncertainty bounds: displacement rates, origin-destination flows, and return dynamics. An Aparri case study following Super Typhoon Nando (2025, Philippines) applies the framework to vendor-provided daily locations from Globe Telecom. Context-aware detection reduced estimated between-municipality displacement by 1.6-2.7 percentage points on weekdays versus naive methods, attributable to the commuter exception but not independently validated. The method captures between-municipality displacement only. Within-municipality evacuation falls outside scope. The single-case demonstration establishes proof of concept. External validity requires application across multiple events and locations. The framework provides humanitarian actors with operational displacement information while preserving individual privacy through aggregation.
Markus Johannes Maier, Matthias Scherer
Parametric insurance contracts translate index measurements to compensation for policyholders' losses using predefined payment schemes. These need to be designed carefully to keep basis risk, i.e. the disparity between payouts and true damages, small. Previous research has motivated the use of conditional expectiles as payment schemes, whose compensation is impacted by the policyholder's potentially unknown attitude towards basis risk. To alleviate this model uncertainty and to investigate the impact of (hidden) influencing factors, we characterize existence and uniqueness of the optimal basis risk weighting in a utility-maximization framework through a set of boundary conditions. In the absence of an optimal solution, we provide comparisons to the utility of no insurance and full indemnity coverage. We establish a link between location-scale distributions and separability of conditional expectiles' derivatives, thus improving the understanding of these statistical functionals. A simulation study on parametric hurricane insurance visualizes our results, investigates the influence of premium loading and risk aversion on the optimal weighting, and comments on the challenge of (spatial) loss dependence.
Alex Iosevich, Vishal Gupta
It is well-known in industrial data science that large values of real-life time series tend to be structured and often follow concrete and visible patterns. In this paper, we use ideas from additive combinatorics and discrete Fourier analysis to give this heuristic a mathematical foundation. Our main tool is the Fourier ratio, a complexity measure previously used in compressed sensing, combined with a generalized version of Chang's lemma from additive combinatorics. Together, these yield a precise prediction: when the Fourier ratio of a time series is small, the set of its largest values can be additively generated by a very small set using only $\{-1,0,1\}$ coefficients. We test this prediction on US inflation data and Delhi climate data, both in their original form and after mean-centering. The numerical results confirm the predicted structure: a generating set of size $4$--$7$ suffices to span large spectra containing dozens of points, even when the Fourier ratio is large enough that our theoretical bounds become loose. These findings provide a rigorous explanation for why extreme values in real-world data are information-rich and structurally significant.
Marios Papamichalis, Regina Ruane
We examine how legal infrastructure organizes eviction in Philadelphia. Using 755,004 Philadelphia landlord--tenant court records filed from 1969 to 2022, we show that eviction is concentrated most strongly among plaintiff-side attorneys. In a typical year, the 10 most active plaintiff attorneys, about 3-4% of active plaintiff attorneys, handle 82.0% of represented cases. Filing is also highly routinized. It is largely same-plaintiff filing, concentrated at the same addresses, and reproduced through recurring plaintiff-attorney-property combinations. Eviction, in short, is organized through repeat actors and repeat places. Specialist attorney plaintiff-side counsel changes how cases are handled inside that system. When plaintiffs adopt specialist attorney counsel, filings rise and repeated use of the same addresses increases, although those filing-margin shifts appear to reflect broader reorganization around counsel entry. In stronger within-plaintiff and within-plaintiff-property comparisons, specialist attorney counsel is associated with fewer judgments by agreement, a lower fee share, and much less lockout-trigger language, with weaker evidence for default and downstream enforcement. That structure extends into the courtroom. Court is not a neutral stage: judges and repeated lawyer pairings shape default, agreement, enforcement, and settlement terms. Overall, eviction is organized through a concentrated plaintiff-side bar, repeat places, structured courtroom relationships, and the production of contracts and debt inside court.
Vishnu Teja Kunde, Alessandro Mirri, Jean-Francois Chamberland, Enrico Paolini
Approximate Message Passing (AMP) is a general framework for iterative algorithms, originally developed for compressed sensing and later extended to a wide range of high-dimensional inference problems. Although recent work has advanced matrix AMP, complex AMP, and AMP for non-separable functions independently, a unified state evolution theory for complex AMP with non-separable denoisers has been lacking. This article fills that gap by establishing state evolution in the setting of complex, non-separable denoising functions. The proposed approach constructs an augmented real-valued system that lifts the problem to a higher-dimensional space, then recovers the complex domain through a many-to-one canonical transformation. Under this construction, the Onsager correction naturally involves Wirtinger derivatives, and the resulting state evolution reduces to scalar complex recursions despite the non-separable structure of the denoisers. The framework extends to the matrix-valued setting, accommodating multiple feature vectors simultaneously. This generalization enables AMP to exploit joint structural constraints, such as simultaneous group and element sparsity, in complex-valued recovery problems. The complex sparse group least absolute shrinkage and selection operator (LASSO) serves as a key instantiation, motivated by preamble detection in Orthogonal Time-Frequency Space (OTFS)-based unsourced random access. Numerical experiments confirm that state evolution accurately predicts performance and show that complex non-separable denoising can produce significant gains over separable and real-valued alternatives.
Koen van Arem, Jakob Söhl, Mirjam Bruinsma, Geurt Jongbloed
The recent growth in data availability in football has increased the risk of incorrect use of data-driven models, making guidelines on their validation and application necessary. The Expected Threat (xT) model is an accessible option for football organizations that start building in-house methods, yet little is known about how to assess its quality. The aim of this study is twofold: to examine how the model error depends on the number of game states and the number of training points, and to translate these results into guidelines for constructing and applying the model. Using the Markov chain underlying the model, we perform theoretical analyses and simulations to study the model error. These show that the model error is approximately log-normally distributed for a specified number of training points and game states. Additionally, we combine the simulations with expert consultation to establish the model error beyond which player evaluations based on the Expected Threat model become unreliable for scouting applications. From this, we derive rules of thumb to ensure the quality of an Expected Threat model before application, and we illustrate through an example how a validated model can be applied in practice. Because the approach generalizes to Expected Possession Value models, this paper illustrates a framework to systematically quantify model quality, despite the ground truth being unobservable in football analytics.
Thomas Schincariol
Understanding how conflict events spread over time and space is crucial for predicting and mitigating future violence. However, progress in this area has been limited by the lack of methods capable of capturing the intricate, dynamic patterns of conflict diffusion. The complex nature of those trends needs flexibility in the models to untangle them. This study addresses this gap by analyzing spatio-temporal conflict fatality data using an innovative approach that transforms the data into three-dimensional patterns at the Prio-Grid level. In this paper, a shape-based model called ShapeFinder is adapted. By applying the Earth Movers Distance (EMD) algorithm, we detect and classify these patterns, allowing us to compare and match patterns with high adaptive capacity in all dimensions. Using historical similar patterns, we generate predictions of conflict fatalities and compare these with forecasts from the Views ensemble model, a leading benchmark. Our findings demonstrate that recognizing and analyzing conflict diffusion patterns significantly improves predictive accuracy, outperforming the benchmark model. This research contributes to the study of conflict dynamics by introducing a novel pattern recognition framework that enhances the analysis of spatio-temporal data and offers practical applications for early warning systems.
Sidi Wang, Kelley Kidwell, Bo Huang, Satrajit Roychoudhury
Overall survival (OS) is the gold standard for assessing patient benefit and cost-effectiveness of new cancer drugs. However, it is often difficult to use OS as the primary endpoint in randomized clinical trials (RCTs) for patients with metastatic cancer due to multiple reasons. In recent years, progression-free survival (PFS) has increasingly been used as the primary endpoint in metastatic cancer RCTs to accelerate development. However, regulatory authorities often seek mature OS data for approval. Therefore, it is critical to determine the target time when OS data are expected to be mature for reliable statistical inference. Motivated by an advanced renal cell carcinoma (RCC) clinical trial, we develop and investigate different prediction models leveraging information from disease progression to improve target OS prediction times. We propose a multivariate joint modeling approach considering components of progression and OS and extend three models commonly used for association to be used for OS prediction. To the best of our knowledge, this is the first comprehensive statistical study exploring the prediction of OS using different levels of information on disease progression and illustrating these models using a real, complex dataset. Our findings have significant implications for OS prediction.
Sara Antonijevic, Danielle Sitalo, Brani Vidakovic
Incomplete reporting of diagnostic accuracy data remains a persistent problem in medical research. In many studies, only part of the 2x2 diagnostic table is reported, leaving denominators for diseased and non-diseased groups unknown and preventing direct calculation of sensitivity, specificity, predictive values, and related operating characteristics. To address this limitation, we develop hierarchical Bayesian models for reconstructing incomplete 2x2 diagnostic tables from such partial information. Two motivating scenarios are considered: one in which only a single test-outcome row is observed, and another in which true positives, false positives, and the total sample size are reported but the remaining cells are missing. The proposed models are illustrated on a benchmark breast MRI study with complete counts, treated as partially observed in order to assess reconstruction performance under controlled missingness. The framework yields posterior inference for the missing cell counts and associated diagnostic measures, together with uncertainty quantification in weakly identified settings.
Aninda Bhattacharya, Chris J. Dent, Amy L. Wilson, Gabriele C. Hegerl
Extreme weather events during peak winter periods drive resource adequacy risk in Great Britain (GB), with weather sensitivity of the supply-demand balance increasing through additional electric heating and wind generation. This work develops an approach of time-shifting weather within the peak season, through adjustment of the relevant terms in a statistical model for demand. This allows more complete consideration of the security of supply consequences of a weather series, as there will be relevant conditions where demand is suppressed due to weather occurring at a weekend or during the Christmas holiday. Results on a GB example show that consideration of this counterfactual is indeed important, and specifically that winter 2010-11 can either be the most severe in the dataset, or insignificant within the resource adequacy model, depending on the alignment of day-of-week with the weather series. Statistical interpretation of the shift model is discussed, which is straightforward for alignment of day-of-week with weather assuming that all seven alignments are equiprobable; but is more subtle for shifting weather in and out of Christmas, as there is no natural maximum on the realistic length of shift, but too large a shift may be physically unrealistic. It is likely that in all systems, assessment of a weather year's severity is incomplete without such consideration of the day-of-week effect; however, whether longer shifts of weather with respect to date need to be considered will depend on the presence of a major holiday (such as Christmas in GB) in the peak season.
Giuseppe De Luca, Paolo Li Donni
This report describes the SHARELIFE-MI project, which aims to generate multiple imputations for missing values in the life-course data collected in SHARELIFE Waves 3 and 7. The SHARELIFE study reconstructs individual life histories through retrospective questions covering key biographical domains such as partnerships, fertility, employment, and residence. As in the regular SHARE waves, item nonresponse represents an important source of nonsampling error - particularly for monetary variables, which require conversions across multiple currencies and long time periods. We document the preliminary data recoding and harmonization steps, as well as the design, specification, and implementation of an imputation model based on the fully conditional specification approach. Finally, we assess the internal and external validity of the resulting imputations through comparisons with the observed data, alternative nonresponse adjustments based on inverse propensity weighting, and external benchmarks from the regular SHARE waves.
Andreas Morr, Maya Ben-Yami, Brian Groenke, Christof Schötz, Alessandro Cotronei, Eirik Myrvoll-Nilsen, Sebastian Bathiany, Martin Rypdal, Niklas Boers
Ditlevsen and Ditlevsen [Nature Communications, 2023] (DD23 hereafter) propose a statistical framework to estimate the timing of a potential collapse of the Atlantic Meridional Overturning Circulation (AMOC) based on extrapolating information from observed sea-surface temperature (SST) variability. By fitting a stochastic one-dimensional fold-bifurcation model to an SST-based fingerprint of the AMOC using Maximum Likelihood Estimation (MLE), they conclude that a collapse is most likely to occur in the middle of the 21st century, with a reported 95% confidence interval covering the time span from 2037 to 2109. Given the profound implications of such a claim for both climate and society, it is essential to thoroughly test the robustness of this result, to critically assess the underlying assumptions and uncertainties, and to estimate the extent to which the reported confidence interval reflects the true limits of current knowledge. Here we examine the sensitivity of DD23's results and argue that four types of uncertainty are insufficiently explored in their analysis: (i) structural uncertainty associated with the assumed low-order bifurcation model, (ii) statistical uncertainty in their model fit, (iii) uncertainty in the representativeness of SST-based fingerprints as proxies for the high-dimensional AMOC dynamics, and (iv) uncertainty in the underlying data, arising from non-stationary observational coverage and dataset preprocessing. Using synthetic experiments and a systematic analysis of alternative fingerprints and observational products, we show that the tipping times estimated by DD23 are highly sensitive to the uncertainties listed above, and extend several millennia into the future when these uncertainties are thoroughly propagated.
Jonas Bauer, Christiane Fuchs, Tamara Schamberger
Football fans frequently exhibit pronounced emotional and physiological reactions during high-stakes matches. However, the temporal dynamics of this football fever are rarely modeled as a latent process. Using intensive longitudinal data from Arminia Bielefeld supporters who wore smartwatches during the 2025 German Football Association (DFB) Cup final, we investigate how football fever unfolds. The devices recorded heart rate, stress level, and related indicators in short intervals, allowing us to construct a latent variable for football fever and model its dynamics. We specify a time-dependent structural equation model with latent growth components and autoregressive effects to capture both overall trends and short-term carry-over effects in fans' physiological responses. Results are aggregated across multiple imputations of missing measurements. Model fit is evaluated using adjustments for the high data dimensionality. The results show that football fever follows a V-shaped trajectory: high at kick-off, followed by a steady decline until the renewed arousal in the second half, with substantial between-fan heterogeneity in both baseline level and temporal dynamics. Our findings demonstrate that football fever can be adequately represented as a latent variable using structural equation modeling and reflected by wearable technology data. This highlights the importance of accounting for temporal dependence when studying dynamic emotional phenomena, e. g., in sports spectatorship.
Dylan J. Morris, Lauren Kennedy, Andrew J. Black
Infectious disease dynamics operate across multiple biological scales, with within-host viral dynamics being a key driver of between-host transmission. However, while models that explicitly link these scales exist, none have been developed with statistical inference as a primary goal. In this paper we propose a multiscale model that jointly captures heterogeneous individual-level viral load trajectories and stochastic household transmission, and develop efficient inference methods to fit it to data. Since full joint inference is computationally difficult, we employ a cut approach that passes information from the within-host to the between-host model but not vice versa. This enables the data on viral loads to inform the transmission parameters such as the infection times and symptom onset thresholds. We evaluate the framework on simulated household outbreak data, assessing parameter recovery, computational efficiency, and the effect of viral load sampling frequency on inference quality. Parameter recovery is unbiased when the sampling frequency of the viral loads is high enough. When sampling is sparse, some bias is introduced, but incorporating external viral load data can mitigate this.