František Bartoš, Suzanne Hoogeveen, Alexandra Sarafoglou, Samuel Pawel
Empirical claims often rely on one population, design, and analysis. Many-analysts, multiverse, and robustness studies expose how results can vary across plausible analytic choices. Synthesizing these results, however, is nontrivial as all results are computed from the same dataset. We introduce single-dataset meta-analysis, a weighted-likelihood approach that incorporates the information in the dataset at most once. It prevents overconfident inferences that would arise if a standard meta-analysis was applied to the data. Single-dataset meta-analysis yields meta-analytic point and interval estimates of the average effect across analytic approaches and of between-analyst heterogeneity, and can be supplied by classical and Bayesian hypothesis tests. Both the common-effect and random-effects versions of the model can be estimated by standard meta-analytic software with small input adjustments. We demonstrate the method via application to the many-analysts study on racial bias in soccer, the many-analysts study of marital status and cardiovascular disease, and the multiverse study on technology use and well-being. The results show how single-dataset meta-analysis complements the qualitative evaluation of many-analysts and multiverse studies.
František Bartoš, Eric-Jan Wagenmakers
Bayesian inference is on the rise, partly because it allows researchers to quantify parameter uncertainty, evaluate evidence for competing hypotheses, incorporate model ambiguity, and seamlessly update knowledge as information accumulates. All of these advantages apply to the meta-analytic settings; however, advanced Bayesian meta-analytic methodology is often restricted to researchers with programming experience. In order to make these tools available to a wider audience, we implemented state-of-the-art Bayesian meta-analysis methods in the Meta-Analysis module of JASP, a free and open-source statistical software package (https://jasp-stats.org/). The module allows researchers to conduct Bayesian estimation, hypothesis testing, and model averaging with models such as meta-regression, multilevel meta-analysis, and publication bias adjusted meta-analysis. Results can be interpreted using forest plots, bubble plots, and estimated marginal means. This manuscript provides an overview of the Bayesian meta-analysis tools available in JASP and demonstrates how the software enables researchers of all technical backgrounds to perform advanced Bayesian meta-analysis.
František Bartoš, Eric-Jan Wagenmakers, Christiaan H. Vinkers, Kees P. J. Braun, Willem M. Otte
The delayed and incomplete availability of historical findings and the lack of integrative and user-friendly software hampers the reliable interpretation of new clinical data. We developed a free, open, and user-friendly clinical trial aggregation program combining a large and representative sample of existing trial data with the latest classical and Bayesian meta-analytical models, including clear output visualizations. Our software is of particular interest for (post-graduate) educational programs (e.g., medicine, epidemiology) and global health initiatives. We demonstrate the database, interface, and plot functionality with a recent randomized controlled trial on effective epileptic seizure reduction in children treated for a parasitic brain infection. The single trial data is placed into context and we show how to interpret new results against existing knowledge instantaneously. Our program is of particular interest to those working on the contextualizing of medical findings. It may facilitate the advancement of global clinical progress as efficiently and openly as possible and simulate further bridging clinical data with the latest biostatistical models.
František Bartoš, Patrícia Martinková
Inter-rater reliability (IRR) is one of the commonly used tools for assessing the quality of ratings from multiple raters. However, applicant selection procedures based on ratings from multiple raters usually result in a binary outcome; the applicant is either selected or not. This final outcome is not considered in IRR, which instead focuses on the ratings of the individual subjects or objects. We outline the connection between the ratings' measurement model (used for IRR) and a binary classification framework. We develop a simple way of approximating the probability of correctly selecting the best applicants which allows us to compute error probabilities of the selection procedure (i.e., false positive and false negative rate) or their lower bounds. We draw connections between the inter-rater reliability and the binary classification metrics, showing that binary classification metrics depend solely on the IRR coefficient and proportion of selected applicants. We assess the performance of the approximation in a simulation study and apply it in an example comparing the reliability of multiple grant peer review selection procedures. We also discuss possible other uses of the explored connections in other contexts, such as educational testing, psychological assessment, and health-related measurement and implement the computations in IRR2FPR R package.
Samuel Pawel, František Bartoš, Björn S. Siepe, Anna Lohmann
Simulation studies are commonly used in methodological research for the empirical evaluation of data analysis methods. They generate artificial data sets under specified mechanisms and compare the performance of methods across conditions. However, simulation repetitions do not always produce valid outputs, e.g., due to non-convergence or other algorithmic failures. This phenomenon complicates the interpretation of results, especially when its occurrence differs between methods and conditions. Despite the potentially serious consequences of such "missingness", quantitative data on its prevalence and specific guidance on how to deal with it are currently limited. To this end, we reviewed 482 simulation studies published in various methodological journals and systematically assessed the prevalence and handling of missingness. We found that only 23% (111/482) of the reviewed simulation studies mention missingness, with even fewer reporting frequency (92/482 = 19%) or how it was handled (67/482 = 14%). We propose a classification of missingness and possible solutions. We give various recommendations, most notably to always quantify and report missingness, even if none was observed, to align missingness handling with study goals, and to share code and data for reproduction and reanalysis. Using a case study on publication bias adjustment methods, we illustrate common pitfalls and solutions.
František Bartoš, Eric-Jan Wagenmakers, Maarten Marsman, Don van den Bergh
Bayes factor sensitivity analysis examines how the evidence for one hypothesis over another depends on the prior distribution. In complex models, the standard approach refits the model at each hyper-parameter value, and the total computational cost scales linearly in the grid size. We propose a method that recovers the entire sensitivity curve from a single additional model fit. The key identity decomposes the Bayes factor at any hyper-parameter value $γ_x$ into an ``anchor'' Bayes factor at a fixed reference $γ_0$ and a Savage--Dickey density ratio in an extended model that places a hyper-prior on $γ$. Once this extended model is fit, the Bayes factor at any $γ_x$ follows from the anchor value and a ratio of two posterior density ordinates. To approximate this ratio, we employ the importance-weighted marginal density estimator (IWMDE). Because the sensitivity parameter enters the model only through the prior distribution on the model parameters, the data likelihood cancels in the IWMDE, reducing it to a simple ratio of prior density evaluations on the MCMC draws, without any additional likelihood computation. The resulting estimator is fast, remains accurate even with small MCMC samples, and substantially outperforms kernel density estimation across the full sensitivity range. The method extends naturally to simultaneous sensitivity over multiple hyper-parameters and to Bayesian model averaging. We illustrate it on a univariate Bayesian $t$-test with exact Bayes factors for validation, a bivariate informed $t$-test, and a Bayesian model-averaged meta-analysis, obtaining accurate sensitivity curves at a fraction of the brute-force cost.
František Bartoš, Quentin F. Gronau, Bram Timmers, Willem M. Otte, Alexander Ly, Eric-Jan Wagenmakers
We outline a Bayesian model-averaged meta-analysis for standardized mean differences in order to quantify evidence for both treatment effectiveness $δ$ and across-study heterogeneity $τ$. We construct four competing models by orthogonally combining two present-absent assumptions, one for the treatment effect and one for across-study heterogeneity. To inform the choice of prior distributions for the model parameters, we used 50% of the Cochrane Database of Systematic Reviews to specify rival prior distributions for $δ$ and $τ$. The relative predictive performance of the competing models and rival prior distributions was assessed using the remaining 50\% of the Cochrane Database. On average, $\mathcal{H}_1^r$ -- the model that assumes the presence of a treatment effect as well as across-study heterogeneity -- outpredicted the other models, but not by a large margin. Within $\mathcal{H}_1^r$, predictive adequacy was relatively constant across the rival prior distributions. We propose specific empirical prior distributions, both for the field in general and for each of 46 specific medical subdisciplines. An example from oral health demonstrates how the proposed prior distributions can be used to conduct a Bayesian model-averaged meta-analysis in the open-source software R and JASP. The preregistered analysis plan is available at https://osf.io/zs3df/.
František Bartoš, Eric-Jan Wagenmakers, Wolfgang Viechtbauer
Meta-analyses play a crucial part in empirical science, enabling researchers to synthesize evidence across studies and draw more precise and generalizable conclusions. Despite their importance, access to advanced meta-analytic methodology is often limited to scientists and students with considerable expertise in computer programming. To lower the barrier for adoption, we have developed the Meta-Analysis module in JASP (https://jasp-stats.org/), a free and open-source software for statistical analyses. The module offers standard and advanced meta-analytic techniques through an easy-to-use graphical user interface (GUI), allowing researchers with diverse technical backgrounds to conduct state-of-the-art analyses. This manuscript presents an overview of the meta-analytic tools implemented in the module and showcases how JASP supports a meta-analytic practice that is rigorous, relevant, and reproducible.
František Bartoš, Samuel Pawel, Eric-Jan Wagenmakers
Null hypothesis statistical significance testing (NHST) is the dominant approach for evaluating results from randomized controlled trials. Whereas NHST comes with long-run error rate guarantees, its main inferential tool -- the $p$-value -- is only an indirect measure of evidence against the null hypothesis. The main reason is that the $p$-value is based on the assumption the null hypothesis is true, whereas the likelihood of the data under any alternative hypothesis is ignored. If the goal is to quantify how much evidence the data provide for or against the null hypothesis it is unavoidable that an alternative hypothesis be specified (Goodman & Royall, 1988). Paradoxes arise when researchers interpret $p$-values as evidence. For instance, results that are surprising under the null may be equally surprising under a plausible alternative hypothesis, such that a $p=.045$ result (`reject the null') does not make the null any less plausible than it was before. Hence, $p$-values have been argued to overestimate the evidence against the null hypothesis. Conversely, it can be the case that statistically non-significant results (i.e., $p>.05)$ nevertheless provide some evidence in favor of the alternative hypothesis. It is therefore crucial for researchers to know when statistical significance and evidence collide, and this requires that a direct measure of evidence is computed and presented alongside the traditional $p$-value.
František Bartoš, Frederik Aust, Julia M. Haaf
We overview Bayesian estimation, hypothesis testing, and model-averaging and illustrate how they benefit parametric survival analysis. We contrast the Bayesian framework to the currently dominant frequentist approach and highlight advantages, such as seamless incorporation of historical data, continuous monitoring of evidence, and incorporating uncertainty about the true data generating process. We illustrate the application of the Bayesian approaches on an example data set from a colon cancer trial. We compare the Bayesian parametric survival analysis and frequentist models with AIC/BIC model selection in fixed-n and sequential designs with a simulation study. In the example data set, the Bayesian framework provided evidence for the absence of a positive treatment effect on disease-free survival in patients with resected colon cancer. Furthermore, the Bayesian sequential analysis would have terminated the trial 10.3 months earlier than the standard frequentist analysis. In a simulation study with sequential designs, the Bayesian framework on average reached a decision in almost half the time required by the frequentist counterparts, while maintaining the same power, and an appropriate false-positive rate. Under model misspecification, the Bayesian framework resulted in higher false-negative rate compared to the frequentist counterparts, which resulted in a higher proportion of undecided trials. In fixed-n designs, the Bayesian framework showed slightly higher power, slightly elevated error rates, and lower bias and RMSE when estimating treatment effects in small samples. We have made the analytic approach readily available in RoBSA R package. The outlined Bayesian framework provides several benefits when applied to parametric survival analyses. It uses data more efficiently, is capable of greatly shortening the length of clinical trials, and provides a richer set of inferences.
František Bartoš, Ulrich Schimmack
Publication bias undermines meta-analytic inference, yet visual diagnostics for detecting model misfit due to publication bias are lacking. We propose the z-curve plot, a publication-bias-focused absolute model fit diagnostic. The z-curve plot overlays the model-implied posterior predictive distribution of $z$-statistics on the observed distribution of z-statistics and enables direct comparison across candidate meta-analytic models (e.g., random-effects meta-analysis, selection models, PET-PEESE, and RoBMA). Models that approximate the data well show minimal discrepancy between the observed and predicted distribution of z-statistics, whereas poor-fitting models show pronounced discrepancies. Discontinuities at significance thresholds or at zero provide visual evidence of publication bias; models that account for this bias track these discontinuities in the observed distribution. We further extrapolate the estimated publication-bias-adjusted models to the absence of publication bias to obtain a posterior predictive distribution in the absence of publication bias. From this extrapolation, we derive three summaries: the expected discovery rate (EDR), the false discovery risk (FDR), and the expected number of missing studies (N missing). We demonstrate the visualization and its interpretation using simulated datasets and four meta-analyses spanning different degrees of publication bias. The method is implemented in the RoBMA R package.
František Bartoš, Maximilian Maier, Eric-Jan Wagenmakers, Franziska Nippold, Hristos Doucouliagos, John P. A. Ioannidis, Willem M. Otte, Martina Sladekova, Teshome K. Deresssa, Stephan B. Bruns, Daniele Fanelli, T. D. Stanley
Publication selection bias undermines the systematic accumulation of evidence. To assess the extent of this problem, we survey over 68,000 meta-analyses containing over 700,000 effect size estimates from medicine (67,386/597,699), environmental sciences (199/12,707), psychology (605/23,563), and economics (327/91,421). Our results indicate that meta-analyses in economics are the most severely contaminated by publication selection bias, closely followed by meta-analyses in environmental sciences and psychology, whereas meta-analyses in medicine are contaminated the least. After adjusting for publication selection bias, the median probability of the presence of an effect decreased from 99.9% to 29.7% in economics, from 98.9% to 55.7% in psychology, from 99.8% to 70.7% in environmental sciences, and from 38.0% to 29.7% in medicine. The median absolute effect sizes (in terms of standardized mean differences) decreased from d = 0.20 to d = 0.07 in economics, from d = 0.37 to d = 0.26 in psychology, from d = 0.62 to d = 0.43 in environmental sciences, and from d = 0.24 to d = 0.13 in medicine.
František Bartoš, Willem M. Otte, Quentin F. Gronau, Bram Timmers, Alexander Ly, Eric-Jan Wagenmakers
Bayesian model-averaged meta-analysis allows quantification of evidence for both treatment effectiveness $μ$ and across-study heterogeneity $τ$. We use the Cochrane Database of Systematic Reviews to develop discipline-wide empirical prior distributions for $μ$ and $τ$ for meta-analyses of binary and time-to-event clinical trial outcomes. First, we use 50% of the database to estimate parameters of different required parametric families. Second, we use the remaining 50% of the database to select the best-performing parametric families and explore essential assumptions about the presence or absence of the treatment effectiveness and across-study heterogeneity in real data. We find that most meta-analyses of binary outcomes are more consistent with the absence of the meta-analytic effect or heterogeneity while meta-analyses of time-to-event outcomes are more consistent with the presence of the meta-analytic effect or heterogeneity. Finally, we use the complete database - with close to half a million trial outcomes - to propose specific empirical prior distributions, both for the field in general and for specific medical subdisciplines. An example from acute respiratory infections demonstrates how the proposed prior distributions can be used to conduct a Bayesian model-averaged meta-analysis in the open-source software R and JASP.
Ulrich Schimmack, František Bartoš
The influential claim that most published results are false raised concerns about the trustworthiness and integrity of science. Since then, there have been numerous attempts to examine the rate of false-positive results that have failed to settle this question empirically. Here we propose a new way to estimate the false positive risk and apply the method to the results of (randomized) clinical trials in top medical journals. Contrary to claims that most published results are false, we find that the traditional significance criterion of $α= .05$ produces a false positive risk of 13%. Adjusting $α$ to .01 lowers the false positive risk to less than 5%. However, our method does provide clear evidence of publication bias that leads to inflated effect size estimates. These results provide a solid empirical foundation for evaluations of the trustworthiness of medical research.
František Bartoš, Eric-Jan Wagenmakers
A staple of Bayesian model comparison and hypothesis testing, Bayes factors are often used to quantify the relative predictive performance of two rival hypotheses. The computation of Bayes factors can be challenging, however, and this has contributed to the popularity of convenient approximations such as the BIC. Unfortunately, these approximations can fail in the case of informed prior distributions. Here we address this problem by outlining an approximation to informed Bayes factors for a focal parameter $θ$. The approximation is computationally simple and requires only the maximum likelihood estimate $\hatθ$ and its standard error. The approximation uses an estimated likelihood of $θ$ and assumes that the posterior distribution for $θ$ is unaffected by the choice of prior distribution for the nuisance parameters. The resulting Bayes factor for the null hypothesis $\mathcal{H}_0: θ= θ_0$ versus the alternative hypothesis $\mathcal{H}_1: θ\sim g(θ)$ is then easily obtained using the Savage--Dickey density ratio. Three real-data examples highlight the speed and closeness of the approximation compared to bridge sampling and Laplace's method. The proposed approximation facilitates Bayesian reanalyses of standard frequentist results, encourages application of Bayesian tests with informed priors, and alleviates the computational challenges that often frustrate both Bayesian sensitivity analyses and Bayes factor design analyses. The approximation is shown to suffer under small sample sizes and when the posterior distribution of the focal parameter is substantially influenced by the prior distributions on the nuisance parameters. The proposed methodology may also be used to approximate the posterior distribution for $θ$ under $\mathcal{H}_1$.
František Bartoš, Alexandra Sarafoglou, Henrik R. Godmann, Amir Sahrani, David Klein Leunk, Pierre Y. Gui, David Voss, Kaleem Ullah, Malte J. Zoubek, Franziska Nippold, Frederik Aust, Felipe F. Vieira, Chris-Gabriel Islam, Anton J. Zoubek, Sara Shabani, Jonas Petter, Ingeborg B. Roos, Adam Finnemann, Aaron B. Lob, Madlen F. Hoffstadt, Jason Nak, Jill de Ron, Koen Derks, Karoline Huth, Sjoerd Terpstra, Thomas Bastelica, Magda Matetovici, Vincent L. Ott, Andreea S. Zetea, Katharina Karnbach, Michelle C. Donzallaz, Arne John, Roy M. Moore, Franziska Assion, Riet van Bork, Theresa E. Leidinger, Xiaochang Zhao, Adrian Karami Motaghi, Ting Pan, Hannah Armstrong, Tianqi Peng, Mara Bialas, Joyce Y. -C. Pang, Bohan Fu, Shujun Yang, Xiaoyi Lin, Dana Sleiffer, Miklos Bognar, Balazs Aczel, Eric-Jan Wagenmakers
Many people have flipped coins but few have stopped to ponder the statistical and physical intricacies of the process. We collected $350{,}757$ coin flips to test the counterintuitive prediction from a physics model of human coin tossing developed by Diaconis, Holmes, and Montgomery (DHM; 2007). The model asserts that when people flip an ordinary coin, it tends to land on the same side it started -- DHM estimated the probability of a same-side outcome to be about 51\%. Our data lend strong support to this precise prediction: the coins landed on the same side more often than not, $\text{Pr}(\text{same side}) = 0.508$, 95\% credible interval (CI) [$0.506$, $0.509$], $\text{BF}_{\text{same-side bias}} = 2359$. Furthermore, the data revealed considerable between-people variation in the degree of this same-side bias. Our data also confirmed the generic prediction that when people flip an ordinary coin -- with the initial side-up randomly determined -- it is equally likely to land heads or tails: $\text{Pr}(\text{heads}) = 0.500$, 95\% CI [$0.498$, $0.502$], $\text{BF}_{\text{heads-tails bias}} = 0.182$. Furthermore, this lack of heads-tails bias does not appear to vary across coins. Additional analyses revealed that the within-people same-side bias decreased as more coins were flipped, an effect that is consistent with the possibility that practice makes people flip coins in a less wobbly fashion. Our data therefore provide strong evidence that when some (but not all) people flip a fair coin, it tends to land on the same side it started.
František Bartoš, Samuel Pawel, Björn S. Siepe
Simulation studies are widely used to evaluate statistical methods. However, new methods are often introduced and evaluated using data-generating mechanisms (DGMs) devised by the same authors. This coupling creates misaligned incentives, e.g., the need to demonstrate the superiority of new methods, potentially compromising the neutrality of simulation studies. Furthermore, results of simulation studies are often difficult to compare due to differences in DGMs, competing methods, and performance measures. This fragmentation can lead to conflicting conclusions, hinder methodological progress, and delay the adoption of effective methods. To address these challenges, we introduce the concept of living synthetic benchmarks. The key idea is to disentangle method and simulation study development and continuously update the benchmark whenever a new DGM, method, or performance measure becomes available. This separation benefits the neutrality of method evaluation, emphasizes the development of both methods and DGMs, and enables systematic comparisons. In this paper, we outline a blueprint for building and maintaining such benchmarks, discuss the technical and organizational challenges of implementation, and demonstrate feasibility with a prototype benchmark for publication bias adjustment methods. We conclude that living synthetic benchmarks have the potential to foster neutral, reproducible, and cumulative evaluation of methods, benefiting both method developers and users.
Nikola Sekulovski, František Bartoš, Don van den Bergh, Giuseppe Arena, Henrik R. Godmann, Vipasha Goyal, Julius M. Pfadt, Maarten Marsman, Adrian E. Raftery
Model uncertainty is a central challenge in statistical models for binary outcomes such as logistic regression, arising when it is unclear which predictors should be included in the model. Many methods have been proposed to address this issue for logistic regression, but their relative performance under realistic conditions remains poorly understood. We therefore conducted a preregistered, simulation-based comparison of 28 established methods for variable selection and inference under model uncertainty, using 11 empirical datasets spanning a range of sample sizes and number of predictors, in cases both with and without separation. We found that Bayesian model averaging (BMA) methods based on g-priors, particularly g = max(n, p^2), show the strongest overall performance when separation is absent. When separation occurs, penalized likelihood approaches, especially the LASSO, provide the most stable results, while BMA with the local empirical Bayes (EB-local) prior is competitive in both situations. These findings offer practical guidance for applied researchers on how to effectively address model uncertainty in logistic regression in modern empirical and machine learning research.
Patrícia Martinková, František Bartoš, Marek Brabec
Inter-rater reliability (IRR), which is a prerequisite of high-quality ratings and assessments, may be affected by contextual variables such as the rater's or ratee's gender, major, or experience. Identification of such heterogeneity sources in IRR is important for implementation of policies with the potential to decrease measurement error and to increase IRR by focusing on the most relevant subgroups. In this study, we propose a flexible approach for assessing IRR in cases of heterogeneity due to covariates by directly modeling differences in variance components. We use Bayes factors to select the best performing model, and we suggest using Bayesian model-averaging as an alternative approach for obtaining IRR and variance component estimates, allowing us to account for model uncertainty. We use inclusion Bayes factors considering the whole model space to provide evidence for or against differences in variance components due to covariates. The proposed method is compared with other Bayesian and frequentist approaches in a simulation study, and we demonstrate its superiority in some situations. Finally, we provide real data examples from grant proposal peer-review, demonstrating the usefulness of this method and its flexibility in the generalization of more complex designs.