What happens when we correct for publication bias?

There is a recent paper by Leif D. Nelson, Uri Simonsohn, and Joseph P. Simmons on this topic, the abstract is this:

Journals tend to publish only statistically significant evidence, creating a scientific record that markedly overstates the size of effects. We provide a new tool that arrives at unbiased effect size estimates while fully ignoring the unpublished record. It capitalizes on the fact that the distribution of significant p-values, p-curve, is a function of the true underlying effect. Researchers armed with only the sample sizes and p-values of the published findings can fully correct for publication bias. We demonstrate the use of p-curve by reassessing the evidence for the impact of “choice overload” from the Psychology literature, and the impact of minimum wage on unemployment from the Economics literature.

When it comes to both the choice overload effect and the minimum wage, correcting for publication bias implies a lack of significance in the overall tenor of the results.  In passing I am not sure the minimum wage is the best example here, since a “no result” paper on that question seems to me entirely publishable these days and indeed for some while.

For the pointer I thank Kevin Lewis.  And Kevin Drum adds comment.


This paper -


does something similar for a large number of economic fields. I uploaded the central tables here:


The use of "scientific" in the context of economics is highly questionable give economics rejecting nature as a constraint. Psychology is a field that is clearly subject to the publicity bias, eg, the Milgram experiments are known for just one of the experiments instead of all of them.

On the inherent bias in economic thinking that prevents it from being a science is best described by economists rejecting entropy as a constraint on economics.

The paper I think is doing the following: if it finds a bunch of papers found a result was statistically significant by the barest majority, it penalizes this paper on the grounds due to chance there must be a bunch of unpublished papers that found no statistical majority (since the significance was borderline). Pretty clever and akin to those studies that find a 50% increase in some disease if you do X, then you read further and find it's 3 out of 10 million cases rather than 2 out of 10 million (50% increase but hardly overwhelming)

This may even be backwards!... the real publication bias is towards papers that find something new or overturn conventional wisdom... in most cases that caused by a positive finding (eg, "GMO food causes cancer") but in the case of the minimum wage a finding of no impact would have been more newsworthy than finding an impact (why even bother publishing that? It's conventional wisdom.)

John List pointed out in an interview a few years ago that the very popular theory of Stereotype Threat appears to be largely due to publication bias. Nobody wants to hear that you couldn't replicate Stereotype Threat, as List, Levitt, and Fryar couldn't, so only the studies that come up with a politically correct finding get published.

Having accurate p values is predicated on having the right standard errors. Using the right one often implies that your p values shrink. I suspect very significant p values are often a symptom of the wrong errors so that I doubt it is possible to fully correct for publication bias with sample sizes and p values alone.

Relatedly, there is a tendency for p values to pool toward .05 from both directions. From the high side, this is largely a result of focusing on subpopulation treatment effects. This also means that better studies more likely to be true need not have stronger p values.

Finally, p values are just the probability of false positive under no effect. But as Andrew Gelman points out, this is rarely the reasonable counterfactual. We ran the experiment in many cases because theory and related experiments suggest their should be an effect. The question is often instead about effect magnitudes. In which case large p values don't tell you much about the likelihood of proper effect measurement. That is in part a criticism of classical hypothesis testing and not just this approach.

Comments for this post are closed