Why Most Published Research Findings are False

Writing in PLoS Medicine, John Ioannidis says:

There is increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims. However, this should not be surprising. It can be proven that most claimed research findings are false.

Ioannidis presents a Bayesian analysis of the problem which most people will find utterly confusing. Here’s the idea in a diagram.

Truehypo_3 Suppose there are 1000 possible hypotheses to be tested. There are an infinite number of false hypotheses about the world and only a finite number of true hypotheses so we should expect that most hypotheses are false. Let us assume that of every 1000 hypotheses 200 are true and 800 false.

It is inevitable in a statistical study that some false hypotheses are accepted as true. In fact, standard statistical practice guarantees that at least 5% of false hypotheses are accepted as true. Thus, out of the 800 false hypotheses 40 will be accepted as “true,” i.e. statistically significant.

It is also inevitable in a statistical study that we will fail to accept some true hypotheses (Yes, I do know that a proper statistician would say “fail to reject the null when the null is in fact false,” but that is ugly). It’s hard to say what the probability is of not finding evidence for a true hypothesis because it depends on a variety of factors such as the sample size but let’s say that of every 200 true hypotheses we will correctly identify 120 or 60%. Putting this together we find that of every 160 (120+40) hypotheses for which there is statistically significant evidence only 120 will in fact be true or a rate of 75% true.

(By the way, the multiplying factors in the diagram are for those who wish to compare with Ioannidis’s notation.)

Ioannidis says most published research findings are false. This is plausible in his field of medicine where it is easy to imagine that there are more than 800 false hypotheses out of 1000. In medicine, there is hardly any theory to exclude a hypothesis from being tested. Want to avoid colon cancer? Let’s see if an apple a day keeps the doctor away. No? What about a serving of bananas? Let’s try vitamin C and don’t forget red wine. Studies in medicine also have notoriously small sample sizes. Lots of studies that make the NYTimes involve less than 50 people – that reduces the probability that you will accept a true hypothesis and raises the probability that the typical study is false.

So economics does ok on the main factors in the diagram but there are other effects which also reduce the probability the typical result is true and economics has no advantages on these – see the extension.

Sadly, things get really bad when lots of researchers are chasing the same set of hypotheses. Indeed, the larger the number of researchers the more likely the average result is to be false! The easiest way to see this is to note that when we have lots of researchers every true hypothesis will be found to be true but eventually so will every false hypothesis. Thus, as the number of researchers increases, the probability that a given result is true goes to the probability in the population, in my example 200/1000 or 20 percent.

A meta analysis will go some way to fixing the last problem so the point is not that knowledge declines with the number of researchers but
rather that with lots of researchers every crackpot theory will have at least one scientific study that it can cite in it’s support.

The meta analysis approach, however, will work well only if the results that are published reflect the results that are discovered. But editors and referees (and authors too) like results which reject the null – i.e. they want to see a theory that is supported not a paper that says we tried this and this and found nothing (which seems like an admission of failure).

Brad DeLong and Kevin Lang wrote a classic paper suggesting that one of the few times that journals will accept a paper that fails
to reject the null is when the evidence against the null is strong (and thus failing to reject the null is considered surprising and
important). DeLong and Lang show that this can result in a paradox. Taken on its own, a paper which fails to reject the null provides evidence in favor of the null, i.e. against the alternative hypothesis and so should increase the probability that a rational person thinks the null is true. But when a rational person takes into account the selection effect, the fact that the only time papers which fail to reject the null are published is when the evidence against the null is strong, the publication of a paper failing to reject the null can cause him to increase his belief in the alternative theory!

What can be done about these problems? (Some cribbed straight from Ioannidis and some my own suggestions.)

1) In evaluating any study try to take into account the amount of background noise. That is, remember that the more hypotheses which are tested and the less selection which goes into choosing hypotheses the more likely it is that you are looking at noise.

2) Bigger samples are better. (But note that even big samples won’t help to solve the problems of observational studies which is a whole other problem).

3) Small effects are to be distrusted.

4) Multiple sources and types of evidence are desirable.

5) Evaluate literatures not individual papers.

6) Trust empirical papers which test other people’s theories more than empirical papers which test the author’s theory.

7) As an editor or referee, don’t reject papers that fail to reject the null.