How good is published academic research?

Bayer halts nearly two-thirds of its target-validation projects because in-house experimental findings fail to match up with published literature claims, finds a first-of-a-kind analysis on data irreproducibility.

An unspoken industry rule alleges that at least 50% of published studies from academic laboratories cannot be repeated in an industrial setting, wrote venture capitalist Bruce Booth in a recent blog post. A first-of-a-kind analysis of Bayer’s internal efforts to validate ‘new drug target’ claims now not only supports this view but suggests that 50% may be an underestimate; the company’s in-house experimental data do not match literature claims in 65% of target-validation projects, leading to project discontinuation.

“People take for granted what they see published,” says John Ioannidis, an expert on data reproducibility at Stanford University School of Medicine in California, USA. “But this and other studies are raising deep questions about whether we can really believe the literature, or whether we have to go back and do everything on our own.”

For the non-peer-reviewed analysis, Khusru Asadullah, Head of Target Discovery at Bayer, and his colleagues looked back at 67 target-validation projects, covering the majority of Bayer’s work in oncology, women’s health and cardiovascular medicine over the past 4 years. Of these, results from internal experiments matched up with the published findings in only 14 projects, but were highly inconsistent in 43 (in a further 10 projects, claims were rated as mostly reproducible, partially reproducible or not applicable; see article online here). “We came up with some shocking examples of discrepancies between published data and our own data,” says Asadullah. These included inabilities to reproduce: over-expression of certain genes in specific tumour types; and decreased cell proliferation via functional inhibition of a target using RNA interference.

There is more here.  And here:

The unspoken rule is that at least 50% of the studies published even in top tier academic journals – Science, Nature, Cell, PNAS, etc… – can’t be repeated with the same conclusions by an industrial lab. In particular, key animal models often don’t reproduce.  This 50% failure rate isn’t a data free assertion: it’s backed up by dozens of experienced R&D professionals who’ve participated in the (re)testing of academic findings.

For the pointer I thank Michelle Dawson.


I guess I'm surprised that this is considered surprising. The examples given all involve difficult, recently developed technologies and I don't think it's surprising to anyone in biotech that the data generated are fraught with problems. I'm a bioinformatics analyst, and would say that data from studies of this type are, on the whole, highly valued despite the high rate of duds, because the positive results can be of such huge economic value.

It is not obvious how to improve the reproducibility rate on individual cutting-edge studies. I don't think it's incompetence or wishful thinking that results fall flat the second time. Biology is just plain hard.

I don't buy it. Publication bias seems like a much, much more plausible answer.

Lots of things are hard to do. The problem is not with biology research itself but with inadequately evaluating the research. If half or more of published findings are not reproducible, then one would think the review and acceptance process is not doing what it is supposed to do.

How, as a reviewer, would you increase the standards without throwing out correct and useful results in the process? Much of this non-reproducibility is due to the 0.05 standard cutoff for P-values, which is not very stringent. But this lack of stringency is deliberate. Especially in genomics, even if you perform an experiment to test a hypothesis that is already known to be true, improving the p-value of the results beyond this threshold can be VERY expensive. It is more rational to accept a high level of false results and do more experiments with the same money, than it is to spend more money on fewer experiments in order to increase their reproducibility. You get more science done by working with weaker results and accepting that they might lead to a dead end.

Concerning the publication bias comment above: yes, that is definitely part of it, and science would be greatly improved by changing the publication model to a "registry" system where experiments are accepted for publication before the results are produced. But you can't just wave a magic wand and make that happen.

I see the problem as one of unrealistic expectations about how science operates. From my perspective, reproducing 50% of the results is actually pretty good, very much a "glass half full" situation.

It depends what is meant by a lot of these words, but the post claims they tried to do the exact same experiment and still got a different result. That doesn't sound like difficulty or statistics.

Your "registry" idea is interesting, and is kind of how the grant system is 'thought' to work. Except that how it really works is whether or not you publish papers.

If 50% of publications aren't reproducible, in order for them to be real, usable "science" someone already has to spend more money trying to reproduce them. A lot more! You are just shifting costs around to the private sector industries that need to use the results - and since they have much less incentive to actually publish what they find, that's probably less efficient in aggregate than just doing things properly with publicly funded research.

If a proposed solution involves doubling the money put into an experiment in order to improve statistical significance, you are very often better off scientifically if that money is spent on an initial trial followed by an independent replicate at a different institution. Which is basically what is happening here.

There are definitely ways the usefulness of published results could be improved. I would not describe it as "low hanging fruit" but it is doable. But getting to a dramatically lower non-reproducibility rate (say less than 10%) may not even be possible and seems less likely to be efficient. Especially considering that most research is not worth following up on even if it is correct.

I get the feeling a lot of people want science to hand them unambiguously certain facts in an easily digestible form. It just doesn't work like that. It's an unreasonable expectation to begin with.

This is a very long winded way of saying "the perfect is the enemy of the good".

heh, I knew Ioannidis was going to be referenced the moment I saw this headline! :)

Is there any way to create an atmosphere where academics are encouraged to competitively prove other people's papers wrong? I mean, I know lots of people who do it as a fun hobby, but certainly no one gets grant money to go disprove other people's stuff. There may be a general consensus within the field that a given paper is full of crap, but it seems like a case of the headline being in 72 point font and the retraction of it is printed on page 4B -- the average person truly believes that if a peer reviewed paper says something in its abstract, it's the truth.

Secondly, is there a way to incentivize this kind of behavior, and publicize the results -- without undermining everyone's faith (I hesitate to use the term) in actually true scientific things? I feel that part of the reason that people do believe (for better or worse) in the value of capital-s Science is because so little in-fighting and competition is displayed -- despite the fact that I'd say that bickering and competition enhances and reinforces the validity of scientific results, rather than the opposite.

There is already an atmosphere that incentivizes correct, reproducible results: industry.

Oh, I totally agree, and my comment was a reflection of my desire to see research move that way. My question is: "how?"

Sorry for the flippant answer; I don't really have an answer to your question. Except perhaps make funding more results-based in some way. But that is little more than a restatement of the problem.

Your last paragraph is interesting. I think lots of institutions face this choice between hiding weaknesses and addressing weaknesses. The latter option is painful but necessary. The former is easier, so it is usually taken. It works for awhile, then leads to a big collapse.

right--US doctors prescribing millions of $ of procedures which are *subsequently* demonstrated to be useless (or harmful!) in the industry of medicine being a great case in point

Do 50% or more of medications turn out to be useful? I don't think so. So yes, industry incentivizes success more than academia.

sigh..."incentivizing success" is how you generate bogus results to begin with. let's see how far back i have to go to find an instance of this...oh wait, one day:
why does this story sound familiar? an initial study, *not* double-blind controlled and with small n, miraculously demonstrates a positive effect for a treatment. doctors in industry with your coveted "success incentivization" don't wait for the large, double-blind controlled study, jump on the bandwagon and dump stents in *thousands* of people. then the double-blind comes out and, lo and behold, it's worse than the prior treatment! having potentially harmed thousands of people and wasted tons of money, we're back at square one. Yes, let's take industry's example.

I have a Ph.D. in physics with, at this point, 13 published articles in peer reviewed journals so I mean this in the best possible way :-). Why should we want people to trust capital-s Science when 2/3rds of published peer-reviewed results are wrong? Or, better question, how is anyone to distinguish capital-s Science from garbage? I'm, at some small level, an insider and I already believe that:

1. Economics has awarded at least one Nobel Prize to a theory that is (probably) wrong.

2. Climate "Science" should consider itself lucky if "only" 2/3rds of its published results are wrong.

You're missing the point: this is about reproducing results, not the validity of the theories. The results of these bio papers aren't even necessarily rigged as multiple people have already pointed this out with comments relating to selection. The easy answer to your question is to change the mindset about the publication of negative vs. positive results. Your bullet points are non sequiturs at best and don't do much to dispel the stereotype of physicists thinking they know everything about every other field as well as their own.

I agree with the publication issue. Publish reproducing or contradicting papers with as much vigor as the originals (at least for the first five or so) and give extra credit to them at tenure/promotion time (as well as giving extra credit to papers that have been reproduced by others and cutting credit to those that have been debunked).

As for publications, 6 of them are in Actuarial/Risk Management Journals and 7 in Physics. I do think, from my experience in both fields, that "wrong" physics results get weeded out quickly. Not necessarily because reproductions and contradictions are published, but because the next experiments build on previous ones and if the previous ones don't hold up, the next researchers know it pretty quickly. In the AS/RM field, research doesn't accumulate in quite the same way, IMHO.

Regarding Eric's comment above, I like the maxim that any academic discipline having the word "science" in its name is not one.

This is absolutely publication bias, and it's a huge effect in fields like biomedicene, where experiments are relatively easy and quick, and nobody publishes all the boring studies that show no effect. I think the data themselves are most likely fine - it's just that people don't understand statistics. 3 sigma effects appear and disappear all the time.

That's what one of Ioannidas' studies proved, if I remember correctly. They just took a bunch of papers and looked at the the data in the back -- and then compared it against the claims in the paper, and found that large percentages of them simply didn't line up. The paper's conclusion may have indeed been true, but their data didn't justify their claims. Just as you say, they didn't know stats.

Indeed... in at least one lab I knew of at a top-notch university, a researcher had asked a bio-statistics group to collaborate with her and analyze her data.
Statistician: "Your data does not show any statistically significant results."
Researcher response: "Can you redo the analysis to show [what I'm trying to prove]?"
Statistician: "Your data does not show any statistically significant results."
Researcher response: "Okay, I'll do it myself" [publishes in Nature]

I agree with Sam. As a molecular & cell biology Ph.D. with about 20 years experience in academic basic research through the ranks (only to provide some context), I can confirm that the very unscientific bias against confirming the null hypothesis in published work is toxic. I've only been able to pull it off once (because the hypothesis being tested had become assumed dogma at that point, so the reviewers thought it was significant enough for publication). Regardless of the validity of the hypothesis or the quality of the experiments an data analysis, most PI's reflexively abandon negative results without even trying to publish them. Unfortunately, this results in continued redundant testing of the same hypothesis, since the findings are never published for others to learn from.

Worse than that. It results in 20 redundant tests, one of which shows a valid result with a 95% confidence level, and which is then published.

In the age of the internet, why wouldn't scientists disclose failures so others do not have to repeat the hypothesis testing. You learn from failure.

Could someone fund a website to report failures, or give credit for a discovery based on someone's reported failure which narrowed the scope and cost of research? Psychic rewards motivate people.

My daughter, who majored in organic chemistry at CalTech, disliked research because it consisted of mostly failed experiments. I tried to convince her that what she really did was discovery, which is neutral as to success or failure, and that discovery was what it was all about. She became a pathologist because, as she said, it is either cancerous or not. Both answers are worthy of knowing.

On the matter of organic chemistry, when I was working on my Ph.D. I needed a material the synthesis of which was claimed in a prestigious chemistry journal. When I tried to duplicate the synthesis, I discovered not only no yield of the desired material, but that the required materials would not even form a homogeneous solution in which to do the reaction. I eventually managed to make the desired compound by using a co-solvent that enabled all of the components to contact each other in a homogeneous solution, but the yield was tiny.

You should often discard negative results. Most of the time, even. It's difficult to prove a true null even with a positive control to the same standard that a positive result is held to.

It's more likely you're not doing it right than that it doesn't exist.

Cool story, Bro!

How many empirical papers in economics are reproducible?

How willing are authors to share their data, particularly when acquiring the data was costly, time consuming, or proprietary?

Would they share their statistical code for others to vet for accuracy?

Or are we going to base economic policies on unreproducible results from unseen data like we've done for climate change policies?

Public Finance Review has actually started a policy of explicitly asking for papers that reproduce others research. Some other journals, such as the Journal of Applied Econometrics, will not accept your paper unless you supply the data and computer code.

While many journals have such a policy (AER for instance), the exceptions sound like the gov't claiming things involve "national security." Just one example: a recent paper could not disclose its data because it had proprietary value - the data was cellphone bills for college students from 5 years earlier!

I have tried to reproduce many results of peer-reviewed and not peer-reviewed studies. Authors are quite reluctant to provide their data. Often they say "it is publicly available, get it yourself." But when I get it myself I cannot match their results. Not only is this inexcusable in peer-reviewed publications, it should be inexcusable for think tanks that are lobbying public policy.. I'd like to see regulators say "we will ignore your work unless you provide the data."

I actually filed testimony for one client that permitted me to post the data on my website. It remains one of my proudest moments.

Good examples Dale.

When I gathered data for my Master's thesis, it was all publicly available data, but I spent countless hours downloading it piecemeal and putting it in usable form. I would have given the data to a journal, but not someone else trying to enhance or criticize my work or conduct unrelated analysis. Data aggregators make a lot of money putting public data in convenient form. They can add value and make it proprietary with their own forecasts. Sometimes you have to pay for data

So i understand why researchers are reluctant to share, but if it guides public policy, it should be made public.

This result is right out of Sellke, Bayarri & Berger (1999) assuming that the p values of the relevant studies were about 0.05. "Suppose it is known that, a priori, about 50% of the drugs tested have a negligible effect. (This is actually quite a neutral
assumption; in some scenarios this percentage is likely to be much higher.) Then:
Of the Di for which the p-value ≈ 0.05, at least 23% (and typically close to 50%) will have negligible effect." Seems about right.

No substitute for empiricism, especially with the state of peer review being what it is.

Alternate title: How good is Bayer's in-house lab?

If I had to guess, I suspect it's better than the lone grad student who is holding everything together with bubble gum and duct tape.

except of course that if you were familiar with Ioannidis' work you'd know that of the many categories of problems he identifies, a tremendous one is industry-funded research, especially industry-funded research where doctors with financial interests in the company are putting their names on the company's work without being involved in the work at all

Comparing average industry to average academia is different from comparing Bayer to average academia.

In this case, Bayer has a *huge* incentive to make the experiments work. The fact that they *can't* is pretty telling about the quality of published academic research.

Yikes. Do you realize that "published research" *includes* massive quantities of industry-funded, sometimes ghost-written, biomedical research into the effectiveness of drug targets, which includes by default the "huge incentive" for positive results you've referenced? It is a massive misconception that this only includes federally-funded grad students toiling away in a corner. Calling it "academic research" doesn't change industry's tremendous role in it, at least in drug research.

Gee, for an economics blog so concerned with incentives, moral hazard, blablabla, it's patently absurd to see people blindly take the word of companies that are actively involved with a financial stake in much of the "published research" that's on the books.

There is nothing wrong with 2/3s of all papers being wrong, I assume we want to get new, interesting findings into print quickly. That means that folks are going to claim that a lot of things are true that aren't. I further presume that we want new, interesting findings independently replicated and tested. By replicated, I mean doing the same thing the original authors did and getting the same result, doing the same thing they did but with a different/better data set and getting the same result, doing what they should have done with the data and different/better data and getting the same result. The problem is that research which replicates and tests interesting findings is damn hard to publish. This is crazy if what we want is to advance knowledge, but it's what we have.

There's nothing wrong with 2/3rds of papers being wrong when its all "academic." But when the safety of your drug or the efficacy of some social policy depends on the conclusions of those papers, they had either damned well be correct or we had better choose not to make decisions based on those findings.

The beauty of basing your academic career on "natural experiments" as so many of MR's favorite young economists do, is that no one can debunk you this way. Must mean that "natural experiment"-based economics research is really robust, right?

Well, what I was going to say is that this result is for the subset of research that is actually worth repeating. Scary.

It's robust to the point that you can convince readers that your instruments are valid. Your quip about debunking is riduculous; plenty of papers are rejected due to instruments that are not plausibly exogenous or are otherwise weak.

I know everyone wants to jump on publication bias as the root cause, but I'd have to say the more likely culprit is Excel errors. I write custom statistical modeling software (so I'll be programming directly in Java for example b/c, for instance, the data set doesn't fit in memory or even on a single machine). Anyway, after years of doing it I can tell you writing large test suites with a carefully selected input to test if the output is valid is the only way to have any reason to believe the code matches the equations you're trying to implement. Even when most of the functions are in an existing library (as is the case with spreadsheets or SAS), and you're simply applying function A to the data set, then function B to the result, a lot can go wrong. And the most sinister part is that you get a reasonable looking answer but it's totally wrong. People think they're smart and meticulous so they're spreadsheets are correct, but the people I've known who are the best at applying statistical software to data have always had an attitude that they suck at it. They're pessimism makes their results believable.

More generally, I think everyone has a pet issue they want to which they want attribute all of the problem, when in fact Ioannidis has never claimed there's one bogeyman on whom we can dump all blame. There is publication bias. There are conflicts of interest from industry-funded research. Ghost-written articles. Inappropriate use of statistical models when the assumptions aren't justified, not correcting for multiple comparisons, not properly normalizing the data before the stats, etc. Simply setting our p-value requirements too easy given the rate of unpublished negative results. All are issues with demonstrable impact.

ok, I'm a scientist too. Biomedical research in my case (currently cancer).
Generalizing to all biomed research is... simply not appropriate.

Now, the "inside rules" of how working (biomed) scientists think about things:
(1) something really novel is quite often wrong. The journals Nature, Science, and Cell put a large emphasis on "new and cool" - hence, yes, more articles (by %) will be wrong.
(2) If you go to speciality journals, especially third-tier, most of the work will be correct. The techniques are well-established, authors are expected to do a lot of experiments, especially control experiments, technical experimental/data analysis problems are well understood, the peer reviewers are going to be tough. Like a mid-range restaurant. Food has to be ok or it wil die.
(3) Low but not appalling journals (like 5th tier) - some papers will be solid, some will be obviously bad.
(4) Below the 6th tier or so: nearly all papers are mostly junk. The ones that are not mostly junk are boring beyond belief.

How does science deal with this?
Let's say paper X in Nature claims finding Y. And now no one in the field believes Y. You will see that paper X and finding Y will simply be ignored after a while (not in review articles or referenced). Or simply referenced by the deadly words ("While a large number of studies have demonstrated not-Y, a single report claimed Y, but further evidence has not been supplied for idea Y." - or words like this.)
Really, it is not hard to figure these things out...

Advisor meeting n:
me: "I could show Y"
Advisor: "that's not exciting enough, everyone already knows that"

Advisor meeting n+1:
me: "I could show Y+1"
Advisor: "noone would buy it"

They need an "Office Space" for academia, but noone would believe it.

It's clear (to me at least) that quite a few of the Climate Change hysterics are crooks. Nature and Science seem to carry their work quite happily. I should be astonished if none of the people whose results Bayer has examined are crooks. I suspect that this can be understood by considering incentives.

Right. This is how it is typically done: 10 clueless graduate students who were never properly trained produce ten different results that conflict with each other. Professor cherry picks the ones that suit "the model" and exerts pressure to deliver confirmatory results. They of course get delivered and eventually, the paper is published (sexy results! revolutionary findings!). When results turn out to be bullshit (or even when the data are fabricated), the PI bears no responsibility - he simply trusted his students. I've seen this played countless number of time.

If a large part of what's published is bullshit, how then is the indisputably great progress is made? Here is how: hundreds produce BS like this. But they are reasonably smart guys and they sincerely try to come up with reasonable models that reasonably account for available data. On a balance of it, the hundreds of papers do give indication of what data is likely to be correct and what is not. This allows next iteration of sand-castle models and semi-competent experimentation. Eventually, the truth shakes out of this extremely inefficient process.

Very similar to the article in the Atlantic:
Lies, Damned Lies, and Medical Science

I authored > 130 biomedical journal papers and am editor of several medical journals. I agree with other comments here that 50% false positives in the literature is pretty good. I suspect it is < 10% in such almost impossible to falsify fields as economics and climate science.
Snark aside, I would say:
a) Research is really hard
b) Peer review can weed out the papers for such red flags as 1) methods do not make sense, especially poor design or statistical weakness 2) conclusions are not supported by the results 3) researchers are obviously unaware of the state of the art in their field, casting doubt on their quality 4) underreported conflicts of interest.
c) Peer review can also weed out papers when methods+results+discussion do not make sense
d) But no reviewer can look over the shoulder of the scientist doing the research. If a scientist makes up results, it is extremely hard to pick up in peer review.

Unfortunately, even more government regulations to research, now to reduce false positive research reports will be called for.

Problems with 3) (researchers unaware of the state of the art) ... first of all, could be irrelevant. You can do novel research using twenty year old technology; there is a lot to look at. Second of all, can be used as an excuse for extracting cites.

I really don't understand why we worry about this at all right now. Isn't unreproducible science just basically like a broken window or some costly EPA regulation in an liquidity trap?

Readers should consider what is meant when Bayer says that they couldn't reproduce a finding. There's a high chance that they didn't do exactly what the original authors did, but rather something similar enough that they assumed that it wouldn't make a difference. Unfortunately, as I've learned from years of research, even the most miniscule difference can give you completely different results. Did you use the Qiagen kit or the Zymogen kit to purify your DNA? They both should work and do the same thing, but sometimes you get a different result because one of the kits is subtly different in a way that they don't report or list as "proprietary". The fact of the matter is that you have a system in which you imagine there to be 1-3 variables which you are controlling while in reality there are 500 variables in the form of hundreds of proteins and signaling molecules, environmental conditions, ect. Add all of that to poor quality statistics, and those numbers seem reasonable.It's unfortunate, but biology is unbelievably, mind numbingly complex in ways that I can't even begin to explain. Take all the complexity of genetics, physical chemistry, evolution, organic chemistry, molecular biology, biochemistry, and physiology and add them all together and then square it and you start to get an idea.

I'd like to second what ad*m said. When you look at the "top" journals, they are primarily interested in novel findings. They have a stringent peer review process, don't get me wrong, but they are specifically looking for new or strange findings. If you look at a specialty journal, like the Journal of General Physiology (top notch Physiology journal), the work is correct and usually thorough.

This is the most nuanced and correct interpretation of this finding. It is refreshing to see it expressed so well rather than the typical knee jerk reaction "Most published research is crap".

There does BTW exist at least one journal I'm aware of that reproduces results before publishing them: Organic Syntheses (organic chemistry journal as the name implies). Every procedure is actually run by a checker to verify that it works before it gets published. Unfortunately this is not practical in most fields, but for some types of research it might be a workable model.

Yep, Org Syn was on my mind, too.

I also wonder how many of the results that Bayer is trying to reproduce are from India or China. They produce some excellent work, but (at least in chemistry, the only area I'm competent to judge) in the lower tiers there's an enormous amount of really shoddy publications. Fabricated results and the like.

@DK I love your summary. Most science results sort of come out of a messy process full of human foibles, flaws with brief moments of brilliance. The key to understanding science is that just like in all human endeavours, most scientists are just about OK at what they do and a lot of them aren't very good at it at all (and the rest are pure geniuses). And to a lot of what they do, there are simply no easy ways of finding the right answers.This is true of all professions. You have to have a certain level of skills and smarts to become a lawyer, doctor or even a politician. But it is amazing how many of them still manage to be complete morons. And sometimes geniuses get turned into morons by the intractability of the problem they're dealing with. Yet, the whole system appears to by and large work. It is indisputable that through this messy process we got medicine that actually saves more lives than it takes (only about 50% of what doctors do these days is complete quackery). But it is important to remember that this system can also sustain fundamental errors for decades or longer.

Funnily enough, social sciences and humanities are no different in this respect. It is surprising how much consensus there is on the basic facts, how much solid replicable research there is and how much tendentious nonsense it's swaddled in.

Comments for this post are closed