Results Free Review

If researchers test a hundred hypotheses, 5% will come up “statistically significant” even when the true effect in every case is zero. Unfortunately, the 5% of papers with statistically signficant results are more likely to be published, especially as these results may seem novel, surprising or unexpected–this is the problem of publication bias.

A potentially simple and yet powerful way to mitigate publication bias is for journals to commit to publish manuscripts without any knowledge of the actual findings. Authors might submit sophisticated research designs that serve as a registration of what they intend to do. Or they might submit already completed studies for which any mention of results is expunged from the submitted manuscript. Reviewers would carefully analyze the theory and research design of the article. If they found that the theoretical contribution was justifiably large and the design an appropriate test of the theoretical logic, then reviewers could recommend publication regardless of the final outcome of the research.

In a new paper (from which the above is quoted) the editors of a special issue of Comparative Political Studies report on an experiment using results-free review. Results-free review worked well. The referees spent a lot of time and effort thinking about theory and research design and the type of institutional and area-specific knowledge that would be necessary to make the results compelling. The quality of the submitted papers was high.

What the editors found, however, was that the demand for “significant” results was very strong and difficult to shake.

It seems especially difficult for referees and authors alike to accept that null findings might mean that a theory has been proved to be unhelpful for explaining some phenomenon, as opposed to being the result of mechanical problems with how the hypothesis was tested (low power, poor measures, etc.). Making this distinction, of course, is exactly the main benefit of results free peer review. Perhaps the single most compelling argument in favor of results-free peer review is that it allows for findings of non-relationships. Yet, our reviewers pushed back against making such calls. They appeared reluctant to endorse manuscripts in which null findings were possible, or if so, to interpret those null results as evidence against the existence of a hypothesized relationship. For some reviewers, this was a source of some consternation: Reviewing manuscripts without results made them aware of how they were making decisions based on the strength of findings, and also how much easier it was to feel “excited” by strong findings This question even led to debate among the special issue editors on what are the standards for publishing a null finding?

I’ve seen this aversion to null results. In my paper with Goldschlag on regulation and dynamism, we find that regulation does not much influence standard measures of dynamism. It’s been very hard for reviewers to accept this result and I don’t think it’s simply because some referees believe strongly that regulation reduces dynamism. I think referees would be more likely to accept the exact same paper if the results were either negative or positive. That’s unscientific–indeed, we should expect that most results are null results so this should give us, if anything, even more confidence in the paper!–but as the above indicates, it’s a very common reaction that null results indicate something is amiss.

Here, by the way, are the three papers reviewed before the results were tabulated. I suspect that some of these papers would not have been accepted at this journal under a standard refereeing system but that all of these papers are of above average quality.

The Effects of Authoritarian Iconography: An Experimental Test finds “no meaningful evidence that authoritarian iconography increases political compliance or support for the Emirati regime.”

Can Politicians Police Themselves? “Taking advantage of a randomized natural experiment embedded in Brazil’s State Audit Courts, we study how variation in the appointment mechanisms for choosing auditors affects political accountability. We show that auditors appointed under few constraints by elected officials punish lawbreaking politicians—particularly co-partisans—at lower rates than bureaucrats insulated from political influence. In addition, we find that even when executives are heavily constrained in their appointment of auditors by meritocratic and professional requirements, auditors still exhibit a pro-politician bias in decision making. Our results suggest that removing bias requires a level of insulation from politics rare among institutions of horizontal accountability.”

Banners, Barricades, and Bombs tests “competing theories about how we should expect the use of tactics with varying degrees of extremeness—including demonstrations, occupations, and bombings—to influence public opinion. We find that respondents are less likely to think the government should negotiate with organizations that use the tactic of bombing when compared with demonstrations or occupations. However, depending on the outcome variable and baseline category used in the analysis, we find mixed support for whether respondents think organizations that use bombings should receive less once negotiations begin. The results of this article are generally consistent with the theoretical and policy-based arguments centering around how governments should not negotiate with organizations that engage in violent activity commonly associated with terrorist organizations.”

Addendum: See also Robin Hanson’s earlier post on conclusion free review.


There is an easy solution to so much of this: The funding agencies need to step up.

First, CORs/GORs/POs/PMs, to an extent, SROs in conjunction with panelists, AIBS reviewers and similar - they all review study designs and experimental designs. Some of them do so in an ongoing sense, others only really at the start of the grant or contract. Regardless, most agencies request periodic and closeout reports. Many of these reports are also found on DTIC and other grey-literature databases.

Without committing the researchers to full publication, the officers administrating the grants already require a relatively full reporting; they could commit it to the grey literature - including design and results, significant and publishable or not. The published literature becomes a polished extract or abridgment of the full 'literature' - basically highlights - which, in some sense, it already is. This allows researchers the flexibility to do pilot experiments, to get methods working, but should substantially increase the reporting of final but somewhat unexciting results. For those interested only in the highlights (like an organic chemist reading synthetic biology), they can start from the glossy journals and work their way progressively into the weeds. For those who are interested in meta-analysis, they can canvas the full reports.

Remember, remember! The fifth of August!

You're doing a very convincing impression.

No, I am not. It is totally different. He is a man of party, I am a man of nation. He is a divider, I am an unifier.

Re: "If researchers test a hundred hypotheses, 5% will come up “statistically significant” even when the true effect in every case is zero."

Oh, really. Prove it.

To make it easy, I have the bowl of 100 research studies you can draw your sample from. Studies I have selected to include in the bowl are from A. Einstein, a Mr. Newton, and a number of Nobel prize winners.

Of those hundred that I select you are telling me that of a hundred, 5% will come up statistically significant even when the true effect in every case is zero.

Alex is trying to say: Take a hundred hypotheses whose effect size is zero. When you test them, 5% will show significance at the 5% level.

It's written in a roundabout way, but it's pretty much tautologically true.

Well, if one accepts that the humanities are full of tautologies.

The physical sciences tend to face other challenges than worrying about whether testable reality is tautological or not.

But they also suffer from the same problem that Alex's post is getting at.

Statistics does not belong to the humanities.

Dave's interpretation is correct.

Just link to the XKCD:

Good idea!

I would have phrased this differently. My phrasing might be completely wrong, so please tell me if I'm off in the weeds.

I would have said: Papers generally use a 95% confidence interval, which implies there's a 95% chance that the reported result is accurate. And therefore, 5% have a reported result that is inaccurate.

With the bias for publishing positive results there definitely is Not "a 95% chance that the reported result is accurate." THAT is the problem.

I don't think that's the correct interpretation. P-values indicate, roughly, the probability of observing the relationship that you got in an experiment even if the relationship does not actually exist.

E.g. I hypothesize that conducting voodoo incantations over a coin and asking the gods to favor heads should produce a coin that comes up heads far more than is normally the case for a fair coin. Regardless of the reality behind my hypothesis, though, there will always be a small chance that even a fair coin completely unaffected by the non-existent gods will still come up, say, 6 heads in a row (.5^6 or 1.5625%). The normal inference being made is that if I observe 6 heads coming up in a row and I can describe a plausible causal mechanism, there is evidence for my hypothesis insofar as coming up with 6 heads in a row is a pretty unlikely phenomenon otherwise. This would appear as a low P-value.

But it isn't *impossibly* unlikely. Therefore, the more experiments you conduct the more risk you run of coming up with an example of spurious positive results despite the hypothesis being BS. Not a problem if you're running the same or very similar experiments again and again, since at a meta-analysis level the aggregate result will show that most results are negative. But if you run few similar experiments and are biased towards finding the positive results more meaningful (and who isn't, really?), then the occasional positive result will get treated as very meaningful evidence.

p<0.05 is way too easy to find something significant . In particle physics they use 3 sigma to 5 sigma to report a discovery . 3 sigma is p <0.003.

Tautological statements are always true.

So, what Alex meant to say is that if you test the SAME hypothesis which showed zero effect 100 times, 5% will come up significant. That is quite different than the statement that if researchers test 100 hypotheses, 5% will come up statistically significant even when the true effect is zero.

Ordinarily I would be ninety five percent confident, but because it is a tautology, am now 100% confident of my statement and not confident of Alex's unless he changes his language.

I think the opening paragraph is great - pithy, straightforward, and simply explains a little-known but critical issue. And I'm not exactly a Tabarrok brown-noser here.

Really? I thought the first sentence was a complete mess.

Alex said:

"If researchers test a hundred hypotheses, 5% will come up “statistically significant” even when the true effect in every case is zero."

This sentence is poorly constructed and not necessarily true. Its clearer and correct to say:
"If researchers perform a hundred hypothesis test, and in each case, the null hypothesis is true, 5% of the tests will report p-values less than 0.05."

Or rather, “If researchers perform a hundred hypothesis tests, and in each case, the null hypothesis is true, you should expect approximately 5% of the tests will report p-values less than 0.05 anyway.”

But does the 'results-free review' really have the potential to solve the problem? Even if that makes it possible to publish null results, researches will still know that papers with 'novel, surprising, unexpected' results are more likely to become widely known and cited and advance their careers (especially as compared to published papers with null findings). Which means that those researchers are still likely, as before, to submit only their significant results and toss the others. And the reviewers, in turn, would have good reason to believe, even without seeing the results, that they're likely significant (else the researcher probably wouldn't have bothered submitting the paper at all).

In theory, under some versions of this design you're committing to pre-registration, which is to say, publishing whatever you've found before you've actually tried running it. Of course you could just lie I suppose, but I suspect the average researcher is basically honest even if a disturbing and non-zero proportion are not.

## "But does the ‘results-free review’ ... solve the problem? "

....we don't even know if that "Results-free review worked well", as claimed above.
(where's the conclusive statistical proof that it worked well?)

Blame the statistics profession-- that falsely promoted 'statistics' to researchers as a method to filter 95% absolute certainty from randomness.

"the primary emphasis [in statistical analyses] should be on the size of the treatment effects and their precision; too often we find a statement regarding “significance,” while the treatment and control means are not even presented. Nearly all statisticians are calling for estimates of effect size and associated precision, rather than test statistics, P-values, and “significance.”"

A quote from Burnham & Anderson's Model Selection and Multi-Model Inference. Here's another one, from Borenstein & Hedges' Introduction to Meta-Analysis:

"the [narrative] review will often focus on the question of whether or not the body of evidence allows us to reject the null hypothesis. There is no good mechanism for discussing the magnitude of the effect. By contrast, the meta-analytic approaches discussed in this volume allow us to compute an estimate of the effect size for each study, and these effect sizes fall at the core of the analysis. This is important because the effect size is what we care about. […] The p-value can tell us only that the effect is not zero, and to report simply that the effect is not zero is to miss the point."

A lot of people who teach statistics are surely over-emphasizing hypothesis-testing approaches and 'significance levels' and stuff like that, but to blame the statistics profession seems off the mark to me. Competent statisticians know better, and I'd gather they're usually some of the people who are most annoyed about the way statistical tools tend to be misused in various ways.

Personally, I see no problem if just positive results are published. The trick here is in the eyes and mind of readers. In theory readers of scientific articles should analyze with skeptical attitude. Even if published research is biased to positive results, the eyes should be looking for potential problems.

Also I see a potential perverse incentive. If failure become publishable, a researcher can double the amount of article published by year. In present conditions that the quality of a researcher is measured by the amount publications, producing articles of negative results helps to increase the personal performance metric. This is similar to a student asking a professor points in a exam for correct reasoning but wrong answer. It's a good idea for PhD school, students are encouraged but students need to stop using training wheels at some point.

...but the design undergoes peer review in this instance, and either the design is good or it isn't, and I think a lot of people would argue that the designs of lots of published papers are terrible but they get published anyway because of interesting results, even if those results would not have appeared under a better design. I think a system that forces better designs and publishes null results might actually lead less papers to be published, so questionable are the methods of the average study.

Also, the idea that only statistically significant answers are "right" is troubling at best.

...."we find that regulation does not much influence standard measures of dynamism."

I'm surprised there aren't more comments here screaming about this.

More? There are none. Want to kick it off? Do you think regulation does influence dynamism? Do you think the standard measures aren't up to the task? What do you expect screaming about?

The definition of industry dynamism that I first heard is deviation of industry sales from a trend line. Boom-and-bust industries are by this definition more dynamic than industries where sales are static. Another definition of dynamism that I've seen is the rate of entry and exit into an industry. Industries with lots of startups and lots of failures are by this definition are more dynamic than industries where the same companies survive year after year.

One plausible purpose of regulation is to smooth out chaos to prevent disruption for consumers or for owners/management. If dynamism is defined here as smoother sales, then I wouldn't be as surprised that regulation fails to stop a rolling tide of up-and-down sales variance. However, for the second definition of entry/exit, I would expect that some of regulation is meant to create barriers of entry that keep potential disruptors out. (Consider Uber.) In the second case, a finding of no relationship would support the idea that regulation is not suppressing entrepreneurial spirit or propping up failed companies. For some observers, that result might lead to screaming.

Without first explaining the "boom bust" pattern it's not at all clear that should be called dynamic as simply cyclical or seasonal. Same holds for entry and failure. I think researants are one of the industries with lots of new entries and lots of failures but on that basis alone I would not call the restaurant industry particularly dynami as I think the sales might be fairly static (one adjusted for external factors like general economic states)

That said, I suspect the issue is more terminology and clearly defining the terms and then crafting the analysis to target the narrowly defined inquiry.

Which standard measures were used in the paper? Have you read it?

A "null" result is an uncertain result (the 95% conf intervals are on both sides of zero, which is usually the null hypothesis of the size of the effect). An uncertain result is not very helpful. Hence the preference for certainty.

Although I like the idea of results free review.

Well... one null result on its own, yes. Numerous null results in well-designed studies testing the same hypothesis in various settings might make us suspicious of a non-null finding on the same hypothesis, though.

This one starts at school. Kids should get credit for the experimental procedure, not the results.

Actually, both - especially if it can be shown from their lab notes where they made their errors, that being a major part of scientific lab work.

But this has nothing to do with what Prof. Tabarrok is writing about, as that has nothing to do with actual physical science, and instead is referring to research in the humanities.

I didn't learn anything from this post, and I'm not sure if I agree with your argument or not. Thought you should know.

Obligatory xkcd:

"I used to think correlation implied causation. Then I took a statistics class. Now I don't."
"Sounds like the class helped."
"Well, maybe."

This too

+95% Confidence

That comic is spot on! Xkcd, much like Dilbert often seems to really nail specific situations.

Christ that's longwinded. AT explained it better.

Physicists are all over this. That's why it takes something like a 5 sigma event to be significant, to weed out the meta model selection or reporting bias described here. In quantitative finance there are also ways to weed out model selection bias by essentially testing a wide universe of trading strategies, monte-carlo, bootstrapping, etc. If social sciences academics did this I suspect the presses would go quite (including economists by the way). Most real world phenomenon do not have a easy to model explanation! Polls are not science. Cross sectional analysis of entirely different economies over 5 years is not science. Reduced dimension models like IS/LM are not science. Climate modelling is not science. Grad student artificial scenario play games are not science. The multiverse and inflation is not science. Of the three papers above only the second one has any real hope of meaning anything IMO as it's a somewhat controlled experiment.

+1. The word "science" has been taken for a rough ride

Actually economists use bootstrapping all the time to construct error bars. But in practice standard metrics often replicate the bootstrapped error bars pretty well.

'Climate modelling is not science.'

Not particularly good science, that is true. For example, compared to real time satellite data, all the climate models involving Arctic sea ice extent have been horribly flawed, as none of them has come even close to being able to accurately predicting what is currently happening with Arctic sea ice.

A real embarassment for climate scientists, one should add - their predictions of sea ice extent in 2025 or 2050 are pretty much scrap paper these days, as their models have proven useless when compared to actual empirical data.

Something I have been pointing out for years, but oddly, those who criticize climate science for its flaws never mention one of its most glaring ongoing failures when compared to actual data.

"A real embarassment for climate scientists,"

Agreed, they've also failed to accurately predict the trend in Antarctic Ice.

Here's is an article on the various predictions for Arctic Ice extents:

(2009) "Article by then Senator John Kerry in which he claims:

“The truth is that the threat we face is not an abstract concern for the future. It is already upon us and its effects are being felt worldwide, right now. Scientists project that the Arctic will be ice-free in the summer of 2013. Not in 2050, but four years from now.”"

Another option is look at data of sea level change 3.4 +/- 4 mm/yr The happy scenario is that sea level only increases 30 cm in 1 century. However there are two additional issues: a) subsidence in unconsolidated sediments (New Orleans, New York, Bangkok, Jakarta, Tokio), and b) sea level increase rate departing from the linear rate

Artic ice is interesting, but even more interesting is the sea knocking at your front door.

Yes, those are good points. But it's a problem that often gets exaggerated. Coastal cities should enact local building codes that encourage and promote a reasonable safety factor regarding height above ground and keeping clear of low areas.

Let's face it, a rise of 1 foot over the next 100 years is hardly catastrophic.

I'm sure the residents of Kiribati, Maldives, Fiji, Palau, Micronesia, Cape Verde, Solomon Islands, and Tangier Island, Virginia will be received to know that a foot of sea level rise is no threat.

"all the climate models involving Arctic sea ice extent have been horribly flawed"

OK, on that basis, economics is not a science. All economics models are far worse in their ability to predict anything, even the paSt.

Ie, theory of supply and demand are just totally false, clearly. We know that because economic models did not predict the crash in prices in 1987 or 2007 in real estate, and prices are just a matter of supply and demand. Living costs are just supply and demand.

If you can't predict the economic cycles with economic models, then every economic theory is false.

We see this kind of knock on so-called social sciences every day. What exactly is the point of restating it ad nauseam? If you don't want to call it science, don't. Call it economics; social psychology; sociology, or whatever. These pursuits will never yield models with perfect predictability. But, so what? Should we just throw our hands up and say "it's just too difficult"? That is not going to happen, in large part because the questions are interesting. Get over it and spare us the nausea. BTW, finance is really, really not much better. The same methods are used in economics frequently.

The point is how valid is the conclusion of the study, not whether it is an interesting question. Lots of dubious science is hyped and sold to the public and individual and public policy are based on it even though the conclusions are suspect or even clearly bogus: dietary studies, climate science (especially paleo-climatology, but also the forward looking models, macro-economics (almost all of it), etc. The point is to understand the error bars of the conclusions and a large part of the error bars are the model selection and reporting biases. It's fine to study any of this even though its hard but don't abandon well established principles of science just because your field can't measure up. "We don't really know the answer, the system is too complicated" is fine answer to many questions.

It seems we largely agree. Where we disagree is on the extent to which well established principles of the scientific method (as distinct from science, which is what we're left with after all the rubbish is disposed of over time) are being abandoned. I'm all for accurate reporting. Ditto for not overselling findings. Pointing out error, or potential for it, in a precise way is far more productive than 'it's not science'. It is precisely because the error bars are so wide here that folks that want to sell books and gain influence and prestige among the powerful are able to. Still, there are many serious minded hard working folks getting their hands dirty for the right reasons and progressing these fields who do try to be accurate and not hype it up. Maybe there is a misallocation of talent to these fields, but that is another matter.

Agreed! This is a well expressed version of the main point.

There is a spectrum of "main science" (let's say "error bars" one dimensionally describe it) but the public can't tell the difference. And somehow that infects the main practitioners, no?

'If researchers test a hundred hypotheses, 5% will come up “statistically significant” even when the true effect in every case is zero.'

One commenter has already mocked one aspect of this, but really, do you think that the hypothesis is completely disconnected from how it is tested? Or that the testing, if done well, won't actually demonstrate that any single hypothesis was incorrect - that being part of the testing process, to not only prove but to disprove the hypothesis.

At least in reference to actual physical science.

'then reviewers could recommend publication regardless of the final outcome of the research'

Again, this would not seem to not apply to the physical sciences, actually (note, this is from the first cite). Where if an experiment designed to demonstrate X instead demonstrates Y, it is obvious that the paper's value is in the final outcome of the research - regardless of the decision of the reviewers beforehand.

'A special issue of Comparative Political Studies report on an experiment using results-free review'

Well, if we are talking about something other than physical sciences, sure, you can make up what ever you can get away with.

'indeed, we should expect that most results are null results so this should give us, if anything, even more confidence in the paper'

It is sad to see a humanities scholar try to talk about their own work using the framework of physical science.

this is basically Rudyard Kipling's "If": If you can meet with Triumph and Disaster And treat those two impostors just the same...

Or Circa 2016:

If you can meet with Trump and Disaster And treat those two impostors just the same

I lived in Las Vegas in 1972. I learned this lesson well. Watched many a gambler win a few times in a row against the statistical odds and saw them assume they were on a roll and bet the farm and lost their ass. The old joke is that the state run lottery's are a tax paid by mathematically challenged citizens. If you don't understand statistics you will be fooled by statistics. AND if you are dishonest and want to fool some of the people all of the time and all of the people some of the time than statistics was made for that purpose.

Not all null results are equal. A null results with a tight confidence interval for an important question should be published, and this doesn't necessarily require results-free review. Top journals are starting to do this (the recent QJE null result on wealth shocks and health comes to mind), but certainly not often enough. In an ideal world, a few null results on your CV give you more credibility when you do find that surprising significant result.

Precisely. A project I'm most proud of was the publication of such a result in a reasonably good journal. If wouldn't have happened were it not for the good name of a coauthor, but I was pleasantly surprised it was published at all. I hope it is a trend that continues.

Am I the only one bothered by the use of "theory" instead of "hypothesis"?

Einstein had two papers on relativity theory and no research results.

He offered a number of hypotheses to test the theory. Many more have been made and continue to be made. The Hawking series "Genius" had lay people go through guided observations and then are guided to hypothesis which they then tested. For example, hypothesis: your speed changes how fast your time passes. And then two clocks were synced and one moved up to a mountaintop where it obvious moves at a higher speed. The result was the clock moving faster ran slower then the clock that stayed at the foot of the mountain as compared 24 hours later.

We know that relativity does not apply in all cases, we just don't know the boundaries. The interesting hypotheses are those that propose relativity theory does not apply. Even more interesting are those that propose neither relativity nor quantum mechanics nor standard model apply.

The theories are accepted as useful because they produce hypotheses that produce predicted results in tests.

I think the problem with Alex's paper, from his description, is that it casts too wide a net. I am skeptical of generic claims that regulation is either "good" or "bad" and also of claims that increasing entrepreneurship is "good" or "bad". Regulation can increase entrepreneurship merely by creating opportunities for businesses to help others deal with the regulation. Some regulations might impede the growth of firms beyond a certain size and thus create openings for entrepreneurs; others may increase barriers to entry and favor larger firms.

The virus (hypothesis "testing") unleashed on Science by the accursed frequentists (and exploited by Keynesians) is as hard to detect as it is deadly (when it comes to making good judgments). One feature that makes it especially stealthy is its exploitation of the widespread belief that complex systems like cancers and markets can sensibly be modeled like games of chance.

It dawned on me that the premise of "randomness" in epidemiological studies was in fact just a way to cover up just how useless most studies were came when I was preparing an expert epi for his trial testimony in a wrongful death case. I wanted to attack the other side's epi studies by attacking their premises - one of which was that doctors and their diagnostic expertise were randomly sprinkled about Texas so that there was no need to worry about a chronic lymphocytic leukemia (CLL) cluster being due to a local preference for CLL over the alternate diagnosis of non-Horgkins lymphoma. I suspected that a prominent hematologist, who'd been a guest speaker at the CLL cluster county's medical association meeting years before and who was an expert on CLL, had indeed strongly influenced the local pathologists. I was really fired up about it but then my avuncular expert said (paraphrasing) "Thanatos, if we look hard at the things they call random, they'll look at the things we call random and the only premise left for either side will be post hoc ergo propter hoc; and the really dirty secret is that we probably can't even say which things came pre- and which things came post."

Let's face it folks, most of what is called science in recent decades has been nothing but a Keynesian jobs program for all the unneeded science-credentialed workers churned out by colleges and universities and all they've managed to accomplish is digging millions of holes and we're now at the stage of paying to have all of the those holes filled back in. What comes next now that the public is on to the ruse?

There is a big difference between "science" and "philosophy of science". Results free review requires from reviewers more philosophical than scientific approach.

This is the way financial audits are supposedly conducted. The auditors determine ahead of time what their sample size will be and write down what actions to take *before* the results are tabulated. At least that was what my recent Advanced Auditing course stated, and the instructor (a very experienced auditor) claimed that usually that is the way things are done.

I won't ever forget a graduate seminar in Real Estate Finance where a tenured professor presented a linear regression model with just over 120 dependent variables. Unsurprisingly, he found six were significant at a 95% level. I was a graduate student at the time and this professor was bringing in lots of grant money, so I kept my mouth shut but I wonder to this day how much money he brought in selling this "fancy" six-factor model.

"Hobbes was right"

> 5% will come up “statistically significant” even when the true effect in every case is zero

This is scoffing at a misuse of statistics without IMO really understanding things much better than the offenders. In social science (and pretty much anywhere else outside one might use stats to test hypothesis) the true effect is almost never zero. It may be incredibly small, uninterestingly or unmeasurably small, but it's not going to be zero. The concern about saying that a result is statistically significant if the true effect is zero is a non-problem, at least in social sciences this never, ever, happens - so it's a red herring to criticize naive use of hypothesis tests for getting this basically imaginary situation wrong.

The hypothesis being tested furthermore is not just going to say an effect is zero, but also assert other things (e.g. that a distribution is exactly normal). That just reinforces the picture: we can almost certainly _know_, as much as we know anything in existence, that such a hypothesis will be false. (Maybe it is "approximately" true, or close enough to be very useful for certain purposes, but at the end of the day: it is not TRUE). So we complain about a procedure that might "incorrectly" tell us that something isn't true (and in fact, it's not). Why exactly is that a big problem for science? (The answer is that the procedure, and misuses of it, are huge problems for science but NOT for this reason!)

One thing physicists do that would translate well to this problem is try to place bounds on the size of the effect. So instead of searching for dark matter and not finding any, the conclusion becomes "the possible density of XXX species of dark matter is no greater than Y." And then they follow up when technology improves to the point where they can get a better bound on Y.

A well designed experiment should still be publishable if it gets a negative result. One thing I always wonder about in the fields that have replication problems is whether the experiments were designed around getting a positive result. There was a (fake) study a few years ago relating dirty train stations to negative views of immigrants, and I remember wondering how the authors ever expected to publish if they got a negative result. As it turned out, the results were fake, but to my eye the whole concept was fake. It was never designed to advance knowledge, it was just designed to be a headline.

The real problem is not with *publishing* negative results. It's with other researchers *reading and interacting* with negative results.

If you're measuring something that is intrinsically useful to know, the value of that measurement is really important. People actually care about whether dark matter exists and what its composition might be. That's why a negative result is publishable -- excluding some theories is just as important as supporting others.

The way some fields seem to work is that everybody comes up with their own hypothesis, designs their own tests, and publishes papers proving they were right all along. A null result means no paper, because all papers must prove the author was right. People find statistical flaws in the last step and blame it on the journals. But the problem is the whole chain, not the last step!

If you really want to reform, you should start to draw a distinction between theory and experiment. Experimenters are expected to be agnostic about theories, and it's seen as poor form to test your own hypotheses. Experiments are designed to measure the size of effects, and improving bounds on an existing measurement without changing the value is seen as a publishable result. That's real reform. Arguing about P values is just sweeping the problem under a different rug.

Comments for this post are closed