Psychology journal bans significance testing

This is perhaps the first real crack in the wall for the almost-universal use of the null hypothesis significance testing procedure (NHSTP). The journal, Basic and Applied Social Psychology (BASP), has banned the use of NHSTP and related statistical procedures from their journal. They previously had stated that use of these statistical methods was no longer required but can be optional included. Now they have proceeded to a full ban.

The type of analysis being banned is often called a frequentist analysis, and we have been highly critical in the pages of SBM of overreliance on such methods. This is the iconic p-value where <0.05 is generally considered to be statistically significant.

There is more here, with further interesting points in the piece, via Mark Thorson.


If no inferential statistics are reported, how is the reader to judge the reliability of a study's findings? Will the authors write things like "we obtained an effect size of 0.3, and we personally feel that this is a robust finding, and hope that the reader agrees"? How are these studies to be used in meta-analyses -- I guess other researchers will have to calculate p values or confidence intervals based on descriptive stats.

It's an interesting experiment from the journal, but it seems that things like pre-registration, requirement for larger samples, and the mandatory publication of raw data would do much more for the reliability of science.

It sounds like they're also pushing for larger sample sizes. Also, Bayesian methods have not been banned (nor are they required). Rather, it seems that if you're using a Bayesian method it comes down to whether or not the editors like your priors.

Of course, the journal would be free to push for larger sample sizes without implementing this rule.

Those seem like almost entirely unrelated goals, though. It's not clear that one need entail the other.

"If no inferential statistics are reported, how is the reader to judge the reliability of a study’s findings?"

From the link within the linked article the push is to use confidence intervals. It's also a well done video:

The journal in question has banned confidence intervals, too.

The core issue is rather:

“If FALSE inferential statistics are reported, how is the reader to judge the reliability of a study’s findings?”

Most published scientific inferential research (especially psychology & bio-med) can not be replicated/validated.

And entire hypothesis-testing process assumes we can learn facts/truth by rejecting straw-man null hypotheses, with a simple true/false approach to research, without gray areas. Scientists and the public have been conditioned to believe a research conclusion as TRUE ... if judged statistically- significant (by any standard) and published by a reputable journal.

But statistical-significance can be tortured out of any set of random data, and the original detailed data set & methodology are rarely available for serious replication attempts.

This little wall crack will become a Babel Tower demolition.

"the mandatory publication of raw data would do much more for the reliability of science."

This would be nice in certain respects, but it would seriously disincentivize the development of data in the first place.

It's not 1823 anymore. You don't sit in a lab and build a usable dataset in an afternoon of tabletop experimentation. People invest years and a lot of money creating data they can use to discover things; it's reasonable to let them have some monopoly access to what they've built.

Maybe lockbox it and publish after three years? Five years?

I completely disagree with this. People invest time to create data which they then milk for years, often doing wrong and/or misleading results. In a perfect world, I might go along with granting them monopoly rights to their data. But in this world, where the analysis is often (usually) done for money or influence, protecting their data should not be tolerated. My respect for journals and policy-makers would grow enormously if they simply said they will not publish or pay attention to studies that do not release their data. No need for legal or regulatory oversight - simply voluntary behavior would go a long way to addressing the abuses. I am tired of seeing economists hide behind "proprietary" data and then believe that people should pay attention to their purported "results."

It's a tradeoff. Require instant publishing of data and code, and you'll get less science. Some kinds of problems just won't get worked on.

I understand there have been some terrible manipulations that hide behind proprietary data. But five years is enough to get out a few papers, while not being so far away that the reputational damage from discovered fraud is some remote future.

Journals should require all source code and data be delivered to reviewers, who should compile and run the thing as part of their process, and then archive it for a period. We'd get most of the benefit of full openness while preserving the incentive to create data.

Can you give an example? Where a lot of money / years are devoted to data generation? Who paid for that years of research? Presumably somebody like NSF / NHS / NIH / DoE / some trust or foundation?

Do these bodies care about retaining their monopoly rights on data? On earning rents from the data? At the expense of having unvalidated, very likely wrong results?

Is data generation the goal or merely a means to the goal? If NSF paid for the data generation as the means to get (say) a better understanding of a disease, obviously having many people work simultaneously on that dataset must be a positive outcome?

Actually, no body is even asking them to reveal their full dataset. But at least show us enough raw data to critically & independently examine the conclusions you claim.

Well, to pick an obvious example, NASA data (even things as simple as pictures on occasion) are generally embargoed for a year to allow principal investigators time to write papers. Otherwise, there's no incentive to be the PI. That's not exactly the same, but it's pretty similar.

I do a lot of work using data I need to buy and can't produce, even if I want to. Other people spent years and millions of dollars building the datasets and businesses around them. If I couldn't use it, I couldn't solve the problems I need to solve.

Is your argument really that the data is valueless? That's the whole point, isn't it? It's hard to make, and it's really useful. Require people give it away and you'll get less of it.

Data should be open unless it compromises defense or was funded by investment from pharma/oil/food/etc companies. Actually, the military does not publish anything after long time and companies applies for a patent instead of writing an article for a journal. The clients of scientific journals are precisely the people that should publish their data since they are payed by the non-military part of governments and NGOs. I understand that raw data is not published because reading an article more than 8-10 pages long is not productive and boring. But, in the age of low storage costs, the PDF article can just link to a data repository.

Indeed, public data repositories are not an internet-age thing, 40 years ago........

Who paid?

I'd hope every public funded project would become public good.

If you are a private venture, not trying to convince me of anything, keep it all.

If you are a private venture, trying to convince me of something, data might help.

We will get less published! But the quality will be better.

Why not separate the data from the science? A market solution would be to license data to researchers or for GOV to pay for data sets which become open source. Am I being naive here?

"Am I being naive here?"

A little bit. In some instances, that amounts to buying a company and continuing to operate it. That's not generally something the government does. It would also lead some people to start to doubt the data quality.

"Require instant publishing of data and code, and you’ll get less science."

You'll get less BAD science.

Aren't the lion's share of published scientific results performed by academics who work (nominally) for state universities or private non-profits? It's not as though they are investing their own money to do the research.

They're investing years of their careers, and grant money they could be allocating to other purposes.

@Lord Action

I'm ok with embargoes. You take your time to exploit your data monopoly in secret. All we ask is that when you release your conclusions to the world we want to see the supporting raw data. I think that's fair. (It's a different question whether NASA granting in house researchers a monopoly is a good use of public funds.)

My point is *not* that the data is valueless. But that in most cases the data derives its value from the conclusions it generates. And whoever funded the data generation mostly funded it for those conclusions.

Releasing the data once it is generated gets the funder both more prolific analysis & a better quality of conclusions due to transparency & validation.

If you want to own something, work with private funds. Works in all other areas of the economy, right?


Then I don't think we're far apart. I'd much prefer release of data and code, but I think some sort of delay is necessary or it would be stifling.

Even with a delay a policy like this will preclude research that really requires the data stay private. You often encounter that in finance, for example, where trading records or credit card data would be impossible to effectively de-identify, and people won't give you access without legal assurance the data won't get out.

And again, what are you supposed to do about data you have to buy? Publicly funded research just can't buy data?

@Lord Action

Why a delay? I say, Journals shouldn't even start the review process before data / code are submitted. And both should be posted online simultaneously with the actual paper.

Trading records & credit card transaction data can be scrubbed & anonymized. I think. Don't they release medical records too similarly ( though shit happens!). In any case I'm OK with some exceptions being made in such sensitive data. (Though even if we did take a hard line I'm not sure it'd be a huge loss to the world)

What sort of data do you end up buying? I don't know much about such areas. Can you elaborate?

If there's no delay, you get one paper and one shot with your data and code. If there's a delay, you might get a few years of productivity.

Trading data might be able to be scrubbed if you're talking about GM equity, but not if you're talking about something thinly traded. I can often fill in the blanks and figure out counterparties even when I only have one side of the trading, for example.

I buy a lot of finance stuff. Prices. Models of structured finance things. Performance data for financial instruments. There are firms that deal in this stuff, and have staff and software that put it all together, and they don't want to give it away to the public - it's their business. They've negotiated agreements with broker dealers; they've written software to parse regulatory filings; they audit and correct mistakes.

I really do agree that data and code should be submitted as part of the review if at all possible. I'd have to think NDAs would usually make that acceptable, even in the privacy or proprietary situations we're discussing.

I'd prefer that data and code be made public after some delay. I argue the delay is desirable, as it incents the creation of novel data and code. And I say "prefer" rather than "demand" because I think sometimes it just won't work, and that that doesn't completely devalue the sort of research that can't be so public.

I'm sorry, that's total crap. Academics publish their findings anyway. Their "monopoly" over the raw data does nothing but hurt the credibility of those findings. The reason raw data is rarely seen and almost never useful is of our collective sloppiness.

We put things in binary files that can only be read by outdated versions of MATLAB, and if we ever manage to open them, we just find to arrays of numbers named "b1" and "q". There is no real incentive for doing better, because there are seven other steps of processing that people would still need to take.

There are serious technical barriers to proper data publishing, and they have been around in various forms since well before 1823. With the right software, education and culture we can probably fix it -- but we are only just beginning to recognise the problem.


I don't see the big technical barriers. Unless your data is really complex or huge.

Dumping your data in a commonly used format or even better a CSV or tab delimited text file and some intelligent commenting would go a long way.

The point we are at right now I see tons of low hanging fruit.

You're assuming "data" means "a table with some numbers in it" when what's often important (for example in climate papers) is how you get from raw data to the database you actually model with. In between you have code that takes you from 6,500 binaries, text files, scribbled notes, etc, cleans all that in a variety of ways, and brings it together into something you can work with. That's where a lot of the complexity (and code, and room for doubt) actually lies. This is normal today, in all sorts of fields.

If we had Rahul's suggestion, we wouldn't have climate science. With today's world, we have climate science but a massive credibility problem (that nobody inside the club seems that worried about). With three or five year delay, we'd have science, and the credibility concerns would be addressed. It would be great to see what's actually true.

Okay, okay, let's take a turn for the political here, how much has the Party of Science doctrine watered down the veracity (for lack of a better word) of science? Let me take a step backwards, if "monopoly over the raw data does nothing but hurt the credibility of their findings", but a large subset of the population admires willingness to accept anything that comes out of a journal, how much does obscuring data really affect public perception of the research?

Good for psychologists. Not only do they recognize the limitations of p values, but they also encourage both preliminary and confirmatory research. Economists use a strict, yet vaguely defined standard of rigor and robustness. There is reluctance to publish important preliminary work that is not econometrically sophisticated. Yet there is also reluctance to publish confirmatory or supplementary results that either confirm or modify previous research. It's like saying that if one published paper showed using powerful new methods that certain drugs clearly benefited women with breast cancer, the journal might reject a followup paper as uninteresting that showed that the benefits were different for older and younger women, or potentially not robust to which climate the women lived in. Of course we don't see that in medicine, but the analogous publishing issue is in fact quite common for the top econ journals.

Or they just don't want to be accountable for robustness. Occam's Razor.

"This is also often described as torturing the data until it confesses.In a 2009 systematic review, 33.7% of scientists surveyed admitted to engaging in questionable research practices – such as those that result in p-hacking. The temptation is simply too great, and the rationalizations too easy – I’ll just keep collecting data until it wanders randomly over the 0.05 p-value level, and then stop."

I agree with the problem, but not the solution.

Banning people from publishing useful information just because it is sometimes used incorrectly is counterproductive and doesn't get at the root of the problem..

Researchers will still conduct dubious research and exaggerate how important it is. They will just use a word other than "significant" to describe it, and readers will have fewer tools with which to judge the strength of the work.

Agree. An "effect size" is actually worse than the point estimate-standard_error-p_value trilogy, it gives lesser information. Ridiculous and thankfully I didn't end up doing psych.

Whether NHST yields useful information is itself quite suspect, isn't it?

The usefulness of the p < 0.05 cutoff point is highly debatable. If they had simply banned this, and the arbitrary designation of certain results as "significant", I would have had no objection. This makes sense, actually.

But they banned not only the cutoff point, but all calculation of p values and confidence intervals. This is insane. In response to worries that too many results are just random noise masquerading as meaningful findings, they have banned the major statistical tests that help one differentiate between the two (except for some Bayesian analysis, at their discretion).

Even without an arbitrary cutoff what about fishing? I wonder why they are not insisting on pre-registration though.

Fishing is a problem, but banning these tests doesn't solve that. It just reduces the information readers have to evaluate the quality of the work.

So because some people do bad statistics, they've decided to ban some of the most commonly used tools in statistics?

There's a reason confidence intervals and p-values are so widely accepted and have been the standard in academic research for decades. They're excellent summaries of uncertainty about point estimates and conclusions.

I've been somewhat skeptical when the disreputable Mr. Sailer has offered the opinion that the published oeuvre of social psychology is a remainder bin of junk and trivia. Maybe he had a hunch that there was a systematic problem in the commanding heights of the subdiscipline.

Mr. Sailer thinks that an awful lot of society is junk.

Soc Psych has brought us a 250:1 ideological bias and the explanation that it's just because "Conservatives are stupid", and that's according to the top echelon of Soc Psych. The whole field is an exercise in confirmation bias.

The Austrians at GMU will be throwing a party to celebrate the news!

Two points, first BASP is a middle-of-the-road journal. Certainly not a leader in Psychology ( )

second, this is a statistical joke, but I don't think the journal editors are in on it. Basically they want p < .00X rather than p < .05, but they won't tell you what X is.

"However, BASP will require strong descriptive statistics, including effect sizes. We also encourage the presentation of frequency or distributional data when this is feasible. Finally, we encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem.”

p = f( effect size, data distribution, sample size )

Here is a nice website that I use to explain this phenomena.

And this blurb:

Another problem with the p-value is that it is not highly replicable. This is demonstrated nicely by Geoff Cumming as illustrated with a video. He shows, using computer simulation, that if one study achieves a p-value of 0.05, this does not predict that an exact replication will also yield the same p-value. Using the p-value as the final arbiter of whether or not to accept or reject the null hypothesis is therefore highly unreliable.

That quote reflects a fundamental misunderstanding of what a p-value is. It is a random value dependent on the sampled data. When you take a different sample, you will get different results. If you get the same p-value every time, you're either doing it wrong or your example is trivial. (eg, your sample is the entire population, or your outcomes are deterministic.)

The 0.05 threshold is of course arbitrary. That a random value doesn't achieve a particular threshold with 100% probability doesn't mean that the approach to derive that value is flawed. The notion that "0.05" has magical properties for p-values is absurd on its face, which is why the actual p-value should be reported rather than simply whether or not it meets some threshold.

Have you seen those graphs that collected p-values from published papers & show the spike right at the threshold?

Oh, I'm well aware that academic publications have huge false discovery problems. Bejamini and Hochberg is a good place to start if you're interested in ways to address this sort of thing. But that's not at all related to blurb you quoted, which fundamentally misunderstands what p-values are.

@mavery - thx, I took a course in statistics but am not a practitioner. If you read this, perhaps you can comment on why confidence limits are considered better than p-values. I have an intuitive understanding why, but am interested on what an expert thinks.


They're not "better" necessarily, they just talk about different things. Confidence intervals (in the traditional, non-Bayesian sense) are also prone to misinterpretation, since intuitive interpretation of them ("There's a 95% chance that the real mean is in this interval!" and "There's a [p-value] probability that the null hypothesis is true!") is both simpler than the actual meaning and is also closer to what we'd want the values to actually be talking about. Confidence intervals are nice because they give you a sense of the magnitude of the uncertainty surrounding your estimate. P-values on the other hand tell you how strong your evidence against a particular conclusion is.

If the hypothesis test is well defined and intuitive, this can be very useful. (e.g., "Does the new drug improve survival rates at five years after treatment?") A marginal p-value would lead me to ask for further testing, whereas a very small p-value would make the new drug much more appealing.

Now, that said, you'd still want to look at exactly what those survival rates meant. Maybe it's a difference between 0.03% and 0.06%. Sure, you've doubled the probability, but it's still pretty small. So you can't just take the p-value and ignore the context surrounding it.

Instead of inexpensive blanket rules like this, journals should earn their outsize fees by hiring staff statisticians to review papers (given that peers in most fields aren't equipped to do so.)

Likewise, universities should fire some deanlets and deanlings and replace them with staff statisticians -- either in the library or in individual departments.


But maybe we don't need to fire those deanlets. I think we need an incentive for researchers to cross borders. Most colleges do have a Math Department...

The problem is that these papers aren't presenting novel stats problems that would qualify as research to an academic statistician. That's why I think you need paid, dedicated, professionals to assist.

From the article:

"However, the p-value was never meant to be the sole measure of whether or not a particular hypothesis is true."

I was taught that hypotheses are either false or not falsified, but were never "true". Kinda like how people are found guilty or not guilty, but never found "innocent".

On the other hand, a friend who is a developmental psychologist says that social psych people are seen as the "soft white underbelly" of the psych world, the bottom of the food chain, not taken too seriously by other sub-disciplines. Maybe this validates that notion.

Sounds like a job for differential privacy!

When I first studied probability and statistics everything got enormously easier when I realised that much of the difficulty was not mathematical. Instead it stemmed from the appallingly constipated, or even confused, writing that the field seemed plagued with. As an example, consider that link's definition "The p value is the probability to obtain an effect equal to or more extreme than the one observed presuming the null hypothesis of no effect is true." Ugh! Really, students should hurl things at any lecturer who expresses himself so badly.

So would you like to reword it for us?

Seriously. If you can find a clearer way of expressing that thought that doesn't admit incorrect interpretation, I'd love to hear it. I try to go with things like, "How unlikely is it that your result could happen due to random chance?" but that's not a formal definition like the one you stated. Formal definitions tend to be complex. I'm not sure who physicists formally define, say, motion, but I would imagine it sounds equally obtuse.

In the case of p-values, I suspect the rampant misunderstandings are because their definition answers a question that no one is really ever asking.

Hence people just make a mental jump & assume that a p-value means whatever they want it to mean.

What the devil is "the probability to obtain"? That's just not English. Why does he shoehorn into the sentence the meaning of a null hypothesis? That should be in a preamble. Why the babyish "presuming the null hypothesis ... is true" rather than 'assuming the truth of the null hypothesis'? Et bloody cetera. It's just dreadful writing. It's also incomplete, in that he needs a separate sentence to explain what he means by "more extreme than".

My thoughts on publishing reform:

(1) All papers declare clearly whether they are exploratory or not.

(2) Papers that claim to be non-exploratory must have their study goals & methodology pre-registered.

(3) Funding agencies dedicate a portion of their budgets for replication studies. If a study is worth funding the results are worth replicating. Replication is non-glamorous work so funding might compensate.

(4) Authors must post all raw data online *before* the paper gets accepted for publication (with limited exceptions)

(5) Rather than just p-values, authors to be pushed to add some metric of importance / real world significance / impact / cost-benefit / economics etc. to their papers.

(6) A Journal should have an independent statistical review of articles possibly by a staff statistician

(7) Publish the names of reviewers along with an article's authors.

Not a bad wish list, but I would certainly like to see a cost benefit analysis of implementing it.

Also, to #4 I would add that any algorithm or math used in the analysis of said data be fully detailed. Ergo, no, yes here's the data and here's the result of my computer model, but the model itself is proprietary. I'd probably just call that the No Black Box rule.

This ban is kind of a double edge knife. It's like banning curse words at the church, yep it's fine and the expected behavior, but the ban doesn't speak well of church attending people. Is shame the best tool to make people change the way they behave? Psychologists are weird.

Why single out psychologists? Have doctors, pharmacists, economists, biologists etc. voluntarily stopped cursing....oops I mean using p-values?

Well, the day an economics journal bans the use of p-values they'll join the psychologists.

This is lazy editing/refereeing. Just reject papers that don't interpret the p-values appropriately.

I just looked this journal up. Gasp!

Who would read such dreck? There seem to be more than a hundred similar journals published each quarter on psychology, social psychology or sociology. Too many.

I defy anyone to find a worthwhile article in this waste of wood pulp.

So BASP has morphed into the "readers digest". What's next? Basic and Applied National Enquirer?

Comments for this post are closed