Psychology journal bans significance testing

by on February 26, 2015 at 7:43 am in Science | Permalink

This is perhaps the first real crack in the wall for the almost-universal use of the null hypothesis significance testing procedure (NHSTP). The journal, Basic and Applied Social Psychology (BASP), has banned the use of NHSTP and related statistical procedures from their journal. They previously had stated that use of these statistical methods was no longer required but can be optional included. Now they have proceeded to a full ban.

The type of analysis being banned is often called a frequentist analysis, and we have been highly critical in the pages of SBM of overreliance on such methods. This is the iconic p-value where <0.05 is generally considered to be statistically significant.

There is more here, with further interesting points in the piece, via Mark Thorson.

1 P February 26, 2015 at 7:55 am

If no inferential statistics are reported, how is the reader to judge the reliability of a study’s findings? Will the authors write things like “we obtained an effect size of 0.3, and we personally feel that this is a robust finding, and hope that the reader agrees”? How are these studies to be used in meta-analyses — I guess other researchers will have to calculate p values or confidence intervals based on descriptive stats.

It’s an interesting experiment from the journal, but it seems that things like pre-registration, requirement for larger samples, and the mandatory publication of raw data would do much more for the reliability of science.

2 Nylund February 26, 2015 at 8:21 am

It sounds like they’re also pushing for larger sample sizes. Also, Bayesian methods have not been banned (nor are they required). Rather, it seems that if you’re using a Bayesian method it comes down to whether or not the editors like your priors.

3 dan1111 February 26, 2015 at 8:24 am

Of course, the journal would be free to push for larger sample sizes without implementing this rule.

4 Colin February 26, 2015 at 1:23 pm

Those seem like almost entirely unrelated goals, though. It’s not clear that one need entail the other.

5 JWatts February 26, 2015 at 10:18 am

“If no inferential statistics are reported, how is the reader to judge the reliability of a study’s findings?”

From the link within the linked article the push is to use confidence intervals. It’s also a well done video: https://www.youtube.com/watch?v=ez4DgdurRPg

6 P February 26, 2015 at 10:39 am

The journal in question has banned confidence intervals, too.

7 barrillo February 26, 2015 at 12:26 pm

The core issue is rather:

“If FALSE inferential statistics are reported, how is the reader to judge the reliability of a study’s findings?”

Most published scientific inferential research (especially psychology & bio-med) can not be replicated/validated.

And entire hypothesis-testing process assumes we can learn facts/truth by rejecting straw-man null hypotheses, with a simple true/false approach to research, without gray areas. Scientists and the public have been conditioned to believe a research conclusion as TRUE … if judged statistically- significant (by any standard) and published by a reputable journal.

But statistical-significance can be tortured out of any set of random data, and the original detailed data set & methodology are rarely available for serious replication attempts.

This little wall crack will become a Babel Tower demolition.

8 Lord Action February 26, 2015 at 10:50 am

“the mandatory publication of raw data would do much more for the reliability of science.”

This would be nice in certain respects, but it would seriously disincentivize the development of data in the first place.

It’s not 1823 anymore. You don’t sit in a lab and build a usable dataset in an afternoon of tabletop experimentation. People invest years and a lot of money creating data they can use to discover things; it’s reasonable to let them have some monopoly access to what they’ve built.

Maybe lockbox it and publish after three years? Five years?

9 Dale February 26, 2015 at 11:16 am

I completely disagree with this. People invest time to create data which they then milk for years, often doing wrong and/or misleading results. In a perfect world, I might go along with granting them monopoly rights to their data. But in this world, where the analysis is often (usually) done for money or influence, protecting their data should not be tolerated. My respect for journals and policy-makers would grow enormously if they simply said they will not publish or pay attention to studies that do not release their data. No need for legal or regulatory oversight – simply voluntary behavior would go a long way to addressing the abuses. I am tired of seeing economists hide behind “proprietary” data and then believe that people should pay attention to their purported “results.”

10 Lord Action February 26, 2015 at 11:29 am

It’s a tradeoff. Require instant publishing of data and code, and you’ll get less science. Some kinds of problems just won’t get worked on.

I understand there have been some terrible manipulations that hide behind proprietary data. But five years is enough to get out a few papers, while not being so far away that the reputational damage from discovered fraud is some remote future.

Journals should require all source code and data be delivered to reviewers, who should compile and run the thing as part of their process, and then archive it for a period. We’d get most of the benefit of full openness while preserving the incentive to create data.

11 Rahul February 26, 2015 at 11:49 am

Can you give an example? Where a lot of money / years are devoted to data generation? Who paid for that years of research? Presumably somebody like NSF / NHS / NIH / DoE / some trust or foundation?

Do these bodies care about retaining their monopoly rights on data? On earning rents from the data? At the expense of having unvalidated, very likely wrong results?

Is data generation the goal or merely a means to the goal? If NSF paid for the data generation as the means to get (say) a better understanding of a disease, obviously having many people work simultaneously on that dataset must be a positive outcome?

Actually, no body is even asking them to reveal their full dataset. But at least show us enough raw data to critically & independently examine the conclusions you claim.

12 Lord Action February 26, 2015 at 12:12 pm

Well, to pick an obvious example, NASA data (even things as simple as pictures on occasion) are generally embargoed for a year to allow principal investigators time to write papers. Otherwise, there’s no incentive to be the PI. That’s not exactly the same, but it’s pretty similar.

I do a lot of work using data I need to buy and can’t produce, even if I want to. Other people spent years and millions of dollars building the datasets and businesses around them. If I couldn’t use it, I couldn’t solve the problems I need to solve.

Is your argument really that the data is valueless? That’s the whole point, isn’t it? It’s hard to make, and it’s really useful. Require people give it away and you’ll get less of it.

13 Axa February 26, 2015 at 12:14 pm

Data should be open unless it compromises defense or was funded by investment from pharma/oil/food/etc companies. Actually, the military does not publish anything after long time and companies applies for a patent instead of writing an article for a journal. The clients of scientific journals are precisely the people that should publish their data since they are payed by the non-military part of governments and NGOs. I understand that raw data is not published because reading an article more than 8-10 pages long is not productive and boring. But, in the age of low storage costs, the PDF article can just link to a data repository.

Indeed, public data repositories are not an internet-age thing, 40 years ago…….. http://www.geosociety.org/pubs/drpint.htm

14 john February 26, 2015 at 1:22 pm

Who paid?

I’d hope every public funded project would become public good.

If you are a private venture, not trying to convince me of anything, keep it all.

If you are a private venture, trying to convince me of something, data might help.

15 Pshrnk February 26, 2015 at 1:32 pm

We will get less published! But the quality will be better.

16 Thomas February 27, 2015 at 12:02 am

Why not separate the data from the science? A market solution would be to license data to researchers or for GOV to pay for data sets which become open source. Am I being naive here?

17 Lord Action February 27, 2015 at 9:20 am

“Am I being naive here?”

A little bit. In some instances, that amounts to buying a company and continuing to operate it. That’s not generally something the government does. It would also lead some people to start to doubt the data quality.

18 eddie March 1, 2015 at 10:02 am

“Require instant publishing of data and code, and you’ll get less science.”

You’ll get less BAD science.

19 Hopaulius February 26, 2015 at 11:47 am

Aren’t the lion’s share of published scientific results performed by academics who work (nominally) for state universities or private non-profits? It’s not as though they are investing their own money to do the research.

20 Lord Action February 26, 2015 at 12:14 pm

They’re investing years of their careers, and grant money they could be allocating to other purposes.

21 Rahul February 26, 2015 at 12:34 pm

@Lord Action

I’m ok with embargoes. You take your time to exploit your data monopoly in secret. All we ask is that when you release your conclusions to the world we want to see the supporting raw data. I think that’s fair. (It’s a different question whether NASA granting in house researchers a monopoly is a good use of public funds.)

My point is *not* that the data is valueless. But that in most cases the data derives its value from the conclusions it generates. And whoever funded the data generation mostly funded it for those conclusions.

Releasing the data once it is generated gets the funder both more prolific analysis & a better quality of conclusions due to transparency & validation.

22 john February 26, 2015 at 1:23 pm

If you want to own something, work with private funds. Works in all other areas of the economy, right?

23 Lord Action February 26, 2015 at 1:31 pm

@Rahul

Then I don’t think we’re far apart. I’d much prefer release of data and code, but I think some sort of delay is necessary or it would be stifling.

Even with a delay a policy like this will preclude research that really requires the data stay private. You often encounter that in finance, for example, where trading records or credit card data would be impossible to effectively de-identify, and people won’t give you access without legal assurance the data won’t get out.

And again, what are you supposed to do about data you have to buy? Publicly funded research just can’t buy data?

24 Rahul February 26, 2015 at 1:54 pm

@Lord Action

Why a delay? I say, Journals shouldn’t even start the review process before data / code are submitted. And both should be posted online simultaneously with the actual paper.

Trading records & credit card transaction data can be scrubbed & anonymized. I think. Don’t they release medical records too similarly ( though shit happens!). In any case I’m OK with some exceptions being made in such sensitive data. (Though even if we did take a hard line I’m not sure it’d be a huge loss to the world)

What sort of data do you end up buying? I don’t know much about such areas. Can you elaborate?

25 Lord Action February 26, 2015 at 2:16 pm

If there’s no delay, you get one paper and one shot with your data and code. If there’s a delay, you might get a few years of productivity.

Trading data might be able to be scrubbed if you’re talking about GM equity, but not if you’re talking about something thinly traded. I can often fill in the blanks and figure out counterparties even when I only have one side of the trading, for example.

I buy a lot of finance stuff. Prices. Models of structured finance things. Performance data for financial instruments. There are firms that deal in this stuff, and have staff and software that put it all together, and they don’t want to give it away to the public – it’s their business. They’ve negotiated agreements with broker dealers; they’ve written software to parse regulatory filings; they audit and correct mistakes.

26 Lord Action February 26, 2015 at 2:22 pm

I really do agree that data and code should be submitted as part of the review if at all possible. I’d have to think NDAs would usually make that acceptable, even in the privacy or proprietary situations we’re discussing.

I’d prefer that data and code be made public after some delay. I argue the delay is desirable, as it incents the creation of novel data and code. And I say “prefer” rather than “demand” because I think sometimes it just won’t work, and that that doesn’t completely devalue the sort of research that can’t be so public.

27 Adrian Ratnapala February 26, 2015 at 6:01 pm

I’m sorry, that’s total crap. Academics publish their findings anyway. Their “monopoly” over the raw data does nothing but hurt the credibility of those findings. The reason raw data is rarely seen and almost never useful is of our collective sloppiness.

We put things in binary files that can only be read by outdated versions of MATLAB, and if we ever manage to open them, we just find to arrays of numbers named “b1” and “q”. There is no real incentive for doing better, because there are seven other steps of processing that people would still need to take.

There are serious technical barriers to proper data publishing, and they have been around in various forms since well before 1823. With the right software, education and culture we can probably fix it — but we are only just beginning to recognise the problem.

28 Rahul February 26, 2015 at 7:25 pm

@Adrian:

I don’t see the big technical barriers. Unless your data is really complex or huge.

Dumping your data in a commonly used format or even better a CSV or tab delimited text file and some intelligent commenting would go a long way.

The point we are at right now I see tons of low hanging fruit.

29 Lord Action February 27, 2015 at 9:30 am

You’re assuming “data” means “a table with some numbers in it” when what’s often important (for example in climate papers) is how you get from raw data to the database you actually model with. In between you have code that takes you from 6,500 binaries, text files, scribbled notes, etc, cleans all that in a variety of ways, and brings it together into something you can work with. That’s where a lot of the complexity (and code, and room for doubt) actually lies. This is normal today, in all sorts of fields.

If we had Rahul’s suggestion, we wouldn’t have climate science. With today’s world, we have climate science but a massive credibility problem (that nobody inside the club seems that worried about). With three or five year delay, we’d have science, and the credibility concerns would be addressed. It would be great to see what’s actually true.

30 Thomas February 27, 2015 at 12:16 am

Okay, okay, let’s take a turn for the political here, how much has the Party of Science doctrine watered down the veracity (for lack of a better word) of science? Let me take a step backwards, if “monopoly over the raw data does nothing but hurt the credibility of their findings”, but a large subset of the population admires willingness to accept anything that comes out of a journal, how much does obscuring data really affect public perception of the research?

31 wiki February 26, 2015 at 7:56 am

Good for psychologists. Not only do they recognize the limitations of p values, but they also encourage both preliminary and confirmatory research. Economists use a strict, yet vaguely defined standard of rigor and robustness. There is reluctance to publish important preliminary work that is not econometrically sophisticated. Yet there is also reluctance to publish confirmatory or supplementary results that either confirm or modify previous research. It’s like saying that if one published paper showed using powerful new methods that certain drugs clearly benefited women with breast cancer, the journal might reject a followup paper as uninteresting that showed that the benefits were different for older and younger women, or potentially not robust to which climate the women lived in. Of course we don’t see that in medicine, but the analogous publishing issue is in fact quite common for the top econ journals.

32 TMC February 26, 2015 at 8:18 am

Or they just don’t want to be accountable for robustness. Occam’s Razor.

33 Party of Science February 26, 2015 at 8:17 am

“This is also often described as torturing the data until it confesses.In a 2009 systematic review, 33.7% of scientists surveyed admitted to engaging in questionable research practices – such as those that result in p-hacking. The temptation is simply too great, and the rationalizations too easy – I’ll just keep collecting data until it wanders randomly over the 0.05 p-value level, and then stop.”

34 dan1111 February 26, 2015 at 8:22 am

I agree with the problem, but not the solution.

Banning people from publishing useful information just because it is sometimes used incorrectly is counterproductive and doesn’t get at the root of the problem..

Researchers will still conduct dubious research and exaggerate how important it is. They will just use a word other than “significant” to describe it, and readers will have fewer tools with which to judge the strength of the work.

35 whatsthat February 26, 2015 at 10:50 am

Agree. An “effect size” is actually worse than the point estimate-standard_error-p_value trilogy, it gives lesser information. Ridiculous and thankfully I didn’t end up doing psych.

36 Rahul February 26, 2015 at 11:58 am

Whether NHST yields useful information is itself quite suspect, isn’t it?

37 dan1111 February 26, 2015 at 12:15 pm

The usefulness of the p < 0.05 cutoff point is highly debatable. If they had simply banned this, and the arbitrary designation of certain results as "significant", I would have had no objection. This makes sense, actually.

But they banned not only the cutoff point, but all calculation of p values and confidence intervals. This is insane. In response to worries that too many results are just random noise masquerading as meaningful findings, they have banned the major statistical tests that help one differentiate between the two (except for some Bayesian analysis, at their discretion).

38 Rahul February 26, 2015 at 12:22 pm

Even without an arbitrary cutoff what about fishing? I wonder why they are not insisting on pre-registration though.

39 dan1111 February 27, 2015 at 1:29 am

Fishing is a problem, but banning these tests doesn’t solve that. It just reduces the information readers have to evaluate the quality of the work.

40 mavery February 26, 2015 at 12:48 pm

So because some people do bad statistics, they’ve decided to ban some of the most commonly used tools in statistics?

There’s a reason confidence intervals and p-values are so widely accepted and have been the standard in academic research for decades. They’re excellent summaries of uncertainty about point estimates and conclusions.

41 Art Deco February 26, 2015 at 9:04 am

I’ve been somewhat skeptical when the disreputable Mr. Sailer has offered the opinion that the published oeuvre of social psychology is a remainder bin of junk and trivia. Maybe he had a hunch that there was a systematic problem in the commanding heights of the subdiscipline.

42 Rahul February 26, 2015 at 12:00 pm

Mr. Sailer thinks that an awful lot of society is junk.

43 Thomas February 27, 2015 at 12:23 am

Soc Psych has brought us a 250:1 ideological bias and the explanation that it’s just because “Conservatives are stupid”, and that’s according to the top echelon of Soc Psych. The whole field is an exercise in confirmation bias.

44 Euripides February 26, 2015 at 9:04 am

The Austrians at GMU will be throwing a party to celebrate the news!

45 KevinH February 26, 2015 at 9:18 am

Two points, first BASP is a middle-of-the-road journal. Certainly not a leader in Psychology ( http://www.eigenfactor.org/rankings.php?bsearch=Basic+and+Applied+Social+Psychology&searchby=journal&orderby=eigenfactor )

second, this is a statistical joke, but I don’t think the journal editors are in on it. Basically they want p < .00X rather than p < .05, but they won't tell you what X is.

"However, BASP will require strong descriptive statistics, including effect sizes. We also encourage the presentation of frequency or distributional data when this is feasible. Finally, we encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem.”

p = f( effect size, data distribution, sample size )

46 Ray Lopez February 26, 2015 at 9:26 am

Here is a nice website that I use to explain this phenomena.

http://www.jerrydallal.com/LHSP/multtest.htm

And this blurb:

Another problem with the p-value is that it is not highly replicable. This is demonstrated nicely by Geoff Cumming as illustrated with a video. He shows, using computer simulation, that if one study achieves a p-value of 0.05, this does not predict that an exact replication will also yield the same p-value. Using the p-value as the final arbiter of whether or not to accept or reject the null hypothesis is therefore highly unreliable.

47 mavery February 26, 2015 at 12:54 pm

That quote reflects a fundamental misunderstanding of what a p-value is. It is a random value dependent on the sampled data. When you take a different sample, you will get different results. If you get the same p-value every time, you’re either doing it wrong or your example is trivial. (eg, your sample is the entire population, or your outcomes are deterministic.)

The 0.05 threshold is of course arbitrary. That a random value doesn’t achieve a particular threshold with 100% probability doesn’t mean that the approach to derive that value is flawed. The notion that “0.05” has magical properties for p-values is absurd on its face, which is why the actual p-value should be reported rather than simply whether or not it meets some threshold.

48 Rahul February 26, 2015 at 1:14 pm

Have you seen those graphs that collected p-values from published papers & show the spike right at the threshold?

49 mavery February 26, 2015 at 1:39 pm

Oh, I’m well aware that academic publications have huge false discovery problems. Bejamini and Hochberg is a good place to start if you’re interested in ways to address this sort of thing. But that’s not at all related to blurb you quoted, which fundamentally misunderstands what p-values are.

50 Ray Lopez February 26, 2015 at 9:19 pm

@mavery – thx, I took a course in statistics but am not a practitioner. If you read this, perhaps you can comment on why confidence limits are considered better than p-values. I have an intuitive understanding why, but am interested on what an expert thinks.

51 mavery February 27, 2015 at 8:55 am

Ray-

They’re not “better” necessarily, they just talk about different things. Confidence intervals (in the traditional, non-Bayesian sense) are also prone to misinterpretation, since intuitive interpretation of them (“There’s a 95% chance that the real mean is in this interval!” and “There’s a [p-value] probability that the null hypothesis is true!”) is both simpler than the actual meaning and is also closer to what we’d want the values to actually be talking about. Confidence intervals are nice because they give you a sense of the magnitude of the uncertainty surrounding your estimate. P-values on the other hand tell you how strong your evidence against a particular conclusion is.

If the hypothesis test is well defined and intuitive, this can be very useful. (e.g., “Does the new drug improve survival rates at five years after treatment?”) A marginal p-value would lead me to ask for further testing, whereas a very small p-value would make the new drug much more appealing.

Now, that said, you’d still want to look at exactly what those survival rates meant. Maybe it’s a difference between 0.03% and 0.06%. Sure, you’ve doubled the probability, but it’s still pretty small. So you can’t just take the p-value and ignore the context surrounding it.

52 prof February 26, 2015 at 10:41 am

Instead of inexpensive blanket rules like this, journals should earn their outsize fees by hiring staff statisticians to review papers (given that peers in most fields aren’t equipped to do so.)

Likewise, universities should fire some deanlets and deanlings and replace them with staff statisticians — either in the library or in individual departments.

53 Nate February 27, 2015 at 5:47 am

+1

But maybe we don’t need to fire those deanlets. I think we need an incentive for researchers to cross borders. Most colleges do have a Math Department…

54 prof February 27, 2015 at 8:58 am

The problem is that these papers aren’t presenting novel stats problems that would qualify as research to an academic statistician. That’s why I think you need paid, dedicated, professionals to assist.

55 bartman February 26, 2015 at 11:16 am

From the article:

“However, the p-value was never meant to be the sole measure of whether or not a particular hypothesis is true.”

I was taught that hypotheses are either false or not falsified, but were never “true”. Kinda like how people are found guilty or not guilty, but never found “innocent”.

On the other hand, a friend who is a developmental psychologist says that social psych people are seen as the “soft white underbelly” of the psych world, the bottom of the food chain, not taken too seriously by other sub-disciplines. Maybe this validates that notion.

56 FG February 26, 2015 at 11:24 am

Sounds like a job for differential privacy!

57 dearieme February 26, 2015 at 11:52 am

When I first studied probability and statistics everything got enormously easier when I realised that much of the difficulty was not mathematical. Instead it stemmed from the appallingly constipated, or even confused, writing that the field seemed plagued with. As an example, consider that link’s definition “The p value is the probability to obtain an effect equal to or more extreme than the one observed presuming the null hypothesis of no effect is true.” Ugh! Really, students should hurl things at any lecturer who expresses himself so badly.

58 Ricardo February 26, 2015 at 12:08 pm

So would you like to reword it for us?

59 mavery February 26, 2015 at 12:58 pm

Seriously. If you can find a clearer way of expressing that thought that doesn’t admit incorrect interpretation, I’d love to hear it. I try to go with things like, “How unlikely is it that your result could happen due to random chance?” but that’s not a formal definition like the one you stated. Formal definitions tend to be complex. I’m not sure who physicists formally define, say, motion, but I would imagine it sounds equally obtuse.

60 Rahul February 26, 2015 at 1:25 pm

In the case of p-values, I suspect the rampant misunderstandings are because their definition answers a question that no one is really ever asking.

Hence people just make a mental jump & assume that a p-value means whatever they want it to mean.

61 dearieme February 26, 2015 at 3:09 pm

What the devil is “the probability to obtain”? That’s just not English. Why does he shoehorn into the sentence the meaning of a null hypothesis? That should be in a preamble. Why the babyish “presuming the null hypothesis … is true” rather than ‘assuming the truth of the null hypothesis’? Et bloody cetera. It’s just dreadful writing. It’s also incomplete, in that he needs a separate sentence to explain what he means by “more extreme than”.

62 Rahul February 26, 2015 at 11:56 am

My thoughts on publishing reform:

(1) All papers declare clearly whether they are exploratory or not.

(2) Papers that claim to be non-exploratory must have their study goals & methodology pre-registered.

(3) Funding agencies dedicate a portion of their budgets for replication studies. If a study is worth funding the results are worth replicating. Replication is non-glamorous work so funding might compensate.

(4) Authors must post all raw data online *before* the paper gets accepted for publication (with limited exceptions)

(5) Rather than just p-values, authors to be pushed to add some metric of importance / real world significance / impact / cost-benefit / economics etc. to their papers.

(6) A Journal should have an independent statistical review of articles possibly by a staff statistician

(7) Publish the names of reviewers along with an article’s authors.

63 JWatts February 26, 2015 at 1:07 pm

Not a bad wish list, but I would certainly like to see a cost benefit analysis of implementing it.

Also, to #4 I would add that any algorithm or math used in the analysis of said data be fully detailed. Ergo, no, yes here’s the data and here’s the result of my computer model, but the model itself is proprietary. I’d probably just call that the No Black Box rule.

64 Axa February 26, 2015 at 12:33 pm

This ban is kind of a double edge knife. It’s like banning curse words at the church, yep it’s fine and the expected behavior, but the ban doesn’t speak well of church attending people. Is shame the best tool to make people change the way they behave? Psychologists are weird.

65 Rahul February 26, 2015 at 12:36 pm

Why single out psychologists? Have doctors, pharmacists, economists, biologists etc. voluntarily stopped cursing….oops I mean using p-values?

66 Axa February 26, 2015 at 1:30 pm

Well, the day an economics journal bans the use of p-values they’ll join the psychologists.

67 Mike February 26, 2015 at 4:14 pm

This is lazy editing/refereeing. Just reject papers that don’t interpret the p-values appropriately.

68 Pat Boyle February 26, 2015 at 7:29 pm

I just looked this journal up. Gasp!

Who would read such dreck? There seem to be more than a hundred similar journals published each quarter on psychology, social psychology or sociology. Too many.

I defy anyone to find a worthwhile article in this waste of wood pulp.

69 mesaman February 27, 2015 at 2:58 pm

So BASP has morphed into the “readers digest”. What’s next? Basic and Applied National Enquirer?

70 JackHastings March 2, 2015 at 3:45 am

Comments on this entry are closed.

Previous post:

Next post: