Science News has a good piece by Tom Siegfried on statistical significance and what it means. Siegfried covers a lot of ground including Ioannidis' argument, Why Most Published Research Findings are False, Oomph versus statistical significance, and the meaning of the p-value. On the latter point, Siegfried writes:

Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct).

He then explains why a 5% level of significance doesn't mean that there is a 95% chance that the result could *not* have happened by chance.

All of this is correct but there is another more common error that Siegfried does not address. Suppose that a researcher runs a regression and gets a coefficient on some variable of interest of 5.2 and a p value of .001. In explaining his or her results the researcher says "a effect of this size would happen by chance alone only 0.1% of the time." Now that sounds very impressive but it is also misleading.

In economics and most of the social sciences what a p-value of .001 really means is that *assuming everything else in the model is correctly specified *the probability that such a result could have happened by chance is only 0.1%*. It is* easy to find a result that is statistically significant at the .001 level in one regression but not at all statistical significant in another regression with small changes such as the inclusion of an additional variable. Indeed, not only can statistical significance disappear, the variable can change size and even sign!

A highly statistically significant result does not tell you that a result is *robust*. It is not even the case that more statistically significant results are more likely to be robust.

Now go back to Siegfried's explanation for the p-value. Notice that he writes "Correctly phrased,* experimental data* yielding a P value of .05*…" *Almost everything of importance is buried in those words "experimental data." In the social sciences we rarely have experimental data. Indeed, even "experimental data" is not quite right – *truly randomized* *data *might be a better term because even so-called experimental data can involve attrition bias or other problems that make it less than truly random.

Thus, the problems with the p-value is not so much that people misinterpret it but rather that the conditions for the p-value to mean what people think it means are really quite restrictive and difficult to achieve.

**Addendum**: Andew Gelman has a roundup of other comments on Siegfried's piece.

Alex

Can you think of a plausible example where you would get a “coefficient on some variable of interest of 5.2 and a p value of .001″, then “simply include one additional variable”, and have the coefficient “change… sign”?

Even this isn’t quite right I think. The way were taught in grad school, the right way to say it in the ideal case would be, “Either the null hypothesis is false, or we just observed something which happens less than 0.1% of the time.” An alternative way of saying it would be, “a _estimated_ effect of this size would happen by chance alone only 0.1% of the time _if the null hypothesis were true_.”

The temptation to make Bayesian statements using just the likelihood function is too great. In order to make Bayesian statements one has to specify a prior, not just the likelihood. There’s a whole set of techniques to do this, but classical hypothesis testing is not one of them.

“@SJ: A regression of age on income adding education or experience.”

I think you’d need to be more specific than that.

When most applied economists look at p-values they are really making a Bayesian interpretation with a flat prior (which is inadmissible, but if the sample is large, may not be such a bad approximation). Thus a p-value of 1% is understood really as, “given this model, the probability that this coefficient is larger than 0 is 99%”. Which is a meaningful, interesting and pretty strong statement, if you like the model.

I like your statement on robustness. I think this is what empirical work should be mostly about. In fact, I am more than willing to trade-off statistically insignificant results in many of the regressions if the coefficient turns out to be robust to changes in the model and in sub-samples. There should be a way to formalize this trade-off (econometricians, any help here???)

In the long run, that’s how we evaluate our theories. If, in paper after paper, regression after regression, wage and education are correlated with a stable coefficient, we come to accept that this is a fact about how wages are set.

There is always some guy at the meeting that questions a grad student’s statistics. I don’t know statistics but I always get the feeling they don’t either. They just don’t know they don’t.

Wouldn’t it be funny if the ascension of the requirement of statistics turned out to make people read more into results than they otherwise would.

So does this mean that green does not really fight cancer and extend life? So does this mean that Quaker Oatmeal does not really prevent heart attack and cancer and lower blood pressure?

http://www.quakeroats.com/oats-do-more/for-your-health/fiber-and-whole-grains/the-health-benefits-of-whole-grains.aspx

I would bet that it does but it at least puts any such findings in doubt. I am amused by the Quaker Oats ads.

“@SJ: Well, i can imagine that the correlation between personal income and age is positive and quite significant. However income rises with age because older people are on average more educated and more experienced. Where you to find an uneducated and unexperienced 60 year old, I would say that (s)he on average makes less than an equivalent 20 year old. The latter can make a decent wage in construction or meat packing, the former cannot. So the correlation of age with income corrected for education and experience is probably *negative*, hence the switching of OLS-coefficient’s sign. ”

Well, JSK failed the test. How about you, Alex?

@SJ, I have seen highly statistically significant results change sign with small changes on many occasions. I think if you ask any econometrician they will tell you the same thing. You ask for a plausible example. JSK’s story is pretty good but in some ways the search for a plausible example obscures what is actually going on. Imagine you have 50 variables in a regression (50 is not a large number in this context). The coefficient on each of those variables comes from a complex expression which takes into account the influence of every other variable which themselves take into account the influence of every other variable. Imagine that a and d are uncorrelated. Nevertheless, if you have a,b,c, in the regression and you add d that can change a because it changes b and c. It’s not surprising in complex systems like this that small changes can have big effects. Not that this always happens but it happens way more than a p-value of .001 might suggest.

@MichaelG, absolutely. Assuming everything else in the model is “correctly specified” includes functional form (how the variables enter) and distribution. Sometimes we can rely on CLT but not always.

@SJ: Are you the guy Andrew talked about?

“There is always some guy at the meeting that questions a grad student’s statistics. I don’t know statistics but I always get the feeling they don’t either. They just don’t know they don’t.”

Grow up. If you’re interested in failing people for tests, become a high school teacher.

Alex,

Let me be nasty and claim that you are making this up:

“I have seen highly statistically significant results change sign with small changes on many occasions.”

I am an empirical micro guy and am running regressions for a living (ok, and teaching). I cannot remember a single instance in which I went from a positive coefficient with a p-value of 0.001 to a negative one with the inclusion of one single additional control. Looking at your CV, you are running many fewer regressions than I do, yet you claim to have seen many such cases. Sorry, not believable.

Oh, and 50 variables in a regression? What regressions in economics have 50 variables (unless you mean fixed effects)? And you even claim that “50 variables isn’t large in this context”? That would presumably mean there are plenty with 100 or 150?

I enjoy your writing a lot, but I am getting the impression you are just making stuff up here.

I think the better way to address robustness is to require (except under exceptional circumstances) that published research make their data publicly availalbe – in a commmon (e.g., text) form. While a number of journals purport to do that today, in reality, little of the interesting data can be found. I keep running across datasets that are withhheld due to proprietary concerns. For example, a recent article (AER, I believe) on cellphone plan adoption couldn’t reveal the data because it was competitively sensitive. The data was 6 years old on a bunch of college students – let’s get real! This is as bad as the Federal government marking documents as sensitive.

Even when data can be gotten publicly, it is hard to replicate the exact dataset that somebody uses. As a profession, economists should be mandating that replication be made easy and transparent. Absent this, I think robustness will remain elusive.

@commenterlein: then you don’t know statistics. If you leave an important explanatory variable out of a regression, the estimators of all other coefficients are biased (with a few exceptions). A unexpected sign on a coefficient or a change in sign with the addition of a new variable is a telltale sign of omitted variable bias.

Regarding the level of significance, .05 is a common convention and we do a disservice by repetitively using it. Alpha is the probability of a Type I Error. Beta is the probability of a Type II Error. For a given sample size, one increases when the other decreases. Beta is based on unknown population parameters so it cannot be calculated. The selection of alpha should be done a priori by weighing the relative costs of a type I and type II error. We produce the pvalue in research so a reader can make a statistical decision FOR ANY chosen alpha. If you have reason to believe alpha should be smaller, then the pvalue gives you the result for your smaller alpha.

The worst thing is that data from most of the surveys we see every day don’t even publish a standard error so we can determine whether the changes are significantly different from zero. We take it on faith and the media runs with it, contributing to statistical innumeracy.

Ioannidis writes: “There are an infinite number of false hypotheses about the world and only a finite number of true hypotheses so we should expect that most hypotheses are false. Let us assume that of every 1000 hypotheses 200 are true and 800 false.”

Why should there only be a finite number of true hypotheses? He doesn’t specify how he is formulating his hypotheses, but it would seem that in a simple logical system you could come up with at least a countably-infinite and probably an uncountably-infinite set of true hypotheses.

I don’t disagree with the rest of his article, which seems like another version of the Prosecutor’s Paradox: http://en.wikipedia.org/wiki/Prosecutor's_fallacy

James

Quick comments:

(1) a very small effect can be highly statistically significant

(2) following #1, one has to look at the size of the effect, not just the stats to evaluate “importance”

(3) To give Alex credit here, yes, the situation is very different in most experimental science (biomed science work in particular). In graduate school, I was scooped on multiple projects by people that I had never heard of or met… and that were using slightly different experimental conditions… because their results *matched* mine. In other words, we agreed entirely without having ever met and while using different experimental systems etc. Welcome to science!

(4) following #3, I always emphasize to my students the importance of replication (as in multiple publications from different labs agreeing) and, more importantly, replication using very different experimental approaches = different assumptions/limitations. And replication “over time” – as in 10 years later, somebody still reaches the same conclusion based on experiments.

(5) based on #3 and #4 above, you can see why there has been a lot less emphasis on stats in (non-human) biomed work. If you disagree with me… do the experiment in your own lab! This is often doable for quite small investment of time/energy/money.

Sorry if this got slightly off-topic…

Its used to determine whether the outcome of an experiment is the result of a relationship between specific factors or due to chance which is commonly used in the medical field to test drugs and vaccines and to determine causal factors of disease.

Slope parameters that shift sign (even with such p-value as < 0.01) should be rare, but are not unfathomable (especially given the type of raw “data” that go into modeling in recent years). Model misspecifications that lead to such results (examples include but are not limited to such as mentioned by POWinCa) and violations of model assumptions are relatively easy to detect through standard diagnostic procedures (available in decent stat texts). Perhaps economic journals should require the inclusion of diagnostic test results as inclusions for the publication of articles (not to mention that stat courses should also cover the subject matter adequately before releasing students into the world of modeling)

When a statistic is significant, it simply means that you are very sure that the statistic is reliable. It doesn’t mean the finding is important or that it has any decision-making utility.

Everything is very open and very clear explanation of issues. was truly information. Your website is very useful. Thanks for sharing.

so 5% level of significance doesn’t mean that there is a 95% chance that the result could not have happened by chance? interesting read nonetheless.

I like your post and all you share with us is up to date and quite informative, i would like to bookmark the page so i can come here again to read you, as you have done a wonderful job.

Nice post. Thanks a ton for sharing this resource. All the points are explained beautifully.

I appreciate your idea here. Definitely it has a good content. Thank you for imparting more of your own thoughts. Good job!

This post and all the comments are very great, I can learn many thing from them.

Thank for share.

This news is very great, sometime a big number like 95% is useless and 5% usually is everything.

Agree with all the comments, good job!

We’ve got a lot of resources on this between my wife and myself. This makes an an important addition. More of the same please!

æ–‡åŒ–å¤åŸŽè¥¿å®‰äºŒæ‰‹æˆ¿å¯è°“æœ€å…·æ”¶è—ä»·å€¼ï¼Œä¸ä»…æˆ¿ä»·æ¯”æ–°æˆ¿è¦ä½Žå¾ˆå¤šï¼Œè€Œä¸”æµ“éƒçš„å¤è‰²å¤é¦™çš„éŸµå‘³ï¼Œç‰¹åˆ«é€‚åˆæ–‡äººé›…å£«ã€‚è¿‘å¹´æ¥ï¼Œè¥¿å®‰äºŒæ‰‹æˆ¿æ€§ä»·æ¯”ç›¸å¯¹äºŽç§Ÿæˆ¿å’Œä¹°æ–°æˆ¿è¦é«˜å¾—å¤šï¼Œä¼˜åŠ¿ä¹Ÿæ„ˆæ¥æ„ˆæ˜Žæ˜¾ï¼Œæ‰€ä»¥å¾ˆå¤šå¹´è½»äººéƒ½ä¹æ„è´ä¹°è¥¿å®‰äºŒæ‰‹æˆ¿ç”¨åšå©šæˆ¿ã€‚

We see some of this in psychological testing. For example, in IQ testing, we may assess someone with a gap of, say, 10 points between their verbal IQ and their performance (i.e. nonverbal) IQ. While this gap may be statistically significant (happening with a lack of frequency that, statistically, it reaches significance), it doesn’t mean much in terms of someone’s intellectual functioning. What is of far more concern is whether a gap between the scores approaches what is called clinical significance – at that point, we may see certain issues resulting from a gap that large (say, roughly, 20+ points difference off the top of my head).

this is so informative i would like to thank you for spreading this information and making it public

Great post and some very interesting comment here too.

Comments on this entry are closed.