The Meaning of Statistical Significance

Science News has a good piece by Tom Siegfried on statistical significance and what it means. Siegfried covers a lot of ground including Ioannidis' argument, Why Most Published Research Findings are False, Oomph versus statistical significance, and the meaning of the p-value. On the latter point, Siegfried writes:

Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct).

He then explains why a 5% level of significance doesn't mean that there is a 95% chance that the result could not have happened by chance.

All of this is correct but there is another more common error that Siegfried does not address. Suppose that a researcher runs a regression and gets a coefficient on some variable of interest of 5.2 and a p value of .001. In explaining his or her results the researcher says "a effect of this size would happen by chance alone only 0.1% of the time." Now that sounds very impressive but it is also misleading.

In economics and most of the social sciences what a p-value of .001 really means is that assuming everything else in the model is correctly specified the probability that such a result could have happened by chance is only 0.1%. It is easy to find a result that is statistically significant at the .001 level in one regression but not at all statistical significant in another regression with small changes such as the inclusion of an additional variable. Indeed, not only can statistical significance disappear, the variable can change size and even sign!

A highly statistically significant result does not tell you that a result is robust. It is not even the case that more statistically significant results are more likely to be robust.

Now go back to Siegfried's explanation for the p-value. Notice that he writes "Correctly phrased, experimental data yielding a P value of .05…" Almost everything of importance is buried in those words "experimental data." In the social sciences we rarely have experimental data. Indeed, even "experimental data" is not quite right – truly randomized data might be a better term because even so-called experimental data can involve attrition bias or other problems that make it less than truly random.

Thus, the problems with the p-value is not so much that people misinterpret it but rather that the conditions for the p-value to mean what people think it means are really quite restrictive and difficult to achieve.

Addendum: Andew Gelman has a roundup of other comments on Siegfried's piece.