Small samples mean statistically significant results should usually be ignored

Size really matters: prior to the era of large genome-wide association studies, the large effect sizes reported in small initial genetic studies often dwindled towards zero (that is, an odds ratio of one) as more samples were studied. Adapted from Ioannidis 2001 et al., Nat Genet 29:306-309.

Genomes Unzipped: In October of 1992, genetics researchers published a potentially groundbreaking finding in Nature: a genetic variant in the angiotensin-converting enzyme ACE appeared to modify an individual’s risk of having a heart attack. This finding was notable at the time for the size of the study, which involved a total of over 500 individuals from four cohorts, and the effect size of the identified variant–in a population initially identified as low-risk for heart attack, the variant had an odds ratio of over 3 (with a corresponding p-value less than 0.0001).

Readers familiar with the history of medical association studies will be unsurprised by what happened over the next few years: initial excitement (this same polymorphism was associated with diabetes! And longevity!) was followed by inconclusive replication studies and, ultimately, disappointment. In 2000, 8 years after the initial report, a large study involving over 5,000 cases and controls found absolutely no detectable effect of the ACE polymorphism on heart attack risk.

The ACE story is not unique to the ACE polymorphism or to medical genetics; the problem is common to most fields of empirical science. If the sample size is small then statistically significant results must have big effect sizes. Combine this with a publication bias toward statistically significant results, plenty of opportunities to subset the data in various ways and lots of researchers looking at lots of data and the result is diminishing effects with increasing confidence, as beautifully shown in the figure.

For more see my post explaining Why Most Published Research Findings are False and Andrew Gelman’s paper on the statistical challenges of estimating small effects.

Addendum: Chris Blattman does his part to reduce bias. Will journal editors follow suit?


Does this just hold for genetic studies though?

Seth Roberts seems to me to be arguing in just the opposite direction, favoring small samples for general health and wellness experimentation. If I understand his arguments correctly he thinks small samples are better because there has to be such a big effect to be significant.

This phenomenon, and the underlying statistical principles, have been known for MANY years by baseball researchers, who have to keep explaining over and over again why the data about "clutch hitting" are meaningless. When you have a thousand totally random samples, ten of them will exhibit [false positive] significance at the .01 level.

Very good point about sample size.
Even worse is when you are not analyzing data, but computer models of predicted future data.
And even worse is when you only get paid if the results come out a certain way.

It's incredible to think this happens in the real life, but the incandescent light bulb ban says hi.

If you just read the methods and results section I think the problem is lessened. I wonder if peer-review helps or hurts this.

This is a little misleading. Yes it is easier to get extreme results in small samples. And precisely for this reason, your results have to be more extreme than they would have to be in a large sample to be statistically significant. Unless you are arguing that asymptotic distributional assumptions are easier to swallow in a larger sample (e.g. if you are using central limit theorem)...

Aren't we typically testing "the effect is positive"? You should not infer "the odds ratio is three". If you use small sample estimates as point estimates then I guess your point is true, but you shouldn't be doing that in the first place.

Good point. There is likely an economizing of implications from the available data. You are basically saying "we didn't rule out an effect" and people want to read it as "there is an effect." The former justifies a larger study, the latter justifies pre-conception.

If every study actually followed correct experimental procedure and there were no publication bias, you should treat 95% confidence with n = 10 the same as 95% confidence with n = 10,000. But almost no one follows correct experimental procedure and so the reported confidences are wrong. The reported confidences will typically be more wrong for smaller samples.

Great post -- and link to past post.

To fully believe this result, you have to believe that there are no big unpublished effects in the world, since such effects will show up in small studies. This is then just a version of the hundred-dollar-bill-in-the-street story. But it only applies where people have been looking. In a novel data set, and where there is little surprise, it will not necessarily be true. Suppose I say that ERA has a big effect on salary for major league pitchers. I'm going to find a huge effect even in a very small sample, and the effect won't diminish 9on average) as i increase sample size. It's just not very surprising.

Isn't statistical significance designed to compensate for sample sizes? Small samples therefore need larger effects to be deemed significant.

Or are we arguing that significance testing is somehow flawed, in a way that for small samples "significant" is not really significant? If so why not modify significance testing procedures?

The way I understood it, problems only arise when non-practitioners interpret "significance" in ways it wasn't supposed to be. Anyways, was always a dizzyingly confusing topic.

One assumption is that all the results are presented- a totally objective process where the hypothesis is tested and the results are presented. Now, imagine taking 5000 samples and getting no effect, but then chopping off one end until you get down to 500 that do show an effect. That's not exactly what they do, but kinda sorta is. Each result is true, but the conclusion is wrong. Meta-analysis will make this worse.

Yes, but what's the way around this? Any ideas?

The main way to improve the situation Blattman provides a very good example of - negative results have to get out in the open somehow, one way or the other. Really any way possible, although of course the optimal approach would probably be for the results to enter the equation via the journals already in existence. That would be a lot easier if journal editors were more willing to play along and actually publish results like those (which is why AT mentions them; editors/publishers are very important in this discussion), instead of just rejecting a lot of those studies because, 'well, they didn't really find anything, did they?'. Researchers need publications, and if they know they have to jump through a lot of hoops to hunt down p-values, then that's what they'll be spending their time doing (instead of perhaps focusing more on important negative results).

I'm also sympathetic to the argument that scientists should just stop contributing to biased publications, but one editor matters a lot more than one scientist here.

A "problem" in genetics is that one performs thousands of statistical tests simultaneously. The p-values should be adjusted accordingly through some Bonferroni-type mechanics.

If I take 500 individuals (250 with a disease and 250 without) and I test every single gene I can observe, it is very likely that one will give a very low p-value. But to what I should compare it to in order to call it significant, will depend on how many genes I tested in total.

Similar situations arise in high freq finance, when I can test 10,000 stock pairs for cointegration or mean reversion. Each test will give me a p-value, but the more tests I perform the more likely it is that some will pass. I cannot compare to the standard 1% or 5% anymore.

And as Kahneman points out in his book, extreme combinations are more likely in small samples. Therefore multiple tests on small samples can yield very unreliable statistics. Each test might account for the small sample size, but it will not account for the fact that another 999 hypothesis tests were run.

"Each test will give me a p-value, but the more tests I perform the more likely it is that some will pass."

Precisely. There's a great XKCD comic on this phenomena:

Bayesians say that p-values are meaningless and should not be used to prove or disprove anything.

Raw data being not available, we can still do reasonable inference from this:
"... people with two G-copies came across better than their peers, regardless of gender. Of the ten most trusted listeners, six were double G-carriers, while nine of the ten least trusted listeners had at least one A-copy."

The theory we should build is: what is the distribution of probabilities that a person having two copies of G is trusted (T), given the data above. Since we don't know the number of G-people, we must average our estimate of this probability over all prior assumptions that the total number of G people is from 6 to 14 (9 people are A-type). Turns out P(T|G) is pretty wide. The mode of the distribution is around 0.5, but the mass is concentrated at the values greater than 0.5. In fact, Probability that P(T|G) is greater than 0.5 is 0.69. That's all we know. Literally.

My patients want to know whether the treatment I am about to give them works or not. What do you suggest I tell them if P(Works|Treatment) > 0.5 = 0.69. Should the FDA allow the treatment to be marketed.

Saying 'it's complicated' does not benefit my patients in any way, nor the malpractice lawyers.

What's the standard of care (hehehe)?

ad*m, I am totally with you. Consider this: a small sample of 23 patients shows that there is 0.69 chance that the next patient will benefit from the treatment and 0.31 that she will not. If she does she lives, if she does not she dies of the disease, perhaps a bit faster - due to the treatment. You tell her these numbers and let her decide (and pay).

I life was uncomplicated we would already received reports about this. Life is complicated and p-values do not help at all.

The problem here is that no one's reporting negative results. If folks were allowed to publish p<0.5 as "more likely to be effective than not" without being required to publish regardless of the result, an apparent 69% likelihood of success is meaningless. Not all drug approval requires particularly high confidence, but the minimum the FDA requires for drug approval, even with small sample oncology stuff, is 0.03-0.05. That doesn't mean you can't attempt the treatment in the context of a clinical trial, though. Lots of medicine is also used off label on the basis of word of mouth between doctors, but then you wind up with something like Avastin being used to treat macular degeneration and actually making people go blind.

Zach, 0.03 is as wrong as 0.5, 0.05 or 0.0005. It's a fallacious number that comes from nowhere and means nothing. FDA process is scientifically flawed and needs to be replaced with sound scientific - as opposed to statistical - procedures.

It's amazing that the big journals haven't started to require solid experimental design. It's even worse in fields with less relevance to medicine. I'd guess that most p values in Science/Nature were calculated after looking at various histograms generated from one experiment. In an ideal world, I'd require a data analysis plan that predates the experiment be kept on file; ideally with a single planned-experiment depot to avoid fraud in submitting different plans to different journals. Any submission not conforming to that standard would be evaluated for scientific merit and published on the condition that key findings be reproduced by an independent lab, with exceptions rarely allowed.

This isn't limited to experimental science, either. Here's an analysis that convinced just about every pop-statistician blogger that Iran's elections were rigged -

Alex, What implications do you draw from this data for the FDA approval process? Perhaps you could comment on Avastin as an example.

Comments for this post are closed