The Danger of Reusing Natural Experiments

I recently wrote a post, Short Selling Reduces Crashes about a paper which used an unusual random experiment by the SEC, Regulation SHO (which temporarily lifted short-sale constraints for randomly designated stocks), as a natural experiment. A correspondent writes to ask whether I was aware that Regulation SHO has been used by more than fifty other studies to test a variety of hypotheses. I was not! The problem is obvious. If the same experiment is used multiple times we should be imposing multiple hypothesis standards to avoid the green jelly bean problem, otherwise known as the false positive problem. Heath, Ringgenberg, Samadi and Werner make this point and test for false positives in the extant literature:

Natural experiments have become an important tool for identifying the causal relationships between variables. While the use of natural experiments has increased the credibility of empirical economics in many dimensions (Angrist & Pischke, 2010), we show that the repeated reuse of a natural experiment significantly increases the number of false discoveries. As a result, the reuse of natural experiments, without correcting for multiple testing, is undermining the credibility of empirical research.

.. To demonstrate the practical importance of the issues we raise, we examine two extensively studied real-world examples: business combination laws and Regulation SHO. Combined, these two natural experiments have been used in well over 100 different academic studies. We re-evaluate 46 outcome variables that were found to be significantly affected by these experiments, using common data frequency and observation window. Our analysis suggests that many of the existing findings in these studies may be false positives.

There is a second more subtle problem. If more than one of the effects are real it calls into question the exclusion restriction.To identify the effect of X on Y1 we need to assume that X influences Y1 along only one path. But if X also influences Y2 that suggests that there might be multiple paths from X to Y1. Morck and Young made this point many years ago, likening the reuse of the same instrumental variables to a tragedy of the commons.

Solving these problems is made especially difficult because they are collective action problems with a time dimension. A referee that sees a paper throw the dice multiple times may demand multiple hypothesis and exclusion test corrections. But if the problem is that there are many papers each running a single test, the burden on the referee to know the literature is much larger. Moreover, do we give the first and second papers a pass and only demand multiple hypothesis corrections for the 100th paper? That seems odd, although in practice it is what happens as more original papers can get published with weaker methods (collider bias!).

As I wrote in Why Most Published Research Findings are False we need to address these problems with a variety of approaches:

1) In evaluating any study try to take into account the amount of background noise. That is, remember that the more hypotheses which are tested and the less selection [this is one reason why theory is important it strengthens selection, AT] which goes into choosing hypotheses the more likely it is that you are looking at noise.

2) Bigger samples are better. (But note that even big samples won’t help to solve the problems of observational studies which is a whole other problem).

3) Small effects are to be distrusted.

4) Multiple sources and types of evidence are desirable.

5) Evaluate literatures not individual papers.

6) Trust empirical papers which test other people’s theories more than empirical papers which test the author’s theory.

7) As an editor or referee, don’t reject papers that fail to reject the null.

Comments

"But if X also influences Y2 that suggests that there might be multiple paths from X to Y1". Can you give an example that makes this clear why this is an issue?

A futures markets may or may not reduce crashes, but it does reduce volatility (of supply/demand as well as price). On the other hand, a large short position by a high-profile person could affect the market. To paraphrase the marketing slogan for EF Hutton, when Ray Dalio shorts, people listen. No, I'm not suggesting that Dalio is intentionally manipulating markets in which he has a short position, but such behavior is not uncommon, and the Supreme Court has been making it more and more difficult to punish market manipulators. Why would a conservative Court make it more difficult to punish someone who is intentionally manipulating markets and thereby reducing faith/confidence in markets?

The "green jelly bean" cartoon reminded me of a seminar I attended. The presenter, a tenured professor in his fifties, presented a linear regression with 120 (!) explanatory variables. He applied the usual 95% confidence interval and concluded that six of them were significant.

To this day I'm not sure which is more amazing - the fact that he attached any significance to his finding, or that he was able to come up with 120 variables to test.

Kudos to Alex for linking to the XKCD cartoon!

.. the larger issue is "How" the sages at SEC objectively determine the proper Federal interventions in short-sale markets (and markets in general).

'the larger issue is "How" the sages at SEC objectively determine the proper Federal interventions in short-sale markets'

Who knows whether that is a larger issue, but it has absolutely nothing to do with this quote, strating in the second line of what Prof. Tabarrok posted - '(which temporarily lifted short-sale constraints for randomly designated stocks).'

Of course, one can always quibble how objectively the randomness was determined.

"3) Small effects are to be distrusted."

There are actually good reasons to distrust large effects, or at the very least think that large published effects overstate the true effect. This is true when the statistical inference techniques have low power and there is publication bias towards publishing papers that have a "statistically significant" large or surprising effect. In this case, the papers that are published will be biased towards studies that overstate the true effect.

Alex has responded commendably to an appropriate challenge to the reliability of his findings. I hope the responsible readers of this blog will follow his example.

"I hope the responsible readers of this blog will follow his example."

Are there responsible readers of this blog?

This is a good and important observation, and as Alex says there may not be a viable solution, the best we can do is take the partial steps as he describes.

It is indeed a common property or tragedy of the commons situation. But I can't think of a viable way to ration access to the data (maybe try a price mechanism but are researchers' ability to pay a good measure of how valuable their insights are?). Especially with public datasets such as the ICPSR, GSS, federal longitudinal surveys, etc. which were designed to be made available to researchers rather than rationed to just the lucky or wealthy few. More researchers ==> more insights, albeit at the cost of more Type I errors.

Several comments to Alex's earlier post cited Clemens and Bazzi's "Blunt Instruments" paper, which seems to have preceded Morck and Young's paper by two years.

Editors of the top-tier psych. journals have made it clear that they don't want to and logically can't publish "failure to replicate" articles, unless they also contain something positive about something else. So articles will not be rejected if they fail to replicate, but because they don't report positive and "striking" findings.
It appears that the root problem is the competition to publish in the top journals. That probably won't change because human attention is too limited. Roughly speaking, the ranking (not excluding other things) of the journal tells us what's worth paying attention to.
Another problem is there are too many sub-par journals, and too many grad students seeking plum tenure-track job. The solution is to give no one tenure. Everyone must be contingent, and contracts dependent on enrollments, only then will the problem go away. (Yeah yeah, academic freedom--feel free to say or publish whatever you want, at your own expense. You can be as free as you can afford to be).

There is surely a risk that using a single dataset to draw inferences runs the risk of overfitting for any or all of those inferences.  But this, it seems to me, has nothing at all to do with "adjusting for multiple hypotheses."  After all, there are an infinite number of studies one might do with any particular database.  Is the implication that no t stat is high enough whether the other hypotheses are actually run or not? 

Comments for this post are closed