Interpreting Statistical Evidence

Betsey Stevenson & Justin Wolfers offer six principles to separate lies from statistics:

1. Focus on how robust a finding is, meaning that different ways of looking at the evidence point to the same conclusion.

In Why Most Published Research Findings are False I offered a slightly different version of the same idea

Evaluate literatures not individual papers.

SWs second principle:

2. Data mavens often make a big deal of their results being statistically significant, which is a statement that it’s unlikely their findings simply reflect chance. Don’t confuse this with something actually mattering. With huge data sets, almost everything is statistically significant. On the flip side, tests of statistical significance sometimes tell us that the evidence is weak, rather than that an effect is nonexistent.

That’s correct but there is another point worth making. Tests of statistical significance are all conditional on the estimated model being the correct model. Results that should happen only 5% of the time by chance can happen much more often once we take into account model uncertainty not just parameter uncertainty.

3. Be wary of scholars using high-powered statistical techniques as a bludgeon to silence critics who are not specialists. If the author can’t explain what they’re doing in terms you can understand, then you shouldn’t be convinced.

I am mostly in agreement but SW and I are partial to natural experiments and similar methods which generally can be explained to the lay public while other econometricians (say of the Heckman school) do work that is much more difficult to follow without significant background and while being wary I also wouldn’t reject that kind of work out of hand.

4.  Don’t fall into the trap of thinking about an empirical finding as “right” or “wrong.” At best, data provide an imperfect guide. Evidence should always shift your thinking on an issue; the question is how far.

Yes, be Bayesian. See Bryan Caplan’s post on the Card-Krueger minimum wage study for a nice example.

5. Don’t mistake correlation for causation.

Does anyone still do this? I know the answer is yes.  I often find, however, that the opposite problem is more common among relatively sophisticated readers–they know that correlation isn’t causation but they don’t always appreciate that economists know this and have developed sophisticated approaches to disentangling the two. Most of the effort in a typical empirical paper in economics is spent on this issue.

6. Always ask “so what?” …The “so what” question is about moving beyond the internal validity of a finding to asking about its external usefulness.

Good advice although I also run across the opposite problem frequently, thinking that a study done in 2001 doesn’t tell us anything about 2013, for example.

Here, from my earlier post, are my rules for evaluating statistical studies:

1)  In evaluating any study try to take into account the amount of background noise.  That is, remember that the more hypotheses which are tested and the less selection which goes into choosing hypotheses the more likely it is that you are looking at noise.

2) Bigger samples are better.  (But note that even big samples won’t help to solve the problems of observational studies which is a whole other problem).

3) Small effects are to be distrusted.

4) Multiple sources and types of evidence are desirable.

5) Evaluate literatures not individual papers.

6)  Trust empirical papers which test other people’s theories more than empirical papers which test the author’s theory.

7)  As an editor or referee, don’t reject papers that fail to reject the null.


'Don’t mistake correlation for causation,' but never forget that correlation is generally how causation is discovered or the measure used for its proof.

Exactly. Correlation (linear or nonlinear) is necessary but not sufficient for causation.

What surprises me is not how often people claim that correlation is causation, but how frequently critics claim that people showing correlation are implying causation. "Correlation is not causation" is usually a sign to me that the speaker is trying to sound intelligent, but really isn't.

Several studies that show correlation from different angles can be strongly suggestive of causation. The usual caveats apply. Often in decision making one must jump to conclusions.

One aspect I never see enough of is a discussion of the costs of type 1 and type 2 errors. Researchers focus so much on the former they often completely forget about the latter. It is unobservable, but not beyond a measure of control. Lots of high powered tests with low power. But also a lack of understanding that the results of the test obviate one error or the other.


The one question I never got a good answer for was, "Correlation is not causation, so what is?" What are better alternatives to prove causation (in social sciences particularly)?

I know of Grainger causation or path analysis but they never seem to be used in practical work. So what do we have other than @Willit's "correlation from different angles"

Just of the top of my head: simultaneous equation models, instrumental variables, propensity scoring, difference of differences, Heckit models, structural equation models ...

SW and Alex both make good points. As a researcher (or reader) what you'd like to see is the simple data showing an effect, and if someone raises their hand and says "correlation is not causation" you trot out your fancy econometric model that corrects for endogeneity and say "look, the result still holds".

Less desireable is when the simple data do not show the effect, but when you apply the fancy model you do see it. That can easily be a publishable result, but readers with some justification will be more wary of your conclusion. (And vice-versa, when the simple data show an effect but the fancy model says that that effect is illusory. The fancy model may well lead us to the correct conclusion, but readers will have some skepticism.)

"Several studies that show correlation from different angles can be strongly suggestive of causation."

Or, in social science settings, they can be strongly suggestive of a strong selection effect. A paper that Alex Tabarrok co-wrote is a perfect example. He shows that there is a positive correlation between the number of police in an area and the crime rate. Yet he shows this same correlation flips to being negative once you introduce a natural experiment where police presence is somewhat arbitrarily (pseudo-randomly) varied across districts and across time.

So in the real world, the "selection effect" (communities station police in high-crime areas where they are needed the most) out-weighs the fact that police do, in fact, deter a fair amount of crime when they patrol the streets. A careful study such as Klick and Tabarrok (2005) allows us to quantify both effects.

Natural experiments and RCTs are a fairly popular solution to the problem of disentangling causality from selection effects.

"Thank God for confirmation bias; otherwise, we’d never know when correlation was causation." – Pascal-Emmanuel Gobry

"With huge data sets, almost everything is statistically significant."

Is this really true? Don't tests of significance adjust for the data set size?

I'm no fan of p-values but does this particular allegation have merit?

Yes, SW are correct. One can almost say that p values are a measure of sample size!

No, pvalues are a function of sample size with a negative first derivative. The calculated pvalue is incorrect if the model is misspecified or if the distribution is not well behaved.

In time series estimation, people often forget that the future is part of the population. Data from the planet Vulcan are part of the population.

In other words, your sample size isn't as large as you think it is, and your model is not as correct as you think it is.

I'm not an expert statistician. What does research say about changing parameters?

first derivative of what? can you explain? thank you

Wait a minute.

If something is truly insignificant, then it should be insignificant in the "large" sample.

The chances of discovering this should increase with the size of the sample.

*Everything* cannot be significant with a large sample.

It is true that not *everything* is significant in a large sample, but it is also true that any effect, however trivial, will eventually be significant in a large enough sample. It is the statistical significance of trivial results that people worry about. In social sciences, since it is genrally true that everything depends in at least some small degree on everything else, nulls which presuppose an effect of zero are almost surely false and will almost surely be shown to be false in a small enough sample.

It is also true that p values will wander around even if the point null is true. Thus, there is almost surely some sample size at which the p value will wander over to a "significant" part of the probability space. This isn't much of a problem in the real world unless you're cheating, but since a lot of people cheat, it's a smallish problem.

Ack! "large enough sample" at the end of the first paragraph above.

Isn't that only true if you go fishing for hypothesis? The right statement seems, to me: "Given a large enough sample you will always find some feature at a statistically significant level"

I don't think the chance of finding statistically significant any given hypothesis increases as sample size increases, so long as you don't selectively keep increasing sizes till you get significance.

If your data set was the entire population, every finding would be significant because it reflects actual reality. You are 100% confident of the findings. Large data sets are more likely to be representative of the population so the significance is going to be higher.

Yes, but is that a bug or a feature? If so, the test is indeed doing a good job isn't it?

Both bug and feature.

Example: suppose there are 200 supermarkets and we have a sample of 20 of them. We can then look at a sales effect and we might say that STATISTICALLY, a 10% change in Cheerios sales was significant relative to the sampling error, which might be relative to a 5% change being of interest from a PRACTICAL standpoint.

Now let's suppose you have data from all 200 supermarkets. A difference of 1 unit is STATISTICALLY significant because the sampling error is zero, even though it is of no practical importance.

This is a feature in terms of accuracy but forces us to be explicit about what size effects really matter in our context, something that a reliance on statistical significance allows us to avoid.

Not so! You are assuming 1005 VALIDITY AND RELIABILITY for the instrument you are measuring with.

You are also assuming no bugs in your Excel spreadsheet :-)

That begs the question. Of course when you sample the entire population or a large portion thereof with a correctly specified model, the probabilities of a type 1 or 2 error are miniscule.

The problem is that even huge data sets can come from weird distributions or ones with changing parameters, models are often misspecified, and data quality is an issue. The true p-value is often much higher than advertised. It isn't necessarily true that large data sets are more representative of the population.

"when you sample the entire population or a large portion thereof with a correctly specified model"

You don't need a regression, the mean values are enough.

Regression analysis is only needed when you have a sample, and you don't know what is the correctly specified model.

Not true. Multivariate models Are Just As Relevant, Regardless Of Analyzing The "Universe" Of People.

Stupid Phone, Sorry Fir The AutoCaps.

@veobaum: I meant the conditional mean values, apologies.

Rule #7: Hypocrisy/Ethnic Fraud Check Always check the ethnic identity of the Researcher (or Propagandist). Ask if his recommendations, findings, prescriptions should first be applied to his ethnic community.

For example, if a Jewish researcher comes to the conclusion that marriage is obsolete, and all laws related to marriage should be abolished, ask if they should be abolished in Israel first

If a Jewish researcher concludes that White countries should have open borders, and 3rd world immigration. Ask, if Israel should have open borders, and 3rd world immigration, instead of injecting depo-provera to sterlize black ethiopian migrants in Israel.

If a Hispanic wants 0 down-payment, 0 interest rate over Whites who pay 20%, ask if Blacks should be given same preference over Hispanics.

The issue with large sample sizes is that you have more power to reject the Null Hypothesis, so if you're model is slightly wrong, you are more likely to reject it - and it can be wrong because there is an effect, or it can be wrong because it is an imperfect descriptor of reality.

I want to take issue with:
>3. Be wary of scholars using high-powered statistical techniques as a >bludgeon to silence critics who are not specialists. If the author can’t >explain what they’re doing in terms you can understand, then you >shouldn’t be convinced.

If an author can't explain in terms *you* can understand - well, there is a huge variety of backgrounds for people and efforts they are willing to put in to understand an argument that confirms or refutes their pre-existing beliefs. As written, this is a rule to excuse ignorance.

As for 3, you are missing something important. I was taught a long time ago that if you are learning something, you can show that you understand it by your ability to explain it to someone else.

Let me illustrate. I have watched the global warming debate, reading both sides. My prejudices, if any, has been my experience in the application of policy where green has essentially been equated to fraud. I'm not a meteorologist, so I have no training or knowledge to judge either side of the issue.

I have a good friend who was a professor of meteorology at an important US university. We trudge through swamps together photographing birds and bugs, and during these treks I ask him questions, and he explains them like he would to a freshman. I asked him about global warming. The models, the methods of measuring and detecting are very complex, and probably not his specific field of research, but he said something that explains it very simply: if you have a system represented in a series of mathematical equations, and you have a constant that is always leaning in one direction, no matter what the other influences and factors are, your results will eventually tend in the direction to which the constant leans. So the effects of CO2 concentrations in the thermodynamic transfers that make up weather and climate, if understood, will make a difference over time.

Whether I'm convinced or not is beside the point. He summed up the issue very simply, and looking at the issue from that standpoint helps me understand it.

Statistics can be used to show almost anything, especially the more complex and opaque the techniques. If it doesn't make sense, it doesn't make sense, no matter how adept someone is at twiddling numbers.

I agree, but as I commented below I don't think there's much information in that test. It also favors those with good sales skills.

How 'bout 'forget consensus'

#3 is insane. It implies that a person who doesn't understand standard errors shouldn't trust a researcher using a t test. If SW and Heckman each write a paper or a series of papers on the same topic, are we really thinking the former will do a better job? At some point, the onus on readers is to learn statistics. But this is impolite for economists to point out. Specialist mnowledge is important.

Obviously I'm not a specialist in avoiding typos.

"It implies that a person who doesn’t understand standard errors shouldn’t trust a researcher using a t test".

On what basis should you trust a researcher whose work you can't understand? You can't evaluate it yourself, so you can only accept it based on trusting in the researcher or the community that does understand it. The less understandable research is, the smaller the group of people who can evaluate it themselves, and the more it must be accepted by everyone else on the authority of that community.

My advice for someone like that is for him or her to learn statistics or take his or her own opinions on social science, policy, etc. way less seriously. By analogy, I don't know how to work with mains electricity, so I don't do the electrical work in my house. The cost of my ignorance is that I may get ripped off when I hire electricians. So maybe I'll learn more about electrical at some point. I would also add that there are better ways to spend a life than forming strong opinions on "important" topics.

I think yours is a simplistic, somewhat black and white position. There's many of us whose life (i.e. careers) require us to take decisions based on studies that have statistical components. Now we do know some statistics but obviously not as much as a professional statistician.

Usually I can safely ignore your "go learn the underlying method" advice since the luxury of time is absent in a lot of projects. At times like these, ignoring a study that used esoteric methods is sound advice.

I think the onus is on statisticians to use methods that their target audience can grok. An "Aircraft wing fatigue-failure analysis" does not have to be understandable at the level of a English Lit. graduate . But if it uses statistical methods that an Aeronautical Engineer cannot easily grasp, the statistician is in iffy territory. Your analysis is likely to end up in the trash can and it probably deserves it.

I appreciate the comment.

When it comes to policy, I disagree 100 percent.

When it comes to business, I'm sympathetic. Time is an important constraint. As long as management realizes the trade-offs, ignoring some studies due to lack of tools is a reasonable thing to do.

But if it uses statistical methods that an Aeronautical Engineer cannot easily grasp, the statistician is in iffy territory. Your analysis is likely to end up in the trash can and it probably deserves it.

Precisely this.

@JWatts, it depends on the problem.


Why should policy be any different? Is time not a luxury even there? Are there not trade-off's in policy?

Admittedly policy might often be slower than private enterprise in reacting to situations. But that's a bug not a feature.


The reason I make the distinction is that policy mistakes are long-lasting and potentially very costly to everyone in a society. If a policy is going to last several decades, spending, say, a decade ruling out the worst policies seems a good use of our time. The potential gains from improving methodology in policy, in my view at least, far exceed the costs.

What about the cost of delaying policy decisions? No policy, is a policy, in some sense.

Imperfect policy may be sometimes better than waiting for 20 more studies. In any case, I fail to see why "improved methodology" must necessarily need greater complexity or esoteric methods.

Thanks for the response.

I agree. Not doing anything is a policy with an attendant set of costs and benefits.

To my eye, I just don't think we'd cause that much harm by doing things more slowly. Do we really think the way we do things now is that great? Bad policy hurts people directly by causing harm. It also hurts people indirectly by diverting money away from other, better programs. I really worry about how long bad policies may live. That's why I think the benefits of improving methodology exceed the costs.

On the arguments for and against more complex methods, I think it's just better for me to point you to this symposium on econometrics: Clearly I'm more sympathetic to Keane's paper as well as Nevo and Whinston's paper. The issues deserve paper-length treatment.

Of course readers of research should learn. And of course research won't be understood by everybody.

The point, though, as I take it, is that trying to explain one's findings clearly to the broadest possible audience is a mark of a good researcher. Excessive use of jargon that few can understand is a warning sign for bad research. If people have good results, they want as many people as possible to understand those results. If they have bad results, they can take refuge in the fact that few will understand and realize.

Sure, I agree. But I don't think this signal, i.e. lack of jargon, contains as much information as you think it does. It provides only a very coarse test. And it really doesn't help the average person's understanding of the deep methodological issues. Unfortunately, these deep issues are really important for learning the "right" things from the data.

Trust but verify? And consider the difference between faith and belief. I don't really believe anything that I read in a paper (or a newspaper) but I can act as if I did believe it.

I don't think I follow. Can you give some examples?

From my perspective from science as a researcher, very few papers are intended to stand on their own. Their result is the jumping off point for someone else's next paper. So, I don't have to trust that paper to design an experiment based on the results. Preferably, my experiment will yield useful results whether their results are correct or not.

Thanks for providing another example.

Is there a difference between what you do and what non-scientists do? Non-scientists don't have methodology to rely on. Furthermore, to apply Alex's recommendation on evaluating literatures, you need to understand methodology to evaluate literatures.

It depends on how much you need to believe the results.

Sorry, I'm just not getting you on this point. Maybe a better question to ask is "How you think voters or policymakers should proceed?"

"Be wary of scholars using high-powered statistical techniques as a bludgeon to silence critics who are not specialists. If the author can’t explain what they’re doing in terms you can understand, then you shouldn’t be convinced."

Seems like a good litmus test for elitism. Achieving the goal of the second sentence might be the crucial aspect of my (technical) job. A lotta people can crunch numbers.

Be wary of bloggers who use value laden language to make points. There is a time and place for almost every statistical technique ... the true skill is not how to use them but when.

As usual, I don't quite grok your jargon, tho I suspect you may be saying something interesting.

I'm struck by all the bristling over number three. I'm pretty sure it doesn't say not to use sophisticated statistical techniques, just don't hide in them. I advise the bristlers to quickly move on and imbibe number four, particularly as it pertains to abstruse results.

Here's an application of his first sentence in 3: I felt his quip was out of line but fully in line with his priors. Economists fight with estimators and identification strategies and often miss the bigger picture. I agree with his deeper point but am not comfortable with what it encourages in debates. But whatever, to each his own...the profession and the public gets what it asks for.

Wrong link, here's his tweet: again just explaining my reaction, everyone can attack these debates as they like.

It was hard to tell from the tweets of that day what Wolfers actually thinks about time series analysis. He expressed problems with causal ordering, but others noted that the original analysis tried both causal ordering setups. A few other people tried to help him out, but an uncharitable skeptic might have gotten the (probably inaccurate) perception that Wolfers isn't super familiar with the SVAR literature. He even claimed that causal ordering techniques have not been considered credible for "decades," which seemed odd. There were certainly problems with the blog post he was criticizing, of course, but others mentioned those problems--not Wolfers. He just rejected the analysis for being too complicated without explaining why we should consider causal timing arguments any more complicated or show-off-y or hand-wavy than any other identification arguments.

In short, again, SW #3 isn't very useful as it is stated.

Be wary of bloggers who use value laden language to make points. There is a time and place for almost every statistical technique … the true skill is not how to use them but when.

So I should be wary of your post? Because the phrase " value laden language to make points" is,in itself, an example of using value laden language to make points. ;)

Furthermore, the complaint from #3 is not about using statistical techniques, it's about not being able to explain them and hiding behind the jargon. If an educated individual can't adequately explain why they used a particular technique without referring to technical jargon it's, in my experience, an indication that they either don't think they should be bothered with explanations or that they don't actually understand it completely themselves and are hiding behind techno jargon. It always sets off a red flag in my mind when an engineer in a meeting starts laying the jargon on thick and isn't explaining the matter at a fundamental level.

Of course, you should be wary of what I say ... much more so than what Stevenson and Wolfers (and other bloggers) say. There are good reasons we listen to well-regarded experts. But I think wariness needs to be paired with openness. In particular, openness to statistical techniques and empirical methods that you don't use yourself. Sometimes when you don't understand the specialist, it's you, the listener, who has more work to do (before you call them out as a fraud). I have seen too many economists completely fully dismiss another's analysis simply on the estimator or data set and it concerns me. More to the post, I agree with every single point in both of these two lists, but I am not sure how successful we are going to be in educating the public about interpreting empirical results. I still think the profession needs more replication, more meta studies (not involving key authors), and more cross-method validation of key findings. We may not agree as a group, but it is extremely hard to canvass the literature and figure out the range of beliefs (even w/ a PhD) and it is unreasonable to expect a journalist or Congressional staffer to do that for us.

There is a time and place for almost every statistical technique

I'm a bit skeptical of this. I'll offer an imperfect analogy: When an engineer designs there's a standard set of pipe sizes, wire gauges, flange thicknesses etc. No doubt I could get a little more efficiency by going for a non-standard size but we resist that temptation unless there are very very compelling reasons. If I ever saw a blueprint where the designer insisted that it'll only work with one precise pipe size and no other, that's a very non-robust design and I'd be leery.

So also with statistics (I think). If the effect you want to demonstrate only shows up after using a particular esoteric technique I'm very skeptical of that effect itself. Note, esoteric not to a lay audience, but to the target of your statistical analysis (policy planners, engineers whatever)

Sorry, with all due respect, this is such a strange way to go about things. Take ordinary least squares (OLS, i.e. plain vanilla regression) and instrumental variables (IV). I think most people think they understand OLS. I'm willing to bet that almost all of those people find IV esoteric.

The problem is that in almost all applications of OLS in economics, it's easy to think of plausible stories for why a regressor will be correlated with the error term in a linear model. A natural approach is to look to IV. With a valid instrument, the IV estimate may differ from the OLS estimate in both sign and magnitude. Suppose further that we have a whole literature on a question, following Alex, using many different instruments consistently finds that there is a big difference between the OLS and IV results, and that IV provides a consistent set of estimates. Most economists, and perhaps statisticians as well, would be inclined to believe the collection of IV results for both economic and statistical reasons.

Your approach suggest that people who don't understand IV should view these results with skepticism. If this is the case, I think this is evidence that there's something wrong with your approach.

I'm sure there are plenty of analogies from engineering in a similar spirit.

I am not going to disagree with you (entirely). My last year of courses in grad school, I had an advance econometric field course at 8:30am and a survey methods class at 6pm. Other than being very tired at the end of those days I was always conflicted. In the morning, I learned state of the art ways to isolate variation and at night I learned how biased data often was. And not in a random, white noise kind of way, like seriously messed up. And you know what they say garbage in, garbage out.

I have since learned that sometime you have to work with suboptimal data (can't always fix it at collection or find better data) and you need estimators to help you filter. Look at the literature on trends in consumption inequality and the key question (up, down, same) often comes down to the filter on some incomplete, error-ridden data. It is an important enough question that I am willing to look through many different statistical filters and try to discern a pattern of results. It I could do it off summary statistics then awesome, but the world is not always so easy to parse. Simple is often better than complex, but not always.

SW #3 is a pretty hard pointer to interpret. I tried to pin down Prof Wolfers on this on Twitter, but he won't be more specific. It's hard to interpret it as anything but, "Any statistical technique that Justin Wolfers uses is good; anything more complicated is bad." Economists differ widely in the methods they learn; those who don't learn and use time series, for example, have strong incentives to convince everyone that VAR is statistical hogwash. Commenter "student" above seems to have it right: pretty much any statistical method is difficult for some people to understand, so we can't all be expected to dumb things down to meet this rule.

On a more general note, these kinds of lists are pretty useful. Given the role of R&R in political dialogue, it's useful to also keep politics in mind ( It's hard enough to know what to think about intra-discipline dialogue; now that some of this is carried out in the blogosphere, it is even harder.

+1: It’s hard to interpret it as anything but, “Any statistical technique that Justin Wolfers uses is good; anything more complicated is bad.” That's precisely it.

It is more like it is difficult to take any one of these pointers by itself. If someone doesn't show you their source data, no you don't trust it as much just like if they don't explain what is under the hood of their statistical black box you don't trust it as much. Not that you don't trust it at all, because you are supposed to be Bayesian and not really trust anything anyway.

Number 3 is silly. Some empirical questions are amenable to relatively simple econometric techniques while others are not. Natural or quasi-natural experiments often not possible and so underlying causal relationships can only be uncovered with more complicated methods.

Number 2 should just be restated as don't confuse statistical significance with economic significance.

outsider = structural? just curious.

If one wanted to be uncharitable, you might even say that Wolfers' research topics are chosen in part because they can be tested fairly simply.

Be wary (or possibly ignore) anybody's empirical study if they won't publicly provide the data they used.

+3, though I would add "and if they won't publicly provide their techniques used to derive their analysis from the data".

If you use a model, the model and data set, should be an appendix to your research paper.

....and the code. Too often people hide their codes so it's never worth anyone's time to recode it all.

We should be able to run your code with minimal effort and then go "Aha, Here's a wrong sign. The answer should have been + 5% not - 5%."

Then, according to the recent blog post here, you'll be ignoring 98% of economics papers, right? And this is understandable to me because the creation of the data is the hard part in science, and you want the statistical massaging to be as trivial as feasible.

If you are doing your own research, it seems to me, you have a network of effects and various literatures hold places at each of the nodes. If a better one wins the place of a lesser one then that is fine, but if some of your nodes are completely vacant that is probably worse than a non-ideal occupant.

You could safely ignore 98% of the papers in any field.

It's figuring out which 98% that's hard.

But shouldn't there be a signalling effect here? If a researcher is willing to supply full data and methods, shouldn't that be a signal of quality?

f a researcher is willing to supply full data and methods, shouldn’t that be a signal of quality?

You would think so, but there are a whole lot quoted/referenced papers that don't supply data or methods.

The problem is that some data sets are only semi-publically available, e.g. because of privacy issues. And to make matters worse some government agencies don't even provide you with the data but you have to give the agency your piece of software and they will run it on their computer and only give you the output. Does it make the research less important or less worth? Should these studies not carried out?

With more and more disaggregate data available, researchers will have to face the conflict of transparency vs. privacy.

@Anon: That's correct.

I wouldn't say the research shouldn't be carried out - everybody can decide for themselves. But I do think intelligent consumers will ignore research where the data is not made available. I can't state it any more strongly - IGNORE. Unless we do that, we leave open the door for innocent mistakes and much worse. Also, since so much economic research is done on behalf of somebody's attempt to influence public policy, there should be a requirement that data be made available. Actually this requires no real changes other than policymakers making it clear that they will ignore any studies that do not release their data. The fact that they won't do so, reveals a lot about what is wrong with both the policy making environment and the research environment (in academia and elsewhere).

I would add that I have been told numerous times that the data is from public sources so I should get it myself. Such an answer is totally inadequate - while much of the research data does come from public sources (such as Census, BLS, etc.) it is almost always processed in order to analyze - there is missing data, choices are made about how to aggregate data, choices are made about how to treat time series, etc. Without having the exact data that is used, it is virtually impossible to replicate their results.

There are reasonable restrictions on data release. How much confidentiality risk do we want to expose, say, health insurance beneficiaries to in releasing claims data? There are trade-offs. Whether we're making the right ones is an open question.

There are fairly good ways of anonymising data. In most cases, even researchers will not see the data before it was scrubbed. Ergo very minimal additional efforts are needed to safely release the data-set.

You are being too charitable to private-dataset-studies. The motivation is less often public confidentiality but more often other motives or simply laziness or convention.

Yes, there are tradeoffs but I'm skeptical we are anywhere near the optimum.

There is no health research where the identify of the individuals is necessary. That information can easily be stripped away and the data released. Claims that there is a confidentiality risk are bogus (I admit that there may be an exception to this, but I believe it would be rare indeed). There are very few tradeoffs - only the tradeoff between having your analysis open to examination and the potential to get it published without the possibility of replication. This is a real tradeoff for the researcher, but not for the consuming public.

So I don't know the details, but I've been told by a reviewing statistician in a response to an application that the concern is that the combination of certain demographic and geographic variables with claims data allows for an "unacceptable" risk of loss of confidentiality. Assuming this is true, this makes me think the older the population, the higher the risk of losing confidentiality (e.g. older waves of the HRS).

The private sector has a thousand times as much data as the government, and growing, which you can use only by providing your query and lots of cash - these companies spend money collecting data and pay for it by charging to use the data. They act as the intermediaries between competitors who have common interest in pooling industry data, say lobbying, as well as in working with complementary businesses.

I'm sure behavioral economists are working with this data, but my guess is they view their results as proprietary and don't publish. I imagine some of their clients would love to see their applied theory published so they had more consultants to use at a lower price.

This is exactly the sort of reason private data-sets should be frowned upon. I knew of a company is about the same business @mulp describes. They used to selectively release some large, quite meaty datasets to a researcher at a university. He then published papers from the data that were somewhat interesting but the data itself never was.

It was quid pro quo. The company got essentially free advertising to attract paying clients. The researcher got some sensational results. No body could ever independently verify the findings (no public data set) but I doubt the researcher nor the firm cared.

Overall I think science doesn't gain much from these devils bargains. There is some benefit but the costs outweigh it. I think a firm stance of no-publication-without-open-data would not hurt our understanding much.

I think number 3 is silly. "Simple" techniques require a lot of assumptions. Is it easy to explain to a lay person what the Inverse Mills Ratio is, why unobserved heterogeneity can cause bias in a duration model, much less how you correct it? Not at all. Does that mean you should ignore the problem and just do something simple and wrong instead? I hope not!

Regarding correlation and causation, I think the problem most often overlooked is that a lack of correlation does not imply a lack of causation. You can have causation without correlation.

I think #3 is getting a bum rap. It's not saying that academics shouldn't write papers about the Inverse Mills Ratio. It's saying that if you, as a reader, come across a paper that supports your priors and which relies on the Inverse Mills Ratio, which you don't understand, you can't use that paper to walk around trumpeting that you've been proven right. Because you really don't understand what the heck he's talking about.

Bear in mind that 5% is not holy scripture but wholly arbitrary, chosen by custom, rather than as a guide to whether, in the particular situation under study, there is an effect worth considering.

What about separating studies of the physical world versus studies that rely on "human" data (i.e. studies measuring temperature changes vs. changes in GDP)?

I set a much higher bar for quantitative studies of human structures, since the data can become flawed for many reasons.

Many good points and links here. The one piece I think is missing is that the world sometimes changes in big, big ways that statistical models are ill-equipped to anticipate. For example: black and grey swans, and supercycles of decades or longer such as the current debt supercycle.

Consider that the macro events that have dwarfed everything else recently – the mortgage debt and housing price cycles – weren’t flagged by econometric modelers anywhere (to my knowledge). And yet, these cycles were so far reaching that they rendered useless many statistical results estimated in the rising debt leg of the cycle.

This is where it’s important to take the broadest possible view and use your brain instead of your econometrics software, notwithstanding Stevenson’s and Wolfers’s point about our flawed intuition.

I agree that we all get things wrong at times or even most of the time, but I’ll suggest that the ones who get the most important things right are looking at the whole forest, not stuck in trees contemplating their p values.

Many times the "meta" issues of model selction are most important. In subjects with political implications, many times there is a search for the right model to support the desired conclusion, or the results of equally valid models with different conclusions are not reported.

"3. Be wary of scholars using high-powered statistical techniques as a bludgeon to silence critics who are not specialists. If the author can’t explain what they’re doing in terms you can understand, then you shouldn’t be convinced."

This is exactly the trick Steven D. Levitt pulled when he and I debated his popular abortion-cut-crime theory in Slate in 1999. I pointed out that if he had done a simple reality check of looking at 14-17 year old homicide offending rates year by year, he would have seen that the first cohort born after legalized abortion had homicide rates triple the last cohort born before legalization. His response was Well, I did a complex study on all 50 states and you just looked in a simple fashion at national date, so I win:

And, hey, it worked great for him and he rode it to becoming a celebrity six year later. Of course, six months after "Freakonomics" hit the bestseller charts, Christopher Foote and Christopher Goetz demonstrated that Levitt had messed up his statistical programming, which was why his state level analysis couldn't be reconciled with the national level analysis. But, even that didn't hurt the Freakonomics brand much.

Comments for this post are closed