Everyone in development economics should read this paper

It is by Eva Vivalt and is called “How Much Can We Generalize from Impact Evaluations?” (pdf).  The abstract is here:

Impact evaluations aim to predict the future, but they are rooted in particular contexts and results may not generalize across settings. I founded an organization to systematically collect and synthesize impact evaluations results on a wide variety of interventions in development. These data allow me to answer this and other questions across a wide variety of interventions. I examine whether results predict each other and whether variance in results can be explained by program characteristics, such as who is implementing them, where they are being implemented, the scale of the program, and what methods are used.  I find that when regressing an estimate on the hierarchical Bayesian meta-analysis result formed from all other studies on the same intervention-outcome combination, the result is significant with a coefficient  of 0.6-0.7, though the R-squared is very low.  The program implementer is the main source of heterogeneity in results, with government-implemented programs faring worse than and being poorly predicted by the smaller studies typically implemented by academic/NGO research teams, even controlling for sample size.  I then turn to examine specification searching and publication bias, issues which could affect generalizability and are also important for research credibility.  I demonstrate that these biases are quite small; nevertheless, to address them, I discuss a mathematical correction that could be applied before showing that randomized control trials (RCTs) are less prone to this type of bias and exploiting them as a robustness check.

Eva is on the job market from Berkeley this year, her home page is here.  Here is her paper “Peacekeepers Help, Governments Hinder” (pdf).  Here is her extended bio.


It is difficult to prove beyond any shred of a doubt that the program caused specific outputs with statistical analysis using aggregated information of the population. But data on actual program outputs plus knowledge of baselines and the general development of time use and economic strategies in society and the economy can make it possible to provide what are likely to be better estimates based on less theoretical statistical tools.

This is an interesting study. But, I think it side-steps the real questions :

i) When do results apply beyond the sample studied (i.e. the set of people experimented on) ?

To answer this you need to a) really understand the relevant domain (.e.g health systems in developing countries) and understand (and quantify) how the context influences the application b) understand and quantify the mechanisms related to scaling up. Both a and b will vary for every intervention. Much greater focus could be on understanding the coefficient covariates and the treatment interaction terms (not just lip service as is currently done). Not in a reduced form statistical sense (as done in this paper), but the underlying economics.

ii) How many studies have examined whether the an intervention result holds in the same context ? Perhaps, for example, in the same region, but in an area not studied.
This will control for many of the other factors. Sometimes, the results are just randomness. Sometimes the results reflect the particular circumstances, at the particular time, in that particular place. Another period, the results could be quite different. So how do these results generalize over time ?

iii) Aside from the relative irrelevance of scaling/equilibrium effects, studies implemented by NGO/academics are correlated along another key dimension. The design, implementation and interpretation of these RCTs are strongly influenced by a handful of economists ( that reside In Cambridge, MA). Their program gets implemented by a "herd of independent minds" comprised of their students and other followers in academia and international organizations. There are a handful of original ideas that bounce around this echo chamber. Many involve some kind of cognitive bias (e.g. procrastination bias) . In the last few years, social networks are the topic du jour. Consequently, it is not surprising that the Academia/NGO RCT results are better aligned. This will be consistent with authors mucking around with design, implementation or analysis to get results that conform to the priors (i.e. intuition/expectations) of the relevant journal referees and editors.

iv) Acemoglu and Deaton are obviously gifted economists, but they have been wrong about a lot of things when it comes to development (at a minimum they are guilty of over-selling their own work and theories). But an argument that both have made that has received surprisingly little attention is that involving complementarities. At its core, development isn't about fixing targeted weaknesses. There are no silver bullets. Systems need to improve. Governments need to improve. The types of solutions that RCTs study are transient and superficial. Temporary fixes make donors (tax payers, Gates, etc.) feel good. The real solutions are harder to implement and even harder to quantify. At best and with considerable resources, an RCT can address 3 or 4 treatments at one time. In fact, in many places, there are several dozen treatments that are needed (and in fact are even going on but willfully ignored in many RCT studies). Yes, I too wish the world was simple and uni-dimensional. It ain't.

Vivalt has much difficulty with plain English.

Her translated remarks read as:

Predictions can be wrong.

I know somebody who collects some other people's predictions -- I used their stuff to see which predictions were correct and perhaps why.

I think correct predictions are mostly about how people will act to get future results.

But it's still difficult to predict the future.



One thing one can learn is how to design programs so that an impact evaluation (Ideally a cost benefit analysis) can be built into every one.

I think it is increasingly common to advocate that something is not ready to move forward until it has been designed in a way that is conducive to applying management techniques which will make it possible to identify what appears to be working, what needs work, etc. This will add to management and data costs in the short run, but ask just about any accountant in any major firm in the world what they think about systems which are designed without management and accounting control in mind, then apply the logic to softer stuff like "improve local governance". You need to define indicators, which is more difficult when "$ in the bank" or "$ of operating capital" is not the variable of interest. As compared to physics, definition of meaningful variables can, and should, lead to different metrics in different places, which contributes to difficulties deciding about the best way to design every program in the "optimal" way for monitoring and evaluation.

Can you say Hawthorne effect? I knew you could.

Here is a great post by S. Galiani from Maryland on this issue:

Comments for this post are closed