Program evaluation often involves generalizing internally-valid site-specific estimates to a different target population. While many analyses have tested the assumptions required for internal validity (e.g. LaLonde 1986), there has been little empirical assessment of external validity in any context, because identical treatments are rarely evaluated in multiple sites. This paper examines a remarkable series of 14 energy conservation field experiments run by a company called Opower, involving 550,000 households in different cities across the U.S. We first show that the site-specific treatment effect heterogeneity is both statistically and economically significant. We then show that Opower partners are selected on partner-level characteristics that are correlated with the treatment effect. This “partner selection bias” implies that replications with additional partners have not given an unbiased estimate of the distribution of treatment effects in non-partner sites. We augment these results in a different context by showing that partner microfinance institutions (MFIs) that carry out randomized experiments are selected on observable characteristics from the global pool of MFIs. Finally, we propose two simple suggestive tests of external validity that can be used in the absence of data from many sites: comparison of observable sample and target site characteristics and an F-test of heterogeneous treatment effects across “sub-sites” within a site.