What is the point of replication?

No experiment can ever be replicated so each attempted replication must assume that the things which differ don’t matter. The more and the more important the things that we can plausibly assume don’t matter, the stronger is the original study. Chemistry students have done the same experiments for hundreds of years and that’s useful because we can plausibly assume that who and when the experiment is conducted doesn’t matter. The recent brouhaha between Nosek et al. and Gilbert et al. illustrates a weaker case.

In their critique of Nosek et al., Gilbert et al. say that some of their replications failed because things were different.

An original study that asked Israelis to imagine the consequences of military service was replicated by asking Americans to imagine the consequences of a honeymoon

Now that sounds like two very different studies but Nosek provides important context. The original study in question wasn’t about military service or honeymoons it was about the conditions that promote reconciliation between victims and injurers after an injustice has been committed. The original study asked Israelis what they would do and how they would feel about a specific injustice. Namely you and a co-worker have been working on a project for a long-time but just before submission you are called away for reserve duty [male]/maternity leave [female]. Your co-worker takes credit for all the work and gets promoted while you later get demoted. The study then went on to ask questions about the conditions necessary for reconciliation. The reserve duty/maternity leave bit was just the story element needed to explain the situation not the focus of the study.

Nosek et al. tried to replicate the study in the United States where being called up for reserve duty is less common than in Israel and where being demoted for taking maternity leave could raise legal issues so they substituted ‘had to leave for honeymoon’. Everything else was the same. One of the original authors approved the new design.

Nosek et al. were not able to replicate the original findings. Is this because they didn’t replicate the study or because the study failed to replicate? Gilbert et al. say Nosek et al. failed to replicate the study.

In my view, Gilbert et al. are caught on the horns of a dilemma. If the studies don’t replicate they aren’t interesting and if the studies replicate but only under extremely precise conditions they also aren’t interesting. We are interested in general features of the human condition not in descriptions of the choices that 75 female and 19 male Israeli students made at a particular point in time. Moreover, if changes in wording matter then surely so does the fact that the original study was on Israeli’s in 2008 and the replication used Americans in 2013 (a lot has changed over these years!) and so must also a hundred other differences. But if so, what’s the point?

Hat tip: Andrew Gelman who has more to say.