One common response to yesterday’s post, What is the Probability of a Nuclear War?, was to claim that probability cannot be assigned to “unique” events. That’s an odd response. Do such respondents really believe that the probability of a nuclear war was not higher during the Cuban Missile Crisis than immediately afterwards when a hotline was established and the Partial Nuclear Test Ban Treaty signed?
Claiming that probability cannot be assigned to unique events seems more like an excuse to ignore best estimates than a credible epistemic position. Moreover, the claim that probability cannot be assigned to “unique” events is testable, as Phillip Tetlock points out in an excellent 80,000 Hours Podcast with Robert Wiblin.
I mean, you take that objection, which you hear repeatedly from extremely smart people that these events are unique and you can’t put probabilities on them, you take that objection and you say, “Okay, let’s take all the events that the smart people say are unique and let’s put them in a set and let’s call that set allegedly unique events. Now let’s see if people can make forecasts within that set of allegedly unique events and if they can, if they can make meaningful probability judgments of these allegedly unique events, maybe the allegedly unique events aren’t so unique after all, maybe there is some recurrence component.” And that is indeed the finding that when you take the set of allegedly unique events, hundreds of allegedly unique events, you find that the best forecasters make pretty well calibrated forecasts fairly reliably over time and don’t regress too much toward the mean.
In other words, since an allegedly unique event either happens or it doesn’t it is difficult to claim that any probability estimate was better than another but when we look at many forecasts each of an allegedly unique event what you find is that some people get more of them right than others. Moreover, the individuals who get more events right approach these questions using a set of techniques and tools that can be replicated and used to improve other forecasters. Here’s a summary from Mellers, Tetlock, Baker, Friedman and Zeckhauser:
In recent years, IARPA (the Intelligence Advanced Research Project Activity), the research wing of the U.S. Intelligence Community, has attempted to learn how to better predict the likelihoods of unique events. From 2011 to 2015, IARPA sponsored a project called ACE, comprising four massive geopolitical forecasting tournaments conducted over the span of four years. The goal of ACE was to discover the best possible ways of eliciting beliefs from crowds and optimally aggregating them. Questions ranged from pandemics and global leadership changes to international negotiations and economic shifts. An example question ,released on September 9, 2011, asked, “Who will be inaugurated as President of Russia in 2012?”…The Good Judgment Project studied over a million forecasts provided by thousands of volunteers who attached numerical probabilities to such events (Mellers, Ungar, Baron, Ramos, Gurcay, et al., 2014; Tetlock, Mellers, Rohrbaugh, & Chen, 2014).
In the ACE tournaments, IARPA defined predictive success using a metric called the Brier scoring rule (the squared deviation between forecasts and outcomes,where outcomes are 0 and 1 for the non-occurrence and occurrence of events, respectively; Brier, 1950). Consider the question, “Will Bashar al-Assad be ousted from Syria’s presidency by the end of 2016?” Outcomes were binary; Assad either stays or he is ousted. Suppose a forecaster predicts that Assad has a 60% chance of staying and a 40% chance of being ousted. If, at the end of 2016, Assad remains in power, the participant’s Brier score would be [(1-.60)^2 + (0-.40)^2] = 0.16. If Assad is ousted, the forecaster’s score is [(0 -.60)^2 + (1 -.40)^2] = 0.36. With Brier scores, lower values are better, and zero is a perfect score.
…The Good Judgment Project won the ACE tournaments by a wide margin each year by being faster than the competition at finding ways to push probabilities toward 0 for things that did not happen and toward 1 for things that did happen. Five drivers of accuracy accounted for Good Judgment’s success.They were identifying, training, teaming, and tracking good forecasters, as well as optimally aggregating predictions. (Mellers, et al., 2014; Mellers, Mellers, Stone, Atanasov, Rohrbaugh, Metz, et al., 2015a; Mellers, Stone, Murray, Minster, Rohrbaugh, et al., 2015b).