Fabio Rojas on Twitter as an electoral predictor

It turns out that what people say on Twitter or Facebook is a very good indicator of how they will vote.

How good? In a paper to be presented Monday, co-authors Joseph DiGrazia, Karissa McKelvey, Johan Bollen and I show that Twitter discussions are an unusually good predictor of U.S. House elections. Using a massive archive of billions of randomly sampled tweets stored at Indiana University, we extracted 542,969 tweets that mention a Democratic or Republican candidate for Congress in 2010. For each congressional district, we computed the percentage of tweets that mentioned these candidates. We found a strong correlation between a candidate’s “tweet share” and the final two-party vote share, especially when we account for a district’s economic, racial and gender profile. In the 2010 data, our Twitter data predicted the winner in 404 out of 406 competitive races.

There is more here.


Is it just me, or is 404 out of 406 a suspiciously high level of successful "prediction"? I'm willing to bet that their algorithm wouldn't be nearly as good when applied to the actual future.

+1 Sounds like over-fitting to me. Crucial phrase is "especially when we account for a district’s economic, racial and gender profile."

Give a big-data guy enough degrees of freedom and he'll get perfect post hoc predictive power for just about any phenomenon.

Of the 406 house races how many were competitive in 2010? Missing 2 of 406 sounds pretty good but what if the competitive races only numbered in the teens?

Agreed ... it is well-known that incumbents win most of the time in Congress ... what matters are those elections that are contested or in which there is no incumbent running for re-election

They say they've controlled for "incumbency, district partisanship, media coverage of the race, time, and demographic variables" but the proof of the pudding, etc.


Where's that ROC/PR curve?


I predict Goodhart's Law in 3,2,1.....

There will be at least some people who reason that if more Twitter mentions indicate the winner then if they get more Twitter mentions then they'll win.

Is there a question of correlation and causation?

Perhaps manipulation of these mediums influences voting rather than represent peoples preferences.

Comments for this post are closed