Fabio Rojas on Twitter as an electoral predictor

by on August 12, 2013 at 9:49 am in Data Source, Political Science, Web/Tech | Permalink

It turns out that what people say on Twitter or Facebook is a very good indicator of how they will vote.

How good? In a paper to be presented Monday, co-authors Joseph DiGrazia, Karissa McKelvey, Johan Bollen and I show that Twitter discussions are an unusually good predictor of U.S. House elections. Using a massive archive of billions of randomly sampled tweets stored at Indiana University, we extracted 542,969 tweets that mention a Democratic or Republican candidate for Congress in 2010. For each congressional district, we computed the percentage of tweets that mentioned these candidates. We found a strong correlation between a candidate’s “tweet share” and the final two-party vote share, especially when we account for a district’s economic, racial and gender profile. In the 2010 data, our Twitter data predicted the winner in 404 out of 406 competitive races.

There is more here.

1 dan1111 August 12, 2013 at 10:34 am

Is it just me, or is 404 out of 406 a suspiciously high level of successful “prediction”? I’m willing to bet that their algorithm wouldn’t be nearly as good when applied to the actual future.

2 Rahul August 12, 2013 at 12:38 pm

+1 Sounds like over-fitting to me. Crucial phrase is “especially when we account for a district’s economic, racial and gender profile.”

Give a big-data guy enough degrees of freedom and he’ll get perfect post hoc predictive power for just about any phenomenon.

3 bluto August 12, 2013 at 11:13 am

Of the 406 house races how many were competitive in 2010? Missing 2 of 406 sounds pretty good but what if the competitive races only numbered in the teens?

4 Enrique August 12, 2013 at 12:55 pm

Agreed … it is well-known that incumbents win most of the time in Congress … what matters are those elections that are contested or in which there is no incumbent running for re-election

5 Kelvin August 13, 2013 at 2:48 pm

They say they’ve controlled for “incumbency, district partisanship, media coverage of the race, time, and demographic variables” but the proof of the pudding, etc.


6 asdfasdfasdf August 12, 2013 at 11:37 am

Where’s that ROC/PR curve?

7 Overfitting August 12, 2013 at 12:33 pm


8 Tim Worstall August 12, 2013 at 12:47 pm

I predict Goodhart’s Law in 3,2,1…..

There will be at least some people who reason that if more Twitter mentions indicate the winner then if they get more Twitter mentions then they’ll win.

9 aaron August 15, 2013 at 6:48 am

Is there a question of correlation and causation?

Perhaps manipulation of these mediums influences voting rather than represent peoples preferences.

Comments on this entry are closed.

Previous post:

Next post: