Big Data+Small Bias

Among experts it’s well understood that “big data” doesn’t solve problems of bias. But how much should one trust an estimate from a big but possibly biased data set compared to a much smaller random sample? In Statistical paradises and paradoxes in big data, Xiao-Li Meng provides some answers which are shocking, even to experts.

Meng gives the following example. Suppose you want to estimate who will win the 2016 US Presidential election. You ask 2.3 million potential voters whether they are likely to vote for Trump or not. The sample is in all ways demographically representative of the US voting population but potential Trump voters are a tiny bit less likely to answer the question, just .001 less likely to answer (note they don’t lie, they just don’t answer).

You also have a random sample of voters where here random doesn’t simply mean chosen at random (the 2.3 million are also chosen at random) but random in the sense that Trump voters are as likely to answer as are other voters. Your random sample is of size n.

How big does n have to be for you to prefer (in the sense of having a smaller mean squared error) the random sample to the 2.3 million “big data” sample? Stop. Take a guess….

The answer is…here. Which is to say that your 2.3 million “big data” sample is no better than a random sample of that number minus 1!

On the one hand, this illustrates the tremendous value of a random sample but it also shows how difficult it is in the social sciences to produce a truly random sample.

Meng goes on to show that the mathematics of random sampling fool us because it seems to deliver so much from so little. The logic of random sampling implies that you only need a small sample to learn a lot about a big population and if the population is much bigger you only need a slightly larger sample. For example, you only need a slightly larger random sample to learn about the Chinese population than about the US population. When the sample is biased, however, then not only do you need a much larger sample you need it to large relative to the total population. A sample of 2.3 million sounds big but it isn’t big relative to the US population which is what matters in the presence of bias.

A more positive way of thinking about this, at least for economists, is that what is truly valuable about big data is that there are many more opportunities to find random “natural experiments” within the data. If we have a sample of 2.3 million, for example, we can throw out huge amounts of data using an instrumental variable and still have a much better estimate than from a simple OLS regression.