Big Data+Small Bias << Small Data+Zero Bias

Among experts it’s well understood that “big data” doesn’t solve problems of bias. But how much should one trust an estimate from a big but possibly biased data set compared to a much smaller random sample? In Statistical paradises and paradoxes in big data, Xiao-Li Meng provides some answers which are shocking, even to experts.

Meng gives the following example. Suppose you want to estimate who will win the 2016 US Presidential election. You ask 2.3 million potential voters whether they are likely to vote for Trump or not. The sample is in all ways demographically representative of the US voting population but potential Trump voters are a tiny bit less likely to answer the question, just .001 less likely to answer (note they don’t lie, they just don’t answer).

You also have a random sample of voters where here random doesn’t simply mean chosen at random (the 2.3 million are also chosen at random) but random in the sense that Trump voters are as likely to answer as are other voters. Your random sample is of size n.

How big does n have to be for you to prefer (in the sense of having a smaller mean squared error) the random sample to the 2.3 million “big data” sample? Stop. Take a guess….

The answer is…here. Which is to say that your 2.3 million “big data” sample is no better than a random sample of that number minus 1!

On the one hand, this illustrates the tremendous value of a random sample but it also shows how difficult it is in the social sciences to produce a truly random sample.

Meng goes on to show that the mathematics of random sampling fool us because it seems to deliver so much from so little. The logic of random sampling implies that you only need a small sample to learn a lot about a big population and if the population is much bigger you only need a slightly larger sample. For example, you only need a slightly larger random sample to learn about the Chinese population than about the US population. When the sample is biased, however, then not only do you need a much larger sample you need it to large relative to the total population. A sample of 2.3 million sounds big but it isn’t big relative to the US population which is what matters in the presence of bias.

A more positive way of thinking about this, at least for economists, is that what is truly valuable about big data is that there are many more opportunities to find random “natural experiments” within the data. If we have a sample of 2.3 million, for example, we can throw out huge amounts of data using an instrumental variable and still have a much better estimate than from a simple OLS regression.

Comments

Is Data, Big or otherwise, the real bottleneck to better decisionns or is it political will to challenge America's plutocratic masters?
Two countries, two systems: As we all know, in Flint, the largest city and seat of Genesee County, Michigan, lots of families were forced to drink lead-laden water because they are poor. Meanwhile, as soon as it it was revealed that a popular Brazilian beer brand was poisoned, the Federal government, at once, banned the poisoned beverage and closed the plant down.

The two biggest problems in using polling data to determine the outcome is too small a sample size and questions intended to return a specific response.

That is true.

This is basically a meaningless example. It is a fundamental problem of random sampling that it cannot possibly catch every single element, a problem compensated by the fact that random sampling requires considerably less effort.

Thus, we hold elections, which definitively avoid the problem of sample size, while incurring the costs associated with noting every voter's preference..

Trump is president because the election was not a random sample.

But he was elected by a biased sampling method.

He got 53 million votes. 60 million voted for anybody but Trump.

However, the sampling effectively discarded millions of California votes for Clinton.

This result is not that surprising, because it's measuring the ability to measure an average, which is about the simplest thing you can do with data. Small random samples are already very good at this.
Big data isn't about being able to estimate simple stuff more accurately, it's about being able to capture much more complex nonlinear relationships like the relationship between an image file and the names of the stuff in the image.

"it's measuring the ability to measure an average, which is about the simplest thing you can do with data."

To do that accurately, we would need to measure the ability to measure the ability to measure an average. It, however, is impossible unless we beforehand measure the ability to measure the ability to measure the ability to measure an average. And so on. It would be a Sisyphean task if ever there was onr.

I have never thought about that.

And clearly I cannot drink the wine in front of you

> It would be a Sisyphean task if ever there was one.

On the other hand there is the Zeno paradox of using the same logic to argue that Achilles could not outrun the tortoise, https://en.wikipedia.org/wiki/Zeno%27s_paradoxes#Achilles_and_the_tortoise

Blindly asserting only one side of the truth will get you nowhere. If the averages of successive large samples converge, then there could be a practical answer or a 95% confidence interval.

Agreed. Maybe this is really a story about people trying to sprinkle some big data credibility on standard analysis that doesn’t need it.

The other problem of big data is people thinking they can skip understanding the problem first. Maybe you can if you have google size data but most datasets are too small to do that type of analysis.

My immediate intuition was that this result was way off, and indeed in the the example from the paper is Trump voters less likely to respond at rate - .005, not - .001. For -.001 you would need a much larger random sample.

Indeed. And even then, it seems curious. Using just the standard formulas of variance for a binomial distribution, sqrt(p*(1-p)/n), the RMSE of a poll with 400 respondents on a ~50% outcome is ~2.5 percentage points. For a very large sample (i.e. millions) sampling variance becomes second-order, so the RMSE is driven by biased sampling. I'm not entirely sure what r = -0.005 means - the article calls it a "data defect correlation" and says "it is not directly estimable", but I suspect it *isn't* the same as them being probability r less likely to answer (relative to a response probability of 100%). Quite obviously, whatever r means, to get a RMSE of ~2.5%, it needs to bias the vote shares by a similar magnitude.

+1, r cannot mean what Alex says. Can someone explain this?

"If we have a sample of 2.3 million, for example, we can throw out huge amounts of data using an instrumental variable and still have a much better estimate than from a simple OLS regression."

I think Alex is over-optimistic. The probability of finding a valid and strong (note that a lot of applied econometrics papers go wrong in choosing weak instruments) instrumental variable are probably about as good as drawing a small but truly random sample of the US population. This is on top of the inevitable measurement error in the variables of interest.

I'd love to see more posts from Alex (who is optimistic and glowing about causal identification) engage with critics like Andrew Gelman (e.g. this recent post: https://statmodeling.stat.columbia.edu/2020/01/13/how-to-get-out-of-the-credulity-rut-regression-discontinuity-edition-getting-beyond-whack-a-mole/). Maybe Alex could have a paper of the week that he highlights and give reasons why we should or should not believe its results. My guess is that a few months of that exercise would lessen Alex's optimism about the general ability to identify things in large data sets using IV (or whatever flavor of identification you want). I'm not to say that you can't learn anything using those techniques; it's more that it's much harder to find the appropriate situations in which to apply those techniques than is widely thought.

Sidenote: I would love if economists would start posting notebooks of their code with more data analysis than what is done in the papers. Forget literature review sections (save that for the paper), just lay out how the data is analyzed and why specific adjustments are made. I say to economists, "Publish biographies of your analyses rather than just its final form."

I'm not sure I understand the problem setup.

Let's say that 51% of the population favor R and 49% favor D. Let's say that 50% of the people who favor D respond to the survey. Does the premise of ".001 less likely to answer" mean that 49.9% of the people who favor R respond to the survey?

If that's what it means, then I calculate that 24.5% of people are D & responders while 25.449% of people are R & responders. So 50.95% of responders favor R. If 10,001 people respond to the poll (and about twice that many are asked), then there is a 97.13% chance that the poll result will have more people favoring R than D. If there was no difference in response rates, then we'd need 9,025 people responding to get the same chance.

This is a wildly different ratio of sample sizes than the numbers in the post. It's answering a different question, but I don't know where the vast difference is coming from.

Ohhhhh, so no more links to articles that use surveys from Amazon mechanical turk. Strong bias to the young and unemployed.

This is not false, but this is no news. From the internet:

"The presidential election of 1936 pitted Alfred Landon, the Republican governor of Kansas, against the incumbent President, Franklin D. Roosevelt. The year 1936 marked the end of the Great Depression, and economic issues such as unemployment and government spending were the dominant themes of the campaign. The Literary Digest was one of the most respected magazines of the time and had a history of accurately predicting the winners of presidential elections that dated back to 1916. For the 1936 election, the Literary Digest prediction was that Landon would get 57% of the vote against Roosevelt's 43% (these are the statistics that the poll measured). The actual results of the election were 62% for Roosevelt against 38% for Landon (these were the parameters the poll was trying to measure). The sampling error in the Literary Digest poll was a whopping 19%, the largest ever in a major public opinion poll. Practically all of the sampling error was the result of sample bias.

The irony of the situation was that the Literary Digest poll was also one of the largest and most expensive polls ever conducted, with a sample size of around 2.4 million people! At the same time the Literary Digest was making its fateful mistake, George Gallup was able to predict a victory for Roosevelt using a much smaller sample of about 50,000 people."

Almost exactly the same story, with similar numbers (a 2.4 million biased sample). Except this one actually happened, which makes it more striking. And this 1936 story is considered as the beginning of modern polling, where serious pollsters try to find the most unbiased sample possible, even if it means throwing away a lot of data that was expensive to collect in the first place.

Also, obviously the fact that in the example considered by the author a very small bias on a biog sample would have been able to make the poll
predict the wrong winner is of course due to, as anyone knows, Trump's margin of victory was extremely thin in three key states.

in short, I am always for reminding mathematical truths, but it should be made clear that those ideas are well-known to anyone working in polling.

So to be clear, I don't think this is "shocking, even to experts".

Joël, stop talking to yourself!

It’s ok to talk to yourself. Rayward has been doing it for years and Thiago holds multiple conversations with himself in this very thread.

He’s done that at home, too. The family is bored to death with his pronounciamentoes.

I am not sure I understand the math. For the variance to be comparable to a bias of 0.001 you need a sample of approximately 1/0.001^2=10^6, which is not that different from 2.3 million and is a far cry from 400. Am I missing something?

I agree with your math. Alex appears to have used the wrong value (the paper uses r = 0.005, not r = 0.001). However that only moderates the problem slightly. I don't think the meaning of r matches the explanation Alex gives it (i.e. Trump supporters respond to poll with probability 1-r). It seems to mean something else, which renders the whole exercise much less impressive. (See my detailed post below for further details)

What is three orders of magnitude between friends? :)

I wish Alex would correct his math though.

All of this is well understood, especially by the polling companies themselves, who desperately wanted you to believe Hillary was going to win in 2016 (as well as the journos who propagated this lie).

It's certainly not hard to skew your sample so that 55% of the people you poll are Dem voters. And then voila, Hillary Landslide Imminent. And of course when she loses, you just blame Russia Russia Russia.

The final paragraph seems to defeat the purpose of the complaint.

Big data is data. The technological innovation of "big" data is that it is possible to crunch more of it at low cost. In the past this was not even an option for most researchers. Maybe the government could do it. Or a supercomputing center. Now, anyone with a mid-priced workstation, or a cloud instance, can crunch as much as they want.

That's neat, but for some reason there has been a push-back that "that's bad!"

Any amount of data has to be understood in terms of noise and bias. Same as it always was. So what was this drama all about? Complaint from people who don't want to learn new tools?

Some people simply think it's fun to sit idly by and criticize people who are making a difference in the world.

"It's certainly not hard to skew your sample so that 55% of the people you poll are Dem voters."

So voters were skewed 51.1% Clinton, 48.9% Trump?

I was curious, so I went to see what researchers are building themselves these days:

"Here are the main hardware specs:

CPU — 2 Intel Xeon SP Gold 5217’s, 8 core / 16 thread each @ 3.0Ghz

RAM — 192GB DDR4–2933 MHz ECC

Storage — 1.0TB SSD M.2 PCIe Drive

GPU — 2 NVIDIA Quadro RTX 8000’s, 48GB VRAM each"

https://towardsdatascience.com/nvidias-new-data-science-workstation-a-review-and-benchmark-e451db600551

Neat, right?

You can get away with a graphics card in the sub 1k range. Depends on how big your model is and how big your data is.

48gb vram is definitley where that card makes a difference.

If your model weighs a gb or two, and your training data is a few hundred gb, your typical 8gb card is going to have to train on a lot of batches per epoch and loading / unloading all that data to card is a big bottleneck.

As it happens, I have been watching YouTube PC build videos lately. One pattern of construction is to use cheap Chinese motherboards and recycled Xeon processors. (It seems that as previous generations become power inefficient they are cycled out a server farms and end up on the secondary market.) So if anybody wanted to, it might be possible to build a sort of rat rod big data workstation, using Chinese parts hand tool second-hand Xeons.

AliExpress link

I suspect something is wrong with the interpretation/math here.

Using just the standard formulas of variance for a binomial distribution, sqrt(p*(1-p)/n), the RMSE of a poll with 400 respondents on a ~50% outcome is ~2.5 percentage points.

For a very large sample (i.e. millions) sampling variance becomes second-order, so the RMSE is driven by biased sampling. (To be precise, it is sqrt(bias^2 + p(1-p)/n). If we interpret r as "instead of responding with probability 100%, Trump voters respond with probability (1-r)", the bias is ~0.125 percentage points, and the RMSE using n = 2.3 million is just under 1.3 percentage points. (Of course, if r is the reduction in response rates in percentage points, relative to a much lower base response rate, then anything goes. E.g. if Clinton supporters respond at rate 3% and Trump supporters at 2.5%, polls will be horribly biased. But it seems more useful to focus on the relative non-response rate, as this is the important quantity / better reflects the magnitude of the sampling bias).

Assuming my math is correct, this suggests either the article is wrong (presumably not) or r means something else. The article calls r a "data defect correlation" and says "it is not directly estimable". This sounds rather different than r being the excess non-response rate (relative to a response probability of 100%).

Quite obviously, whatever r means, to get a RMSE of ~2.5%, it needs to bias the vote shares by a similar magnitude. That isn't a small bias. There's a reason that reputable opinion polls tend to have samples of ~1-2k, rather than 400 (eg see here: https://www.realclearpolitics.com/epolls/other/president_trump_job_approval-6179.html). Adding sample size only helps with sampling variance not bias, and a single n = 400 poll is pretty noisy - hence we tend to see bigger polls that at least *might* (if unbiased) have a high chance of getting pretty close to the truth.

Sorry, typo above: the RMSE using n = 2.3 million is just under 0.13 percentage points (not 1.3). Aka, the RMSE is indeed almost all coming from the bias, not the sampling variation at that scale.

r does indeed mean something else. The 0.001 number in the paper is p( respond to survey | not Trump voter ) - p( respond to survey | Trump voter ).

That is, strictly speaking, saying "potential Trump voters are a tiny bit less likely to answer the question". It is also very far from "instead of responding with probability 100%, Trump voters respond with probability (1-r)", which is how you and I and I and (I would guess) most of the other people reading this initially interpreted it. It's not even "potential Trump voters are 0.01% less likely to answer the question *if they are surveyed*".

Instead it's 0.1% of the 60 million people who voted for Trump. That's 60,000 people. If your sample of size 2.3 million is short by 60,000 Trump voters, the result will be off by about 2.5%.

I think I would summarize that as Big Data+Big Bias << Small Data+Zero Bias.

Really cool paper!

My favorite result is the authors showing that it's not only the absolute number of observations you have but the relation to the population size (N). I've seen so many analysis that completely ignore N and make appeals to asymptotics that don't necessarily apply in the presence of bias (even when they acknowledge likely bias!).

I certainly don't always know the best way to incorporate N, but I'd love to see more papers giving applied examples about how to do it.

"Assume the data are unbiased" is rather like "assume a tin opener".

Unbiased data is much harder to obtain than "very slightly biased" data, some might even argue it's impossible.

It's well known in med and psych fields that the sample size needs not to be big (ie, millions, or even thousands) to be able to test a hypothesis. Which is to say, most of the people who routinely complain "sample size too small, results don't mean anything" usually don't know what they are talking about.

The big data + tiny bias method so far has worked quite well. Even in this 2016 trump example, the polls were pretty accurate when it came down to the % of popular votes.

My issue with the recent "biased big data" narrative is interested parties (esp those influence policies) are using it to push back against big data uses and in the end get the worst of both worlds, ie, small data with even bigger bias.

I thought that the point of big data wasn't so much to do with sample sizes as with the number of data points. For example basketball coaching using a gazillion data points to find and understand previously unseen dimensions of the game.
In the election example given, I would have thought that big data's main value is to analyze a wide range of data on people (demographics, consumer habits, credit ratings, opinions on issues, etc.) to reduce potential biases in the single data point, and even predict who might say they're voting for X but end up voting for Y.

I think it would be clearer if the blog post just talked about "large sample size" rather than "big data."

This didn't seem right, so I ran a simulation using the parameters described here. Indeed, with the parameters listed here, the MSE for the big data sample is 30x lower than the small data sample, and the big data sample is more accurate 84% of the time.

Intuitively, this is pretty obvious. With the 0.1% bias, assuming a 50% national Trump preference, you will find a 49.9% mean Trump preference.

A representative poll of 400 people is going to be unbiased, so the expected value is exactly 50%. But the margin of error is going to be way higher than 0.1%, so the big data sample is going to have smaller MSE in the vast majority of cases.

I doubt the paper is wrong, so probably they've structured the example to tilt the playing field toward the representative sample. Don't throw the baby out with the bathwater so fast!

My only issue here is how do you know before the fact that your sample is really unbiased? This sounds easy, but it's not always so straightforward. There is always some mechanism that is used to access the respondent, whether that's a fixed line phone, a website, a mobile, face to face (and if so, where?), or whatever it is. And often these mechanisms introduce a source of bias despite our best efforts.

Suppose I run a survey, and I work hard to think of all possible causes of biases before the fact. I run the survey, I get an outcome, and I make a prediction over the population. But if an unforeseen factor has biased my survey, my conclusions are out, and I'm not going to know until later. I then muck up the prediction, and then repeat the process.

Each time if my prediction is out I'm discovering a new bias. Maybe. Or I'm just discovering the 5% of the time that my random sample accidentally included a few too many of one subset. I can change my survey methodology to respond to my newly identified bias, or I can keep it unchanged and argue that was just a fluke. And now there's a bias/variance trade off when I'm updating my methodology for the next time.

This framing is a bit disingenious though.

The key issue is that the 2.3M will be inconsistent, so getting more and more people will just get it wrong in a more and more precise way. So as soon as the standard error of the random sample is smaller than the bias, the bigger sample is toast.

The only way for the bigger sample to recover is to have the whole population, so the bias disappears.

Comments for this post are closed