What’s the source of the problem? The Census Bureau purposely messes with the microdata a little, to protect the identity of each individual. For instance, if they recode a 37-year-old expat Aussie living in Philadelphia as a 36-year-old, then it’s harder for you to look me up in the microdata, which protects my privacy. In order to make sure the data still give accurate estimates, it is important that they also recode a 36-year-old with similar characteristics as being 37. This gives you the gist of some of their “disclosure avoidance procedures.” While it may all sound a bit odd, if these procedures are done properly, the data will yield accurate estimates, while also protecting my identity. So far, so good.
But the problem arose because of a programming error in how the Census Bureau ran these procedures. The right response is obvious: fix the programs, and publish corrected data. Unfortunately, the Census Bureau has refused to correct the data.
The problem also runs a bit deeper. If the mistake were just the one shown in the above graph, it would be easy to simply re-scale the estimates so that there are no longer too many, say, 85-year-old men – just weight them down a bit. But it turns out that the same coding error also messes up the correlation between age and employment, or age and marital status (and, the authors suspect, possibly other correlations as well). When you break several correlations like this, there’s no easy statistical fix.
Worse still, the researchers find that related problems afflict the microdata released for other major data sources. All told, they’ve found similar errors in:
- The 2000 Decennial Census.
- The American Community Survey, which is the annual “mini-census” (errors exist in 2003-2006, but not 2001-02, or 2007-08).
- The Current Population Survey, which generates our main labor force statistics (errors exist for 2004-2009).
These microdata have been used in literally thousands of studies and countless policy discussions.