Data mining is not infinitely powerful

To say the least:

Suppose that asset pricing factors are just data mined noise. How much data mining is required to produce the more than 300 factors documented by academics? This short paper shows that, if 10,000 academics generate 1 factor every minute, it takes 15 million years of full-time data mining. This absurd conclusion comes from rigorously pursuing the data mining theory and applying it to data. To fit the fat right tail of published t-stats, a pure data mining model implies that the probability of publishing t-stats < 6.0 is ridiculously small, and thus it takes a ridiculous amount of mining to publish a single t-stat. These results show that the data mining alone cannot explain the zoo of asset pricing factors.

That is from a new paper by Andrew Y. Chen at the Fed.


Not sure what metric the author is measuring but I recall reading in some finance paper that to get three sigmas certainty, you need something like 300 to 600 years of stock market data. This paper is probably echoing that theme.

OK I read the paper, and I figured out where the author is going and how he's wrong. He's saying that if you randomly select variables that probably have no meaning, like "the price of butter in Bangladesh predicting US stock market returns" and then 'data mine' to fit Bangladesh butter prices to US stock prices, that this would require data that's astronomically large in general (though strictly speaking, like Borges' Tower of Babel story, eventually you will, mathematically, find a fit). Similarly just to find these so-called "factors" (variables that might explain stock prices) is difficult and requires 10k researchers working eight hours a day and so on, if they pick these factors randomly (i.e., just trawling the data randomly to find factors.

What the researcher got wrong is "Bayesian Probability", he failed to account for it, the famous "prosecutor's fallacy" as it's known in law. The academics don't just pick random factors and then data mine, rather, they choose well known factors like P/E ratios, GDP, interest rates, investor confidence, profits, etc, and then fit the data to these well known factors. If the data fits, the paper is published, otherwise not. Usually these N-factor models, like Roll's N-factor, or the CAPM (the three-factor test, the five factor test and so on), succeed in the in-sample data (they show a fit) but they fail in out-of-sample data, thus it's called "data mining" in a pejorative sense when they fail out of sample. It's not just randomly picking variables and randomly trying to assign them to data. That would be like expecting monkeys to type Shakespeare all in one go. Rather, it's like having monkeys pick four letters from a box, and the monkeys that pick known four letter words are published as genius monkeys while the rest of the monkeys are ignored. The researcher knows that four letters that make up a known word are not hard to pick, hence the data mining is 'reasonable'.

I think you correctly identify the paper's essential flaw: his basic premise stated in simplified form is, "a large t statistic cannot be produced by data mining given our time and computing constraints." I am skeptical that this is a useful definition. Rather, data mining is more like believing "a good model is one that fits the statistics", i.e. producing models to quote-unquote 'predict' in-sample statistics with no attempt to understand the question of Why?

Andrew Gellman should have a comment!

If you were going to put quotes around ‘predict’ why did you also write out ‘quote-unquote’? Or were you actually saying ‘’predict’’?

I don't know. I just did. Do I need a reason? Don't be too normal please.

What the author appears to be doing is debunking the current trend of using data mining to predict stock returns. What trend? The trend in which hedge and other investment funds are replacing analysts with computer engineers and quants. The trend started years ago by a few small investment funds (one located in Charlottesville with so much computer hardware it's stored in a large warehouse), whose outsized returns led to a few copycats and culminated in lots of analysts seeking a new line of work. I can appreciate why data mining appeals, as it supposedly removes the human factors (mostly derived from bias) from picking stocks and replacing those factors with a mathematical certainty that derives from, well, mathematical certainty not bias. If you believe that, I have a blockchain I will sell you.

As I understand it, The Data Mining Theory is an abstract one, as this article says, about pulling information from a pile of unstructured data.

That is an NP problem isn't it, and only solvable in exponential time?

So of course with the worst-case data set "the financial markets" or something, 'exponential time to solve" is going to be very long.

As other comments note, testing theories against the database will be much more tractable. And I would call this date in mining still, in a general sense.

And if anyone is going to let a general dataminer run on general data .. yeah, it would have to be a pretty constrained set of data at this point.

"date in mining" is of course "data mining" still.

In other words most practical data mining probably does not conform to "a pure data mining model."

For Big Data, you more or less want a theory generator (human, automated or hybrid) and then a theory tester to run against the data.

Even then theories might take a time to test.

Minor clarification from someone who does this for a living. We mine the data for possible patterns (so many spurious ones) that might explain behaviors of interest. Then we form hypotheses to test via analytics or machine learning as appropriate.

Data mining, the search for patterns that appear to correlate to an interesting behavior or result, can be one input to your theory generator.

While there's a lot of interest in automating generation (feature engineering), so far, the most refined results come from a deep understanding of the domain, from someone with the ability to explain, hell, rationalize why some darn thing works.

I think the author is not debunking data mining. He is maintaining that the factors identified in some studies of stock returns are legitimate correlations. His argument is that if no such correlations existed, then it is unlikely so many such correlations would have been discovered by data mining.

Yup, that's it! Thank you! My apologies for the confusing terminology: in academic finance data mining has very negative connotations. I'll improve the terminology in the next draft.

Therefore, in conclusion, the normal DSGE model predicts it takes 50-100 years for markets to determine true interest rates. in the same population used by these researchers.. A minimum of one full generation going through their life cycle hypothesis.

Comments for this post are closed