The excellent Susan Athey addresses that question on Quora, here is one excerpt:

Machine learning is a broad term; I’m going to use it fairly narrowly here. Within machine learning, there are two branches, supervised and unsupervised machine learning. Supervised machine learning typically entails using a set of “features” or “covariates” (x’s) to predict an outcome (y). There are a variety of ML methods, such as LASSO (see Victor Chernozhukov (MIT) and coauthors who have brought this into economics), random forest, regression trees, support vector machines, etc. One common feature of many ML methods is that they use cross-validation to select model complexity; that is, they repeatedly estimate a model on part of the data and then test it on another part, and they find the “complexity penalty term” that fits the data best in terms of mean-squared error of the prediction (the squared difference between the model prediction and the actual outcome). In much of cross-sectional econometrics, the tradition has been that the researcher specifies one model and then checks “robustness” by looking at 2 or 3 alternatives. I believe that regularization and systematic model selection will become a standard part of empirical practice in economics as we more frequently encounter datasets with many covariates, and also as we see the advantages of being systematic about model selection.

…in general ML prediction models are built on a premise that is fundamentally at odds with a lot of social science work on causal inference. The foundation of supervised ML methods is that model selection (cross-validation) is carried out to optimize goodness of fit on a test sample. A model is good if and only if it predicts well. Yet, a cornerstone of introductory econometrics is that prediction is not causal inference, and indeed a classic economic example is that in many economic datasets, price and quantity are positively correlated. Firms set prices higher in high-income cities where consumers buy more; they raise prices in anticipation of times of peak demand. A large body of econometric research seeks to REDUCE the goodness of fit of a model in order to estimate the causal effect of, say, changing prices. If prices and quantities are positively correlated in the data, any model that estimates the true causal effect (quantity goes down if you change price) will not do as good a job fitting the data. The place where the econometric model with a causal estimate would do better is at fitting what happens if the firm actually changes prices at a given point in time—at doing counterfactual predictions when the world changes. Techniques like instrumental variables seek to use only some of the information that is in the data – the “clean” or “exogenous” or “experiment-like” variation in price—sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions about changing price. This type of model has not received almost any attention in ML.

The answer is interesting, though difficult, throughout. Here are various Susan Athey writings, on machine learning. Here are other Susan Athey answers on Quora, recommended. Here is her answer on whether machine learning is “just prediction.”

First. A lame comment.

I’ve only been reading Quora for several months, and irregularly at that. I may have subscribed to low quality subject areas but the questions are usually incredibly dumb, but the answers are often interesting and informative. Fairly often an actual expert or even celebrity will answer.

But I hadn’t seen something truly substantive on Quora until this set of Q&A with Susan Athey. Even the questions were good and some of her answers were really good, in particular this machine learning one.

But they always told me that there are no stupid questions

There are only revealing questions.

My impression of Quora is that it is basically Yahoo answers, if Yahoo Answers had started as a closed academic community and then gone mass market.

Yesterday I was looking at some of Athey’s papers on random forests and classification trees. I first encountered her name in graduate school when I read (I barely understood) her work on Comparative Statics. In every topic she goes into, she makes a fairly deep contribution. Impressive. Is there any other economist in that “class”? Why is she writing in Quora?

I agree her papers are quite interesting 🙂

The missing word is generalization. Reducing the ‘goodness of fit’ in the specific case gives a model which will fit over a broader range of parameters – but less well over any specific set – unless the amount of data can be increased vastly and thus increases in model complexity be supported.

I sometimes explain ML to people as just applied statistics, where the formula has enough sophistication to do its job without any additional input. (This isn’t quite correct; it ignores hyperparameter tuning and the intelligence that went into picking a particular ML algorithm.)

But it seems to me that empirical economics will benefit greatly from using more sophisticated statistical techniques, and more importantly, more sophisticated statistical validation tests. (Cross-validation, for example, is much more informative than null hypothesis significance testing, even though the two don’t directly compete.)

Right, I think cross-validation should simply be standard practice in almost all statistical work. Other approaches such as datamining may not be as helpful, but still might help with robustness checks.

I am having trouble seeing applications. Yes, a computer can process a big set of employment data, and “learn” a relationship to minimum wage, but that relationship is internally the machine state of X elements, cells, trees, whatever.

How do you publish? With a public virtual machine? How do you describe? Is the machine state reducible to a human narrative?

From Leo Breiman’s [originator of CART and Random Forests] paper Statistical Modeling: The Two Cultures.

Breiman talks of the “Data Modeling” culture and the “Algorithmic Modeling” culture (decision trees, neural nets, which can often become Machine Learning).

There is an “asymptotic convergence of the CART algorithm to the Bayes risk”.

“Unfortunately, in prediction, accuracy and simplicity (interpretability) are in conflict.For instance, linear regression gives a fairly interpretable picture of the y, x relation.

but its accuracy is usually less than that of the less interpretable neural nets.”

“While trees rate an A+ on interpretability, they are good, but not great, predictors.”

“Random Forests are A+ predictors.

… But their mechanism for production a prediction is difficult to understand.

…Accuracy generally requires more complex prediction methods.

Simple and interpretable functions do not make the most accurate predictors.

Using complex predictors may be unpleasant, but the soundest path is to go for predictive accuracy first.”

The foundation of supervised ML methods is that model selection (cross-validation) is carried out to optimize goodness of fit on a test sample. A model is good if and only if it predicts well. Yet, a cornerstone of introductory econometrics is that prediction is not causal inference, and indeed a classic economic example is that in many economic datasets, price and quantity are positively correlated.I’m not sure if I see the contradiction. Does ML claim that prediction is causal inference?

Firms set prices higher in high-income cities where consumers buy more; they raise prices in anticipation of times of peak demand. A large body of econometric research seeks to REDUCE the goodness of fit of a model in order to estimate the causal effect of, say, changing prices. If prices and quantities are positively correlated in the data, any model that estimates the true causal effect (quantity goes down if you change price) will not do as good a job fitting the data. The place where the econometric model with a causal estimate would do better is at fitting what happens if the firm actually changes prices at a given point in time—at doing counterfactual predictions when the world changes. Techniques like instrumental variables seek to use only some of the information that is in the data – the “clean” or “exogenous” or “experiment-like” variation in price—sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions about changing price. This type of model has not received almost any attention in ML.Not sure if I understand this either. Is it saying that because the causal effect is small compared to the correlations due to unrelated factors, that it is hard to get the model to fit it? So you fit the model to the broad correlations and look for places where the model breaks? It strikes me that this is a problem of selecting an inappropriate model that can’t capture the higher-order behavior you want. Or just not applying in a way that captures what you’re looking for. Can’t you just normalize the data against local cost-of-living indexes or something to that effect? Transform the data in some way that makes the relanships you’re looking for pop out more.

Say you train with ML to recognize dogs and cats. Do you run a test, rounding dog’s ears more, to see if they become cats?

I think (at my level of knowing) that internal state, and they “why it works” is very much black box in ML. Yes, you can have ML recognize repeat offenders, but I know no reason why tweaking the data to get a new result does not just corrupt the state of the machine. Dogs become cats.

Transforming the data isn’t “tweaking” it in the sense you mean.

What I’m saying is suppose have like a pattern inscribed on a curved surface that you’re trying to read. If you know the shape of the curved surface, you can subtract that out so that you’r just looking at the pattern as if it was on a flat surface.

The work of (computer scientist) Judea Pearl is all about causality as opposed to probabilistic prediction. Causal inference is an active research area in machine learning, see for example the work at the MPI Tuebingen in Germany: http://ei.is.tuebingen.mpg.de/research_groups/causal-inference-group

So I’m not sure why Susan Athey does not discuss this part of ML. Especially since Judea Pearl’s work is just massive.

Comments on this entry are closed.