Profiting from Machine Learning in the NBA Draft

by on May 5, 2014 at 2:16 pm in Data Source, Sports | Permalink

There is a new paper by Philip Maymin:

I project historical NCAA college basketball performance to subsequent NBA performance for prospects using modern machine learning techniques without snooping bias. I find that the projections would have helped improve the drafting decisions of virtually every team: over the past ten years, teams forfeited an average of about $90,000,000 in lost productivity that could have been theirs had they followed the recommendations of the model. I provide team-by-team breakdowns of who should have been drafted instead, as well as team summaries of lost profit, and draft order comparison. Far from being just another input in making decisions, when used properly, advanced draft analytics can effectively be an additional revenue source in a team’s business model.

Note these are “partial equilibrium” estimates, namely given rivalry not every team can draft better to this extent.

prior_approval May 5, 2014 at 2:19 pm

And the injury factor?

Or is this like the Dow average, where a company is silently dropped, so the numbers appear consistent to the unobservant?

Philip Maymin May 5, 2014 at 4:17 pm

To be conservative, I treat injuries as zero wins produced. Not silently dropped.

Ray Lopez May 5, 2014 at 2:20 pm

Isn’t this a version of M. Lewis’ Moneyball, which according to a Slate article is story-telling more than science? Hence I doubt the “90M” figure.

Cliff May 5, 2014 at 2:34 pm

Wow, you reject both advanced statistical analysis in sports and HBD? Step away from your “hot 20yo Filipino gf” and read some books or something

Jason Y May 5, 2014 at 3:08 pm

Ray isn’t being unreasonable. If you have special insight into sports you can make lots of money betting on them, and I can’t help but notice that most sports analytics experts don’t make lots of money — their models absolutely suck. So, what gives? I also notice that known long-term winners don’t use advanced statistical techniques, e.g., haralabob. Interesting, right? I wonder what’s going on.

If you’re going to style yourself a noticer, try noticing more than just how ethnic groups differ.

TB May 5, 2014 at 9:17 pm
Ray Lopez May 6, 2014 at 1:03 am

Haralabos, a Greek name, yep. “When Voulgaris was 18, he took a gap year between high school and college. First he traveled to Greece, visiting the hardscrabble villages — Argos, Tripoli — where his parents were born and raised before they immigrated, in their 20s, to Canada. (Voulgaris’ legal first name is Haralabos.) ”

And notice he made most if not all of his fortune “by his wits” (without a machine), though you can argue it was luck.

As for economist models, often they are simply backfitting data perfectly but have no predictive function, like those stock market models that can predict past performance perfectly (“beta”, or “three factor model”) but cannot predict the future very well.

My hot 20 yo gf is cleaning the bathroom now. She can cook too!

Cliff May 5, 2014 at 9:24 pm

Having been a professional gambler, I probably have more insight on this subject than you. The world is full of betting syndicates that rely on advanced statistics, so your objection is baseless. Many people interested in sports stats are hobbyists who love the sport and want a job in the sport not running an illegal betting syndicate in Asia.

mkt May 6, 2014 at 12:37 am

An irrelevant objection. It’s like saying that unless an economist is making a killing in stock markets or derivatives markets, his or her models are no good. That’s a silly way to evaluate an economic model, and a silly way to evaluate a basketball model.

In particular, if Maymin’s model gives teams better advice about who to draft (admittedly, from his article we have almost no ability to judge whether this is so, beyond his say-so), then that’s a model which would be useful to a team, but provides no betting opportunities.

chuck martel May 5, 2014 at 2:26 pm

Looks like this guy has a for-sure job offer coming from the Minnesota Timberwolves.

Mike Bibby May 5, 2014 at 2:36 pm

An interesting idea, but wagesofwins wins produced is the worst stat he could have possibly chosen to use. Just glancing over the list you can see his model would have recommended to pick Carmelo Anthony before Lebron, Mike Conley before Kevin Durant, Michael Beasley over Derrick Rose, Derrick Williams over Kyrie Irving and countless other absurdities.

Philip Maymin May 5, 2014 at 4:18 pm

Carmelo had NCAA stats, LeBron did not.

Matt May 5, 2014 at 4:26 pm

So, you’re saying it would have been worthless during the time when highschool players had to be evaluated, or also for foreign players? A pretty significant limitation.

Philip Maymin May 5, 2014 at 4:49 pm

Indeed. The model can only choose from NCAA players while actual teams can choose from NCAA, high school, and international. Despite this informational disadvantage, the model still substantially outperforms.

dead serious May 5, 2014 at 8:09 pm

Except when it doesn’t. Choosing Carmelo Anthony over Lebron is a franchise-killing decision.

TV May 13, 2014 at 1:11 pm

Hey dead serious, hindsight bias much?

Matt May 5, 2014 at 2:43 pm

Like Mike (the real mike bibby, I wonder?) I can’t help but note the large number of obviously wrong results, including the absurd over-valuing of Mario Chalmers, among others. Chalmers is a useful player, but it must be the case that his value is being pushed up here by having played w/ Lebron James, etc. (Also, any model that tells you to take T.J. Ford over Dwayne Wade has some serious shortcomings.) Maybe it can be improved, but I think the claims need to be dialed back a bit for now.

zubin May 5, 2014 at 2:44 pm

this is fairly useless.

he says he’s “modern machine learning techniques” but doesn’t say a word about the actual model other than that it’s “custom”.

Unsuspecting social scientists beware

Jonathan May 5, 2014 at 2:45 pm

Since when does a blogger have to be responsible for the actual methodology. Clearly this is fun and interesting. Albeit Jeff Withy seems to be picked by a lot of teams quite often!

Zach May 5, 2014 at 3:35 pm

Looking over the team-by-team comparisons of drafts vs should-have-drafteds, I notice that many names are repeated for multiple teams. Jeff Withey, for example, is the 2013 should-have-drafted list for three or four teams.

So if I follow the methodology correctly, a valuable player who falls in the draft can show up in the should-have-drafted list many times, but into the drafted list only once. Meanwhile, a valuable player who is drafted ahead of when the model predicts appears only once on the drafted players list.

The basic lack of symmetry in the accounting system makes me distrust the conclusions.

niko May 5, 2014 at 4:05 pm

I think you’re right Zach.

Philip Maymin May 5, 2014 at 4:20 pm

On a per-team basis, no player is drafted, or should have been drafted, more than once.

Philip Maymin May 5, 2014 at 4:19 pm

This is exactly why it is a partial equilibrium result.

mkt May 6, 2014 at 12:31 am

That is indeed a reason for doubting the $90M figure, but Tyler already mentioned that they are “partial equilibrium” estimates — an even more precise way to describe it is that the calculations are done using Cournot assumptions, i.e. the rivals’ strategies are assumed to be fixed. This does indeed create a lack of symmetry, but I don’t see an obvious alternative way to measure the effect of sub-optimal draft choices.

That doesn’t mean that Maymin’s model is correct or useful — since all we know about it is “machine learning” there’s very little that we can do to evaluate it. But Maymin’s done interesting NBA research before (and presented it at the Sloan Sports Analytics Conference) so I wouldn’t reject it out of hand.

Konstantinm May 5, 2014 at 6:04 pm

Hard to do much with this given the total lack of methodology, and the reliance on wins produced (which is pretty much universally rejected these days, even among the box-score statistics community, without getting into the RPM debate even).

I like the idea however.

LP May 5, 2014 at 9:34 pm

Agreed. Saying “I used machine learning” without any potential for methodological critique or replication is ridiculous. This doesn’t deserve to be hosted on SSRN.

Will P May 5, 2014 at 7:58 pm

Was the training data set the same or different than the test data set?

triclops41 May 5, 2014 at 10:57 pm

Excellent question

Cliff May 6, 2014 at 12:55 am

I think it’s safe to assume the latter, since otherwise the guy would be an idiot and his model would be completely worthless.

EnlightenedDuck May 6, 2014 at 8:40 am

@Cliff – Given that he provides no information about the method he used (see above comments and “custom machine learning” doesn’t count), I think, by your justification, it is safer to assume the former.

@Tyler Cowen – I’m seriously disappointed that you posted this. Oh well…your rate of interesting posts still has me here every morning:)

Philip Maymin May 6, 2014 at 9:17 am

Different

Comments on this entry are closed.

Previous post:

Next post: