Statistics-free sports prediction

by on March 22, 2016 at 7:58 am in Data Source, Sports, Web/Tech | Permalink

From Alexander Dubbs:

We use a simple machine learning model, logistically-weighted regularized linear least squares regression, in order to predict baseball, basketball, football, and hockey games. We do so using only the thirty-year record of which visiting teams played which home teams, on what date, and what the final score was. No real “statistics” are used. The method works best in basketball, likely because it is high-scoring and has long seasons. It works better in football and hockey than in baseball, but in baseball the predictions are closer to a theoretical optimum. The football predictions, while good, can in principle be made much better, and the hockey predictions can be made somewhat better. These findings tells us that in basketball, most statistics are subsumed by the scores of the games, whereas in baseball, football, and hockey, further study of game and player statistics is necessary to predict games as well as can be done.

That is an almost Hayekian result, and I wonder what the people at 538 will think of it.

For the pointer I thank Agustin Lebron.

1 Alex Godofsky March 22, 2016 at 8:29 am


machine learning

logistically-weighted regularized linear least squares regression

2 Vaniver March 22, 2016 at 10:19 am

I thought that was bizarre too on first reading, but the rest makes it clear–they don’t mean “statistics” the branch of mathematics, they mean statistics as a sports fan would think of them. All they have is the final scores, who’s the home team, and the time series order, but they don’t have any details about who played what game or who had the ball for how long or so on.

3 Daniel Weber March 22, 2016 at 11:40 am

Having the final score sure sounds like a statistic as a sports fan would think of it.

I was thinking that they somehow were getting good predictions just from scheduling information, and not from any results of play.

4 bjk March 22, 2016 at 8:34 am

Can somebody explain what this means? Thanks

5 elcapitan March 22, 2016 at 8:43 am

I think what they want to say by ‘statistics’ is the kind of detailled statistics that is available for example for Baseball, where you have runs per player and so on, basically creating a complex multi-dimensional predictive model.

Technically the method they used is just predictive statistics (logistical regression), but using only scores and home vs visitor data.

I don’t find it very surprising that you can make such predictions with a certain level of confidence, simply because “rich teams” (as in “Moneyball”) will win more often than poor teams, and therefore past wins are predictive of future wins. It all depends on what margin of error you are willing to accept.

6 Ryan March 22, 2016 at 10:20 am

Moneyball is about a poor team that uses a different set of statistics to identify undervalued players and win vs. “rich teams”. Not what you are implying.

7 The Original D March 22, 2016 at 4:05 pm

Their use of different stats made them an outlier. So perhaps the model would have more variance when predicting outcomes of Oakland games, but better with non-Oakland games.

Though nowadays more teams use Moneyball tactics so it might not work as well.

Also, in basketball one of the reasons Golden State is reputed to be so successful is their obsession with player health. So much so that they sit some of their stars when playing in Denver because they found that the altitude can affect a player’s performance for up to two weeks afterwards. I wonder how good the model is at predicting outcomes in Denver?

8 gadsen purchase March 22, 2016 at 4:32 pm

the bigger problem for baseball is it’s the only sport where the fundamentals change dramatically game by game. The starting pitcher isn’t determined by home/away status it’s determined by “is best pitcher available/has had 4+days since last start” No, has second best pitcher…

also moneyball is about constructing a team given monetary constraints so i don’t think it s even relevant to this. “which visiting teams played which home teams, on what date, and what the final score was.” all that measures is stuff that’s happening after things were locked in.

9 middyeek March 24, 2016 at 10:41 am

But the main reason for their success, and nothing could be more obvious, is that Steph Curry is the best basketball player in the world.

10 Careless March 22, 2016 at 2:55 pm

Football, hockey, and basketball all have relevant salary caps and don’t vary all that much in what teams spend

11 gadsen purchase March 22, 2016 at 4:33 pm

30 years of hockey data includes a lot of non salary capped years

12 Axa March 22, 2016 at 8:45 am

It seems the core of this method is:

“it (the model) measures the ability of teams over games instead of players over possessions, and it takes into account home-held advantage. We intentionally limit our data use to the date, home and visiting teams, and score of each game, and we compare our predictions to a theoretically near-optimal indicator. Doing so tells us what statistical information is contained just in the scores, and whether what are commonly referred to as “statistics” have real predictive power.”

For short, historical scores and home-visitor data tell a lot about basketball outcomes. In certain way, this makes the game boring.

13 Ray Lopez March 22, 2016 at 11:07 am

Just guessing, reading the abstract, I think what they did was ‘back fit’ data going back 30 years to see what the likely ‘dummy variables’ are that are most likely to predict the outcome.

Simple example (making this up): suppose the Green Bay Packers always seem to win at home (90%) when playing their conference teams, but occasionally (25%) lose when playing non-conference teams at home. Using ‘machine learning’ you would input this into a linear equation having various coefficients, and if the team was the Chicago Bears playing the Packers in Green Bay, WI, you would know the GB Packers would win 90% of the time at home, but if it was the New Orleans Saints playing in Lambeau Field the Pack would only win 75% of the time. This is done for all teams, and tested against data going back 30 years, and finding the best fit (‘least squares’) for the linear equation for each team.

In short, much ado about nothing.

14 Ray Lopez March 22, 2016 at 11:12 am

Of course the “least squares” would find the coefficient values for each dummy variable. In the example above, for Green Bay, the dummy of “HOME_FIELD_ CONFERENCE_OPPONENT” would be 0.90 for Green Bay, and the dummy of “HOME_FIELD_NON_CONFERENCE_OPPONENT” would be 0.75 for GB. Likewise for other dummy variables, as many as you want, all back tested with “machine learning” to see if they are significant dummies or not.

I’m making this up but it’s plausible.

15 Philip March 22, 2016 at 8:52 am

Predicting winners is not that hard. Might be useful to compare to a naive predictor as well as the “optimal” one. How about no regressions, just predict based on each team’s prior season’s win percentage?

The hard part is predicting net of Vegas lines. That would be interesting. Do they win by more than expected?

16 Ryan Cousineau March 22, 2016 at 2:01 pm

This x1000. Much as nobody would care about a stock model that underperforms the S&P 500, a prediction model that underperforms the Vegas line is essentially valueless.

17 anon March 22, 2016 at 9:07 am

This strikes me as back to the future. Surely least squares on team wins was the sort of thing done with earliest home computers, before approaches grew in complexity.

18 msl March 22, 2016 at 10:16 am

Yes, but not logistically-weighted least squares!

19 Urstoff March 22, 2016 at 9:34 am

Yeah, but how does it do against the spread?

20 Brian Donohue March 22, 2016 at 1:26 pm

You can bet odds rather than spread. In which case, being right 60% of the time might not be enough.

21 LR March 22, 2016 at 9:39 am

Basketball and football seem to load more heavily on skill which should make them easier to predict. IE you don’t have baseball teams winning 90% of their games…

22 gadsen purchase March 22, 2016 at 4:37 pm

what would football team records look like if they had a 32 game schedule?

23 (Not That) Bill O'Reilly March 22, 2016 at 9:58 am

We do so using only the thirty-year record of which visiting teams played which home teams, on what date, and what the final score was.

Perhaps I am misunderstanding the model a bit, but relying on a thirty-year record (a period long enough for rosters to turn over 10-20 times) seems likely to end up capturing a lot of franchise effect. That is, it will be easier to predict matchups in sports where the same franchises are more regularly at the top of the standings. Considering basketball seems to have some of the least parity of the Big Four sports (I don’t follow closely, but that’s my understanding), it’s not surprising that a 30-year old boxscore would have more predictive power in basketball than in other sports.

24 Hadur March 22, 2016 at 10:43 am

“The Patriots are consistently good at football”

bbl, going to get this sentence published in an academic journal

25 Doug March 22, 2016 at 10:47 am

Jeff Sagarin has been doing this for years using linear least squares regression. You can see all of his “power ratings” at

What is less clear from the paper is if the method can predict outcomes better than Vegas odds.

26 mobile March 22, 2016 at 3:05 pm

I used Sagarin’s college basketball ratings to fill out my bracket (the “RECENT” method) and I cruised to a commanding lead in the first round of my family NCAA pool. It didn’t tell me Michigan State was going to lose in the first round, but it did tell me now to advance them past the Elite Eight.

27 Flannery Bro'Connor March 22, 2016 at 11:02 am

Tyler –

What does “Hayekian” mean? If I referred to something as “Cowenian” what would that mean?

28 ricardo March 22, 2016 at 11:13 am

The analogy is with Hayek’s view of the price system. Prices summarise relevant information (e.g. I don’t need to know whether rainfall was high or low this year; I just need to look at the price of wheat) in the same way that all the relevant sports `statistics’ are summarised by the game scores.

29 Enrique March 22, 2016 at 3:16 pm

Great description/summary of “Hayekian” theory, but what would “Cowenian” mean?

30 Flannery Bro'Connor March 22, 2016 at 3:29 pm

Thank you Ricardo. Strange that the life and work of a Nobel-prize winning economist can be synechdochized into four words of common sense.

31 mkt42 March 23, 2016 at 3:26 am

Most economic research, indeed most social science research, can be summarized in concise common sense terms. Search theory, rational expectations, public choice, externalities, etc. There are only a few models or theories that are counter-intuitive: comparative advantage, Coase Theorem, maybe a few more.

32 Lord Action March 22, 2016 at 11:08 am

This doesn’t seem surprising. Basketball is a series of a large number of essentially independent plays with almost no room for strategy. It should basically come down to the athleticism of the top couple of players who have the ball most of the time.

Football and hockey are strategically deep and have large teams so it’s hard for an individual to dominate. Today, it’s basically quarterbacks and goalies who have out-sized influence.

I have to admit I’m a little surprised by the result for baseball, which I would have put in between (football and hockey) and basketball. I wonder if it’s an artifact of pitcher rotations. Maybe if you treated team,pitcher combinations as the thing you were trying to predict you’d do better.

33 mpowell March 22, 2016 at 11:46 am

Baseball conclusion also surprised me. Is there too much noise in the single game result? I am fairly confident that underlying statistics will do the most to improve prediction for baseball games since they are individually very powerful predictors, but I don’t see why this result would come out of the study as described.

34 Lord Action March 22, 2016 at 3:04 pm

My guess for baseball is that PedroMartinez,RedSox is effectively a different team from TimWakefield,RedSox.

Agreed on baseball statistics: it’s relatively easy to measure very important aspects of individual performance, like batting.

35 gadsen purchase March 22, 2016 at 4:58 pm

lets quantify this in a snapshot for random year with both players: 1998
going to round a decent amount

the top three red soxs pitchers

Petro martinez leaves the game in 1998 at the start of the 8th inning having given up 2.25 runs

Tim Wakefield leaves halfway through the 7th inning having given up 3.3 runs. [which undercounts since i’m using earned runs and presumably a knuckler will have more unearned runs]

Bret Saberhagan leaves with 2 outs in the 6th inning having given up 2.48 runs.

That’s a huge difference between Pedro and the others. your #1 pitcher essentially gives you 1 1/3 innings of
“guaranteed” scoreless innings. Pedro to Wakefield makes you net -1+ runs. OR: The total team averaged 4.5 runs allowed per game (above used earned runs). So “generic boston” earns 4.5 runs but switching Pedro to wakefield saves at least 22% of RA per game. S

36 Albert M Passy March 22, 2016 at 11:23 am

I too sort of “object” to the title. Maybe “Team-” or “player-statistics-free”? Awkward, but much more accurate.

37 Ed March 22, 2016 at 11:54 am

The type of machine learning described is essentially statistical, but the only statistic being used is the score. It would be surprising if more information (individual player statistics, etc.) did not make for better predictions.

38 tm March 22, 2016 at 12:06 pm

that’s the exact same info that 538 uses for their nfl elo rankings and predictions; they beat the spread by a bit, not enough to cover the vigorish

39 Mark B March 22, 2016 at 12:57 pm

Yup, I would have writren “minimal model” and writren more about what I left out (e.g. individual player statistics/data).
So yes, poorly worded.
And yes, a model that captures 95% of the variance is often far, far less useful than one that captures even just a little more ( like 98%).
Now to sarcasm, a bit unfair:
At the end of last year, I made the bold prediction that Golden State would be a top team this year and that the Cavs would be a top team in the East. And my model just uses winning percentage…

40 Ryan Cousineau March 22, 2016 at 2:08 pm

(Where naive models like this one get most interesting is when used as a comparator to live humans. A few years ago, a Canucks fan posited that the hockey team’s management was doing “worse than a potato,” the potato representing a naive drafting model (always draft the available forward with the most points from the most important Canadian junior league). The case was compelling that the team, in fact, did a worse job of drafting than the potato.)

41 Plucky March 22, 2016 at 3:19 pm

The baseball result is fairly obvious given the inputs. The single most important player in any particular game is the starting pitcher, but that player rotates on a schedule. Composition of “the team” changes substantially day-to-day.

42 The Original D March 22, 2016 at 4:09 pm

As a Broncos fan I was disappointed when 538 picked the Panthers in the Super Bowl. But having watched all the Broncos games since, oh, 1999, I felt like this year’s edition was a tougher team that had overcome more adversity to get to the Super Bowl, and that would prove to be the X-factor. Plus, GM John Elway had said that lack of team toughness was the reason he fired the head coach last year.

43 John Mansfield March 22, 2016 at 4:38 pm

Predictions close to the “theoretical optimum”? Does the theory have a name?

44 byomtov March 22, 2016 at 6:40 pm

I also wonder what this means.

45 mkt42 March 23, 2016 at 3:47 am

Meh. Using scores of past games to predict future games. The only innovative thing that I see is that they are using a relatively new machine learning technique, RLSC (regularized least squares classification). Which may indeed yield better predictions than older techniques, but I’d be astounded if it was more than a marginal step forward. And most new techniques are clearly superior in only a fraction of their possible applications; other techniques will give superior results when used in other applications. In other words, there’s no guaranteed best crystal ball.

Comments on this entry are closed.

Previous post:

Next post: