Baseball umpires are not so great, and older umpires are much worse

This deep-dive analysis demonstrated that MLB umpires make certain incorrect calls at least 20 percent of the time, or one in every five calls. Research results revealed clear two-strike bias and pronounced strike zone blind spots. Less-experienced younger umpires in their prime routinely outperformed veterans, and umpires selected in recent World Series were not the best performers. Results showed a declining but still unacceptably high BCR score, but on a positive note, only a marginal inter-inning call inconsistency.

The most likely mistakes are made at the top of the strike zone.  And older umpires really are worse:

Based on the research, professional umpires, similar to professional baseball players, have a standard peak. The study revealed that home plate umpires who made the Top 10 MLB performance list (2008-2018) had an average of 2.7 years of experience, and averaged 33 years of age with a BCR of 8.94 percent. None of these top performers had more than five years of experience or were older than 37…

In contrast to the overall top performers, research uncovered that umpires on the Bottom 10 MLB performance list (2008-2018) had an average experience level of 20.6 years, were 56.1 years of age, and had an average BCR of 13.96 percent. This group’s error rate was a staggering 56 percent higher than the top 10 MLB performers. Umpire Jerry Layne, with 29 years on the job and at age 61, sported the highest BCR, 14.18 percent. This performance research clearly indicates that more experience and age does not necessarily produce the best umpires.

Here is the full story, written by Mark T. Williams, who also did the data work, via the excellent John Chamberlain.

Comments

Bring back the American League umpire ball protectors, those bulky things umps held up and hid behind.

Hi there everyone, it's my first visit at
this web site, and piece of writing is actually fruitful in favor of me, keep up posting
these types of content.

Old guys suck. Replace them with computers. Then America will be great again.

Very similar to federal judges; often make wrong calls and have a bias.

... Abolish umpires if the existing MLB ball/strike video technology is as good as claimed:

"For this research, we looked at game data from Baseball Savant, MLB.com, and Retrosheet. The time period chosen, the most recent 11 baseball regular seasons (2008-2018), presented nearly four million called pitches. Similar to players, MLB umpires were assigned numbers, so that games behind the plate could be easily tracked. All active umpires were included in this performance study, and their ability to accurately call balls and strikes was closely observed. All 30 major league parks are outfitted with triangulated tracking cameras that follow baseballs from the pitcher’s hand to across home plate. Ball location can be tracked up to 50 times during each pitch and accuracy is claimed to be within one inch."

Replace both umpires and cameras with Google algorithms, which will accurately predict the outcome of each pitch before it is thrown...

Considering that we have a generation that seems infatuated with virtual reality and contrived reality TV, and would prefer gaming on a counsole than in the dirt, the algo approach to sports actually scares the bejeez out of me.

Agree..once the technology is proven both batters and pitchers would accept the call...who makes safe or out calls at home though?

Given the trend... a video replay official.

IMHO, it's a trend in the wrong direction. Blown calls are a huge part of what make sports memorable.

Tyler earlier (Thursday, April 11 "Assorted Links") linked to a summary of this study which included the following quote:

. "MLB home plate umpires make incorrect calls at least 20% of the time – one in every five calls. In the 2018 season, MLB umpires made 34,246 incorrect ball and strike calls for an average of 14 per game, or 1.6 per inning. Last season, 55 games – 2.2% of the total played – ended with an incorrect call."

I questioned that in the comments because it did not seem consistent with the number of pitches throw in a game. A couple of other commenters correctly pointed out to me that not all pitches require a judgement call by the plate umpire on whether it is a ball or strike---foul balls, hit balls, etc.

But now it seems I may have been on to something. Note that in this summary of the study the language has been changed. It is no longer that plate umpires "make incorrect calls at least 20 percent of the time", but now that umpires "make *certain* incorrect calls at least 20 percent of the time." It does not appear to me that *certain* is there merely to distinguish plate umpires from other umpires but to distinguish certain situational calls made solely be plate umpires.

Perhaps I am again overlooking something but if the *worst* home plate umpire has a "bad call ratio" (BCR) of only 13.96 percent, even that worst umpire is not making bad calls 20 percent of the time. The tables accompanying this latest summary seem to support a conclusion that the ratio is only in regard to the pitches that actually require a judgement call on balls and strikes.

It is disappointing that such a statistic is so poorly explained and misleading and yet garners all the headline attention.

I'm sure that home plate umpires make a lot of "mistakes" and that perhaps a sophisticated camera system might do a better job. However, one little detail in this latest summary also got my attention:

"For this research, we looked at game data from Baseball Savant, MLB.com, and Retrosheet. The time period chosen, the most recent 11 baseball regular seasons (2008-2018), presented nearly four million called pitches....All 30 major league parks are outfitted with triangulated tracking cameras that follow baseballs from the pitcher’s hand to across home plate. Ball location can be tracked up to 50 times during each pitch and accuracy is claimed to be within one inch. Statcast, a MLB subsidiary, is at the center of this system..."

OK, so accuracy "is *claimed* (by whom, the manufacturer?) to be within one inch". Bob Feller has been attributed with the observation that "Baseball is only a game, a game of inches and a lot of luck". Now, given that the machines are *claimed* to have an accuracy of within an inch, how many of those machine readings were actually wrong? Why do these researchers claim the umps were simply wrong without allowing for the margin of error of the machine on calls that came within an inch of the strike zone? This seems more of a hit job on plate umpires than an objective academic study.

We have great ways to know, like seeing how errors increase on 2 strikes. Say what you will of the machine's accuracy, but it sure doesn't care about the strike count.

Yeah, difficult pitches are difficult to hit and difficult to tell whether they are going to be a ball or strike. The official definition of the strike zone shows how hard that can be -- the top is located from "the midpoint between the top of the shoulders and the top of the uniform pants" and the bottom "at the hollow beneath the knee cap."

There have been a number of articles over the last year complaining that computers aren't necessarily able to identify the strike zone accurately:

"Isn’t that ironic? Until MLB comes up with a machine-comprehensible definition of the top and bottom of the strike zone, machines will need the assistance of humans to define the strike zone for the machines."

https://tht.fangraphs.com/the-physics-of-roboump/

No doubt Bill Veeck would roster a double amputee

I read the other link as well, and umpires do not make mistakes 20% of the time (even when we don't deal with the error in the technology's measurement). It's closer to 10% of the time, at most (this has improved since 2008 as well, thanks to training particularly at the bottom of the zone). It's not clear what that author was talking about on the 20% mistake rate.

There was an HBO special on some of the mistake rates a year ago or so with Tobias Moskowitz, who has done a lot of work with umpire data. In that, it was reported as 33%, but this error rate was specific to pitches on the edge of the strike zone (hardest to call) and did not deal with the likelihood that 33% is overestimated probably due to measurement error. So in thinking about all pitches (including easier to call ones), the error rate is substantially lower, as I noted above.

"(this has improved since 2008 as well, thanks to training particularly at the bottom of the zone). "

Yes, but have you considered the possibility that some of that "improvement" as measured by the cameras is actually the result of the camera tracking technology getting better since 2008 so that it is the tracking that has improved and not the umps? (Note for pedants: I did say *some*).

Hi Vivian,

Good question. Short answer is yes. I was quite concerned about that early on in looking at the data. But I spent a good bit of time talking to some physicists and engineers that work closely with the technology hat noted it has largely remained the same (I exclude the bit of 2007 data because it was a time when machines were being better calibrated).

However, moving across the old PITCHf/x technology and the newer Statcast (Trackman) technology could be problematic in comparing velocity, movement, location, etc. So I generally stick to 2008 - 2014 or 2015 (PITCHf/x data available) or 2016-2018 (Statcast data available). The change in technology used is an important point that anyone working with the data should be aware of.

Additional evidence that it's the umpires, rather than the system, is that if we use the variation in changes to accuracy/strike zone size, we actually see that umpires were likely responsible for some 20-40% of the decline in run scoring from 2009 to 2014. Further, the improvement to the zone is specific to certain areas that - through news and other sources - we are somewhat aware that the league wanted specific improvements on (outside pitches, especially to lefties, and the bottom of the strike zone).

To clarify. This 4 million pitch-call study (2008-2018) focused on the performance of home plate umpires. MLB data and MLB provided strike-zone was used. This research uncovered that for certain calls, home plate umpires made incorrect calls at least 20 percent of the time. Yes, this study focused only on pitches called and ignored those that did not require judgement. Typically, each game, home plate umpires make judgement on over 50 percent of pitches thrown. Results demonstrated a 2-strike bias, umpires calling a true ball a strike 29 percent of the time. During the 2018 season, umpires made 34,246 incorrect ball and strike calls. This Bad Call Ratio (BCR) was over 9 percent. The older the umpire, typically, the higher the BCR.

Thanks for the clarification, Mark. This error rate is much more consistent with the large body of academic and non-academic work on this issue.

Based on the description of the “2-strike bias” it seems you’re implying strikes are called more often in 2-strike counts, which would be counter to about 10 years of literature on this question. So I would be curious as to why you seem to find something so different, or if we are misunderstanding what you mean by bias (and it’s direction).

As I noted in another comment, I have a hopefully comprehensive bibliography of academic work on umpire accuracy and bias if you’re interested.

Great question. The 2-strike bias mentioned in our study was that once the count is reached by the batter, wrong strike calls increase and correct strike calls decline. Our finding is supported in previous research. One that comes to mind is in the 2017 study link provided. https://community.fangraphs.com/the-2016-strike-zone-and-the-umpires-who-control-it/ It’s on Table 5. Interestingly enough, based on a smaller data set, this study’s correct strike call percentage was even lower than in our study. I hope this helps in clarifying. Best.

I think your labeling makes it sort of hard to interpret. Indeed, the number of *true strikes* actually called strikes by the umpire decreases in 0-2 counts. In other words, your increase in "wrong strike calls" is an increase in incorrectly called balls (pitches in the zone being called balls more often). This is the interpretation that is consistent with past work, and what is reported in that Fangraphs article. Correct ball rates increase in these counts because the umpires trade off correct strike rates for correct ball rates (think of true strike rates as sensitivity, and true ball rates as specificity, and they're trading false positives for false negatives).

Off the top of my head, there are 3 academic papers looking at this same effect (expanding zone in 3-0 counts, shrinking zone in 0-2 counts, etc., including my own paper at MDE, Kim & King (2014, Man. Sci.), and Green & Daniels (2018)) and in the book Scorecasting (Moskowitz & Wertheim, 2011). One of the papers shows that using some Bayes, umpires are changing their prior expectation such that decreasing the number of strike calls (calling more pitches balls) in 0-2 and 1-2 counts actually *improves* their overall accuracy, given the expected error rates (see Green & Daniels SSRN paper called Bayesian Instinct). There's also a non-academic look at this here:

https://www.baseballprospectus.com/news/article/28513/prospectus-feature-umpires-arent-compassionate-theyre-bayesian/

On an admittedly self-serving note, you might also be interested in my 2017 paper at Labour Economics that exhibits much of the age/experience effects you've discussed (particularly related to heterogeneous improvement rates over this period thanks to new training and evaluation) as well as DJ Hunter at J. Quant. Analysis in Sport showing similar results on which umpires tend to be the most accurate and consistent.

Thank you as well for this clarification. It confirms that my original suspicions were correct.

I also suspect that my comment may have been the reason that in your second published article you quietly changed that "umpires make incorrect calls at least 20 percent of the time" to "umpires make *certain* incorrect calls at least 20 percent of the time". The original quote was wrong and the second quote highly misleading even with the revised language. You owe it to yourself, the umpires you called out and academic and journalistic integrity to make the necessary corrections and clarifications in an update to those articles. It might also be a good idea to write Tyler Cowen an e-mail asking that he publish your clarification in a new post to this blog. I would have a lot more respect for you and your profession if those steps were taken.

Viv

We find the following sentence in the linked article "Moreover, MLB umpires have a pronounced biased[sic], greatly increasing the odds, on a two-strike count, that a true ball will incorrectly be called a strike."

Can this really be true? I would have sworn they had a bias against strikes on a 2-strike count, so as not to be decisive.

You are correct. This is the opposite of what a very large literature (academic and not) has actually found. The strike zone shrinks considerably in 2 strike counts, and grows considerably in 3 ball counts. Recent (very cool) work has also considered that this isn’t impact aversion, but Bayesian decisionmaking by umps that actually maximizes accuracy for them.

Are star pitchers given a wider strike zone than middling pitchers? Seems so. Is it because of bias or expectations (i.e., the umpire expects the star to hit the corners and the middling pitcher to miss). "The most likely mistakes are made at the top of the strike zone." Not sure if the mistakes support a higher strike zone, but I suspect that's the case. Could it be that umpires, older umpires especially, want to see a wider (i.e., higher) strike zone? When I was involved in youth baseball, a recurring issue with umpires is that they had a wider (i.e., higher) strike zone. As coaches, we objected, not so much because of the issue of fairness but training. We wanted the young pitchers to locate their pitches low and the hitters to learn how to hit low pitches, because the older they became, the more likely that's what would be expected of them. My main role was to teach hitting, and I would spend hours in the batting cage with the young players, teaching them the mechanics of hitting line drives, and not trying to "lift" the ball for home runs. Umpires with a high strike zone would undo much of what I would teach the players. I don't watch enough MLB to know the current trends, about location of pitches and umpire bias, but from my experience in youth baseball, umpires are not the hitters best friend even if they subconsciously think they are doing them a favor. [An aside, the top of the strike zone is not altogether clear: According to rule 2.00 of the Major League Baseball rule book, a strike zone is defined as "that area over home plate the upper limit of which is a horizontal line at the midpoint between the top of the shoulders and the top of the uniform pants. . . ." https://www.businessinsider.com/mlb-strike-zone-2014-9]

I know nothing about baseball umpiring. How much technical assistance (video, etc) do baseball umpires get during play? (At the top level, cricket umpires get lots of video and audio assistance.)

The worst call in the history of Detroit baseball was made by a 55-year-old umpire with 23 years of experience at the time:
https://en.wikipedia.org/wiki/Armando_Galarraga%27s_near-perfect_game

Baseball is a sport that could easily use technology to improve officiating. I'm thinking of putting sensors on the bases, using electronic eyes for foul lines and enhanced reality eyeware for home plate umpires. The officials could then focus on moving the game along, which would also be an improvement.

This could easily be done for tennis.

I thought this part was interesting:

Research results demonstrate that umpires in certain circumstances overwhelmingly favored the pitcher over the batter. For a batter with a two-strike count, umpires were twice as likely to call a true ball a strike (29 percent of the time) than when the count was lower (15 percent). These error rates have declined since 2008 (35.20 percent), but still are too high. During the 2018 season, this two-strike count error rate was 21.50 percent and repeated 2,107 times. The impact of constant miscalls include overinflated pitcher strikeout percentages and suppressed batting averages.

Anecdotally, it seems like the opposite happens quite a bit, too, where if the pitcher gets a quick 0-2 count, the umpires will start to favor the hitter, and the strike zone will magically shrink for a few pitches, until the count goes 2-2 or thereabouts. Maybe that doesn't happen as often as I seem to think, though.

Devil's Advocate view: the umpires' supposed bad calls actually add entertainment value to the game by forcing hitters to be more aggressive at the plate, rather than stand there and collect walks and foul balls. The BCR is a feature, not a bug.

It seems to me that this comparison doesn't really do much for us; there are so many variables changing from pitch to pitch. For example, catchers' ability to `frame' a pitch has become celebrated in the analytics community, but is far from understood. Additionally, there is a path dependent element to each pitch call - did the umpire have a bad interaction in previous at-bats with this batter, or with this pitcher? Did the manager come out and yell at the umpire the at-bat before and therefore alter the umpire's bias in one direction or the other? This descriptive analysis definitely highlights that there is an accuracy problem, but to truly understand it, we'd need to hold some of these framing and path dependent variables constant. I'm not sure replacing old umpires wouldn't necessarily fix the problem, but this analysis fails to provide much support that it would.

Whether "correct" calls are all that important in a game that's played by fallible humans is one thing. The age of the umpires is another, more obvious issue, and not just in baseball. The creaky fossils that officiate NFL games, who, unlike baseball umpires, aren't actual professionals, should have been replaced by younger men dedicated to the business years ago. Aged officials are a detriment to all sports. NHL hockey is less guilty in this respect in that referees and linesmen must be exceptional skaters, just as good as the players themselves. NHL referees average age 40, which means that at times they are younger than some of the players.

Most of these findings have been shown in the economics, management, and operations research literature already, along with many public analysts sharing these results for years. The specific age/experience result is prominently in my own paper at Labour Economics, a result that's been in a public working paper since 2014.

For anyone interested in umpire research, I've attempted to put together a comprehensive list of academic papers related to this group of professionals. Includes answers to many of the questions in the comments here (training effects/technical assistance, veteran/better pitchers get larger zones, home field advantage, bottom/top zone differences, changes over time, effects on game outcomes, etc.):

https://www.brianmmills.com/umpire-research.html

And for the feasibility of a robo ump:

https://www.baseballprospectus.com/news/article/37347/robo-strike-zone-not-simple-think/

"Baseball umpires are not so great, and older umpires are much worse"
What are these umpires not great at? And like, what is the point of this whole "baseball" thing anyway?

Headlines like this make me very wary indeed of the maximize sustainable economic growth ethic! Yes on the margin a little bit more makes our present day lives better. But what does the General Equilibrium look like when we start maximizing this or that variable? Compounding is in fact like magic but I am way more concerned about things like Sonnenschein Mantel Debreu theorem, theory of second best, and A-Prime/C-prime "theorem" idea of Mccloskey. If growth is low, then there will be more years during which future humans live somewhat like we do now, and this should not be considered a tragedy.

I am curious if Tyler has comment on things like this paper:

Can intergenerational equity be operationalized?
WILLIAM R. ZAME (2007) TE
A long Utilitarian tradition has the ideal of equal regard for all individuals, both those now living and those yet to be born. The literature formalizes this ideal as asking for a preference relation on the space of infinite utility streams that is complete, transitive, invariant to finite permutations, and respects the Pareto ordering; an ethical preference relation, for short. This paper argues that oper- ationalizing this ideal is problematic. Most simply, every ethical preference re- lation has the property that almost all (in the sense of outer measure) pairs of utility streams are indifferent. Even if we abandon completeness and respect for the Pareto ordering, every irreflexive preference relation that is invariant to finite permutations has the property that almost all pairs of utility streams are incompa- rable (not strictly ranked). Moreover, no ethical preference relation is measurable. As a consequence, the existence of an ethical preference relation is independent of the axioms used in almost all of formal economics and all of classical analysis. Finally, even if an ethical preference relation exists, it cannot be “explicitly de- scribed.” These results have implications for game theory, for macroeconomics, and for economic development.

A lot of comments on both the linked article and here suggest that the technology isn't good enough to accurately call balls and strikes. This is not true. "Umpire assist" has existed since the early 2000's and has gotten more accurate over the years: from 3" originally to less than 1" today.

To answer one question, it takes the rules for the size of the strike zone into account, tailored for each batter (the top of the zone is correctly determined, for example). The cameras are re-calibrated before each game, which takes about five minutes. I believe there are two tracking cameras high in the stands. There may be others near the dugouts (there was a famous incident where Curt Schilling smashed one of those because he didn't like the calls he was getting -- he had to pay for it). There's also a pitcher's view camera in deep center field (which is mostly there to show the umpires the thing works). The biggest risk is that the per-batter calibration might not be performed accurately, but it's fairly easy to do correctly. Umpires train on recordings from this system.

As the article explicitly states, no one is trying to get rid of the umpires, as they do far more than just call balls and strikes. The goal is to use the tracking technology to assist the umpire in making correct pitch calls.

I know in American baseball you don't gotta have wa, but is it possible hating on the umpire enhances a team's esprit de corps?

Indeed, historical discussions I’ve read talk about how, early in he 20th century, owners actively encouraged fans to “abuse” umpires because they knew it was good entertainment for them.

Right. I was thinking of the players, but fans too. I know this idea would be anathema to extreme baseball purists, who would I expect prefer perfect calling (I am thinking of the one in my family, for whom sentimental notions of sport, appeals to nostalgia and history, don't operate at all despite his total knowledge of same), but if you eliminate the umpire, the fans might turn (more of) their ire on their own team.

How much of this decline Can be explained by (correctable?) vision?

Tough to say, but when tracking tech was introduced for umpire training in 2009 - although younger umpires were slightly better and improved more quickly using it - ALL umpires improved considerably relative to the rulebook strike zone. In other words, older umpires were able to improve their accuracy relatively high rates. Provides some evidence that vision still seems to work quite well (though, I'm sure it has an effect on the overall rates).

Maybe older umps can't crouch low enough to see the strike zone for smaller stature hitters. MLB mandated knee repairs might be in order

Interesting point on turning their ire to the team (no scapegoat).

Worth noting that perfect calling is also something that would have to be considered carefully. The fuzzy calls at the edges of the zone (especially the corners) going directly to perfectly called would probably make the strike zone much larger. This could have real undesired impacts on gameplay (which we know also happened from 2008 to 2014, as umpires were given training technology and lowered the bottom of the zone considerably). I think they'll eventually get there with the technology, certainly at least assisted, but it's no quite as easy as a lot of folks want to think due to these other game-level effects that can happen (plus error rates, etc.).

What if the variability of umpires (repeatability and reproducibility) was an integral part of the game?

Technology has ruined so much in our culture and civilization with its false promise of certainty and infallibility. It's a vanishing point people, we are crawling up our own bungholes.

The FIRST thing every child learns at recess is that sports are about 10% playing and 90% arguing about the play.

Agree. Baseball is (for me at least) about entertainment, not optimization.

Indeed. It's supposed to be about fun, humanity, unpredictability etc and etc. And then you are supposed to get back to your regularly scheduled life.

In other words, if you are still furious about the Saints non-call, you are doing it wrong.

This guy gets it

Like I didn't already know that getting older makes you worse at everything.

This is a clear case of a labor union protecting poor performers, mainly by resisting changes to the rules. The TV networks have had the tech to have machines call the balls and strikes for decades now, and they would do a better and fairer job.

Good to know there's data to back up my habit of screaming at umps.

Comments for this post are closed