Let’s face it, it’s the only reason we remember Tony Batista.
In 2004, in his final full season as a major-leaguer, T-Bats drove in 110 runs for the Montreal Expos, despite a putrid .272 OBP. Although he was, arguably, the worst everyday player in the majors in ’04, he was hardly the worst player to ever drive in 100 runs (see Ruben Sierra, 1993), nor was 110 the highest RBI total ever amassed by a replacement-level player (see Joe Carter, 1990). However, for some reason, Tony Batista became a sabremetric icon, our favorite cause celebre when we rage, rage against the RBI.
You’ve heard it before. RBIs are just neat round numbers and context. Given the opportunity to hit behind a couple of on-base machines like Brad Wilkerson and Jose Vidro, anybody could drive in 100 runs. But just because a blind squirrel gets a nut every once in awhile, that doesn’t mean he should bat cleanup.
In the wake of T-Bats glorious season, the sabremetric cause was moving from its grassroots mail-order infancy to full-blown mainstream phenomenon, buoyed by New York Times Bestsellers, championship GMs, and senior columnists. When an broadcaster spouted out a flurry of “traditionals” – batting average, homers, RBI, wins, saves – to make his point, basements full of fantasy addicts looked up from their digital almanacs and replied in unison: Bleh.
Give me OBP, give me OPS, give me IPO, give me WPA, give me K/BB; just don’t give me RBI! If you’re going to give me RBI, Mr. McCarver, I’d rather you gave me nothing.
And then came WAR.
The concept was ratified by the sabremetric Godfather, Bill James, who’d created Win Shares according to a similar ideology in 2002. It was a neoclassical economist’s wet dream, like baseball GDP: an elegant equation which accounted for all the sport’s diverse variables and yielded a single number roughly reducible to the oldest and most hallowed statistic of them all, the win. Hallelujah.
Wins Above Replacement is a beautiful idea. Euclidean grace in a quantum world. A simple answer, not only for age-old baseball conundrums like “Mantle or DiMaggio?”, but also a formula for unprecedented comparisons like “Rickey Henderson v. Johnny Bench” and “Roy Halladay v. Alex Rodriguez“.
There’s only one problem. It doesn’t work.
At least, not yet. Not in the fantastically straight-forward way we try to use it. The idea is so good, so clarifying – like democracy or the rational market – that we really, really want it to work, we’re willing to suspend our disbelief just a little while longer in the hope that it might. Because it’d be so great to know with statistical certainty that Albert Pujols was worth $200 Million, that we really couldn’t win that pennant without Andy Pettitte, that Jacoby Ellsbury is definitely the AL MVP, and that Ben Zobrist is exactly 9.3% better than Adrian Gonzalez. Darn that dream.
The cruel irony, the I-could’ve-had-Sean-Doolittle-and-all-I-got-was-stupid-Barry-Zito irony, is that the problem with WAR is the same as the problem with RBI. It frequently measures context as much as performance. Especially when used to evaluate single seasons, it doesn’t sufficiently account for the inevitable variations in opportunity and environment.
A few weeks back I critiqued Steve Berthiaume’s analysis of Curtis Granderson’s defense by looking at some inconsistencies in the way Ultimate Zone Rating (the defensive metric associated with Fangraph’s WAR) assesses outfielders. Mark Simon of the ESPN Stats & Info Blog followed up with a very interesting review of specific plays which have adversely effected Granderson’s low ratings in 2011. While Simon isn’t looking at UZR specifically, he does point out that most defensive metrics do not account for positioning and that half a dozen plays can cause sizable shifts in the aggregate numbers when we’re dealing with less than a season’s worth of data.
I’m not the only one who’s noticed that UZR frequently yields suspicious results in small samples, at Fenway, and when several good outfielders are playing alongside one another. I do, however, want to expand upon my claim that outfield UZR is substantively effected by flyball rates.
In the Granderson article I pointed out that the teams in each league which rank highest in outfield UZR for 2011 – Boston and Arizona – also ranked #1 in their league in FB%. This remains true. However, this is obviously not sufficient proof of correlation, for a couple reasons. Not only is there a high possibility of coincidence in any single example, but both the D-Backs and Red Sox feature several outfielders traditionally regarded highly by both sabremetricians and scouts. For anybody who’s watched them consistently, it would be pretty hard to argue that the trio of Gerardo Parra, Chris Young, and Justin Upton isn’t among the best in the major leagues, no matter who’s on the mound.
So, I looked back at all teams that finished at the extremes of the flyball scale since 2003. I do not claim that there is a perfect or, in the parlance of economics, a “strong” correlation. That is, a team with a 35% flyball rate wouldn’t have a dramatic disadvantage in OF UZR compared to one at 38%. There is, however, significant evidence that pitching staffs with extreme batted ball tendencies can dramatically effect their outfielders UZR numbers. (These extremes I defined at upward of 40% at the high end and below 33% at the low end.)
Average OF UZR for FB% > 40.0: 10.1
Average OF UZR for FB% < 33.0: -10.6
Of the sixteen teams at the high end of the range, five finished #1 in their league in OF UZR. Of the 21 teams at the low-end, only five finished with a UZR north of zero.
From these I would point to some interesting pieces of anecdotal evidence:
The 2010 Giants and their 40.7 FB% led the majors in outfield UZR by a substantial margin (40.7 to 31.6), despite the fact that they gave more than 1100 innings to Pat Burrell and Aubrey Huff, lead-footed former DHs who nonetheless somehow finished with positive UZRs for the season.
The 2007 Cubs had an exceptional 44.3 OF UZR in a season where they handed most of the innings to Alfonso Soriano, Jacque Jones, and Cliff Floyd, all of whom substantially outperformed their career numbers with some help from a Chicago staff that sent 40.6% of batted balls in their direction.
On the other side, the ’05 Cardinals, despite featuring some premier outfield talent in Jim Edmonds, Larry Walker, Reggie Sanders, and So Taguchi, finished with a -6.1 OF UZR, thanks to a pitching staff that put only 29.7% of batted balls in the air.
The difference between 30% and 40% can easily be several hundred plays, so when you consider Simon’s point about the significance of even a handful of mistakes in a few months of play, you can see what kind of advantage those extra opportunities provide.
This is not to say that UZR is useless, just that is unreliable in single season increments and that unreliability is passed on to WAR, which we habitually use/misuse when discussing single seasons and partial seasons.
I can’t play several positions. (or “The Adam Dunn Effect”)
WAR’s move to the mainstream is deeply tied to the rising popularity of FanGraphs. One of the first of it’s “unlikely results” to spark considerable conversation was Ben Zobrist leading AL batters (and finishing behind only Albert Pujols and Zack Greinke overall) in 2009. Zobrist had a breakout season which was impressive by any measure, but his WAR was given a major boost by his defense (only Franklin Gutierrez and Nyjer Morgan got a greater advantage from fielding).
On one level, this seemed legit. Zobrist appeared at every position on the diamond in ’09 and over the years has proven himself to be an above-average defender at second base and in right field. Managers have long lauded the value of versatility and lavished praise on players like Zobrist, Mark DeRosa, and Placido Polanco, who play several key positions well and also swing decent sticks. Zobrist’s looked like evidence of their wisdom.
But while it isn’t much of a stretch to believe that Zobrist’s glove was worth a couple wins to the Rays in 2009, try selling this: According to WAR, in 2011, Carlos Lee has had as much defensive value as Troy Tulowitzki.
There are two types of utilitymen, those who are given the job because they play many positions well and those who are given it because they play no position well. As yet, WAR struggles to distinguish between the two. It reads Houston’s inability to decide where Lee hurts them least as evidence of Lee’s versatility. It suggests that Howie Kendrick‘s defense at second base has gone from average to exceptional since Mike Scioscia started giving him more starts in left field.
UZR results get weirder the smaller the sample gets. The utility player may log a thousand innings in total, thus suggesting his UZR is somewhat more reliable, but what actually happens is that several hyper-unreliable samples of a few hundred innings or less are bundled together like toxic mortgages and rated AAA.
WAR Hates Sluggers
One of the things which advanced stats should be applauded for is the extent to which they’ve decreased the fetishizing of the homerun and raised awareness of all-around contributions. Jonah Keri and Dave Dameshek debated the relative merits of Willie Stargell and Tim Raines this week, largely based on the fact they had identical career WAR totals. Dustin Pedroia has a real shot at his second MVP, despite the fact that his “traditionals” (.309 AVG, 85 R, 18 HR, 74 RBI, 25 SB) are basically the same as Melky Cabrera‘s (.303, 83, 17, 79, 17).
However, one can’t help but notice that a cross-section of the most intimidating hitters in the game are treated with relative disdain by the metric. It doesn’t like them because they play first base or left field (or DH), which aren’t scarcity positions. It doesn’t like that they are fat and slow.
While I understand that everybody would love to have Chase Utley or Troy Tulowitzki, a middle-of-the-order hitter who makes big contributions in the field and on the basepaths, as well as at the plate, the fact remains, building a lineup without a slugger (or two) is like building a mall with seven Sunglass Huts and no department stores. A few sluggers are swift, slender middle-infielders. Most of them aren’t. To paraphrase Reggie, there are lots of drinks and precious few straws. If you get left without one, no amount of Range Factor, WHIP, or baserunning acumen can save your season. Just ask the Padres, or the Mariners.
We’ve struggled to understand and statistically represent the effect hitters have on one another. Would Nyjer Morgan be hitting .306 if he wasn’t batting directly in front of Ryan Braun and Prince Fielder? (WAR suggests, by the way, that Morgan has been more valuable on a per game basis than Fielder.) Morgan is taking free passes this season at only about half his career rate. Has he become less patient? (On the other side of things, Adrian Gonzalez‘s career OPS is fifty points higher when the pitcher is throwing from the stretch. He’s enjoyed that situation in 52% of his plate appearances in 2011.)
While I admit the difficulty of building a model that accounts for the effect a pairing like Braun/Fielder or Pujols/Holliday has on the rest of the lineup, this is one area in which I find the conventional wisdom to be irrefutable. While I applaud WAR (and other metrics) for aiding in our appreciation of defense and baserunning, it’s beyond asinine to conclude that Ellsbury is twice as valuable as Fielder. Too often WAR is used as a means of comparing oranges to apples. One of the things that makes baseball great is the diversity of the fruit basket. WAR give incredible weight to scarcity of shortstops, but no weight to the scarcity of pitcher-intimidating, strategy-altering cleanup hitters, which I see as a form of reverse discrimination.
These are not the last of the problems. WAR evaluates catching using only the ability to control the running game. There is abundant evidence that certain park factors have not been sufficiently accounted for. I’m not arguing, however, that WAR should be completely discounted. As yet, it is probably as good a singular statistic as is widely available. But, WAR is not a debate-ending statistic, especially for single seasons. Even WAR’s adherents, like Dave Cameron, generally admit the margin of error is at least 15%. When we stubbornly suggest that 0.5 WAR means anything, we are grossly exaggerating the statistic’s accuracy, even according to its creators. It remains true that any reasoned discussion of an individual’s contributions still requires analysis of the various components that go into WAR, as well as several that don’t, and, as such, subjectivity reigns.
Statistical elegance is elusive. Variables get short shrift or go unaccounted for entirely. Results yield unintended consequences. Misunderstood data is misrepresented and polemicized. In the words of Tolstoy: WAR makes fools of us all.
EDITOR UPDATE (9/7/11): For Hippeaux’s reply, please click here. It’s certainly worth reading and addresses many of the thoughts, issues, concerns, debates mentioned in the comments below. Brien, too, has a follow-up that you might want to check out. Thanks. -J