Long ago I realized Wine Advocate and Wine Spectator couldn't tell me if I would personally like a wine. Nothing against the idea of a 100-point rating; it's just a subjective measure from somebody with different tastes.
But advanced baseball stats were different. I was an early adopter of VORP and OPS+, and I figured that if the methods of calculation of FRAA were over my head, that was my fault, not the statisticians'.
A stat called xFIP first raised doubts for me. xFIP became my equivalent of an overripe, undrinkable 98-point wine -- it taught me to doubt.
xFIP is supposed to stand for Fielding Independent Pitching. It doesn't show what a pitcher has done, like ERA, or predict what he will do (I like Pecota and understand that probability is not certainty.) What xFIP does is say what that pitcher would have done in a parallel universe where he pitched the same games for a different team.
That hug is after Matt Cain's perfect game |
Matt Cain is the ace on the current World Champions. He was the No. 2 starter on the World Champions in 2010. While salaries aren't a great measure of effectiveness, they are a marketplace indicator, and Matt Cain is now the highest-paid right-handed pitcher in baseball.
And yet, baseball stats geeks have been saying for years that he's really not so good because his xFIP isn't good. Not like, for example, Edwin Jackson or Joe Blanton, who have had better xFIPs than Cain for the past three seasons. There's not a real baseball team on the planet that would trade Cain for either of those guys. But in the parallel world, Jackson's a great pitcher and Cain is meh.
It's a lot like saying, as Wine Advocate has for years, that great Burgundies that have been shown for decades to age well and are sought by restaurateurs the world over aren't actually as good as some super-rich Cabernet with an expensive label that some new winery just made.
I've been down on xFIP for a while, but where I finally have broken completely from "advanced" baseball stats is on WAR.
WAR is supposed to stand for Wins Above Replacement. It's supposed to compare all players in baseball in every position and every era, so you can end arguments by saying Barry Bonds in 2001 was slightly better than Babe Ruth in 1927 (12.2 to 12.1). That's appealing: with WAR, you can compare the value of shortstops to relief pitchers and not assume it's apples vs. oranges (sweeter vs. more refreshing). With so much money riding on baseball contract decisions, teams like a stat like WAR that might allow them to stop overpaying less valuable players.
WAR has so completely taken over arguments between baseball geeks that you never see any other stats anymore. There are just two problems with WAR:
1) Nobody who quotes it knows where it comes from, and
2) Like xFIP, it appears to be based on a parallel universe in which the actual results of games didn't happen.
First, point 1. Like the quarterback rating in football, WAR is calculated using an algorithm of other stats. And like quarterback rating (which is more mysterious to me), nobody can define that algorithm. ESPN's David Schoenfeld, whose job includes trying to explain advanced stats to the SOX ROCK! baseball crowd, wrote more than 1700 words about WAR in July. That's the length of two newspaper columns. Yet he doesn't give the algorithm. And only by reading closely did I learn that there are actually two competing versions of WAR, one from baseball-reference and the other from FanGraphs.
Just like Wine Advocate and Wine Spectator. Each one claims to be definitive. Neither really explains WHY low-acid, non-terroir-driven wines get 97 points while more complex wines that go better with food get 87. They just announce: NewCult Cabernet, 96.
It's hard, mentally, to dispute a definitive-looking number. But I can prove that WAR isn't accurate.
Keep in mind that WAR stands for Wins Above Replacement. A replacement-level player is not an average player; he's the guy that a team would pick up to play a position if their starter was injured. It's a good concept, because even below-average players have value. I'd rather see Barry Zito (WAR -0.3) start than a random AAA pitcher or a pitcher grabbed off waivers, though WAR states that I'd be slightly better off with the randomness.
Here's the problem. A team of all replacement players would be expected to win about 50 games and lose 112. That's how the algorithm of WAR is set. So if WAR measured what actually happened, then you could find out how many games a team won by adding up all its' players' WAR and adding that total to 50.
Sometimes this works:
Detroit Tigers 37.1 WAR; 88 wins
Boston Red Sox 19 WAR; 69 wins
Oakland Athletics 42.5 WAR; 94 wins
And sometimes it doesn't:
San Francisco Giants 34.2 WAR; 94 wins
Baltimore Orioles 33.3 WAR; 93 wins
(I used baseball-reference's figures for WAR)
You're not REALLY winning, are you? |
My disdain for this stat didn't happen at season's end, though. I'm from Baltimore; I root for all teams in orange and black. (Go Halloween!) The Orioles were surprising this year. Nobody, including me, predicted them to be good. I get that. Analysts were saying in May, June, July, etc., that the O's would fade. I get that. Predictions fail in all areas, as Facebook stock buyers and President Kerry can attest.
What irritated me was the redefinition of reality: not what the Orioles would do, but dismissal of what they had done. When they were tied for first place in late September, some baseball analysts and plenty of smug camp followers were still saying that because of their run differential, the Orioles had not been and were not actually a good team. ESPN's Keith Law, who I enjoy reading, wrote that while the Orioles were in the playoffs, which is pretty much the definition of a good team to me.
See the parallel to wine ratings? How can rosé or unoaked Sauvignon Blanc be great? It doesn't get over 90 points. Maybe it's had a nice little niche in summer, but great? Doesn't have the numbers. The greatest unoaked Sauvignon Blanc could never be better than an ordinary Napa Cab, right? Not according to the numbers.
This is not to say that WAR -- and wine ratings -- aren't interesting and even somewhat useful. It's not definitive, though, and its supporters act like it is, and say that anyone who disagrees (see the tiresome Mike Trout vs. Miguel Cabrera for MVP debate) is stupid.
The revolution against wine ratings has come for a number of reasons. But at its emotional core, it comes from a realization like I made about WAR. These numbers, they might mean something, but they don't mean everything. A wine might get only 85 points from an authority, and yet it can be the best wine available for what you actually want or need. Another wine might get 98 points and yet it just isn't that good.
I've always had faith in my own judgment; it was never shocking to me that subjective ratings might not be universal. But I'm not like most consumers. It took the WAR revelation for me to experience what an ordinary consumer might feel the first time he or she realizes wine ratings aren't definitive.
I want to type that it felt great. For the Orioles' and Giants' evaluations, sure, it did. I'm probably at the Giants' victory parade while you read this, cheering on postseason hero Barry Zito (WAR -0.3.)
But it also felt chaotic. My understanding of baseball is now less orderly. I might not be able to use the same shortcuts and will have to devote more time and thought if I really care about which player is better. I have to decide how important solving such a question is to me. I can't just say "That guy, WAR 5.3, he's better."
I can live with more uncertainty in talking about baseball. But who wants uncertainty in every aspect of their life? Sometimes people don't want to think deeply about a decision.
When I swore off WAR, not only did I understand how it felt to reject the authority of wine ratings -- I also understood how it felt to have some regret for the passing of a more easily understood world.
And then I went to the victory parade. Uncertainty rules!
Follow me on Twitter: @wblakegray and like The Gray Report on Facebook.
This reminds me of a discussion I had with my pal Leo McCloskey many years ago when he was first developing Enologix. In the original iteration intensity of extract won every time so there was no way say that a Pinot Noir could ever equal a Cabernet using this metric. Leo solved the problem easily and elegantly by dividing the wines into "style" catagories, which are more or less families of wine based on historically or traditionally accepted levels of extract. One of the ongoing and frustrating aspects of the 100 point scale is the lack of something similar.
ReplyDeleteThis is a very creative post. Even though I don't know sh(i)t about baseball stats, I appreciate the creative linkage here suggested.
ReplyDeleteThanks Patrick.
ReplyDeleteI'm not going to bury my head in the sand and go back to exalting RBI and pitcher wins (Barry Zito had 15!). But I think OPS+ and ERA+ may be as advanced as stats can get without becoming subjective, so I recommend those.