Tuesday, January 30, 2007

Using Team DYJS

Following the collapse of the Yankees in the 2006 Division Series, there was much discussion over at my favorite Yankees site (Bronx Banter) about What Went Wrong. More than a few fans began to rebel at what is considered the “new wave” of team construction- signing players with high on base percentages, focusing on scoring as many runs as possible, limiting traditional tactics such as stolen bases and sacrifice bunts. These fans were not at all satisfied with the response of the “statheads” to postseason performance, typified by Billy Beane’s comment: “My shit doesn’t work in the playoffs.”

Indeed, research conducted by Baseball Prospectus has produced reams of material showing the statistical correlations between regular season wins and on base percentage, slugging percentage, and other more complicated stats. It was these correlations that lead to philosophies of team building made infamous by Moneyball. But, as in all statistics, sample size matters. The long-term trends of a 162 game schedule cannot be compressed into the do-or-die environment of the best-of-five series.

Baseball Prospectus also investigated this, using a system they developed to measure post-season success. Simply put, a team gets maximum points if they sweep all eleven games and win the World Series, and minimum points for getting swept out of the division series. Using this formula, they tried to correlate post-season success with their metrics known to correlate well with regular season success: runs scored, runs allowed, on base percentage, pitchers’ strikeout rate, and many more. What they found is not one metric correlated with post-season success in any meaningful way. Perhaps the post-season really is a crapshoot.

Or is it? Another poster, who also has an excellent blog, and I started to toss around the idea that because the long-term trends do not apply does not mean that there are no trends at all. Offensive production surely is not tied to winning in October: the 2006 Yankees scored 933 runs, but went 21 straight innings without scoring a run in the Division Series. Therefore, the question is this: in terms of post-season success, perhaps what matters more than having a productive offense is having a consistent offense. I have no formal training in statistics (which should change in the next year or so), but to me the way to find this out was investigating the standard deviation of runs scored on a day-to-day basis. Looking at this graphically, here’s the 2006 Yankees:

The graph is vaguely bell-shaped, but there are some clear outliers: the Yankees scored exactly 3 runs in a game significantly more than is predicted by normal distribution, and allowed 2 runs a lot as well. Overall, the Yankees averaged 5.72 runs/game, with a standard deviation of 3.68. How does this compare to the team which knocked the Yanks from the playoffs? The Tigers averaged fewer runs per game (5.07) and with a smaller standard deviation (3.39). Here’s the Tigers’ run distribution:

So is it true that the Tigers had a more “consistent” offense, and that’s why they were better equipped to win in the small-sample environment of a Division Series? Well, I’m not sure. As I said, I’m not a statistician, and I would welcome the assistance of one for this data. In the meantime, this looks like a perfect use for Did Your Job Stat. As I hinted in the comments to Sam’s original post, DYJS originated not out of individual performance, but of that of a team. How often did the Yankees’ offense “do their job” and score at least five runs in a game? How often did the Yankees’ pitching “do their job” and limit the opposition to four or fewer runs?

Thanks to retrosheet.org and a parsing program written by Sam, I was able to collect this data for all MLB teams from the 2006 season. The results are very, very interesting and I will not try to say it all in one post. However, you can view the summarized data here. In the coming days and weeks I’ll try to dissect this data (as well as historical data from previous seasons) to determine what it is that DYJS can tell us about a team’s success in the regular season and the post season.


SOB21 said...

The more runs a team scores, the greater the standard deviation. This is the largest single reason for the Yankees having such a large standard deviation.

Second, the greater the advantage/disadvantage afforded in scoring runs by a team's home ballpark, the greater the standard deviation. Look at the Rockies. I like the concept, but think you've drawn conclusions that aren't necessarily true.

Matt said...

That is an excellent point. Using my data I found that the Standard Deviation of runs scored and runs/game is 0.71, which is quite high. That's interesting, because while Runs/Game are well correlated with wins (duh), standard deviation is not.

As for your other point, I went to ESPN.com and found the park factors for all 30 stadiums in 2006. Based off of this, the correlation between a team's home park factor and its runs scored per game is -0.09. The correlation between park factor and standard deviation of runs scored is even less, at r = 0.02. So while the Rockies have a hitters' park and a high deviation, they do not appear to be indicative of a trend.

Thanks for your comments.

SOB21 said...

Good call on checking the correlation with park factor - I would've thought that since the home/away games essentially create two bell shaped curves for runs scored that the further apart they are, the greater the standard deviation.

It's interesting that you got negative correlation between park factor and runs scored - doesn't make much intuitive sense. Although, when I think about it, it seems that getting a counterintuitive result is more likely because when a team is up after the eigth, they don't bat in the ninth, meaning 11% less runs ABs on average in a win. With that in mind, it makes some sense - whatever advantage is afforded by having a beneficial home park is negated by losing that extra frame.


I hope you enjoy getting your formal training - you seem to be the kind of person who will find it interesting.