Monday, September 07, 2009

Congrats to Ross Ohlendorf(and belatedly to A.J. Burnett)

On Saturday, Ross Ohlendorf pitched an immaculate seventh inning against the Cardinals. You can watch it here. I'd have to defer to mehmattski on this, but I would bet that it's the first time that an Immaculate Inning has been throw where every strikeout was a dropped third strike.

On June 20th, AJ Burnett threw an immaculate inning of his own. You would think that a blog written by a Yankees fan and a Marlins fan who also follows Former Marlins, would've noticed when a Former Marlin throws an Immaculate Inning for the Yankees, against the Marlins. To our 6 readers and AJ Burnett, we wholeheartedly apologize for not posting about it when it happened.

Tuesday, April 28, 2009

Immaculate Inning: Daniel Bard

We at Immaculate Inning take a lot of pride in chronicling the rare feat which gives our blog its name; that is, striking out three batters in an inning using just nine pitches. It has only happened 41 times in major league history, but unfortunately we have no idea how common the feat is at the minor leagues. A few years ago it came to our attention that Chris Mason twirled an Immaculate Inning in a AA game. The immaculate inning fires take a while to get stroked with these minor league games, but we are proud to recount the feat of Red Sox prospect Daniel Bard. Major hat-tip to the Projo Sox Blog for bringing the performance to my attention.

Daniel Bard is a 23-year old righthander pitching for the Pawtucket Sox in AAA. He finished last season at AA, retiring 20 of his final 23 batters; that domination has continued into this season, as he sports a 1.69 ERA and has struck out 18 in 10.3 innings. One of those innings, and three of those strikeouts, came against the Rochester Red Wings on April 22. Some video of the immaculate inning can be found here.

The batters were Jason Pridie, Matt Tolbert, and Luke Hughes, the first three batters in the Rochester order. All of them are career minor leaguers, although the 23 year old Hughes appears to be a legit prospect. Pridie appears to go down swinging on three straight fastballs right down the pipe. Tolbert follows the advice of a nearby heckler ("swing!") and misses at three more fastballs from Bard. Some kind of offspeed pitch (curveball) is taken for a strike by Hughes before he swings way late on two fastballs, the second around his eyes. Bard, meanwhile, looks to be rather bored with AAA pitching, and should expect a call up to the majors sometime this season. Regardless of his future, Bard has solidified a place in history with his immaculate inning, and we offer him the highest congratulations!

Tuesday, March 24, 2009

Sweet Sixteen Predictions by Simulation

Now that I've taken a day to recover from watching some 40+ hours of basketball over the weekend, let's revisit the predictions made by my NCAA Tournament Simulator. Here's a link to bracket that I picked based on the highest number of average wins in the tournament. As you can see, the picks did pretty well, landing in the 72nd percentile overall on ESPN. Thirteen of the sweet Sixteen teams were picked correctly, and the bracket lost zero Elite Eight teams over the first weekend of play. The three most notable exceptions were West Virginia, UCLA and Wake Forest. The simulation could not have taken into account how absolutely uninspired these teams would play. It also missed the Western Kentucky over Illinois, since the simulation didn't know about the injury to Chester Frazier.

West Virginia did replace Michigan State in the Most Likely Elite Eight according to the one million simulations. How likely was the first round overall? I wrote a script to count the number of times the simulation predicted the exact first round results in each region:

West = YES! 41733 times!
Midwest = YES! 2325 times!
East = YES! 84894 times!
South = YES! 13648 times!
Overall = Nope. 0 matches.

Upsets of Wake Forest, Utah, and West Virginia at the same time in the Midwest region rarely occurred in the same simulation, and when they did, that simulation did not get one of the other regions correct. In fact, in my pool of 1 million simulations, just 66 produced the correct first round results in three of the four regions. It seems that even if I could have entered all one million simulations, it would not be enough to win Yahoo's Perfect Bracket $1 million. Oh well.

So what do the Pomeroy ratings tell us about the Sweet Sixteen and beyond? To answer that I have two different approaches. One is to simply report the results of the final simulation from Sunday night, the results of which can be found in the data and graphs in this post. Those results are based on the Pythagorean Winning Percentages posted before the first round of the tournament. Four days and forty-eight games (not counting NIT games) later, the rankings are a bit different. How does the added information enhance or suppress the national title chances of each team left in the tournament?

Elite Eight Chances (Click for Chart)
Final Four Chances (Click for Chart)
Championship Game Chances (Click for Chart)
National Title Chances (Click for Chart)

Basically, the inclusion of all the statistics from the tournament games has improved the chances of Connecticut and Memphis winning the national championship, and hurt the chances for nearly everyone else. For Thursday and Friday's games, the teams that most improved were Connecticut (+8.2%), Villanova (+5.5%), North Carolina (+4.5%), and Kansas (+4%). Predictably, the teams that were most hurt by the newer statistics were the immediate opponents of those four teams. UNC-Gonzaga has gone from a tossup (51%-49%) to a more solid favoring of the top seed (55%-45%). The closest game of the Sweet Sixteen now projects to be Oklahoma-Syracuse, with the third-seeded Orange winning 52% of the time.

In the Final Four, Connecticut has actually seen its chances decrease, due to a much higher proportion alocated to Memphis and Missouri, but the Huskies still win the West region in 35% of the one million simulations. From the Midwest, Louisville is still the favorite with a slight edge over Kansas; Michigan State saw a drop in their chances with the inclusion of the new stats. The South is just as open as it was to start the tournament, but Syracuse maintains a healthy advantage, followed by Oklahoma. There is then a huge dropoff between those two and North Carolina and Gonzaga. Finally, the East regional still projects a showdown between Pittsburgh and Duke, with the Blue Devils giving an ever so slight edge (29.00% to 28.28% for Pitt).

The updated stats say that the national title game is less likely to have a representative from the East region, compared with pre-Tourney stats. This is because the four remaining South regional teams all improved their title-game chances, while Duke had the biggest drop of all the teams (from 17.01% to 14.82%). The other half of the title game is still most likely to come from the West, which had Connecticut, Memphis, and Missouri all increase their chances with the inclusion of new stats.

It has, so far, been a tournament small on upsets. The simulator predicts that this trend will continue, with one small exception (#3 Syracuse over #2 Oklahoma), although many of the games project to be very close. One thing that could be improved in the model is the log5 predictions for teams with such similar Pythagorean Winning Percentages. This is one of the things I will be taking a look at in the offseason. In the meantime, it's only two more days until things get kicked off in Glendale, Arizona. Hooray basketball!

Friday, March 20, 2009

Progression of Final Four Chances

At the Immaculate Inning we've been playing all week with different ways to present data generated by our NCAA Tournament simulation. Here is what I feel is the most dynamic view of things: after every set of games on both Thursday and Friday, I set the probability of the losing teams to "0" and re-ran the simulation. I've then graphed the final four chances of every team, by section. You can see the results below (also click here to view the whole source spreadsheet):



You can click on each tab to view the chart for each region. There lots of interesting in trends in each region. The re-simulations are based on the Pomeroy rankings from Thursday, and do not take into account statistics from the first round games themselves.

South Regional: The story here is the Final Four chances of #1 seed North Carolina. Note that these statistics do not take into account Ty Lawson's injury, and yet UNC has had their chances drop over the last two days. More specifically, they've stayed in about the same place, while four other teams have passed them in Final Four chances. Oklahoma is now the odds-on favorite, their chances jumping tremendously with the upset of Clemson elsewhere in the bracket. That is the story throughout; teams rarely improve their own Final Four chances with a win. Instead it's other teams losing that sends waves through the simulations. Three of the four remaining teams in the bottom half of the regional have a better final four shot than UNC, as does Gonzaga in the top half. Arizona State, meanwhile, has climbed from sixth to second in terms of Final Four chances, because they are favored in their matchup with Syracuse (52%-48%).

East Regional: Not much movement going on here, just some strengthening of chances for the favorites as the upsets just don't come. Remember, Wisconsin was heavily favored over FSU in the simulation, so the Seminoles' overtime loss doesn't have much effect on the rest of the regional. Basically, Wisconsin is now at 9%, having added FSU's original 4% to the Badgers' own 5% chances. Among the remainging teams, Texas has the worst chances, since they could have to go through Duke, UCLA, and Pitt (the top 3 teams, statistically), to make it to Detroit. Pittsburgh has the best chance of winning their second round game over Oklahoma St, while the Xavier-Wisconsin game should prove to be the closest of the second round.

Midwest Regional: One of the biggest jumps of the first round was in this regional-- Louisville is no longer the favorite, but instead Kansas wins the West 25% of the time. This only had a little bit to do with Kansas' win over North Dakota St. As you can see from the graph, the gigantic jump came at 5 PM, when simulation favorite West Virginia went down in an uninspiring performance against Dayton. Louisville, the #1 seed, also reached a Final Four chance of 25% by the end of the day, thanks to the upset of Wake Forest by Cleveland State (an upset we predicted in this post). There seems to be two types of games in the second round-- Kansas and Louisville are 80% favorites, while Michigan State and Arizona are favored at the 60-65% rate. If a Sweet Sixteen berth for a low-seeded, mid-major team is your defnition of "Cinderella," then Cleveland State's 38% chance of beating Arizona is the best slipper bet.

West Regional: Not much going on here, because there haven't been that many upsets. Our model did not see Maryland taking out Cal, but clearly they are a different team than the one which put up very mediocre numbers throughout the season. If Maryland can click their offense to the tune of 1.23 points/possession like against Cal, Memphis is going to be in for a long day. Purdue vs Washington is a coinflip (50.9% to 49.1%) and should be a very good game, while Missouri could have a tough time with Marquette.

Final Four Picture: There are tossups in pretty much every regional now, with Louisville-Kansas joining Pittsburgh-Duke and Memphis-Connecticut in the two-dog races. The South regional is as open as ever, and sees Oklahoma as the most likely representative. A UConn-Pittsburgh final still seems to be the most likely, while UConn and Memphis are the only teams winning more than 8% of the time (both are over 11%). This dynamic should change considerably after this weekend; currently the only major change was the elimination of West Virginia. Gonzaga and Kansas have slipped past Duke and are the fourth and fifth most likely championship teams.

So that's where we stand after the first thirty-two games of the 2009 NCAA tournament. Tomorrow and Sunday I'll be updating frequently with the chances of each team's advancement, and I will follow next week with a new simulation from the Sweet-Sixteen onwards! Till then, may your brackets be less busted than mine!

ACC Teams in NCAAT: Day 2



Above is a real time progression of the Final Four chances and the number of average wins for the seven ACC teams using my NCAA Tournament simulation. Yesterday there were a number of interesting trends, including the downward trend of Carolina's Final Four chances despite crushing Radford earlier in the day. In fact, they are no longer the favorite to win the South regional. Maryland improved their average number of wins from 0.44 to 1.15, despite the fact that they now have one actual win. This reflects the 15% chance that they will beat Memphis on Saturday.

As we enter Day 2, it will be interesting to see the chances of Boston College, Wake Forest, and Florida St before and after they play their games. It will also be interesting to follow the progression of Duke and Carolina's chances as the number of upsets increases. My next update will be after the 12 PM games. I don't expect there to be much effect on the ACC teams, but some upsets could send waves through other teams' chances (for example, if Stephen F. Austin upset Syracuse, it would solidify Oklahoma as the South regional favorite).

Thursday, March 19, 2009

ACC Teams in NCAAT: Real Time Chances

Throughout the day, I'm going to re-simulate the NCAAT as each team loses. Then, I am going to plot for each ACC team, their chances of making the Final Four. The first update should be around 2:30 Eastern, and will definitely have implications for North Carolina. Check back here often to see your team's chances change, in real time*!

Starting Chances:

North Carolina (#1 South): 15.28%
Duke (#2 East): 17.61%
Wake Forest (#4 Midwest): 9.72%
Florida State (#5 East): 4.62%
Boston College (#7 Midwest): 1.61%
Maryland (#10 West): 0.90%

*What the hell does "real time" mean anyway? As opposed to fake time? How would I update in fake time, anyway?

Update 1: 3:18 PM

Just ran a new simulation, taking into account the results of the 12 PM games. Three games, and already one pretty large upset, although you wouldn't tell it from the seeds. In the initial simulation, Butler beat Texas A&M 63% of the time. As you can see, there is not much change for the ACC teams. Click on the other tabs to see a handy progression chart for Final Four chances and for Average Wins. The biggest positive effects seem to be on the chances of UConn and Texas A&M making the final four (up 3-4% each), while no teams dipped all that much. The next update will be around 5 PM with the results of the 2:30 games, which will have a much bigger impact on the ACC teams, since two of them are playing...

Important note: the Pythagorean Win Percentages used to make this simulation are different from the ones used Sunday. I mistakenly did not save the original rankings, and the new rankings take into account adjustments based on the NIT results... if an NIT team played well, all of their opponents will have better adjusted stats. That is the reason why teams like Wake had their chances change from pre-tourney. I think the rest of the first round I will use today's statistics, rather than have them adjust each time.

Update #2: 12:52 AM

The results of Day 1 of the NCAA tournament are final. There were some exciting finishes in the first sixteen games, and by the seeds only one true upset. However, by the statistics there were some fairly unlikely results; BYU was favored 2-to-1 over Texas A&M, and Maryland was a 3-to-1 underdog against California. But as they say, that's why they play the games. Overall the "average wins" bracket was 12 for 16 (75%) on the first day, and lost zero teams beyond the second round. One of the games was as close to a coin flip as one can probably get; Butler beat LSU a slim 50.62% of the original simulations.

For the ACC, the major changes are obviously for Clemson, upset by a hot shooting Michigan team, and Maryland, whose one actual win only improves their "average wins" score by 0.74! In terms of Final Four probability, both Duke and Carolina saw their chances decrease throughout the day, despite winning. This is because while both teams were heavily favored to win their games, the teams in their way were not as heavily favored. In those matchups where Duke was playing Minnesota and American on the way to the Elite Eight, Duke would be the heavy favorite; those matchups are now impossible in the simulation.

Overall, the team with the biggest "bump" today was Memphis, which rose to an 11.73% chance of winning it all, thanks to the Maryland upset. Connecticut also benefited from the Texas A&M "on paper" upset, rising to 11.14%. Those two teams now sit at a combined 50% chance to win the west region; it doesn't look promising for the challengers there.

The biggest story is probably that North Carolina is no longer favored to win the South regional. After their win over Morgan St, and a very favorable matchup against Michigan in the second round, raised Oklahoma to 22.93% chance to make the Final Four. This is exactly the sort of thing we were looking for with these predictions-- how the matchups dictate who has the best chances to survive and advance. This will probably change dramatically tomorrow, especially if Syracuse and Arizona St. hold serve in the rest of Oklahoma's bracket. Certainly something to keep an eye on.

Immaculate Inning Bracket

My NCAA tournament simulations have been the most popular thing I've ever done on Immaculate Inning. With the tournament starting in one hour, I thought I'd get my personal pics out there. First of all, here is the tournament, selected simply by picking the team with the most average wins in the tournament (click to enlarge):



But there's more to March Madness than simply statistics. Here is what I call the "Educated Intuition" bracket. It resembles the simulation bracket because I used those to educate my decisions. However, I overrulled the bracket in several key matchups. Plus, I always have to have one bracket where Duke wins it all!



I'll be coming back to Tournament Simulations and breakdowns throughout the weekend and into next week. Thanks for visiting Immaculate Inning for your tourney prognostication needs!

Tuesday, March 17, 2009

Upset Special!

Hello again, welcome back to Immaculate Inning as we continue our week-long dive into the NCAA tournament, simulation style. In case you missed the posts, I've simulated the tournament one million times, and I've pulled from the data the most likely championship games and final fours. The link to the all-mighty spreadsheet (here).

This time I'm going to take a look much earlier in the tournament, as we fast approach the most exciting weekend of the sports year. Everybody loves a Cinderella, and everyone wants to brag about how they picked the upsets that filled the perfect brackets at work on Monday. This is going to be different from upset analysis you may have seen elsewhere, such as AccuScore, which simulates individual games 10,000 times. I've simulated the result of each game in the tournament once, then repeated that one million times. That number of simulations allows me to use statistical power that not even the flashy WhatifSports can match.

First, let's look at the upsets that are matters of probability; the efficiency ratings say, point blank, that the lower seed should be favored to win.

Upset Special #1: #10 Southern California (65.5%) over #7 Boston College (34.5%). The Trojans have the highest percentage of winning the first round game for any double-digit seed, and they might not have even been in the tournament if it weren't for capturing the Pac-10 tournament title. Both teams are strong on the offensive glass and weak on the defensive glass, and both teams don't take very many threes. This game could be a bruiser in the paint. One trouble spot for a USC upset potential is their poor free-throw ability; in a close game, Boston College has a clear edge there.

Upset Special #2: #12 Wisconsin (53.1%) over #5 Florida State (46.9%). As an avid fan of nearly all ACC teams when it comes to the tournament, this one hurts. The Seminoles enter the big dance as one of the hottest teams in the nation, knocking off (an admittedly wounded) North Carolina on the way to a runner-up finish in the ACC Tournament. Toney Douglas is exactly the kind of player that can go off in a big tournament and carry his team a long way. Wisconsin, meanwhile, is plodding-- 59.9 possessions is 334 out of 344 division 1 teams; is mistake-free-- #5 in turnovers/possession and #6 in steals/possession in the nation on offense. They also failed to win twenty games and have no one particularly scary. This is one where I personally would have a hard time following my own simulation, but they won just 0.82 games on average, by far the worst among the #5 seeds.

In terms of pure upsets predicted by the simulations, that's it for the first round. In general, if we were grading the committee based upon how well they matched higher seeded teams with higher Pomeroy efficiency ratings, they did pretty well. However, there are quite a few games that are "too close for comfort," when taking the seeds into account.

TCFC #1: #3 Kansas (80.7%) vs #14 North Dakota St (19.35%). NDSU, in their first tournament in their first year of eligibility, is a favorite upset pick among statheads like myself. The numbers were prettier a few weeks ago, but the Thundar (really? Thundar?) put up a pretty good offense for a minor-conference team. They can shoot lights out (40.2%, 10th in the nation), and Kansas hasn't defended the 3 very effectively this season. They also protect the ball pretty well (14th in turnovers/possession), while Kansas does not (244th). Bill Self's squad could be in trouble with this one.

TCFC #2: Dueling #13 seeds-- Mississippi St (23.8%) and Cleveland St (24.9%) both have much higher chances of knocking off their respective 4-seeds (Wake Forest and Washington). While the SEC champs would make for a nice story, the clear media favorite would be Cleveland St, a team which upset Butler in the Horizon league final to make the tournament. The Spiders won't spook anyone offensively, but they have a defense that is among the nation's best at taking the ball away. Washington, meanwhile, are in the middle of the pack in taking care of the ball, and their size should be more than enough to take care of Cleveland St. If I were the Huskies, I wouldn't be sleeping easy about a 1-in-4 chance of losing, however.

As for Wake Forest, I think we're noticing a trend; my simulation hates ACC teams not named Duke or Carolina. The other team not mentioned yet is Maryland, and my simulation has Maryland winning the fewest average games of any 10 seed, although they have a better shot at winning their opening round game than Michigan does, barely (35%). The folks filling out their bracket on ESPN disagree strongly, favoring Maryland over Cal 2-to-1.

Most casual bracket-fillers will lose interest after their brackets are busted by sometime Sunday evening; but the one who picks the correct surprise Sweet Sixteen teams is going to be the one bragging come Monday morning. So which low-seeded teams have the best chance to be standing after this weekend? These teams showed up in the Sweet Sixteen in at least ten percent of the simulations:

Wisconsin (#12 E): 26.5%
Southern California (#10 MW): 26.3%
Arizona (#12 MW): 17.9%
Michigan (#10 S): 10.9%
Minnesota (#10 E): 10.3%

I think it would be wise to be cautious about picking these #10 seeds to win two games this weekend. To see why, consider what the simulation was doing: picking at random (weighted by expected winning percentage) the winner of each game. So in some number of trials, the #2 seeds fell in the first round (Robert Morris and Morgan St. each won 8% of the time, for example). In those scenarios in which the #15 and #10 teams both won, the #10 seed is going to be a heavy favorite in the second round game. This inflates the chances of a #10 team making it to the second round; only a little bit has to do with the ability of the #10 seed to beat the #2 seed, by far the more likely opponent.

This is not the same with the #12 seed "Cinderellas" (not that major conference teams could ever count as such). Their upset win pits them, at worst, with a similarly-seeded #13 seed. Their high percentage really does suggest good matchups.

To finish, I present the best chances of winning two games this weekend, by seed:

1 seed: Louisville (80.23%)
2 seed: Memphis (83.98%)
3 seed: Missouri (60.50%)
4 seed: Gonzaga (68.66%)
5 seed: Purdue (47.16%)
6 seed: UCLA (54.30%)
7 seed: Clemson (34.66%)
8 seed: Brigham Young (24.29%)
9 seed: Tennessee (13.42%)
10 seed: Southern California (26.29%)
11 seed: Temple (9.05%)
12 seed: Wisconsin (26.51%)
13 seed: Cleveland St. (8.33%)
14 seed: North Dakota St. (4.01%)
15 seed: Robert Morris (1.62%)
16 seed: East Tennessee St. (1.09%)... yes, they have a 6% shot at beating Pittsburgh....

The Most Likely Final Four

Sorry that it has taken so long since my last post, I know that the masses are in need of more data, and help filling out their brackets. I have been working on a Python script to parse the massive amounts of data I produced with my 1 million NCAA tournament simulations. Essentially, what resulted is a data file containing the winners of each game in a single simulation; that file is 611 MB, if you were wondering. What I have done is pull out from that massive file the most common Final Fours and the most common Championship games, which I will present in a minute.

Yesterday was the most successful day in Immaculate Inning history, with over 740 unique visitors, most of you coming from BallHype.com. I want to take a minute and point out some differences between what you'll find here and what other sites are producing. First, I noticed this article by the Wages of Wins Journal-- they do basically what I did for the ACC tournament, using both Pomeroy and Sagarin ratings. It's important to remember that the data on that site is discrete probabilities multiplied against each other; it's impossible to know how the winner of one game will affect the rest of the tournament.

Next, we have Joel Sokol of Georgia Tech, who uses a logarithmic regression model, based solely on margin of victory, to rank every team in Division I. He selects his bracket by picking the team that ranks higher, and according to his analysis, this method outperforms every other major bracket-picking method, whether it's seeds, ESPN's experts, or Sagarin rankings. That's pretty impressive, but once again, his choices do not take into account the effect of upsets on a single tournament.

Finally, there's a competing NCAA tourney simulation by Upon Further Review. There are two main differences between that simulation and mine. First, and perhaps most important; he doesn't show his work. A cursory look at the rest of the website shows a predilection for Basketball Prospectus, so perhaps we can assume he used efficiency ratings, but we just don't know. The second difference is that his is just 1,000 simulations. I'll admit that it doesn't seem obvious at first why having 1,000 times more simulations is necessarily better, other than the novelty of seeing Alabama State winning the tournament one or two times. I'm hoping to convince folks that the one million simulations really are better, because I can produce results like these: (click here to view the full spreadsheet)

The Most Likely Championship Game: Connecticut vs Pittsburgh

I searched my simulation output file for the winners of the initial final four matchups-- the championship game participants. There were 840 different matchups in the one million simulations. The championship games appearing in at least 1% (1,000) simulations, in order of decreasing likelihood:

Connecticut / Pittsburgh : 2.21%
Memphis / Pittsburgh : 1.86%
Louisville / Pittsburgh : 1.71%
Connecticut / Duke : 1.66%
Connecticut / North Carolina : 1.59%
Memphis / Duke : 1.43%
Memphis / North Carolina : 1.33%
Connecticut / Gonzaga : 1.31%
Louisville / Duke : 1.29%
Connecticut / Oklahoma : 1.28%
Connecticut / Syracuse : 1.27%
Louisville / North Carolina : 1.26%
Connecticut / Arizona St. : 1.22%
Connecticut / UCLA : 1.22%
Memphis / Gonzaga : 1.12%
West Virginia / Pittsburgh : 1.11%
Memphis / Syracuse : 1.09%
Memphis / Oklahoma : 1.09%
Louisville / Gonzaga : 1.02%
Memphis / UCLA : 1.02%
Memphis / Arizona St. : 1.02%

I'm fairly confident that a simulation of only 1,000 tournaments would be unable to separate the occurrence of one game versus another with any kind of power. As you can see, the first three most likely Championship Games include Pittsburgh. UCLA and Arizona St, both six seeds, are the lowest seeds commonly making an appearance in these most likely title game matchups. The left side of the bracket, representing the West/Midwest half of the tournament, appears a lot more stable than the right side; with one exception (WV), just three teams are represented: Louisville, Connecticut, and Memphis. The right side of the bracket, meanwhile, has a lot more variability, with three teams from the East and four from the South each making an appearance in the likely title games list.

In case you're worried about my arbitrary cutoff of 1%, the next three most common championship games all featured Louisville (vs Syracuse, Oklahoma, and UCLA), followed by a Michigan St-Pittsburgh matchup and yet another Louisville game (vs Arizona St). Following a unique matchup between Purdue and Pittsburgh at 0.90%, there is a sharp dropoff in the frequency. The first 25 or so matchups are clearly the most common, and therefore the most likely. I suppose it means that if you are looking for a sure thing, Pittsburgh is a good bet to make the title game. However, if you're looking for a sleeper (not a #1 or #2 seed) to make the title game, it would be better to replace Pittsburgh with UCLA, Arizona St, or Gonzaga, because low seeds making the title game out of the West and Midwest is just not likely.

The Most Likely Final Four: Connecticut, Louisville, Pittsburgh, Oklahoma

As a Duke fan, I was saddened that Duke did not represent the East region in the most likely final four. However, I am overjoyed that the only non-#1 seed to be there is North Carolina...
The power of the #1 seeds was actually quite strong-- the first five most likely brackets, representing nearly 1 percent of all simulations, featured UConn, Louisville, and Pittsburgh (one of which also included North Carolina). Anyway, there are 26,790 unique final fours in the simulation, 6,134 of which appear only once. Only 2,434 Final Fours occured more than 100 times (0.01 percent). The most likely final four, listed above, occured 2009 times (how's that for symmetry), or 0.2 percent.

Once again, the top heavy nature of the West region was clear; it was not until the 42nd most common final four that the West representative was not Connecticut or Memphis (it was Purdue). The first nine most common final fours list Louisville as the Midwest champ, and some sprinklings of West Virginia and Michigan State follow until the 37th most likely final four, which features Kanas. In the East, Pitt did capture those first five spots, and most of the top 20 (replaced by Duke in five of them, then UCLA in the 21st most likely final four). The first team to come out of the East that was not Pitt, Duke, or UCLA was Xavier in the 48th most likely Final Four. Finally, the South is just as wide open as we've been advertising, with five different teams in the first five most likely scenarios!

What does all of this mean for you, humble bracket filler? It means that under the most common bracket pool rules, (more points for late round games than early round) someone is going to win the pool by picking the correct South regional winner. The other regions are farily top-heavy with just a few likely options, but the South is where the money is at. These breakdowns don't really point to a favorite in the five-team cluster, although the initial simulation calls North Carolina the favorite.

It is a bit strange to note that Memphis is neither in the most likely title game, nor the most likely Final Four. They were a slight favorite to win the tournament in the initial simulation, just beating out UConn. I suppose you could say that whoever wins the West regional should be the odds-on favorite to capture the title!

Xenod and I are working on expanding the search through the simulation to incorporate the Elite Eight and Sweet Sixteen. I'm not sure if 1 million is enough to tease apart the variance at those levels, but we will try. I'll also take a look at first and second round matchups from a different perspective. Stay tuned to all the tourney simulations you can handle, right here at Immaculate Inning!

Sunday, March 15, 2009

NCAA Tournament Predictions Using Simulations

It's been a crazy Championship Week across the NCAA, and parity ruled supreme across the land, leaving many college basketball fans scratching their heads as they attempt to fill out their brackets. Well, we at the Immaculate Inning have a treat for you: a complete breakdown of the recently NCAA bracket based on the log5 prediction system and Ken Pomeroy's efficiency ratings. I did this for the ACC tournament by painstakingly filling out an Excel spreadsheet and running the numbers essentially by hand. This time was a bit different.

The Method. Briefly, this simulation takes in the "Expected Winning Percentage" calculated by taking the number of points a team scores and allows and transforming it into a win percentage. Instead of using raw scoring figures, I'm using the metrics invented by Ken Pomeroy, which take the tempo of a game out of the equation- we're dealing with how efficient a team's offense or defense is. Next, using the log5 prediction method (linked above), we can calculate how often a team with a given winning percentage is likely to beat another team with a given win percentage. For example, a team with a .600 win percentage is projected to beat a team with a .400 win percentage 69.2% of the time.

How well a team does in the NCAA tournament is affected by three things: how good a team is, how good their opponents are, and how likely it is to see a particular opponent. So while Louisiana State may salivate at the possibility of playing Radford in the second round of the tournament, it's just not likely to happen. For the ACC tournament, I calculated discrete probabilities for each matchup. This is where I've done things a bit different. I have created a computer simulation (a script in the Python language, thanks to Xenod for guidance and helpful tips) for the NCAA tournament, and then I run it a bunch of times. The outcome of each game is random, weighted by the expected winning percentage of each team. The result is not just another table of log5 projections, but is the result of 1 million simulated NCAA tournaments. It's how the tournament looks, "on paper."

So how did your favorite team fare in my simulations? Take a look at the spreadsheet below to find out! (It can also be accessed here for your sorting pleasure.)



The spreadsheet has tabs for each region; they are currently sorted by the "4" column, which is the chances that a given team will win "at least 4" games. That is, it is the chances that a team will win its region, advancing to the Final Four in Detroit. The other columns are similar, recording the percentage chance a team will win that many games. The difference is the "All Teams" tab, which is sorted by "Average Wins." This is the average number of wins a team accrued across the 1 million simulations. It ranges from Memphis (2.81 wins) to Chattanooga (0.02 wins).

Now that all the data is out there, what does it mean? I believe this data can tell us a great deal about how the tournament was set up by the committee, and who has the "hardest" and "easiest" roads to the Final Four and the national title. To begin, the finding that Memphis has not only the largest number of average wins, but also the highest chances of winning the title, is not surprising. Pomeroy's ratings place Memphis squarely atop the nation, led by an amazing team defense. John Calipari's team continues to get little respect nationally despite three straight regional final appearances. The statistics say there is a high probability they will make it four straight.

The South region has the most parity, with five teams winning four games (and the region) at least 10% of the time. Interestingly, top-seed North Carolina ranks third in Final Four appearances from this group, behind Oklahoma and Syracuse. However, if North Carolina does survive the region, they have by far the highest number of national titles (5.68%) from the South region.

Six teams made the national championship game in at least ten percent of the simulations: Connecticut, Louisville, Pittsburgh, Memphis, and Duke. Obviously, only one of the UConn-Memphis and Pitt-Duke pairs can make the title game, but I think it speaks to the lower overall level of performance from teams in the West and East regionals. Indeed, in those regions, the #1 and #2 seeds accounted for the regions' champion more than 40% of the time, while in the Midwest, Louisville and Michigan state came close (39.9%). The South, meanwhile, lags far behind- the champ was either UNC or OK just 32% of the time.

Among the lower seeds, three of the #6 seeds stand out as having higher than average chances of going to Detroit. West Virginia, ranked highly by Pomeroy all season, is the highest non-1-or-2 in terms of Final Four percentage, at 15.74%. Their first round matchup against Dayton ranks as one of the least upset prone games of the first round. How West Virginia fares in this tournament is perhaps a test case to the Pomeroy method-- how important are wins and losses, really, when you play pretty well in all those losses?

A similar case is UCLA, given a 6 seed in the East region despite having one of the best offenses in the country, statistically speaking. While their opening round game against VCU is no joke (and this Duke fan would know about that), they have the greatest chances of an Elite Eight appearance other than Duke and Pitt in this region. A third six-seed with high hopes could be Arizona State, in the apparently wide open South regional. Should ASU get past a tough Temple matchup in the first round, my simulator likes their chances against either Syracuse. Marquette is the odd six seed out in the simulations, with the lowest number of 1 win and 2 win simulations for six seeds.

I will have much more on these simulations in the coming days, eventually culminating in The Immaculate Inning Most Likely Bracket-- which of the 9 quadrillion possible baskets would Pomeroy's efficiency rating tell us to fill out?

If you have any suggestions on what kind of data analysis to do, how to improve the method, or if you'd like a copy of my Tournament Simulation script, comment here or shoot me an e-mail at mehmattski AT gmail DOT com. March Madness baby!