Tuesday, March 24, 2009

Sweet Sixteen Predictions by Simulation

Now that I've taken a day to recover from watching some 40+ hours of basketball over the weekend, let's revisit the predictions made by my NCAA Tournament Simulator. Here's a link to bracket that I picked based on the highest number of average wins in the tournament. As you can see, the picks did pretty well, landing in the 72nd percentile overall on ESPN. Thirteen of the sweet Sixteen teams were picked correctly, and the bracket lost zero Elite Eight teams over the first weekend of play. The three most notable exceptions were West Virginia, UCLA and Wake Forest. The simulation could not have taken into account how absolutely uninspired these teams would play. It also missed the Western Kentucky over Illinois, since the simulation didn't know about the injury to Chester Frazier.

West Virginia did replace Michigan State in the Most Likely Elite Eight according to the one million simulations. How likely was the first round overall? I wrote a script to count the number of times the simulation predicted the exact first round results in each region:

West = YES! 41733 times!
Midwest = YES! 2325 times!
East = YES! 84894 times!
South = YES! 13648 times!
Overall = Nope. 0 matches.

Upsets of Wake Forest, Utah, and West Virginia at the same time in the Midwest region rarely occurred in the same simulation, and when they did, that simulation did not get one of the other regions correct. In fact, in my pool of 1 million simulations, just 66 produced the correct first round results in three of the four regions. It seems that even if I could have entered all one million simulations, it would not be enough to win Yahoo's Perfect Bracket $1 million. Oh well.

So what do the Pomeroy ratings tell us about the Sweet Sixteen and beyond? To answer that I have two different approaches. One is to simply report the results of the final simulation from Sunday night, the results of which can be found in the data and graphs in this post. Those results are based on the Pythagorean Winning Percentages posted before the first round of the tournament. Four days and forty-eight games (not counting NIT games) later, the rankings are a bit different. How does the added information enhance or suppress the national title chances of each team left in the tournament?

Elite Eight Chances (Click for Chart)
Final Four Chances (Click for Chart)
Championship Game Chances (Click for Chart)
National Title Chances (Click for Chart)

Basically, the inclusion of all the statistics from the tournament games has improved the chances of Connecticut and Memphis winning the national championship, and hurt the chances for nearly everyone else. For Thursday and Friday's games, the teams that most improved were Connecticut (+8.2%), Villanova (+5.5%), North Carolina (+4.5%), and Kansas (+4%). Predictably, the teams that were most hurt by the newer statistics were the immediate opponents of those four teams. UNC-Gonzaga has gone from a tossup (51%-49%) to a more solid favoring of the top seed (55%-45%). The closest game of the Sweet Sixteen now projects to be Oklahoma-Syracuse, with the third-seeded Orange winning 52% of the time.

In the Final Four, Connecticut has actually seen its chances decrease, due to a much higher proportion alocated to Memphis and Missouri, but the Huskies still win the West region in 35% of the one million simulations. From the Midwest, Louisville is still the favorite with a slight edge over Kansas; Michigan State saw a drop in their chances with the inclusion of the new stats. The South is just as open as it was to start the tournament, but Syracuse maintains a healthy advantage, followed by Oklahoma. There is then a huge dropoff between those two and North Carolina and Gonzaga. Finally, the East regional still projects a showdown between Pittsburgh and Duke, with the Blue Devils giving an ever so slight edge (29.00% to 28.28% for Pitt).

The updated stats say that the national title game is less likely to have a representative from the East region, compared with pre-Tourney stats. This is because the four remaining South regional teams all improved their title-game chances, while Duke had the biggest drop of all the teams (from 17.01% to 14.82%). The other half of the title game is still most likely to come from the West, which had Connecticut, Memphis, and Missouri all increase their chances with the inclusion of new stats.

It has, so far, been a tournament small on upsets. The simulator predicts that this trend will continue, with one small exception (#3 Syracuse over #2 Oklahoma), although many of the games project to be very close. One thing that could be improved in the model is the log5 predictions for teams with such similar Pythagorean Winning Percentages. This is one of the things I will be taking a look at in the offseason. In the meantime, it's only two more days until things get kicked off in Glendale, Arizona. Hooray basketball!

Friday, March 20, 2009

Progression of Final Four Chances

At the Immaculate Inning we've been playing all week with different ways to present data generated by our NCAA Tournament simulation. Here is what I feel is the most dynamic view of things: after every set of games on both Thursday and Friday, I set the probability of the losing teams to "0" and re-ran the simulation. I've then graphed the final four chances of every team, by section. You can see the results below (also click here to view the whole source spreadsheet):

You can click on each tab to view the chart for each region. There lots of interesting in trends in each region. The re-simulations are based on the Pomeroy rankings from Thursday, and do not take into account statistics from the first round games themselves.

South Regional: The story here is the Final Four chances of #1 seed North Carolina. Note that these statistics do not take into account Ty Lawson's injury, and yet UNC has had their chances drop over the last two days. More specifically, they've stayed in about the same place, while four other teams have passed them in Final Four chances. Oklahoma is now the odds-on favorite, their chances jumping tremendously with the upset of Clemson elsewhere in the bracket. That is the story throughout; teams rarely improve their own Final Four chances with a win. Instead it's other teams losing that sends waves through the simulations. Three of the four remaining teams in the bottom half of the regional have a better final four shot than UNC, as does Gonzaga in the top half. Arizona State, meanwhile, has climbed from sixth to second in terms of Final Four chances, because they are favored in their matchup with Syracuse (52%-48%).

East Regional: Not much movement going on here, just some strengthening of chances for the favorites as the upsets just don't come. Remember, Wisconsin was heavily favored over FSU in the simulation, so the Seminoles' overtime loss doesn't have much effect on the rest of the regional. Basically, Wisconsin is now at 9%, having added FSU's original 4% to the Badgers' own 5% chances. Among the remainging teams, Texas has the worst chances, since they could have to go through Duke, UCLA, and Pitt (the top 3 teams, statistically), to make it to Detroit. Pittsburgh has the best chance of winning their second round game over Oklahoma St, while the Xavier-Wisconsin game should prove to be the closest of the second round.

Midwest Regional: One of the biggest jumps of the first round was in this regional-- Louisville is no longer the favorite, but instead Kansas wins the West 25% of the time. This only had a little bit to do with Kansas' win over North Dakota St. As you can see from the graph, the gigantic jump came at 5 PM, when simulation favorite West Virginia went down in an uninspiring performance against Dayton. Louisville, the #1 seed, also reached a Final Four chance of 25% by the end of the day, thanks to the upset of Wake Forest by Cleveland State (an upset we predicted in this post). There seems to be two types of games in the second round-- Kansas and Louisville are 80% favorites, while Michigan State and Arizona are favored at the 60-65% rate. If a Sweet Sixteen berth for a low-seeded, mid-major team is your defnition of "Cinderella," then Cleveland State's 38% chance of beating Arizona is the best slipper bet.

West Regional: Not much going on here, because there haven't been that many upsets. Our model did not see Maryland taking out Cal, but clearly they are a different team than the one which put up very mediocre numbers throughout the season. If Maryland can click their offense to the tune of 1.23 points/possession like against Cal, Memphis is going to be in for a long day. Purdue vs Washington is a coinflip (50.9% to 49.1%) and should be a very good game, while Missouri could have a tough time with Marquette.

Final Four Picture: There are tossups in pretty much every regional now, with Louisville-Kansas joining Pittsburgh-Duke and Memphis-Connecticut in the two-dog races. The South regional is as open as ever, and sees Oklahoma as the most likely representative. A UConn-Pittsburgh final still seems to be the most likely, while UConn and Memphis are the only teams winning more than 8% of the time (both are over 11%). This dynamic should change considerably after this weekend; currently the only major change was the elimination of West Virginia. Gonzaga and Kansas have slipped past Duke and are the fourth and fifth most likely championship teams.

So that's where we stand after the first thirty-two games of the 2009 NCAA tournament. Tomorrow and Sunday I'll be updating frequently with the chances of each team's advancement, and I will follow next week with a new simulation from the Sweet-Sixteen onwards! Till then, may your brackets be less busted than mine!

ACC Teams in NCAAT: Day 2

Above is a real time progression of the Final Four chances and the number of average wins for the seven ACC teams using my NCAA Tournament simulation. Yesterday there were a number of interesting trends, including the downward trend of Carolina's Final Four chances despite crushing Radford earlier in the day. In fact, they are no longer the favorite to win the South regional. Maryland improved their average number of wins from 0.44 to 1.15, despite the fact that they now have one actual win. This reflects the 15% chance that they will beat Memphis on Saturday.

As we enter Day 2, it will be interesting to see the chances of Boston College, Wake Forest, and Florida St before and after they play their games. It will also be interesting to follow the progression of Duke and Carolina's chances as the number of upsets increases. My next update will be after the 12 PM games. I don't expect there to be much effect on the ACC teams, but some upsets could send waves through other teams' chances (for example, if Stephen F. Austin upset Syracuse, it would solidify Oklahoma as the South regional favorite).

Thursday, March 19, 2009

ACC Teams in NCAAT: Real Time Chances

Throughout the day, I'm going to re-simulate the NCAAT as each team loses. Then, I am going to plot for each ACC team, their chances of making the Final Four. The first update should be around 2:30 Eastern, and will definitely have implications for North Carolina. Check back here often to see your team's chances change, in real time*!

Starting Chances:

North Carolina (#1 South): 15.28%
Duke (#2 East): 17.61%
Wake Forest (#4 Midwest): 9.72%
Florida State (#5 East): 4.62%
Boston College (#7 Midwest): 1.61%
Maryland (#10 West): 0.90%

*What the hell does "real time" mean anyway? As opposed to fake time? How would I update in fake time, anyway?

Update 1: 3:18 PM

Just ran a new simulation, taking into account the results of the 12 PM games. Three games, and already one pretty large upset, although you wouldn't tell it from the seeds. In the initial simulation, Butler beat Texas A&M 63% of the time. As you can see, there is not much change for the ACC teams. Click on the other tabs to see a handy progression chart for Final Four chances and for Average Wins. The biggest positive effects seem to be on the chances of UConn and Texas A&M making the final four (up 3-4% each), while no teams dipped all that much. The next update will be around 5 PM with the results of the 2:30 games, which will have a much bigger impact on the ACC teams, since two of them are playing...

Important note: the Pythagorean Win Percentages used to make this simulation are different from the ones used Sunday. I mistakenly did not save the original rankings, and the new rankings take into account adjustments based on the NIT results... if an NIT team played well, all of their opponents will have better adjusted stats. That is the reason why teams like Wake had their chances change from pre-tourney. I think the rest of the first round I will use today's statistics, rather than have them adjust each time.

Update #2: 12:52 AM

The results of Day 1 of the NCAA tournament are final. There were some exciting finishes in the first sixteen games, and by the seeds only one true upset. However, by the statistics there were some fairly unlikely results; BYU was favored 2-to-1 over Texas A&M, and Maryland was a 3-to-1 underdog against California. But as they say, that's why they play the games. Overall the "average wins" bracket was 12 for 16 (75%) on the first day, and lost zero teams beyond the second round. One of the games was as close to a coin flip as one can probably get; Butler beat LSU a slim 50.62% of the original simulations.

For the ACC, the major changes are obviously for Clemson, upset by a hot shooting Michigan team, and Maryland, whose one actual win only improves their "average wins" score by 0.74! In terms of Final Four probability, both Duke and Carolina saw their chances decrease throughout the day, despite winning. This is because while both teams were heavily favored to win their games, the teams in their way were not as heavily favored. In those matchups where Duke was playing Minnesota and American on the way to the Elite Eight, Duke would be the heavy favorite; those matchups are now impossible in the simulation.

Overall, the team with the biggest "bump" today was Memphis, which rose to an 11.73% chance of winning it all, thanks to the Maryland upset. Connecticut also benefited from the Texas A&M "on paper" upset, rising to 11.14%. Those two teams now sit at a combined 50% chance to win the west region; it doesn't look promising for the challengers there.

The biggest story is probably that North Carolina is no longer favored to win the South regional. After their win over Morgan St, and a very favorable matchup against Michigan in the second round, raised Oklahoma to 22.93% chance to make the Final Four. This is exactly the sort of thing we were looking for with these predictions-- how the matchups dictate who has the best chances to survive and advance. This will probably change dramatically tomorrow, especially if Syracuse and Arizona St. hold serve in the rest of Oklahoma's bracket. Certainly something to keep an eye on.

Immaculate Inning Bracket

My NCAA tournament simulations have been the most popular thing I've ever done on Immaculate Inning. With the tournament starting in one hour, I thought I'd get my personal pics out there. First of all, here is the tournament, selected simply by picking the team with the most average wins in the tournament (click to enlarge):

But there's more to March Madness than simply statistics. Here is what I call the "Educated Intuition" bracket. It resembles the simulation bracket because I used those to educate my decisions. However, I overrulled the bracket in several key matchups. Plus, I always have to have one bracket where Duke wins it all!

I'll be coming back to Tournament Simulations and breakdowns throughout the weekend and into next week. Thanks for visiting Immaculate Inning for your tourney prognostication needs!

Tuesday, March 17, 2009

Upset Special!

Hello again, welcome back to Immaculate Inning as we continue our week-long dive into the NCAA tournament, simulation style. In case you missed the posts, I've simulated the tournament one million times, and I've pulled from the data the most likely championship games and final fours. The link to the all-mighty spreadsheet (here).

This time I'm going to take a look much earlier in the tournament, as we fast approach the most exciting weekend of the sports year. Everybody loves a Cinderella, and everyone wants to brag about how they picked the upsets that filled the perfect brackets at work on Monday. This is going to be different from upset analysis you may have seen elsewhere, such as AccuScore, which simulates individual games 10,000 times. I've simulated the result of each game in the tournament once, then repeated that one million times. That number of simulations allows me to use statistical power that not even the flashy WhatifSports can match.

First, let's look at the upsets that are matters of probability; the efficiency ratings say, point blank, that the lower seed should be favored to win.

Upset Special #1: #10 Southern California (65.5%) over #7 Boston College (34.5%). The Trojans have the highest percentage of winning the first round game for any double-digit seed, and they might not have even been in the tournament if it weren't for capturing the Pac-10 tournament title. Both teams are strong on the offensive glass and weak on the defensive glass, and both teams don't take very many threes. This game could be a bruiser in the paint. One trouble spot for a USC upset potential is their poor free-throw ability; in a close game, Boston College has a clear edge there.

Upset Special #2: #12 Wisconsin (53.1%) over #5 Florida State (46.9%). As an avid fan of nearly all ACC teams when it comes to the tournament, this one hurts. The Seminoles enter the big dance as one of the hottest teams in the nation, knocking off (an admittedly wounded) North Carolina on the way to a runner-up finish in the ACC Tournament. Toney Douglas is exactly the kind of player that can go off in a big tournament and carry his team a long way. Wisconsin, meanwhile, is plodding-- 59.9 possessions is 334 out of 344 division 1 teams; is mistake-free-- #5 in turnovers/possession and #6 in steals/possession in the nation on offense. They also failed to win twenty games and have no one particularly scary. This is one where I personally would have a hard time following my own simulation, but they won just 0.82 games on average, by far the worst among the #5 seeds.

In terms of pure upsets predicted by the simulations, that's it for the first round. In general, if we were grading the committee based upon how well they matched higher seeded teams with higher Pomeroy efficiency ratings, they did pretty well. However, there are quite a few games that are "too close for comfort," when taking the seeds into account.

TCFC #1: #3 Kansas (80.7%) vs #14 North Dakota St (19.35%). NDSU, in their first tournament in their first year of eligibility, is a favorite upset pick among statheads like myself. The numbers were prettier a few weeks ago, but the Thundar (really? Thundar?) put up a pretty good offense for a minor-conference team. They can shoot lights out (40.2%, 10th in the nation), and Kansas hasn't defended the 3 very effectively this season. They also protect the ball pretty well (14th in turnovers/possession), while Kansas does not (244th). Bill Self's squad could be in trouble with this one.

TCFC #2: Dueling #13 seeds-- Mississippi St (23.8%) and Cleveland St (24.9%) both have much higher chances of knocking off their respective 4-seeds (Wake Forest and Washington). While the SEC champs would make for a nice story, the clear media favorite would be Cleveland St, a team which upset Butler in the Horizon league final to make the tournament. The Spiders won't spook anyone offensively, but they have a defense that is among the nation's best at taking the ball away. Washington, meanwhile, are in the middle of the pack in taking care of the ball, and their size should be more than enough to take care of Cleveland St. If I were the Huskies, I wouldn't be sleeping easy about a 1-in-4 chance of losing, however.

As for Wake Forest, I think we're noticing a trend; my simulation hates ACC teams not named Duke or Carolina. The other team not mentioned yet is Maryland, and my simulation has Maryland winning the fewest average games of any 10 seed, although they have a better shot at winning their opening round game than Michigan does, barely (35%). The folks filling out their bracket on ESPN disagree strongly, favoring Maryland over Cal 2-to-1.

Most casual bracket-fillers will lose interest after their brackets are busted by sometime Sunday evening; but the one who picks the correct surprise Sweet Sixteen teams is going to be the one bragging come Monday morning. So which low-seeded teams have the best chance to be standing after this weekend? These teams showed up in the Sweet Sixteen in at least ten percent of the simulations:

Wisconsin (#12 E): 26.5%
Southern California (#10 MW): 26.3%
Arizona (#12 MW): 17.9%
Michigan (#10 S): 10.9%
Minnesota (#10 E): 10.3%

I think it would be wise to be cautious about picking these #10 seeds to win two games this weekend. To see why, consider what the simulation was doing: picking at random (weighted by expected winning percentage) the winner of each game. So in some number of trials, the #2 seeds fell in the first round (Robert Morris and Morgan St. each won 8% of the time, for example). In those scenarios in which the #15 and #10 teams both won, the #10 seed is going to be a heavy favorite in the second round game. This inflates the chances of a #10 team making it to the second round; only a little bit has to do with the ability of the #10 seed to beat the #2 seed, by far the more likely opponent.

This is not the same with the #12 seed "Cinderellas" (not that major conference teams could ever count as such). Their upset win pits them, at worst, with a similarly-seeded #13 seed. Their high percentage really does suggest good matchups.

To finish, I present the best chances of winning two games this weekend, by seed:

1 seed: Louisville (80.23%)
2 seed: Memphis (83.98%)
3 seed: Missouri (60.50%)
4 seed: Gonzaga (68.66%)
5 seed: Purdue (47.16%)
6 seed: UCLA (54.30%)
7 seed: Clemson (34.66%)
8 seed: Brigham Young (24.29%)
9 seed: Tennessee (13.42%)
10 seed: Southern California (26.29%)
11 seed: Temple (9.05%)
12 seed: Wisconsin (26.51%)
13 seed: Cleveland St. (8.33%)
14 seed: North Dakota St. (4.01%)
15 seed: Robert Morris (1.62%)
16 seed: East Tennessee St. (1.09%)... yes, they have a 6% shot at beating Pittsburgh....

The Most Likely Final Four

Sorry that it has taken so long since my last post, I know that the masses are in need of more data, and help filling out their brackets. I have been working on a Python script to parse the massive amounts of data I produced with my 1 million NCAA tournament simulations. Essentially, what resulted is a data file containing the winners of each game in a single simulation; that file is 611 MB, if you were wondering. What I have done is pull out from that massive file the most common Final Fours and the most common Championship games, which I will present in a minute.

Yesterday was the most successful day in Immaculate Inning history, with over 740 unique visitors, most of you coming from BallHype.com. I want to take a minute and point out some differences between what you'll find here and what other sites are producing. First, I noticed this article by the Wages of Wins Journal-- they do basically what I did for the ACC tournament, using both Pomeroy and Sagarin ratings. It's important to remember that the data on that site is discrete probabilities multiplied against each other; it's impossible to know how the winner of one game will affect the rest of the tournament.

Next, we have Joel Sokol of Georgia Tech, who uses a logarithmic regression model, based solely on margin of victory, to rank every team in Division I. He selects his bracket by picking the team that ranks higher, and according to his analysis, this method outperforms every other major bracket-picking method, whether it's seeds, ESPN's experts, or Sagarin rankings. That's pretty impressive, but once again, his choices do not take into account the effect of upsets on a single tournament.

Finally, there's a competing NCAA tourney simulation by Upon Further Review. There are two main differences between that simulation and mine. First, and perhaps most important; he doesn't show his work. A cursory look at the rest of the website shows a predilection for Basketball Prospectus, so perhaps we can assume he used efficiency ratings, but we just don't know. The second difference is that his is just 1,000 simulations. I'll admit that it doesn't seem obvious at first why having 1,000 times more simulations is necessarily better, other than the novelty of seeing Alabama State winning the tournament one or two times. I'm hoping to convince folks that the one million simulations really are better, because I can produce results like these: (click here to view the full spreadsheet)

The Most Likely Championship Game: Connecticut vs Pittsburgh

I searched my simulation output file for the winners of the initial final four matchups-- the championship game participants. There were 840 different matchups in the one million simulations. The championship games appearing in at least 1% (1,000) simulations, in order of decreasing likelihood:

Connecticut / Pittsburgh : 2.21%
Memphis / Pittsburgh : 1.86%
Louisville / Pittsburgh : 1.71%
Connecticut / Duke : 1.66%
Connecticut / North Carolina : 1.59%
Memphis / Duke : 1.43%
Memphis / North Carolina : 1.33%
Connecticut / Gonzaga : 1.31%
Louisville / Duke : 1.29%
Connecticut / Oklahoma : 1.28%
Connecticut / Syracuse : 1.27%
Louisville / North Carolina : 1.26%
Connecticut / Arizona St. : 1.22%
Connecticut / UCLA : 1.22%
Memphis / Gonzaga : 1.12%
West Virginia / Pittsburgh : 1.11%
Memphis / Syracuse : 1.09%
Memphis / Oklahoma : 1.09%
Louisville / Gonzaga : 1.02%
Memphis / UCLA : 1.02%
Memphis / Arizona St. : 1.02%

I'm fairly confident that a simulation of only 1,000 tournaments would be unable to separate the occurrence of one game versus another with any kind of power. As you can see, the first three most likely Championship Games include Pittsburgh. UCLA and Arizona St, both six seeds, are the lowest seeds commonly making an appearance in these most likely title game matchups. The left side of the bracket, representing the West/Midwest half of the tournament, appears a lot more stable than the right side; with one exception (WV), just three teams are represented: Louisville, Connecticut, and Memphis. The right side of the bracket, meanwhile, has a lot more variability, with three teams from the East and four from the South each making an appearance in the likely title games list.

In case you're worried about my arbitrary cutoff of 1%, the next three most common championship games all featured Louisville (vs Syracuse, Oklahoma, and UCLA), followed by a Michigan St-Pittsburgh matchup and yet another Louisville game (vs Arizona St). Following a unique matchup between Purdue and Pittsburgh at 0.90%, there is a sharp dropoff in the frequency. The first 25 or so matchups are clearly the most common, and therefore the most likely. I suppose it means that if you are looking for a sure thing, Pittsburgh is a good bet to make the title game. However, if you're looking for a sleeper (not a #1 or #2 seed) to make the title game, it would be better to replace Pittsburgh with UCLA, Arizona St, or Gonzaga, because low seeds making the title game out of the West and Midwest is just not likely.

The Most Likely Final Four: Connecticut, Louisville, Pittsburgh, Oklahoma

As a Duke fan, I was saddened that Duke did not represent the East region in the most likely final four. However, I am overjoyed that the only non-#1 seed to be there is North Carolina...
The power of the #1 seeds was actually quite strong-- the first five most likely brackets, representing nearly 1 percent of all simulations, featured UConn, Louisville, and Pittsburgh (one of which also included North Carolina). Anyway, there are 26,790 unique final fours in the simulation, 6,134 of which appear only once. Only 2,434 Final Fours occured more than 100 times (0.01 percent). The most likely final four, listed above, occured 2009 times (how's that for symmetry), or 0.2 percent.

Once again, the top heavy nature of the West region was clear; it was not until the 42nd most common final four that the West representative was not Connecticut or Memphis (it was Purdue). The first nine most common final fours list Louisville as the Midwest champ, and some sprinklings of West Virginia and Michigan State follow until the 37th most likely final four, which features Kanas. In the East, Pitt did capture those first five spots, and most of the top 20 (replaced by Duke in five of them, then UCLA in the 21st most likely final four). The first team to come out of the East that was not Pitt, Duke, or UCLA was Xavier in the 48th most likely Final Four. Finally, the South is just as wide open as we've been advertising, with five different teams in the first five most likely scenarios!

What does all of this mean for you, humble bracket filler? It means that under the most common bracket pool rules, (more points for late round games than early round) someone is going to win the pool by picking the correct South regional winner. The other regions are farily top-heavy with just a few likely options, but the South is where the money is at. These breakdowns don't really point to a favorite in the five-team cluster, although the initial simulation calls North Carolina the favorite.

It is a bit strange to note that Memphis is neither in the most likely title game, nor the most likely Final Four. They were a slight favorite to win the tournament in the initial simulation, just beating out UConn. I suppose you could say that whoever wins the West regional should be the odds-on favorite to capture the title!

Xenod and I are working on expanding the search through the simulation to incorporate the Elite Eight and Sweet Sixteen. I'm not sure if 1 million is enough to tease apart the variance at those levels, but we will try. I'll also take a look at first and second round matchups from a different perspective. Stay tuned to all the tourney simulations you can handle, right here at Immaculate Inning!

Sunday, March 15, 2009

NCAA Tournament Predictions Using Simulations

If you're looking for 2010 NCAA Tournament Simulations, you can find Immaculate Inning's One Million Simulations right here!

It's been a crazy Championship Week across the NCAA, and parity ruled supreme across the land, leaving many college basketball fans scratching their heads as they attempt to fill out their brackets. Well, we at the Immaculate Inning have a treat for you: a complete breakdown of the recently NCAA bracket based on the log5 prediction system and Ken Pomeroy's efficiency ratings. I did this for the ACC tournament by painstakingly filling out an Excel spreadsheet and running the numbers essentially by hand. This time was a bit different.

The Method. Briefly, this simulation takes in the "Expected Winning Percentage" calculated by taking the number of points a team scores and allows and transforming it into a win percentage. Instead of using raw scoring figures, I'm using the metrics invented by Ken Pomeroy, which take the tempo of a game out of the equation- we're dealing with how efficient a team's offense or defense is. Next, using the log5 prediction method (linked above), we can calculate how often a team with a given winning percentage is likely to beat another team with a given win percentage. For example, a team with a .600 win percentage is projected to beat a team with a .400 win percentage 69.2% of the time.

How well a team does in the NCAA tournament is affected by three things: how good a team is, how good their opponents are, and how likely it is to see a particular opponent. So while Louisiana State may salivate at the possibility of playing Radford in the second round of the tournament, it's just not likely to happen. For the ACC tournament, I calculated discrete probabilities for each matchup. This is where I've done things a bit different. I have created a computer simulation (a script in the Python language, thanks to Xenod for guidance and helpful tips) for the NCAA tournament, and then I run it a bunch of times. The outcome of each game is random, weighted by the expected winning percentage of each team. The result is not just another table of log5 projections, but is the result of 1 million simulated NCAA tournaments. It's how the tournament looks, "on paper."

So how did your favorite team fare in my simulations? Take a look at the spreadsheet below to find out! (It can also be accessed here for your sorting pleasure.)

The spreadsheet has tabs for each region; they are currently sorted by the "4" column, which is the chances that a given team will win "at least 4" games. That is, it is the chances that a team will win its region, advancing to the Final Four in Detroit. The other columns are similar, recording the percentage chance a team will win that many games. The difference is the "All Teams" tab, which is sorted by "Average Wins." This is the average number of wins a team accrued across the 1 million simulations. It ranges from Memphis (2.81 wins) to Chattanooga (0.02 wins).

Now that all the data is out there, what does it mean? I believe this data can tell us a great deal about how the tournament was set up by the committee, and who has the "hardest" and "easiest" roads to the Final Four and the national title. To begin, the finding that Memphis has not only the largest number of average wins, but also the highest chances of winning the title, is not surprising. Pomeroy's ratings place Memphis squarely atop the nation, led by an amazing team defense. John Calipari's team continues to get little respect nationally despite three straight regional final appearances. The statistics say there is a high probability they will make it four straight.

The South region has the most parity, with five teams winning four games (and the region) at least 10% of the time. Interestingly, top-seed North Carolina ranks third in Final Four appearances from this group, behind Oklahoma and Syracuse. However, if North Carolina does survive the region, they have by far the highest number of national titles (5.68%) from the South region.

Six teams made the national championship game in at least ten percent of the simulations: Connecticut, Louisville, Pittsburgh, Memphis, and Duke. Obviously, only one of the UConn-Memphis and Pitt-Duke pairs can make the title game, but I think it speaks to the lower overall level of performance from teams in the West and East regionals. Indeed, in those regions, the #1 and #2 seeds accounted for the regions' champion more than 40% of the time, while in the Midwest, Louisville and Michigan state came close (39.9%). The South, meanwhile, lags far behind- the champ was either UNC or OK just 32% of the time.

Among the lower seeds, three of the #6 seeds stand out as having higher than average chances of going to Detroit. West Virginia, ranked highly by Pomeroy all season, is the highest non-1-or-2 in terms of Final Four percentage, at 15.74%. Their first round matchup against Dayton ranks as one of the least upset prone games of the first round. How West Virginia fares in this tournament is perhaps a test case to the Pomeroy method-- how important are wins and losses, really, when you play pretty well in all those losses?

A similar case is UCLA, given a 6 seed in the East region despite having one of the best offenses in the country, statistically speaking. While their opening round game against VCU is no joke (and this Duke fan would know about that), they have the greatest chances of an Elite Eight appearance other than Duke and Pitt in this region. A third six-seed with high hopes could be Arizona State, in the apparently wide open South regional. Should ASU get past a tough Temple matchup in the first round, my simulator likes their chances against either Syracuse. Marquette is the odd six seed out in the simulations, with the lowest number of 1 win and 2 win simulations for six seeds.

I will have much more on these simulations in the coming days, eventually culminating in The Immaculate Inning Most Likely Bracket-- which of the 9 quadrillion possible baskets would Pomeroy's efficiency rating tell us to fill out?

If you have any suggestions on what kind of data analysis to do, how to improve the method, or if you'd like a copy of my Tournament Simulation script, comment here or shoot me an e-mail at mehmattski AT gmail DOT com. March Madness baby!

Friday, March 13, 2009

Updated ACC Tourney Probabilities

With the first six games in the ACC Tournament complete, let's revisit the log5 predictions, which are based on tempo-free efficiency ratings accrued in ACC games only:

These are up to date following FSU's escape of Georgia Tech in the second afternoon game. As you can see, North Carolina has increased their chances of winning the tournament to better than 50/50. Duke's tournament chances have actually gone down, caused by no longer having the possibility of playing Virginia. Both of tonight's quarterfinal games have a similar 4-to-1 advantage for favorites Duke and Wake Forest. Maryland doesn't have much of a chance of winning the tournament, but should they pull the upset tonight, would that be enough to get the ACC a seventh team in Teh Dance?

The other story lines remaining in the ACC tournament are all about seedings. Carolina probably locked up their #1 seed with a win, considering that they're actually still playing, unlike UConn, Pitt, and Oklahoma. The ACC results are not in a vaccuum, the seedings of Duke and Wake are heavily influenced by the results of the other tournaments. For example, someone upsetting Memphis or Louisville capturing the Big East tournament would have top-seeded implications.

Stay tuned to Immaculate Inning for all your March Madness projection needs. We've got a big project in the works to unveil late Sunday or early Monday. NCAA Hoops- Awesome!

Wednesday, March 11, 2009

ACC Conference Play: Devourer of Stats

A few weeks ago, I made a very critical post about the 2008-2009 Duke team. Having come off of a very poor stretch, the once promising Blue Devils seemed to be succumbing to conference play, with disastrous consequences. I concluded that Duke's pounding of non-conference foes was clouding our view of their standing, statistically speaking. Pomeroy's rankings simply cannot account for the evolution of a team throughout a season; they treat a November blowout win the same as a February blowout win. And conventional wisdom would treat the latter as more indicative of a team's chances in March.

With the ACC season complete I thought I'd take one final look at Duke's performance between conference and non-conference play. The result is not pretty:

In red are all the categories in which Duke is performing worse in ACC play compared to out of conference play (this includes 2009 games against Davidson, Georgetown, and St. John's). With the exception of turnovers on offense, Duke is not playing as well. But clearly, the level of play in the ACC must affect all teams. So I then tallied up every team's tempo-free performances. Rather than post another spreadsheet, the results can be found here. Some major points:

1) Nearly team saw both their offensive and defensive efficiencies drop when they were playing against ACC opponents. In fact, nearly every cell in the "Difference" part of my spreadsheet is colored red, meaning teams were also worse in other statistical categories. This probably makes sense, since the ACC is ranked the #1 conference by Pomeroy, and the #1 conference by Sagarin.

2) Overall, offense was less affected than defense. During ACC play, the conference teams averaged an efficiency of 104.5. Compared to the national average (100.1), it means that the ACC has an offense-heavy atmosphere. It would take further analysis to prove this point, but I believe this could have an effect similar to the "ballpark effect" in baseball; if the Oakland A's hit 300 home runs as a team, it would be more impressive than if the Colorado Rockies did it. By analogy I am suggesting that having a good offense in the ACC is not as impressive as having a good defense. This points a praising finger squarely at teams like FSU and Duke, the only teams to have defensive efficiency ratings below 100 during conference play.

3) Florida State is a major exception. While everyone elses' offensive efficiency was dropping, Florida State actually improved their offensive efficiency in conference play. A large part of this comes from another category-- turnover rate. Along with Duke, the Seminoles are one of two teams to improve their turnover rate on offense against ACC foes. Their effective field goal percentage and offensive rebounding rate were not as affected by conference play as well.

4) NC State is probably the biggest culprit of Cupcake Syndrome. The Wolfpack's offensive efficiency dropped by 8.4 points in ACC play (the worst drop the conference), and their defensive efficiency also dropped, by 18.0 points. They were an average team until January, and simply not a very good team in conference play.

5) There is no evidence for the conception "The ACC refs call more fouls than the rest of the nation." Collectively, the ACC teams went to the charity stripe during 35.4% of their possessions during league play, compared to 41.4% of possessions in non-conference play. While free throw rate is not a perfect proxy for the number of fouls called, it is obvious that the the ACC refs aren't as whistle happy as some would have you believe. On the other hand, during non-conference play, the opponents of ACC teams went to the free throw line in just 30.0% of possessions. There is certainly a connection between level of play and the number of fouls called; bad teams have bad defensive positioning and would tend to be whistled more often.

6) Continuing on the foul theme, Duke was near the top of free throw rate in conference (39.9%, 4th) and out of conference (46.2%, 3rd), but by no means any fuel for the DukeGetsAllTheCalls morons. In fact, every team (including Duke) saw their opponents go to the free throw line more frequently during ACC play, except one. That would be the Carolina Tar Heels, who inexplicably allowed free throws on 4% fewer possessions, compared to out of conference play. I'm not suggesting conspiracy, it's probably due to their Swiss cheese approach to half-court defense...

7) Wake Forest's defensive woes may be a bit misleading. Sure, they saw the biggest drop in defensive efficiency (19 points) of any team in the league. But, during league play they still have the best defensive effective field goal percentage, and the best defensive rebounding rate, of any team in the ACC. In this case, I'm guessing the problem was a cupcake pre-conference schedule (ranked 275th by Pomeroy), rather than some exposure by better competition.

8) Finally, we return to the most overanalyzed team in the country: Duke. It's amusing to me that any casual college basketball fan in the country right now can point to seven different reasons why the Blue Devils are not poised for greatness: they lack depth, they can't stop quick guards, they can't stop an inside presence, they don't play enough zone, they don't adapt in-game, ad nauseum. I wonder if those fans can note weaknesses so easily in other top 10 teams? Still, even I was receptive to this line of thinking a few weeks ago. But my comparison is clear: Duke is in the middle of the pack when it comes to their statistics being "affected" somehow by non-conference play.

In fact, contrary to my conclusions a few weeks ago, Duke's defense is one of the least affected by ACC play. On offense, Duke turns the ball over less frequently than any ACC team, and have respectable rebounding numbers for a team with "no inside presence." The lesson: stop making judgments in a vaccum; statistics can be misleading if they are not in a relative context.

Monday, March 09, 2009

2009 ACC Tournament Predictions

Some of the hardest days as a sports fan come during early March; the worst of all are the four days between Selection Sunday and the first full day of NCAA tournament games. For this ACC fan, it is equally hard to bear the four days between the Duke-Carolina rematch and the start of the ACC tournament. Sure, there are plenty of actual games between now and then, but few of them actually matter, save the random upset of a top 25 mid-major and the corresponding bubble implications. To pass the time, I repeated an exercise I completed two years ago this week: predictions for the ACC tournament using the log5 method.

There will no doubt be predictions using Ken Pomeroy's rating system, all over the internet. (Here's one simple example.) I want to do something different; how do the predictions change, based on whether I use:

1) Winning Percentage
2) Raw Points Scored/Allowed
3) Pomeroy's Rankings (Full Season)
4) Raw Efficiency (ACC Games Only)

What follows are four Google spreadsheets tallying the information. Each sheet has three tabs: the calcuated winning percentage for each team. For tests 2 through 4, my formula follows Ken Pomeroy's: PF^11.5/(PF^11.5+PA^11.5). The next tab shows the chances that the team in a given column will beat the team listed in a row, using the "log5" formula, discussed here. Finally, mindful of the ACC Tournament Bracket, I predict each team's chances to advance to the Quarterfinals, Semifinals, Finals, and their chances of being 2009 ACC Champion. Let's start with raw winning percentage.

So you can see that Duke has an .806 winning percentage, a 31.6% chance of beating UNC, and a 15. 6% chance of winning the ACC tournament. Of course, winning percentage is kind of silly, because blowouts and squeakers count exactly the same. For this reason many baseball stat-heads turned to Pythagorean Win Percentage, which calculates a team's likely winning percentage given how much they score and how much they allow. This can be applied to basketball as well, with the following result:

Some pretty big changes already. First off, Duke has vaulted above Wake and is now favored to make the finals against a still-overwhelmingly-favored UNC team. The middle of the pack has changed considerably; Miami has doubled their chances, while Clemson has had theirs halved. We know that the Pythagorean Winning Percentage is flawed, Baseball Prospectus also follows what they call "Third Order Wins." By this they mean that how much offense/defense is not as important as the context in which the points were scored.To put it in 2009 terms, which team has the better offense:

VMI-- Points/Game: 93.8 Possessions/Game: 81.2
Duke- Points/Game: 78.7 Possessions/Game: 70.1

It is true that VMI scores 15 more points per contest than the Blue Devils; they are the most prolific scorers in the nation. However, VMI plays at the fastest tempo in the country, getting over 11 possessions more per game than Duke. Teams play different opponents every game, which could have a wide variance in the number of possessions. So, a fair comparison of offenses requires looking not at a team's raw scoring numbers, but at how efficiently a team scores in the possessions it gets. With this, it is clear that Duke has the better offense.

So what if we were to predict the results of the ACC tournament using Offensive and Defensive Efficiency, as provided by Ken Pomeroy? For this run I will also take each team's schedule into account by using Pomeroy's "Adjusted" efficiency ratings; teams are penalized if they run up high efficiencies against bottom feeding teams. The results are provided in an earlier link, but I'm showing my work:

While the chances of favorite UNC have remained largely the same, the effect of tempo-free statistics and the schedule have boosted Duke's chances by 5%. Most of this comes from an ever-increasing chance of beating Wake Forest on a neutral court: from 46% using just win percentage to 56% with Pythagoras to 62% tempo-free.

Frequently, when I use these tempo-free statistics, some folks are not convinced. They think that the adjustments for schedule made by Pomeroy are not enough, and that teams are different in league play than they were playing non-league foes before the new year. In addition, the ACC tournament is taking place between only ACC teams, so shouldn't statistics within the ACC matter more? On the other hand, the ACC no longer has a balanced schedule; for example, Boston College played Duke once (at home) while they played #12 seed Georgia Tech twice. I have not attempted to adjust for schedule here, so these are raw efficiency numbers:

The most striking result is that the top three teams (UNC, Duke, Wake) have had their chances all go down, relative to the full-season Pomeroy ratings. These extra chances have been split among a few teams. Clemson's title chances went up by 3 percentage points. Florida State, whose defense has improved tremendously since the clock ticked to 2009, have doubled their title chances (as have Boston College).

NCAA Tournament Implications:
1) The 8-9 game is not the closest of the first round. That distinction belongs to NCSU vs Maryland, according to all four metrics. That is not a good matchup for anyone who thinks that Maryland is still on the bubble.
2) Virginia Tech is pretty screwed. Like Maryland, they are a 7-9 ACC team, and the committee doesn't usually take kindly to a sub-.500 conference record. They are probably out of the tournament picture unless they make it deep, and the statistics say it's not probable at all.
3) The final 7-9 team, Miami, has to avoid a collapse against Virginia Tech, and then they face 2-to-1 odds against in the matchup with Wake Forest. Should they prevail, would the committee consider what then would be a 20-win ACC team?
4) Statistically, the top three seeds are very heavy favorites for the semifinals, with Duke and UNC more likely to be there than Wake. Should Duke win the two games as expected, would they still have to beat Wake Forest to get a #2 seed in the NCAAT? Certainly, the Deacons probably need to win the ACC tournament to get their own #2 seed.
5) Clemson and Florida State should both be solidly into the NCAA tourament, but they are playing for favorable seedings. By the ACC numbers and the overall Pomeroy ratings, Clemson is favored in a matchup with Florida State, and the Tigers are more likely to knock off UNC.
6) Spreadsheets are fun!