Tuesday, March 17, 2009

The Most Likely Final Four

Sorry that it has taken so long since my last post, I know that the masses are in need of more data, and help filling out their brackets. I have been working on a Python script to parse the massive amounts of data I produced with my 1 million NCAA tournament simulations. Essentially, what resulted is a data file containing the winners of each game in a single simulation; that file is 611 MB, if you were wondering. What I have done is pull out from that massive file the most common Final Fours and the most common Championship games, which I will present in a minute.

Yesterday was the most successful day in Immaculate Inning history, with over 740 unique visitors, most of you coming from BallHype.com. I want to take a minute and point out some differences between what you'll find here and what other sites are producing. First, I noticed this article by the Wages of Wins Journal-- they do basically what I did for the ACC tournament, using both Pomeroy and Sagarin ratings. It's important to remember that the data on that site is discrete probabilities multiplied against each other; it's impossible to know how the winner of one game will affect the rest of the tournament.

Next, we have Joel Sokol of Georgia Tech, who uses a logarithmic regression model, based solely on margin of victory, to rank every team in Division I. He selects his bracket by picking the team that ranks higher, and according to his analysis, this method outperforms every other major bracket-picking method, whether it's seeds, ESPN's experts, or Sagarin rankings. That's pretty impressive, but once again, his choices do not take into account the effect of upsets on a single tournament.

Finally, there's a competing NCAA tourney simulation by Upon Further Review. There are two main differences between that simulation and mine. First, and perhaps most important; he doesn't show his work. A cursory look at the rest of the website shows a predilection for Basketball Prospectus, so perhaps we can assume he used efficiency ratings, but we just don't know. The second difference is that his is just 1,000 simulations. I'll admit that it doesn't seem obvious at first why having 1,000 times more simulations is necessarily better, other than the novelty of seeing Alabama State winning the tournament one or two times. I'm hoping to convince folks that the one million simulations really are better, because I can produce results like these: (click here to view the full spreadsheet)

The Most Likely Championship Game: Connecticut vs Pittsburgh

I searched my simulation output file for the winners of the initial final four matchups-- the championship game participants. There were 840 different matchups in the one million simulations. The championship games appearing in at least 1% (1,000) simulations, in order of decreasing likelihood:

Connecticut / Pittsburgh : 2.21%
Memphis / Pittsburgh : 1.86%
Louisville / Pittsburgh : 1.71%
Connecticut / Duke : 1.66%
Connecticut / North Carolina : 1.59%
Memphis / Duke : 1.43%
Memphis / North Carolina : 1.33%
Connecticut / Gonzaga : 1.31%
Louisville / Duke : 1.29%
Connecticut / Oklahoma : 1.28%
Connecticut / Syracuse : 1.27%
Louisville / North Carolina : 1.26%
Connecticut / Arizona St. : 1.22%
Connecticut / UCLA : 1.22%
Memphis / Gonzaga : 1.12%
West Virginia / Pittsburgh : 1.11%
Memphis / Syracuse : 1.09%
Memphis / Oklahoma : 1.09%
Louisville / Gonzaga : 1.02%
Memphis / UCLA : 1.02%
Memphis / Arizona St. : 1.02%

I'm fairly confident that a simulation of only 1,000 tournaments would be unable to separate the occurrence of one game versus another with any kind of power. As you can see, the first three most likely Championship Games include Pittsburgh. UCLA and Arizona St, both six seeds, are the lowest seeds commonly making an appearance in these most likely title game matchups. The left side of the bracket, representing the West/Midwest half of the tournament, appears a lot more stable than the right side; with one exception (WV), just three teams are represented: Louisville, Connecticut, and Memphis. The right side of the bracket, meanwhile, has a lot more variability, with three teams from the East and four from the South each making an appearance in the likely title games list.

In case you're worried about my arbitrary cutoff of 1%, the next three most common championship games all featured Louisville (vs Syracuse, Oklahoma, and UCLA), followed by a Michigan St-Pittsburgh matchup and yet another Louisville game (vs Arizona St). Following a unique matchup between Purdue and Pittsburgh at 0.90%, there is a sharp dropoff in the frequency. The first 25 or so matchups are clearly the most common, and therefore the most likely. I suppose it means that if you are looking for a sure thing, Pittsburgh is a good bet to make the title game. However, if you're looking for a sleeper (not a #1 or #2 seed) to make the title game, it would be better to replace Pittsburgh with UCLA, Arizona St, or Gonzaga, because low seeds making the title game out of the West and Midwest is just not likely.

The Most Likely Final Four: Connecticut, Louisville, Pittsburgh, Oklahoma

As a Duke fan, I was saddened that Duke did not represent the East region in the most likely final four. However, I am overjoyed that the only non-#1 seed to be there is North Carolina...
The power of the #1 seeds was actually quite strong-- the first five most likely brackets, representing nearly 1 percent of all simulations, featured UConn, Louisville, and Pittsburgh (one of which also included North Carolina). Anyway, there are 26,790 unique final fours in the simulation, 6,134 of which appear only once. Only 2,434 Final Fours occured more than 100 times (0.01 percent). The most likely final four, listed above, occured 2009 times (how's that for symmetry), or 0.2 percent.

Once again, the top heavy nature of the West region was clear; it was not until the 42nd most common final four that the West representative was not Connecticut or Memphis (it was Purdue). The first nine most common final fours list Louisville as the Midwest champ, and some sprinklings of West Virginia and Michigan State follow until the 37th most likely final four, which features Kanas. In the East, Pitt did capture those first five spots, and most of the top 20 (replaced by Duke in five of them, then UCLA in the 21st most likely final four). The first team to come out of the East that was not Pitt, Duke, or UCLA was Xavier in the 48th most likely Final Four. Finally, the South is just as wide open as we've been advertising, with five different teams in the first five most likely scenarios!

What does all of this mean for you, humble bracket filler? It means that under the most common bracket pool rules, (more points for late round games than early round) someone is going to win the pool by picking the correct South regional winner. The other regions are farily top-heavy with just a few likely options, but the South is where the money is at. These breakdowns don't really point to a favorite in the five-team cluster, although the initial simulation calls North Carolina the favorite.

It is a bit strange to note that Memphis is neither in the most likely title game, nor the most likely Final Four. They were a slight favorite to win the tournament in the initial simulation, just beating out UConn. I suppose you could say that whoever wins the West regional should be the odds-on favorite to capture the title!

Xenod and I are working on expanding the search through the simulation to incorporate the Elite Eight and Sweet Sixteen. I'm not sure if 1 million is enough to tease apart the variance at those levels, but we will try. I'll also take a look at first and second round matchups from a different perspective. Stay tuned to all the tourney simulations you can handle, right here at Immaculate Inning!

No comments: