Predicting the winners and losers of March Madness is such a daunting challenge that it attracts math nerds like Starfleet voyagers lining up at Comic-Con. Statisticians, economists, Silicon Valley coders, the PhD quants at hedge funds and gambling syndicates: They’ve all tried to “solve” the outcome of the annual college basketball tournament’s 63 matchups.
“Every kid who takes a mathematical modeling class and who’s a college basketball fan, the first thing they want to do is predict the NCAA tournament,” says Ken Pomeroy, a former meteorologist who has become arguably the foremost college basketball numbers guru. His famous KenPom ratings measure the strength of all 351 NCAA Division 1 basketball teams using an old-school regression technique known as “least squares,” which analyzes statistical variances in teams’ past performances and helps predict the winners in two-team matchups.
But to generate entire brackets is to tangle not just with the randomness of the game itself, but with the randomness of your betting pool—the lucky guesses made by all the people you’re competing against to predict the greatest number of winners. Microsoft researchers have unleashed their machine-learning engine Bing Predicts on March Madness forecasts, and several independent researchers, such as the chief data scientist of a big defense consultant, have used neural networks to entwine discrete predictive models into “ensembles” that spit out probabilities. But some of the most intense March Madness research is being done by David Hess. He’s a 36-year-old with degrees in neuroscience from Johns Hopkins and NYU who’s also from Kansas, and is thus “a huge college basketball fan.” In 2011 he went to work at a sports prediction site called Team Rankings, where he set out to build a tool to produce optimized NCAA tournament brackets for paying customers.
After experimenting with different statistical models, including a so-called upset algorithm that somehow augurs underdog victories, Hess settled on what’s known as an evolutionary algorithm that relies on machine learning. Hess begins by rating the relative strength of all the competitors. Once the NCAA on Sunday announces the seedings—a ranking of the teams in the tournament—the model uses that data, along with probabilistic information from betting markets, to spit out a batch of probable results. That, however, isn’t enough. A second model scrapes data from ESPN and Yahoo, where millions of people submit their picks for public consumption, and generates a simulated pool of opponents’ brackets.
At this point, the evolutionary algorithm takes over. It obtains a semirandom sample of brackets from the 9.2 quintillion (that’s 9 million trillion!) possible permutations, and pits them against a series of simulated tournament results and a series of simulated pools. It runs, in essence, a simulation based on two other simulations. The algorithm plucks out the brackets that achieve the highest winning percentages and then does what makes it evolutionary: It “mutates” or “mates” the brackets to produce “offspring” outcomes. The software repeats this process through 300 or so generations and halts the evolution when it detects no room for improvement.
Starting Sunday night, 18 Amazon servers used by Team Rankings will spin for more than 24 hours, and Hess’ crew will pull a few all-nighters. “I think we find the global optimum solution the majority of the time,” he says, and recent results bear that out: A Team Rankings analysis shows that people who paid $39 for its optimized bracket last year were 4.5 times more likely to win a prize in their pools than those without an algorithmic edge. However, he’s quick to caution that no machine will ever be able to predict upsets. “Even if you were omniscient and could know the true odds of a thing happening,” Hess says, no bracket based on those true odds would win any given March Madness pool. In betting and basketball, there are no sure things.
This article appears in the March issue. Subscribe now.