Note: The code for this project can be found in my github repo. This wasn’t originally intended to be open-sourced, so I apologize for lack of documentation and hard-coded directory structure 🙂
My friend Matt is a Daily Fantasy Sports (DFS) junky. He doesn’t have a math background but has heard the stories of guys using magic algorithms that tip the odds in their favor while they laugh their way to the bank. Naturally, he was curious. I’m not a huge sports fan, but it was a cool enough problem that he got me interested and we started playing around with it.
If you’re not familiar, baseball DFS works like this: on a given night several teams are playing and you draft a fake team among those players and are then awarded points based on their real-world performance. Although the roster and scoring details depend on the platform you’re playing on, you’ll have to pick something like: 2 starting pitchers, 3 outfielders, 1 first basement, 1 second basemen, 1 short stop…etc. The caveat is that each player is assigned a salary and you only have a certain amount of cash to spend, so you can’t just buy the best players in every position. Pitchers are scored for things like strikeouts and everyone else is scored on offensive outcomes, like singles and doubles. There are a variety of contest types you can join but the ones we played were such that those in the top 50% were paid from the entry fee pool of those in the bottom 50% (minus a platform fee). So if there’s 1000 people, you’re paid the same whether you’re in 1st place or 450th place.
It turns out that this is a variant of something called the Knapsack Problem. Briefly, the Knapsack Problem is such that you have a sack that can hold a given volume of stuff and there are many valuable items to choose from, each with a different size and intrinsic value, and your goal is to choose the combination of items that maximizes value subject to the constraint on maximum volume. In our case we’re trying to choose a collection of players that maximizes projected DFS points subject to a salary constraint. In addition to the salary constraint, we also have a constraint on the number and types of players we need.
I had never heard of the knapsack problem but it looked to me like just another combinatorial optimization problem, similar to the Traveling Salesman problem. Since I had been working with Ant Colony Optimization (ACO) I took a stab at adopting that approach in Python. The pipeline starts by scraping data from a website that has projections on each player’s DFS points according to some model they had developed. That is what I would try to maximize with ACO. This site also listed the salary of each player for the platform I was using, so it was a great data source (rotogrinders.com).
Once I had a DataFrame with Player, Position, Salary, Projected_Points, I would run the following algorithm:
- Randomly choose a position that needs filled
- Filter the DataFrame to only include candidate players in that position whose salary was also beneath the remaining budget
- Calculate probabilities for each candidate that was proportional to a value derived from multiplying a greedy factor (Projected_Points) and a “pheromone” factor that is influenced by previous ant behavior.
- These two factors are each scaled by tunable exponents that thereby set the degree to which “ants” are influenced by greedy information or previous ant behavior
- Randomly choose among the candidates using these probabilities as weights
- If a full roster was created, get the total Projected Points and add a “pheromone” trail to each selected player by an amount that’s proportional to total points.
- Apply global pheromone evaporation by multiplying all levels by a constant < 1.
The main idea is that you’re making a choice that balances local, greedy information with global information. So even if someone looks like a bad choice because a different player has more projected points, they might be a great addition to the roster because salary savings on this position can be used in another position with greater effect.
Although I made tweaks along the way and ultimately built a system that could take my preferences into account, the results were beyond my expectations. Specifically, I beat the median score 46% of the time. That’s not enough to win money, but it’s interesting that someone like me, with little-to-no baseball expertise, can build an algorithm that performs about evenly with humans that are confident enough in their choices to bet money.
Ultimately, the performance of my system can only be as good as the player projections. If I were wanting to do this with aspirations of making money, I would definitely want to build my own projected player points model so that I could at least quantify the uncertainty.
Another problem is that intra-team player performance is correlated. If a pitcher is having a terrible night, the entire opposing team has an elevated potential. Further, if you have great hitters around you, there’s more chances for you to score points on RBIs or to get to home once you’re on base. Also, a more successful hitting team means more people coming up to the plate, which translates to more chances.
Presumably the projected points model takes some of these things into consideration, but it’s a fundamentally different optimization problem when the values are not only random variables, but correlated. For instance, it’s a common strategy to “stack” your roster such that you have multiple players from a single real-life team. The idea is that even if you wouldn’t normally pick a certain player based on their individual merit, you might on this night because they’re playing some bad team and will likely score many runs, which means a lot of DFS points are going to be spread out on that team. It works both ways, of course, and a poorly performing team makes it hard for an individual to get many points.
So does it make more sense to leverage the correlation by stacking players from teams that have a high probability of winning, or should you choose players that reduce correlation and therefore reduce risk of collective under-performance?
The DFS players that make the most money seem to do so by taking a combination approach: stack individual teams, but generate many different rosters and therefore participate in many different contests. It’s worth noting that other contest types are scored differently and in many cases the payout depends on your relative rank. The answer to the above question might depend on the details of contest payout as well as risk-appetite.
The next steps are thinking about what optimization means in this two-stage system of single-roster optimization and then optimization of a collection of rosters. How can you leverage stochastic uncertainty and correlation in player performance such that you’re choosing a collection of rosters that have the best chance of performing? My current implementation cannot generate many variations on a roster without manual intervention because it has no concept of stochastic performance; it merely assumes the projected points estimate is the value of the candidate.
If one had a model that produced probability distributions for individual players that took intra-team correlations into account, then this could be sampled from to simulate an outcome, from which ACO could be run using these assumed performance values. This is a sort of Markov-Chain Monte Carlo (MCMC) approach.
A solution of this problem likely has value outside of the limited world of DFS. For instance, consider the task of choosing a portfolio of investments. Similarly, you may want to both take advantage of big upswings when correlations are in your favor and simultaneously hedge against risk by diversifying to break correlations.