An Open Letter to Open Letters

Dear Open Letters,

Although you likely won’t read this since Open Letters are likely never read by their stated target, I’m writing it anyway because I’m actually just looking for a pedestal to shout from.  Not because I feel particularly strongly or that I think any actual change will result from this, but rather because it makes me feel good about myself to take zero risk while pretending to make a bold, public gesture.

This isn’t about you, Open Letters, this is about me.  You see, I need to feel important.  However, it’s hard work to actually be important.  I’d rather pen a pointless letter that takes an incredibly uncontroversial opinion but then write it as if it were aggressively brave.  Then I will unleash it on social media because social media asks the absolute minimum of others to show their support:  a movement of a finger by mere millimeters.  But this inane, inconsequential gesture is enough to stroke my ego, and maybe even those on the other end of the tapping fingers.  It validates that the trivial amount of effort was worth it and then I’ll sleep better knowing that I can say, “Well at least I raised awareness.”  SJW 101 stuff.

Consider what you actually are, Open Letters:  the world’s most efficient ego-stroking machine.  Just consider: if it only takes a person 5 minutes to read this, but it goes viral to 10M people, that’s 83,333 man hours.  That’s like 2,083 people working full time for a week, all for me.  And what did I have to do?  Spend 15 minutes at a keyboard.  So lest you think it’s about you and your content, Open Letters, be assured that it’s not.  You work for me.

Sincerely,

A lazy blogger.

How Ants Can Pick your Fantasy Baseball Team

Note:  The code for this project can be found in my github repo.  This wasn’t originally intended to be open-sourced, so I apologize for lack of documentation and hard-coded directory structure 🙂 

My friend Matt is a Daily Fantasy Sports (DFS) junky.  He doesn’t have a math background but has heard the stories of guys using magic algorithms that tip the odds in their favor while they laugh their way to the bank.  Naturally, he was curious.  I’m not a huge sports fan, but it was a cool enough problem that he got me interested and we started playing around with it.

If you’re not familiar, baseball DFS works like this:  on a given night several teams are playing and you draft a fake team among those players and are then awarded points based on their real-world performance.  Although the roster and scoring details depend on the platform you’re playing on, you’ll have to pick something like:  2 starting pitchers, 3 outfielders, 1 first basement, 1 second basemen, 1 short stop…etc.  The caveat is that each player is assigned a salary and you only have a certain amount of cash to spend, so you can’t just buy the best players in every position.  Pitchers are scored for things like strikeouts and everyone else is scored on offensive outcomes, like singles and doubles.  There are a variety of contest types you can join but the ones we played were such that those in the top 50% were paid from the entry fee pool of those in the bottom 50% (minus a platform fee).  So if there’s 1000 people, you’re paid the same whether you’re in 1st place or 450th place.

Problem

It turns out that this is a variant of something called the Knapsack Problem.  Briefly, the Knapsack Problem is such that you have a sack that can hold a given volume of stuff and there are many valuable items to choose from, each with a different size and intrinsic value, and your goal is to choose the combination of items that maximizes value subject to the constraint on maximum volume.  In our case we’re trying to choose a collection of players that maximizes projected DFS points subject to a salary constraint.  In addition to the salary constraint, we also have a constraint on the number and types of players we need.

Approach

I had never heard of the knapsack problem but it looked to me like just another combinatorial optimization problem, similar to the Traveling Salesman problem.  Since I had been working with Ant Colony Optimization (ACO) I took a stab at adopting that approach in Python.  The pipeline starts by scraping data from a website that has projections on each player’s DFS points according to some model they had developed.  That is what I would try to maximize with ACO.  This site also listed the salary of each player for the platform I was using, so it was a great data source (rotogrinders.com).

Once I had a DataFrame with Player, Position, Salary, Projected_Points, I would run the following algorithm:

  1. Randomly choose a position that needs filled
  2. Filter the DataFrame to only include candidate players in that position whose salary was also beneath the remaining budget
  3. Calculate probabilities for each candidate that was proportional to a value derived from multiplying a greedy factor (Projected_Points) and a “pheromone” factor that is influenced by previous ant behavior.
    1. These two factors are each scaled by tunable exponents that thereby set the degree to which “ants” are influenced by greedy information or previous ant behavior
  4. Randomly choose among the candidates using these probabilities as weights
  5. If a full roster was created, get the total Projected Points and add a “pheromone” trail to each selected player by an amount that’s proportional to total points.
  6. Apply global pheromone evaporation by multiplying all levels by a constant < 1.

The main idea is that you’re making a choice that balances local, greedy information with global information.  So even if someone looks like a bad choice because a different player has more projected points, they might be a great addition to the roster because salary savings on this position can be used in another position with greater effect.

Results

Although I made tweaks along the way and ultimately built a system that could take my preferences into account, the results were beyond my expectations.  Specifically, I beat the median score 46% of the time.  That’s not enough to win money, but it’s interesting that someone like me, with little-to-no baseball expertise, can build an algorithm that performs about evenly with humans that are confident enough in their choices to bet money.

Nuances

Ultimately, the performance of my system can only be as good as the player projections.  If I were wanting to do this with aspirations of making money, I would definitely want to build my own projected player points model so that I could at least quantify the uncertainty.

Another problem is that intra-team player performance is correlated.  If a pitcher is having a terrible night, the entire opposing team has an elevated potential.  Further, if you have great hitters around you, there’s more chances for you to score points on RBIs or to get to home once you’re on base.  Also, a more successful hitting team means more people coming up to the plate, which translates to more chances.

Presumably the projected points model takes some of these things into consideration, but it’s a fundamentally different optimization problem when the values are not only random variables, but correlated.  For instance, it’s a common strategy to “stack” your roster such that you have multiple players from a single real-life team.  The idea is that even if you wouldn’t normally pick a certain player based on their individual merit, you might on this night because they’re playing some bad team and will likely score many runs, which means a lot of DFS points are going to be spread out on that team.  It works both ways, of course, and a poorly performing team makes it hard for an individual to get many points.

Next Steps

So does it make more sense to leverage the correlation by stacking players from teams that have a high probability of winning, or should you choose players that reduce correlation and therefore reduce risk of collective under-performance?

The DFS players that make the most money seem to do so by taking a combination approach:  stack individual teams, but generate many different rosters and therefore participate in many different contests.  It’s worth noting that other contest types are scored differently and in many cases the payout depends on your relative rank.  The answer to the above question might depend on the details of contest payout as well as risk-appetite.

The next steps are thinking about what optimization means in this two-stage system of single-roster optimization and then optimization of a collection of rosters.  How can you leverage stochastic uncertainty and correlation in player performance such that you’re choosing a collection of rosters that have the best chance of performing?  My current implementation cannot generate many variations on a roster without manual intervention because it has no concept of stochastic performance; it merely assumes the projected points estimate is the value of the candidate.

If one had a model that produced probability distributions for individual players that took intra-team correlations into account, then this could be sampled from to simulate an outcome, from which ACO could be run using these assumed performance values.  This is a sort of Markov-Chain Monte Carlo (MCMC) approach.

Broader Connections

A solution of this problem likely has value outside of the limited world of DFS.  For instance, consider the task of choosing a portfolio of investments.  Similarly, you may want to both take advantage of big upswings when correlations are in your favor and simultaneously hedge against risk by diversifying to break correlations.

The Bayesian Approach to Linear Regression

I’ve long been interested in the Bayesian approach to statistical modeling and machine learning.  I recently dug in deep to an example in Bishop’s text to re-create his results and paraphrase my understanding of it.  Much like how you would want an online learning system to work, this work shows in detail how each data point updates the posterior distribution and becomes the next iteration’s prior.  In turn, you can observe the uncertainty in the model’s estimate improving.

What I like most about the Bayesian treatment is that you get a full probability distribution for your model parameters, not just a point estimate of its value.  This allows you to ensemble many different models, each of which is weighted by its respective probability.  This is a far more robust estimate than the common Maximum Likelihood approach.

If you find the treatment useful, feel free to fork the GitHub code and play with it yourself.  Find the notebook here.

Bayesian estimates within +/- 1 stdev from the mean using only two data points

Bayesian estimates within +/- 1 stdev from the mean using only two data points

Deriving the Normal Distribution, Part 2

Intro

In Part 1 of this series we set out learn why the Gaussian/Normal distribution is such a pervasive choice in modeling random variables.  We claim that this distribution is the least biased choice possible when the only information we have is the average and variance of outcomes from a population.  Our strategy was to establish a metric of uncertainty and then maximize this metric while simultaneously adhering to our known equation for variance from an average.

Read more

Deriving the Normal Distribution, Part 1

Intro

When dealing with probabilities or building mathematical models, the de-facto distribution is the normal/gaussian.  The bell curve.  But where does this come from?  Why is it a better choice than any other distribution that centers on the average?  It turns out that the Guassian distribution is the one that is the least biased choice if the only information you have are 1) the average and 2) the variance of a set of outcomes.

To get to this result, we will first establish a measure of our uncertainty.  This will be useful because we will then (Part 2) choose a distribution to model our process that both maximizes this uncertainty function and simultaneously satisfies the constraints of our average/variance knowledge.

Criteria for a Measure of Uncertainty

How do you measure the amount of uncertainty in a probability distribution?  We’ll start by establishing a few characteristics that would be required for a useful uncertainty function.  From these we will derive the functional form that meets the criteria.

Suppose we have the possible outcomes of the process encoded by values represented by (x_1, ... , x_n). For instance, if it were a 6-sided die, the values would be 1, 2, 3, 4, 5, and 6. One way to mathematically model the process of rolling the die would be to assign probabilities to each of the possible outcomes.  In the case of a fair die, each probability is equal to 1/6.  We assign a corresponding probability (p_1, ... , p_n) for each outcome (x_1, ... , x_n) and these probabilities represent our knowledge of the process. Let’s call our unknown uncertainty function “H”.  We expect H to depend on our probabilities since we are trying to quantify our uncertainty in our knowledge and its our probabilities that represent that knowledge: H=H(p_1, ..., p_n)

So which criteria does H have to meet so that it is useful and in line with our intentions?

  1. It must be a continuous function of the p_i.  This means a small change in one of the probabilities can’t give a huge change in our knowledge state.
  2. In the case that all outcomes are equally probable (as in the case of a fair 6-sided die), the uncertainty should increase as the number of possible outcomes increases.  In other words, the uncertainty function monotonically increases with the number of equiprobable choices.
    1. This just means that if you have more choices, you’re more uncertain than if you had fewer choices.  Guessing the outcome of a coin-flip is easier than picking the lottery numbers.
  3. No matter how you count the probabilities, you should arrive at the same value for the same state of knowledge.
    1. For instance, let’s say that instead of having to guess the outcome of a fair six sided die roll directly, you first guess whether it’s in the set \{1,2,3\} or \{4,5,6\} and then guess the final number from the smaller list.  The first choice has two equal outcomes, so your probability of success is 1/2.  The second choice has three equal outcomes, so your probability of success is 1/3.  Your total probability of success is the product of these two and is 1/6.  Essentially this is another way of modeling the exact same process.  The requirement above just means that no matter how we model a process and break it up into sub-decisions, we should get one consistent value for H, whatever it ends up being.

The Uncertainty Function

It turns out, the function that satisfies these criteria is H(p_1,...,p_n)=-K\sum p_i \log p_i where K is a positive constant that determines the units.  If you’re interested in the derivation, see the Deriving the Uncertainty Function section below.

An Example

Let’s use a simple case where there are two outcomes with probabilities p and q = 1-p.  The uncertainty becomes

H(p, 1-p)=-K[p\log p + (1-p)\log(1-p)]

= -K[p\log\frac{p}{1-p}+\log(1-p)]

Plotting this function using K=1 and a logarithm of base-2, the following plot is generated with units of bits on the y-axis.

Binary_entropy_plot

How does this fit our intuition?  If we are certain in the outcome (i.e. p=0 or p=1) then our uncertainty function H=0.  We have no uncertainty.  If, on the other hand, p=q=\frac{1}{2}, then our function is maximized at H=1 bit.  As we would expect, we are maximally uncertain when the two outcomes are equally probable.

Deriving the Uncertainty Function

Using the criteria stated above, let’s now derive our uncertainty function.

Uncertainty Function of Equally Probable Outcomes

Let’s assume we are dealing with equally probable events at first, like a fair die.  We’ll simplify our notation so that instead of writing the uncertainty for n equally probable events as H(\frac{1}{n},\frac{1}{n},...,\frac{1}{n}), we will just use A(n).

According to requirement 3 above, we can deduce that to decompose a choice from n = s^m equally likely possibilities, you can transform it to be m choices from s equally likely possibilities:  A(s^m) = mA(s).  In other words, if you have n = 9 = 3^2 equally likely choices, you can decompose that into two subsequent choices, each from three possibilities.  This decomposition is similar to what we did in the sub-point of requirement 3.

Now, for any specified number t^n, we can find an m to satisfy the following equation: s^m\leq t^n\leq s^{m+1}.  So if t^n = 3^2 = 9, we could choose s^m = 2^3 = 8 and then s^{m+1} = 2^4 = 16, which satisfies the inequality.

By taking the logarithms of each term, we get m\log(s) \leq n\log(t) \leq (m+1)\log(s).  Dividing by n\log(s), we get \frac{m}{n} \leq \frac{\log(t)}{\log(s)} \leq \frac{m}{n} + \frac{1}{n}.  Subtracting \frac{m}{n} from each term we end up showing: \frac{\log(t)}{\log(s)} - \frac{m}{n} \leq \frac{1}{n}.  And since n can be arbitrarily large, |\frac{\log(t)}{\log(s)} - \frac{m}{n}| < \epsilon.  So what?

Since requirement 2 says that our uncertainty function is monotonic that means the original inequality and subsequent forms of it apply:  A(s^m) \leq A(t^n) \leq A(s^{m+1}).  Using our decomposition rule just derived: mA(s) \leq nA(t) \leq (m+1)A(s).  Continuing with the same division/subtraction steps we just undertook we end up with the relation:  |\frac{A(t)}{A(s)} - \frac{m}{n}| < \epsilon.

These inequalities essentially state the following:  The ratios of \frac{A(t)}{A(s)} and \frac{\log(t)}{\log(s)} are both within a distance of \epsilon from the quantity \frac{m}{n}.  That means the furthest they can be from one another is 2\epsilon.  Or translating that to an inequality, |\frac{A(t)}{A(s)}-\frac{\log(t)}{\log(s)}| < 2\epsilon.

So what does this tell us?  We have shown that the difference between a ratio of our unknown functions \frac{A(t)}{A(s)} are arbitrarily close to a ratio of known functions \frac{\log(t)}{\log(s)}.  This leads one to naturally conclude that the unknown functions are the known functions within some multiplicative constant.  How else could they be arbitrarily close for any s and t?  Therefore:

A(t) = K\log(t).

Uncertainty of General Probability Distributions

So that’s the function that quantifies the uncertainty of a process with t equally probable outcomes.  But what about processes that don’t have equally probable outcomes?  Using requirement 3 we see that the way you decompose a process is that you take the uncertainty of the first choice and add the uncertainties of the additional choices that are weighted by their probability of occurrence.  So in the case of our game in the example of requirement 3, the uncertainty of the process of choosing 1 of 6 equally probable sides is the same as the sum of the uncertainties of the two step process: 1) choosing one of the two subgroups \{1,2,3\} or \{4,5,6\} and 2) choosing among the three elements of each subgroup multiplied by the probabilities of getting that subgroup in the first place.

A(6) = H(\frac{1}{2},\frac{1}{2})+\frac{1}{2}H(\frac{1}{3},\frac{1}{3}, \frac{1}{3})+\frac{1}{2}H(\frac{1}{3},\frac{1}{3}, \frac{1}{3})

Now imagine a process where you have n =\sum n_i outcomes where n_i is the relative occurrence of the i^{th} outcome.  For instance, a 6 sided die where there are a single green, a single red, two blues and two yellows.  The probabilities for each outcome are then straight-forward:  p_i = \frac{n_i}{\sum n_i}.  This also means that if you know you are on outcome i, this is like choosing from n_i equally probable events since there are n_i occurrences of i.  To break down this process of n =\sum n_i outcomes we have A(\sum n_i) = H(p_1,...,p_n) + \sum p_i A(n_i).  Let’s solve this for our unknown uncertainty function H(p_1,...,p_n) .

H(p_1,...,p_n)=A(\sum n_i)-\sum p_i A(n_i)

= \sum p_i A(\sum n_i)- \sum p_i A(n_i) = \sum p_i [A(\sum n_i) - A(n_i)]

= K\sum p_i [\log(\sum n_i)-\log(n_i)]

= K\sum p_i \log\frac{\sum n_i}{n_i}

= K\sum p_i \log\frac{1}{p_i}

= -K\sum p_i \log p_i

The only thing we did was recognize that a function weighted by its probabilities and summed is equal to the function itself (line 2) and substituted our equation of the A(n) that we found above (line 3).

So, finally, we have solved our problem.  The uncertainty associated with a process that’s characterized by the set of probabilities \{p_i\} is given by the function

H(p_1,...,p_n)=-K\sum p_i \log p_i

where K is a positive constant.

Conclusion

Our original goal was to find the least biased distribution to use to model a process when we only have knowledge of the average and variance of a sub-population.  To do this we started by coming up with a metric for measuring our uncertainty when given a set of probabilities.  In Part 2 of this treatment we will maximize this uncertainty function while using our knowledge of average/variance as constraints on the probabilities.  In this way we work backwards from this uncertainty function and ultimately arrive at a functional form of our least-biased probability distribution.

References

This derivation and treatment follows almost exactly that found in Claud Shannon’s original work A Mathematical Theory of Communication.  The image above was found on the wikipedia page on information entropy and can be found here.

The Entropy of First Names

I was recently working on a coding challenge for a job related to analysis of the first names of people born in the USA.  The data set given was from the Social Security Administration and can be downloaded here.  There is a text file for each state and for every name entry there is the associated gender, year, state, and count of people born that year with the stated name.  For example, here are the first few lines of the Arkansas file:

AK,F,1910,Mary,14
AK,F,1910,Annie,12
AK,F,1910,Anna,10

While looking for inspiration for this challenge I found an interesting article that referenced the entropy of baby names.  Entropy measures the amount of information contained in a choice from a distribution.  I decided to explore the entropy concept further and the results are below.

Read more

Ants Solving the Traveling Salesman Problem

I have been working on building the stochastic optimization algorithm inspired by ant foraging called the Ant System as described in Marco Dorigo’s paper, “The Ant System: Optimization by a colony of cooperating agents.”  Code for running it yourself using Python 3 can be found in my Ant System github repo.  It is being used to find solutions to the Traveling Salesman problem on the 30-city “Oliver 30” data set.  Initial results below.  The general question is: given a map of cities, what is the minimum path to take such that you visit all cities precisely one time?  This has obvious applications to logistics or routing traffic in optimal ways.

Read more

Hacking for Good

I have recently been going to the OpenDataSTL MeetUp group.  They are the local brigade of Code for America and aim to provide an outlet for people that have technical chops and are interested in serving their community in ways that go beyond manual labor.  The National Day of Civic Hacking is June 6th and to prepare, OpenDataSTL has crammed May with cool events to raise awareness and get stuff done.  Things like:

  • Open Source Brewathon (results tasting on May 27th)
    • This event’s goal is to teach the community about the power of open source and version control, but via beer instead of software.  A handful of brewers started with an open brew recipe, made their modifications and have concocted something new.  Come taste the delicious variety and use your (beer-lubricated) imagination on how this might be powerful for software
  • Volunteer Fair (May 29th)
    • Imagine a career fair, but with non-profits looking for technical people like you to do interesting tasks.  This is for people that want to help, but in ways closer aligned to their professional abilities.
  • Hackathon (May 29th – 31st)
    • Once you have a great idea from the Volunteer Fair, come build it with other talented people at the hackathon.  Network, build relationships, and do good.

There are several other things going on earlier in the month.  Go to buildforstl.org to keep up to date.

Simulating Ant Foraging

I’ve recently been building some tools in Python and getting experience with PyGame as a means to visualizing moving objects.  You may have seen my post on a 2D platform game with my dog Bruce or 2D circular collision detection.  The motivation for working on these fun but somewhat trivial projects is that they enable the framework for research into a more interesting application:  how ants solve problems by performing a type of distributed computing.

A very interesting article that makes similar connections can be found on Nautilus.  The idea is that ants aren’t very smart individually, but when many of them interact together the colony can solve problems the individual cannot.  This is indicative of a more general phenomena in complex systems–how a bunch of relatively simple things can interact without a central conductor to do more complex things.  For instance:  the human brain; a school of fish; human civilization; the immune system; cells.  Understanding so-called “emergence” will be key to building true artificial intelligence.

My interest is in applying statistical physics and information theory to understand how information is communicated and processed among individuals to solve this type of problem for the colony.  The first step, however, is building a simulation environment that can get virtual ants to wiggle on a screen and tell you when they’ve found food.  This relies heavily on the 2D circular collision detection software mentioned above.  Currently, there is no interaction between ants so there is no distributed computation taking place, merely blind wandering.  This should be the null benchmark to which the quality of other solutions is compared.  Stay tuned!

2D Collision Detection and Handling with Python

I built a python package for detecting and handling collisions of circles in 2D.  The code defines a class for the objects and handles the velocity updates using basic elastic collision mechanics.  The code can be downloaded from my GitHub repository and an example output using Pygame as a display is shown below.  For details on the physics along with a discussion of how this problem relates to Quantum Mechanics, keep reading!

Read more

1 2 3