Category Archives: Project

Scraping Flight Data with Python

Note:  The code for this project can be found in this github repo.

I have been building a new project that requires the prices of flights.  I looked for APIs but couldn’t find any that were free.  However, I had a great experience using BeautifulSoup when scraping baseball data for my Fantasy Baseball auto-roster algorithm project, so I decided to try to get my own data that way.  To start, I chose a single airline:  southwest.  In this post I’ll step you through a couple basics, including how to scrape when you need to submit form data.

Read more

How to Use Google Analytics and Tag Manager with WordPress

If you have a WordPress blog you might want to integrate Google Analytics (GA).  Instead of doing it directly with a plugin, however, consider integrating it by using Google’s Tag Manager (GTM) service.  GTM gives you far more flexibility in tracking and will make your Google Analytics experience much more rich.  I’ll give you a brief run down and a quick example to get you started.

Let’s say you’ve written a blog and want to find which articles people are finding most interesting.  You decide that instead of having all of the text displayed for every post, instead you’ll write one paragraph and then include a “Read More” button that expands the post [This is a standard option in the WordPress post editor].  Then, you want to track clicks on any of these “Read More” buttons.  This means you need two things:  1) a piece of code running on the page that will signal when a click event on the right objects happens and 2) a destination to record this data.  GTM helps you build/deploy the listener and GA will record the signal.

First, go to both analytics.google.com and tagmanager.google.com to create free accounts if you haven’t already.  You’ll need to follow the instructions and type in your domain.  Let’s start by building the GA goal.

Click Admin -> Goals -> New Goal.  In Goal Setup, choose Custom instead of a Template.  In Goal Description, name it ReadMoreButton and select “Event” for the Type.  In Goal Details we’ll fill out the following information:  Category Equals to ReadMore; Action Equals to Click; Label Equals to ReadMore; Value Equals to 1.  We’ve just defined a goal and it’s defined by these event details.  We will use the same information in defining the signal that GTM will send to establish the hand-shake between the two services.

Now go to GTM.  First, you need to deploy general Google Analytics to your site.  (Note:  If you previously had a GA WordPress plugin installed, deactivate it as this will take its place).  In your Workspace, click Add A New Tag. Click the tag icon in the Tag Configuration card and select Universal Analytics.  It will ask for your Tracking ID from GA.  To get that, navigate to GA and click Admin -> Tracking Info -> Tracking Code.  Copy and paste the text that looks like “UA-XXXXXXXX”.  Leave Track Type to Page View and then click the icon in Triggers to select “All Pages”.  Change the title of the Tag in the upper left corner of the screen to “ga-integration” and then save your changes.

This creates the code that needs to be embedded in all of your web pages that will allow Google Analytics to work on your blog.  Instead of manually doing that, download the “DuracellTomi’s Google Tag Manager for WordPress” plugin.  Once downloaded, activate and then just paste in your GTM id that looks like: “GTM-XXXXXX”.

Let’s pause and test the general GA integration.  Open up a tab that has your GA dashboard and navigate to Reporting -> Real Time -> Overview.  It should show a 0 because we never deployed our GTM tags/triggers.  Open up another tab that has your GTM workspace.  Next to the “Publish” button, expand the arrow and choose “Preview”.  Now, open up a third tab and navigate to your website.  If everything is working correctly you should have a GTM toolbar on the bottom of your page that contains Tags Fired On This Page: ga-integration.  Your GA dashboard should now display a number > 0, representing you and whoever else might be there.

Now let’s create a tag/trigger for our Read More event.  Let’s first create the tag in GTM by clicking Tags -> New.  Give it a title of “ReadMoreTag” in the upper left corner.  As before, in Tag Configuration choose Universal Analytics and add your Tracking ID.  However, this time change Track Type to Event.  Then, fill out Category/Action/Label/Value in exactly the same manner as we did when creating the goal in GA.  This will result in GTM sending a signal to GA that has all the information GA needs to recognize this event as our goal.  Leave everything else as the defaults.  Also, leave “Triggering” blank for a minute and save.

We ignore triggering for a moment because, by default, several options are not available to us.  To enable them, navigate on the sidebar to Variables and click Configure in the Built-In Variables section. I enabled everything in the Clicks and Forms section, though not all are needed for this tutorial.

To create our tag’s trigger, click Triggers -> New and then the icon.  From the menu choose Just Links from the Click category.  Under “This trigger fires on” select “Some Link Clicks”.  This is where we will filter to only Read More links.  In my case I filled this out to look like:  Click classes equals more-link.  This translates to: only trigger this tag if the link that they clicked on has the CSS class of “more-link”.  To find your CSS class via the Chrome browser, right click on a Read More link and choose Inspect to view the html code on your page.  The code for mine looks like this:

<a href=”http://zakjost.com/2016/01/deriving-the-normal-distribution-part-2/#more-257″ class=”more-link”><span>Read more</span></a>

Note that the class is equal to “more-link”, but yours might be different.  Fill in the relevant CSS class information for the filter, give your trigger a name and save it.

Finally, we need to assign this trigger to your tag.  Click Tags -> ReadMoreTag and then select our new trigger in the Triggering section.

To test, you should now go into Preview Mode and navigate to your site to test that the triggers fire.  A quick tip:  instead of clicking the Read More links normally, which refreshes the page and erases the GTA toolbar information, do a Cmd+click or Ctrl+click to open a new tab.  This should preserve the data in the existing tab toolbar.  If everything was done correctly you should see the toolbar say “Tags Fired On This Page: ga-integration, ReadMoreTag” after you Cmd+click a Read More link.

And lastly, if you return to GA and go to Reporting -> Real Time -> Conversions, you should see the ReadMoreButton Goal we created increment to a 1.  Now you will know every time someone clicks a Read More button on your blog.  Remember to turn off Preview Mode and Publish your work once you’re ready to deploy for real.  Good luck!

How Ants Can Pick your Fantasy Baseball Team

Note:  The code for this project can be found in my github repo.  This wasn’t originally intended to be open-sourced, so I apologize for lack of documentation and hard-coded directory structure 🙂 

My friend Matt is a Daily Fantasy Sports (DFS) junky.  He doesn’t have a math background but has heard the stories of guys using magic algorithms that tip the odds in their favor while they laugh their way to the bank.  Naturally, he was curious.  I’m not a huge sports fan, but it was a cool enough problem that he got me interested and we started playing around with it.

If you’re not familiar, baseball DFS works like this:  on a given night several teams are playing and you draft a fake team among those players and are then awarded points based on their real-world performance.  Although the roster and scoring details depend on the platform you’re playing on, you’ll have to pick something like:  2 starting pitchers, 3 outfielders, 1 first basement, 1 second basemen, 1 short stop…etc.  The caveat is that each player is assigned a salary and you only have a certain amount of cash to spend, so you can’t just buy the best players in every position.  Pitchers are scored for things like strikeouts and everyone else is scored on offensive outcomes, like singles and doubles.  There are a variety of contest types you can join but the ones we played were such that those in the top 50% were paid from the entry fee pool of those in the bottom 50% (minus a platform fee).  So if there’s 1000 people, you’re paid the same whether you’re in 1st place or 450th place.

Problem

It turns out that this is a variant of something called the Knapsack Problem.  Briefly, the Knapsack Problem is such that you have a sack that can hold a given volume of stuff and there are many valuable items to choose from, each with a different size and intrinsic value, and your goal is to choose the combination of items that maximizes value subject to the constraint on maximum volume.  In our case we’re trying to choose a collection of players that maximizes projected DFS points subject to a salary constraint.  In addition to the salary constraint, we also have a constraint on the number and types of players we need.

Approach

I had never heard of the knapsack problem but it looked to me like just another combinatorial optimization problem, similar to the Traveling Salesman problem.  Since I had been working with Ant Colony Optimization (ACO) I took a stab at adopting that approach in Python.  The pipeline starts by scraping data from a website that has projections on each player’s DFS points according to some model they had developed.  That is what I would try to maximize with ACO.  This site also listed the salary of each player for the platform I was using, so it was a great data source (rotogrinders.com).

Once I had a DataFrame with Player, Position, Salary, Projected_Points, I would run the following algorithm:

  1. Randomly choose a position that needs filled
  2. Filter the DataFrame to only include candidate players in that position whose salary was also beneath the remaining budget
  3. Calculate probabilities for each candidate that was proportional to a value derived from multiplying a greedy factor (Projected_Points) and a “pheromone” factor that is influenced by previous ant behavior.
    1. These two factors are each scaled by tunable exponents that thereby set the degree to which “ants” are influenced by greedy information or previous ant behavior
  4. Randomly choose among the candidates using these probabilities as weights
  5. If a full roster was created, get the total Projected Points and add a “pheromone” trail to each selected player by an amount that’s proportional to total points.
  6. Apply global pheromone evaporation by multiplying all levels by a constant < 1.

The main idea is that you’re making a choice that balances local, greedy information with global information.  So even if someone looks like a bad choice because a different player has more projected points, they might be a great addition to the roster because salary savings on this position can be used in another position with greater effect.

Results

Although I made tweaks along the way and ultimately built a system that could take my preferences into account, the results were beyond my expectations.  Specifically, I beat the median score 46% of the time.  That’s not enough to win money, but it’s interesting that someone like me, with little-to-no baseball expertise, can build an algorithm that performs about evenly with humans that are confident enough in their choices to bet money.

Nuances

Ultimately, the performance of my system can only be as good as the player projections.  If I were wanting to do this with aspirations of making money, I would definitely want to build my own projected player points model so that I could at least quantify the uncertainty.

Another problem is that intra-team player performance is correlated.  If a pitcher is having a terrible night, the entire opposing team has an elevated potential.  Further, if you have great hitters around you, there’s more chances for you to score points on RBIs or to get to home once you’re on base.  Also, a more successful hitting team means more people coming up to the plate, which translates to more chances.

Presumably the projected points model takes some of these things into consideration, but it’s a fundamentally different optimization problem when the values are not only random variables, but correlated.  For instance, it’s a common strategy to “stack” your roster such that you have multiple players from a single real-life team.  The idea is that even if you wouldn’t normally pick a certain player based on their individual merit, you might on this night because they’re playing some bad team and will likely score many runs, which means a lot of DFS points are going to be spread out on that team.  It works both ways, of course, and a poorly performing team makes it hard for an individual to get many points.

Next Steps

So does it make more sense to leverage the correlation by stacking players from teams that have a high probability of winning, or should you choose players that reduce correlation and therefore reduce risk of collective under-performance?

The DFS players that make the most money seem to do so by taking a combination approach:  stack individual teams, but generate many different rosters and therefore participate in many different contests.  It’s worth noting that other contest types are scored differently and in many cases the payout depends on your relative rank.  The answer to the above question might depend on the details of contest payout as well as risk-appetite.

The next steps are thinking about what optimization means in this two-stage system of single-roster optimization and then optimization of a collection of rosters.  How can you leverage stochastic uncertainty and correlation in player performance such that you’re choosing a collection of rosters that have the best chance of performing?  My current implementation cannot generate many variations on a roster without manual intervention because it has no concept of stochastic performance; it merely assumes the projected points estimate is the value of the candidate.

If one had a model that produced probability distributions for individual players that took intra-team correlations into account, then this could be sampled from to simulate an outcome, from which ACO could be run using these assumed performance values.  This is a sort of Markov-Chain Monte Carlo (MCMC) approach.

Broader Connections

A solution of this problem likely has value outside of the limited world of DFS.  For instance, consider the task of choosing a portfolio of investments.  Similarly, you may want to both take advantage of big upswings when correlations are in your favor and simultaneously hedge against risk by diversifying to break correlations.

The Bayesian Approach to Linear Regression

I’ve long been interested in the Bayesian approach to statistical modeling and machine learning.  I recently dug in deep to an example in Bishop’s text to re-create his results and paraphrase my understanding of it.  Much like how you would want an online learning system to work, this work shows in detail how each data point updates the posterior distribution and becomes the next iteration’s prior.  In turn, you can observe the uncertainty in the model’s estimate improving.

What I like most about the Bayesian treatment is that you get a full probability distribution for your model parameters, not just a point estimate of its value.  This allows you to ensemble many different models, each of which is weighted by its respective probability.  This is a far more robust estimate than the common Maximum Likelihood approach.

If you find the treatment useful, feel free to fork the GitHub code and play with it yourself.  Find the notebook here.

Bayesian estimates within +/- 1 stdev from the mean using only two data points

Bayesian estimates within +/- 1 stdev from the mean using only two data points

The Entropy of First Names

I was recently working on a coding challenge for a job related to analysis of the first names of people born in the USA.  The data set given was from the Social Security Administration and can be downloaded here.  There is a text file for each state and for every name entry there is the associated gender, year, state, and count of people born that year with the stated name.  For example, here are the first few lines of the Arkansas file:

AK,F,1910,Mary,14
AK,F,1910,Annie,12
AK,F,1910,Anna,10

While looking for inspiration for this challenge I found an interesting article that referenced the entropy of baby names.  Entropy measures the amount of information contained in a choice from a distribution.  I decided to explore the entropy concept further and the results are below.

Read more

Ants Solving the Traveling Salesman Problem

I have been working on building the stochastic optimization algorithm inspired by ant foraging called the Ant System as described in Marco Dorigo’s paper, “The Ant System: Optimization by a colony of cooperating agents.”  Code for running it yourself using Python 3 can be found in my Ant System github repo.  It is being used to find solutions to the Traveling Salesman problem on the 30-city “Oliver 30” data set.  Initial results below.  The general question is: given a map of cities, what is the minimum path to take such that you visit all cities precisely one time?  This has obvious applications to logistics or routing traffic in optimal ways.

Read more

Simulating Ant Foraging

I’ve recently been building some tools in Python and getting experience with PyGame as a means to visualizing moving objects.  You may have seen my post on a 2D platform game with my dog Bruce or 2D circular collision detection.  The motivation for working on these fun but somewhat trivial projects is that they enable the framework for research into a more interesting application:  how ants solve problems by performing a type of distributed computing.

A very interesting article that makes similar connections can be found on Nautilus.  The idea is that ants aren’t very smart individually, but when many of them interact together the colony can solve problems the individual cannot.  This is indicative of a more general phenomena in complex systems–how a bunch of relatively simple things can interact without a central conductor to do more complex things.  For instance:  the human brain; a school of fish; human civilization; the immune system; cells.  Understanding so-called “emergence” will be key to building true artificial intelligence.

My interest is in applying statistical physics and information theory to understand how information is communicated and processed among individuals to solve this type of problem for the colony.  The first step, however, is building a simulation environment that can get virtual ants to wiggle on a screen and tell you when they’ve found food.  This relies heavily on the 2D circular collision detection software mentioned above.  Currently, there is no interaction between ants so there is no distributed computation taking place, merely blind wandering.  This should be the null benchmark to which the quality of other solutions is compared.  Stay tuned!

2D Collision Detection and Handling with Python

I built a python package for detecting and handling collisions of circles in 2D.  The code defines a class for the objects and handles the velocity updates using basic elastic collision mechanics.  The code can be downloaded from my GitHub repository and an example output using Pygame as a display is shown below.  For details on the physics along with a discussion of how this problem relates to Quantum Mechanics, keep reading!

Read more

Using Python/PyGame to build a platform game

I have been playing around with the book Learning Python with Raspberry Pi.  Chapter 5 steps the reader through creating a Mario Brothers-type 2D platform game.  Using that to start I modified it to include my dog Bruce doing his favorite thing in the world:  chasing a ball.  A short video is below.  You can get the python code to modify/play on GitHub.

 

Learning D3.js

I’ve been learning how to program in R via Coursera and became interested in visualization.  I know very little about html/CSS but stumbled upon some d3.js tutorials at Scott Murray’s site that are very enjoyable and practical.  After a few hours I generated some random data and was able to make this dynamic visualization.  Notice that each time you refresh the page the data are different (after you run the animation).  This is because your browser is running the javascript and hence random number generation for the data set.  This also works on mobile devices.  Now that I have some of the machinery, hopefully it can be put to use to make less trivial examples.  A current project is exploring structure in an e-mail system based on To/From and time stamps.  Visualizing the dynamics of an e-mail chain through a network could be entertaining if nothing else!

1 2