Category Archives: Programming

Getting Started with Python for Data Analysis

KDnuggets Silver Blog, July 2017

friend recently asked this and I thought it might benefit others if published here. This is for someone new to Python that wants the easiest path from zero to one.

  1. Download the Python 3.X version of the Anaconda distribution for your operating system here. You will avoid a lot of install-related headaches by choosing this pre-bundled distribution. It comes with most of the important data analysis packages pre-installed.
  2. Once you have it installed, test to make sure that the default python interpreter is the one you’ve just installed. This is important because your system may already have a version of Python installed, but it won’t have all the good stuff in the Anaconda bundle, so you need to make sure the new one is the default. On Mac/Linux this might mean typing which python in the terminal. Or you can just run the Python interpreter and make sure the version matches what you downloaded. If all went well, it should have been done by the install. If not, you’ll need to stop here and fix it.
  3. Issue the jupyter notebook command in your shell. This should open a browser window. If not, open a browser and navigate to http://localhost:8888. Once there, create a new Python notebook.
  4. Go to the kernels section of www.kaggle.com and filter to Python kernels. These are mostly jupyter notebooks of other people doing analysis or building models on data sets that are freely available on Kaggle’s website. Look for titles with things like EDA (Exploratory Data Analysis), as opposed to those building predictive models. Find one that’s interesting and start recreating it in your notebook.

Note: You’ll find that when you try to recreate some of these analyses that you get import errors. This is likely because they’ve installed packages that are not bundled in the Anaconda distribution. You’ll eventually need to learn how to interact with the conda package manager and this will be one of many rabbit holes you’ll eventually go down. Usually it’s as easy as conda install <package_name> but you’ll need to find the right package name and sometimes you’ll need to specify other details. And other times you’ll need to use pip install <other_package_name>, but you’ll learn all that later.

High Level Library Summary

Here’s a quick summary of the important libraries you’ll interact with frequently.

  • NumPy: has a lot of the core functionality for scientific computing. Under the hood is calling C-compiled code, so is much faster than the same functions written in Python. Not the most user-friendly.
  • SciPy: similar to NumPy but has more means for sampling from distributions, calculating test statistics…etc.
  • MatPlotLib: The main plotting framework. A necessary evil.
  • Seaborn: import it after MatPlotLib and it will make your plots a lot prettier by default. Also has its own functionality, but I find the coolest stuff runs too slow.
  • Pandas: mostly a thin wrapper around NumPy/SciPy to make more user friendly. Ideal for interacting with tables of data, which they call a DataFrame. Also has wrappers around plotting functionality to enable quick plotting while avoiding complications of MPL. I use Pandas more than anything for manipulating data.
  • Scikit-learn: Has a lot of supervised and unsupervised machine learning algorithms. Also has many metrics for doing model selection and a nice preprocessing library for doing things like Principal Component Analysis or encoding categorical variables.

Quick Tips

  1. When in a jupyter notebook, put a question mark in front of any object before running the cell and it will open up the documentation for it. This is really handy when you’ve forgotten the details of what the function you’re trying to call is expecting you to pass. e.g. ?my_dataframe.applywill explain the apply method of the pandas.DataFrame object, represented here by my_dataframe.
  2. You will likely always need to refer to the documentation for whatever library you’re using, so just keep it open in your browser. There’s just too many optional arguments and nuances.
  3. When it comes to the inevitable task of troubleshooting, stackoverflow probably has the answer.
  4. Accept the fact that you’ll be doing things you don’t fully understand for awhile or you’ll get bogged down by details that aren’t that important. Some day you’ll probably need to understand virtual environments and it’s really not that hard, but there are many detours like that that add unnecessary pain for someone getting started.
  5. Read other people’s code. It’s the best way to learn conventions and best practices. That’s where the Kaggle kernels really help. GitHub also supports the display of jupyter notebooks in the browser, so there are tons of examples on the internet.

A Python Data Science Workflow using Remote Servers

A good toolset and workflow can be a big productivity gain.  Unless you’re working exclusively on your personal computer for the whole pipeline, development workflows can quickly get messy.  I recently found a Python IDE that has great support for work on remote servers along with many other nice features:  PyCharm.  I’ll briefly walk through some of the basics that have provided a great relief to me.

Read more

Scraping Flight Data with Python

Note:  The code for this project can be found in this github repo.

I have been building a new project that requires the prices of flights.  I looked for APIs but couldn’t find any that were free.  However, I had a great experience using BeautifulSoup when scraping baseball data for my Fantasy Baseball auto-roster algorithm project, so I decided to try to get my own data that way.  To start, I chose a single airline:  southwest.  In this post I’ll step you through a couple basics, including how to scrape when you need to submit form data.

Read more

How Ants Can Pick your Fantasy Baseball Team

Note:  The code for this project can be found in my github repo.  This wasn’t originally intended to be open-sourced, so I apologize for lack of documentation and hard-coded directory structure 🙂 

My friend Matt is a Daily Fantasy Sports (DFS) junky.  He doesn’t have a math background but has heard the stories of guys using magic algorithms that tip the odds in their favor while they laugh their way to the bank.  Naturally, he was curious.  I’m not a huge sports fan, but it was a cool enough problem that he got me interested and we started playing around with it.

If you’re not familiar, baseball DFS works like this:  on a given night several teams are playing and you draft a fake team among those players and are then awarded points based on their real-world performance.  Although the roster and scoring details depend on the platform you’re playing on, you’ll have to pick something like:  2 starting pitchers, 3 outfielders, 1 first basement, 1 second basemen, 1 short stop…etc.  The caveat is that each player is assigned a salary and you only have a certain amount of cash to spend, so you can’t just buy the best players in every position.  Pitchers are scored for things like strikeouts and everyone else is scored on offensive outcomes, like singles and doubles.  There are a variety of contest types you can join but the ones we played were such that those in the top 50% were paid from the entry fee pool of those in the bottom 50% (minus a platform fee).  So if there’s 1000 people, you’re paid the same whether you’re in 1st place or 450th place.

Problem

It turns out that this is a variant of something called the Knapsack Problem.  Briefly, the Knapsack Problem is such that you have a sack that can hold a given volume of stuff and there are many valuable items to choose from, each with a different size and intrinsic value, and your goal is to choose the combination of items that maximizes value subject to the constraint on maximum volume.  In our case we’re trying to choose a collection of players that maximizes projected DFS points subject to a salary constraint.  In addition to the salary constraint, we also have a constraint on the number and types of players we need.

Approach

I had never heard of the knapsack problem but it looked to me like just another combinatorial optimization problem, similar to the Traveling Salesman problem.  Since I had been working with Ant Colony Optimization (ACO) I took a stab at adopting that approach in Python.  The pipeline starts by scraping data from a website that has projections on each player’s DFS points according to some model they had developed.  That is what I would try to maximize with ACO.  This site also listed the salary of each player for the platform I was using, so it was a great data source (rotogrinders.com).

Once I had a DataFrame with Player, Position, Salary, Projected_Points, I would run the following algorithm:

  1. Randomly choose a position that needs filled
  2. Filter the DataFrame to only include candidate players in that position whose salary was also beneath the remaining budget
  3. Calculate probabilities for each candidate that was proportional to a value derived from multiplying a greedy factor (Projected_Points) and a “pheromone” factor that is influenced by previous ant behavior.
    1. These two factors are each scaled by tunable exponents that thereby set the degree to which “ants” are influenced by greedy information or previous ant behavior
  4. Randomly choose among the candidates using these probabilities as weights
  5. If a full roster was created, get the total Projected Points and add a “pheromone” trail to each selected player by an amount that’s proportional to total points.
  6. Apply global pheromone evaporation by multiplying all levels by a constant < 1.

The main idea is that you’re making a choice that balances local, greedy information with global information.  So even if someone looks like a bad choice because a different player has more projected points, they might be a great addition to the roster because salary savings on this position can be used in another position with greater effect.

Results

Although I made tweaks along the way and ultimately built a system that could take my preferences into account, the results were beyond my expectations.  Specifically, I beat the median score 46% of the time.  That’s not enough to win money, but it’s interesting that someone like me, with little-to-no baseball expertise, can build an algorithm that performs about evenly with humans that are confident enough in their choices to bet money.

Nuances

Ultimately, the performance of my system can only be as good as the player projections.  If I were wanting to do this with aspirations of making money, I would definitely want to build my own projected player points model so that I could at least quantify the uncertainty.

Another problem is that intra-team player performance is correlated.  If a pitcher is having a terrible night, the entire opposing team has an elevated potential.  Further, if you have great hitters around you, there’s more chances for you to score points on RBIs or to get to home once you’re on base.  Also, a more successful hitting team means more people coming up to the plate, which translates to more chances.

Presumably the projected points model takes some of these things into consideration, but it’s a fundamentally different optimization problem when the values are not only random variables, but correlated.  For instance, it’s a common strategy to “stack” your roster such that you have multiple players from a single real-life team.  The idea is that even if you wouldn’t normally pick a certain player based on their individual merit, you might on this night because they’re playing some bad team and will likely score many runs, which means a lot of DFS points are going to be spread out on that team.  It works both ways, of course, and a poorly performing team makes it hard for an individual to get many points.

Next Steps

So does it make more sense to leverage the correlation by stacking players from teams that have a high probability of winning, or should you choose players that reduce correlation and therefore reduce risk of collective under-performance?

The DFS players that make the most money seem to do so by taking a combination approach:  stack individual teams, but generate many different rosters and therefore participate in many different contests.  It’s worth noting that other contest types are scored differently and in many cases the payout depends on your relative rank.  The answer to the above question might depend on the details of contest payout as well as risk-appetite.

The next steps are thinking about what optimization means in this two-stage system of single-roster optimization and then optimization of a collection of rosters.  How can you leverage stochastic uncertainty and correlation in player performance such that you’re choosing a collection of rosters that have the best chance of performing?  My current implementation cannot generate many variations on a roster without manual intervention because it has no concept of stochastic performance; it merely assumes the projected points estimate is the value of the candidate.

If one had a model that produced probability distributions for individual players that took intra-team correlations into account, then this could be sampled from to simulate an outcome, from which ACO could be run using these assumed performance values.  This is a sort of Markov-Chain Monte Carlo (MCMC) approach.

Broader Connections

A solution of this problem likely has value outside of the limited world of DFS.  For instance, consider the task of choosing a portfolio of investments.  Similarly, you may want to both take advantage of big upswings when correlations are in your favor and simultaneously hedge against risk by diversifying to break correlations.

The Bayesian Approach to Linear Regression

I’ve long been interested in the Bayesian approach to statistical modeling and machine learning.  I recently dug in deep to an example in Bishop’s text to re-create his results and paraphrase my understanding of it.  Much like how you would want an online learning system to work, this work shows in detail how each data point updates the posterior distribution and becomes the next iteration’s prior.  In turn, you can observe the uncertainty in the model’s estimate improving.

What I like most about the Bayesian treatment is that you get a full probability distribution for your model parameters, not just a point estimate of its value.  This allows you to ensemble many different models, each of which is weighted by its respective probability.  This is a far more robust estimate than the common Maximum Likelihood approach.

If you find the treatment useful, feel free to fork the GitHub code and play with it yourself.  Find the notebook here.

Bayesian estimates within +/- 1 stdev from the mean using only two data points

Bayesian estimates within +/- 1 stdev from the mean using only two data points

The Entropy of First Names

I was recently working on a coding challenge for a job related to analysis of the first names of people born in the USA.  The data set given was from the Social Security Administration and can be downloaded here.  There is a text file for each state and for every name entry there is the associated gender, year, state, and count of people born that year with the stated name.  For example, here are the first few lines of the Arkansas file:

AK,F,1910,Mary,14
AK,F,1910,Annie,12
AK,F,1910,Anna,10

While looking for inspiration for this challenge I found an interesting article that referenced the entropy of baby names.  Entropy measures the amount of information contained in a choice from a distribution.  I decided to explore the entropy concept further and the results are below.

Read more

Ants Solving the Traveling Salesman Problem

I have been working on building the stochastic optimization algorithm inspired by ant foraging called the Ant System as described in Marco Dorigo’s paper, “The Ant System: Optimization by a colony of cooperating agents.”  Code for running it yourself using Python 3 can be found in my Ant System github repo.  It is being used to find solutions to the Traveling Salesman problem on the 30-city “Oliver 30” data set.  Initial results below.  The general question is: given a map of cities, what is the minimum path to take such that you visit all cities precisely one time?  This has obvious applications to logistics or routing traffic in optimal ways.

Read more

Hacking for Good

I have recently been going to the OpenDataSTL MeetUp group.  They are the local brigade of Code for America and aim to provide an outlet for people that have technical chops and are interested in serving their community in ways that go beyond manual labor.  The National Day of Civic Hacking is June 6th and to prepare, OpenDataSTL has crammed May with cool events to raise awareness and get stuff done.  Things like:

  • Open Source Brewathon (results tasting on May 27th)
    • This event’s goal is to teach the community about the power of open source and version control, but via beer instead of software.  A handful of brewers started with an open brew recipe, made their modifications and have concocted something new.  Come taste the delicious variety and use your (beer-lubricated) imagination on how this might be powerful for software
  • Volunteer Fair (May 29th)
    • Imagine a career fair, but with non-profits looking for technical people like you to do interesting tasks.  This is for people that want to help, but in ways closer aligned to their professional abilities.
  • Hackathon (May 29th – 31st)
    • Once you have a great idea from the Volunteer Fair, come build it with other talented people at the hackathon.  Network, build relationships, and do good.

There are several other things going on earlier in the month.  Go to buildforstl.org to keep up to date.

Simulating Ant Foraging

I’ve recently been building some tools in Python and getting experience with PyGame as a means to visualizing moving objects.  You may have seen my post on a 2D platform game with my dog Bruce or 2D circular collision detection.  The motivation for working on these fun but somewhat trivial projects is that they enable the framework for research into a more interesting application:  how ants solve problems by performing a type of distributed computing.

A very interesting article that makes similar connections can be found on Nautilus.  The idea is that ants aren’t very smart individually, but when many of them interact together the colony can solve problems the individual cannot.  This is indicative of a more general phenomena in complex systems–how a bunch of relatively simple things can interact without a central conductor to do more complex things.  For instance:  the human brain; a school of fish; human civilization; the immune system; cells.  Understanding so-called “emergence” will be key to building true artificial intelligence.

My interest is in applying statistical physics and information theory to understand how information is communicated and processed among individuals to solve this type of problem for the colony.  The first step, however, is building a simulation environment that can get virtual ants to wiggle on a screen and tell you when they’ve found food.  This relies heavily on the 2D circular collision detection software mentioned above.  Currently, there is no interaction between ants so there is no distributed computation taking place, merely blind wandering.  This should be the null benchmark to which the quality of other solutions is compared.  Stay tuned!

2D Collision Detection and Handling with Python

I built a python package for detecting and handling collisions of circles in 2D.  The code defines a class for the objects and handles the velocity updates using basic elastic collision mechanics.  The code can be downloaded from my GitHub repository and an example output using Pygame as a display is shown below.  For details on the physics along with a discussion of how this problem relates to Quantum Mechanics, keep reading!

Read more

1 2