Tag Archives: Python

Getting Started with Python for Data Analysis

KDnuggets Silver Blog, July 2017

friend recently asked this and I thought it might benefit others if published here. This is for someone new to Python that wants the easiest path from zero to one.

  1. Download the Python 3.X version of the Anaconda distribution for your operating system here. You will avoid a lot of install-related headaches by choosing this pre-bundled distribution. It comes with most of the important data analysis packages pre-installed.
  2. Once you have it installed, test to make sure that the default python interpreter is the one you’ve just installed. This is important because your system may already have a version of Python installed, but it won’t have all the good stuff in the Anaconda bundle, so you need to make sure the new one is the default. On Mac/Linux this might mean typing which python in the terminal. Or you can just run the Python interpreter and make sure the version matches what you downloaded. If all went well, it should have been done by the install. If not, you’ll need to stop here and fix it.
  3. Issue the jupyter notebook command in your shell. This should open a browser window. If not, open a browser and navigate to http://localhost:8888. Once there, create a new Python notebook.
  4. Go to the kernels section of www.kaggle.com and filter to Python kernels. These are mostly jupyter notebooks of other people doing analysis or building models on data sets that are freely available on Kaggle’s website. Look for titles with things like EDA (Exploratory Data Analysis), as opposed to those building predictive models. Find one that’s interesting and start recreating it in your notebook.

Note: You’ll find that when you try to recreate some of these analyses that you get import errors. This is likely because they’ve installed packages that are not bundled in the Anaconda distribution. You’ll eventually need to learn how to interact with the conda package manager and this will be one of many rabbit holes you’ll eventually go down. Usually it’s as easy as conda install <package_name> but you’ll need to find the right package name and sometimes you’ll need to specify other details. And other times you’ll need to use pip install <other_package_name>, but you’ll learn all that later.

High Level Library Summary

Here’s a quick summary of the important libraries you’ll interact with frequently.

  • NumPy: has a lot of the core functionality for scientific computing. Under the hood is calling C-compiled code, so is much faster than the same functions written in Python. Not the most user-friendly.
  • SciPy: similar to NumPy but has more means for sampling from distributions, calculating test statistics…etc.
  • MatPlotLib: The main plotting framework. A necessary evil.
  • Seaborn: import it after MatPlotLib and it will make your plots a lot prettier by default. Also has its own functionality, but I find the coolest stuff runs too slow.
  • Pandas: mostly a thin wrapper around NumPy/SciPy to make more user friendly. Ideal for interacting with tables of data, which they call a DataFrame. Also has wrappers around plotting functionality to enable quick plotting while avoiding complications of MPL. I use Pandas more than anything for manipulating data.
  • Scikit-learn: Has a lot of supervised and unsupervised machine learning algorithms. Also has many metrics for doing model selection and a nice preprocessing library for doing things like Principal Component Analysis or encoding categorical variables.

Quick Tips

  1. When in a jupyter notebook, put a question mark in front of any object before running the cell and it will open up the documentation for it. This is really handy when you’ve forgotten the details of what the function you’re trying to call is expecting you to pass. e.g. ?my_dataframe.applywill explain the apply method of the pandas.DataFrame object, represented here by my_dataframe.
  2. You will likely always need to refer to the documentation for whatever library you’re using, so just keep it open in your browser. There’s just too many optional arguments and nuances.
  3. When it comes to the inevitable task of troubleshooting, stackoverflow probably has the answer.
  4. Accept the fact that you’ll be doing things you don’t fully understand for awhile or you’ll get bogged down by details that aren’t that important. Some day you’ll probably need to understand virtual environments and it’s really not that hard, but there are many detours like that that add unnecessary pain for someone getting started.
  5. Read other people’s code. It’s the best way to learn conventions and best practices. That’s where the Kaggle kernels really help. GitHub also supports the display of jupyter notebooks in the browser, so there are tons of examples on the internet.

Scraping Flight Data with Python

Note:  The code for this project can be found in this github repo.

I have been building a new project that requires the prices of flights.  I looked for APIs but couldn’t find any that were free.  However, I had a great experience using BeautifulSoup when scraping baseball data for my Fantasy Baseball auto-roster algorithm project, so I decided to try to get my own data that way.  To start, I chose a single airline:  southwest.  In this post I’ll step you through a couple basics, including how to scrape when you need to submit form data.

Read more

How Ants Can Pick your Fantasy Baseball Team

Note:  The code for this project can be found in my github repo.  This wasn’t originally intended to be open-sourced, so I apologize for lack of documentation and hard-coded directory structure 🙂 

My friend Matt is a Daily Fantasy Sports (DFS) junky.  He doesn’t have a math background but has heard the stories of guys using magic algorithms that tip the odds in their favor while they laugh their way to the bank.  Naturally, he was curious.  I’m not a huge sports fan, but it was a cool enough problem that he got me interested and we started playing around with it.

If you’re not familiar, baseball DFS works like this:  on a given night several teams are playing and you draft a fake team among those players and are then awarded points based on their real-world performance.  Although the roster and scoring details depend on the platform you’re playing on, you’ll have to pick something like:  2 starting pitchers, 3 outfielders, 1 first basement, 1 second basemen, 1 short stop…etc.  The caveat is that each player is assigned a salary and you only have a certain amount of cash to spend, so you can’t just buy the best players in every position.  Pitchers are scored for things like strikeouts and everyone else is scored on offensive outcomes, like singles and doubles.  There are a variety of contest types you can join but the ones we played were such that those in the top 50% were paid from the entry fee pool of those in the bottom 50% (minus a platform fee).  So if there’s 1000 people, you’re paid the same whether you’re in 1st place or 450th place.


It turns out that this is a variant of something called the Knapsack Problem.  Briefly, the Knapsack Problem is such that you have a sack that can hold a given volume of stuff and there are many valuable items to choose from, each with a different size and intrinsic value, and your goal is to choose the combination of items that maximizes value subject to the constraint on maximum volume.  In our case we’re trying to choose a collection of players that maximizes projected DFS points subject to a salary constraint.  In addition to the salary constraint, we also have a constraint on the number and types of players we need.


I had never heard of the knapsack problem but it looked to me like just another combinatorial optimization problem, similar to the Traveling Salesman problem.  Since I had been working with Ant Colony Optimization (ACO) I took a stab at adopting that approach in Python.  The pipeline starts by scraping data from a website that has projections on each player’s DFS points according to some model they had developed.  That is what I would try to maximize with ACO.  This site also listed the salary of each player for the platform I was using, so it was a great data source (rotogrinders.com).

Once I had a DataFrame with Player, Position, Salary, Projected_Points, I would run the following algorithm:

  1. Randomly choose a position that needs filled
  2. Filter the DataFrame to only include candidate players in that position whose salary was also beneath the remaining budget
  3. Calculate probabilities for each candidate that was proportional to a value derived from multiplying a greedy factor (Projected_Points) and a “pheromone” factor that is influenced by previous ant behavior.
    1. These two factors are each scaled by tunable exponents that thereby set the degree to which “ants” are influenced by greedy information or previous ant behavior
  4. Randomly choose among the candidates using these probabilities as weights
  5. If a full roster was created, get the total Projected Points and add a “pheromone” trail to each selected player by an amount that’s proportional to total points.
  6. Apply global pheromone evaporation by multiplying all levels by a constant < 1.

The main idea is that you’re making a choice that balances local, greedy information with global information.  So even if someone looks like a bad choice because a different player has more projected points, they might be a great addition to the roster because salary savings on this position can be used in another position with greater effect.


Although I made tweaks along the way and ultimately built a system that could take my preferences into account, the results were beyond my expectations.  Specifically, I beat the median score 46% of the time.  That’s not enough to win money, but it’s interesting that someone like me, with little-to-no baseball expertise, can build an algorithm that performs about evenly with humans that are confident enough in their choices to bet money.


Ultimately, the performance of my system can only be as good as the player projections.  If I were wanting to do this with aspirations of making money, I would definitely want to build my own projected player points model so that I could at least quantify the uncertainty.

Another problem is that intra-team player performance is correlated.  If a pitcher is having a terrible night, the entire opposing team has an elevated potential.  Further, if you have great hitters around you, there’s more chances for you to score points on RBIs or to get to home once you’re on base.  Also, a more successful hitting team means more people coming up to the plate, which translates to more chances.

Presumably the projected points model takes some of these things into consideration, but it’s a fundamentally different optimization problem when the values are not only random variables, but correlated.  For instance, it’s a common strategy to “stack” your roster such that you have multiple players from a single real-life team.  The idea is that even if you wouldn’t normally pick a certain player based on their individual merit, you might on this night because they’re playing some bad team and will likely score many runs, which means a lot of DFS points are going to be spread out on that team.  It works both ways, of course, and a poorly performing team makes it hard for an individual to get many points.

Next Steps

So does it make more sense to leverage the correlation by stacking players from teams that have a high probability of winning, or should you choose players that reduce correlation and therefore reduce risk of collective under-performance?

The DFS players that make the most money seem to do so by taking a combination approach:  stack individual teams, but generate many different rosters and therefore participate in many different contests.  It’s worth noting that other contest types are scored differently and in many cases the payout depends on your relative rank.  The answer to the above question might depend on the details of contest payout as well as risk-appetite.

The next steps are thinking about what optimization means in this two-stage system of single-roster optimization and then optimization of a collection of rosters.  How can you leverage stochastic uncertainty and correlation in player performance such that you’re choosing a collection of rosters that have the best chance of performing?  My current implementation cannot generate many variations on a roster without manual intervention because it has no concept of stochastic performance; it merely assumes the projected points estimate is the value of the candidate.

If one had a model that produced probability distributions for individual players that took intra-team correlations into account, then this could be sampled from to simulate an outcome, from which ACO could be run using these assumed performance values.  This is a sort of Markov-Chain Monte Carlo (MCMC) approach.

Broader Connections

A solution of this problem likely has value outside of the limited world of DFS.  For instance, consider the task of choosing a portfolio of investments.  Similarly, you may want to both take advantage of big upswings when correlations are in your favor and simultaneously hedge against risk by diversifying to break correlations.

Simulating Ant Foraging

I’ve recently been building some tools in Python and getting experience with PyGame as a means to visualizing moving objects.  You may have seen my post on a 2D platform game with my dog Bruce or 2D circular collision detection.  The motivation for working on these fun but somewhat trivial projects is that they enable the framework for research into a more interesting application:  how ants solve problems by performing a type of distributed computing.

A very interesting article that makes similar connections can be found on Nautilus.  The idea is that ants aren’t very smart individually, but when many of them interact together the colony can solve problems the individual cannot.  This is indicative of a more general phenomena in complex systems–how a bunch of relatively simple things can interact without a central conductor to do more complex things.  For instance:  the human brain; a school of fish; human civilization; the immune system; cells.  Understanding so-called “emergence” will be key to building true artificial intelligence.

My interest is in applying statistical physics and information theory to understand how information is communicated and processed among individuals to solve this type of problem for the colony.  The first step, however, is building a simulation environment that can get virtual ants to wiggle on a screen and tell you when they’ve found food.  This relies heavily on the 2D circular collision detection software mentioned above.  Currently, there is no interaction between ants so there is no distributed computation taking place, merely blind wandering.  This should be the null benchmark to which the quality of other solutions is compared.  Stay tuned!

2D Collision Detection and Handling with Python

I built a python package for detecting and handling collisions of circles in 2D.  The code defines a class for the objects and handles the velocity updates using basic elastic collision mechanics.  The code can be downloaded from my GitHub repository and an example output using Pygame as a display is shown below.  For details on the physics along with a discussion of how this problem relates to Quantum Mechanics, keep reading!

Read more

Using Python/PyGame to build a platform game

I have been playing around with the book Learning Python with Raspberry Pi.  Chapter 5 steps the reader through creating a Mario Brothers-type 2D platform game.  Using that to start I modified it to include my dog Bruce doing his favorite thing in the world:  chasing a ball.  A short video is below.  You can get the python code to modify/play on GitHub.