Category Archives: Machine Learning

Getting Started with Python for Data Analysis

KDnuggets Silver Blog, July 2017

friend recently asked this and I thought it might benefit others if published here. This is for someone new to Python that wants the easiest path from zero to one.

  1. Download the Python 3.X version of the Anaconda distribution for your operating system here. You will avoid a lot of install-related headaches by choosing this pre-bundled distribution. It comes with most of the important data analysis packages pre-installed.
  2. Once you have it installed, test to make sure that the default python interpreter is the one you’ve just installed. This is important because your system may already have a version of Python installed, but it won’t have all the good stuff in the Anaconda bundle, so you need to make sure the new one is the default. On Mac/Linux this might mean typing which python in the terminal. Or you can just run the Python interpreter and make sure the version matches what you downloaded. If all went well, it should have been done by the install. If not, you’ll need to stop here and fix it.
  3. Issue the jupyter notebook command in your shell. This should open a browser window. If not, open a browser and navigate to http://localhost:8888. Once there, create a new Python notebook.
  4. Go to the kernels section of www.kaggle.com and filter to Python kernels. These are mostly jupyter notebooks of other people doing analysis or building models on data sets that are freely available on Kaggle’s website. Look for titles with things like EDA (Exploratory Data Analysis), as opposed to those building predictive models. Find one that’s interesting and start recreating it in your notebook.

Note: You’ll find that when you try to recreate some of these analyses that you get import errors. This is likely because they’ve installed packages that are not bundled in the Anaconda distribution. You’ll eventually need to learn how to interact with the conda package manager and this will be one of many rabbit holes you’ll eventually go down. Usually it’s as easy as conda install <package_name> but you’ll need to find the right package name and sometimes you’ll need to specify other details. And other times you’ll need to use pip install <other_package_name>, but you’ll learn all that later.

High Level Library Summary

Here’s a quick summary of the important libraries you’ll interact with frequently.

  • NumPy: has a lot of the core functionality for scientific computing. Under the hood is calling C-compiled code, so is much faster than the same functions written in Python. Not the most user-friendly.
  • SciPy: similar to NumPy but has more means for sampling from distributions, calculating test statistics…etc.
  • MatPlotLib: The main plotting framework. A necessary evil.
  • Seaborn: import it after MatPlotLib and it will make your plots a lot prettier by default. Also has its own functionality, but I find the coolest stuff runs too slow.
  • Pandas: mostly a thin wrapper around NumPy/SciPy to make more user friendly. Ideal for interacting with tables of data, which they call a DataFrame. Also has wrappers around plotting functionality to enable quick plotting while avoiding complications of MPL. I use Pandas more than anything for manipulating data.
  • Scikit-learn: Has a lot of supervised and unsupervised machine learning algorithms. Also has many metrics for doing model selection and a nice preprocessing library for doing things like Principal Component Analysis or encoding categorical variables.

Quick Tips

  1. When in a jupyter notebook, put a question mark in front of any object before running the cell and it will open up the documentation for it. This is really handy when you’ve forgotten the details of what the function you’re trying to call is expecting you to pass. e.g. ?my_dataframe.applywill explain the apply method of the pandas.DataFrame object, represented here by my_dataframe.
  2. You will likely always need to refer to the documentation for whatever library you’re using, so just keep it open in your browser. There’s just too many optional arguments and nuances.
  3. When it comes to the inevitable task of troubleshooting, stackoverflow probably has the answer.
  4. Accept the fact that you’ll be doing things you don’t fully understand for awhile or you’ll get bogged down by details that aren’t that important. Some day you’ll probably need to understand virtual environments and it’s really not that hard, but there are many detours like that that add unnecessary pain for someone getting started.
  5. Read other people’s code. It’s the best way to learn conventions and best practices. That’s where the Kaggle kernels really help. GitHub also supports the display of jupyter notebooks in the browser, so there are tons of examples on the internet.

Pattern Recognition and Machine Learning

A quick book recommendation for those trying to strengthen their machine learning knowledge:  Christopher Bishop’s Pattern Recognition and Machine Learning.  I have found it incredibly helpful in adding depth to sources like Andrew Ng’s Coursera class.  It’s apparently a staple in the Computer Science world, but I was never exposed having come from physics.

The topics you would expect are there with great depth and clarity.  However, it is also focused on providing the Bayesian perspective.  If you’re new to Bayesian ideas applied to Machine Learning this is an excellent text.  It does a great job of both contrasting frequentist vs Bayesian approaches and showing their connections.  It should be on every Data Scientist’s shelf.

How Ants Can Pick your Fantasy Baseball Team

Note:  The code for this project can be found in my github repo.  This wasn’t originally intended to be open-sourced, so I apologize for lack of documentation and hard-coded directory structure 🙂 

My friend Matt is a Daily Fantasy Sports (DFS) junky.  He doesn’t have a math background but has heard the stories of guys using magic algorithms that tip the odds in their favor while they laugh their way to the bank.  Naturally, he was curious.  I’m not a huge sports fan, but it was a cool enough problem that he got me interested and we started playing around with it.

If you’re not familiar, baseball DFS works like this:  on a given night several teams are playing and you draft a fake team among those players and are then awarded points based on their real-world performance.  Although the roster and scoring details depend on the platform you’re playing on, you’ll have to pick something like:  2 starting pitchers, 3 outfielders, 1 first basement, 1 second basemen, 1 short stop…etc.  The caveat is that each player is assigned a salary and you only have a certain amount of cash to spend, so you can’t just buy the best players in every position.  Pitchers are scored for things like strikeouts and everyone else is scored on offensive outcomes, like singles and doubles.  There are a variety of contest types you can join but the ones we played were such that those in the top 50% were paid from the entry fee pool of those in the bottom 50% (minus a platform fee).  So if there’s 1000 people, you’re paid the same whether you’re in 1st place or 450th place.

Problem

It turns out that this is a variant of something called the Knapsack Problem.  Briefly, the Knapsack Problem is such that you have a sack that can hold a given volume of stuff and there are many valuable items to choose from, each with a different size and intrinsic value, and your goal is to choose the combination of items that maximizes value subject to the constraint on maximum volume.  In our case we’re trying to choose a collection of players that maximizes projected DFS points subject to a salary constraint.  In addition to the salary constraint, we also have a constraint on the number and types of players we need.

Approach

I had never heard of the knapsack problem but it looked to me like just another combinatorial optimization problem, similar to the Traveling Salesman problem.  Since I had been working with Ant Colony Optimization (ACO) I took a stab at adopting that approach in Python.  The pipeline starts by scraping data from a website that has projections on each player’s DFS points according to some model they had developed.  That is what I would try to maximize with ACO.  This site also listed the salary of each player for the platform I was using, so it was a great data source (rotogrinders.com).

Once I had a DataFrame with Player, Position, Salary, Projected_Points, I would run the following algorithm:

  1. Randomly choose a position that needs filled
  2. Filter the DataFrame to only include candidate players in that position whose salary was also beneath the remaining budget
  3. Calculate probabilities for each candidate that was proportional to a value derived from multiplying a greedy factor (Projected_Points) and a “pheromone” factor that is influenced by previous ant behavior.
    1. These two factors are each scaled by tunable exponents that thereby set the degree to which “ants” are influenced by greedy information or previous ant behavior
  4. Randomly choose among the candidates using these probabilities as weights
  5. If a full roster was created, get the total Projected Points and add a “pheromone” trail to each selected player by an amount that’s proportional to total points.
  6. Apply global pheromone evaporation by multiplying all levels by a constant < 1.

The main idea is that you’re making a choice that balances local, greedy information with global information.  So even if someone looks like a bad choice because a different player has more projected points, they might be a great addition to the roster because salary savings on this position can be used in another position with greater effect.

Results

Although I made tweaks along the way and ultimately built a system that could take my preferences into account, the results were beyond my expectations.  Specifically, I beat the median score 46% of the time.  That’s not enough to win money, but it’s interesting that someone like me, with little-to-no baseball expertise, can build an algorithm that performs about evenly with humans that are confident enough in their choices to bet money.

Nuances

Ultimately, the performance of my system can only be as good as the player projections.  If I were wanting to do this with aspirations of making money, I would definitely want to build my own projected player points model so that I could at least quantify the uncertainty.

Another problem is that intra-team player performance is correlated.  If a pitcher is having a terrible night, the entire opposing team has an elevated potential.  Further, if you have great hitters around you, there’s more chances for you to score points on RBIs or to get to home once you’re on base.  Also, a more successful hitting team means more people coming up to the plate, which translates to more chances.

Presumably the projected points model takes some of these things into consideration, but it’s a fundamentally different optimization problem when the values are not only random variables, but correlated.  For instance, it’s a common strategy to “stack” your roster such that you have multiple players from a single real-life team.  The idea is that even if you wouldn’t normally pick a certain player based on their individual merit, you might on this night because they’re playing some bad team and will likely score many runs, which means a lot of DFS points are going to be spread out on that team.  It works both ways, of course, and a poorly performing team makes it hard for an individual to get many points.

Next Steps

So does it make more sense to leverage the correlation by stacking players from teams that have a high probability of winning, or should you choose players that reduce correlation and therefore reduce risk of collective under-performance?

The DFS players that make the most money seem to do so by taking a combination approach:  stack individual teams, but generate many different rosters and therefore participate in many different contests.  It’s worth noting that other contest types are scored differently and in many cases the payout depends on your relative rank.  The answer to the above question might depend on the details of contest payout as well as risk-appetite.

The next steps are thinking about what optimization means in this two-stage system of single-roster optimization and then optimization of a collection of rosters.  How can you leverage stochastic uncertainty and correlation in player performance such that you’re choosing a collection of rosters that have the best chance of performing?  My current implementation cannot generate many variations on a roster without manual intervention because it has no concept of stochastic performance; it merely assumes the projected points estimate is the value of the candidate.

If one had a model that produced probability distributions for individual players that took intra-team correlations into account, then this could be sampled from to simulate an outcome, from which ACO could be run using these assumed performance values.  This is a sort of Markov-Chain Monte Carlo (MCMC) approach.

Broader Connections

A solution of this problem likely has value outside of the limited world of DFS.  For instance, consider the task of choosing a portfolio of investments.  Similarly, you may want to both take advantage of big upswings when correlations are in your favor and simultaneously hedge against risk by diversifying to break correlations.

The Bayesian Approach to Linear Regression

I’ve long been interested in the Bayesian approach to statistical modeling and machine learning.  I recently dug in deep to an example in Bishop’s text to re-create his results and paraphrase my understanding of it.  Much like how you would want an online learning system to work, this work shows in detail how each data point updates the posterior distribution and becomes the next iteration’s prior.  In turn, you can observe the uncertainty in the model’s estimate improving.

What I like most about the Bayesian treatment is that you get a full probability distribution for your model parameters, not just a point estimate of its value.  This allows you to ensemble many different models, each of which is weighted by its respective probability.  This is a far more robust estimate than the common Maximum Likelihood approach.

If you find the treatment useful, feel free to fork the GitHub code and play with it yourself.  Find the notebook here.

Bayesian estimates within +/- 1 stdev from the mean using only two data points

Bayesian estimates within +/- 1 stdev from the mean using only two data points

Ants Solving the Traveling Salesman Problem

I have been working on building the stochastic optimization algorithm inspired by ant foraging called the Ant System as described in Marco Dorigo’s paper, “The Ant System: Optimization by a colony of cooperating agents.”  Code for running it yourself using Python 3 can be found in my Ant System github repo.  It is being used to find solutions to the Traveling Salesman problem on the 30-city “Oliver 30” data set.  Initial results below.  The general question is: given a map of cities, what is the minimum path to take such that you visit all cities precisely one time?  This has obvious applications to logistics or routing traffic in optimal ways.

Read more

Machine Learning Practice on Breast Cancer Data

I found an open data set on the Machine Learning Repository concerning Breast Cancer and thought I would use it as a way to practice machine learning approaches.  Briefly, the training data indicate whether the instance was Benign or Malignant and the values of a variety of features:  i.e. Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape…etc.

Please go to my GitHub Breast Cancer Repository for the data set and code.

Approach Outline

The data set had 9 features, two classifications and about 700 data points.  I split it so that the training of my model happened on the first 350 data points and the remainder were used for checking the accuracy.

Our prediction equation is a sigmoid and is interpreted as the probability that the tumor is malignant based on the input features.  The input to the sigmoid is a sum of each feature multiplied by a parameter value, i.e. \vec{\theta}\cdot \vec{X} where \vec{\theta} is the parameter vector [\theta_0, \theta_1, \theta_2, ..., \theta_9] and \vec{X} is the feature vector of a given data point i:  [1, x_{1i}, x_{2i}, x_{3i}, ..., x_{9i}].  Explicitly, the input for a set of features \vec{X} is:  \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_9x_9 and the output is the probability of being malignant with the chosen model parameters (\vec{\theta}).  My program says if the output is greater than or equal to 0.5 (meaning the probability is greater than 50%), then the prediction is that the tumor is malignant.

The cost function is such that if the sigmoid predicts a 1 (meaning the probability the tumor is malignant is 100%) and the real classification is that it’s benign, then there is an infinite cost.  The same is true if it predicts a 0 (meaning the probability of being malignant is 0%) and the real classification is malignant.  The cost equation ends up being:  J = -\frac{1}{m}\sum_{i=1}^m (y*log(h) + (1-y)*log(1-h)) where m is the number of data points, h is the output of the prediction sigmoid and y is the classification of 0 or 1  (benign/malignant).  The idea is that if you’re 100% confident and end up being wrong, you have a very large cost.

The job is then to find the vector theta that minimizes this cost function.  The gradient descent method is one choice, but for the first iteration I tried a minimization function built into Octave ‘fminunc’ with 400 iterations.

Initial Results

The ‘fminunc’ returned a vector of parameters theta that minimized the cost function specified above.  I then used the sigmoid with these values for \vec{\theta} to make predictions on the remaining ~350 data points.  So for each data point there are a set of features given by \vec{X} that were dotted with my parameter vector and fed as the input to the sigmoid function.  If this number was greater than or equal to 0.5, it predicted the tumor was malignant.

The prediction accuracy ends up being 98%!

Gradient Descent

Next I used a Gradient Descent (GD) minimization algorithm instead of the built-in ‘fminunc’ feature to compare results.  The results when using GD were very sensitive to the learning rate parameter \alpha .  If I used \alpha = 0.1 things remained pretty stable and gave reasonable accuracy, but took more iterations to achieve.  The accuracy ended up exceeding that of fminunc very slightly with the best value being 98.6% compared to fminun’s 98.2%.  Essentially it classified one more of the data points correctly.

For \alpha = 0.5 the transient behavior resulted became unstable as log functions began returning infinities.  The algorithm ignored these results and continued to minimize and the output would end up being similar to cases where this wasn’t an issue, but the descent was not a smooth march towards the minimum.  See the figures below for a comparison between the methods and an example of an unstable cost output as a function of the number of iterations.

What the Model Tells Us

Once we have parameters that give us high accuracy in prediction, what can we learn from it?  If you look at the sigmoid function it’s clear that when the input argument is positive, the answer tends towards 1 and when it’s negative it tends towards 0.  The more positive/negative the input is, the closer to 1/0 the output is.  So instead of writing out the sigmoid function, let’s just look at the input and see the conditions that make it positive or negative to garner some intuition.

The input is just \vec{\theta}\cdot \vec{X} = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_9x_9.  The machine learning algorithm outputs the following for the parameter vector:  \vec{\theta} = [ -8.2, 0.52, -0.10, 0.42, 0.27, 0.11, 0.41, 0.12, 0.09, 0.31 ].  Recall that \vec{X} is our vector of features (with x_0 = 1 so that we can have a \theta_0 = const term).  The description of each feature is given below:

#  Attribute:  Domain

   0. Constant:  1 for all
   1. Clump Thickness:  1 – 10
   2. Uniformity of Cell Size:  1 – 10
   3. Uniformity of Cell Shape:  1 – 10
   4. Marginal Adhesion:  1 – 10
   5. Single Epithelial Cell Size:  1 – 10
   6. Bare Nuclei:  1 – 10
   7. Bland Chromatin:  1 – 10
   8. Normal Nucleoli:  1 – 10
   9. Mitoses:   1 – 10
Now, the first parameter, which merely multiplies the constant 1, is equal to -8.2.  This means that in the absence of any other feature information, it’s most likely that the tumor is benign, not malignant.  Good news!  The next highest parameter value is 0.52, which multiplies the Clump Thickness.  This feature seems to be the most strongly correlated with being malignant.  The third parameter value is less than zero, indicating the higher the Uniformity of Cell Shape, the more negative the argument becomes and therefore the more likely to be benign.

Next Steps

Next we’ll see how the prediction accuracy changes when using more a complicated feature set (e.g. including combinations of features like x_1x_2x_4^3).

accuracy-vs-iter

cost-vs-iter1

 

Acknowledgments

Data came from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

Citations

1. O. L. Mangasarian and W. H. Wolberg: “Cancer diagnosis via linear
programming”, SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

2. William H. Wolberg and O.L. Mangasarian: “Multisurface method of
pattern separation for medical diagnosis applied to breast cytology”,
Proceedings of the National Academy of Sciences, U.S.A., Volume 87,
December 1990, pp 9193-9196.

3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: “Pattern recognition
via linear programming: Theory and application to medical diagnosis”,
in: “Large-scale numerical optimization”, Thomas F. Coleman and Yuying
Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.

4. K. P. Bennett & O. L. Mangasarian: “Robust linear programming
discrimination of two linearly inseparable sets”, Optimization Methods
and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).