Getting Started with Python for Data Analysis

KDnuggets Silver Blog, July 2017

friend recently asked this and I thought it might benefit others if published here. This is for someone new to Python that wants the easiest path from zero to one.

  1. Download the Python 3.X version of the Anaconda distribution for your operating system here. You will avoid a lot of install-related headaches by choosing this pre-bundled distribution. It comes with most of the important data analysis packages pre-installed.
  2. Once you have it installed, test to make sure that the default python interpreter is the one you’ve just installed. This is important because your system may already have a version of Python installed, but it won’t have all the good stuff in the Anaconda bundle, so you need to make sure the new one is the default. On Mac/Linux this might mean typing which python in the terminal. Or you can just run the Python interpreter and make sure the version matches what you downloaded. If all went well, it should have been done by the install. If not, you’ll need to stop here and fix it.
  3. Issue the jupyter notebook command in your shell. This should open a browser window. If not, open a browser and navigate to http://localhost:8888. Once there, create a new Python notebook.
  4. Go to the kernels section of www.kaggle.com and filter to Python kernels. These are mostly jupyter notebooks of other people doing analysis or building models on data sets that are freely available on Kaggle’s website. Look for titles with things like EDA (Exploratory Data Analysis), as opposed to those building predictive models. Find one that’s interesting and start recreating it in your notebook.

Note: You’ll find that when you try to recreate some of these analyses that you get import errors. This is likely because they’ve installed packages that are not bundled in the Anaconda distribution. You’ll eventually need to learn how to interact with the conda package manager and this will be one of many rabbit holes you’ll eventually go down. Usually it’s as easy as conda install <package_name> but you’ll need to find the right package name and sometimes you’ll need to specify other details. And other times you’ll need to use pip install <other_package_name>, but you’ll learn all that later.

High Level Library Summary

Here’s a quick summary of the important libraries you’ll interact with frequently.

  • NumPy: has a lot of the core functionality for scientific computing. Under the hood is calling C-compiled code, so is much faster than the same functions written in Python. Not the most user-friendly.
  • SciPy: similar to NumPy but has more means for sampling from distributions, calculating test statistics…etc.
  • MatPlotLib: The main plotting framework. A necessary evil.
  • Seaborn: import it after MatPlotLib and it will make your plots a lot prettier by default. Also has its own functionality, but I find the coolest stuff runs too slow.
  • Pandas: mostly a thin wrapper around NumPy/SciPy to make more user friendly. Ideal for interacting with tables of data, which they call a DataFrame. Also has wrappers around plotting functionality to enable quick plotting while avoiding complications of MPL. I use Pandas more than anything for manipulating data.
  • Scikit-learn: Has a lot of supervised and unsupervised machine learning algorithms. Also has many metrics for doing model selection and a nice preprocessing library for doing things like Principal Component Analysis or encoding categorical variables.

Quick Tips

  1. When in a jupyter notebook, put a question mark in front of any object before running the cell and it will open up the documentation for it. This is really handy when you’ve forgotten the details of what the function you’re trying to call is expecting you to pass. e.g. ?my_dataframe.applywill explain the apply method of the pandas.DataFrame object, represented here by my_dataframe.
  2. You will likely always need to refer to the documentation for whatever library you’re using, so just keep it open in your browser. There’s just too many optional arguments and nuances.
  3. When it comes to the inevitable task of troubleshooting, stackoverflow probably has the answer.
  4. Accept the fact that you’ll be doing things you don’t fully understand for awhile or you’ll get bogged down by details that aren’t that important. Some day you’ll probably need to understand virtual environments and it’s really not that hard, but there are many detours like that that add unnecessary pain for someone getting started.
  5. Read other people’s code. It’s the best way to learn conventions and best practices. That’s where the Kaggle kernels really help. GitHub also supports the display of jupyter notebooks in the browser, so there are tons of examples on the internet.

“Estimation and Inference of Heterogeneous Treatment Effects using Random Forests” paper review

I’ve recently been reading a lot of technical papers and thought it would be nice to summarize them in less formal language than academic papers. I may or may not do this more than once.

I. Motivation

This paper is about trying to do causal analysis using Random Forests (RF). RF are very popular for building classification or regression predictive models, but it’s not trivial to make classical statistical claims about the results. For instance, what are your confidence intervals? How do you get p-values?

Furthermore, this paper wants to make claims on causal impact, or the effect of a treatment. For instance, what’s the impact of college on income? This is hard to do for many reasons, but most fundamentally you don’t have the data you exactly need, which is data for each individual on what happened when they both went to college and did not. This is impossible of course, because in reality the individual either went to college or they did not and you don’t know what would have happened if the other situation occurred — i.e. the counterfactual. This way of framing the problem is called the “potential outcomes framework” and essentially supposes that each person has multiple potential outcomes depending on whether or not they received the treatment.

A key assumption in this paper, and causal estimation techniques of this type generally, is one of unconfoundedness. This means that once you control for the variables of interest, whether or not a person received the treatment or not is random. This enables us to treat nearby points as mini-randomized experiments. In the case of a drug trial you can randomly assign treatment, so this isn’t a problem, but if you’re analyzing observational data the treatments are already assigned — a person went to college or they didn’t.

For unconfoundedness to be upheld you would need to choose and measure all of the variables that would affect the treatment assignment. For the causal impact of a 4-year degree on income, one of the covariates you’d likely want to choose is family income since whether or not someone goes to college is probably correlated with their family’s income. Age might be another one since an 18 year-old is a lot more likely to go to college than a 50 year-old. The idea is that when you look at two neighbor points that are plotted in your family-income/age plot, the decision of whether or not to go to college should be random. This is valuable because then you can take the income of the no-college-person and subtract it from the college-person and you have an estimate of the impact of college at that point in the family-income/age space.

But this is hard, because you might forget an important variable, or maybe you just don’t have the data for it. So it’s an assumption that surely isn’t exactly true, but it might be true enough to give useful answers.

But assuming you have the right covariates, the authors use a Random Forest to split the data into self-similar groups. The Random Forest is an adaptive nearest-neighbor method in that it decides which portions of the space are similar for you, whereas most nearest-neighbor techniques tend to treat all distances equal. They then add constraints that the resultant leaves have a minimum of both treatment and no-treatment classes and then the causal impact for each leaf can be calculated by subtracting the averages as if it were a randomized experiment. Once complete for the individual trees, the estimates from each tree can be averaged. They call this implementation a Causal Forest.

II. What’s new here

As mentioned above, to make robust statistical claims in traditional statistics, you need things like p-values and confidence intervals. This requires knowing the asymptotic sampling distribution of the statistic. In this case, that means we need to know what the distribution would look like if we sampled the average treatment effect from the Random Forest estimator over an infinite number of trees/data. If we have that, then we can say things like: “the average treatment is 0.3 and the likelihood that this came from random chance is less than 0.1%”

The built-in assumption here is that we can derive properties for the asymptotic/infinite data case and apply them to our real-world case of finite samples, but that’s often the case in traditional statistics.

Prior work has been done to enable asymptotic analysis of Random Forests, but this paper establishes the constraints needed to apply them to their Causal Forests. One of the constraints requires “honest trees”, which they present two algorithms for growing.

III. Approach

The full proof is very complicated and I won’t attempt to recreate it here, but I’ll briefly outline some constraints and what they enable.

First, an asymptotic theory is recreated from previous work for traditional Random Forests that shows that it is an asymptotically Gaussian estimator with a mean of zero, meaning it’s unbiased. They also mention a technique called the infinitesimal jackknife for estimating the variance, which includes a finite-sample correction.

The authors are able to leverage this previous work by including the concept of the “honest tree”. The basic idea is that you cannot use the outcome variable to both do the splitting and estimate the average impact — you have to choose one or the other. They present two ways to do this. The first is the double-sample tree where you split your data in two: half for estimating the impact and the other for placing the splits. The splitting criteria is to minimize the MSE for the outcome variable. In the double-sample case it might seem you’re throwing away half your data, but this condition is for a single tree and the Random Forest is sampling new training sets for each tree, so you’ll end up using all of the data for both splitting and estimation.

The other method is by growing “propensity trees” which are classification trees that aim to predict the treatment class instead of the outcome, and then the outcome variable is only used for estimating the impact within each leaf. They impose a stopping criteria such that you stop splitting to maintain a minimum of each treatment class in any leaf. This is necessary so that you have outcomes to compare to estimate the effect.

By using honest trees and relying on assumptions of unconfoundedness and treatment class overlap within leaves, they’re able to slightly modify the traditional treatment to give the same unbiased gaussian asymptotic results.

IV. Experimental Results

Their baseline is using the k-NN algorithm. They create some simulation experiments with known conditions that mimic common problems and then apply the Causal Forest and k-NN methods.

The first experiment holds the true treatment effect at zero for all x, but establishes a correlation between the outcome and the treatment assignment, thereby testing the ability of the algorithm to correct the covariates to eliminate the bias. This is like having the algorithm automatically figure out that age and family are important as well as how to split them up. They run the experiment many times at a varying training data dimension. They reported MSE and the coverage, which was how often the true value was within the 95% confidence interval of the estimator. The Causal Forest had an order of magnitude improvement over 10-NN and a factor of 5 improvement over 100-NN. CF maintained ~0.95 coverage up to 10 dimensions and then began to degrade. 10-NN maintained reasonable coverage in the 0.9 range and 100-NN performed very poorly. It’s worth noting the confidence intervals were much wider for k-NN than CF, so the improved coverage is that much more impressive.

The 2nd experiment had constant main effect and propensity, but had the true treatment effect depend on only two covariates. They then scaled the number of irrelevant covariates to understand the ability of the algorithm to find this heterogeneous treatment effect in the presence of irrelevant covariates. Surprisingly, CF did better at higher dimension than low dimension. They explain this by noting the variance of the forest depends on the correlation between trees and suggest that the correlation between trees and therefore ensemble variance is reduced at the higher dimension. The results are similar to experiment 1 in that MSE is much better or at least on par with more consistent coverage that scales with dimension.

It was noted that the confidence intervals’ coverage begins to degrade at the edge of the feature space, particularly for high dimension. This is explained as being a situation dominated by bias that would disappear in the asymptotic limit of infinite data. It is noted that bias at the boundaries is typical of trees specifically, and nearest-neighbor non-parametric estimators generally.

V. Discussion

Although causal analysis is not my expertise, it seems this is a nice advancement for nearest-neighbor methods with the assumption of unconfoundedness. The dramatic improvement in MSE while maintaining nominal coverage is impressive.

I found several aspects of the paper confusing, however. Specifically, those related to splitting criteria of the trees. In the case of propensity trees, they’re training a classifier to separate treatment classes, but they’re conversely requiring a constraint of heterogeneity of classes in each leaf, which is directly opposed to the split criteria.

Similarly in the double-sample framework, they’re splitting to minimize the outcome MSE, which groups points with similar outcome values. But the entire point is that after separating the points by treatment classification, the outcomes are different and that difference is the average treatment effect. Once again the splitting criteria seems opposed to the end-goal. To this end they reference a paper (Athey and Imbens [2016]) that may contain clarification.

Finally, there’s a remark that I don’t understand, but sounds troubling.

Remark 4. (testing at many points) We note that it is not in general possible to construct causal trees that are regular in the sense of Definition 4b for all x simultaneously….In practice, if we want to build a causal tree that can be used to predict at many test points, we may need to assign different trees to be valid for different test points. Then, when predicting at a specific x, we treat the set of tees that were assigned to be valid at that x as the relevant forest and apply Theorem 11 to it. (pg 19)

I’m not sure if this is operational overhead, or something more fundamental.

What do you think?

The Limitations of AI from the Horse’s Mouth

I wrote an opinion piece a long time ago on the absurd amount of hype around AI, both optimistic and pessimistic.  But I’m just a guy on the internet.  Andrew Ng, on the other hand, is a world-renown expert on all things AI.  Not only does he teach the famous Coursera Stanford Machine Learning class, he was the founding lead of the Google Brain team and currently leads Baidu’s AI team.

His article is similar to his teaching: concise, clear, and enlightening.  Give it a read:  What Artificial Intelligence Can and Can’t Do Right Now.

A Python Data Science Workflow using Remote Servers

A good toolset and workflow can be a big productivity gain.  Unless you’re working exclusively on your personal computer for the whole pipeline, development workflows can quickly get messy.  I recently found a Python IDE that has great support for work on remote servers along with many other nice features:  PyCharm.  I’ll briefly walk through some of the basics that have provided a great relief to me.

Read more

Scraping Flight Data with Python

Note:  The code for this project can be found in this github repo.

I have been building a new project that requires the prices of flights.  I looked for APIs but couldn’t find any that were free.  However, I had a great experience using BeautifulSoup when scraping baseball data for my Fantasy Baseball auto-roster algorithm project, so I decided to try to get my own data that way.  To start, I chose a single airline:  southwest.  In this post I’ll step you through a couple basics, including how to scrape when you need to submit form data.

Read more

New Metallica review

Metallica is a very polarizing band.  When I was 12 it was unquestionably cool to be a Metallica fan.  Then…Napster.  Almost overnight it became severely uncool to like Metallica.  In the public consciousness they quickly became the rich guys that were past their prime, complaining that they’d only make 10 instead of 100 million.  Former loyal fans were less likely to come to their defense perhaps because many felt betrayed by their evolving sound, which was a significant departure from what made them famous to begin with.  And although it’s almost unfathomable upon reflection, they caught a lot of shit for cutting their hair.  Seriously, it was a big deal in the 90s.

I think there’s at least one more reason that helps explain the vitriol seen in the comment section of every article pertaining to Metallica.  The band has been around so long, many original fans are now much older with tastes that have significantly changed.  They got real jobs, had kids and just don’t listen to metal.  That’s fine, but they mistake their nostalgia of youth as evidence that Metallica has significantly changed.  Is Death Magnetic‘s “My Apocalypse” really that different from Master of Puppet‘s “Battery”?  Sure, there are differences, but it’s definitely in the ballpark.  People seem to get really creative in finding reasons to hate Metallica.  The big criticism for the totally excellent Death Magnetic was that they used too much compression in the mix.  That was the first and last time I had heard anyone except recording engineers even talk about compression, much less use it as a reason to not like a record.  Exhausting.

But despite the band being able to do no right in the public’s eye, they’ve continued to deliver uncompromising, excellent material and Hardwired …To Self-Destruct is no exception.  Similar to Death Magnetic, they have cherry-picked the best aspects of the many sonic eras the band has created to result in a highly polished behemoth of creative output.  Although there are many examples of high speed thrash metal riffs and guitar harmonies that define their early material, they have brought in more elements from their later material in this album.

Specifically, the slower groove that is perhaps unique to 90’s Metallica.  Think Load‘s “2 x 4”, Reload‘s “Carpe Diem Baby”, or the black album’s “Sad But True”.  “Now That We’re Dead” would be at home on any of those albums, but the punctuation of the more complex prog parts would have made it stand out.  “Dream No More” and “Am I Savage?” are perfect examples of dropping into the groove of that era and making the most of Hetfield’s singing ability that didn’t seem to appear until Load.

Another album I find myself making many connections to is Disc 1 of their cover album, Garage Inc.  The sort of Irish-inspired sound of “Whiskey In the Jar” shows up in a few places.  Riffs similar to their Mercyful Fate tribute are sprinkled throughout.  And most pleasing to me, the Sabbath groove of “Sabbra Cadabra” is front and center on my favorite album track:  ManUNkind.

I love riffs that are both in an odd time signature and can still groove hard.  This song hangs in a time signature I still haven’t figured out, but is executed with a skill that allows the listener to bob their head uninterrupted.  Like Clutch, Isis, or Tool at their best, on this track in particular Metallica strikes that elusive balance of technically challenging and extremely listenable.

Aside from this song having all the elements I personally like, I think it also represents something very important: a truly original sound.  It is incredibly difficult to create an original vibe.  Metallica did it in the 80’s with their combination of orchestral melodies and punishing riffs that came to be known as the best example of thrash.  They miraculously did it again in the 90’s by creating a sound that I don’t think has ever been coherently named or recreated by any band since.  I don’t know what genre “Ain’t My Bitch” is, but it’s not thrash and I’ve never heard anything like it before or since.  And while you can argue that putting together these elements in different combinations is original, I think the accomplishment of “ManUNkind” is something different.  It would not fit any other album in their catalog, which is a hallmark to its originality.

But generally, Death Magnetic sounded like a backlash against those saying they didn’t have the technical chops that helped define their legendary albums.  It’s as if much of the 90’s was spent trying to convince themselves that they could do more than just thrash, but then a decade later they had to remind themselves that they still could.  Death Magnetic sounded like the answer to the question, “Remind me again why we’re avoiding our thrash history?”

Hardwired…To Self-Destruct, on the other hand, sounds like the band has finally gotten comfortable with what they created in the 90’s.  They seem at peace with not only living in the shadow of their own legend, but with the various twists and turns they made trying to escape it.  Instead of being confined by their legend from feeling pressure to either recreate or avoid it, they are finally able to just ignore it and make the best album they can, by their own definition.  Hardwired sounds like an album created without the burden of constraint.  It marks a new chapter in the long, awesome book of Metallica and I’m already looking forward to the next one…

Experimenting with Content

I currently don’t know the scope of this blog.  Forgive me for experimenting with different topics as I find my public voice.  But I love music too much to not experiment writing about it.  Ideally I’ll find a nice intersection between technical and art topics, but if they stay fragmented then I’ll split them into different sites.  In the mean time I’ll work to tag the posts in such a way that you can filter out what you don’t want to see.

If you have an opinion one way or the other, feel free to reach out and tell me all about it:

How to Use Google Analytics and Tag Manager with WordPress

If you have a WordPress blog you might want to integrate Google Analytics (GA).  Instead of doing it directly with a plugin, however, consider integrating it by using Google’s Tag Manager (GTM) service.  GTM gives you far more flexibility in tracking and will make your Google Analytics experience much more rich.  I’ll give you a brief run down and a quick example to get you started.

Let’s say you’ve written a blog and want to find which articles people are finding most interesting.  You decide that instead of having all of the text displayed for every post, instead you’ll write one paragraph and then include a “Read More” button that expands the post [This is a standard option in the WordPress post editor].  Then, you want to track clicks on any of these “Read More” buttons.  This means you need two things:  1) a piece of code running on the page that will signal when a click event on the right objects happens and 2) a destination to record this data.  GTM helps you build/deploy the listener and GA will record the signal.

First, go to both analytics.google.com and tagmanager.google.com to create free accounts if you haven’t already.  You’ll need to follow the instructions and type in your domain.  Let’s start by building the GA goal.

Click Admin -> Goals -> New Goal.  In Goal Setup, choose Custom instead of a Template.  In Goal Description, name it ReadMoreButton and select “Event” for the Type.  In Goal Details we’ll fill out the following information:  Category Equals to ReadMore; Action Equals to Click; Label Equals to ReadMore; Value Equals to 1.  We’ve just defined a goal and it’s defined by these event details.  We will use the same information in defining the signal that GTM will send to establish the hand-shake between the two services.

Now go to GTM.  First, you need to deploy general Google Analytics to your site.  (Note:  If you previously had a GA WordPress plugin installed, deactivate it as this will take its place).  In your Workspace, click Add A New Tag. Click the tag icon in the Tag Configuration card and select Universal Analytics.  It will ask for your Tracking ID from GA.  To get that, navigate to GA and click Admin -> Tracking Info -> Tracking Code.  Copy and paste the text that looks like “UA-XXXXXXXX”.  Leave Track Type to Page View and then click the icon in Triggers to select “All Pages”.  Change the title of the Tag in the upper left corner of the screen to “ga-integration” and then save your changes.

This creates the code that needs to be embedded in all of your web pages that will allow Google Analytics to work on your blog.  Instead of manually doing that, download the “DuracellTomi’s Google Tag Manager for WordPress” plugin.  Once downloaded, activate and then just paste in your GTM id that looks like: “GTM-XXXXXX”.

Let’s pause and test the general GA integration.  Open up a tab that has your GA dashboard and navigate to Reporting -> Real Time -> Overview.  It should show a 0 because we never deployed our GTM tags/triggers.  Open up another tab that has your GTM workspace.  Next to the “Publish” button, expand the arrow and choose “Preview”.  Now, open up a third tab and navigate to your website.  If everything is working correctly you should have a GTM toolbar on the bottom of your page that contains Tags Fired On This Page: ga-integration.  Your GA dashboard should now display a number > 0, representing you and whoever else might be there.

Now let’s create a tag/trigger for our Read More event.  Let’s first create the tag in GTM by clicking Tags -> New.  Give it a title of “ReadMoreTag” in the upper left corner.  As before, in Tag Configuration choose Universal Analytics and add your Tracking ID.  However, this time change Track Type to Event.  Then, fill out Category/Action/Label/Value in exactly the same manner as we did when creating the goal in GA.  This will result in GTM sending a signal to GA that has all the information GA needs to recognize this event as our goal.  Leave everything else as the defaults.  Also, leave “Triggering” blank for a minute and save.

We ignore triggering for a moment because, by default, several options are not available to us.  To enable them, navigate on the sidebar to Variables and click Configure in the Built-In Variables section. I enabled everything in the Clicks and Forms section, though not all are needed for this tutorial.

To create our tag’s trigger, click Triggers -> New and then the icon.  From the menu choose Just Links from the Click category.  Under “This trigger fires on” select “Some Link Clicks”.  This is where we will filter to only Read More links.  In my case I filled this out to look like:  Click classes equals more-link.  This translates to: only trigger this tag if the link that they clicked on has the CSS class of “more-link”.  To find your CSS class via the Chrome browser, right click on a Read More link and choose Inspect to view the html code on your page.  The code for mine looks like this:

<a href=”http://zakjost.com/2016/01/deriving-the-normal-distribution-part-2/#more-257″ class=”more-link”><span>Read more</span></a>

Note that the class is equal to “more-link”, but yours might be different.  Fill in the relevant CSS class information for the filter, give your trigger a name and save it.

Finally, we need to assign this trigger to your tag.  Click Tags -> ReadMoreTag and then select our new trigger in the Triggering section.

To test, you should now go into Preview Mode and navigate to your site to test that the triggers fire.  A quick tip:  instead of clicking the Read More links normally, which refreshes the page and erases the GTA toolbar information, do a Cmd+click or Ctrl+click to open a new tab.  This should preserve the data in the existing tab toolbar.  If everything was done correctly you should see the toolbar say “Tags Fired On This Page: ga-integration, ReadMoreTag” after you Cmd+click a Read More link.

And lastly, if you return to GA and go to Reporting -> Real Time -> Conversions, you should see the ReadMoreButton Goal we created increment to a 1.  Now you will know every time someone clicks a Read More button on your blog.  Remember to turn off Preview Mode and Publish your work once you’re ready to deploy for real.  Good luck!

Pattern Recognition and Machine Learning

A quick book recommendation for those trying to strengthen their machine learning knowledge:  Christopher Bishop’s Pattern Recognition and Machine Learning.  I have found it incredibly helpful in adding depth to sources like Andrew Ng’s Coursera class.  It’s apparently a staple in the Computer Science world, but I was never exposed having come from physics.

The topics you would expect are there with great depth and clarity.  However, it is also focused on providing the Bayesian perspective.  If you’re new to Bayesian ideas applied to Machine Learning this is an excellent text.  It does a great job of both contrasting frequentist vs Bayesian approaches and showing their connections.  It should be on every Data Scientist’s shelf.

1 2 3