Pattern Recognition and Machine Learning

A quick book recommendation for those trying to strengthen their machine learning knowledge:  Christopher Bishop’s Pattern Recognition and Machine Learning.  I have found it incredibly helpful in adding depth to sources like Andrew Ng’s Coursera class.  It’s apparently a staple in the Computer Science world, but I was never exposed having come from physics.

The topics you would expect are there with great depth and clarity.  However, it is also focused on providing the Bayesian perspective.  If you’re new to Bayesian ideas applied to Machine Learning this is an excellent text.  It does a great job of both contrasting frequentist vs Bayesian approaches and showing their connections.  It should be on every Data Scientist’s shelf.

Ellipse Detection

I enjoy learning new algorithms and am interested in computer vision.  It turned out that my company had a use for recognizing small circles in images and defect maps so I thought I’d use it as an opportunity to learn something.  You can find all the code and example data files you need at my GitHub repository on ellipse detection.

Background:

The standard way of finding the signature defect pattern was by human observation.  It’s obviously advantageous to automate this task both for rigorous statistics and for the sake of freeing up peoples’ time.  Further, more robustly understanding the timing and frequency of this defect might allow us to learn something about the root cause.

Method:

I first wanted to prototype a solution in the easiest case, which is when dealing with a defect map with a list of $(x,y)$ coordinates.  I found a paper by Xie, et. al. that reduces the task of finding an ellipse to that of fitting a single parameter by making a few assumptions about the points and using a voting scheme on a one-dimensional accumulator array.  This is probably overkill since my defects are largely circular, but I wanted to understand how they reduced the complexity of the problem and get practice incorporating algorithms from the literature.

The trick to reduce the problem from finding 5 parameters $\{x_0, y_0, \alpha, a, b\}$ (center point, orientation,  length of major/minor axes) down to one is to simply assume the two chosen points are the two endpoints of the major axis.  If this is the case then $\{x_0, y_0, a, \alpha \}$ can all be calculated and only b needs to be voted upon.

So you pick two points and calculate the 4 geometric terms.  Next, iterate over all other points and calculate a minor axis for each and then increment the vote for that choice of minor axis.  Once you’ve iterated over all other points for that pair, find the minor axis that was voted for the most.  If the number of votes is above your threshold, you’ve found an ellipse!  This act of voting in a parameter space is called a Hough transform.

But as they say, the devil’s in the details.  It’s up to you to put in rules about minimum distances to consider and the appropriate value for the vote threshold.

Results:

I found I had far too many points than I needed to get a reasonable fit, so I did some filtering.  The below image shows the original points (blue), the remaining points after filtering (red), and the final calculated ellipse (solid blue line).  Better accuracy can be obtained by changing the resolution of the bins used for grouping the axes, but I was more interested in stability than an accurate fit.

Method criticisms:

Although I was able to get it to work, I had to deviate somewhat from the paper.  For instance, I was not able to get good results by merely deleting points as soon as any type of ellipse was found.  The reason is that a constant value for the minimum vote would sometimes be too high and miss ellipses or would be too low and allow many nearby ellipses fitting the same points and I would sometimes exclude a possible good ellipse by removing points after a bad ellipse was “found”.  I ended up keeping all points in the defect list even if they led to a “found ellipse” because it’s possible that another ellipse using those same points may be a better fit.  I found it better to leave the points in, find all the suggested ellipses, and then choose the best ones by ruling out nearby alternatives with lower vote numbers.

Conclusion:

The prototype is successful in finding ellipses with the small file set I tested, but there is a lot of room for improvement in tuning the various parameters and rules to strike a balance between speed and accuracy.