I found an open data set on the Machine Learning Repository concerning Breast Cancer and thought I would use it as a way to practice machine learning approaches. Briefly, the training data indicate whether the instance was Benign or Malignant and the values of a variety of features: i.e. Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape…etc.

Please go to my GitHub Breast Cancer Repository for the data set and code.

#### Approach Outline

The data set had 9 features, two classifications and about 700 data points. I split it so that the training of my model happened on the first 350 data points and the remainder were used for checking the accuracy.

Our prediction equation is a sigmoid and is interpreted as the probability that the tumor is malignant based on the input features. The input to the sigmoid is a sum of each feature multiplied by a parameter value, i.e. where is the parameter vector and is the feature vector of a given data point i: . Explicitly, the input for a set of features is: and the output is the probability of being malignant with the chosen model parameters (). My program says if the output is greater than or equal to 0.5 (meaning the probability is greater than 50%), then the prediction is that the tumor is malignant.

The cost function is such that if the sigmoid predicts a 1 (meaning the probability the tumor is malignant is 100%) and the* real* classification is that it’s benign, then there is an infinite cost. The same is true if it predicts a 0 (meaning the probability of being malignant is 0%) and the* real* classification is malignant. The cost equation ends up being: where m is the number of data points, h is the output of the prediction sigmoid and y is the classification of 0 or 1 (benign/malignant). The idea is that if you’re 100% confident and end up being wrong, you have a very large cost.

The job is then to find the vector theta that minimizes this cost function. The gradient descent method is one choice, but for the first iteration I tried a minimization function built into Octave ‘fminunc’ with 400 iterations.

#### Initial Results

The ‘fminunc’ returned a vector of parameters theta that minimized the cost function specified above. I then used the sigmoid with these values for to make predictions on the remaining ~350 data points. So for each data point there are a set of features given by that were dotted with my parameter vector and fed as the input to the sigmoid function. If this number was greater than or equal to 0.5, it predicted the tumor was malignant.

The prediction accuracy ends up being 98%!

#### Gradient Descent

Next I used a Gradient Descent (GD) minimization algorithm instead of the built-in ‘fminunc’ feature to compare results. The results when using GD were very sensitive to the learning rate parameter . If I used things remained pretty stable and gave reasonable accuracy, but took more iterations to achieve. The accuracy ended up exceeding that of fminunc very slightly with the best value being 98.6% compared to fminun’s 98.2%. Essentially it classified one more of the data points correctly.

For the transient behavior resulted became unstable as log functions began returning infinities. The algorithm ignored these results and continued to minimize and the output would end up being similar to cases where this wasn’t an issue, but the descent was not a smooth march towards the minimum. See the figures below for a comparison between the methods and an example of an unstable cost output as a function of the number of iterations.

#### What the Model Tells Us

Once we have parameters that give us high accuracy in prediction, what can we learn from it? If you look at the sigmoid function it’s clear that when the input argument is positive, the answer tends towards 1 and when it’s negative it tends towards 0. The more positive/negative the input is, the closer to 1/0 the output is. So instead of writing out the sigmoid function, let’s just look at the input and see the conditions that make it positive or negative to garner some intuition.

The input is just . The machine learning algorithm outputs the following for the parameter vector: . Recall that is our vector of features (with so that we can have a term). The description of each feature is given below:

# Attribute: Domain

0. Constant: 1 for all

1. Clump Thickness: 1 – 10

2. Uniformity of Cell Size: 1 – 10

3. Uniformity of Cell Shape: 1 – 10

4. Marginal Adhesion: 1 – 10

5. Single Epithelial Cell Size: 1 – 10

6. Bare Nuclei: 1 – 10

7. Bland Chromatin: 1 – 10

8. Normal Nucleoli: 1 – 10

9. Mitoses: 1 – 10

Now, the first parameter, which merely multiplies the constant 1, is equal to -8.2. This means that in the absence of any other feature information, it’s most likely that the tumor is benign, not malignant. Good news! The next highest parameter value is 0.52, which multiplies the Clump Thickness. This feature seems to be the most strongly correlated with being malignant. The third parameter value is less than zero, indicating the higher the Uniformity of Cell Shape, the more negative the argument becomes and therefore the more likely to be benign.

#### Next Steps

Next we’ll see how the prediction accuracy changes when using more a complicated feature set (e.g. including combinations of features like ).

#### Acknowledgments

Data came from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

#### Citations

1. O. L. Mangasarian and W. H. Wolberg: “Cancer diagnosis via linear

programming”, SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

2. William H. Wolberg and O.L. Mangasarian: “Multisurface method of

pattern separation for medical diagnosis applied to breast cytology”,

Proceedings of the National Academy of Sciences, U.S.A., Volume 87,

December 1990, pp 9193-9196.

3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: “Pattern recognition

via linear programming: Theory and application to medical diagnosis”,

in: “Large-scale numerical optimization”, Thomas F. Coleman and Yuying

Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.

4. K. P. Bennett & O. L. Mangasarian: “Robust linear programming

discrimination of two linearly inseparable sets”, Optimization Methods

and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).