Support Vector Machines are another classifier that allows to build a non linear decision boundary much like multinomial logistic regression, but without having to manually choose the variations of the features ( etc.).
The main concepts behind how a SVM work are Optimal Margin Classifiers and Kernels.
Optimal Margin Classifier
For now we assume that the dataset is separable, meaning it can exists a linear boundary that separates the examples.
Functional Margin
-
Recap on how binary classification works in logistic regression
The classifier predicts if , and otherwise. This because .
In other words, this means that if , we hope that ; if , we hope that .
Functional margin of hyerplane defined by with respect to
The parameters defines an hyperplane, that in just a higher dimension line that seprarates the samples.
We define the functional margin as following:
As before for logistic regression:
- If , we want and hope that ;
- If , we want and hope that
And this means that we want and hope that .
We can define the functional margin with respect to the entire training set as:
There is a problem here, since it’s possible to “cheat” and change the functional margin, without influencing the decision boundary, by just multiplying for the same factor both the parameters and .
In order to solve this problem, is to normalize the length of the parameters. One way of doing this is forcing the magnitude of the vector to be 1 (), or by replacing with .
Geometric Margin
The geometric margin describes how much the decision boundary stands apart from the examples. A decision boundary that separates best the points, meaning that doesn’t comes very close to some training samples, has a larger geometric margin hence is better.
More formally, the geometric margin is the euclidean distance between the training sample coordinates and the line (or hyperplane) that represents the decision boundary.
We formally define the geometric margin with respect to a particular training sample as:
We can define the geometric margin with respect to entire training set as the worst gemetric margin w.r.t a particular training sample:
We can now see that the geometric margin and the functional margin are correlated:
An optimal margin classifier aims to find the parameters that maximise .
In other words, this means to find:
This problem written like this cannot be solved with algorithms like gradient descend or similar, but can be rewritten in order to be optimized with one of those algorithms:
Kernels
Let’s suppose that the parameter can be expressed as a linear combination between the training examples:
The Representer theorem prove that this is possible without loosing any performances.
-
Assignments
-
First Presentation
Of course we want our model to have an high value of correct predictions, what we want more is for our model to have the lest amount of false negatives. This because in a sensitive environment such as the healthcare, we don’t want to predict that a person won’t have a stroke when they probably will. This means that the model should have an high recall value, without decreasing of course the accuracy.
-
Final Presentation
Original
Since it was a simple classification problem, we started with creating a Logistic Regression based model. Instead of using a black-box model, we tried to implement it from scratch in order to have more control on its execution. We used gradient descent to reduce the cost and finally saw how the metrics changed, varying the threshold. Results were good but we wanted to try to get higher performances using other approaches. So we also implemented GDA (from scratch as well), K-Nearest Neighbors (with an efficient selection of the K value), Support Vector Machine and Random Forest. The last three were done using SkLearn.
Putting all together, KNN with k=18 gives us the best accuracy with a lower recall. Logistic Regression works well too: choosing a threshold of 0.55 we managed to get a larger recall then KNN with a little lower accuracy. Random Forest (with a maxdepth of 5) and the Support Vector Classifier have more or less the same trend as logistic regression. Concerning our goal, GDA has the best performance since it has a high accuracy and a low number of false negatives.
Reduced
Since it was a simple classification problem, we started with Logistic Regression. We implemented it from scratch in order to have more control on its execution, seeing how the metrics changed varying the threshold.
We then tried other methods like Gaussian Discriminant Analysis, from scratch as well, and others like K-Nearest Neihbors (with an optimal selection of the K value), Support Vector Machine and Random Forest using the SKLearn library.
In conclusion we found that KNN with k=18 and RFC have the best accuracy but a low recall. Logistic Regression with threshold 0.55 and Support Vector Machine have a higher recall with a little lower accuracy, but concerning our goal GDA, in particular Quadratic Discriminant Analysis, has the highest recall, while maintaining high accuracy.