Support Vector Machines are a classifier that allows to build a non linear decision boundary much like multinomial logistic regression, but without having to manually choose the variations of the features ( etc.).
The main concepts behind how a SVM work are Optimal Margin Classifiers and Kernels.
Optimal Margin Classifier
For now we assume that the dataset is separable, meaning it can exists a linear boundary that separates the examples.
While at its core SVM is a linear classifier, it can become a nonlinear classifier with some kernel tricks, since it will always find a linear separator, but in a higher-dimensional transformed space that allows the model to capture non-linear relationships.
Functional Margin
-
Recap on how binary classification works in logistic regression
The classifier predicts if , and otherwise. This because .
In other words, this means that if , we hope that ; if , we hope that .
Functional margin of hyperplane defined by with respect to
The parameters defines an hyperplane, that in just a higher dimension line that separates the samples.
We define the functional margin as following:
As before for logistic regression:
- If , we want and hope that ;
- If , we want and hope that
And this means that we want and hope that .
We can define the functional margin with respect to the entire training set as:
There is a problem here, since it’s possible to “cheat” and change the functional margin, without influencing the decision boundary, by just multiplying for the same factor both the parameters and .
In order to solve this problem, is to normalize the length of the parameters. One way of doing this is forcing the magnitude of the vector to be 1 (), or by replacing with .
Geometric Margin
The geometric margin describes how much the decision boundary stands apart from the examples. A decision boundary that separates best the points, meaning that doesn’t comes very close to some training samples, has a larger geometric margin hence is better.
More formally, the geometric margin is the euclidean distance between the training sample coordinates and the line (or hyperplane) that represents the decision boundary.
We formally define the geometric margin with respect to a particular training sample as:
We can define the geometric margin with respect to entire training set as the worst gemetric margin w.r.t a particular training sample:
We can now see that the geometric margin and the functional margin are correlated:
An optimal margin classifier aims to find the parameters that maximise .
In other words, this means to find:
This problem written like this cannot be solved with algorithms like gradient descend or similar, but can be rewritten in order to be optimized with one of those algorithms:
Kernels
Let’s suppose that the parameter can be expressed as a linear combination between the training examples:
The Representer theorem prove that this is possible without loosing any performances.
tags:#ai/machine-learning