Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

om/understanding-support-vector-machine-part-1-lagrange-multipliers-5c24a52ffc5e">support vectors,</a> we have already seen (please check to understand the mathematical formulation) that the maximization depends only on the dot products of support vectors,</p><figure id="aef0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vy4RBK-70-4PeEKvM_w3bQ.png"><figcaption></figcaption></figure><p id="3c52">not only that, as the decision rule also depends on the dot product of support vector and a new sample</p><figure id="b155"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*z3Dse6PlT0EZ5uMl8nu3fg.png"><figcaption></figcaption></figure><p id="c604">That means if we use a mapping function that maps our data into a higher dimensional space, then, the maximization and decision rule will depend on the dot products of the mapping function for different samples, as below -</p><figure id="f737"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Fu34Ex4Y7in3rVHBwbE_gw.png"><figcaption></figcaption></figure><p id="c6e2">And Voila!! If we have a function <i>K</i> defined as below</p><figure id="577b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vZtZjOKpXnTW9fInXGdVBQ.png"><figcaption></figcaption></figure><p id="cb99">then we just need to know <i>K</i> and not the mapping function itself. This function is known as <b>Kernel function</b> and it reduces the complexity of finding the mapping function. So, <b>Kernel function defines inner product in the transformed space.</b></p><p id="a088">Let us look at some of the most used kernel functions</p><figure id="eb04"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*V_jH1lc4XOUxxVCPn-13cQ.png"><figcaption></figcaption></figure><p id="aeca">I have used Radial Basis Function kernel to plot figure 2, where mapping from 2D space to 3D space indeed helps us in classification.</p><p id="62e8">Apart from this predefined kernels, <i>what conditions determine which functions can be considered as Kernels ? </i>This is given by <b>Mercer’s theorem.</b> First condition is rather trivial i.e. the <b>Kernel function must be symmetric.</b> As a Kernel function is <i>dot product (inner product)</i> of the mapping function we can write as below —</p><figure id="ba4c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*3wyI43_K9ulob19IE5tI6Q.png"><figcaption></figcaption></figure><p id="b075">Mercer theorem guides us to the necessary and sufficient condition for a function to be Kernel function. One way to understand the theorem is —</p><figure id="5de2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*WU4T1PMct7_FxeOHw7Scpw.png"><figcaption></figcaption></figure><p id="e24b"><b>In other words, in a finite input space, if the Kernel matrix (also known as Gram matrix) is <i>positive semi-definite</i> then, the matrix element i.e. the function K can be a kernel function. </b>So the Gram matrix merges all the information necessary for the learning algorithm, the data points and the mapping function fused into the inner product.</p><p id="b139">Let’s see an example of finding the <b>mapping function from the kernel function</b> and here we will use Gaussian kernel function</p><figure id="6ee5"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*dvBDsHAwe7qkGiPfsPGdoA.png"><figcaption></figcaption></figure><h2 id="5153">Tuning Parameter</h2><p id="fe77">Since we have discussed about the non-linear kernels and specially Gaussian kernel (or RBF kernel), I will finish the post with intuitive understanding for one of the tuning parameters in SVM —<i> gamma.</i></p><p id="3bd6">Looking at the RBF kernel we see that it depends on the Euclidean distance between two points, i.e. if two vectors are closer then this term is small. As the variance is always positive, this means for closer vectors, the RBF kernel is widely spread than the farther vectors. When <i>gamma </i>parameter is high the value of kernel function will be less, even for two close by samples, and, this may cause complicated decision boundary or give rise to over-fitting. You can read more on this in a <a href="https://towardsdatascience.com/visualizing-support-vector-machine-decision-boundary-69e7591dacea">different post

Options

</a> by me.</p><figure id="b8fe"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*B8e0TE2rTx8gdOH1FA1rXg.png"><figcaption>Example of over-fitting and complex decision boundary with high values of gamma. [Image Courtesy : <a href="https://chrisalbon.com/machine_learning/support_vector_machines/svc_parameters_using_rbf_kernel/">Chris Albon]</a></figcaption></figure><p id="1771">Before we finish this discussion let’s recall what have we learn so far in this chapter —</p><ol><li>Mapping data points from low dimensional space to a higher dimensional space can make it possible to apply SVM even for non-linear data sample.</li><li>We don’t need to know the mapping function itself, as long as we know the Kernel function ; <b><i>Kernel Trick</i></b></li><li>Condition for a function to be considered as kernel function; <i>Positive semi-definite Gram matrix.</i></li><li>Types of kernels that are used most.</li><li>How the tuning parameter gamma can lead to over fitting or bias in RBF kernel.</li></ol><p id="d802">Hope you have enjoyed the post and, on the next chapter we will see some machine learning examples using SVM algorithm.</p><p id="cc69"><b>The complete math behind the SVM algorithm was discussed here:</b></p><div id="9b0b" class="link-block"> <a href="https://towardsdatascience.com/understanding-support-vector-machine-part-1-lagrange-multipliers-5c24a52ffc5e"> <div> <div> <h2>Support Vector Machine: Complete Theory</h2> <div><h3>Decision Rule for SVM</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*i7dAZe3aSZP6KNHhgk0GVg.png)"></div> </div> </div> </a> </div><p id="f63f">For more on plotting and understanding support vector machine decision boundary, please check this post:</p><div id="5c49" class="link-block"> <a href="https://towardsdatascience.com/visualizing-support-vector-machine-decision-boundary-69e7591dacea"> <div> <div> <h2>Principal Component Analysis and SVM in a Pipeline with Python</h2> <div><h3>Pipeline, GridSearchCV and Contour Plot</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*lSGgwEqxl2jtIpudB9-Mbg.png)"></div> </div> </div> </a> </div><p id="7f53">Stay Strong ! Cheers !!</p><p id="72b6"><b><i>If you’re interested in further fundamental machine learning concepts and more, you can consider joining Medium using <a href="https://saptashwa.medium.com/membership">My Link</a>. You won’t pay anything extra but I’ll get a tiny commission. Appreciate you all!!</i></b></p><div id="9d5a" class="link-block"> <a href="https://medium.com/subscribe/@saptashwa?source=publishing_settings-------------------------------------"> <div> <div> <h2>Get an email whenever Saptashwa Bhattacharyya publishes.</h2> <div><h3>Get an email whenever Saptashwa Bhattacharyya publishes. By signing up, you will create a Medium account if you don't…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*-dlsxxogRQ7LRBbq)"></div> </div> </div> </a> </div><h2 id="bc57">References:</h2><ol><li>“The Elements of Statistical Learning”; Hastie, T. et.al. ; <a href="https://www.amazon.com/Elements-Statistical-Learning-Prediction-Statistics-ebook/dp/B00475AS2E">amazon link</a></li><li>“Properties of Kernel ”; Berkley <a href="https://people.eecs.berkeley.edu/~jordan/kernels/0521813972c03_p47-84.pdf">University; pdf</a>.</li><li>“Kernels”; MIT <a href="https://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-spring-2012/lecture-notes/MIT15_097S12_lec13.pdf">lecture-notes</a>.</li></ol></article></body>

Support Vector Machine: Kernel Trick; Mercer’s Theorem

Understanding SVM Series : Part 2

Prerequisite: 1. Knowledge of Support vector machine algorithm which I have discussed in the previous post. 2. Some basic knowledge of algebra.

In the 1st part of this series, from the mathematical formulation of support vectors, we have found two important concepts of SVM, which are

SVM problem is a constrained minimization problem and we have learned to use Lagrange Multiplier method to deal with this.
To find the widest road between different samples we just need to consider dot products of support vectors and the samples.

In the previous post of SVM, I took a simple example of samples that are linearly separable to demonstrate Support Vector Classifiers. What happens if we have a sample set like below ?

Blue and Red samples all over the place !!!! At least it seems like (Source: Author)

This plot is generated using the in built make_circlesdataset of sklearn.

import numpy as np 
import sklearn 
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_circles

X,y = make_circles(90, factor=0.2, noise=0.1) 
#noise = standard deviation of Gaussian noise added in data. 
#factor = scale factor between the two circles

plt.scatter(X[:,0],X[:,1], c=y, s=50, cmap='seismic')
plt.show()

As you can understand that it is impossible to draw a line in this 2d plot which could separate the blue samples from the red ones. Is it still possible for us to apply SVM algorithm ?

What if we give this 2d space a bit of shaking and the red samples fly off the plane and appears separated from the blue samples ? Take a look !

After ‘shaking’ the data samples now it seems there is a great possibility of classification! Find the code in my github (Source: Author)

Well it seems shaking the data samples indeed worked, and now we can easily draw a plane (instead of line as we have used before) to classify these samples. So what actually happened during shaking ?

When we don’t have linear separable set of training data like the example above (in real life scenario most of the data-set are quite complicated), the Kernel trick comes handy. The idea is mapping the non-linear separable data-set into a higher dimensional space where we can find a hyperplane that can separate the samples.

So it is all about finding the mapping function that transforms the 2D input space into a 3D output space. Or is it ? And what the heck is Kernel Trick ?

From the previous post about support vectors, we have already seen (please check to understand the mathematical formulation) that the maximization depends only on the dot products of support vectors,

not only that, as the decision rule also depends on the dot product of support vector and a new sample

That means if we use a mapping function that maps our data into a higher dimensional space, then, the maximization and decision rule will depend on the dot products of the mapping function for different samples, as below -

And Voila!! If we have a function K defined as below

then we just need to know K and not the mapping function itself. This function is known as Kernel function and it reduces the complexity of finding the mapping function. So, Kernel function defines inner product in the transformed space.

Let us look at some of the most used kernel functions

I have used Radial Basis Function kernel to plot figure 2, where mapping from 2D space to 3D space indeed helps us in classification.

Apart from this predefined kernels, what conditions determine which functions can be considered as Kernels ? This is given by Mercer’s theorem. First condition is rather trivial i.e. the Kernel function must be symmetric. As a Kernel function is dot product (inner product) of the mapping function we can write as below —

Mercer theorem guides us to the necessary and sufficient condition for a function to be Kernel function. One way to understand the theorem is —

In other words, in a finite input space, if the Kernel matrix (also known as Gram matrix) is positive semi-definite then, the matrix element i.e. the function K can be a kernel function. So the Gram matrix merges all the information necessary for the learning algorithm, the data points and the mapping function fused into the inner product.

Let’s see an example of finding the mapping function from the kernel function and here we will use Gaussian kernel function

Tuning Parameter

Since we have discussed about the non-linear kernels and specially Gaussian kernel (or RBF kernel), I will finish the post with intuitive understanding for one of the tuning parameters in SVM — gamma.

Looking at the RBF kernel we see that it depends on the Euclidean distance between two points, i.e. if two vectors are closer then this term is small. As the variance is always positive, this means for closer vectors, the RBF kernel is widely spread than the farther vectors. When gamma parameter is high the value of kernel function will be less, even for two close by samples, and, this may cause complicated decision boundary or give rise to over-fitting. You can read more on this in a different post by me.

Example of over-fitting and complex decision boundary with high values of gamma. [Image Courtesy : Chris Albon]

Before we finish this discussion let’s recall what have we learn so far in this chapter —

Mapping data points from low dimensional space to a higher dimensional space can make it possible to apply SVM even for non-linear data sample.
We don’t need to know the mapping function itself, as long as we know the Kernel function ; Kernel Trick
Condition for a function to be considered as kernel function; Positive semi-definite Gram matrix.
Types of kernels that are used most.
How the tuning parameter gamma can lead to over fitting or bias in RBF kernel.

Hope you have enjoyed the post and, on the next chapter we will see some machine learning examples using SVM algorithm.

The complete math behind the SVM algorithm was discussed here:

Support Vector Machine: Complete Theory

Decision Rule for SVM

towardsdatascience.com

For more on plotting and understanding support vector machine decision boundary, please check this post:

Principal Component Analysis and SVM in a Pipeline with Python

Pipeline, GridSearchCV and Contour Plot

towardsdatascience.com

Stay Strong ! Cheers !!

If you’re interested in further fundamental machine learning concepts and more, you can consider joining Medium using My Link. You won’t pay anything extra but I’ll get a tiny commission. Appreciate you all!!

Get an email whenever Saptashwa Bhattacharyya publishes.

Get an email whenever Saptashwa Bhattacharyya publishes. By signing up, you will create a Medium account if you don't…

medium.com

References:

“The Elements of Statistical Learning”; Hastie, T. et.al. ; amazon link
“Properties of Kernel ”; Berkley University; pdf.
“Kernels”; MIT lecture-notes.