avatarChristopher Chung

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3475

Abstract

InJowXV">See maths explained here.</a></p><figure id="a471"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*poluhJC7FK2DaWGbBWvwew.png"><figcaption>SVM Decision Rule 1 — Constraint (Samples on the Boundaries)</figcaption></figure><h2 id="c745">3. Decision Rule — Maximum Width</h2><p id="d485">Let’s say we have the vector<b><i> </i></b><i>x</i>+<b> </b>on the gutter of positive boundary, and the vector <i>x-</i> on the gutter of negative boundary. The<b> </b><i>x+</i> minus<b> </b><i>x-</i> stands for the directional force from the negative vector <i>x-</i> to the positive vector <i>x+</i>. If we perform dot product on this directional force with the unit vector of <i>w</i> which is perpendicular to the decision boundary, then this becomes the width between negative and positive boundaries. Note that <i>w</i> is normal vector and ||<i>w</i>|| is the magnitude of w. <a href="https://www.mathcha.io/editor/XPgE9hqLS21T87r230UDzpE2xuJYBBk8InJowXV">See maths explained here.</a></p><figure id="75ce"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*-vMK58QbdtRb8VuD9Nfvtg.png"><figcaption></figcaption></figure><p id="9fb4">We basically maximise this width to distinctively separate negative and positive data points. This can be simplified as below. The last form squares the magnitude of <i>w</i> and divides it by 2 for mathematical convenience.</p><figure id="387e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*rqgxUEiuoPA0XovJIE71kg.png"><figcaption>SVM Decision Rule 2— Max Width</figcaption></figure><h2 id="7e41">3. Constrained Optimisation — Find Max Width with a Constraint</h2><p id="797c">The <b>Lagrangian equation</b> can be used to solve a constrained optimisation problem. If the constraint changes by one unit, then the maximum value of the objective function reduces by λ. The equation is generally used to find maxima or minima of the objective function given the constraints.</p><ul><li>L(x, λ) = f(x)- λ g(x)</li><li><i>f(x): objective function</i></li><li><i>g(x): constraint</i></li><li><i>λ: Lagrangian</i></li></ul><p id="fd6a">Earlier we mentioned that SVM takes the widest street approach to find the maximum width between positive and negative boundaries. This problem can be described using the Lagrangian equation with the objective function and constraint defined as below.</p><figure id="aa87"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*rqgxUEiuoPA0XovJIE71kg.png"><figcaption>Objective Function: f(x) — Max Width</figcaption></figure><figure id="ee98"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*poluhJC7FK2DaWGbBWvwew.png"><figcaption>Constraint Function: g(x) — Samples on the Boundaries</figcaption></figure><p id="6b58">In summary, the Lagrangian minimises the objective function (eventually maximise the width between positive and negative boundaries) given the constraint that samples are support vectors on the gutters.</p><figure id="c2ba"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*xa4IS_UIA4ekqZHIBUgDgg.png"><figcaption>Find Max Width with Lagrangian</figcaption></figure><p id="6da8">After finding the derivatives with respect to <i>w</i> and <i>b</i> from the equation above, it can be simplified as below. Since <i>y i</i> and <i>y j</i> are labels or response variables, the equation can be simply minimised by <b>maximising the dot product of vector <i>x i</i> and <i>x j</i></b>. In other word

Options

s, maximisation of width <b>all depends on summing dot products of pairs of supports vectors (on the gutters) </b>while drawing boundary lines.</p><figure id="a51f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*7Iwbzfq9J20lqyWQD6l-OQ.png"><figcaption></figcaption></figure><p id="f88a">Furthermore, unknown vector <i>u </i>is determined whether it is placed on the positive side of decision boundary depending on dot products of support vector <i>x </i>and <i>u</i>.</p><figure id="343f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*yQ4NGBRKrEtKT4i6ZcBoAA.png"><figcaption></figcaption></figure><h2 id="38ff">4. Kernel Trick</h2><p id="3e9d">In the linear problem, SVM can easily draw a decision boundary to group samples into multiple classes. However, if data points cannot be separated with linear slices then data points can be transformed before drawing decision boundaries, which is called Kernel Trick.</p><figure id="9d63"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gjyh3VvhB3taXnph3Dj17g.png"><figcaption></figcaption></figure><p id="9374">In the above, non-linear SVM becomes linear SVM problem after transforming with Kernel Trick. Kernels basically <b>map the problem from input space to a new higher-dimensional space (called the feature space) 𝜙<i>(x)</i></b> by doing a non-linear transformation using a special function called the Kernel. Then a linear model is used to separate data points in the feature space. The linear model in the feature space corresponds to a non-linear model in the input space.</p><figure id="16a7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*wp4GV7zRyzd_zty5FH5ZMA.png"><figcaption>Map Input Space (Non-Linear) to Feature Space (Linear)</figcaption></figure><p id="a379">The SVM basic rule can be expressed as below in the feature space. The equation below is when the magnitude of <i>w </i>is replaced with linear sum of <i>a, y </i>and<i> x. <a href="https://www.mathcha.io/editor/XPgE9hqLS21T87r230UDzpE2xuJYBBk8InJowXV"></a></i><a href="https://www.mathcha.io/editor/XPgE9hqLS21T87r230UDzpE2xuJYBBk8InJowXV">See maths explained here.</a> The beauty of using Kernel is the original equation does not change since Kernel transformation is abstracted in phi <b>𝜙</b>.</p><figure id="114e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*xk5ATS1chFKotiAMyKivIA.png"><figcaption></figcaption></figure><p id="4978">Here are examples of Kernel functions. Usually, you can start with the simplest version of transformation and gradually model with more and more advanced kernel functions to avoid overfitting.</p><figure id="6f1f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*caq4F5JjyiL1H9lqgb-fuA.png"><figcaption>Examples of Kernel Functions</figcaption></figure><p id="b99c">Now that we cover how SVM solves the classification problem while drawing boundaries, we will build a model with sample dataset in the next post.</p><p id="c99b"><i>References:</i></p><ul><li><a href="http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html"><i>Cornell Lecture: SVM</i></a></li><li><a href="https://www.youtube.com/watch?v=_PwhiWxHK8o"><i>MIT Open Course: SVM</i></a></li><li><a href="https://www.youtube.com/watch?v=bM4_AstaBZo"><i>SVM (The Math): Data Science Concepts</i></a></li><li><a href="https://www.youtube.com/watch?v=tEx-iqUX9Z4"><i>Kernels in SVM</i></a></li></ul></article></body>

How SVM constructs boundaries? Math explained.

Mathematical concepts behind how SVM can separate data points in high dimensional space

Sydney Harbour Bridge, Australia — We live in a beautiful 3 dimensional world with the 4th as time

Support vector machines were first introduced by Vladmir Vapnik and his colleagues at Bell Labs in 1992. However, many are not aware that basics of support vector machines were already developed in 1960s with his PhD thesis at Moscow University. Over decades, SVM has been highly preferred by many since it uses less computational resources while allowing data scientists to achieve notable accuracy. Not to mention that it solves both classification and regression problems.

1. Basic Concept

SVM can solve linear and non-linear problems and work well for many practical business problems. The principle idea of SVM is straight forward. The learning model draws a line which separates data points into multiple classes. In a binary problem, this decision boundary takes the widest street approach maximising the distance to the closest data points from each class.

In vector calculus, the dot product measures ‘how much’ one vector lies along another, and tells you the amount of force going in the direction of the displacement, or in the direction of another vector.

For instance, we have the unknown vector u and normal vector w which is perpendicular to the decision boundary. The dot product of w·u denotes the amount of force by u going in the direction of vector w. In this regard, if unknown vector u locates on the positive side of boundary, it can be described as below with the constant b.

SVM Basic Rule

The samples locating above the boundary that classifies positive samples (+1) or below the boundary that classifies negative samples (-1) can be accordingly expressed.

2. Decision Rule — Constraint

When a decision boundary is determined, the positive and negative boundaries should be drawn in a way that the closest samples from each group maximise the width, and thus those samples are placed on each group’s boundary.

This rule will become a constraint to find the maximum width of boundaries. Given that y is +1 for positive samples and -1 for negative samples, both equations above can express sample x on the gutter of positive or negative boundary by multiplying y on both sides of equations. They are also known as support vectors. See maths explained here.

SVM Decision Rule 1 — Constraint (Samples on the Boundaries)

3. Decision Rule — Maximum Width

Let’s say we have the vector x+ on the gutter of positive boundary, and the vector x- on the gutter of negative boundary. The x+ minus x- stands for the directional force from the negative vector x- to the positive vector x+. If we perform dot product on this directional force with the unit vector of w which is perpendicular to the decision boundary, then this becomes the width between negative and positive boundaries. Note that w is normal vector and ||w|| is the magnitude of w. See maths explained here.

We basically maximise this width to distinctively separate negative and positive data points. This can be simplified as below. The last form squares the magnitude of w and divides it by 2 for mathematical convenience.

SVM Decision Rule 2— Max Width

3. Constrained Optimisation — Find Max Width with a Constraint

The Lagrangian equation can be used to solve a constrained optimisation problem. If the constraint changes by one unit, then the maximum value of the objective function reduces by λ. The equation is generally used to find maxima or minima of the objective function given the constraints.

  • L(x, λ) = f(x)- λ g(x)
  • f(x): objective function
  • g(x): constraint
  • λ: Lagrangian

Earlier we mentioned that SVM takes the widest street approach to find the maximum width between positive and negative boundaries. This problem can be described using the Lagrangian equation with the objective function and constraint defined as below.

Objective Function: f(x) — Max Width
Constraint Function: g(x) — Samples on the Boundaries

In summary, the Lagrangian minimises the objective function (eventually maximise the width between positive and negative boundaries) given the constraint that samples are support vectors on the gutters.

Find Max Width with Lagrangian

After finding the derivatives with respect to w and b from the equation above, it can be simplified as below. Since y i and y j are labels or response variables, the equation can be simply minimised by maximising the dot product of vector x i and x j. In other words, maximisation of width all depends on summing dot products of pairs of supports vectors (on the gutters) while drawing boundary lines.

Furthermore, unknown vector u is determined whether it is placed on the positive side of decision boundary depending on dot products of support vector x and u.

4. Kernel Trick

In the linear problem, SVM can easily draw a decision boundary to group samples into multiple classes. However, if data points cannot be separated with linear slices then data points can be transformed before drawing decision boundaries, which is called Kernel Trick.

In the above, non-linear SVM becomes linear SVM problem after transforming with Kernel Trick. Kernels basically map the problem from input space to a new higher-dimensional space (called the feature space) 𝜙(x) by doing a non-linear transformation using a special function called the Kernel. Then a linear model is used to separate data points in the feature space. The linear model in the feature space corresponds to a non-linear model in the input space.

Map Input Space (Non-Linear) to Feature Space (Linear)

The SVM basic rule can be expressed as below in the feature space. The equation below is when the magnitude of w is replaced with linear sum of a, y and x. See maths explained here. The beauty of using Kernel is the original equation does not change since Kernel transformation is abstracted in phi 𝜙.

Here are examples of Kernel functions. Usually, you can start with the simplest version of transformation and gradually model with more and more advanced kernel functions to avoid overfitting.

Examples of Kernel Functions

Now that we cover how SVM solves the classification problem while drawing boundaries, we will build a model with sample dataset in the next post.

References:

Data Science
Machine Learning
Support Vector Machine
Svm
Supervised Learning
Recommended from ReadMedium