avatarRohit Pandey

Summary

The web content provides an in-depth exploration of Support Vector Machines (SVM) using optimization and Lagrange multipliers to understand the geometric and mathematical underpinnings of SVMs in classification problems.

Abstract

The article delves into the mechanics of Support Vector Machines (SVM), a supervised machine learning algorithm used for classification and regression tasks. It begins with an overview of SVMs, describing how they separate labeled points in an n-dimensional space with a hyperplane. The author then discusses the optimization problem that SVMs solve to maximize the margin between classes, introducing the concept of Lagrange multipliers to handle the constraints involved. The use of Lagrange multipliers is further elaborated to provide insights into the role of support vectors, which are the data points closest to the separating hyperplane. The article also illustrates the SVM optimization process with a simple two-dimensional classification problem and extends the analysis to a scenario with a third point, demonstrating how the optimal separating hyperplane changes with the addition of new data points. Finally, the author employs Buchberger's algorithm to solve the resulting system of polynomial equations, showcasing the practical application of these mathematical tools in machine learning.

Opinions

  • The author emphasizes the importance of understanding the optimization problem behind SVMs and the role of Lagrange multipliers in solving it.
  • There is a clear appreciation for the geometric interpretation of SVMs, as it helps in visualizing and understanding the concept of support vectors and the margin.
  • The article suggests that the insights provided by the method of Lagrange multipliers are crucial for grasping the behavior of SVMs in classification problems.
  • The author advocates for the use of Buchberger's algorithm and the sympy library in Python to tackle the complex system of polynomial equations that arise from SVM optimization problems.
  • The author's approach to explaining SVMs through detailed mathematical derivations and practical examples indicates a preference for a rigorous and methodical understanding of machine learning algorithms.

SVM: An optimization problem

Drawing lines with Lagrange

1. Overview

This blog will explore the mechanics of support vector machines. First, let’s get a 100 miles per hour overview of this article (highly encourage you to glance through it before reading this one). Basically, we’re given some points in an n-dimensional space, where each point has a binary label and want to separate them with a hyper-plane. Denote any point in this space by x. Then, any hyper-plane can be represented as: w^T x +b=0. Let’s lay out some terminology.

  1. x^i: The ith point in the d-dimensional space referenced above. So, it is a vector with a length, d and all its elements being real numbers (x ∈ R^d).
  2. t^i: The binary label of this ith point. It can be 1 or 0.
  3. w: For the hyperplane separating the space into two regions, the coefficient of the x vector.
  4. b: For the hyperplane separating the space into two regions, the constant term.

In the previous blog, we derived the optimization problem which if solved, gives us the w and b describing the separating plane (we’ll continue our equation numbering from there, γ was a dummy variable) that maximizes the “margin” or the distance of the closest point from the plane. Note that there is one inequality constraint per data point.

Eqn (4): SVM optimization problem

We then did some ninjitsu to get rid of even the γ and reduce to the following optimization problem:

Eq (7): Simplified SVM optimization problem

In this blog, let’s look into what insights the method of Lagrange multipliers for solving constrained optimization problems like these can provide about support vector machines.

2. Solution using Lagrange multipliers

In the previous blog of this series, we obtained two constrained optimization problems (equations (4) and (7) above) that can be used to obtain the plane that maximizes the margin. There is a general method for solving optimization problems with constraints (the method of Lagrange multipliers). To keep things focused, we’ll just state the recipe here and use it to excavate insights pertaining to the SVM problem. As for why this recipe works, read this blog where Lagrange multipliers are covered in detail. If we have a general optimization problem,

Eq (8) General constrained optimization problem.

Step 1: Define the Lagrangian function:

Eq (9) The Lagrangian

Where α_i and β_i are additional variables called the “Lagrange multipliers”. The multipliers corresponding to the inequalities, α_i must be ≥0 while those corresponding to the equalities, β_i can be any real numbers. Again, some visual intuition for why this is so is provided here. Then, the conditions that must be satisfied in order for a w to be the optimum (called the KKT conditions) are:

Eq (10): KKT conditions for finding a solution to the constrained optimization problem.

Equation 10-e is called the complimentarity condition and ensures that if an inequality constraint is not “tight” (g_i(w)>0 and not =0), then the Lagrange multiplier corresponding to that constraint has to be equal to zero. Turns out, this is the case because the Lagrange multipliers can be interpreted as specifying how hard an inequality constraint is being “pushed against” at the optimum point. If the constraint is not even tight (active), we aren’t pushing against it at all at the solution and so, the corresponding Lagrange multiplier, α_i=0. Equations 10b and 10c are pretty trivial since they simply state that the constraints of the original optimization problem should be satisfied at the optimal point (almost a tautology).

2.1. Observation about SVM

Let’s get back now to support vector machines. In equations (4) and (7), we specified an inequality constraint for each of the points in terms of their perpendicular distance from the separating line (margins). The point with the minimum distance from the line (margin) will be the one whose constraint will be an equality. If there are multiple points that share this minimum distance, they will all have their constraints per equations (4) or (7) become equalities. Such points are called “support vectors” since they “support” the line in between them (as we will see). From the geometry of the problem, it is easy to see that there have to be at least two support vectors (points that share the minimum distance from the line and thus have “tight” constraints), one with a positive label and one with a negative label. Assume that this is not the case and there is only one point with the minimum distance, d. Without loss of generality, we can assume that this point has a positive label. Then, there is another point with a negative label on the other side of the line which has a distance d+δd. It is possible to move the line a distance of δd/2 along the w vector towards the negative point and increase the minimum margin by that same distance (and now, both the closest positive and closest negative points become support vectors). Which means that other line we started with was a false prophet; couldn’t have really been the optimal margin line since we easily improved the margin.

But, this relied entirely on the geometric interpretation of the problem. If we had instead been given just the optimization problems (4) or (7) (we’ll assume we know how to get one from the other), could we have reached the same conclusion? For the problem in equation (4), the Lagrangian as defined in equation (9) becomes:

Taking the derivative with respect to γ we get,

Eq (11)

And taking derivative with respect to b,

If we consider {I} to be the set of positive labels and {J} the set of negative labels we can re-write the above equation:

Eq (12)

Equations (11) and (12) along with the fact that all the α’s are ≥0 implies that there must be at least one non-zero α_i in each of the positive and negative classes. And since α_i represents how “tight” the constraint corresponding to the i th point is (with 0 meaning not tight at all), it means there must be at least two points from each of the two classes with the constraints being active and hence possessing the minimum margin (across the points).

2.2 What does “support vector” mean anyway?

In the previous section, we formulated the Lagrangian for the system given in equation (4) and took derivative with respect to γ. Now, let’s form the Lagrangian for the formulation given by equation (10) since this is much simpler:

Eq (13): Lagrangian of equation (7)

Taking the derivative with respect to w as per 10-a and setting to zero we obtain:

Eq (14): The optimal w.

Like before, every point will have an inequality constraint it corresponds to and so also a Lagrange multiplier, α_i. Also, apart from the points that have the minimum possible distance from the separating line (for which the constraints in equations (4) or (7) are active), all others have their α_i’s equal to zero (since the constraints are not active).

From equation (14), we see that such points (for which the α_i’s =0) have no contribution to the Lagrangian and hence the w of the optimal line. It is similarly easy to see that they don’t affect the b of the optimal line either. So, only the points that are closest to the line (and hence have their inequality constraints become equalities) matter in defining it. That is why such points are called “support vectors”. There are generally only a handful of them and yet, they support the separating plane between them.

3. Simplest classification problem

In this section, we will consider a very simple classification problem that is able to capture the essence of how this optimization behaves. So we might visualize what’s going on, we make the feature space two-dimensional. Let’s put two points on it and label them (green for positive label, red for negative label) like so:

It’s quite clear that the best place for separating these two points is the purple line given by: x+y=0.

Now let’s see how the Math we have studied so far tells us what we already know about this problem. Why do this? So that tomorrow it can tell us something we don’t know.

Since we have t⁰=1 and t¹=-1, we get from equation (12), α_0 = α_1 = α. Plugging this into equation (14) (which is a vector equation), we get w_0=w_1=2 α. Hence we immediately get that the line must have equal coefficients for x and y. We also know that since there are only two points, the Lagrange multipliers corresponding to them must be >0 and the inequality constraints corresponding to them must become “tight” (equalities; see the last line of section 2.1). This means,

Subtracting the two equations, we get b=0. So, the separating plane, in this case, is the line: x+y=0, as expected.

3.1. Adding a third floating point

To make the problem more interesting and cover a range of possible types of SVM behaviors, let’s add a third floating point. Since (1,1) and (-1,-1) lie on the line y-x=0, let’s have this third point lie on this line as well. Let’s put it at x=y=u. Also, let’s give this point a positive label (just like the green (1,1) point).

As long as u>1, the plane passes through the origin. But when u<1, the (u,u) point becomes the support vector and drives the separating purple plane downward.

Now, the intuition about support vectors tells us:

  1. If u>1, the optimal SVM line doesn’t change since the support vectors are still (1,1) and (-1,-1).
  2. If u∈ (-1,1), the SVM line moves along with u, since the support vector now switches from the point (1,1) to (u,u).
  3. If u<-1, the points become un-separable and there is no solution to the SVM optimization problems (4) or (7) (they become infeasible).

Let’s see how the Lagrange multipliers can help us reach this same conclusion. From equation (14),

Eq (15)

Also, taking derivative of equation (13) with respect to b and setting to zero we get:

And for our problem, this translates to: α_0-α_1+α_2=0 (because the first and third points — (1,1) and (u,u) belong to the positive class and the second point — (-1,-1) belongs to the negative class).

Eq (16)

Next, equations 10-b imply simply that the inequalities should be satisfied. For our problem, we get three inequalities (one per data point). All specified by (per equation (7)):

for our three data points we get:

But from equation (15) we know that w_0=w_1. Further, the second point is the only one in the negative class. So, the inequality corresponding to it must be an equality. Using this and introducing new slack variables, k_0 and k_2 to convert the above inequalities into equalities (the squares ensure the three inequalities above are still ≥0):

Eq (17)

And finally, we have the complementarity conditions:

Eq (18)

From equation (17) we get: b=2w-1. Further, since we require α_0>0 and α_2>0, let’s replace them with α_0² and α_2². From equations (15) and (16) we get:

Eq (19)

Substituting the b=2w-1 into the first of equation (17),

Eq (20)

and into the third of equation (17),

Eq (21)

Now, equations (18) through (21) are hard to solve by hand. Thankfully, there is a general framework for solving systems of polynomial equations called “Buchberger’s algorithm” and the equations described above are basically a system of polynomial equations. I wrote a detailed blog on Buchberger’s algorithm for solving systems of polynomial equations here. It can be used to simplify the system of equations in terms of the variables we’re interested in (the simplified form is called the “Groebner’s basis). And this algorithm is implemented in the python library, sympy. Let’s see how it works.

The order of the variables in the code above is important since it tells sympy their “importance”. It tries to have the equations at the end of the Groebner basis expressed in terms of the variables from the end. In this case, we had six variables but only five equations. And since k_0 and k_2 were the last two variables, the last equation of the basis will be expressed in terms of them alone (if there were six equations, the last equation would be in terms of k2 alone). We get:

This means k_0 k_2 =0 and so, at least one of them must be zero. However, we know that both of them can’t be zero (in general) since that would mean the constraints corresponding to (1,1) and (u,u) are both tight; meaning they are both at the minimal distance from the line, which is only possible if u=1. The fact that one or the other must be 0 makes sense since exactly one of (1,1) or (u,u) them must be closest to the separating line, making the corresponding k =0.

Doing a similar exercise, but with the last equation expressed in terms of u and k_0 we get:

And this implies that either k_0=0 or

Similarly, extracting the equation in terms of k_2 and u we get:

which in turn implies that either k_2=0 or

This means that if u>1, then we must have k_0=0 since the other possibility will make it imaginary. And consequently, k_2 can’t be 0 and will become (u-1)^.5. In other words, the equation corresponding to (1,1) will become an equality and the one corresponding to (u,u) will be “lose” (a strict inequality). And this makes sense since if u>1, (1,1) will be the point closer to the hyperplane.

And similarly, if u<1, k_2 will be forced to become 0 and consequently, k_0 will be forced to take on a positive value. We see the two points; (u,u) and (1,1) switching the role of being the support vector as u transitions from being less than to greater than 1. If u<0 on the other hand, it is impossible to find k_0 and k_2 that are both non-zero, real numbers and hence the equations have no real solution.

Machine Learning
Optimization
Classification
Vector
Data Science
Recommended from ReadMedium