SVM: An optimization problem
Drawing lines with Lagrange
1. Overview
This blog will explore the mechanics of support vector machines. First, let’s get a 100 miles per hour overview of this article (highly encourage you to glance through it before reading this one). Basically, we’re given some points in an n-dimensional space, where each point has a binary label and want to separate them with a hyper-plane. Denote any point in this space by x. Then, any hyper-plane can be represented as: w^T x +b=0. Let’s lay out some terminology.
- x^i: The ith point in the d-dimensional space referenced above. So, it is a vector with a length, d and all its elements being real numbers (x ∈ R^d).
- t^i: The binary label of this ith point. It can be 1 or 0.
- w: For the hyperplane separating the space into two regions, the coefficient of the x vector.
- b: For the hyperplane separating the space into two regions, the constant term.
In the previous blog, we derived the optimization problem which if solved, gives us the w and b describing the separating plane (we’ll continue our equation numbering from there, γ was a dummy variable) that maximizes the “margin” or the distance of the closest point from the plane. Note that there is one inequality constraint per data point.

We then did some ninjitsu to get rid of even the γ and reduce to the following optimization problem:

In this blog, let’s look into what insights the method of Lagrange multipliers for solving constrained optimization problems like these can provide about support vector machines.
2. Solution using Lagrange multipliers
In the previous blog of this series, we obtained two constrained optimization problems (equations (4) and (7) above) that can be used to obtain the plane that maximizes the margin. There is a general method for solving optimization problems with constraints (the method of Lagrange multipliers). To keep things focused, we’ll just state the recipe here and use it to excavate insights pertaining to the SVM problem. As for why this recipe works, read this blog where Lagrange multipliers are covered in detail. If we have a general optimization problem,

Step 1: Define the Lagrangian function:

Where α_i and β_i are additional variables called the “Lagrange multipliers”. The multipliers corresponding to the inequalities, α_i must be ≥0 while those corresponding to the equalities, β_i can be any real numbers. Again, some visual intuition for why this is so is provided here. Then, the conditions that must be satisfied in order for a w to be the optimum (called the KKT conditions) are:

Equation 10-e is called the complimentarity condition and ensures that if an inequality constraint is not “tight” (g_i(w)>0 and not =0), then the Lagrange multiplier corresponding to that constraint has to be equal to zero. Turns out, this is the case because the Lagrange multipliers can be interpreted as specifying how hard an inequality constraint is being “pushed against” at the optimum point. If the constraint is not even tight (active), we aren’t pushing against it at all at the solution and so, the corresponding Lagrange multiplier, α_i=0. Equations 10b and 10c are pretty trivial since they simply state that the constraints of the original optimization problem should be satisfied at the optimal point (almost a tautology).
2.1. Observation about SVM
Let’s get back now to support vector machines. In equations (4) and (7), we specified an inequality constraint for each of the points in terms of their perpendicular distance from the separating line (margins). The point with the minimum distance from the line (margin) will be the one whose constraint will be an equality. If there are multiple points that share this minimum distance, they will all have their constraints per equations (4) or (7) become equalities. Such points are called “support vectors” since they “support” the line in between them (as we will see). From the geometry of the problem, it is easy to see that there have to be at least two support vectors (points that share the minimum distance from the line and thus have “tight” constraints), one with a positive label and one with a negative label. Assume that this is not the case and there is only one point with the minimum distance, d. Without loss of generality, we can assume that this point has a positive label. Then, there is another point with a negative label on the other side of the line which has a distance d+δd. It is possible to move the line a distance of δd/2 along the w vector towards the negative point and increase the minimum margin by that same distance (and now, both the closest positive and closest negative points become support vectors). Which means that other line we started with was a false prophet; couldn’t have really been the optimal margin line since we easily improved the margin.
But, this relied entirely on the geometric interpretation of the problem. If we had instead been given just the optimization problems (4) or (7) (we’ll assume we know how to get one from the other), could we have reached the same conclusion? For the problem in equation (4), the Lagrangian as defined in equation (9) becomes:

Taking the derivative with respect to γ we get,

And taking derivative with respect to b,

If we consider {I} to be the set of positive labels and {J} the set of negative labels we can re-write the above equation:

Equations (11) and (12) along with the fact that all the α’s are ≥0 implies that there must be at least one non-zero α_i in each of the positive and negative classes. And since α_i represents how “tight” the constraint corresponding to the i th point is (with 0 meaning not tight at all), it means there must be at least two points from each of the two classes with the constraints being active and hence possessing the minimum margin (across the points).
2.2 What does “support vector” mean anyway?
In the previous section, we formulated the Lagrangian for the system given in equation (4) and took derivative with respect to γ. Now, let’s form the Lagrangian for the formulation given by equation (10) since this is much simpler:

Taking the derivative with respect to w as per 10-a and setting to zero we obtain:

Like before, every point will have an inequality constraint it corresponds to and so also a Lagrange multiplier, α_i. Also, apart from the points that have the minimum possible distance from the separating line (for which the constraints in equations (4) or (7) are active), all others have their α_i’s equal to zero (since the constraints are not active).
From equation (14), we see that such points (for which the α_i’s =0) have no contribution to the Lagrangian and hence the w of the optimal line. It is similarly easy to see that they don’t affect the b of the optimal line either. So, only the points that are closest to the line (and hence have their inequality constraints become equalities) matter in defining it. That is why such points are called “support vectors”. There are generally only a handful of them and yet, they support the separating plane between them.
3. Simplest classification problem
In this section, we will consider a very simple classification problem that is able to capture the essence of how this optimization behaves. So we might visualize what’s going on, we make the feature space two-dimensional. Let’s put two points on it and label them (green for positive label, red for negative label) like so:

It’s quite clear that the best place for separating these two points is the purple line given by: x+y=0.
Now let’s see how the Math we have studied so far tells us what we already know about this problem. Why do this? So that tomorrow it can tell us something we don’t know.
Since we have t⁰=1 and t¹=-1, we get from equation (12), α_0 = α_1 = α. Plugging this into equation (14) (which is a vector equation), we get w_0=w_1=2 α. Hence we immediately get that the line must have equal coefficients for x and y. We also know that since there are only two points, the Lagrange multipliers corresponding to them must be >0 and the inequality constraints corresponding to them must become “tight” (equalities; see the last line of section 2.1). This means,

Subtracting the two equations, we get b=0. So, the separating plane, in this case, is the line: x+y=0, as expected.
3.1. Adding a third floating point
To make the problem more interesting and cover a range of possible types of SVM behaviors, let’s add a third floating point. Since (1,1) and (-1,-1) lie on the line y-x=0, let’s have this third point lie on this line as well. Let’s put it at x=y=u. Also, let’s give this point a positive label (just like the green (1,1) point).

Now, the intuition about support vectors tells us:
- If u>1, the optimal SVM line doesn’t change since the support vectors are still (1,1) and (-1,-1).
- If u∈ (-1,1), the SVM line moves along with u, since the support vector now switches from the point (1,1) to (u,u).
- If u<-1, the points become un-separable and there is no solution to the SVM optimization problems (4) or (7) (they become infeasible).
Let’s see how the Lagrange multipliers can help us reach this same conclusion. From equation (14),

Also, taking derivative of equation (13) with respect to b and setting to zero we get:

And for our problem, this translates to: α_0-α_1+α_2=0 (because the first and third points — (1,1) and (u,u) belong to the positive class and the second point — (-1,-1) belongs to the negative class).

Next, equations 10-b imply simply that the inequalities should be satisfied. For our problem, we get three inequalities (one per data point). All specified by (per equation (7)):

for our three data points we get:

But from equation (15) we know that w_0=w_1. Further, the second point is the only one in the negative class. So, the inequality corresponding to it must be an equality. Using this and introducing new slack variables, k_0 and k_2 to convert the above inequalities into equalities (the squares ensure the three inequalities above are still ≥0):

And finally, we have the complementarity conditions:

From equation (17) we get: b=2w-1. Further, since we require α_0>0 and α_2>0, let’s replace them with α_0² and α_2². From equations (15) and (16) we get:

Substituting the b=2w-1 into the first of equation (17),

and into the third of equation (17),

Now, equations (18) through (21) are hard to solve by hand. Thankfully, there is a general framework for solving systems of polynomial equations called “Buchberger’s algorithm” and the equations described above are basically a system of polynomial equations. I wrote a detailed blog on Buchberger’s algorithm for solving systems of polynomial equations here. It can be used to simplify the system of equations in terms of the variables we’re interested in (the simplified form is called the “Groebner’s basis). And this algorithm is implemented in the python library, sympy. Let’s see how it works.