Courage to Learn ML: Demystifying L1 & L2 Regularization (part 2)
Unlocking the Intuition Behind L1 Sparsity with Lagrange multipliers
Welcome back to ‘Courage to Learn ML: Demystifying L1 & L2 Regularization,’ Part Two. In our previous discussion, we explored the benefits of smaller coefficients and the means to attain them through weight penalization techniques. Now, in this follow-up, our mentor and learner will delve even deeper into the realm of L1 and L2 regularization.
If you’ve been pondering questions like these, you’re in the right place:
- What’s the reason behind the names L1 and L2 regularization?
- How do we interpret the classic L1 and L2 regularization graph?
- What are Lagrange multipliers, and how can we understand them intuitively?
- Applying Lagrange multipliers to comprehend L1 sparsity.
Your engagement — likes, comments, and follows — does more than just boost morale; it powers our journey of discovery! So, let’s dive in.
Why they call L1, l2 regularization?
The name, L1 and L2 regularization, comes from the concept of Lp norms directly. Lp norms represent different ways to calculate distances from a point to the origin in a space. For instance, the L1 norm, also known as Manhattan distance, calculates the distance using the absolute values of coordinates, like ∣x∣+∣y∣. On the other hand, the L2 norm, or Euclidean distance, calculates it as the square root of the sum of the squared values, which is sqrt(x² + y²)
In the context of regularization in machine learning, these norms are used to create penalty terms that are added to the loss function. You can think of Lp regularization as measuring the total distance of the model’s weights from the origin in a high-dimensional space. The choice of norm affects the nature of this penalty: the L1 norm tends to make some coefficients zero, effectively selecting more important features, while the L2 norm shrinks the coefficients towards zero, ensuring no single feature disproportionately influences the model.
Therefore, L1 and L2 regularization are named after these mathematical norms — L1 norm and L2 norm — due to the way they apply their respective distance calculations as penalties to the model’s weights. This helps in controlling overfitting.
I often come across the graph below when studying L1 and L2 regularization, but I find it quite challenging to interpret. Could you help clarify what it represents?
Alright, let’s unpack this graph step-by-step. To start, it’s essential to understand what its different elements signify. Imagine our loss function is defined by just two weights, w1 and w2 (in the graph we use beta instead of w, but they represent the same concept). The axes of the graph represent these weights we aim to optimize.
Without any weight penalties, our goal is to find w1 and w2 values that minimize our loss function. You can visualize this function’s landscape as a valley or basin, illustrated in the graph by the elliptical contours.
Now, let’s delve into the penalties. The L1 norm, shown as a diamond shape, essentially measures the Manhattan distance of w1 and w2 from the origin. The L2 norm forms a circle, representing the sum of squared weights.
The center of the elliptical contour indicates the global minimum of the objective function, where we find our ideal weights. The centers of the L1 and L2 shapes (diamond and circle) at the origin, where all weights are zero, highlight the minimal weight penalty scenario. As we increasing the penalty term’s intensity, the model’s weights would gravitate closer to zero. This graph is a visual guide to understanding these dynamics and the impact of penalties on the weights.
Understood. So…. The graph shows a dimension created by the weights, and the two distinct shapes illustrate the objective function and the penalty, respectively. How should we interpret the intersection point labeled w*? What does this point signify?
To understand the above graph as a whole, it’s essential to grasp the concept of Lagrange multipliers, a key tool in optimization. Lagrange multipliers aid in finding the optimal points (maximum or minimum) of a function within certain constraints.
Imagine you’re hiking up a mountain with the goal of reaching the peak. There are various paths, but due to safety, you’re required to stay on a designated safe path. Here, reaching the peak represents the optimization problem, and the safe path symbolizes the constraints.
Mathematically, suppose you have a function f(x, y) to optimize. This optimization must adhere to a constraint, represented by another function g(x, y) = 0.
In the ‘Lagrange Multipliers 2D’ graph from Wikipedia, the blue contours represent f(x, y) (the mountain’s landscape), and the red curves indicate the constraints. The point where these two intersect, although not the peak point on the f(x, y) contour, represents the optimal solution under the given constraint. Lagrange multipliers solve this by merging the objective function with its constraints. In the other word, Lagrange multipliers will help you find the point easier.
So, if we circle back to that L1 and L2 graph, are you suggesting that the diamond and circle shapes represent constraints? And does that mean the spot where they intersect, that tangent point, is essentially the sweet spot for hitting the max of f(x, y) within those constraints?
Correct! The L1 and L2 regularization techniques can indeed be visualized as imposing constraints in the form of a diamond and a circle, respectively. So the graph helps us understand how these regularization methods impact the optimization of a function, typically the loss function in machine learning models.
- L1 Regularization (Diamond Shape): The L1 norm creates a diamond-shaped constraint. This shape is characterized by its sharp corners along the axes. When the optimization process (like gradient descent) seeks the point that minimizes the loss function while staying within this diamond, it’s more likely to hit these corners. At these corners, one of the weights (parameters of the model) becomes zero while others remain non-zero. This property of the L1 norm leads to sparsity in the model parameters, meaning some weights are exactly zero. This sparsity is useful for feature selection, as it effectively removes some features from the model.
- L2 Regularization (Circle Shape): On the other hand, the L2 norm creates a circular-shaped constraint. The smooth, round nature of the circle means that the optimization process is less likely to find solutions at the axes where weights are zero. Instead, the L2 norm tends to shrink the weights uniformly without necessarily driving any to zero. This controls the model complexity by preventing weights from becoming too large, thereby helping to avoid overfitting. However, unlike the L1 norm, it doesn’t lead to sparsity in the model parameters.
Keep in mind, Lagrange multipliers aren’t the only method to grasp L1 and L2 regularizations. Let’s take a break here, and I’ll address more of your questions in our next installment. See you soon!
Other posts in this series:
- Courage to Learn ML: Demystifying L1 & L2 Regularization (part 1)
- Courage to Learn ML: Demystifying L1 & L2 Regularization (part 3)
If you liked the article, you can find me on LinkedIn.