avatarAmy Ma

Summary

The article "Courage to Learn ML: Demystifying L1 & L2 Regularization (part 2)" delves into the intuition behind L1 and L2 regularization, explaining their names, graphical representations, and the role of Lagrange multipliers in understanding L1 sparsity.

Abstract

This piece is the second part of a series that aims to clarify the concepts of L1 and L2 regularization in machine learning. It begins by welcoming readers back to the discussion on regularization techniques that penalize weights to prevent overfitting. The article addresses common questions about the origins of the terms L1 and L2, which are derived from Lp norms that measure distances in space. It explains that L1 regularization, also known as Lasso, induces sparsity by driving some coefficients to zero, effectively performing feature selection. In contrast, L2 regularization, or Ridge, shrinks coefficients towards zero without setting any to zero, thus controlling model complexity. The article also interprets the classic L1 and L2 regularization graph, illustrating how these methods impose constraints on the optimization of a function. It uses the concept of Lagrange multipliers to explain how these regularization techniques find the optimal point within the constraints, leading to a better understanding of model optimization in machine learning.

Opinions

  • The author emphasizes the importance of reader engagement, stating that likes, comments, and follows fuel their journey of discovery.
  • The article suggests that understanding the graphical representation of L1 and L2 regularization is challenging but essential for grasping their impact on model weights.
  • The use of Lagrange multipliers is presented as a key tool for optimizing functions with constraints, which is crucial for applying L1 and L2 regularization effectively.
  • The author believes that the intersection point (w*) in the graph represents the optimal solution under the given constraints imposed by regularization.
  • It is implied that the choice between L1 and L2 regularization should be based on the specific needs of the model, such as feature selection (L1) or avoiding overfitting (L2).
  • The article concludes by acknowledging that Lagrange multipliers are not the sole method for understanding regularization and promises to address further questions in the next installment of the series.

Courage to Learn ML: Demystifying L1 & L2 Regularization (part 2)

Unlocking the Intuition Behind L1 Sparsity with Lagrange multipliers

Welcome back to ‘Courage to Learn ML: Demystifying L1 & L2 Regularization,’ Part Two. In our previous discussion, we explored the benefits of smaller coefficients and the means to attain them through weight penalization techniques. Now, in this follow-up, our mentor and learner will delve even deeper into the realm of L1 and L2 regularization.

If you’ve been pondering questions like these, you’re in the right place:

  • What’s the reason behind the names L1 and L2 regularization?
  • How do we interpret the classic L1 and L2 regularization graph?
  • What are Lagrange multipliers, and how can we understand them intuitively?
  • Applying Lagrange multipliers to comprehend L1 sparsity.

Your engagement — likes, comments, and follows — does more than just boost morale; it powers our journey of discovery! So, let’s dive in.

Photo by Aarón Blanco Tejedor on Unsplash

Why they call L1, l2 regularization?

The name, L1 and L2 regularization, comes from the concept of Lp norms directly. Lp norms represent different ways to calculate distances from a point to the origin in a space. For instance, the L1 norm, also known as Manhattan distance, calculates the distance using the absolute values of coordinates, like ∣x∣+∣y∣. On the other hand, the L2 norm, or Euclidean distance, calculates it as the square root of the sum of the squared values, which is sqrt(x² + y²)

In the context of regularization in machine learning, these norms are used to create penalty terms that are added to the loss function. You can think of Lp regularization as measuring the total distance of the model’s weights from the origin in a high-dimensional space. The choice of norm affects the nature of this penalty: the L1 norm tends to make some coefficients zero, effectively selecting more important features, while the L2 norm shrinks the coefficients towards zero, ensuring no single feature disproportionately influences the model.

Therefore, L1 and L2 regularization are named after these mathematical norms — L1 norm and L2 norm — due to the way they apply their respective distance calculations as penalties to the model’s weights. This helps in controlling overfitting.

I often come across the graph below when studying L1 and L2 regularization, but I find it quite challenging to interpret. Could you help clarify what it represents?

The traditional yet perplexing L1 and L2 regularization graph found in textbooks.. source: https://commons.wikimedia.org/wiki/File:Regularization.jpg

Alright, let’s unpack this graph step-by-step. To start, it’s essential to understand what its different elements signify. Imagine our loss function is defined by just two weights, w1 and w2 (in the graph we use beta instead of w, but they represent the same concept). The axes of the graph represent these weights we aim to optimize.

Without any weight penalties, our goal is to find w1 and w2 values that minimize our loss function. You can visualize this function’s landscape as a valley or basin, illustrated in the graph by the elliptical contours.

Now, let’s delve into the penalties. The L1 norm, shown as a diamond shape, essentially measures the Manhattan distance of w1 and w2 from the origin. The L2 norm forms a circle, representing the sum of squared weights.

The center of the elliptical contour indicates the global minimum of the objective function, where we find our ideal weights. The centers of the L1 and L2 shapes (diamond and circle) at the origin, where all weights are zero, highlight the minimal weight penalty scenario. As we increasing the penalty term’s intensity, the model’s weights would gravitate closer to zero. This graph is a visual guide to understanding these dynamics and the impact of penalties on the weights.

Understood. So…. The graph shows a dimension created by the weights, and the two distinct shapes illustrate the objective function and the penalty, respectively. How should we interpret the intersection point labeled w*? What does this point signify?

To understand the above graph as a whole, it’s essential to grasp the concept of Lagrange multipliers, a key tool in optimization. Lagrange multipliers aid in finding the optimal points (maximum or minimum) of a function within certain constraints.

Imagine you’re hiking up a mountain with the goal of reaching the peak. There are various paths, but due to safety, you’re required to stay on a designated safe path. Here, reaching the peak represents the optimization problem, and the safe path symbolizes the constraints.

Mathematically, suppose you have a function f(x, y) to optimize. This optimization must adhere to a constraint, represented by another function g(x, y) = 0.

In the ‘Lagrange Multipliers 2D’ graph from Wikipedia, the blue contours represent f(x, y) (the mountain’s landscape), and the red curves indicate the constraints. The point where these two intersect, although not the peak point on the f(x, y) contour, represents the optimal solution under the given constraint. Lagrange multipliers solve this by merging the objective function with its constraints. In the other word, Lagrange multipliers will help you find the point easier.

So, if we circle back to that L1 and L2 graph, are you suggesting that the diamond and circle shapes represent constraints? And does that mean the spot where they intersect, that tangent point, is essentially the sweet spot for hitting the max of f(x, y) within those constraints?

Correct! The L1 and L2 regularization techniques can indeed be visualized as imposing constraints in the form of a diamond and a circle, respectively. So the graph helps us understand how these regularization methods impact the optimization of a function, typically the loss function in machine learning models.

A better illustration on L2 (left), L1 (right) regularizations. Source: https://www.researchgate.net/figure/Parameter-norm-penalties-L2-norm-regularization-left-and-L1-norm-regularization_fig2_355020694
  1. L1 Regularization (Diamond Shape): The L1 norm creates a diamond-shaped constraint. This shape is characterized by its sharp corners along the axes. When the optimization process (like gradient descent) seeks the point that minimizes the loss function while staying within this diamond, it’s more likely to hit these corners. At these corners, one of the weights (parameters of the model) becomes zero while others remain non-zero. This property of the L1 norm leads to sparsity in the model parameters, meaning some weights are exactly zero. This sparsity is useful for feature selection, as it effectively removes some features from the model.
  2. L2 Regularization (Circle Shape): On the other hand, the L2 norm creates a circular-shaped constraint. The smooth, round nature of the circle means that the optimization process is less likely to find solutions at the axes where weights are zero. Instead, the L2 norm tends to shrink the weights uniformly without necessarily driving any to zero. This controls the model complexity by preventing weights from becoming too large, thereby helping to avoid overfitting. However, unlike the L1 norm, it doesn’t lead to sparsity in the model parameters.

Keep in mind, Lagrange multipliers aren’t the only method to grasp L1 and L2 regularizations. Let’s take a break here, and I’ll address more of your questions in our next installment. See you soon!

Other posts in this series:

If you liked the article, you can find me on LinkedIn.

Machine Learning
Data Scientist Interview
Data Science
Deep Learning
Courage To Learn Ml
Recommended from ReadMedium