L1 regularization and L2 regularization intuitive explanation

Summary

L1 and L2 regularization are techniques used in machine learning to prevent overfitting and improve model performance, with L1 generating sparse solutions and helping with feature selection, while L2 yields non-sparse solutions and is beneficial for building simpler models.

Abstract

L1 and L2 regularization are common methods used in machine learning to add a penalty term to the loss function, which helps prevent overfitting and improve model performance. L1 regularization, also known as lasso regression, adds the absolute value of the coefficient as a penalty term, while L2 regularization, also known as ridge regression, adds the squared magnitude of the coefficient as a penalty term. L1 regularization generates sparse solutions and is helpful for feature selection, while L2 regularization yields non-sparse solutions and is beneficial for building simpler models. The best way to understand the effect of L1 and L2 regularization is to construct a simple hypothetical scenario where we have duplicate or highly correlated features. In this scenario, L1 regularization can eliminate duplicate features and keep just one of them with the weight of 6 without impacting the result or the penalty, while L2 regularization keeps both features with weights of 3 to minimize the penalty while not impacting the loss. It is best to use L1 during the exploration/experiment stage and select the features that matter, and then use L2 to provide better models with lower variance.

Bullet points

L1 and L2 regularization are methods in machine learning that add a penalty term to the loss function.
L1 regularization is also known as lasso regression, and L2 regularization is also known as ridge regression.
L1 regularization adds the absolute value of the coefficient as a penalty term, while L2 regularization adds the squared magnitude of the coefficient as a penalty term.
L1 regularization generates sparse solutions and is helpful for feature selection, while L2 regularization yields non-sparse solutions and is beneficial for building simpler models.
The best way to understand the effect of L1 and L2 regularization is to construct a simple hypothetical scenario where we have duplicate or highly correlated features.
In this scenario, L1 regularization can eliminate duplicate features and keep just one of them with the weight of 6 without impacting the result or the penalty, while L2 regularization keeps both features with weights of 3 to minimize the penalty while not impacting the loss.
It is best to use L1 during the exploration/experiment stage and select the features that matter, and then use L2 to provide better models with lower variance.

L1 regularization and L2 regularization intuitive explanation

L1 and L2 as forms for regularization are well known and common. While I was explaining L1 and L2 recently to a colleague it occurred to me it might not be intuitive to many on why they have the effect described on the model weights. In this blog we go over the intuition on the effect they have to the final model weights.

Starting with definition:

L1 and L2 regularization are methods in machine learning that add a penalty term to the loss function. L1 regularization is also known as lasso regression, and L2 regularization is also known as ridge regression.

L1 regularization adds the absolute value of the coefficient as a penalty term. L2 regularization adds the squared magnitude of the coefficient as a penalty term.

Intuition on the effect:

Best way to understand the ways in which L1 and L2 affect ML models is to construct a simple hypothetical scenario where we have duplicate / highly correlated features A and B and a simple linear model y = c1 A + c2 B.

Without any regularization their weights are 2 and 4.

Now when we add the L2 regularization. The penalty term is 2² and 4² = 20. To minimize the penalty while not impacting the loss, model needs to provide weight of 6 to the result form these duplicate features. Best way for the model is to keep both features with weights of 3 (total weight is 6), penalty is 18 (lower than 20 before).

While for L1 regularization: Initial penalty is 2 + 4 = 6. It can eliminate duplicate feature and keep just 1 of them with the weight of 6 without impacting the result or the peanalty.

Hence the effect of L2 is to keep model simple by preventing the weights form getting too big but ends up keeping the duplicate features while L1 can provide the effect of feature reduction / sparse solutions.

Practical application:

Its best to use L1 during exploration/experiment stage and select the features that matter. Once you settle on the features L2 provides better models with lower variance.

L1 regularization and L2 regularization intuitive explanation

Starting with definition:

Effect:

Intuition on the effect: