L1 regularization and L2 regularization intuitive explanation
L1 and L2 as forms for regularization are well known and common. While I was explaining L1 and L2 recently to a colleague it occurred to me it might not be intuitive to many on why they have the effect described on the model weights. In this blog we go over the intuition on the effect they have to the final model weights.
Starting with definition:
L1 and L2 regularization are methods in machine learning that add a penalty term to the loss function. L1 regularization is also known as lasso regression, and L2 regularization is also known as ridge regression.
L1 regularization adds the absolute value of the coefficient as a penalty term. L2 regularization adds the squared magnitude of the coefficient as a penalty term.
Effect:
L1 regularization generates sparse solutions and is helpful for feature selection. L2 regularization yields non-sparse solutions and is beneficial for building simpler models.
Intuition on the effect:
Best way to understand the ways in which L1 and L2 affect ML models is to construct a simple hypothetical scenario where we have duplicate / highly correlated features A and B and a simple linear model y = c1 A + c2 B.
Without any regularization their weights are 2 and 4.
Now when we add the L2 regularization. The penalty term is 2² and 4² = 20. To minimize the penalty while not impacting the loss, model needs to provide weight of 6 to the result form these duplicate features. Best way for the model is to keep both features with weights of 3 (total weight is 6), penalty is 18 (lower than 20 before).
While for L1 regularization: Initial penalty is 2 + 4 = 6. It can eliminate duplicate feature and keep just 1 of them with the weight of 6 without impacting the result or the peanalty.
Hence the effect of L2 is to keep model simple by preventing the weights form getting too big but ends up keeping the duplicate features while L1 can provide the effect of feature reduction / sparse solutions.
Practical application:
Its best to use L1 during exploration/experiment stage and select the features that matter. Once you settle on the features L2 provides better models with lower variance.




