Regularization and Sparsity

Aathreya Kadambi
March 25, 2025

One of the most interesting difference between L1L^1 and L2L^2 when used as regularization terms is that L1L^1 induces sparsity, while L2L^2 doesn’t. Why is this the case?

We already saw an intuitive explanation for why in lecture: the geometric perspective. If we draw out L1L^1, it looks like this:

This is one contour of L1L^1, so we can imagine the gradient being perpendicular to these lines (see the blue curves). Since the gradient then points in a direction angled 45 degrees in the xyxy-plane (or 135, or 225, or 315), any point on this contour will flow in a direction that will make it hit an axis before it hits the origin. On the other hand, for L2L^2, it looks like this:

so that the gradients will point towards the origin, and so points will not hit axes before the origin.

When we hit axes, we are setting a coordinate to zero. As we set many coordinates to zero with L1L^1 regularization, we say that we “induce sparsity”.

The above explanation is great, but it is cheating a bit. It fails to consider the impact of the main loss function! In most cases, we won’t just have a regularization term without an actual loss term. There is actually another graphical explanation of this, which Arnav showed in class. It can also be found in the book Elements of Statistical Learning, and this StackExchange post

We can also derive this more algebraically. Suppose we have some vector of data xx, parameters θ\theta, and a loss given by:

L(x,θ)=f(x,θ)+λθL2L(x,\theta) = f(x,\theta) + \lambda \|\theta\|_{L^2}
One can then ask, what happens as we decrease LL? In theory, at the minimum of LL, we must have:
0=θL(x,θ)=θf(x,θ)+θ(λθL2)=θf(x,θ)+2λθ0 = \nabla_\theta L(x,\theta) = \nabla_\theta f(x,\theta) + \nabla_\theta (\lambda \|\theta\|_{L^2}) = \nabla_\theta f(x,\theta) +2\lambda \theta
and now we see that:
θ=12λθf(x,θ)θi=12λθif(x,θ)\theta = -\dfrac{1}{2\lambda}\nabla_\theta f(x,\theta) \Rightarrow \theta_i = -\dfrac{1}{2\lambda}\partial_{\theta_i} f(x,\theta)
In other words, we expect the gradient of ff to be proportional to θ\theta. But this equation also tells as another story: the right hand side is proportional to the direction of descent, so that the direction of descent of ff is in the same direction as θ\theta. In other words, L2L^2 balances θi\theta_i with minimizing ff, rather than inducing sparsity. This is a similar sort of picture to what we saw above, and is certainly a point of interesting discussion.

On the other hand, if we had a loss with L1L^1,

L(x,θ)=f(x,θ)+λθL1L(x,\theta) = f(x,\theta) + \lambda \|\theta\|_{L^1}
so that at a minimum,
0=L(x,θ)=θf(x,θ)+θ(λθL1)=θf(x)+λsign(θ)0 = \nabla L(x,\theta) = \nabla_\theta f(x,\theta) + \nabla_\theta (\lambda \|\theta\|_{L^1}) = \nabla_\theta f(x) + \lambda \text{sign}(\theta)
where sign(θ)i\text{sign}(\theta)_i is 1-1 for θi<0\theta_i < 0 and 11 for θi>0\theta_i > 0. Note that θi=0\theta_i = 0 is a point of discontinuity, so this logic doesn’t apply here. We can handle the three cases separately. If θi<0\theta_i < 0, then it must be that
θif(x,θ)=λ.\partial_{\theta_i} f(x,\theta) = \lambda.
If θi>0\theta_i > 0, it must be that
θif(x,θ)=λ.\partial_{\theta_i} f(x,\theta) = -\lambda.

In the case of regression, θif(x,θ)=xi2θi+2xiyi\partial_{\theta_i} f(x,\theta) = x_i^2\theta_i+2x_iy_i (why?). If 0>θi0 > \theta_i, then,

0>θi>λ2xiyixi2λ2<xiyi0 > \theta_i > \dfrac{\lambda - 2x_i y_i}{x_i^2} \Rightarrow \dfrac{\lambda}{2} < x_iy_i
and if 0<θi0 < \theta_i, then
0<θi<λxiyixi2λ2>xiyi0 < \theta_i < \dfrac{-\lambda - x_i y_i}{x_i^2} \Rightarrow -\dfrac{\lambda}{2} > x_iy_i.
Overall, θi\theta_i being nonzero implies that xiyi>λ2|x_iy_i| > \dfrac{\lambda}{2}. In other words, as we keep increasing λ\lambda, after λ2xiyi\lambda \ge 2|x_i y_i|, θi\theta_i will become zero. This is sparsity.

Comments

Not signed in. Sign in with Google to make comments unanonymously!