Regularization and Sparsity
Aathreya Kadambi
March 25, 2025
One of the most interesting difference between L1 and L2 when used as regularization terms is that L1 induces sparsity, while L2 doesn’t. Why is this the case?
We already saw an intuitive explanation for why in lecture: the geometric perspective. If we draw out L1, it looks like this:
This is one contour of L1, so we can imagine the gradient being perpendicular to these lines (see the blue curves). Since the gradient then points in a direction angled 45 degrees in the xy-plane (or 135, or 225, or 315), any point on this contour will flow in a direction that will make it hit an axis before it hits the origin. On the other hand, for L2, it looks like this:
so that the gradients will point towards the origin, and so points will not hit axes before the origin.
When we hit axes, we are setting a coordinate to zero. As we set many coordinates to zero with L1 regularization, we say that we “induce sparsity”.
The above explanation is great, but it is cheating a bit. It fails to consider the impact of the main loss function! In most cases, we won’t just have a regularization term without an actual loss term. There is actually another graphical explanation of this, which Arnav showed in class. It can also be found in the book Elements of Statistical Learning, and this StackExchange post
We can also derive this more algebraically. Suppose we have some vector of data x, parameters θ, and a loss given by:
L(x,θ)=f(x,θ)+λ∥θ∥L2
One can then ask, what happens as we decrease
L? In theory, at the minimum of
L, we must have:
0=∇θL(x,θ)=∇θf(x,θ)+∇θ(λ∥θ∥L2)=∇θf(x,θ)+2λθ
and now we see that:
θ=−2λ1∇θf(x,θ)⇒θi=−2λ1∂θif(x,θ)
In other words, we expect the gradient of
f to be proportional to
θ. But this equation also tells as another story: the right hand side is proportional to the direction of descent, so that the direction of descent of
f is in the same direction as
θ. In other words,
L2 balances
θi with minimizing
f, rather than inducing sparsity. This is a similar sort of picture to what we saw above, and is certainly a point of interesting discussion.
On the other hand, if we had a loss with L1,
L(x,θ)=f(x,θ)+λ∥θ∥L1
so that at a minimum,
0=∇L(x,θ)=∇θf(x,θ)+∇θ(λ∥θ∥L1)=∇θf(x)+λsign(θ)
where
sign(θ)i is
−1 for
θi<0 and
1 for
θi>0. Note that
θi=0 is a point of discontinuity, so this logic doesn’t apply here. We can handle the three cases separately. If
θi<0, then it must be that
∂θif(x,θ)=λ.
If
θi>0, it must be that
∂θif(x,θ)=−λ.
In the case of regression, ∂θif(x,θ)=xi2θi+2xiyi (why?). If 0>θi, then,
0>θi>xi2λ−2xiyi⇒2λ<xiyi
and if
0<θi, then
0<θi<xi2−λ−xiyi⇒−2λ>xiyi.
Overall,
θi being nonzero implies that
∣xiyi∣>2λ. In other words, as we keep increasing
λ, after
λ≥2∣xiyi∣,
θi will become zero. This is sparsity.
Comments
Not signed in. Sign in with Google to make comments unanonymously!