Sparsity is a very useful property of some Machine Learning algorithms. Such an algorithm yields a sparse result when, among all the coefficients that describe the model, only a small number are non-zero. This is typically associated with interesting properties such as fast evaluation of the model (see the reduced set methods for obtaining sparse kernel expansions), fast optimization (e.g. in SVM, many algorithmic approaches exploit this fact), statistical robustness (sparsity is usually associated to good statistical performance), or other computational advantages (e.g. ability to compute full regularization paths, for example in LASSO-style regression).
However I have not seen a clear explanation of this phenomenon. My feeling (I have no proof but it seems intuitively reasonable) is that sparsity is related to the regularity of the criterion to be optimized.
More precisely, the less regular the optimization criterion, the more sparse the solution may end up being.
The idea is that, for sparsity to occur, the value 0 has to play a special role, hence something unusual has to happen at the value 0. This something can be a discontinuity of the criterion or of one of its derivatives.
If the criterion is discontinuous at 0 for some variables, the solutions might get "stuck" in this value (provided it is a local miminum of course). If instead, the criterion is continuous but has a derivative which is discontinuous at 0, it means that the criterion is V-shaped at 0, so that solutions might be "trapped" at this point. If we continue the reasoning, we see that the "attraction" of the point 0 is less and less effective as the regularity increases. When the function is twice differentiable everywhere, there is not any reason for the solution to be "trapped" at 0 rather than ending up somewhere else.
This reasoning partly explains the sparsity of SVMs. Indeed, the standard L1-SVM (hinge loss) have a discontinuous criterion, while for L2-SVM (squared hinge loss), the criterion has a discontinuous derivative and finally, for the LS-SVM (squared loss), the criterion is twice differentiable. It turns out that the most sparse is the L1 version and then the L2 version, while for LS-SVM there is no sparsity at all.
The same reasoning applies when one compares penalized least squares regression: when the penalization is the L2-norm of the weights, there is no sparsity, while with the L1-norm, the sparsity occurs, and for the L0-norm there is even more sparsity.
I am wondering whether there is any mathematical treatment of these issues anywhere in the Machine Learning litterature. If anyone has a pointer, please let me know.

hi! i read with pleasure your weblog. i am not sure i understood your intuition on sparsity, but a paper i recently read which seems to go in this direction is an old paper by federico girosi:
An equivalence between sparse approximation and support vector machines
i think that the interesting point is that one have some insight on the role played by loss function and penalty term with respect to sparsity.
Posted by: lorenzo | November 10, 2005 at 09:48 PM
Hello,
If I followed, you ask "Why sparse model have good properties?" and on the other hand your intuition is "unregularity of optimised criteria implies sparsity".
So even if your intuition is good, you don't answer to the question, do you ? The question is justed shifted to "why unregularity of criteria implies good popreties". (in paticular statitical robusteness) ???
From another point of view, would it be correct to link sparcity with short description lenth, simplicty, Occam razor and then good generalisation ???
Posted by: pierre | November 17, 2005 at 02:58 PM
Encouraging sparsity increases bias, and hence reduces the variance. This is a good thing to do if you think that in your parametrization setting most parameter to 0 won't hurt you too much.
Andrew Ng showed in his "rotational invariance" paper that if a fixed number of features is needed to predict well, the sample complexity of L1 regularized logistic regression grows logarithmically in the number of irrelevant features, but L2 or any other rotationally invariant regularization scheme grows at least linearly. http://ai.stanford.edu/~ang/papers/icml04-l1l2.pdf
There's an issue of parametrization dependence: encouraging sparsity in one parametrization may be a good thing while in a reparametrized version of the problem be terrible.
So this raises the question -- how can we come up with parametrizations where sparse solutions are not very far from best solution?
We could let the "best solution" be chosen adversarially. Then for discrete distributions, the uniform distribution minimizes worst loss, so it makes sense that we pick a parametrization where parameter vector=0 corresponds to uniform distribution. But what will the rest of parametrization look like?
Truncated series expansions are a kind of "sparse approximations" for functions, so perhaps there are results in approximation theory that answer this?
Posted by: Yaroslav Bulatov | November 26, 2005 at 09:30 PM
Hi Olivier, I am not sure to understand your intuition but I have a question. I was wondering if the Kernel Projection Machine (NIPS'04) is a sparse algorithm or not.
Posted by: zwald | December 15, 2005 at 10:28 AM
Bonjour Olivier,
It does seem to me that your intuition is more than vaguely related to some (non-ML) works by Mila Nikolova (curr. in ENS Cachan), in particular:
Local strong homogeneity of a regularized estimator, SIAM Journal on Applied Mathematics, vol. 61, no. 2, 2000, pp. 633-658
http://www.cmla.ens-cachan.fr/Utilisateurs/nikolova/Hom.pdf
where she shows that in image reconstruction problems, the fact that a nonsmooth at zero penalty function (regularization term) is used is instrumental in obtaining reconstructed signals that display, in her terminology, strongly homogeneous zones (of which neighbouring constant values is a typical example).
Posted by: Olivier Cappé | December 21, 2005 at 10:34 AM
Seems fairly a long time discussion. I like your intuition. Well, my question is, since computing exponential is not that expensive (I mean computing kernel functions, e.g. in the case of Gaussian), how can sparsity be useful other than balancing the bias and variance?
Posted by: Benjamin Guo | May 22, 2008 at 05:34 PM