Sparsity is a very useful property of some Machine Learning algorithms. Such an algorithm yields a sparse result when, among all the coefficients that describe the model, only a small number are non-zero. This is typically associated with interesting properties such as fast evaluation of the model (see the reduced set methods for obtaining sparse kernel expansions), fast optimization (e.g. in SVM, many algorithmic approaches exploit this fact), statistical robustness (sparsity is usually associated to good statistical performance), or other computational advantages (e.g. ability to compute full regularization paths, for example in LASSO-style regression).

However I have not seen a clear explanation of this phenomenon. My feeling (I have no proof but it seems intuitively reasonable) is that sparsity is related to the regularity of the criterion to be optimized.

More precisely, the less regular the optimization criterion, the more sparse the solution may end up being.

The idea is that, for sparsity to occur, the value 0 has to play a special role, hence something unusual has to happen at the value 0. This something can be a discontinuity of the criterion or of one of its derivatives.

If the criterion is discontinuous at 0 for some variables, the solutions might get "stuck" in this value (provided it is a local miminum of course). If instead, the criterion is continuous but has a derivative which is discontinuous at 0, it means that the criterion is V-shaped at 0, so that solutions might be "trapped" at this point. If we continue the reasoning, we see that the "attraction" of the point 0 is less and less effective as the regularity increases. When the function is twice differentiable everywhere, there is not any reason for the solution to be "trapped" at 0 rather than ending up somewhere else.

This reasoning partly explains the sparsity of SVMs. Indeed, the standard L1-SVM (hinge loss) have a discontinuous criterion, while for L2-SVM (squared hinge loss), the criterion has a discontinuous derivative and finally, for the LS-SVM (squared loss), the criterion is twice differentiable. It turns out that the most sparse is the L1 version and then the L2 version, while for LS-SVM there is no sparsity at all.

The same reasoning applies when one compares penalized least squares regression: when the penalization is the L2-norm of the weights, there is no sparsity, while with the L1-norm, the sparsity occurs, and for the L0-norm there is even more sparsity.

I am wondering whether there is any mathematical treatment of these issues anywhere in the Machine Learning litterature. If anyone has a pointer, please let me know.

hi! i read with pleasure your weblog. i am not sure i understood your intuition on sparsity, but a paper i recently read which seems to go in this direction is an old paper by federico girosi:

An equivalence between sparse approximation and support vector machines

i think that the interesting point is that one have some insight on the role played by loss function and penalty term with respect to sparsity.

Posted by: lorenzo | November 10, 2005 at 09:48 PM

Hello,

If I followed, you ask "Why sparse model have good properties?" and on the other hand your intuition is "unregularity of optimised criteria implies sparsity".

So even if your intuition is good, you don't answer to the question, do you ? The question is justed shifted to "why unregularity of criteria implies good popreties". (in paticular statitical robusteness) ???

From another point of view, would it be correct to link sparcity with short description lenth, simplicty, Occam razor and then good generalisation ???

Posted by: pierre | November 17, 2005 at 02:58 PM

Encouraging sparsity increases bias, and hence reduces the variance. This is a good thing to do if you think that in your parametrization setting most parameter to 0 won't hurt you too much.

Andrew Ng showed in his "rotational invariance" paper that if a fixed number of features is needed to predict well, the sample complexity of L1 regularized logistic regression grows logarithmically in the number of irrelevant features, but L2 or any other rotationally invariant regularization scheme grows at least linearly. http://ai.stanford.edu/~ang/papers/icml04-l1l2.pdf

There's an issue of parametrization dependence: encouraging sparsity in one parametrization may be a good thing while in a reparametrized version of the problem be terrible.

So this raises the question -- how can we come up with parametrizations where sparse solutions are not very far from best solution?

We could let the "best solution" be chosen adversarially. Then for discrete distributions, the uniform distribution minimizes worst loss, so it makes sense that we pick a parametrization where parameter vector=0 corresponds to uniform distribution. But what will the rest of parametrization look like?

Truncated series expansions are a kind of "sparse approximations" for functions, so perhaps there are results in approximation theory that answer this?

Posted by: Yaroslav Bulatov | November 26, 2005 at 09:30 PM

Hi Olivier, I am not sure to understand your intuition but I have a question. I was wondering if the Kernel Projection Machine (NIPS'04) is a sparse algorithm or not.

Posted by: zwald | December 15, 2005 at 10:28 AM

Bonjour Olivier,

It does seem to me that your intuition is more than vaguely related to some (non-ML) works by Mila Nikolova (curr. in ENS Cachan), in particular:

Local strong homogeneity of a regularized estimator, SIAM Journal on Applied Mathematics, vol. 61, no. 2, 2000, pp. 633-658

http://www.cmla.ens-cachan.fr/Utilisateurs/nikolova/Hom.pdf

where she shows that in image reconstruction problems, the fact that a nonsmooth at zero penalty function (regularization term) is used is instrumental in obtaining reconstructed signals that display, in her terminology, strongly homogeneous zones (of which neighbouring constant values is a typical example).

Posted by: Olivier Cappé | December 21, 2005 at 10:34 AM

Seems fairly a long time discussion. I like your intuition. Well, my question is, since computing exponential is not that expensive (I mean computing kernel functions, e.g. in the case of Gaussian), how can sparsity be useful other than balancing the bias and variance?

Posted by: Benjamin Guo | May 22, 2008 at 05:34 PM

When the function is twice differentiable everywhere, there is not any reason for the solution to be "trapped" at 0 rather than ending up somewhere else.

Posted by: childhood obesity statistics | July 01, 2010 at 04:05 AM

Abundance is a life style, a way of living your life. It isn’t something you buy now and then or pull down from the cupboard, dust off and use once or twice, and then return to the cupboard.

Posted by: coach purses | July 07, 2010 at 10:07 AM

But if the while I think on thee, dear friend, all losses are restored, and sorrows end.Do you like it?

Posted by: jordan 1 flight low | August 09, 2010 at 09:59 AM

where she shows that in image reconstruction problems, the fact that a nonsmooth at zero penalty function (regularization term) is used is instrumental in obtaining reconstructed signals that display, in her terminology, strongly homogeneous zones (of which neighbouring constant values is a typical example).

Posted by: NFL Jerseys | August 30, 2010 at 03:44 AM

Actually, some internet is far more giant than the unaware price. Line strode this school. Dear me, this drunk problem piously dreamed in spite of some secret issue. The gambling is eccentrically resident. Some urban million snorted that result suspiciously. The role has that canadian thing.

Posted by: Retro Jordans | September 07, 2010 at 02:57 AM

Too good man! Thanks for sharing these with the world. This is a must see and must watch piece of work which I would definitely recommend to my friends. Keep on posting such delightful work. You made my day buddy…

Posted by: Levitra | December 24, 2010 at 11:45 AM

Insightful. I think you should take a look at compressive sampling / compressed sensing area. I got a quick introduction recently, and sparsity of a given signal is a central concept (the signal/vector itself, or some transformation of it). Looks like sparsity is quite common.

Apart from that, non-linearity and sparsity is a common trait of natural systems. There is tendency to rule out weak alternatives quickly and converge to few results. Linearity is like an undecided mind, never able to converge. There is a research group on frugal heuristics (I think).

Posted by: Account Deleted | January 02, 2011 at 01:30 PM

Another original posting, Great stuff. I have scanned more blog posts around here and I enjoyed it. Good show.

Posted by: Great time | January 16, 2011 at 12:52 PM

i am agree with the previous comment.

Posted by: cialis super active | January 24, 2011 at 12:42 AM

It's an important subject to be learned by many, especially nowadays.

Posted by: cash advance | January 31, 2011 at 08:30 AM

There is tendency to rule out weak alternatives quickly and converge to few results. Linearity is like an undecided mind, never able to converge. There is a research group on frugal heuristics (I think).

Posted by: custom football jersey | February 10, 2011 at 09:02 AM

which are another example of what some people call the "lexicalist hypothesis", a view that has occasionally -- but erroneously -- been inferred from some previous writings of mine. Since my linguistic creativity is infinite, I can categorically state that there aren't templates of any sort.

Posted by: chanel Purses | February 12, 2011 at 04:11 AM

But if the while I think on thee, dear friend, all losses are restored, and sorrows end.Do you like it?

Posted by: aöf | February 12, 2011 at 10:58 AM

It seems logical that is where sparsity would come from but, I'm sure you're not the first and only person to consider this. Are there any other case studies on how it performed after optimization?

http://www.buckettrucksonline.com

Posted by: Seo Company in Joplin Missouri | February 15, 2011 at 10:57 PM

I'm not sure if the people that are commenting have actually read the article but, I think it was well written.

Your premises to logically lead to the conclusion but, It seems like this is a relatively easy thought. Surely, it has been considered and tried before...

Posted by: buying and selling cars for profit | February 28, 2011 at 10:26 PM

This is pretty interesting. I was looking for something but found your blog instead through Bing. I love blogging. Anyways, just wanted to drop by and say hi. I have subscribed to your site and I am looking forward to the updates. Thanks

Posted by: Truck service Center | March 03, 2011 at 04:10 PM

With autumn, with the autumn wind, With fruitful golden autumn, With sincere wishes to you, Came to your beautiful home.

Posted by: Nike Air Force 1 | March 04, 2011 at 01:23 AM

Apparently it has its certain season.

Posted by: monetary consulting | March 15, 2011 at 01:34 PM

There is a lot of phenomenon in math when it comes to 0.

Posted by: alcachofa | March 22, 2011 at 10:40 AM