Machine Learning Thoughts

Some thoughts about philosophical, theoretical and practical aspects of Machine Learning.

About

My Photo

Favorite Links

  • Publications
  • Homepage

Categories

  • Artificial Intelligence
  • Data Mining
  • General
  • Links
  • Machine Learning
  • Personal
  • Pertinence
  • Philosophy
  • search engine
  • Theory

Recent Comments

  • freight on Why do we do Science?
  • freight on Why do we do Science?
  • Poker Ohne Einzahlung on Decision-making
  • Bonus Senza Deposito on Decision-making
  • Bonus sans dépôt on Decision-making
  • anti cellulite on Happiness of a scientist II: the 80/20 rule
  • Thesis Writing on The Failure of AI
  • nail school online on Happiness of a scientist II: the 80/20 rule
  • anti cellulite on The Failure of AI
  • Facebook advertising on Happiness of a scientist I: rationalization

Related blogs

  • Sam Cook
  • Group blog
  • Grant Ingersoll
  • Hal Daume III
  • ?Notes
  • Fernando Diaz
  • Matthew Hurst
    Director of Science and Innovation, Nielsen BuzzMetrics; co-creator of BlogPulse.
  • Daniel Lemire
  • Leonid Kontorovich
  • Cognitive Daily

Archives

  • February 2007
  • November 2006
  • September 2006
  • June 2006
  • May 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • November 2005

Favorite Books

  • Advanced Lectures on Machine Learning : MLSS 2003 (Olivier Bousquet, Ulrike von Luxburg, Gunnar Rätsch eds)
  • Algorithmic Learning in a Random World (Vladimir Vovk,Alex Gammerman,Glenn Shafer)
  • Probability and Finance: It's Only a Game! (Glenn Shafer,Vladimir Vovk)

Machine Learning (Theory)

Add me to your TypePad People list
Subscribe to this blog's feed
Blog powered by TypePad

Other links

  • Listed on BlogShares
My Squidoo Lens

Blink

I have just started reading "Blink, the power of thinking without thinking" by Malcolm Gladwell. Similarly to Freakonomics, this book has sold very well in the US. I was thus curious about it.
Overall, it is fun to read, although a bit unorganized. But what is especially striking for me is the main claim that humans can reason unconsciously.
More precisely, there are many situations (and the book gives a large number of surprising yet convincing examples) where humans are able to perform difficult "classification" tasks unexpectedly fast. For example, some art experts are able to tell apart genuine sculptures from fake ones virtually in a blink. Even more surprising: they are completely unable to explain what makes them think a specific sculpture is fake!

It thus seems (and there are plenty of psychological studies about this), that, with enough training, humans are able to learn very difficult classification (I take it in the classical Machine Learning sense) tasks, including tasks that are not natural.

Let me try to explain what is new here.
We know that humans are very good at learning certain classification tasks: young children can classify objects from images very easily and get a much better performance than any computer to date.
Also we know that once this has been learned, the actual classification of any new image can be done in a few milliseconds.
Hence, with enough training, the brain is able to perform this complicated task very easily, without requiring any conscious reasoning to take place.

However, I used to think that the tasks we can learn easily are those for which we have a sufficiently strong prior encoded into our genes. In other words, I thought that the ability to learn visual classification tasks was the result of a long natural evolution (which provides us with the appropriate pre-wiring, or the prior in Bayesian terms) combined with a short period of adaptation (similar to computing the posterior in Bayesian terms again).
What is new to me in this book is the following: we can be trained to perform tasks that have nothing to do with evolutionary constraints, and this training can be performed unconsciously (without any explicit or conscious reasoning). An example of this phenomenon is given in the book: a tennis coach once realized that he could predict whether a tennis player would miss his service right before he would hit the ball. However he would not be able to explain why and how he could do so!

This may show that our brain hosts a powerful learning engine (with a powerful feature extractor to isolate the relevant information) that does not even require our attention to be triggered and that can deal with many different learning tasks.
Of course this raises the question of the prior: we know that there is no better learning algorithm, but only algorithms better adapted to learning problems. In other words, we can only learn the problems that have a large enough weight under the prior, which means it is hard to be good simultaneously for many different tasks.
Why is it that the prior encode into our brain allows us to learn such useless tasks as being able to tell whether a tennis-man will fail his service? and why is it that this prior is not more "peaked" around the tasks that are really useful for our survival?
I guess this book is related to a lot of interesting cognitive science problems but it also revived my interest in human learning and its relationship to Machine Learning...

May 04, 2006 in General, Philosophy | Permalink | Comments (11) | TrackBack (0)

Extracting Information from People

It seems natural that the goal of any good Machine Learning algorithm should be to extract information from the available data.
However, when you are faced with practical problems, this is not enough. More precisely, data by itself does not hold the solution. One needs "prior knowledge" or "domain knowledge".
So far, nothing new.

But what is important is how to actually get and use this knowledge, and this is very rarely addressed or even mentioned!
My point here is that building efficient algorithms should mean building algorithms that can extract and make maximum use of this knowledge!
To achieve this, here are some possible directions:

  • A first step is probably to think about what are the natural "knowledge bits" one may have about a problem and how to formalize them. For example, it can be knowledge about how the data was collected, what the features mean, what kind of errors can be made in the data collection,...
  • A second step is to provide simple but versatile tools to encode prior knowledge: this can be done off-line, for example when using a probabilistic framework one can allow the probability distributions to be customized, or on-line (i.e. interactively) with a trial-and-error procedure (based on cross-validation or on expert validation).
  • There is also a possibility to go one level higher: often, knowledge is gained by integration of very diverse sources of information, humans (as learning systems) are never isolated: all problems they can solve have some relationship to their environment. So ideally our systems should be able to integrate several sources and have some sort of meta-learning capability rather than starting from scratch every time a new dataset is to be used, and focusing only on this specific dataset.

All the above explains the title of my post, and to be more precise, I even tend to think that research efforts should be focused on knowledge extraction from experts rather than from data!!!

Finally, I would like to give examples of such an extraction (we are not talking about sitting experts in a chair with electrodes connected into their brains! but just about providing software that can interact a bit with them).
Below is a (non-exhaustive) list of what can be learned from the user by a learning system:

  • Implicit knowledge (when the data is collected and put in a database)
    • data representation: the way the data is represented (the features that are used to represent the objects) already brings a lot of information and often a problem is solved once the appropriate representation has been found.
    • setting of the problem: the way the problem is set up (i.e. the choice of which variables are the inputs and which are the outputs, the choice of the samples...) also bring information.
  • Basic information (when the analysis starts)
    • choice of features: choosing the right features, ignoring those that are irrelevant...
    • choice of samples: choosing a representative subset, filtering...
    • choice of an algorithm
    • choice of parameters for this algorithm
  • Structural knowledge (usually incorporated in the algorithm design phase)
    • design of kernels, prior distribution
    • design of the algorithm
    • invariances
    • causal structures
  • Interactive knowledge: all the above can be repeated by iteratively trying various options. Each trial can be validated using data (cross-validation) or expertise (judging the plausibility of the built model).

As a final remark, let me just mention that the interactive mode is often used (although not explicitly) by practitioners who try several different algorithms and take the one that seems the best (on a validation set). Of course this gives rise to the risk of overfitting, especially because the information brought by the interaction is very limited. Indeed, it simply amounts to the validation error which cannot be considered as knowledge: this kind of interaction simply brings in more data (the validation data) rather than more knowledge.
It would probably be interesting to formalize a bit better these notions...

April 18, 2006 in Data Mining, Machine Learning, Philosophy | Permalink | Comments (2) | TrackBack (1)

Freakonomics

I have just started reading this book: "Freakonomics" by Steven D. Levitt and Stephen J. Dubner. Although Levitt is a famous economist, this book is not about Economics. It has generated a lot of interest because it gives a very original, simple and friendly view about what pragmatic economics could be.

More precisely, the authors give a lot of real-world examples of questions one may ask about everyday life (society, politics, education,...) and how a proper and careful analysis of the available data can provide (sometimes surprising) answers to these questions. In my opinion, this book is very valuable to people interested in practical aspects of data analysis. Indeed, I see it more as a hands-on approach to data analysis than as a new approach to Economics.


What is most important to me is that all the examples given in the book concur to show that, in order to extract relevant information from data, one needs to think a lot about how the data was collected, what the data means, and what are precisely the questions to be answered using this data. This may seem disappointing to many Machine Learning researchers who tend to think that a good algorithm can solve most practical learning problems, but when it comes to actually helping people solve a practical problem (in a real-world situation), this never happens. One needs a lot of careful, rigorous, timely and possibly boring investigations that cannot be automated and require significant knowledge and understanding of the data and what this data is about.

Anyway, this book is fun to read and I recommend it!

March 15, 2006 in Data Mining | Permalink | Comments (3) | TrackBack (0)

Machine Learning Blogs

Blogging is becoming increasingly popular including among Machine Learning researchers.
Here are some interesting blogs about ML:

  • Yet Another Machine Learning Blog (Pierre Dangauthier)
  • Machine Learning Devotee (Mahdi Shafiei)
  • Business Intelligence, Data Mining & Machine Learning (José Carlos Cortizo Pérez)
  • Predict This! (Tilmann Bruckhaus)
  • Statistical Modeling, Causal Inference, and Social Science (Andrew Gelman)
  • Natural Language Processing (Hal Daumé)
  • MaLi @ backprop.net (Robert Wall)

Some other blogs deal with related topics (although less directly connected):

  • Decision Science News
  • Intelligent Machines (Damien François)
  • Enterprise Decision Management (James Taylor)

February 14, 2006 in Machine Learning | Permalink | Comments (2) | TrackBack (0)

Be Rational

Usually, performing inductive inference occurs in two steps. The first one consists in constructing a set of assumptions that summarize the knowledge one has about a phenomenon of interest prior to observing instances of this phenomenon. The second one consists in actually observing these instances and deriving new knowledge from this observation.

A possible question is: what principle may guide each of these steps?
A possible answer is: be as rational as possible. In other words, try to avoid inconsistencies.

Regarding the second one, it is sometimes possible to formulate the problem as a purely deductive one. Indeed, the question is "given such assumptions and given such data, what can I deduce?". For example, in a probabilistic framework, one would have a prior distribution and observations and would aim at obtaining an updated distribution. The rational way of doing this is to apply Bayes rule.
In other settings, when the assumptions are not formulated in a probabilistic language, or when the objective is to optimize some sort of worst-case performance, other rules could be used.
The point is that once the objective is clearly and formally specified, rationality naturally leads to the solution via pure deduction.

Regarding the first one (constructing the assumptions), the situation is less obvious. There are guiding principles though, which again rely on rationality.
One such principle is the one of symmetry: if there is no reason to prefer one side of a coin to the other (or to assume that both faces would have different properties), simply consider them equally probable. A more elaborated version of this principle is the principle of maximum entropy: when choosing a prior distribution over a set of possibilities, choose, among the ones that are consistent with your prior beliefs, the one with maximum entropy.
Finally, there is also the principle of simplicity (Occam's razor) which suggests to give more prior weight to the simple hypotheses than the complex ones.

However, all these principles cannot be justified in a formal way. One can surely construct settings where applying one specific principle is the "best" thing to do, but this is somewhat artificial and does not provide a justification.

Instead of proving things, I guess the best thing to do is to provide recommendations. One such recommendation is "be rational", or in other words, try to take into account every piece of evidence you may have before observing the data and to do this in a way that does not lead to contradictions and does not expose you to more risk than you are willing to accept. So in a way, inferences should take into account both your knowledge and your uncertainty and be calibrated according to what you accept to loose if you fail.
I like the idea that performing an inference is like horse race gambling: you try to get as much information you can about the horses, but you know there will always be some missing piece of information. Even if gambling is somewhat irrational, when you have no choice but to do it, better do it in the most rational way!

January 30, 2006 in Machine Learning, Philosophy | Permalink | Comments (4) | TrackBack (0)

More is Less

When we try to understand the causes of a phenomenon, there is a natural tendency to think that the more variables we measure the more likely we are to identify the real cause.
This is generally true if there is no constraint on the amount of experiments we can realize, but as soon as we work with a limited sample, adding more variables may lead to less accurate models.
More precisely, if we have the ability to perform say 100 independent experiments and for each of these experiments we measure d "input" variables and one output variable and try to build a model to predict the output from the inputs, then the larger d is, the more difficult it might be to build the model. There is some kind of optimal value of d: on the one hand, if one has measured too few variables, one may miss important information, on the other hand, if one has measured too many variables, it becomes impossible to distinguish between "true" correlations and distortions caused by the imperfection of the sampling mechanism (for example, it may happen that on these specific 100 experiments, one input variable that has nothing to do with the output is incidentally correlated to the output).

In order to study this phenomenon more precisely, one can imagine to have a framework where, in addition to obtaining the experiments from sampling iid from an arbitrary distribution, one considers that the variables themselves are obtained from a sampling process (also iid from some distribution). This is similar in spirit to the framework recently proposed by Krupka and Tishby, except that in their framework, only the variables (or features) are sampled, and not the examples. Note that the assumption that one samples the variables in an iid fashion does not mean that the variables are necessarily independent in the classical sense of having independent values. One should imagine an infinite matrix representing all possible measurements on all possible experiments for a given problems: rows would be experiments, columns would be measurements.
The framework consists in assuming that one randomly picks n rows and d columns from this matrix (possibly picking several times the same row or column to comply with the iid assumption).

Now the question is: can you obtain a bound on the error of a given learning algorithm when trained on such an (n,d) sample as a function of n and d?

This is probably still too vague to be answered, and one probably needs to put restrictions on the functions that are allowed. For example, a reasonable first goal would be to study the case where the target function is a (possibly countably infinite) linear combination of the variables (i.e. columns of that infinite matrix).
Intuitively, one would expect that the result depends on "how similar" the columns of the matrix are, and "how wide-spread" the coefficients of the linear combination are: you need to collect enough variables with high coefficients in the final linear combination, but not too many variables with low or zero coefficient.

January 25, 2006 in Machine Learning | Permalink | Comments (1) | TrackBack (0)

Bayesian brain

In a recent issue of The Economist, there is a very nice article (see here) about how everyday reasoning can be compared to Bayesian inference.
This article is based on a recent paper by Griffiths and Tenenbaum (see here). What they have done is to ask questions such as "How long do you think a man who is xx years old will live?" to several people. It turns out that the answers matched very well with those which would have been obtained by applying Bayes rule. Even more, they tried this with several different types of questions, for which the implicit priors are very different (Gaussian, Erlang or power-law distributions) and in all cases, the intuitive answers given by people had the right form (in terms of distribution).

What they conclude from this is that the way people intuitively reason about the world is quite similar to applying Bayesian inference.

What is intriguing is that the article in The Economist tries to see there a proof of domination of the Bayesian over the frequentist point of view. Also in the paper of Griffith and Tenenbaum, they use the term "optimal" when they talk about Bayes rule. I think this is very misleading and inaccurate.

Indeed, the only conclusion one should draw from this study is that the way people naturally make inferences about events in the world is very much rational and this confirms the fact that has been observed many times before that the intuitive notion of rationality we have match very well with the rules of the calculus of probabilities.
But this is no surprise because these rules were designed in order to be intuitively rational (what else?). What is interesting is that rationality leads necessarily to these rules and no other, but this has been known for years.

I do not see what this study has to do with the debate between Bayesian vs frequentist. First of all, there is no real opposition between these points of view. Indeed, they lead to the same rules for combining probabilities, the only difference is in the meaning that is associated to these probabilities. So this debate is mostly philosophical and should not interfere with cognitive science studies, nor (even less) with machine learning.

January 24, 2006 in Machine Learning, Philosophy | Permalink | Comments (8) | TrackBack (0)

When does sparsity occur?

Sparsity is a very useful property of some Machine Learning algorithms. Such an algorithm yields a sparse result when, among all the coefficients that describe the model, only a small number are non-zero. This is typically associated with interesting properties such as fast evaluation of the model (see the reduced set methods for obtaining sparse kernel expansions), fast optimization (e.g. in SVM, many algorithmic approaches exploit this fact), statistical robustness (sparsity is usually associated to good statistical performance), or other computational advantages (e.g. ability to compute full regularization paths, for example in LASSO-style regression).

However I have not seen a clear explanation of this phenomenon. My feeling (I have no proof but it seems intuitively reasonable) is that sparsity is related to the regularity of the criterion to be optimized.
More precisely, the less regular the optimization criterion, the more sparse the solution may end up being.

The idea is that, for sparsity to occur, the value 0 has to play a special role, hence something unusual has to happen at the value 0. This something can be a discontinuity of the criterion or of one of its derivatives.

If the criterion is discontinuous at 0 for some variables, the solutions might get "stuck" in this value (provided it is a local miminum of course). If instead, the criterion is continuous but has a derivative which is discontinuous at 0, it means that the criterion is V-shaped at 0, so that solutions might be "trapped" at this point. If we continue the reasoning, we see that the "attraction" of the point 0 is less and less effective as the regularity increases. When the function is twice differentiable everywhere, there is not any reason for the solution to be "trapped" at 0 rather than ending up somewhere else.

This reasoning partly explains the sparsity of SVMs. Indeed, the standard L1-SVM (hinge loss) have a discontinuous criterion, while for L2-SVM (squared hinge loss), the criterion has a discontinuous derivative and finally, for the LS-SVM (squared loss), the criterion is twice differentiable. It turns out that the most sparse is the L1 version and then the L2 version, while for LS-SVM there is no sparsity at all.

The same reasoning applies when one compares penalized least squares regression: when the penalization is the L2-norm of the weights, there is no sparsity, while with the L1-norm, the sparsity occurs, and for the L0-norm there is even more sparsity.

I am wondering whether there is any mathematical treatment of these issues anywhere in the Machine Learning litterature. If anyone has a pointer, please let me know.

November 08, 2005 in Machine Learning, Theory | Permalink | Comments (32) | TrackBack (0)

Can a computer think?

I recently came across the webpage of Jeffrey Shallit, a very impressive computer scientist, and I saw he gave a talk on a topic that can be of interest to people reading this blog: Can a Computer Think?
The slides are very documented and comprehensive, he also has a reading list associated to this talk on his website.
What I especially like about this talk is that it gives an interesting historical perspective, showing how many people had predicted that computers would achieve some task in the near future  and none of these predictions were correct.

Also of interest is the quote by Hofstadter who essentially says that "intelligence" is what computers cannot do. Indeed, once computers can do something, we start to think that it does not require intelligence.

But if the definition of "thinking" is very controversial, it might be a better choice to focus on simpler things like "learning" and ask the question "Can a computer learn?".
Of course, ML researchers are exactly after that, and to some extent, it is clear that computers can learn.

However, if we try to define more precisely what learning is, there are several issues. In particular, there are at least three levels at which we can define the learning phenomenon:

  1. Low-level: Ability to adapt to a (changing) environment
  2. Medium-level: Ability to perform a task or to improve at performing a task without being taught explicitly (by practice or imitation)
  3. High-level: Ability to infer general laws from particular instances (induction)

The first level is somewhat "unconscious" and is something that could be said of most animals.
The second level is also something many animals can do.
The third level is more "conceptual" and seems to require some "thinking". But this is not necessarily an exclusively human ability: indeed, when a dog learns that bringing back the stick will get him a stroke, this is also some kind of induction.

I am not sure the above distinction really makes sense and it might be impossible to say which form of learning actually occurs in a specific situation.
However, computers have clearly demonstrated all of them, at least in a very simple way.

November 03, 2005 in General, Philosophy | Permalink | Comments (47) | TrackBack (0)

Can the study of human learning help?

The question is whether, in order to make progress toward building learning machines, it is necessary to study the only available examples of such machines we have so far: animal brains (and more specifically human ones).
People who would answer no to this question often cite the example of planes: the first successes for building flying machines were obtained when people stopped trying to imitate birds. So it seems that understanding how Nature solved the problem may not always help. One reason is that animal brains were not designed only to solve the learning problem, just as birds were not designed to solve the flying problem. It is only a byproduct of an evolution that was mainly geared towards survival and adaptation to certain environmental conditions.

Despite these considerations, there has been many attempts to make bridges between the study of natural and artificial learning. For example, people working on artificial neural networks, or on genetic algorithms were never ashamed of using biological findings as a source of inspiration.

My feeling is that it is fine to be interested both in artificial and natural learning provided the following is accepted:

  • First of all it is not necessary to start from natural learning to develop a theory of artificial learning, and there is no need for this theory to explain the specifics of natural learning.
  • Second, it is important to abstract away from natural learning in order to formulate precisely what learning means.
  • However, any source of inspiration is good, especially when one gets stuck, so why not looking for inspiration in natural learning.

I used to think that the work being done in cognitive psychology was too specifically human to be of any interest for people working on learning theory. Also I was very suspicious about  whatever was said to be "biologically inspired" or "cognitively inspired". However, the remarkable efforts for abstracting concepts that has been done by some cognitive psychologists suggest that might have been too critical.

In a recent interview, Tom Mitchell, a famous ML researcher expressed similar views:

[The interviewer] - Learning the brain’s algorithms for doing things is very difficult, and is not very well understood as yet. Do you ever find it frustrating trying to get computers to learn things that we ourselves don’t know the inner workings of?
[Tom Mitchell] - That’s actually a very interesting observation—I actually don’t get frustrated by that—why? I don’t know!
Maybe it’s odd, but it’s true that much of the work in machine learning—how to get computers to learn—has been kind of unguided by anything we know about human learning. It just grew up on its own—“ok, how would we engineer this system to look at a lot of data and discover regularities?”—so people engineered those instead of looking at how humans do it and then trying to duplicate it. But recently, because I’ve been looking at the brain, I’ve been starting to learn more about what people know about human learning—and it’s very different. For example, when we humans learn, a big part of what determines whether we succeed or not is all about motivation. And there’s nothing in machine learning algorithms that even remotely corresponds to motivation. So it’s just a very different phenomenon…maybe in 10 years we’ll understand it better, but right now, the two are very different.

Also, I recently heard about the work of Alison Gopnik who studied the way children learn causes and she draws some interesting connections between the causal structure that is inferred by them and graphical models such as Bayesian networks. Also she explains that the way children learn is very "multivariate" in the sense that they try many things at a time and extract causal relationships easily from multidimensional observations. In other words, it is not necessary for them to act on one knob at a time to understand how a machine works.

October 30, 2005 | Permalink | Comments (7) | TrackBack (0)

« | »