Machine Learning Thoughts

Some thoughts about philosophical, theoretical and practical aspects of Machine Learning.

About

My Photo

Favorite Links

  • Publications
  • Homepage

Categories

  • Artificial Intelligence
  • Data Mining
  • General
  • Links
  • Machine Learning
  • Personal
  • Pertinence
  • Philosophy
  • search engine
  • Theory

Recent Comments

  • freight on Why do we do Science?
  • freight on Why do we do Science?
  • Poker Ohne Einzahlung on Decision-making
  • Bonus Senza Deposito on Decision-making
  • Bonus sans dépôt on Decision-making
  • anti cellulite on Happiness of a scientist II: the 80/20 rule
  • Thesis Writing on The Failure of AI
  • nail school online on Happiness of a scientist II: the 80/20 rule
  • anti cellulite on The Failure of AI
  • Facebook advertising on Happiness of a scientist I: rationalization

Related blogs

  • Sam Cook
  • Group blog
  • Grant Ingersoll
  • Hal Daume III
  • ?Notes
  • Fernando Diaz
  • Matthew Hurst
    Director of Science and Innovation, Nielsen BuzzMetrics; co-creator of BlogPulse.
  • Daniel Lemire
  • Leonid Kontorovich
  • Cognitive Daily

Archives

  • February 2007
  • November 2006
  • September 2006
  • June 2006
  • May 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • November 2005

Favorite Books

  • Advanced Lectures on Machine Learning : MLSS 2003 (Olivier Bousquet, Ulrike von Luxburg, Gunnar Rätsch eds)
  • Algorithmic Learning in a Random World (Vladimir Vovk,Alex Gammerman,Glenn Shafer)
  • Probability and Finance: It's Only a Game! (Glenn Shafer,Vladimir Vovk)

Machine Learning (Theory)

Subscribe to this blog's feed
Blog powered by Typepad

Other links

  • Listed on BlogShares
My Squidoo Lens

The Failure of AI

In the early days of AI, scientists thought they would be able to build an intelligent computer by the end of the 20th century. This raised various fears about computers eventually taking over the world, and human beings replaced by robots.
Not only this did not happen yet but we are very far from this!
But what is even worse is that we are now following a somewhat opposite trend. There are many tasks at which humans are far better than computers, but instead of trying to build better algorithms for these tasks, people are now trying to find ways to better make use of human intelligence, or rather to automate this usage!

Two examples of this: Luis von Ahn's "Artificial artificial intelligence" and Amazon's "Mechanical Turk".

Luis von Ahn has designed a couple of internet games whose purpose is to make players perform useful tasks such as labeling images. Amazon is taking this to the industrial scale (although not as a game anymore) by allowing people to design programs which include calls to web services which are actually executed by paid humans (e.g. your program calls "translate(text)" and the text is sent to someone who translates the text and return the result)!

Update (21/12/2006): I found a related article in the Boston Globe, citing other examples of the same kind, such as Mozes Mob, a cell phone Q&A service powered by humans.

November 30, 2006 in Artificial Intelligence, Data Mining, Machine Learning | Permalink | Comments (45) | TrackBack (1)

Extracting Information from People

It seems natural that the goal of any good Machine Learning algorithm should be to extract information from the available data.
However, when you are faced with practical problems, this is not enough. More precisely, data by itself does not hold the solution. One needs "prior knowledge" or "domain knowledge".
So far, nothing new.

But what is important is how to actually get and use this knowledge, and this is very rarely addressed or even mentioned!
My point here is that building efficient algorithms should mean building algorithms that can extract and make maximum use of this knowledge!
To achieve this, here are some possible directions:

  • A first step is probably to think about what are the natural "knowledge bits" one may have about a problem and how to formalize them. For example, it can be knowledge about how the data was collected, what the features mean, what kind of errors can be made in the data collection,...
  • A second step is to provide simple but versatile tools to encode prior knowledge: this can be done off-line, for example when using a probabilistic framework one can allow the probability distributions to be customized, or on-line (i.e. interactively) with a trial-and-error procedure (based on cross-validation or on expert validation).
  • There is also a possibility to go one level higher: often, knowledge is gained by integration of very diverse sources of information, humans (as learning systems) are never isolated: all problems they can solve have some relationship to their environment. So ideally our systems should be able to integrate several sources and have some sort of meta-learning capability rather than starting from scratch every time a new dataset is to be used, and focusing only on this specific dataset.

All the above explains the title of my post, and to be more precise, I even tend to think that research efforts should be focused on knowledge extraction from experts rather than from data!!!

Finally, I would like to give examples of such an extraction (we are not talking about sitting experts in a chair with electrodes connected into their brains! but just about providing software that can interact a bit with them).
Below is a (non-exhaustive) list of what can be learned from the user by a learning system:

  • Implicit knowledge (when the data is collected and put in a database)
    • data representation: the way the data is represented (the features that are used to represent the objects) already brings a lot of information and often a problem is solved once the appropriate representation has been found.
    • setting of the problem: the way the problem is set up (i.e. the choice of which variables are the inputs and which are the outputs, the choice of the samples...) also bring information.
  • Basic information (when the analysis starts)
    • choice of features: choosing the right features, ignoring those that are irrelevant...
    • choice of samples: choosing a representative subset, filtering...
    • choice of an algorithm
    • choice of parameters for this algorithm
  • Structural knowledge (usually incorporated in the algorithm design phase)
    • design of kernels, prior distribution
    • design of the algorithm
    • invariances
    • causal structures
  • Interactive knowledge: all the above can be repeated by iteratively trying various options. Each trial can be validated using data (cross-validation) or expertise (judging the plausibility of the built model).

As a final remark, let me just mention that the interactive mode is often used (although not explicitly) by practitioners who try several different algorithms and take the one that seems the best (on a validation set). Of course this gives rise to the risk of overfitting, especially because the information brought by the interaction is very limited. Indeed, it simply amounts to the validation error which cannot be considered as knowledge: this kind of interaction simply brings in more data (the validation data) rather than more knowledge.
It would probably be interesting to formalize a bit better these notions...

April 18, 2006 in Data Mining, Machine Learning, Philosophy | Permalink | Comments (2) | TrackBack (1)

Freakonomics

I have just started reading this book: "Freakonomics" by Steven D. Levitt and Stephen J. Dubner. Although Levitt is a famous economist, this book is not about Economics. It has generated a lot of interest because it gives a very original, simple and friendly view about what pragmatic economics could be.

More precisely, the authors give a lot of real-world examples of questions one may ask about everyday life (society, politics, education,...) and how a proper and careful analysis of the available data can provide (sometimes surprising) answers to these questions. In my opinion, this book is very valuable to people interested in practical aspects of data analysis. Indeed, I see it more as a hands-on approach to data analysis than as a new approach to Economics.


What is most important to me is that all the examples given in the book concur to show that, in order to extract relevant information from data, one needs to think a lot about how the data was collected, what the data means, and what are precisely the questions to be answered using this data. This may seem disappointing to many Machine Learning researchers who tend to think that a good algorithm can solve most practical learning problems, but when it comes to actually helping people solve a practical problem (in a real-world situation), this never happens. One needs a lot of careful, rigorous, timely and possibly boring investigations that cannot be automated and require significant knowledge and understanding of the data and what this data is about.

Anyway, this book is fun to read and I recommend it!

March 15, 2006 in Data Mining | Permalink | Comments (3) | TrackBack (0)

Building models: what for?

A large amount of the effort in Machine Learning research is devoted to building predictive models. This means trying to infer, from labeled examples, a model that can, later on, be used for making predictions on new instances. This problem of making models for prediction is relatively well understood, although very partially solved today. But, there are plenty of other reasons for building models, and this may drive a large part of the future research in this field.

For example, my current work is to investigate how one can build models for understanding, monitoring and controlling complex systems or processes. Of course, this is not new as such, but I want to emphasize here that this is an area people in Machine Learning have seldom studied, although ML techniques could very well be adapted to yield better solutions than existing approaches.

Let us leave understanding aside (as this is a very debatable issue and should be discussed separately) and focus on monitoring and control. There are plenty of methods for monitoring and controlling systems or processes. Many rely on models that are built from knowledge, some rely on statistical models (built from observations). The problem is that models built from knowledge are usually very sophisticated and specialized, while statistical models are very simple and generic. There thus seems to be a gap in between.
Machine Learning is very advanced in terms of automatically building sophisticated models from observations, while incorporating previous knowledge. The issues of trading-off the complexity of the models with their statistical properties (e.g. overfitting) has been thoroughly investigated in this field.
As a result, ML is particularly suited for providing new ways of building sophisticated, yet reliable models.

However, ML researchers have focused their effort on prediction error, while when you need to control a process, this prediction error is not what matters.
To make my point more precise, let me give an example. Assume you are trying to cook up some nice steak. Depending on the thickness of the piece of meat, you might have to heat it up longer or shorter. Your goal is to have a steak that tastes good (tender, not cold, not burnt...) and for that, you cannot modify the thickness of the meat but only act on the time you leave it on the grill.
Of course if you never have done it before, it is likely that you will fail to get a good steak the first  few times, but after a while, you will have a good model of what is the right cooking time for a given thickness.

To translate this into ML terms, you have as input variables the steak thickness (S), the grill time (G) and your output is the taste (T).
If you try to purely optimize the classification error, you would look for a function f(S,G) that approximates T in the sense that |f(S,G)-T|should be small on average over the distribution of (S,G).
Hence you need to be able to accurately predict the taste for all values (within the observation range) of the pair (S,G).

However, in order to solve your problem, it is sufficient to find, for each value of S, a value of G that guarantees that T is good. This means that you do not need to build a full model (for all pairs (S,G)) but you only need a model which allows you to determine one good grill time and can be wrong otherwise.

This shows that by rephrasing the objective from prediction to control, you get a different problem so that finding the most predictive model might not the best thing to do.
In recent years, ML researchers have started looking at slightly different loss functions (than the simple prediction error) and settings. My feeling is that this will continue and possibly drift towards loss functions that correspond to control (since many real-life problems are control problems rather than prediction ones).

October 08, 2005 in Data Mining, Machine Learning, Pertinence | Permalink | Comments (1) | TrackBack (0)

Scientific Names and Their Relationship

This is a followup of this post and of the comments made about my previous post.

I have use the idea of Rudi Cilibrasi and Paul Vitanyi (see their preprint here) of extracting from Google page counts a "semantic" distance between terms.
So I ran the following experiment: I did Google searches for the terms Statistics, Statistical, Data Analysis, Data Mining and Machine Learning.
The individual page counts came as follows:

Statistics 573000000
Statistical 158000000
Data Analysis 38600000
Data Mining 17500000
Machine Learning 5250000

Of course Statistics is the most common. One reason is that it is not exclusively associated to a scientific discipline.
Then, I used the so-called "Normalized Google Distance" (NGD) to assess the relationship between these terms. Here is the result:

  St.tics St.cal DA DM ML
Statistics 0.00 0.53 0.73 0.93 0.82
Statistical 0.53 0.00 0.49 0.72 0.65
Data Analysis 0.73 0.49 0.00 0.54 0.61
Data Mining 0.93 0.72 0.54 0.00 0.39
Machine Learning 0.82 0.65 0.61 0.39 0.00

It is interesting to notice that Machine Learning is more related to Statistics than is Data Mining. Also, Machine Learning and Data Mining are very close to each other, while Data Analysis is closer to Statistics.

I also tried to compute the NGD between these terms and the term "Company" in order to see which one had penetrated the corporate world the most. The results were not very enlightening. However, if you look at the page counts for the phrase "Statistics Company", or replace Statistics by the other terms, you get the following:

Statistics Company 23100
Data Mining Company 10500
Data Analysis Company 623
Statistical Company 163
Machine Learning Company 140

Interestingly, Machine Learning is very seldom associated to company, while Data Mining companies seem to abund.

Another possible measure, is the number of ads you get when you do these searches. I noticed that the search "Data Mining" or "Data Mining Company" are returning an incredible number of ads for data mining software and companies (much more than statistics or machine learning).

So there is still a lot to be done in order to make Machine Learning better recognized...

September 19, 2005 in Data Mining, General, Machine Learning | Permalink | Comments (0) | TrackBack (0)

Data in the Corporate World

Dealing with data is becoming an important part of the job of most large companies. Within this area, tasks such as storing and managing data are now well-mastered, so that the key capability now becomes the analysis or leveraging of this data.
Hence Data Mining is becoming an increasingly important concern of most hi-tech companies. This trend is witnessed by the recent creation of the CDO (Chief Data Officer) title at Yahoo! (see here), which has been given to a former Data Mining researcher.
Another indication of this trend can be found in the educational domain: most computer science departments in the big US universities now offer courses in Data Mining or Machine Learning.
These terms are also now known to people remote from the scientific world.

So it seems that Machine Learning is no longer an obscure research field, and more and more a popular technological domain.

September 14, 2005 in Data Mining, General, Machine Learning, Pertinence | Permalink | Comments (6) | TrackBack (0)

Is there an optimal learning algorithm?

From someone outside the learning community, it may seem that researchers spend their time looking for THE optimal learning algorithm, that is a completely generic algorithm which would beat all the other ones. One may even think that such an algorithm would be so sophisticated that researchers can only incrementally approach it and this explains why the progress is relatively slow in this area.
However, this is a serious misinterpretation of what is going on in this research field.

First of all, there is no such thing as an optimal learning algorithm, and my point here is to explain why this is so.
There are at least three possible explanations and I will go from the most informal to the most formal one:

  1. Any algorithm has some bias: given a data sample, a learning algorithm typically builds a function (or a model) that agrees (to a certain degree) with this data and that is able to make predictions for new data, that is it extrapolates the data. However for each data sample, there are infinitely many ways to extrapolate it, and each learning algorithm does it in its own way. The bias of a learning algorithm can be thought of as the way this algorithm ranks the possible functions. The point is that in order to build a function, one needs a way to decide which function to pick among all the functions that (at least partially) agree with the data. So the question is whether there could exist some sort of optimal ranking of functions, or optimal way to decide which function to pick. The problem is that, given a learning problem characterized by the function to be learned, one can always construct an algorithm that performs optimally, simply by choosing a ranking that puts this particular function first. So there is always an optimal algorithm for each problem, but this optimal algorithm will necessarily be sub-optimal on other problems. So, roughly speaking, there is no way to have a good performance simultaneously on all problems.
  2. There exist several results that make this more precise, in particular, the so called No Free Lunch (NFL) theorem. This theorem essentially says that if you consider all the possible learning problems, all possible learning algorithms have the same  performance on average over all the problems. As a consequence, a learning algorithm that performs well on some problems will necessarily perform poorly on others to balance this. Thus there cannot exist an optimal and universal learning algorithm.
  3. A more involved version of the NFL theorem is the Slow Rate Theorem which states that for any learning algorithm, if you choose any sequence of numbers that converge to zero, there exists a learning problem such that the algorithm's generalization error on that problem will converge to zero slower than the chosen sequence (as the sample size increases). In other words, a learning algorithm may converge to the optimal solution arbitrarily slowly and can have arbitrarily poor performance for any fixed sample size.

All this implies that there cannot be an algorithm that is both universal (that can learn any problem) and optimal (that performs better than the others on all problems).
So the only thing that we can hope for is an algorithm that has good properties for a restricted set of problems only. This is not so bad however since you can assume that the problems you encounter in the real world are somehow well-behaved and do not span the space of all possible problems.

As a conclusion, what Machine Learning researchers do is not to look for THE optimal algorithm, but to look for a learning algorithm that is optimal for a small set of learning problems, namely the "real-world problems".

August 11, 2005 in Data Mining, Machine Learning, Philosophy | Permalink | Comments (3) | TrackBack (0)