« Deduction and Induction | Main | Scientific Names and Their Relationship »


Ben Rubinstein

Hi Olivier, great blog! I've also noticed this phenomenon.

At the same time, however, I also feel that laypeople (users of Machine Learning, business people, politicians, other scientists/academics) are much more aware of the terms "Data Mining" and "Statistics" than "Machine Learning". Sure, one can point to distinguishing features (e.g. association rule mining and hypothesis testing are usually soley covered under the auspices of data mining and statistics, respectively). But by and large the three areas are pretty similar in the problems they tackle and approaches they favour. At least, I don't think fundamental differences explain the discrepancy.

Furthermore, I get the impression that the public has a significantly better intuition about what "Data Mining" means than they do for "Statistics". And when I mention "Machine Learning", I am regularly met with blank stares. "Learning to use computers?" is an uncommon, but repeated response.

Is this a case of good vs. bad naming? Or are "data mining" researchers more involved in industry? Whatever the reason, I'm always happy to explain what I do and how Machine Learning relates to Statistics. I find that people are always happy to listen.

Olivier Bousquet

Thanks for the comment (the first one on this blog!!).

I agree with you, Machine Learning is still pretty much unknown to the average people. There is still a lot to do in order to make this field get more attention from the outside world.

Regarding the issue of naming, I tend to be pessimistic about the possibility of finding a good name for the domain. Recently, I tried to give some definitions (see the July 2005 archive) of Machine Learning, Data Mining and Statistics. However, my overall feeling is that names are dangerous because they either carry more than they should, or are misused or are misleading. As an example, statistics (in the sense of statistical methods of data analysis) has a very specific interpretation in the corporate world: it refers to a set of standard methods for analyzing data: essentially what you would learn in a first course on (classical) statistics: univariate, parametric, under normal assumptions...

I think that this issue of naming is really fun to explore, and very important for the success of the field: I do believe that it is important to have a good name if you want to get grants, people's attention, media coverage and so on...

In my next post I will give some quantitative results about the relationship between the names...


Hi Olivier!

I just discovered your blog, thanks for letting us know about your thoughts, that's very helpful for us students.

I certainly agree with your analysis: information systems seem to be ripe for us to start applying interesting methods. As a researcher involved in handling real life data, what sectors do you think are the most competitive with respect to this? that is sectors that have now achieved their transition to modern datawarehouses and which really face challenges?

I would spontaneaously think credit scoring companies, banks & hedge funds for trading, and of course software companies.

what may come before that according to you? what comes next? in what order? pharmas? retail/supermarkets etc..?

Finally, do you think the usage of machine learning/data mining might boom in some sectors in the next years, that is sectors which might be completely reshaped by machine learning and which may just be finishing their information system transition? [ i'm not asking for business ideas here ;) ]



Olivier Bousquet

Hi Marco,

This is a very interesting question. Although it looks like you are trying to get crucial (and possibly confidential) business information from me ;) I will try to answer as accurately as possible.

There are actually many ways to approach the question. Here are two possible rephrased versions:
1) which real-life problems are difficult enough but mature enough to be interesting to ML researchers, and are likely to attract a lot of attention from them in the next couple of years ?
2) which problems are likely to generate a lot of business for Machine Learning companies (such as Pertinence ;) ?

Regarding the first question, it seems that in the ML community, there is a general feeling that simple classification problems are more or less solved, and the interesting problems are those that involve completely different types of data (biological sequences, graphs, relational data, brain imaging data, time-series, differential equations...) or different settings (ranking, semi-supervised, reinforcement learning, collaboration, networks of sensors...)

However, I think that many basic problems remain very partially understood. Also, looking at the kinds of problems that occur in the real-world (not the so-called "real-world problems" in the ML community but those that companies are actually faced with), there is a lot of room for scientific exploration even though the data they involve may seem pretty basic. I will come back to this later.

Regarding the second question, there are several things to have in mind: the good problems are those that are mission-critical for businesses of which one part (often very small although possibly unavoidable) can be solved by ML techniques.
But even in such problems, ML may make the difference only if, as you point out, the data exists in an appropriate form (eg inside a clean and comprehensive database). But even in this case, in many problems, what matters is not to find the best possible solution, but to get quickly to a reasonably good solution (the so called 80-20 compromise). In such problems, it is likely that a very simple engineering approach may yield a perfectly acceptable solution even though a good ML researcher could reach a much better accuracy with enough time and work.

So the problems that are critical, involve proper databases, and cannot be solved to any reasonable degree by simple techniques, are those that one should consider.

I realize that I am not answering your question (I did not tell which are the right domains), but my feeling is that it is more a matter of finding the right type of problem than the right domain.

In any case, there is a huge gap between the state-of-the-art in Machine Learning research, and what people use in companies when they analyze data.
But the main reason is not that people in companies do not understand ML, it is rater that in most real-world problems, the goal is not just to build an accurate model, but it is to understand what are the actions that have to be taken in order to get to a desired objective.


thanks for these insights.

I feel pretty much the same way about the limitations and challenges offered by "true real world problems", that is that few sectors are really interested in getting a cutting edge prediction, and rather want to secure or consolidate an additionnal block in their decision chain.

Reaching a 80% level performance seems to be enough for most companies, while we as researchers are specifically interested in scratching some percents more. Hence my comment on finance/hedge funds for whom raw performance is crucial, because it keeps translating in more $$ up to the last percent. And I'm still wondering who else, apart from pharmas & biotechs, would benefit from more clever ML techniques in the future...

Manufacturers of electronics which, once miniaturized enough, will seek towards more clever interfaces to keep an edge on the competition?

software companies dedicated to programming better robots once the mechanics are completely worked out?

public administrations having to screen millions of measurements (for public health & security reasons)?.

insurance companies that try to do better than actuarial techniques? (this is crucial, since a lot of money is involved too...)

I'm really wondering which sectors may pull most of the research effort in the next years, beyond the ones we're used to (google & al.)... that is where would the surprise come from.

Another issue that i'm wondering is what is the frequency for companies to update their statistical tools.

I guess many software companies are still living on the NNets architecture they implemented in the 80's and early 90's, that some blue chips only use old regression tools the same way than decades ago, and i feel that stochastic calculus got solidly installed in banks in the 90's and is here to last a little while more, just because people in banks are used to it.

I'm wondering whether and when will ML techniques such as kernel methods see their advent in the industry, and how could this take place.

As you point out, I feel that handling tricky objects and their multiple modalities might be quite a promising leap forward, once this is well understood... I think this is pretty much the advantage of kernel methods over competitors, but needs further work. Do you think that i am biased on this point, as a researcher working in that field, and that kernel methods will not seem any more effective to industrials than SAS logistic regression macros?

PS: i'm staying in academia for a little while more, so i wont use any of this to run a start up, i promise :P

visalia self storage

It can be confusing at first. But mastering and having to input it on your program makes things easier.

The comments to this entry are closed.