This is a followup of this post and of the comments made about my previous post.
I have use the idea of Rudi Cilibrasi and Paul Vitanyi (see their preprint here) of extracting from Google page counts a "semantic" distance between terms.
So I ran the following experiment: I did Google searches for the terms Statistics, Statistical, Data Analysis, Data Mining and Machine Learning.
The individual page counts came as follows:
Statistics | 573000000 |
Statistical | 158000000 |
Data Analysis | 38600000 |
Data Mining | 17500000 |
Machine Learning | 5250000 |
Of course Statistics is the most common. One reason is that it is not exclusively associated to a scientific discipline.
Then, I used the so-called "Normalized Google Distance" (NGD) to assess the relationship between these terms. Here is the result:
St.tics | St.cal | DA | DM | ML | |
Statistics | 0.00 | 0.53 | 0.73 | 0.93 | 0.82 |
Statistical | 0.53 | 0.00 | 0.49 | 0.72 | 0.65 |
Data Analysis | 0.73 | 0.49 | 0.00 | 0.54 | 0.61 |
Data Mining | 0.93 | 0.72 | 0.54 | 0.00 | 0.39 |
Machine Learning | 0.82 | 0.65 | 0.61 | 0.39 | 0.00 |
It is interesting to notice that Machine Learning is more related to Statistics than is Data Mining. Also, Machine Learning and Data Mining are very close to each other, while Data Analysis is closer to Statistics.
I also tried to compute the NGD between these terms and the term "Company" in order to see which one had penetrated the corporate world the most. The results were not very enlightening. However, if you look at the page counts for the phrase "Statistics Company", or replace Statistics by the other terms, you get the following:
Statistics Company | 23100 |
Data Mining Company | 10500 |
Data Analysis Company | 623 |
Statistical Company | 163 |
Machine Learning Company | 140 |
Interestingly, Machine Learning is very seldom associated to company, while Data Mining companies seem to abund.
Another possible measure, is the number of ads you get when you do these searches. I noticed that the search "Data Mining" or "Data Mining Company" are returning an incredible number of ads for data mining software and companies (much more than statistics or machine learning).
So there is still a lot to be done in order to make Machine Learning better recognized...
Comments