Talking about names (see previous post), here is an attempt to define and distinguish several names that sometimes are used interchangeably: Statistics, Data Mining, Machine Learning.
If one were to put these under a common name, one could think of "Information Sciences" as a reasonable candidate, but let us treat them separately first:
- Statistics: formally, statistics are the exact opposite of Probability. Probability theory is about computing the probability of events knowing the model, Statistics is about inferring the model from the observation of events. Events are typically described by data, so Statistics is about building models from data. One can also find this more general definition "Statistics is the part of mathematics that deals with collecting, organizing, and analyzing data".
- Data Mining: the goal here is to "extract information from (large) databases". This requires to define both what is meant by information, and by the extraction process. Possible answers (see e.g. this paper by Jerome Friedman for others) are as follows: "Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" (U. Fayyad), "Data Mining is the process of extracting previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decision" (A. Zekulin). Data Mining can be considered as a sub-area of Descriptive Statistics (although this is probably restrictive) with emphasis on
- "understandability" of the produced results
- algorithmic issues
- ability to handle "large" databases
- potential use of the produced results for decision-making
- Machine Learning: this refers to the study of the learning phenomenon, which can be defined as "the ability of a machine to improve its performance based on previous results". The connection with the above fields is that "previous results" usually mean data, hence this other definition: "Subspecialty of artificial intelligence concerned with developing methods for software to learn from experience or extract knowledge from examples in a database". Machine Learning largely overlaps with Statistics in the sense that both deal with the analysis of data, but it considers issues that are largely ignored in Statistics, such as the algorithmic complexity of computational implementations. Also, Machine Learning includes the study of other forms of learning that cannot be directly cast as a problem of building a model from a database. Examples are on-line learning, active learning or reinforcement learning.
The goal here is to emphasize the distinction in the spirit of these different fields, while showing their connections.
I have tried to define the above domains in terms of their goals and not in terms of the techniques they developped. Indeed, very often, these domains are compared in terms of the tools they produced. My opinion is that this is meaningless since:
- the same algorithms may be used for different goals
- many algorithms were (re)discovered independently in each field
It is thus much more interesting to look at the goals or at the types of problems they aim at solving rather than at the set of tools they encompass.
Also, I do not like to use names in a discriminative way ("What you are doing is not XYZ, it is ABC") because it deepens the gaps between scientific domains, and nothing is more detrimental to science than the lack of communication between domains. However, I like to think about what a name means, because this usually leads to thinking about the goal of your research and this is always a good thing to do...