From someone outside the learning community, it may seem that researchers spend their time looking for THE optimal learning algorithm, that is a completely generic algorithm which would beat all the other ones. One may even think that such an algorithm would be so sophisticated that researchers can only incrementally approach it and this explains why the progress is relatively slow in this area.
However, this is a serious misinterpretation of what is going on in this research field.
First of all, there is no such thing as an optimal learning algorithm, and my point here is to explain why this is so.
There are at least three possible explanations and I will go from the most informal to the most formal one:
- Any algorithm has some bias: given a data sample, a learning algorithm typically builds a function (or a model) that agrees (to a certain degree) with this data and that is able to make predictions for new data, that is it extrapolates the data. However for each data sample, there are infinitely many ways to extrapolate it, and each learning algorithm does it in its own way. The bias of a learning algorithm can be thought of as the way this algorithm ranks the possible functions. The point is that in order to build a function, one needs a way to decide which function to pick among all the functions that (at least partially) agree with the data. So the question is whether there could exist some sort of optimal ranking of functions, or optimal way to decide which function to pick. The problem is that, given a learning problem characterized by the function to be learned, one can always construct an algorithm that performs optimally, simply by choosing a ranking that puts this particular function first. So there is always an optimal algorithm for each problem, but this optimal algorithm will necessarily be sub-optimal on other problems. So, roughly speaking, there is no way to have a good performance simultaneously on all problems.
- There exist several results that make this more precise, in particular, the so called No Free Lunch (NFL) theorem. This theorem essentially says that if you consider all the possible learning problems, all possible learning algorithms have the same performance on average over all the problems. As a consequence, a learning algorithm that performs well on some problems will necessarily perform poorly on others to balance this. Thus there cannot exist an optimal and universal learning algorithm.
- A more involved version of the NFL theorem is the Slow Rate Theorem which states that for any learning algorithm, if you choose any sequence of numbers that converge to zero, there exists a learning problem such that the algorithm's generalization error on that problem will converge to zero slower than the chosen sequence (as the sample size increases). In other words, a learning algorithm may converge to the optimal solution arbitrarily slowly and can have arbitrarily poor performance for any fixed sample size.
All this implies that there cannot be an algorithm that is both universal (that can learn any problem) and optimal (that performs better than the others on all problems).
So the only thing that we can hope for is an algorithm that has good properties for a restricted set of problems only. This is not so bad however since you can assume that the problems you encounter in the real world are somehow well-behaved and do not span the space of all possible problems.
As a conclusion, what Machine Learning researchers do is not to look for THE optimal algorithm, but to look for a learning algorithm that is optimal for a small set of learning problems, namely the "real-world problems".