I would like to come back to the tight relationship between optimization and learning.
Of course, there is much to say about this, but let us focus on one specific example: randomization.
Randomization is an old and well-known method for solving difficult (e.g. combinatorial) problems. Indeed, when one is faced with a complex problem where there are many combinations to try out, and very little structure in the search space, it is often convenient to introduce some randomness in the search.
As it turns out, you can often prove that a "stupid" random method will perform as well as a sophisticated one in terms of how close you can get to the optimal solution as a function of the computation time spent.
Let me give you some simple example: consider the d-dimensional unit cube and assume you are given a set of k d-dimensional hyperrectangles defined by the coordinates of their borders (2d such borders, hence 2d numbers characterize each hyperrectangle) that lie inside the unit cube (the numbers are all between 0 and 1).
The problem is to determine the volume left out by these hyperrectangles.
Put in simple words, you have a cubic space from which you cut out (possibly overlapping) rectangular pieces and you want to determine the volume left out after you cut k such pieces out.
This seems like a simple problem but solving it exactly requires heavy computations: you essentially need to determine all intersections between hyperrectangles and the intersections between these intersections and so on... Of course this can be done, but will require exponential time in d.
This means that it is practically impossible as soon as the dimension gets bigger than say 20.
So, instead of trying to solve it exactly, one can try to approximate the solution with the following simple randomized method:
you imagine throwing a stone in the cubic space repeatedly and counting how many times it hits a hyperrectangle. The fraction of such hits will very quickly converge to the solution of the problem.
More precisely, imagine you can sample points in the d-dimensional unit cube independently and uniformly. For each such point it is easy to determine whether it is inside a hyperrectangle (it takes 2dk comparisons at most) and thus if one samples n such points, with a computation time 2dkn, one gets an estimate of the volume v which is (with large probability) at least within epsilon of v.
The size of this epsilon is roughly n^{-1/2} so that with 10000 draws you get 1% precision in your estimation, and this is independent of the dimension.
So you can have, in high dimensional spaces, a very cheap solution for solving the problem (which takes time linear in the dimension and not exponential, at the expense of yielding an approximate solution).
This is actually a very common phenomenon which has been largely exploited in solving many high dimensional problems.
Now the connection to learning: the randomized method proposed above is a learning algorithm. Indeed, you are trying to "learn" the fact that a point in the unit cube is covered or not by one of the hyperrectangles. It is a very simple learning problem since one is only interested in the volume of the empty space and not in a full description of this empty space, but the techniques that are required are basically similar. More precisely, the estimation of the error in the approximation involves computations of the same type as those used in learning theory for deriving bounds on the generalization error of learning algorithms.
Of course this was a very simple use of learning and there are plenty of ways to make more sophisticated uses of learning in solving complex problems. The message is that statistical learning theory can be useful for analyzing algorithms for solving complex problems (and not only for analyzing learning algorithms as such).

Hi,
Your example is very interesting to me. Often, I have considered the relation between ML and Optimization as this: we can formulate the learning task as optimizing an objective function. Would you mention other interesting examples?
thanks
Posted by: Reza | November 09, 2005 at 03:49 AM
It all ends up with a common point, and that makes it looks accurate.
Posted by: scaling | July 05, 2011 at 10:15 AM