When we try to understand the causes of a phenomenon, there is a natural tendency to think that the more variables we measure the more likely we are to identify the real cause.
This is generally true if there is no constraint on the amount of experiments we can realize, but as soon as we work with a limited sample, adding more variables may lead to less accurate models.
More precisely, if we have the ability to perform say 100 independent experiments and for each of these experiments we measure d "input" variables and one output variable and try to build a model to predict the output from the inputs, then the larger d is, the more difficult it might be to build the model. There is some kind of optimal value of d: on the one hand, if one has measured too few variables, one may miss important information, on the other hand, if one has measured too many variables, it becomes impossible to distinguish between "true" correlations and distortions caused by the imperfection of the sampling mechanism (for example, it may happen that on these specific 100 experiments, one input variable that has nothing to do with the output is incidentally correlated to the output).
In order to study this phenomenon more precisely, one can imagine to have a framework where, in addition to obtaining the experiments from sampling iid from an arbitrary distribution, one considers that the variables themselves are obtained from a sampling process (also iid from some distribution). This is similar in spirit to the framework recently proposed by Krupka and Tishby, except that in their framework, only the variables (or features) are sampled, and not the examples. Note that the assumption that one samples the variables in an iid fashion does not mean that the variables are necessarily independent in the classical sense of having independent values. One should imagine an infinite matrix representing all possible measurements on all possible experiments for a given problems: rows would be experiments, columns would be measurements.
The framework consists in assuming that one randomly picks n rows and d columns from this matrix (possibly picking several times the same row or column to comply with the iid assumption).
Now the question is: can you obtain a bound on the error of a given learning algorithm when trained on such an (n,d) sample as a function of n and d?
This is probably still too vague to be answered, and one probably needs to put restrictions on the functions that are allowed. For example, a reasonable first goal would be to study the case where the target function is a (possibly countably infinite) linear combination of the variables (i.e. columns of that infinite matrix).
Intuitively, one would expect that the result depends on "how similar" the columns of the matrix are, and "how wide-spread" the coefficients of the linear combination are: you need to collect enough variables with high coefficients in the final linear combination, but not too many variables with low or zero coefficient.
There are so many ways that fires can start.
Posted by: cctv | April 18, 2011 at 02:40 PM