GENERALIZATION AND OPTIMIZATION
Combining Multiple Learners
This unit discusses the models composed of multiple learners that complement
each other so that by combining them, higher accuracy can be attained.
Rationale
In any application, several learning algorithms can be used, and with certain
algorithms, there are hyper parameters that affect the final learner. For example,
in a classification setting, we can use a parametric classifier or a multilayer
perceptron, and, for example, with a multilayer perceptron, the number of
hidden units should also be decided. The No Free Lunch Theorem states that
there is no single learning algorithm that in any domain always induces the most
accurate learner. The usual approach is to try many and choose the one that
performs the best on a separate validation set. Each learning algorithm dictates a
certain model that comes with a set of assumptions. This inductive bias leads to
error if the assumptions do not hold for the data. Learning is an ill-posed
problem and with finite data, each algorithm converges to a different solution
and fails under different circumstances. The performance of a learner may be
fine-tuned to get the highest possible accuracy on a validation set, but this fine
tuning is a complex task and still there are instances on which even the best
learner is not accurate enough. The idea is that there may be another learner that
is accurate on these. By suitably combining multiple base-learners, accuracy can
be improved. Recently with computation and memory getting cheaper, such
systems composed of multiple learners have become popular.
There are basically two questions here:
1. How to generate base-learners that complement each other?
2. How to combine the outputs of base-learners for maximum accuracy?
Upcoming discussion in this unit will answer these two related questions.
Usually the model combination is not a trick that always increases accuracy;
model combination does always increase time and space complexity of training
and testing, and unless base-learners are trained carefully and their decisions
combined smartly, only the extra complexity will be paid without any
significant gain in accuracy.
,Generating Diverse Learners
Since there is no point in combining learners that always make similar
decisions, the aim is to be able to find a set of diverse learners who differ in
their decisions so that they complement each other. At the same time, there
cannot be a gain in overall success unless the learners are accurate, at least in
their domain of expertise. Below topics discuss the different ways to achieve
this.
Different Algorithms
Different learning algorithms can be used to train different base-learners.
Different algorithms make different assumptions about the data and lead to
different classifiers. For example, one base-learner may be parametric and
another may be non-parametric. When we decide on a single algorithm, we give
emphasis to a single method and ignore all others. Combining multiple learners
based on multiple algorithms, it is easier to make a decision and it is no longer
necessary to put all the eggs in one basket.
Different Hyperparameters
Same learning algorithm can be used with different hyperparameters. Examples
are the number of hidden units in a multilayer perceptron, k in k-nearest
neighbor, error threshold in decision trees, the kernel function in support vector
machines, and so forth. With a Gaussian parametric classifier, whether the
covariance matrices are shared or not is a hyperparameter. If the optimization
algorithm uses an iterative procedure such as gradient descent whose final state
depends on the initial state, such as in backpropagation with multilayer
perceptrons, the initial state, for example, the initial weights, is another
hyperparameter. When multiple base-learners are trained with different
hyperparameter values, the decision averages over this factor and reduces
variance, and therefore error.
Different Input Representations
Separate base-learners may be using different representations of the same input
object or event, making it possible to integrate different types of
sensors/measurements/modalities. Different representations make different
characteristics explicit allowing better identification. In many applications, there
are multiple sources of information, and it is desirable to use all of these data to
extract more information and achieve higher accuracy in prediction.
For example, in speech recognition, to recognize the uttered words, in addition
to the acoustic input, video images of the speaker’s lips can also be used as the
, words are spoken. This is similar to sensor fusion where the data from different
sensors are integrated to extract more information for a specific application. The
simplest approach is to concatenate all data vectors and treat it as one large
vector from a single source, but this does not seem theoretically appropriate
since this corresponds to modeling data as sampled from one multivariate
statistical distribution. Moreover, larger input dimensionalities make the
systems more complex and require larger samples for the estimators to be
accurate. This approach makes separate predictions based on different sources
using separate base-learners, then combines their predictions. Even if there is a
single input representation, by choosing random subsets from it, we can have
classifiers using different input features; this is called the random subspace
method (Ho 1998). This has the effect that different learners will look at the
same problem from different points of view and will be robust; it will also help
reduce the curse of dimensionality because inputs are fewer dimensional.
Different Training Sets
Another possibility is to train different base-learners by different subsets of the
training set. This can be done randomly by drawing random training sets from
the given sample; this is called bagging. Or, the learners can be trained serially
so that instances on which the preceding base-learners are not accurate are given
more emphasis in training later base-learners; examples are boosting and
cascading, which actively try to generate complementary learners, instead of
leaving this to chance.
The partitioning of the training sample can also be done based on locality in the
input space so that each base-learner is trained on instances in a certain local
part of the input space. Similarly, it is possible to define the main task in terms
of a number of subtasks to be implemented by the base-learners, as is done by
error-correcting output codes.
Diversity vs. Accuracy
One important note is that while generating multiple base-learners, they have to
be reasonably accurate but do not require them to be very accurate individually,
so they are not, and need not be, optimized separately for best accuracy. The
base-learners are not chosen for their accuracy, but for their simplicity.
However, the base-learners should be diverse, that is, accurate on different
instances, specializing in subdomains of the problem. Final accuracy is the main
consideration when the base-learners are combined, rather than the accuracies of
the base-learners when started. There’s a classifier that is 80 percent accurate.
While deciding on a second classifier, there is not much care for the overall