The RDSE Algorithm
When experimenting with other neural network paradigms, the most
consistent and annoying problem was getting the algorithm to converge to
a solution consistently. All the gradient descent based algorithms
have difficulties with local minima even when simulated annealing techniques
are introduced. Annealing techniques involve a large number of extra
parameters which require domain specific settings and do not entirely alleviate
the problem. This means that a large number of runs need to be undertaken
just to get an algorithm to learn, before any parameter optimisation and
real experimentation can begin. RDSE is a hybrid algorithm which
incorporates a number of the techniques used in different existing paradigms
to construct networks using the new neuronal model. As will be shown,
this paradigm overcomes many of the problems associated with these existing
methodologies.
Description
The RDSE algorithm builds and trains one neuron at a time, freezing
the weights before continuing. Each neuron is trained to minimise
an entropy measure (Quinlan, 1986) using a Directed Random Search technique
(Baba, 1989).
The entropy measure above, is used to ascertain how well the
training examples have been partitioned by one layer of neurons and to
enable the supervision of new neurons as they are trained. Each new
neuron in a layer must attempt to partition as many unpartitioned examples
as possible in to decision regions with class purity and thus improve on
the previous function value. Training stops when the neuron can no
longer improve on this value and has therefore hit an error minima (or
maxima). Once this entropy measure has been reduced to approximately
zero all the training vectors have been successfully partitioned and so
a new layer of neurons is begun.
The algorithm commences by building a single neuron in the first
layer, fully connected to the input units, with small (in the range [-0.5
.. +0.5]) randomly initialised weights.
Training a Neuron by Random Directed Search
From an initial position (set of weights), a random step is taken,
the size of which is governed by a self adjusting variance parameter.
If the new position has a lower entropy than the initial (a good step),
then the weights are updated (the new position becomes the current position)
and the variance parameter increases. If the new position has a greater
or equal entropy to the previous, then the step is ignored and the variance
decreases. When a good step is taken a momentum vector is also updated
by a small proportion of the step. This vector is added to consequent
random steps and so biases the search in directions of decreasing entropy.
Alone this technique would have a tendency to get stuck in local minima
as each time an unsuccessful step is taken, the variance is reduced.
This would result in making shorter and shorter searches. In order
to escape a local minima a relatively large step must be taken. This
can be achieved by performing a number of searches for a good step (when
one is found the minima will have been escaped).
A search consists of a number of unsuccessful random steps with
a decaying variance. Once the variance hits a minimum value the initial
variance is made larger and more searches are performed. This ensures
that the system is more and more likely to escape a local minima.
Once a set number1 of searches has been performed the algorithm infers
that a global minima has been found. The steepness parameter for
the trained neuron is now calculated according to the criteria set in the
introduction.
Calculating the Steepness Parameter
The neuron is required to produce a high output [0.5..1], for
the nearest positive example, to the hyperplane, a low output [0..0.5],
for the nearest negative example and a smooth curve in between. This
can be achieved by altering the steepness such that at the average distance
of the two closest data points of differing class from the dividing hyperplane,
a set output is attained, e.g. 0.9 on the positive side and 0.1 on the
other. The distance of a data point to a hyperplane is proportional
to the neurons activation x on being presented with that data point. So
the average distance of the two closest points is proportional to the average
of the magnitudes of the smallest positive and negative activations, xave,
found when passing through the training data.
The New Layer Criterion
If the layer of neurons has reduced the entropy to approximately
zero, then all the examples have been partitioned into volumes which only
contain a single class (a convex hull, see Fig. 2). When this occurs
a new layer is started which is fully connected to the previous layer.
If the entropy measure is not sufficiently close to zero a new neuron is
added to the current layer and trained in the same way as above.
A Maximum Number of Layers
Any training examples fed through the network so far, which lie within
the same convex hulls, will be approximately mapped to the same point (see
Fig. 2) in the pattern space of the new layer neuron. The new neurons
can be trained to partition these points of higher abstraction in the same
way as the first layer until they have again been partitioned according
to class. In other paving algorithms such as CID3 (Cios, 1992), this
process carries on until the pattern space at a layer is linearly separable
and so can be partitioned using a single neuron. This can lead to
many layers of neurons being created (often over ten ) which leads to slow
training times and an overly large number of neurons being created.
The situation can be improved by halting the process after a pre-specified
number of layers have been created (usually three).
At the last layer of neurons all the points have been partitioned
into volumes containing a single class of data. A layer of terminal
neurons is now added consisting of one recognising terminal neuron for
each volume, connected to only those neurons which implement the volume
boundaries. Every data point inside the convex hull will be mapped to the
same point in the terminal neuron's pattern space, so the weights on the
connections to this neuron can easily be calculated and no training is
necessary. Finally an output layer of neurons is added, one for each
class of data , and connected to the terminal neurons which handle data
points of the same class. Any example presented to the network, will
always be mapped to a single convex hull in the last layer, so at any time
only a single terminal neuron will fire. This makes the problem of
training the output neurons trivial.
Download
the RDSE Algorithm - the algorithm is coded in C and requires a command
file to supply parameters
such as filenames, learning parameters and advanced options.
Download
an example command file - the command files main parameters are described
below
training file name, test file, weights file (output)
0
results file (report)
number input units, max file width, max file length, positive class
value in data
successful steps before variance increases, bad steps before variance
decreases, learning rate
intelligent weight initialisation (1/0), fully connected network/inter
layer connections (1/0), random pattern presentation (1/0),
plot probability map (1/0){2D inputs only}, full report (1/0), 0,0
max searches to global minimum, min searches, variance expansion, variance
contraction, initial variance,
required sigmoidal output, initial weight range, threshold
max levels, entropy threshold for each level {max levels +1}