The RDSE Algorithm

 When experimenting with other neural network paradigms, the most consistent and annoying problem was getting the algorithm to converge to a solution consistently.  All the gradient descent based algorithms have difficulties with local minima even when simulated annealing techniques are introduced.  Annealing techniques involve a large number of extra parameters which require domain specific settings and do not entirely alleviate the problem.  This means that a large number of runs need to be undertaken just to get an algorithm to learn, before any parameter optimisation and real experimentation can begin.  RDSE is a hybrid algorithm which incorporates a number of the techniques used in different existing paradigms to construct networks using the new neuronal model.  As will be shown, this paradigm overcomes many of the problems associated with these existing methodologies.
Description
 The RDSE algorithm builds and trains one neuron at a time, freezing the weights before continuing.  Each neuron is trained to minimise an entropy measure (Quinlan, 1986) using a Directed Random Search technique (Baba, 1989).
 
 The entropy measure above, is used to ascertain how well the training examples have been partitioned by one layer of neurons and to enable the supervision of new neurons as they are trained.  Each new neuron in a layer must attempt to partition as many unpartitioned examples as possible in to decision regions with class purity and thus improve on the previous function value.  Training stops when the neuron can no longer improve on this value and has therefore hit an error minima (or maxima).  Once this entropy measure has been reduced to approximately zero all the training vectors have been successfully partitioned and so a new layer of neurons is begun.
 The algorithm commences by building a single neuron in the first layer, fully connected to the input units, with small (in the range [-0.5 .. +0.5]) randomly initialised weights.

Training a Neuron by Random Directed Search

 From an initial position (set of weights), a random step is taken, the size of which is governed by a self adjusting variance parameter.  If the new position has a lower entropy than the initial (a good step), then the weights are updated (the new position becomes the current position) and the variance parameter increases.  If the new position has a greater or equal entropy to the previous, then the step is ignored and the variance decreases.  When a good step is taken a momentum vector is also updated by a small proportion of the step.  This vector is added to consequent random steps and so biases the search in directions of decreasing entropy.  Alone this technique would have a tendency to get stuck in local minima as each time an unsuccessful step is taken, the variance is reduced.  This would result in making shorter and shorter searches.  In order to escape a local minima a relatively large step must be taken.  This can be achieved by performing a number of searches for a good step (when one is found the minima will have been escaped).
 A search consists of a number of unsuccessful random steps with a decaying variance.  Once the variance hits a minimum value the initial variance is made larger and more searches are performed.  This ensures that the system is more and more likely to escape a local minima.  Once a set number1 of searches has been performed the algorithm infers that a global minima has been found.  The steepness parameter for the trained neuron is now calculated according to the criteria set in the introduction.
Calculating the Steepness Parameter
 The neuron is required to produce a high output [0.5..1], for the nearest positive example, to the hyperplane, a low output [0..0.5], for the nearest negative example and a smooth curve in between.  This can be achieved by altering the steepness such that at the average distance of the two closest data points of differing class from the dividing hyperplane, a set output is attained, e.g. 0.9 on the positive side and 0.1 on the other.  The distance of a data point to a hyperplane is proportional to the neurons activation x on being presented with that data point. So the average distance of the two closest points is proportional to the average of the magnitudes of the smallest positive and negative activations, xave, found when passing through the training data.

The New Layer Criterion

 If the layer of neurons has reduced the entropy to approximately zero, then all the examples have been partitioned into volumes which only contain a single class (a convex hull, see Fig. 2).  When this occurs a new layer is started which is fully connected to the previous layer.  If the entropy measure is not sufficiently close to zero a new neuron is added to the current layer and trained in the same way as above.

A Maximum Number of Layers

 Any training examples fed through the network so far, which lie within the same convex hulls, will be approximately mapped to the same point (see Fig. 2) in the pattern space of the new layer neuron.  The new neurons can be trained to partition these points of higher abstraction in the same way as the first layer until they have again been partitioned according to class.  In other paving algorithms such as CID3 (Cios, 1992), this process carries on until the pattern space at a layer is linearly separable and so can be partitioned using a single neuron.  This can lead to many layers of neurons being created (often over ten ) which leads to slow training times and an overly large number of neurons being created.  The situation can be improved by halting the process after a pre-specified number of layers have been created (usually three).
 At the last layer of neurons all the points have been partitioned into volumes containing a single class of data.  A layer of terminal neurons is now added consisting of one recognising terminal neuron for each volume, connected to only those neurons which implement the volume boundaries. Every data point inside the convex hull will be mapped to the same point in the terminal neuron's pattern space, so the weights on the connections to this neuron can easily be calculated and no training is necessary.  Finally an output layer of neurons is added, one for each class of data , and connected to the terminal neurons which handle data points of the same class.  Any example presented to the network, will always be mapped to a single convex hull in the last layer, so at any time only a single terminal neuron will fire.  This makes the problem of training the output neurons trivial.
 
 Download the RDSE Algorithm - the algorithm is coded in C and requires a command file to supply parameters
such as filenames, learning parameters and advanced options.
Download an example command file - the command files main parameters are described below

training file name, test file, weights file (output)
0
results file (report)
number input units, max file width, max file length, positive class value in data
successful steps before variance increases, bad steps before variance decreases, learning rate
intelligent weight initialisation (1/0), fully connected network/inter layer connections (1/0), random pattern presentation (1/0),
plot probability map (1/0){2D inputs only}, full report (1/0), 0,0
max searches to global minimum, min searches, variance expansion, variance contraction, initial variance,
required sigmoidal output, initial weight range, threshold
max levels, entropy threshold for each level {max levels +1}