2. THEORY
2.1 Expert
Network
2.1.1 Architecture
I have implemented a feed-forward, layered, neural
network, as described by Pratt [8].
This type of network consists of a number of inputs, outputs, and hidden
nodes, organised into layers. The nodes
only connect to nodes in the layer below.
Figure 2.1 shows a network with an input layer, an output layer, and two
hidden layers. The nodes have full
connectivity (with only the leftmost connections shown for clarity).

Figure 2.1: Feed-forward, layered, network architecture
Each node has associated with it a number of inputs,
a set of weights corresponding to those inputs, and a threshold. The output of the node is calculated by a
function of the net weighted input into the node, as shown in figure 2.2.
(2.1)
(2.2)
Figure 2.2: Node activation.
The node function will be the Sigmoid function,
(2.3)
where T is a small positive constant. This is a standard function used in neural
networks, and a description of it can be found in many texts, such as [7] or
[8].
2.1.2 Training
The network will be trained using the
Backpropagation algorithm. This is as
described by Pratt [8]. This algorithm
works as follows. First the inputs into
the network are set, and the activations of the nodes allowed to propagate
through the network to produce an output.
The errors of the output nodes are then calculated using,
(2.4)
where
is the error of
(output) node
given pattern
,
is the desired output
of the node, and
is the actual output
of the node.
These errors are then propagated back through the
network by being used to calculate the errors of the hidden nodes,
(2.5)
where
is the weight of the
connection from node
to node
.
The weights of all nodes are adjusted using,
(2.6)
where
is a small positive
constant, and
is the differentiated
function acting on the net output of node
given pattern
.
Finally, the thresholds of the nodes are adjusted
using,
(2.7)
where
is the threshold of
node
.
As this algorithm requires the differentiation of the function in order to calculate the weight and threshold changes, it becomes inconvenient if different functions are used in different nodes. To allow this, if no differentiated function is supplied to the Backpropagation method, it becomes,
(2.8)
2.1.3 Structural Operations
In order to breed networks, a number of methods were implemented to allow structural alteration. These methods enable the following,
1. Addition/removal of a layer.
2. Addition/removal of a node.
3. Addition/removal of a connection.
4. Setting of a node’s connection weights.
5. Setting of a node’s function.
6. Setting of a node’s connectivity pattern.
7. Setting of a node’s threshold.
8. Activation/deactivation of a node.
There is also a range of methods that return the structural parameters of the network.
2.2 Modular Network
2.2.1 Architecture
The modular network implemented is described by Haykin [1] (also see [2]). It consists of a number of experts that are governed by a gating network. Given an input pattern, the gating network produces a decision as to which experts are most capable of producing a correct output. A weight is produced corresponding to each expert, which scales its output according to its ability. As mentioned above, the experts are neural networks, although they could be any learning model. The architecture of the modular network is shown below, in figure 2.3.

å
(2.9)
Figure
2.3: Modular network architecture
The gating network consists of a number of so-called
“Softmax” nodes. This is due to the
function they use, called a Softmax function.
This was devised in 1990 by Bridle.
The network consists of a single layer of softmax nodes fully connected
to the input layer. Figure 2.4 shows
this network (with only the connections of the first node shown for clarity).

Figure 2.4: The gating network
Each of these nodes corresponds to an expert, and
produces the weight for that expert.
The Softmax function is,
,
(2.10)
where
is the weighted input
into node
,
,
(2.11)
2.2.2 Training
Two algorithms have been implemented. The first is very simple and will not be
explained here. The second is based on
the Winner-takes-all procedure, described by Jacobs, Jordan, and Barto
[2]. This is a more complex
procedure. The algorithm works as
follows. First, the modular network is
presented with an input pattern. The
error of the network is then calculated using,
(2.12)
where
is the error of the
network on pattern
,
is the desired output
of the network, and
is the actual output.
This error is taken to be the current performance of
the network. Next, it needs to be
determined whether the performance of the network has significantly improved or
not. This is done by comparing the
networks current performance with the networks past performance on the same
input pattern. The past performance of
the network is calculated iteratively using,
(2.13)
where
is the measure of
past performance at time step
on pattern
, and
is a constant,
, that determines how rapidly past values of
are forgotten.
The comparison that evaluates true if the network
has improved is,
(2.14)
where
is a factor that
determines how much the performance must have increased for the network to be
considered improved.
The errors of the experts then need to be determined
by calculating their error using,
![]()
(2.15)
where
is the error of
expert
on pattern
,
is the error of
output
(as calculated in the
Backpropagation algorithm described in section 2.1.2), and N is the total
number of outputs.
The expert with the least error is taken to be the
winner, and all others are taken to be the losers. If the network has improved, then the desired output of the
softmax node corresponding to the winning expert is set to 1, and the desired
outputs of the softmax nodes corresponding to the all other experts are set to
0. The weights of the softmax nodes are
then adjusted to bring the output of the nodes closer to their desired outputs,
using
(2.16)
where
is calculated in the same way as the experts. On the other hand, if the network has not
improved, then the desired outputs are all set to bring the outputs of the
nodes closer to a neutral value.
This neutral value is
, where
is the number of
experts.
Finally, the desired outputs of the experts are
adjusted proportionally to the weight produced by the corresponding softmax
nodes. This is to influence the
training of each expert according to it’s performance. The desired outputs are adjusted as follows,
(2.17)
where
is the altered
desired output of expert
on pattern
.
For a discussion of the effects of this algorithm,
please see [2].
2.3 Genetic
Algorithm
2.3.1 General Method
The general method is described by Mitchell [7]. The
following description is as implemented in this project. You start with an initial population of
neural networks. The method describes
the program cycle that creates new populations of networks. The cycle is simple, as the majority of work
is done by four algorithms employed by the general method. These algorithms are as follows,
1.
Fitness
evaluator: This evaluates the fitness of each population member.
2.
Selection
algorithm: This selects a member from the population. The fitness of the member is commonly used to determine
selection.
3.
Crossover
operation: This takes a set of population members, called the “parents”, and
produces “child” members, consisting of various properties taken from the
parents.
4.
Mutation
algorithm: This takes a population member and applies some kind of mutation to
its properties.
The networks in the initial population are
initialised to have a random number of layers, and a random number of nodes per
layer, both parameters set by the user.
The program cycle is as follows,
1. Use the fitness evaluator to assign a fitness to each member in