2. THEORY

 

 

2.1 Expert Network

 

2.1.1 Architecture

 

I have implemented a feed-forward, layered, neural network, as described by Pratt [8].  This type of network consists of a number of inputs, outputs, and hidden nodes, organised into layers.  The nodes only connect to nodes in the layer below.  Figure 2.1 shows a network with an input layer, an output layer, and two hidden layers.  The nodes have full connectivity (with only the leftmost connections shown for clarity). 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


             Figure 2.1: Feed-forward, layered, network architecture

 

 

Each node has associated with it a number of inputs, a set of weights corresponding to those inputs, and a threshold.  The output of the node is calculated by a function of the net weighted input into the node, as shown in figure 2.2.

 

                           

 

                                                                                                             

                                                                                               

 

 

 

 

 

       (2.1)

 

                  (2.2)

 

                Figure 2.2: Node activation.                                   

 

 

The node function will be the Sigmoid function,

 

                                   (2.3)

 

where T is a small positive constant.  This is a standard function used in neural networks, and a description of it can be found in many texts, such as [7] or [8].

 

2.1.2 Training

 

The network will be trained using the Backpropagation algorithm.  This is as described by Pratt [8].  This algorithm works as follows.  First the inputs into the network are set, and the activations of the nodes allowed to propagate through the network to produce an output.  The errors of the output nodes are then calculated using,  

 

                                    (2.4)

 

where  is the error of (output) node  given pattern ,  is the desired output of the node, and  is the actual output of the node.     

These errors are then propagated back through the network by being used to calculate the errors of the hidden nodes,  

 

                                   (2.5)

 

where  is the weight of the connection from node  to node .

The weights of all nodes are adjusted using, 

 

                                                                            (2.6)

 

where  is a small positive constant, and   is the differentiated function acting on the net output of node given pattern .

 

Finally, the thresholds of the nodes are adjusted using,

 

                                                                                    (2.7)

 

where  is the threshold of node .

 

As this algorithm requires the differentiation of the function in order to calculate the weight and threshold changes, it becomes inconvenient if different functions are used in different nodes.  To allow this, if no differentiated function is supplied to the Backpropagation method, it becomes,  

 

                                                   (2.8)

 

2.1.3 Structural Operations

 

In order to breed networks, a number of methods were implemented to allow structural alteration.  These methods enable the following,

 

1.      Addition/removal of a layer.

2.      Addition/removal of a node.

3.      Addition/removal of a connection.

4.      Setting of a node’s connection weights.

5.      Setting of a node’s function.

6.      Setting of a node’s connectivity pattern.

7.      Setting of a node’s threshold.

8.      Activation/deactivation of a node.

 

There is also a range of methods that return the structural parameters of the network.

 

 

2.2 Modular Network

 

2.2.1 Architecture

 

The modular network implemented is described by Haykin [1] (also see [2]).  It consists of a number of experts that are governed by a gating network.  Given an input pattern, the gating network produces a decision as to which experts are most capable of producing a correct output.  A weight is produced corresponding to each expert, which scales its output according to its ability.  As mentioned above, the experts are neural networks, although they could be any learning model.  The architecture of the modular network is shown below, in figure 2.3.

 

 

 


                                                                                                           

 

å

 
                                                                                                                       

        

 

                                                                

 

 

 

                                                         (2.9)

 

 

 

 

       Figure 2.3: Modular network architecture

 

 

The gating network consists of a number of so-called “Softmax” nodes.  This is due to the function they use, called a Softmax function.  This was devised in 1990 by Bridle.  The network consists of a single layer of softmax nodes fully connected to the input layer.  Figure 2.4 shows this network (with only the connections of the first node shown for clarity).

 

 

 


                                           

                                                                                                           

                                           

 

                                                                       

                   

                                           

 

                                                                       

 

 

     Figure 2.4: The gating network

 

 

Each of these nodes corresponds to an expert, and produces the weight for that expert. 

 

The Softmax function is,

 

                                                ,                             (2.10)

 

where  is the weighted input into node ,

 

                                                ,                                 (2.11)

 

2.2.2 Training

 

Two algorithms have been implemented.  The first is very simple and will not be explained here.  The second is based on the Winner-takes-all procedure, described by Jacobs, Jordan, and Barto [2].  This is a more complex procedure.  The algorithm works as follows.  First, the modular network is presented with an input pattern.  The error of the network is then calculated using,

 

                                                                              (2.12)

 

where  is the error of the network on pattern ,  is the desired output of the network, and  is the actual output.

 

This error is taken to be the current performance of the network.  Next, it needs to be determined whether the performance of the network has significantly improved or not.  This is done by comparing the networks current performance with the networks past performance on the same input pattern.  The past performance of the network is calculated iteratively using,

 

                                                                    (2.13)

 

where  is the measure of past performance at time step  on pattern , and  is a constant, , that determines how rapidly past values of  are forgotten.

 

The comparison that evaluates true if the network has improved is,

 

                                                                                           (2.14)

 

where  is a factor that determines how much the performance must have increased for the network to be considered improved. 

 

The errors of the experts then need to be determined by calculating their error using,

                       

                                                                                                    (2.15)

 

where  is the error of expert  on pattern ,  is the error of output  (as calculated in the Backpropagation algorithm described in section 2.1.2), and N is the total number of outputs.

 

The expert with the least error is taken to be the winner, and all others are taken to be the losers.  If the network has improved, then the desired output of the softmax node corresponding to the winning expert is set to 1, and the desired outputs of the softmax nodes corresponding to the all other experts are set to 0.  The weights of the softmax nodes are then adjusted to bring the output of the nodes closer to their desired outputs, using

 

                                                                       (2.16)

 

where is calculated in the same way as the experts.  On the other hand, if the network has not improved, then the desired outputs are all set to bring the outputs of the nodes closer to a neutral value. 

 

This neutral value is , where  is the number of experts.

 

Finally, the desired outputs of the experts are adjusted proportionally to the weight produced by the corresponding softmax nodes.  This is to influence the training of each expert according to it’s performance.  The desired outputs are adjusted as follows,

 

                                                                                   (2.17)

 

where  is the altered desired output of expert  on pattern .

 

For a discussion of the effects of this algorithm, please see [2].

 

 

2.3 Genetic Algorithm

 

2.3.1 General Method

 

The general method is described by Mitchell [7]. The following description is as implemented in this project.  You start with an initial population of neural networks.  The method describes the program cycle that creates new populations of networks.  The cycle is simple, as the majority of work is done by four algorithms employed by the general method.  These algorithms are as follows,

 

1.      Fitness evaluator: This evaluates the fitness of each population member.

2.      Selection algorithm: This selects a member from the population.  The fitness of the member is commonly used to determine selection.

3.      Crossover operation: This takes a set of population members, called the “parents”, and produces “child” members, consisting of various properties taken from the parents.

4.      Mutation algorithm: This takes a population member and applies some kind of mutation to its properties.

 

The networks in the initial population are initialised to have a random number of layers, and a random number of nodes per layer, both parameters set by the user.

 

The program cycle is as follows,

 

1.      Use the fitness evaluator to assign a fitness to each member in