4. TESTING

 

 

The systems must be tested for two reasons: 

 

1.      Correctness: To make sure all algorithms are working correctly.

2.      Performance: To check the effectiveness of the algorithms.

 

Correctness tests were carried out throughout the production of the systems.  These were necessary to ensure that the classes were working correctly, before being used or extended by other classes.  The correctness testing was first carried out using the pre-compiled tests, provided in the testerPackage package.  Later, the interfaces could be used to allow flexible testing of the complete systems.  All systems have been tested for correctness, and shown to be reliable.

 

Some performance tests were carried out, giving a first indication as to the success of the systems.  However, the running time of the systems is very high.  Only a small amount of tests were able to be ran.  Further testing is necessary to produce conclusive results.

 

 

4.1    Choice of Task

 

The choice of task is important for both correctness and performance testing.  It must allow all aspects of the systems to be tested.  In other words, all algorithms must be invoked, and a range of behaviours made possible.  This leads to a number of requirements.

 

1. The task must be sufficiently complex.  This is to ensure that the systems have to adapt to perform the task, rather than being initially capable of performing it. 

 

2. The task must have some modularity to it.  This means that within the task, there exist sub-tasks, for which if solutions are found, the complete task can be performed.

 

 3. It should be a tried and tested task, for which a capable network structure is known.  This is for use verifying that good structures are being produced by the genetic algorithm, and the modular breeder.

 

 

4.2    What-Where Vision Task

 

This task fulfils all of the requirements listed above. It consists of an object-recognition task (the “what” task), and a spatial localisation task (the “where” task).  The definition of the what-where task, and the data used to define it, are from a program (and the notes therein) written by my project supervisor, Jonathan Shapiro.  It is similar to the task used by Jacobs, Jordan, and Barto (1991) [2], and the similarity should allow suitable expert network structures to be inferred. 

 

4.2.1        The Task

 

The input is a set of 64 binary numbers.  This represents an 8 x 8 matrix.  The matrix holds values of 0, except for where a pattern is present.  The “what” sub-task asks to distinguish the two following patterns,

 

1 1       from     1 0

1 1                   0 1

 

The “where” sub-task is to identify which of the four quadrants in the input matrix the pattern is. 

 

There are three output bits.  The first gives the solution to the “what” sub-task – 0 indicates the first pattern, and 1 indicates the second pattern.  The last two are for the 4 possibilities of the “where” sub-task.

 

The outputs of the neural networks are real numbers.  An output of <0.5 will represents a 0, and an output of 0.5 will represent a 1.

 

4.2.2        The Suitable Structures

 

I believe that the two following networks are the equivalent to the ones presented in [2].  One network is smaller than the other.

 

The smaller network has a single hidden layer containing 6 nodes, with full-connectivity.  The larger network has 12 hidden nodes rather than 6.  If the results obtained in [2] provide an accurate indication, this should exhibit a faster learning response.

 

The modular network will consist of three experts, based on the ones used in [2].  The first two are the networks described above.  The third will have no hidden layers, and so all inputs will connect to all outputs. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


                        Figure 4.1: The suitable network structures

 

 

 The network architectures presented above must be tested, and verified to be suitable.  Once they have been verified they can be used as a basis for comparison, with both the structures generated by the genetic algorithm, and the modular network breeder.

 

4.2.2 Crosstalk

 

A problem affecting network performance when faced with a modular task, is crosstalk [2].  There are two types, temporal and spatial.  Temporal crosstalk occurs when a network is trained to perform different tasks, at different times.  Spatial crosstalk occurs when a network is trained to perform different tasks at the same time.  We are only concerned with spacial crosstalk, as it’s a problem present in the what-where task used.  The modular network should be more resilient to both types of crosstalk.  This is due to its ability to choose which experts are most suitable, and train them in proportion to their error by adjusting their desired output (see section 2.2.2, equation 2.17).

 

 

4.3    Results

 

All systems have been tested on the what-where problem described previously. 

 

1.      The neural networks were trained on the task.  This serves to verify that the network structures presented in section 4.2.2 can perform adequately. 

2.      A modular network was similarly trained.  This provides a comparison with the performance of a normal neural network.

2.      The genetic algorithm was tested.  The tests show what changes the populations went through.   

3.      The modular breeder was tested.  This serves to show whether suitable network structures were being produced, and if so, how the system performs.

 

50 patterns are presented to all systems for training.  The number of epochs (or cycles) of a system is plotted against the errors of the system.  This gives a good indication of the learning speed.  The errors were calculated using different methods, as described in chapter 2, and so the graphs will have different properties.  This does not matter, as we are only interested in the number of epochs taken to reach a steady state, not how that state was reached.  In order to test a system, it is presented with another 50 different patterns.  The percentage of correct answers gives the generalisation ability.

 

 The test parameters are provided in appendix C.

 

4.3.1        Neural Network Tests

 

Both neural networks structures (described in section 4.2.2) were tested.

 

1)      Test ww1-network-2: Network 1 is tested.

 

The error settled on 0 in 500 epochs.  See figure 4.2.  The network got 68% of the testing patterns correct.

 

2)      Test ww1-network-3: Network 2 is tested.

 

The error reduced to 0 within 400 epochs.  See figure 4.2.  The network only got 62% of testing patterns correct. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

                        Network 1                                                           Network 2

 

Figure 4.2: Learning rates of the networks (difference error)

 

 

These tests have shown that both networks can learn to perform the task, but display poor ability to generalise.  The larger network has a faster learning rate, as expected, but surprisingly is less able to generalise.

 

4.3.2 Modular Network Test

 

1)      Test ww1-modular-4:

 

 

 

 

 

 

 

 

 

 

 

 

 

            Figure 4.3: Learning rate of the modular network (sum-squared error)

 

 

The error settles at ~425 epochs.  See figure 4.3.  The generalisation ability is considerably better, with 82% of the testing patterns correct.

 

The system displays a faster learning speed than the smaller of the two neural networks, but not the larger.  The good generalisation of the network indicates that the modular network is more robust when faced with crosstalk than the neural networks, as expected.

 

4.3.2        Genetic Algorithm Tests

 

Three tests were performed.  The second two were performed to try to improve on the results of the first.  The random initialisation of networks was prevented from creating any of the competent network structures (as suggested in section 4.2.2).  All three of the tests had to be terminated before they had finished.  After ~12 hours they were assumed not to be working adequately.  As the program was terminated early, the performance on the testing patterns could not be determined.

 

1)      Test ww1-breeder-1:

 

Terminated after 30 generations.  It uses LinearEvaluator as the fitness evaluator, and MutateAll as the mutation algorithm

 

The fitness seemed to be tending toward a value of 0.14.  See figure 4.4.  However, at around generation 19 it suddenly dropped.   This can be explained by looking at the average size of the networks generated, shown in figure 4.5.

 

It seems that the average size dropped suddenly on generation 18.  The average size in generation 19 is 1, which means there must have been only one population member containing a hidden layer with a node in it.  No structures were capable of performing the task, leading to the drop in fitness.  The increase in the sizes generated from generation 20 onwards, follows the increase in fitness.  The test was terminated before desired fitness could be reached.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


   Figure 4.4: Changes in best fitness                       Figure 4.5: Average sizes of networks

 

 

2)      Test ww1-breeder-4:

 

Terminated after 30 generations.  This changes the fitness evaluator to SquareEvaluator to attempt to select better members.  The mutation algorithm has been changed to MutateNodes, to try and prevent such large structural changes.  The mutation percentage has been increased to allow a number of (smaller) structural variations.  Finally, the fitness threshold has been reduced in an attempt to allow the system to finish naturally.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


   Figure 4.6: Change in best fitness                       Figure 4.7: Average sizes of networks.

 

 

The fitness of the best network rises steadily from generation 10 onwards, shown in figure 4.6.  A look at the average network sizes (figure 4.7) show that they decreased in size, until from generation 10 onwards there were no networks present containing any hidden nodes. 

 

No suitable network structures were produced by the three genetic algorithm tests.  The fitness was shown to rise in both tests.  It is possible that suitable structures would have been produced had the systems continued running.  However, they involved different changes in network size – test 1 showed the sizes increasing, while test 2 showed them decreasing.  The sharper rise of fitness in test 1 compared to the test 2 indicates that network structures averaging a size of 4 are most appropriate.  There is a possibility that the network fitness improved due to only the network training.

 

4.3.3        Modular Breeder

 

Three tests were run.  The output from test 3 is supplied as an example in appendix B.

 

1)  Test ww1-modbreeder-3:

 

A suitable modular network structure was returned after only two cycles.  The patterns were split into 3 sub-sets.  The genetic algorithms did not need to make any new populations of experts.  This indicates that some of the experts in the initial population must have already had suitable structures, and the system simply picked them out.

 

 
 

 

Cycle 1:

number of pattern sub-sets produced = 3

number of patterns in each sub-set  = 18, 26, 6

 

Cycle 2:

number of pattern sub-sets produced = 3

number of patterns in each sub-set = 27, 10, 13

 

The final network is correct in 74% of the testing patterns. 

It consists of 3 experts, each with no hidden layers.

 
 

 

 


 

 

 

 

 

 

 

 

 

 

 

            Figure 4.8: Modular breeder results

 

 

Each modular network in the system is trained for 100 epochs per cycle. The genetic algorithm trains for 100 epochs per generation.  So the system learnt the task on the equivalent of 300 epochs per modular network.  This is less than required for the modular network tested above in section 4.3.2, but the generalisation of the network is worse.

 

2)  Test ww1-modbreeder-4:

 

This test returned a suitable network after only a single cycle.  The task was not partitioned.  Expert 3 from the first modular network in the population was best on all inputs patterns, and so only a single genetic algorithm was started.  Perhaps because of the increased difficulty of performing the whole task, the genetic algorithm took 2 generations until a suitable expert network was returned. 

 

 

Cycle 1:

number of pattern sub-sets produced = 1

 

The final network is correct on 82% of the testing patterns.

 It consists of a single expert, with no hidden layers.

 
 

 

 

 

 

 

 

 

 


            Figure 4.9: Modular breeder results

 

 

Each modular network was trained in the equivalent of 300 epochs.  The final network learns faster, and can generalise better than the modular network in section 4.3.2. 

 

3)  Test ww1-modbreeder-6

 

This test ran for two cycles.  In the first cycle, each of the two genetic algorithms       needed two generations to reduce the expert error by the desired amount.  In the second cycle, only a single generation was needed, before an expert with the desired error was produced.  The task was partitioned into 2 near equal sized sets.

 

 

Cycle 1:

number of pattern sub-sets produced = 2

number of patterns in each sub-set = 26, 25

 

Cycle 2:

number of pattern sub-set produced = 2

number of patterns in each sub-set = 27, 23

 

The final network is correct on 84% of the testing patterns. 

It consists of two experts, each containing no hidden layers.

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 


            Figure 4.10: Modular breeder results

 

 

The final network got 84% of the testing patterns correct.  It consisted of two experts, each containing no hidden layers.

 

The network was trained in the equivalent of 500 epochs.  While this number of epochs is large, the generalisation ability is the best yet. 

 

The modular networks produced by the tests can perform as well as, or better than, the structures presented in section 4.2.2.  It’s worth noting that the best results were produced by the third test that partitioned the tasks into two near equal sets.

 

All three of the modular networks returned consist of experts with no hidden layers.  This implies,

1.      The what-where task specified can be partitioned into sub-tasks that only require encoding of the inputs, and no processing layer.  

2.      The networks theorised in section 4.2.2 are not good structures.  Whilst the tests showed that they can perform the tasks, they are larger than necessary.

3.      The second genetic algorithm test may have been producing the best structures (i.e. networks with no hidden layers).

 

The production of networks containing no hidden nodes is a surprise.  Further investigation is required to determine whether it is possible for such structures to perform the task.