4. TESTING
The systems must be tested for two reasons:
1.
Correctness:
To make sure all algorithms are working correctly.
2.
Performance:
To check the effectiveness of the algorithms.
Correctness tests were carried out throughout the production
of the systems. These were necessary to
ensure that the classes were working correctly, before being used or extended
by other classes. The correctness
testing was first carried out using the pre-compiled tests, provided in the
testerPackage package. Later, the
interfaces could be used to allow flexible testing of the complete
systems. All systems have been tested
for correctness, and shown to be reliable.
Some performance tests were carried out, giving a
first indication as to the success of the systems. However, the running time of the systems is very high. Only a small amount of tests were able to be
ran. Further testing is necessary to
produce conclusive results.
4.1 Choice of Task
The choice of task is important for both correctness
and performance testing. It must allow
all aspects of the systems to be tested.
In other words, all algorithms must be invoked, and a range of
behaviours made possible. This leads to
a number of requirements.
1. The task must be sufficiently complex. This is to ensure that the systems have to
adapt to perform the task, rather than being initially capable of performing
it.
2. The task must have some modularity to it. This means that within the task, there exist
sub-tasks, for which if solutions are found, the complete task can be
performed.
3. It should
be a tried and tested task, for which a capable network structure is
known. This is for use verifying that
good structures are being produced by the genetic algorithm, and the modular
breeder.
4.2 What-Where Vision Task
This task fulfils all of the requirements listed
above. It consists of an object-recognition task (the “what” task), and a
spatial localisation task (the “where” task).
The definition of the what-where task, and the data used to define it,
are from a program (and the notes therein) written by my project supervisor,
Jonathan Shapiro. It is similar to the
task used by Jacobs, Jordan, and Barto (1991) [2], and the similarity should
allow suitable expert network structures to be inferred.
4.2.1
The
Task
The input is a set of 64 binary numbers. This represents an 8 x 8 matrix. The matrix holds values of 0, except for
where a pattern is present. The “what”
sub-task asks to distinguish the two following patterns,
1 1 from
1 0
1 1 0 1
The “where” sub-task is to identify which of the
four quadrants in the input matrix the pattern is.
There are three output bits. The first gives the solution to the “what”
sub-task – 0 indicates the first pattern, and 1 indicates the second
pattern. The last two are for the 4
possibilities of the “where” sub-task.
The outputs of the neural networks are real
numbers. An output of <0.5 will
represents a 0, and an output of
0.5 will represent a 1.
4.2.2
The
Suitable Structures
I believe that the two following networks are the
equivalent to the ones presented in [2].
One network is smaller than the other.
The smaller network has a single hidden layer
containing 6 nodes, with full-connectivity.
The larger network has 12 hidden nodes rather than 6. If the results obtained in [2] provide an
accurate indication, this should exhibit a faster learning response.
The modular network will consist of three experts,
based on the ones used in [2]. The
first two are the networks described above.
The third will have no hidden layers, and so all inputs will connect to
all outputs.

Figure
4.1: The suitable network structures
The network
architectures presented above must be tested, and verified to be suitable. Once they have been verified they can be
used as a basis for comparison, with both the structures generated by the
genetic algorithm, and the modular network breeder.
4.2.2 Crosstalk
A problem affecting network performance when faced
with a modular task, is crosstalk [2].
There are two types, temporal and spatial. Temporal crosstalk occurs when a network is trained to perform different
tasks, at different times. Spatial
crosstalk occurs when a network is trained to perform different tasks at the
same time. We are only concerned with
spacial crosstalk, as it’s a problem present in the what-where task used. The modular network should be more resilient
to both types of crosstalk. This is due
to its ability to choose which experts are most suitable, and train them in
proportion to their error by adjusting their desired output (see section 2.2.2,
equation 2.17).
4.3 Results
All systems have been tested on the what-where
problem described previously.
1.
The
neural networks were trained on the task.
This serves to verify that the network structures presented in section
4.2.2 can perform adequately.
2.
A
modular network was similarly trained.
This provides a comparison with the performance of a normal neural
network.
2.
The
genetic algorithm was tested. The tests
show what changes the populations went through.
3.
The
modular breeder was tested. This serves
to show whether suitable network structures were being produced, and if so, how
the system performs.
50 patterns are presented to all systems for
training. The number of epochs (or
cycles) of a system is plotted against the errors of the system. This gives a good indication of the learning
speed. The errors were calculated using
different methods, as described in chapter 2, and so the graphs will have
different properties. This does not
matter, as we are only interested in the number of epochs taken to reach a
steady state, not how that state was reached.
In order to test a system, it is presented with another 50 different
patterns. The percentage of correct
answers gives the generalisation ability.
The test
parameters are provided in appendix C.
4.3.1
Neural
Network Tests
Both neural networks structures (described in
section 4.2.2) were tested.
1)
Test
ww1-network-2: Network 1 is tested.
The error settled on 0 in
500 epochs. See figure 4.2. The network got 68% of the testing patterns
correct.
2)
Test
ww1-network-3: Network 2 is tested.
The error reduced to 0
within 400 epochs. See figure 4.2. The network only got 62% of testing patterns
correct.
Network
1 Network 2
Figure 4.2: Learning rates
of the networks (difference error)
These tests have shown that both networks can learn
to perform the task, but display poor ability to generalise. The larger network has a faster learning rate,
as expected, but surprisingly is less able to generalise.
4.3.2 Modular Network Test
1)
Test ww1-modular-4:
Figure
4.3: Learning rate of the modular network (sum-squared error)
The error settles at ~425 epochs. See figure 4.3. The generalisation ability is considerably better, with 82% of
the testing patterns correct.
The system displays a faster learning speed than the
smaller of the two neural networks, but not the larger. The good generalisation of the network
indicates that the modular network is more robust when faced with crosstalk
than the neural networks, as expected.
4.3.2
Genetic
Algorithm Tests
Three tests were performed. The second two were performed to try to
improve on the results of the first. The
random initialisation of networks was prevented from creating any of the
competent network structures (as suggested in section 4.2.2). All three of the tests had to be terminated
before they had finished. After ~12
hours they were assumed not to be working adequately. As the program was terminated early, the performance on the
testing patterns could not be determined.
1)
Test
ww1-breeder-1:
Terminated after 30
generations. It uses LinearEvaluator as
the fitness evaluator, and MutateAll as the mutation algorithm
The fitness seemed to be
tending toward a value of 0.14. See
figure 4.4. However, at around
generation 19 it suddenly dropped.
This can be explained by looking at the average size of the networks
generated, shown in figure 4.5.
It seems that the average
size dropped suddenly on generation 18.
The average size in generation 19 is 1, which means there must have been
only one population member containing a hidden layer with a node in it. No structures were capable of performing the
task, leading to the drop in fitness.
The increase in the sizes generated from generation 20 onwards, follows
the increase in fitness. The test was
terminated before desired fitness could be reached.
Figure
4.4: Changes in best fitness Figure 4.5: Average sizes of networks
2)
Test
ww1-breeder-4:
Terminated after 30
generations. This changes the fitness
evaluator to SquareEvaluator to attempt to select better members. The mutation algorithm has been changed to
MutateNodes, to try and prevent such large structural changes. The mutation percentage has been increased
to allow a number of (smaller) structural variations. Finally, the fitness threshold has been reduced in an attempt to
allow the system to finish naturally.
Figure 4.6: Change in best fitness Figure
4.7: Average sizes of networks.
The
fitness of the best network rises steadily from generation 10 onwards, shown in
figure 4.6. A look at the average
network sizes (figure 4.7) show that they decreased in size, until from
generation 10 onwards there were no networks present containing any hidden
nodes.
No suitable network structures were produced by the
three genetic algorithm tests. The
fitness was shown to rise in both tests.
It is possible that suitable structures would have been produced had the
systems continued running. However,
they involved different changes in network size – test 1 showed the sizes
increasing, while test 2 showed them decreasing. The sharper rise of fitness in test 1 compared to the test 2
indicates that network structures averaging a size of 4 are most
appropriate. There is a possibility
that the network fitness improved due to only the network training.
4.3.3
Modular
Breeder
Three tests were run. The output from test 3 is supplied as an example in appendix B.
1) Test
ww1-modbreeder-3:
A suitable modular network
structure was returned after only two cycles.
The patterns were split into 3 sub-sets. The genetic algorithms did not need to make any new populations
of experts. This indicates that some of
the experts in the initial population must have already had suitable
structures, and the system simply picked them out.
Cycle 1: number of pattern
sub-sets produced = 3 number of patterns in
each sub-set = 18, 26, 6 Cycle 2: number of pattern
sub-sets produced = 3 number of patterns in
each sub-set = 27, 10, 13 The final network is
correct in 74% of the testing patterns.
It
consists of 3 experts, each with no hidden layers.
Figure
4.8: Modular breeder results
Each modular network in the
system is trained for 100 epochs per cycle. The genetic algorithm trains for
100 epochs per generation. So the
system learnt the task on the equivalent of 300 epochs per modular network. This is less than required for the modular
network tested above in section 4.3.2, but the generalisation of the network is
worse.
2) Test
ww1-modbreeder-4:
This test returned a
suitable network after only a single cycle.
The task was not partitioned. Expert
3 from the first modular network in the population was best on all inputs
patterns, and so only a single genetic algorithm was started. Perhaps because of the increased difficulty
of performing the whole task, the genetic algorithm took 2 generations until a
suitable expert network was returned.
Cycle 1: number of pattern
sub-sets produced = 1 The
final network is correct on 82% of the testing patterns. It consists of a single expert, with no
hidden layers.
Figure
4.9: Modular breeder results
Each modular network was
trained in the equivalent of 300 epochs.
The final network learns faster, and can generalise better than the
modular network in section 4.3.2.
3) Test
ww1-modbreeder-6
This test ran for two cycles. In the first cycle, each of the two genetic
algorithms needed two generations
to reduce the expert error by the desired amount. In the second cycle, only a single generation was needed, before
an expert with the desired error was produced.
The task was partitioned into 2 near equal sized sets.
Cycle 1: number of pattern
sub-sets produced = 2 number of patterns in
each sub-set = 26, 25 Cycle 2: number of pattern
sub-set produced = 2 number of patterns in
each sub-set = 27, 23 The
final network is correct on 84% of the testing patterns. It
consists of two experts, each containing no hidden layers.
Figure
4.10: Modular breeder results
The final network got 84% of
the testing patterns correct. It consisted
of two experts, each containing no hidden layers.
The network was trained in
the equivalent of 500 epochs. While
this number of epochs is large, the generalisation ability is the best
yet.
The modular networks produced by the tests can perform
as well as, or better than, the structures presented in section 4.2.2. It’s worth noting that the best results were
produced by the third test that partitioned the tasks into two near equal sets.
All three of the modular networks returned consist
of experts with no hidden layers. This
implies,
1.
The
what-where task specified can be partitioned into sub-tasks that only require
encoding of the inputs, and no processing layer.
2.
The
networks theorised in section 4.2.2 are not good structures. Whilst the tests showed that they can
perform the tasks, they are larger than necessary.
3.
The
second genetic algorithm test may have been producing the best structures (i.e.
networks with no hidden layers).
The production of networks containing no hidden
nodes is a surprise. Further
investigation is required to determine whether it is possible for such
structures to perform the task.