F
In practice, the nodes of most feedforward nets are partitioned into layers - that is, sets of nodes, and the layers may be numbered in such a way that the nodes in each layer are connected only to nodes in the next layer - that is, the layer with the next higher number. Commonly successive layers are totally interconnected - each node in the earlier layer is connected to every node in the next layer.
The first layer has no input connections, so consists of input units and is termed the input layer (yellow nodes in the diagram below).
The last layer has no output connections, so consists of output units and is termed the output layer (maroon nodes in the diagram below).
The layers in between the input and output layers are termed hidden layers, and consist of hidden units (light blue nodes and brown nodes in the diagram below).
When the net is operating, the activations of non-input neurons are computing using each neuron's activation function.
Feedforward network. All connections (arrows) are in one direction; there are no cycles of activation flow (cyclic subgraphs). Each colour identifies a different layer in the network. The layers 1 and 2 are fully interconnected, and so are layers 3 and 4. Layers 2 and 3 are only partly interconnected.
See also symbolic learning algorithms.
G
Moreover, with large complex sets of training patterns, it is likely that some errors may occur, either in the inputs or in the outputs. In that case, and again particularly in the later parts of the learning process, it is likely that backprop will be contorting the weights so as to fit precisely around training patterns that are actually erroneous! This phenomenon is known as over-fitting.
This problem can to some extent be avoided by stopping learning early. How does one tell when to stop? One method is to partition the training patterns into two sets (assuming that there are enough of them). The larger part of the training patterns, say 80% of them, chosen at random, form the training set, and the remaining 20% are referred to as the test set. Every now and again during training, one measures the performance of the current set of weights on the test set. One normally finds that the error on the training set drops monotonically (that's what a gradient descent algorithm is supposed to do, after all). However, error on the test set (which will be larger, per pattern, than the error on the training set) will fall at first, then start to rise as the algorithm begins to overtrain. Best generalization performance is gained by stopping the algorithm at the point where error on the test set starts to rise.
Δwji(n) = α Δwji(n–1) + η δj(n) yi(n)
in the notation of Haykin's text (Neural networks - a comprehensive foundation). The constant α is a termed the momentum constant and can be adjusted to achieve the best effect. The second summand corresponds to the standard delta rule, while the first summand says "add α × the previous change to this weight."
This new rule is called the generalized delta rule. The effect is that if the basic delta rule would be consistently pushing a weight in the same direction, then it gradually gathers "momentum" in that direction.
When an artificial neural networklearning algorithm causes the weights of the net to change, it will do so in such a way that the current point on the error surface will descend into a valley of the error surface, in a direction that corresponds to the steepest (downhill) gradient or slope at the current point on the error surface. For this reason, backprop is said to be a gradient descent method, and to perform gradient descent in weight space.
See also local minimum.