S
sequence prediction tasks
Feedforward networks are fine for classifying objects, but their units (as distinct from their weights) have no memory of previous inputs. Consequently they are unable to cope with sequenceprediction tasks - tasks like predicting, given a sequence of sunspot activity counts, what the sunspot activity for the next time period will be, and financialprediction tasks (e.g. given share prices for the last n days, and presumably other economic data, etc., predict tomorrow's share price).
Simple recurrent nets can tackle tasks like this, because they do have a kind of memory for recording information derived from activation values from previous time steps.
This article is included for general interest - sequenceprediction tasks are not part of the syllabus of COMP9414 Artificial Intelligence.
sigmoidal nonlinearity
Another name for the logistic function and certain related functions (such as tanh(x)). Sigmoidal functions are a type of squashing function. They are called because sigma is the Greek letter "s", and the logistic functions look somewhat like a sloping letter "s" when graphed.
simple recurrent network
A simple recurrent network is like a feedforward network with an input layer, and output layer, and a single hidden layer, except that there is a further group of units called state units or context units. There is one state unit for each hidden unit. The activation function of the state unit is as follows: the activation of a state unit in time step n is the same of that of the correspondinghidden unit in time step n–1. That is, the state unit activations are copies of the hidden unit activations from the previous time step. Each state unit is also connected to each hidden unit by a trainable weight - the direction of this connection is from the state unit to the hidden unit.
SRN diagram
Simple recurrent networks can learn sequenceprediction tasks.
See also recurrent networks.
This article is included for general interest - simple recurrent networks are not part of the syllabus of COMP9414 Artificial Intelligence.
splitting criterion in ID3
The point of the ID3 algorithm is to decide the best attribute, out of those not already used, on which to split the training instances that are classified to a particular branch node.
The algorithm, in outline, is as follows:
if all the instances belong to a single class, there is nothing to do (except create a leaf node labelled with the name of that class).
otherwise, for each attribute that has not already been used, calculate the information gain that would be obtained by using that attribute on the particular set of instances classified to this branch node.
use the attribute with the greatest information gain.
This leaves the question of how to calculate the information gain associated with using a particular attribute A. Suppose that there are k classes C1, C2, ..., Ck, and that of the N instances classified to this node, I1 belong to class C1, I2 belong to class C2, ..., and Ik belong to class Ck.
Let p1 = I1/N, p2 = I2/N, ..., and pk = Ik/N.
The initial entropy E at this node is:
–p1log2(p1) –p2log2(p2) ... –pklog2(pk).
Now split the instances on each value of the chosen attribute A. Suppose that there are r attribute values for A, namely a1, a2, ..., ar.
For a particular value aj, say, suppose that there are Jj,1 instances in class C1, Jj,2 instances in class C2, ..., and Jj,k instances in class Ck, for a total of Jj instances having attribute value aj.
Let qj,1 = Jj,1/Jj, qj,2 = Jj,2/Jj, ..., and qj,k = Jj,k/Jj.
The entropy Ej associated with this attribute value aj this position is:
–qj,1log2(qj,1) –qj,2log2(qj,2) ... –qj,klog2(qj,k).
Now compute:
E – ((J1/N).E1 + (J2/N).E2 + ... + (Jr/N).Er).
This is the information gain for attribute A.
Note that Jj/N is the estimated probability that an instance classified to this node will have value aj for attribute A. Thus we are weighting the entropy estimates Ej by the estimated probability that an instance has the associated attribute value.
In terms of the example used in the lecture notes, (see also calculations in lecture notes), k = 2 since there are just two classes, positive and negative. I1 = 4 and I2 = 3, and N =7, and so p1 = 4/7 and p2 = 3/7, and E = –p1log2(p1) –p2log2(p2) = –(4/7)×log2(4/7) – (3/7)×log2(3/7). In the example, the first attribute A considered is size, and the first value of size considered, large, corresponds to a1, in the example in the lecture notes, so J1,1 = 2 = J1,2, and J1 = 4. Thus q1,1 = J1,1/J1 = 2/4 = ½, and q1,2 = J1,2/J1 = 2/4 = ½, and E1 = –q1,1log2(q1,1) –q1,2log2(q1,2) = -½×log2(½) – ½×log2(½) = 1.
Similarly E2 = 1 and J2 = 2 (size = small), and E3 = 0 and J3 = 1 (size = medium) so the final information gain,
E – ((J1/N).E1 + (J2/N).E2 + ... + (Jr/N).Er)
= E – ((4/7)×E1 + (2/7)×E2 + (1/7)×E3)
which turned out to be about 0.13 in the example.
squashing function
One of several functions, including sigmoid and step functions (see threshold used to transform the total net input to a neuron to obtain the final output of the neuron.
stopping criterion in backprop
Possible stopping criteria in error backpropagation learning include:
total error of the network falling below some predetermined level;
a certain number of epochs having been completed;
Combinations of the two (e.g. whichever of the two occurs first) and other stopping conditions are possible. See the reference by Haykin (Neural networks: a comprehensivefoundation p. 153) for more details.
supervised learning
Supervised learning is a kind of machine learning where the learning algorithm is provided with a set of inputs for the algorithm along with the corresponding correct outputs, and learning involves the algorithm comparing its current actualoutput with the correct or target outputs, so that it knows what its error is, and modify things accordingly.
Contrast unsupervised learning.
symbolic learning algorithms
Symbolic learning algorithms learn concepts by constructing a symbolic expression (such as a decision tree, as in Aq ) that describes a class (or classes) of objects. Many such systems work with representations equivalent to first order logic.
Such learning algorithms have the advantage that their internal representations and their final output can be inspected and relatively easily understood by humans.
See also function approximation algorithms.