3.Learning In previous lecture, we discussed the biological foundations of of neural computation including single neuron models connecting single neuron behaviour with network models spiking neural networks computational neuroscience In present one, we introduce Statistical foundations of neural computation = Artificial foundations of neural computation Artificial Neural Networks Biological foundations Artificial foundations (Neuroscience) (Statistics, Mathematics) Duck: can swim (but not like a fish) fly (but not like a bird) walk (in a funny way) In present one, we introduce Statistical foundations of neural computation = Artificial foundations of neural computation Artificial Neural Networks Biological foundations Artificial foundations (Neuroscience) (Statistics, Mathematics) Duck: can swim (but not like a fish) (Feng) fly (but not like a bird) (all my colleagues here) walk (in a funny way) In present one, we introduce Statistical foundations of neural computation = Artificial foundations of neural computation Artificial Neural Networks Biological foundations Artificial foundations (Neuroscience) (Statistics, Mathematics) Duck: can swim (but not like a fish) (Feng) fly (but not like a bird) walk (in a funny way) Topic Pattern recognition Cluster Statistical Approach Statistical Learning (training from data set, adaptation) change weights or interaction between neurons according to examples, previous knowledge The purpose of learning is to minimize training errors on learning data Learning (training from data set, adaptation) and The purpose of learning is that to minimize training errors on learning data: learning error prediction errors on new, unseen data: generalization error Learning (training from data set, adaptation) and The purpose of learning is that to minimize training errors prediction errors The neuroscience basis of learning remains elusive, although we have seen some progresses (see references in the previous lecture) Statistical learning: the artificial, reasonable way of training and prediction LEARNING: extracting principles from data set. • Supervised learning: • Unsupervised learning: not teacher, learn by itself • Reinforcement learning: have a teacher, telling you where to go have a critics, wrong or correct Statistical learning: the artificial, reasonable way of training and prediction LEARNING: extracting principles from data set. Supervised learning: have a teacher, telling you where to go Unsupervised learning: not teacher, learn by itself Reinforcement learning: have a critics, wrong or correct We will concentrate on the first two. You could find reinforced learning from Haykin, Hertz et al. books or Sutton R.S., and Barto A.G. (1998) Reinforcement learning: an introduction Cambridge, MA: MIT Press Pattern recognition (classifications), a special case of learning The simplest case: f (x) =1 or -1 for x in X (the set of objects we intend to separate) Example: X, a bunch of faces x, a single face, 1 if x is male f ( x) 1 if x is female Pattern recognition (classifications), a special case of learning The simplest case: f (x) =1 or -1 for x in X (the set of objects we intend to separate) For example: X, a bunch of faces x, a single face, 1 if x is male f ( x) 1 if x is female f( )1 f( )1 Pattern: as opposite of a chaos; it is an entity, vaguely defined, that could be given a name Examples: • a fingerprint image, • a handwritten word, • a human face, • a speech signal, • an iris pattern etc. Pattern: Given a pattern: a. supervised classification (discriminant analysis) in which the input pattern is identified as a member of a predefined class b. unsupervised classification (e.g.. clustering ) in which the patter is assigned to a hitherto unknown class. Unsupervised classification will be introduced in later Lectures Pattern recognition is the process of assigning patterns to one of a number of classes x y feature extraction pattern space (data) feature space feature extraction x = x = Hair length y =0 Hair length y = 30 cm Pattern recognition is the process of assigning patterns to one of a number of classes x classification feature extraction y feature space pattern space (data) Decision space feature extraction classification Hair length =0 Short hair = male Long hair = female Hair length = 30 cm Feature extraction: which is a very fundamental issue For example: when we recognize a face, which feature we use ???? Eye pattern, geometric outline etc. Two approaches: Statistical approach Clusters: template matching In two steps: Find a discrimant function in terms of certain features Make a decision in terms of the discrimant function discriminant function: a function used to decide on class membership Cluster: patterns of a class should be grouped or clustered together in pattern or feature space if decision space is to be partitioned objects near together must be similar objects far apart must be dissimilar distance measures: choice becomes important for basis of classification Once a distance is given, the pattern recognition is accomplished. Hair Length Distance metrics: different distance will be employed later To be a valid distance metric of the distance between two objects in and abstract space W, a distance metric must satisfy following conditions Distance metrics: different distance will be employed later To be a valid distance metric of the distance between two objects in and abstract space W, a distance metric must satisfy following conditions d(x,y)>=0 nonnegative d(x,x)=0 reflexivity d(x,y)=d(y,x) symmetrical d(x,y)<= d(x,z)+d(z,y) triangle inequality We will encounter different distances, for example distance metric -- relative entropy (distance from information theory Hamming distance For x = {xi} and y = {yi} dH(x , y ) = S |xi-yi| measure of sum of absolute different between each element of two vectors x and y most often used in comparing binary vectors (binary pixel figures, black and white figures) e.g. dH ([1 0 0 1 1 1 0 1], [1 1 0 1 0 01 1]) = 4 = ( 1 1 1 1 1 1 1 1 0) Euclidean Distance For x = {xi} and y = {yi} d (x , y ) = [S (xi-yi)2]1/2 Most widely used distance, easy to calculate Minkowski Distance For x = {xi} and y = {yi} d (x , y ) = [S |xi-yi|r]1/r r>0 Statistical approach: Hair length Distribution density p1(x) and p2(x) If p1(x) > p2(x) then x is in class one other wise it is in class two The discriminant function is given by p1(x) = p2(x) Now the problem of statistical pattern recognition is reduced to estimate the probability density for a given data {x} and {y} In general there are two approaches • Parametric method • Nonparametric method Parametric methods Assumes knowledge of underlying probability density distribution p(x) Advantages: need only adjust parameters distributions to obtain best fit. According to the central limit theorem, we could assume in many cases that the distribution is Gaussian (see below) Disadvantage: if assumption is wrong than poor performance in terms of misclassification. However, if crude classification acceptable then this can be OK Normal (Gaussian) Probability Distribution --common assumption that density distribution is normal For single variable X 1 (x m ) p (x ) exp( ) 2 2 2s 2 mean E X = m variance E ( X- E X)2 = s2 For multiple dimensions x 1 1 p (x ) exp( (x m ) S (x m )) (2 ) |S | 2 1 T n/2 1/ 2 x1 m1 s 11 x m S xn mn s n1 x feature vector, m mean vector, covariance matrix matrix and is symmetric and sij = E [ (Xi- m i) (Xj- m j) ] the correlation between Xi and Xj | S | = determinant of S S1 s 1n = inverse of S s nn S an nxn Fig. here Mahalanobis distance d ( x, m ) ( x m ) S ( x m ) T 1 l1 u1 d (x , m ) c l2 u2 Topic Hebbian learning rule Hebbian learning rule is local: only involving two neurones, independent of other variables We will return to Hebbian learning rule later in the course in PCA learning There are other possible ways of learning which are demonstrated in experiments (see Nature Neuroscience, as in previous lecture) Biological learning Vs. statistical learning Biological learning: Hebbian learning rule When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one of both cells such that A’s efficiency as one of the cell firing B, is increased A B Cooperation between two neurons In mathematical term: w(t) as the weight between two neurons a t time t w(t+1)=w(t)+ h rA rB