Advantages of NNs. 1. High computation rates provided by MP 2. A great degree of robustness 3. Adaptation of connection weights based on current results so as to improve performance 4. Non-parametric The Artificial Neuron The basic computation element of a NN is called the artifice neuron or neuron or node A general neuron model Bias φ X1 W1 Activation fn U(i) F( ) Out put to Input Xi Wi Another neuron Xn Wn Basis fn The net input in represent ed by a net fn u (i) which also called basis fn the neuron activation by the activation fn f() . The bias acts exactly as a weight on a connection from a unit whose activation is always 1 The simplest node sums the weighted inputs and passes the result through a non-linearity i.e. Y=f(∑Xi Wi +φ) the non-linear fn(f) or activation fn is typically used for the neuron in any particular layer of a neuron net. Some common activation fns. Are the step , ramp, sigmoid functions The bias fn Is represented as u(w ,x) where w stands for the weight matrix , and x for the input vector. The bias fn has two common forms. 1. Linear basis Fn(LBF) in a hyper plane type fn, this is a 1st order linear bias fn. The net value in a linear combination of the inputs n Ui(W,X) = ∑ Wij Xj j=1 2. radial bias fn (RBF) in a hype sphere type fn .this involve a second order Nonlinear basis fn ,the net value represents the distance to a reference pattern Ui (W,X) = n ∑(Xi) –Wij)2 j=1 Common activation function 1- step fn .to convert the net input (a continuously valued variable) to an output binary (0 or 1) bipolar (1 or-1) signal. the binary step fn .is also known as the threshold fn. or hard limiter. f(x) 1 x -1 This type of fn is used in both supervised and unsupervised learning . its main advantage in it simplicity. 2- the Threshold logical unit (TLU) or ramp f(x) -X0 X X0 -1 -1 F(x) = X 1 X ≤ -X0 -X0 < X < X0 X ≥ X0 3- sigmoid fns. (S. shaped curves) are useful activation fns. The logistics fn. and the hyperbolic tangent fn. are the most common. They are especially useful for NNS trained by back program because of the simple relationship between the value of the fn. at a point and the value of the derivation at that point (i.e. has a simple derivative ). the logistic fn .with a range frame 0 to 1 is often used for NNS in which the desired out put values either are in the binary or are in the rang (interval ) between 0 and 1 f(x) = 1 / 1+ e-σσx f(x) =σ f(x) [ 1- f(x) ] f(x) Binary sigmoid with σ=1 σ=3 σ=3 σ =1 x Logistic fn .is also called the squashing σ is the slop parameter. N.B The slop parameter is used to modify the steepness of the sigmoid so that the sigmoid fn .achieve a particular desired value for a given value of x . The bipolar sigmoid is often used when the desired range of output values is between -1 and 1 f(x)= 1- e -σσx / 1+e -σx f(x)=( σ/2 ) (1+f(x)) (1-f(x)) f(x) 1 x -1 The bipolar sigmoid is similar to the logistic fn .but it is symmetric about the origin the sigmoid fns characteristics the are continuous differentiable and monastically non – decreasing .hence the sigmoid fns. are used in the implementation of the back propagation algorithm (which requires that the fn .be everywhere differentiable ) The hyper balic tangent (σ =2) :f(x)= 1-e -2x / 1+e -2x and f(x)=(1+f(x)) (1-f(x)) f(x) X Note The f part of learn scales the error to force a stronger correction when the net input Is near the rapid rise in the sigmoid carve i.e The peak is in the same position as the rise in the sigmoid curve Taxonomy of NNS There are two phases in neural inform processing: 1-lerning phase:In the learning phase a training data set is used to determine the weight parameter that define the never mode NNS learn by adaptively then updating then weights that characterize the strength of connection .the adaptation may also involve altering the pattern of connection .the weight are up dated according to the inform extracted from training patterns usually the optimal weights are opened by optimizing creation fn . eg.to minimize the lest square error between the teacher value the actual out put value 2-retiving phase:Retracing phase the trained neural modal is used in this phase to process real patters and yield .classification rustling computed neuron values represent the desired out put Supervised and unclarssification networks:The NNS are commonly categorized of there training algorithms -Fixed weight networks . -Un supewised net work -Supewised networks There is no learning required for the fixed weight nets so learning Model is either supervised or unsupervised . Supervised net work:requires the paying of each input vector with a target vector representing the desired out put-put together these are called a training pair :usually a net work is trained over a no of such training pairs .an input vector is Basic Delta Rule :This rule is also known as the Least Mean Squares (LMS) or widrow-Hoff rule. The basic delta rule is given by Δwij = α (tj - yj) xi Where tj = the teacher value (or desired value) for yj = the actual output for unit j . unit j . α = learning rate . in other words Δwij = α δ xi Where δ = tj - yj i.e. the deference between the desired or target output and the actual output .The delta rule modifies the weights appropriately for both continuous and binary inputs and outputs . The delta rule minimizes the squares of the differences between the actual and the desired O/P values . the squared error for a particular training pattern E = ∑ (tj–yj )² j Where E is a fn. of all the weights .The gradient of E is a vector consisting of the partial derivatives of E with respect to each of the weights . This vector gives the direction of most rapid increase in E , the opposite direction gives the direction of most rapid decrease in the error . The error can be reduced most rapidly by adjusting the weight wij in the direction of - ∂E/∂wij . Generalized delta rule :In a multilayer network , the determination of the error is a recursive process which starts with the O/P units and the error is back propagated to the input unit. Therefore the rule is called the error back propagation BP. Δ wij = α δj xi δj = (tj - yj)f´ (y –inj). where y-inj = ∑ wj xi + b Back Propagation :Backpropagation is the generalization of the Widrow-Hoff learning rule to multiple-layer networks and nonlinear differentiable transfer functions. Input vectors and the corresponding target vectors are used to train a network until it can approximate a function, associate input vectors with specific output vectors, or classify input vectors in an appropriate way as defined by you. Networks with biases, a sigmoid layer, and a linear output layer are capable of approximating any function with a finite number of discontinuities. Standard backpropagation is a gradient descent algorithm, as is the Widrow-Hoff learning rule, in which the network weights are moved along the negative of the gradient of the performance function. The term backpropagation refers to the manner in which the gradient is computed for nonlinear multilayer networks. There are a number of variations on the basic algorithm that are based on other standard optimization techniques, such as conjugate gradient and Newton methods. Properly trained backpropagation networks tend to give reasonable answers when presented with inputs that they have never seen. Typically, a new input leads to an output similar to the correct output for input vectors used in training that are similar to the new input being presented. This generalization property makes it possible to train a network on a representative set of input/target pairs and get good results without training the network on all possible input/output pairs. There are two features of Neural Network software that are designed to improve network generalization: regularization and early stopping. The popularity of this algorithm lies in the network flexibility in capturing hidden features in the application data . It overcomes the limitations of the simple perception which can represent linearly separable fns. Only the BP algorithm provides a general learning rule for the more powerful feed forward multilayer networks . The basis for the algorithm is the optimization technique known as Gradient Descent i.e. during learning , each of the connection wts. is adjusted by an amount proportional to the gradient of a global measure of error of the network at the present location . The algorithm involves three stages 1- Feed forward of the input patterns ; 2- Back propagation of the associated error ; 3- Adjustment of the wts.. The algorithm may be used with one or several hidden layers . Algorithm :Step 0:- For a network with one hidden layer , initialize wts and bias to small random values. Set learning rate α such that 0.1< α ≤ 1. x0 b 1 1 z1 X1 X2 X1 y1 zj yk X2 Xn Y1 Yk Xn ym zp p Input hidden output Ym Step 1:- while stopping condition is false , do steps 2 to 7. Step 2:- for each train pattern , do steps 3 to 6. Feed forward :Step 3:- a. for each hidden unit zj . The net input is computed as follows : Zj in = ∑ xi wij where x0 is the bias. The output signal is computed. Zj= f (Zj in) The output signal is sent to all units in the layer above . b. For each output unit yk , the weighted input is: yk in = ∑ Zj v j and the output signal is yk = f (yk in) Back Propagation of Error Step 4 : For each output unit yk (k =1,2………m) a . Comput its error term (or delta term) δk =(tk – yk ) f' (yk in) where f' (yk in) is the 1st derivative of activation fn . b . Compute its wt correction term to update the relevant wt later . Δ vjk = α δ k zj c. Send δk to units in the layer below Step 5 : For each hidden unit zj (j= 1,2…….p) a. Sum up its delta inputs (from layer above) m δj in =∑ δk vjk k=1 b. Calculate its error term δj δ j=δj in f'(z j in) c. Calculate its wt correction term to update the relevant wt later . Δ wjj= α δ j xi Update weights Step 6 : a. Each output unit yk updates its weights vjk (new)=vjk (old)+Δvjk b. Each hidden unit z updates its wts w ij(new)=w ij(old)+Δw ij Step 7 : Test stopping condition The BP algorithm may be used for a NN with arbitrary no. of hidden layers. The rule has the form wpq =α δoutput *v input Where input and output refer to the two ends p and q of the connection concerned and v stands for the appropriate input –end activation of a hidden unit or real input. Note that the meaning of δ differs for last layer from all others hidden layers. Notes 1. The Bp network consists of at least one hidden layer. 2. Each layer is fully connected to the next layer. 3. The no. of neurons in the hidden layer varies and depends on the particular problem. 4. The activation fn used is a sigmoid fn. 5. The commonly used error fn is E = 1/2 ∑ (t k – y k) 2. µk Where µ is the pattern no i.e. the objective is to minimize the squared error between the teacher and the actual response. Convergence When a network is trained successfully ,it produces correct answers more and more often as training progresses. The root. mean .squared error ( RMS) is usually calculated to reflect the degree to which learn has taken place. This measure reflects how close the network is to getting the correct answers. As the network learns ,its RMS error decreases , and generally an RMS value below 0.1 indicates that a network has learned its training set . The network gets closer & closer to the target value incrementally with each step .Hence it is possible to define a cutoff point when the network's output is said to match the target values. Convergence is a process whereby the RSM value for the network gets closer to 0. However it is not always easy to achieve because 1-The process may take an exceedingly long time 2- The network may get stuck in a local minimum &stops learn g altogether .A local min m is surrounded higher values and the network usually dose not leave a local min m early . Special steps may be used to get out of a local minm E Error surface W1 old W1 new Local min Global min 1-Change the learn rate. W1 Error surface in 2dimension 2-change the no. of hidden unites. 3-Add small random values to the wits. 4- Start the learning process again with different initial wts. Generalization This is the ability of the net to produce reasonable responses to input patterns that are similar but not identical to training patterns. A balance between memorization and generalization is usually desired. It is common for the generalize n performance of a neural network to become sub optimal if training is allowed to continue indefinitely as the model overfits the training data (or memorizes the training data ). The most commonly used technique to avoid overtraining is to monitor generalize n performance as training proceeds, on an independent data set (or test set). Training is terminated when generalization performance fails to improve as training commences. This technique (called cross validation) has some disadvantages. It can be wasteful when training data is limited. Also, it can be very noisy between different test data sets & hence provide only an approximate indication of when to stop. Another technique is to restrict the size or intercommunion density of the NN thus restricting performance on the training set, can also alter the network complexity by changing the no. of hidden nodes while maintaining full connectivity between layers or both. Example X1=1.1 O1 X2=2.4 -0.33 X3=3.2 0.7 -0.45 H1 O2 0.62 0.85 X4=5.1 O3 .13 -0.1 X5=3.9 .37 H2 0.21 Input Vector=(1.1 2.4 3.2 5.1 3.9) Target=(0.52 0.25 0.75 0.97) Let Weights between neuron H in hidden layer and output layer(vjk)=(0.85 0.62 -0.1 0.21) Weights V=(-0.33 0.07 -0.45 0.13 0.37)[ H األوزان ألداخله إلى الخلية2[في الطبقة المخبأة n=-0.33*1.1+ 0.07*2.4 -0.45*3.2+ 0.13*5.1+ 0.37*3.9 +0,679(b) a=1/(1+exp(-n) a=0.7595(H ) إخراج الخلية :نحتاج الى نموذج االخراج المحسوب نفرض Computed Actual Output=(0.61 0.41 0.57 0.35) Target(Desired)=(0.52 0.25 0.75 0.97) Compute Error=T-O=(-0.09 -0.16 0.18 0.44) δk =(tk – yk ) f' (yk in) f' (yk in)= computed actual output (1-computed actual output) δk(Error Vector)=(-0.02 -0.04 0.04 0.11) let α=0.2 Δ vjk = α δ k zj =0.2* (-0.02 -0.04 0.04 0.11)*0.7595=(-0.003 -0.006 0.006 0.017)()التعديل على األوزان بين الطبقة المخبأة و الطبقة الخارجية vjk (new)=vjk (old)+Δvjk =(0.85 0.62 -0.1 0.21)+(- 0.003 -0.006 0.006 0.017) δj in =∑ δk vjk = ((0.85*-0.02) + (0.62*-0.04) + (-0.1*0.04) + (0.21*0.11)) δ j=δj in f'(z j in) =((0.85*-0.02) + (0.62*-0.04) + (-0.1*0.04) + (0.21*0.11)* 0.7595*(1-0.759) =-0.0041 Δ wjj= α δ j xi let α=0.15 xi=x3=3.2 Δ wjj=0.15*3.2*-0.0041=-0.002 w ij(new)=w ij(old)+Δw ij =-0.45+(-0.002)=-0.452