Sieci neuronowe – bezmodelowa analiza danych? K. M. Graczyk IFT, Uniwersytet Wrocławski Poland Why Neural Networks? • Inspired by C. Giunti (Torino) – PDF’s by Neural Network • Papers of Forte et al.. (JHEP 0205:062,200, JHEP 0503:080,2005, JHEP 0703:039,2007, Nucl.Phys.B809:163,2009). • A kind of model independent way of fitting data and computing associated uncertainty • Learn, Implement, Publish (LIP rule) – Cooperation with R. Sulej (IPJ, Warszawa) and P. Płoński (Politechnika Warszawska) • NetMaker – GrANNet ;) my own C++ library Road map • Artificial Neural Networks (NN) – idea • Feed Forward NN • PDF’s by NN • Bayesian statistics • Bayesian approach to NN • GrANNet Inspired by Nature The human brain consists of around 1011 neurons which are highly interconnected with around 1015 connections Applications • Function approximation, or regression analysis, including time series prediction, fitness approximation and modeling. • Classification, including pattern and sequence recognition, novelty detection and sequential decision making. • Data processing, including filtering, clustering, blind source separation and compression. • Robotics, including directing manipulators, Computer numerical control. Artificial Neural Network Output, target Input layer Hidden layer Feed Forward t1 w11 t 2 w21 t w 3 31 the simplest example Linear Activation Functions Matrix w12 i1 w22 i2 w32 input 1 weights i-th perceptron 2 output Summing 3 k threshold activation function activation functions 1 g ( x) 1 e x 1.0 sigmoid 4 0.5 2 2 4 0.5 tanh(x) 1.0 Signal is weaker qth threshol signal is amplified •Heavside function q(x) • 0 or 1 signal •sigmoid function •tanh() •linear architecture • 3 -layers network, two hidden: • 1:2:1:1 • 2+2+1 + 1+2+1: #par=9: •Bias neurons, instead of thresholds •Signal One F(x) x Linear Function Symmetric Sigmoid Function Neural Networks – Function Approximation • The universal approximation theorem for neural networks states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi-layer perceptron with just one hidden layer. This result holds only for restricted classes of activation functions, e.g. for the sigmoidal functions. (Wikipedia.org) Q2 F2 x F 2(Q2, x; wij ) Q2 s e s (Q2, e ; wij ) A map from one vector space to another Supervised Learning • Propose the Error Function – in principle any continuous function which has a global minimum • Motivated by Statistics: Standard Error Function, chi2, etc, … • Consider set of the data • Train given NN by showing the data marginalize the error function – back propagation algorithms • An iterative procedure which fixes weights Learning Algorithms • Gradient Algorithms – Gradient descent – RPROP (Ridmiller & Braun) – Conjugate gradients • Look at curvature – QuickProp (Fahlman) – Levenberg-Marquardt (hessian) – Newtonian method (hessian) • Monte Carlo algorithms (based on Marcov chain algorithm) Overfitting • More complex models describe data in better way, but lost generalities – bias-variance tradeoff • Overfitting large values of the weights • Compare with the test set (must be twice larger than original) • Regularization additional penalty term to error function ED ED EW Decay rate ED ED EW 2 W 2 wi 2 i 1 dw ED w, data absence w(t ) w(0) exp(t ) dt What about physics Data Still More precise than Theory •PDF Nature Problems Physics given directly by the Some general constraints Observation data Measurements Model Independent Statistics Analysis Statistical Model data Data Idea Uncertainty of the predictions free parameters Most of Models model QCD nonoperative Nonparametric QED Fitting data with Artificial Neural Networks ‘The goal of the network training is not to learn on exact representation of the training data itself, but rather to built statistical model for the process which generates the data’ C. Bishop, ‘Neural Networks for Pattern Recognition’ Parton Distribution Function with NN Some method but… Q2 F2 x Parton Distributions Functions S. Forte, L. Garrido, J. I. Latorre and A. Piccione, JHEP 0205 (2002) 062 • A kind of model independent analysis of the data • Construction of the probability density P[G(Q2)] in the space of the structure functions – In practice only one Neural Network architecture • Probability density in the space of parameters of one particular NN But in reality Forte at al.. did Generating Monte Carlo pseudo data The idea comes from W. T. Giele and S. Keller Training Nrep neural networks, one for each set of Ndat pseudo-data The Nrep trained neural networks provide a representation of the probability measure in the space of the structure functions uncertainty correlation 10, 100 and 1000 replicas short enough long 30 data points, overfitting too long My criticism • The simultaneous use of artificial data and chi2 error function overestimates uncertainty? • Do not discuss other NN architectures • Problems with overfitting (a need of test set) • Relatively simple approach, comparing with the present techniques in NN computing. • The uncertainty of the model predictions must be generated by the probability distribution obtained for the model then the data itself GraNNet – Why? • • • • • I stole some ideas from FANN C++ Library, easy in use User defined Error Function (any you wish) Easy access to units and their weights Several ways for initiating network of given architecture • Bayesin learning • Main objects: – Classes: NeuralNetwork, Unit – Learning algorithms: so far QuickProp, Rprop+, Rprop-, iRprop-, iRprop+,…, – Network Response Uncertainty (based on Hessian) – Some restarting and stopping simple solutions Structure of GraNNet • Libraries: – Unit class – Neural_Network class – Activation (activation and error function structures) – Learning algorithms – RProp+, RProp-, iRProp+, RProp-, Quickprop, Backprop – generatormt – TNT inverse matrix package Bayesian Approach ‘common sense reduced to calculations’ Bayesian Framework for BackProp NN, MacKay, Bishop,… • Objective Criteria for comparing alternative network solutions, in particular with different architectures • Objective criteria for setting decay rate • Objective choice of regularizing function Ew • Comparing with test data is not required. Notation and Conventions ti xi y ( xi ) D: N W Data point, vector input, vector Network response (t1 , x1 ), (t 2 , x2 ),...,(t N , x N ) Data set Number of data points Number of data weights Model Classification • A collection of models, H1, H2, …, Hk • We believe that models are classified by P(H1), P(H2), …, P(Hk) (sum to 1) • After observing data D Bayes’ rule • Usually at the beginning P(H1)=P(H2)= …=P(Hk) Probability of D given Hi P ( H i D) P( D H i ) P( H i ) P ( D) Normalizing constatnt Single Model Statistics • Assume that model Hi is the correct one • The neural network A with weights w is considered • Task 1: Assuming some prior probability of w, after including data, construct Posterior • Task 2: consider the space of hypothesis and construct evidence for them Likelihood Pr ior Posterior Evidence P(w D, Ai ) P( D w, Ai ) P(w Ai ) P( D Ai ) P ( D Ai ) P ( D w, Ai ) P ( w Ai )dw P( Ai D) P(D Ai )P( Ai ) Hierarchy P ( w D, , A) P ( D, A) P( A D) P ( D w, , A) P ( w , A) P ( D , A) P ( D , A) P ( A) P ( D A) P ( D A) P ( A) P( D) Constructing prior and posterior functions Assume const ant y ( xi , w) t ( xi ) 2 ED s i i 1 EW wi2 2 i S ED EW 2 Prior Pw 0.20 0.05 w 0 w0 P( D ) likelihood exp( E D ) P ( D w, A) ZD Z D d t exp( E D ) N P ( w , A) N N /2 s i i 1 exp( EW ) ZW ( ) W /2 0.10 10 P ( D w, ) P ( w ) 2 ZW ( ) d W w exp(EW ) exp( S ) P ( w D, , A) Z M ( ) 0.15 20 P ( w D, ) Weight distribution!!! 10 Z M ( ) d W w exp( E D EW ) 20 Posterior probability wMP Computing Posterior 1 T S ( w) S ( wMP ) w Aw 2 N 1 k yi l yi ( yi t ( xi )) l k yi kl Akl k k S 2 2 i 1 s i N 2 i 1 ZM k y l y kl s i hessian W /2 2 exp( S ( w | A| MP s x2 dw y ( w, x) y ( x) )) 2 exp( S ( w)) y ( wMP , x) A1y ( wMP , x) T Covariance matrix How to fix proper ? p( w D, A) dp( w , D, A) p( D, A) Two ideas: •Evidence Approximation (MacKay) •Hierarchical •Find wMP •Find aMP •Perform analytically integrals over a p( w D, A) p( w MP , D, A) dp( D, A) p ( w MP , D, A) If sharply peaked!!! Getting aMP p ( D) p ( D ) p ( ) p( D) p ( D ) p ( D w, ) p ( w ) p ( D w) p ( w ) Z M ( ) Z D ZW ( ) d log p ( D) 0 d 2E MP W W i 1 i W iteration / 2 EW The effective number of well-determined parameters Iterative procedure during training Bayesian Model Comparison – Occam Factor P ( Ai D) P ( D Ai ) P ( Ai ) P ( D Ai ) P ( D Ai ) p ( D w, Ai ) p ( w Ai )dw p ( D wMP , Ai ) p ( wMP Ai )w posterior if p ( wMP Ai ) 1 w prior Occam Factor P ( D Ai ) p ( D wMP , Ai ) p ( wMP Ai ) P ( D Ai ) p ( D wMP , Ai ) p ( wMP Best fit likelihood w posterior w prior (2 )W / 2 Ai ) det A •The log of Occam Factor amount of •Information we gain after data have arrived •Large Occam factor complex models •larger accessible phase space (larger range of posterior) •Small Occam factor simple models •small accessible phase space (larger range of posterior) Evidence Misfit of the interpolant data ln p( D A) E MP W E g 2 M! Symmetry Factor M MP W N 1 W N ln det A ln ln ln s i ln g 2 2 2 i 1 Occam Factor – Penalty Term Q2 F2 x Tanh(.) change w sign Occam hill Network 121 preferred by data 131 network preferred by data 131 seems to be preferred by the data