graczyk - Uniwersytet Wrocławski

advertisement
Sieci neuronowe –
bezmodelowa analiza danych?
K. M. Graczyk
IFT, Uniwersytet Wrocławski
Poland
Why Neural Networks?
• Inspired by C. Giunti (Torino)
– PDF’s by Neural Network
• Papers of Forte et al.. (JHEP 0205:062,200, JHEP
0503:080,2005, JHEP 0703:039,2007, Nucl.Phys.B809:163,2009).
• A kind of model independent way of fitting data and
computing associated uncertainty
• Learn, Implement, Publish (LIP rule)
– Cooperation with R. Sulej (IPJ, Warszawa) and P.
Płoński (Politechnika Warszawska)
• NetMaker
– GrANNet ;) my own C++ library
Road map
• Artificial Neural Networks
(NN) – idea
• Feed Forward NN
• PDF’s by NN
• Bayesian statistics
• Bayesian approach to NN
• GrANNet
Inspired by Nature
The human brain consists of around 1011
neurons which are highly interconnected
with around 1015 connections
Applications
• Function approximation, or
regression analysis, including
time series prediction,
fitness approximation and
modeling.
• Classification, including
pattern and sequence
recognition, novelty
detection and sequential
decision making.
• Data processing, including
filtering, clustering, blind
source separation and
compression.
• Robotics, including directing
manipulators, Computer
numerical control.
Artificial Neural
Network
Output,
target
Input layer
Hidden layer
Feed Forward
 t1   w11
  
 t 2    w21
t   w
 3   31
the simplest example  Linear Activation Functions  Matrix
w12 
 i1 
w22  
i2 


w32 
input
1
weights
i-th perceptron
2
output
Summing
3
k
threshold
activation
function
activation functions
1
g ( x) 
1  e  x
1.0
sigmoid
4
0.5
2
2
4
0.5
tanh(x)
1.0
Signal is weaker
qth
threshol
signal is
amplified
•Heavside function q(x)
• 0 or 1 signal
•sigmoid function
•tanh()
•linear
architecture
• 3 -layers network, two hidden:
• 1:2:1:1
• 2+2+1 + 1+2+1: #par=9:
•Bias neurons, instead of thresholds
•Signal One
F(x)
x
Linear Function
Symmetric Sigmoid Function
Neural Networks – Function
Approximation
• The universal approximation
theorem for neural networks
states that every continuous
function that maps intervals of
real numbers to some output
interval of real numbers can be
approximated arbitrarily
closely by a multi-layer
perceptron with just one
hidden layer. This result holds
only for restricted classes of
activation functions, e.g. for
the sigmoidal functions.
(Wikipedia.org)
Q2
F2
x
F 2(Q2, x; wij )
Q2
s
e
s (Q2, e ; wij )
A map from one vector space to another
Supervised Learning
• Propose the Error Function
– in principle any continuous function which
has a global minimum
• Motivated by Statistics: Standard Error
Function, chi2, etc, …
• Consider set of the data
• Train given NN by showing the data 
marginalize the error function
– back propagation algorithms
• An iterative procedure which fixes weights
Learning
Algorithms
• Gradient Algorithms
– Gradient descent
– RPROP (Ridmiller & Braun)
– Conjugate gradients
• Look at curvature
– QuickProp (Fahlman)
– Levenberg-Marquardt
(hessian)
– Newtonian method
(hessian)
• Monte Carlo algorithms
(based on Marcov chain
algorithm)
Overfitting
• More complex models
describe data in better
way, but lost
generalities
– bias-variance tradeoff
• Overfitting  large
values of the weights
• Compare with the test
set (must be twice
larger than original)
• Regularization 
additional penalty term
to error function
ED  ED  EW
Decay rate
ED  ED  EW   2 
 W 2
 wi
2 i 1
dw
 ED  w,  data absence  w(t )  w(0) exp(t )
dt
What about physics
Data Still More precise than Theory
•PDF
Nature
Problems
Physics given
directly by the
Some general
constraints
Observation
data
Measurements
Model Independent Statistics
Analysis
Statistical
Model  data Data
Idea
Uncertainty of the predictions
free parameters
Most of Models
model
QCD
nonoperative
Nonparametric
QED
Fitting data with Artificial
Neural Networks
‘The goal of the network training is not to learn on exact
representation of the training data itself, but rather to
built statistical model for the process which generates
the data’
C. Bishop, ‘Neural Networks for Pattern Recognition’
Parton Distribution Function
with NN
Some method but…
Q2
F2
x
Parton Distributions Functions S. Forte, L. Garrido, J.
I. Latorre and A. Piccione, JHEP 0205 (2002) 062
• A kind of model
independent analysis of
the data
• Construction of the
probability density
P[G(Q2)] in the space of
the structure functions
– In practice only one
Neural Network
architecture
• Probability density in the
space of parameters of
one particular NN
But in reality Forte at al.. did
Generating Monte Carlo pseudo data
The idea comes from
W. T. Giele and S. Keller
Training Nrep neural networks, one for each set of Ndat pseudo-data
The Nrep trained neural networks  provide
a representation of the probability measure in the space
of the structure functions
uncertainty
correlation
10, 100 and 1000 replicas
short
enough long
30 data points, overfitting
too long
My criticism
• The simultaneous use of artificial data and
chi2 error function overestimates
uncertainty?
• Do not discuss other NN architectures
• Problems with overfitting (a need of test set)
• Relatively simple approach, comparing with
the present techniques in NN computing.
• The uncertainty of the model predictions
must be generated by the probability
distribution obtained for the model then
the data itself
GraNNet – Why?
•
•
•
•
•
I stole some ideas from FANN
C++ Library, easy in use
User defined Error Function (any you wish)
Easy access to units and their weights
Several ways for initiating network of given
architecture
• Bayesin learning
• Main objects:
– Classes: NeuralNetwork, Unit
– Learning algorithms: so far QuickProp, Rprop+, Rprop-,
iRprop-, iRprop+,…,
– Network Response Uncertainty (based on Hessian)
– Some restarting and stopping simple solutions
Structure of GraNNet
• Libraries:
– Unit class
– Neural_Network class
– Activation (activation and error function
structures)
– Learning algorithms
– RProp+, RProp-, iRProp+, RProp-, Quickprop,
Backprop
– generatormt
– TNT inverse matrix package
Bayesian Approach
‘common sense reduced to calculations’
Bayesian Framework for BackProp
NN, MacKay, Bishop,…
• Objective Criteria for comparing alternative
network solutions, in particular with different
architectures
• Objective criteria for setting decay rate 
• Objective choice of regularizing function Ew
• Comparing with test data is not required.
Notation and Conventions
ti
xi
y ( xi )
D:
N
W
Data point, vector
input, vector
Network response
(t1 , x1 ), (t 2 , x2 ),...,(t N , x N )
Data set
Number of data points
Number of data weights
Model Classification
• A collection of models, H1,
H2, …, Hk
• We believe that models
are classified by P(H1),
P(H2), …, P(Hk) (sum to 1)
• After observing data D 
Bayes’ rule 
• Usually at the beginning
P(H1)=P(H2)= …=P(Hk)
Probability of D given Hi
P ( H i D) 
P( D H i ) P( H i )
P ( D)
Normalizing constatnt
Single Model Statistics
• Assume that model Hi is
the correct one
• The neural network A
with weights w is
considered
• Task 1: Assuming some
prior probability of w,
after including data,
construct Posterior
• Task 2: consider the
space of hypothesis and
construct evidence for
them
Likelihood  Pr ior
Posterior 
Evidence
P(w D, Ai ) 
P( D w, Ai ) P(w Ai )
P( D Ai )
P ( D Ai )   P ( D w, Ai ) P ( w Ai )dw
P( Ai D)  P(D Ai )P( Ai )
Hierarchy
P ( w D,  , A) 
P ( D, A) 
P( A D) 
P ( D w,  , A) P ( w  , A)
P ( D  , A)
P ( D  , A) P ( A)
P ( D A)
P ( D A) P ( A)
P( D)
Constructing prior and
posterior functions
Assume  const ant
 y ( xi , w)  t ( xi ) 
2

ED     
s i
i 

1
EW   wi2
2 i
S  ED  EW
2
Prior
Pw
0.20
0.05
w
0
w0
P( D  )
likelihood
exp( E D )
P ( D w, A) 
ZD
Z D   d t exp( E D )  
N
P ( w  , A) 
N
N /2
 s
i
i 1
exp( EW )
ZW ( )
W /2
0.10
10
P ( D w,  ) P ( w  )
 2 
ZW ( )   d W w exp(EW )  

 
exp( S )
P ( w D,  , A) 
Z M ( )
0.15
20
P ( w D,  ) 
Weight distribution!!!
10
Z M ( )   d W w exp( E D  EW )
20
Posterior probability
wMP
Computing Posterior
1
T
S ( w)  S ( wMP )  w Aw
2
N
1
 k yi l yi  ( yi  t ( xi )) l  k yi    kl
Akl   k  k S  2
2
i 1 s i
N
 2
i 1
ZM
 k y l y
  kl
s i
hessian
W /2

2 

exp( S ( w
| A|
MP
s x2   dw y ( w, x)  y ( x)
))

2
exp( S ( w))  y ( wMP , x)  A1y ( wMP , x)
T
Covariance matrix
How to fix proper ?
p( w D, A)   dp( w  , D, A) p( D, A)
Two ideas:
•Evidence Approximation (MacKay)
•Hierarchical
•Find wMP
•Find aMP
•Perform analytically integrals over a
p( w D, A)  p( w  MP , D, A)  dp( D, A)  p ( w  MP , D, A)
If sharply peaked!!!
Getting aMP
p ( D) 
p ( D  ) p ( )
p( D)
 p ( D  )   p ( D w,  ) p ( w  )   p ( D w) p ( w  ) 
Z M ( )
Z D ZW ( )

d
log p ( D)  0
d
2E
MP
W
W

i 1
i  
W 
 iteration   / 2 EW

The effective number of well-determined parameters
Iterative procedure during training
Bayesian Model
Comparison –
Occam Factor
P ( Ai D)  P ( D Ai ) P ( Ai )  P ( D Ai )
P ( D Ai )   p ( D w, Ai ) p ( w Ai )dw  p ( D wMP , Ai ) p ( wMP Ai )w posterior
if
p ( wMP Ai ) 
1
w prior
Occam Factor
P ( D Ai )  p ( D wMP , Ai ) p ( wMP Ai )
P ( D Ai )  p ( D wMP , Ai ) p ( wMP
Best fit likelihood
w posterior
w prior
(2 )W / 2
Ai )
det A
•The log of Occam Factor  amount of
•Information we gain after data have arrived
•Large Occam factor  complex models
•larger accessible phase space (larger
range of posterior)
•Small Occam factor  simple models
•small accessible phase space (larger
range of posterior)
Evidence
Misfit of the interpolant data
ln p( D A)   E
MP
W
 E
g  2 M!
Symmetry Factor
M
MP
W
N
1
W
N
 ln det A  ln   ln    ln s i  ln g
2
2
2
i 1
Occam Factor – Penalty Term
Q2
F2
x
Tanh(.)
change w sign
Occam hill
Network 121 preferred by data
131 network preferred by data
131 seems to be preferred by the data
Download