Some Current Statistical Considerations in Particle Physics Byron P. Roe Department of Physics

Some Current Statistical
Considerations in Particle Physics
Byron P. Roe
Department of Physics
University of Michigan
Ann Arbor, MI 48109
• Preliminaries
• Nuisance Variables
• Modern Classification Methods (especially
boosting and related methods).
• Try to determine a
parameter l given a
measurement x
• For each l draw line so
probability x is within
limits is 90%
• The probability of a result
falling in region is 90%
• Given an x, then for 90%
of experiments l is in that
region. This is the
Neyman Construction
Frequentist and Bayesian
• This construction is “frequentist”; no
probability is assigned to l a physical
• Bayesian point of view: probability refers
to state of knowledge of parameter and l
can have a probability
• Formerly a “war” between the two views.
• People starting to realize each side has
some merits and uses; war abating
• At a given l, 90% of the time x will fall in region,
but do you want 5% on each side or 8% on
lower and 2% on upper?
• Useful ordering principle introduced into physics
by Feldman and Cousins: Choose the region to
have the largest values of R=likelihood this l
given x / best likelihood of any physical l given x
• Always gives a region; goes automatically from
limits to regions
But is it new?
• Feldman and Cousins soon realized this
was a standard statistical technique
described in a text (Kendall and Stuart)
• Physicists have, in the past, often ignored
statistical literature to the detriment of both
physicists and statisticians
• In recent years, helped by conferences on
statistics in physics since 2000, there has
been more and more cooperation
Nuisance Parameters
• Experiments may depend on background,
efficiency… which are not the targets of
the experiment, but are needed to get to
the physical parameter l
• These are called nuisance parameters
• The expectation values may be well
known or have an appreciable uncertainty.
Problem with Feldman Cousins
• Karmen experiment in 1999 reported
results. They were checking LSND expt.
• Background was known to be 2.83+/-0.13
• They observed 0 events and set a limit on
lambda using FC at 1.1 at 90% CL
• Common sense: if 0 signal, 2.3 is 90% CL
• FC ordering is P given data, BUT 90% CL
is overall P, not P given data
Attempts to Improve Estimate
• With a statistician, Michael Woodroofe, I tried
different methods
• Suppose try Bayesian method, taking a prior
(initial) probability for l uniform.
• Obtain credible limit (Bayesian equivalent of CL)
• Go back to frequentist view and look at coverage
• Quite close to frequentist and with small
modification, very close in almost all regions, but
gives 2.5 for Karmen limit close to desired 2.3
Nuisance Variables with Significant
• Can draw a 90% CL
region for joint
probability of l and b
(nuisance par.)
• Project onto l axis
and take extreme
values for CL
• Safe, but often
grossly over-covers
Nuisance Parameters 2
• 1. Integrate over nuisance parameter b using measured
probability of b. Introduces Bayesian concept for b.
Tends to over-cover and there are claims of undercoverage
• 2. Suppose max. likelihood solution has values L, B.
Suppose, given a l, the maximum likelihood solution for
b is bl. Consider:
R = likelihood(x|l,bl)/ likelihood(x|L,B)
(Essentially FC ratio)
Let Gl,b = probl,b(R>C) = CL
Approximate: Gl,b approx Gl,bl
Nuisance Parameters 3
• Use this and make a full Neyman construction.
Good coverage for a number of examples, OR…
• Assume -2ln R is approximately a c2 distribution.
(It is asymptotically.) This is method of MINOS in
MINUIT. Coverage good in some recent cases.
Clipping required for nuisance parameters far
from expected values.
Data Classification
• Given a set of events to separate into
signal and background and some partial
knowledge in the form of set of particle
identification (PID) variables.
• Make a series of cuts on PID variables—
often inefficient.
• Neural net. Invented by John Hopfield as
a method the brain might use to learn.
• Newer methods—boosting,…
Neural Nets and Modern Methods
• Use a training sample of events for which you know
which are signal and which are background.
• Practice an algorithm on this set, updating it and trying to
find best discrimination.
• Need second unbiased set to test result on, the test
• If the test set was used to determine parameters or
stopping point of algorithm, need a third set, verification
• Results here for testing samples. Verification samples in
our tests gave essentially same results.
Neural Network Structure
Combine the features
in a non-linear way to
a “hidden layer” and
then to a “final layer”
Use a training set to find
the best wik to
distinguish signal and
• Neural nets and most modern methods
use PID variables in complicated nonlinear ways. Intuition is somewhat difficult
• However, they are often much more
efficient than cuts and are used more and
• I will not discuss neural nets further, but
will discuss modern methods—
Boosted Decision Trees
• What is a decision tree?
• What is “boosting the decision trees”?
• Two algorithms for boosting.
Decision Tree
• Go through all PID
variables and find best
variable and value to split
• For each of the two
subsets repeat the
• Proceeding in this way a
tree is built.
• Ending nodes are called
Select Signal and Background
• Assume an equal weight of signal and
background training events.
• If more than ½ of the weight of a leaf
corresponds to signal, it is a signal leaf;
otherwise it is a background leaf.
• Signal events on a background leaf or
background events on a signal leaf are
Criterion for “Best” Split
• Purity, P, is the fraction of the weight of a
leaf due to signal events.
• Gini: Note that gini is 0 for all signal or all
• The criterion is to minimize ginileft + giniright
of the two children from a parent node
Criterion for Next Branch to Split
• Pick the branch to maximize the change in
Criterion = giniparent – giniright-child –ginileft-child
Decision Trees
• This is a decision tree
• They have been known for some time, but
often are unstable; a small change in the
training sample can produce a large
Boosting the Decision Tree
• Give the training
events misclassified
under this procedure
a higher weight.
• Continuing build
perhaps 1000 trees
and do a weighted
average of the results
(1 if signal leaf, -1 if
background leaf).
Two Commonly used Algorithms for
changing weights
• 1. AdaBoost
• 2. Epsilon boost (shrinkage)
• Xi= set of particle ID variables for event i
• Yi= 1 if event i is signal, -1 if background
• Tm(xi) = 1 if event i lands on a signal leaf of
tree m and -1 if the event lands on a
background leaf.
• Define err_m = weight wrong/total weight
1  err _ m
 m   log
err _ m
Increase weight for misidentified events
wi  wi exp( m )
Scoring events with AdaBoost
• Renormalize weights
wi  wi /  wi
i 1
• Score by summing over trees
T ( x)    mTm ( x)
m 1
eBoost (shrinkage)
• After tree m, change weight of misclassified
events, typical e ~0.01 (0.03). For
misclassfied events:
wi  wi exp(2e )
• Renormalize weights
wi  wi /  wi
i 1
• Score by summing over trees
T ( x)   e Tm ( x)
m 1
Unwgted, Wgted Misclassified
Event Rate vs No. Trees
Comparison of methods
 e-boost changes weights a little at a time
• Let y=1 for signal, -1 for bkrd, T=score
summed over trees
• AdaBoost can be shown to try to optimize
each change of weights. exp(-yT) is
• The optimum value is
T=½ log odds probability that y is 1 given x
Tests of Boosting Parameters
• 45 Leaves seemed to work well for our application
• 1000 Trees was sufficient (or over-sufficient).
• AdaBoost with  about 0.5 and e-Boost with e about 0.03
worked well, although small changes made little
• For other applications these numbers may need
• For MiniBooNE need around 50-100 variables for best
results. Too many variables degrades performance.
• Relative ratio = const.*(fraction bkrd kept)/
(fraction signal kept).
Smaller is better!
Effects of Number of Leaves and
Number of Trees
Smaller is better! R = c X frac. sig/frac. bkrd.
Number of feature variables in
• In recent trials we have used 182 variables.
Boosting worked well.
• However, by looking at the frequency with which
each variable was used as a splitting variable, it
was possible to reduce the number to 86 without
loss of sensitivity. Several methods for choosing
variables were tried, but this worked as well as
• After using the frequency of use as a splitting
variable, some further improvement may be
obtained by looking at the correlations between
Effect of Number of PID Variables
Comparison of Boosting and ANN
• Relative ratio here is
ANN bkrd
kept/Boosting bkrd
kept. Greater than one
implies boosting wins!
• A. All types of
background events.
Red is 21 and black is
52 training var.
• B. Bkrd is p0 events.
Red is 22 and black is
52 training variables
Percent nue CCQE kept
• For either boosting or ANN, it is important to
know how robust the method is, i.e. will small
changes in the model produce large changes in
• In MiniBooNE this is handled by generating
many sets of events with parameters varied by
about 1s and checking on the differences. This
is not complete, but, so far, the selections look
quite robust for boosting.
How did the sensitivities change
with a new optical model?
• In Nov. 04, a new, much changed optical model
of the detector was introduced for making MC
• The reconstruction tunings needed to be
changed to optimize fits for this model
• Using the SAME PID variables as for the old
• For a fixed background contamination of p0
events fraction of signal kept dropped by 8.3%
for boosting and dropped by 21.4% for ANN
• For ANN one needs to set temperature, hidden
layer size, learning rate… There are lots of
parameters to tune.
• For ANN if one
a. Multiplies a variable by a constant,
var(17) 2.var(17)
b. Switches two variables
c. Puts a variable in twice
The result is very likely to change.
For Boosting
• Only a few parameters and once set have
been stable for all calculations within our
• Let y=f(x) such that if x1>x2 then y1>y2,
then the results are identical as they
depend only on the ordering of values.
• Putting variables in twice or changing the
order of variables has no effect.
Tests of Boosting Variants
• None clearly better than AdaBoost or
• I will not go over most, except Random Forests
• For Random Forest, one uses only a random
fraction of the events (WITH replacement) per
tree and only a random fraction of the variables
per node. NO boosting is used—just many
trees. Each tree should go to completion—every
node very small or pure signal or background
• Our UM programs weren’t designed well for this
many leaves and better results (Narsky) have
been obtained—but not better than boosting.
Can Convergence Speed be
• Removing correlations between variables
• Random Forest WHEN combined with
• Softening the step function scoring:
y=(2*purity-1); score = sign(y)*sqrt(|y|).
Soft Scoring and Step Function
Performance of AdaBoost with Step
Function and Soft Scoring Function
Conclusions for Nuisance Variables
• Likelihood ratio methods seem very useful as an
organizing principle with or without nuisance
• Some problems in extreme cases where data is
much smaller than is expected
• Several tools for handling nuisance variables
were described. The method using approx.
likelihood to construct Neyman region seems to
have good performance.
Conclusions For Classification
• Boosting is very robust. Given a sufficient number of
leaves and trees AdaBoost or EpsilonBoost reaches an
optimum level, which is not bettered by any variant tried.
• Boosting was better than ANN in our tests by 1.2-1.8.
• There are ways (such as the smooth scoring function) to
increase convergence speed in some cases.
• Several techniques can be used for weeding variables.
Examining the frequency with which a given variable is
used works reasonably well.
• Downloads in FORTRAN or C++ available at:
Adaboost Output for Training and
Test Samples
The MiniBooNE Collaboration
40’ D tank, mineral oil, surrounded by about 1280
photomultipliers. Both Cher. and scintillation light.
Geometrical shape and timing distinguishes events
Numerical Results from sfitter (a
second reconstruction program)
• Extensive attempt to find best variables for
ANN and for boosting starting from about
3000 candidates
• Train against pi0 and related
backgrounds—22 ANN variables and 50
boosting variables
• For the region near 50% of signal kept,
the ratio of ANN to boosting background
was about 1.2
• Post-Fitting is an attempt to reweight the
trees when summing tree scores after all
the trees are made
• Two attempts produced only a very
modest (few %), if any, gain.
