Status of SAM

advertisement
R in High Energy Physics
(A somewhat personal account)
Adam Lyon
Fermi National Accelerator Laboratory
Computing Division - DØ Experiment
PHYSTAT Workshop on Statistical Software
MSU - March, 2004
Outline:
Some
Background
Why is R interesting to us?
Some non-analysis examples
Using R in HEP
Some thoughts on where this can go
Some background on me…

Graduate student on DØ
(400 person Fermilab
experiment)
 Marc Paterno and I were
some of the first to use C++
for analysis at DØ (days of
PAW)
 … and the first DØ to use
Bayesian statistics for limit
calculation

Postdoc on CLEO (Cornell)
(200 person experiment)
 Used PAW, ROOT &
Mathematica for several
analyses
 Involved in experiment's
transition to C++
A. Lyon (FNAL/DØCA) – 2004

Back to DØ (now 700
person Fermilab
experiment) as an
associate scientist in
Computing Division
 Used R for non-HEP
analysis applications
 Pondering (with Marc
Paterno and Jim
Kowalkowski also of
FNAL/CD) how R can be
made useful in HEP
analyses
2
First use of R
Marc (C++ & Statistics expert & trouble maker) came
across R and showed it to Jim and myself.
 Looked neat but didn't have any reason to use
it until…


Monitoring of DØ's Data Handling System
DØ has 601 Terabytes of data on tape
SAM (DØ & CDF joint project) is our
• File storage system (knows where all files live)
• File delivery system (gets those files to you worldwide)
• File cataloging system (stores meta-data for file cataloging)
• Analysis bookkeeping system (remembers what you did)
A. Lyon (FNAL/DØCA) – 2004
3
Data Handling at DØ

SAM typically delivers
~150 TB of data to users
per month

No monitoring except
for huge dumps of text
and log files

It's perhaps a 0th
generation GRID

Monitoring was sorely
needed -- lots of things
can go wrong

SAM is a very
complicated system

Usage statistics were
needed for future
planning and discovery
of bottlenecks
A. Lyon (FNAL/DØCA) – 2004
4
samTV
A. Lyon (FNAL/DØCA) – 2004
5
Monitoring with R

Turn a text file like this (from parsing big log files):
station procId
time
event
cabsrv1 2983599 1074593577 OpenFile
cabsrv1 2983599 1074604748 RequestNextFile
cabsrv1 2983599 1074609598 OpenFile
cabsrv1 2983599 1074620392 RequestNextFile
cabsrv1 2983599 1074620392 OpenFile
cabsrv1 2983599 1074631505 RequestNextFile
cab
3085189 1076666381 OpenFile
cab
3085189 1076673379 RequestNextFile
cab
3085189 1076673379 OpenFile
cab
3085189 1076680426 RequestNextFile
cab
3085189 1076680753 OpenFile
cab
3085189 1076687836 RequestNextFile
cab
3085189 1076687836 OpenFile
cab
3085189 1076694821 RequestNextFile
cab
3085189 1076695114 OpenFile
cab
3085189 1076702701 RequestNextFile
cab
3085189 1076702701 OpenFile
cab
3085189 1076710021 RequestNextFile
cab
3085189 1076710021 OpenFile
cab
3085189 1076717651 RequestNextFile
cab
3085189 1076717651 OpenFile
(705,000 more lines like the above!)
A. Lyon (FNAL/DØCA) – 2004
fromStation
enstore
NA
enstore
NA
fnal-cabsrv1
NA
cab
NA
cab
NA
enstore
NA
enstore
NA
cab
NA
enstore
NA
enstore
NA
cab
dur
9343
11171
4850
10794
0
11113
415
6998
0
7047
327
7083
0
6985
293
7587
0
7320
0
7630
0
6
Into plots like this…

R code:
library(lattice)
d = read.table("data.dat",
head=T)
w = data[ data$event=="OpenFile",]
w$min = w$dur/60.0
bwPlot( fromStation ~ min | station,
data=w, subset=(min<60)
xlab="Minutes",
main="Wait Time for …" )
A. Lyon (FNAL/DØCA) – 2004
7
Box and Whisker Plots
A. Lyon (FNAL/DØCA) – 2004
8
Why is R interesting to us?


Seems to be the "State of the Art" in statistics
Enormous library of user contributed add-on packages
 Huge number of statistical tests, fitting, smoothing, …
 More advanced stuff too: genetic algorithms, support vector
machines, kriging (would have been useful for my thesis!)
 Advanced graphics based on William Cleveland's Visualizing
Data
 SQL (MySQL, Oracle, SqlLite, Postgres, ODBC), XML
 Hooks to COM and CORBA
 Interfaces for Python, Perl, Tk (GUIs), Java




Pretty easy interface to C, C++, Fortran
Some nice conveniences (R can save its state)
It's multiplatform
It's free!
A. Lyon (FNAL/DØCA) – 2004
9
The R Language
"Not unlike S"
 Author (John Chambers) received 1998 ACM
Software System Award:

The ACM's citation notes that Dr. Chambers' work "will
forever alter the way people analyze, visualize, and
manipulate data . . . S is an elegant, widely accepted, and
enduring software system, with conceptual integrity, thanks
to the insight, taste, and effort of John Chambers."
(http://www.acm.org/announcements/ss99.html)

I guess he did good!
A. Lyon (FNAL/DØCA) – 2004
10
What is the R/S Language Interesting?

"Programming with Data"
 The fundamental purpose
of the language (as I see it)
is to provide general tools
for efficient data
manipulation and analysis
while allowing extensions
to those tools to be
programmed easily.
 Has a specific purpose.
You wouldn't write your
online data acquisition
system in R/S. But
analyzing output from
online monitoring is
certainly a good task for it.
A. Lyon (FNAL/DØCA) – 2004

R/S is a functional language
 vectorized functions, apply,
lazy evaluation

R/S is an object oriented
language (but with a
functional bent)
 Functions with the same
name are dispatched based
on argument types (has
notions of inheritance and
other OO features)

Is R/S ideal? Don't know, but
we've been very surprised by
how some complicated tasks
can be accomplished with
astonishingly simple code
11
Some non-analysis examples

samTV: Plot the mean wait times by file source
for each SAM station
> nrow(w)
[1] 399135
> w[1:2,]
station procId
time
event
1 cabsrv1 2983599 1074593577 OpenFile
2 cabsrv1 2983599 1074609598 OpenFile
fromStation dur
enstore 934
enstore 4850
min
155.7
80.8
> w.means = aggregate(w, list(station=w$station,
src=w$fromStation), mean)
> w.means[1:2,]
2.2 seconds
station src x
1 cab
cab 6.861695109
2 cabsrv1 cab 8.171100917
A. Lyon (FNAL/DØCA) – 2004
12
samTV cont'd
> dotplot(src ~ x | station,
data=w.means,
scales=list(cex=1.3),
main=list("Mean Process Wait
Times", cex=1.5),
xlab=list("Wait time
(minutes)", cex=1.5),
cex=1.7,
par.strip.text=
list(cex=1.7) )
A. Lyon (FNAL/DØCA) – 2004
13
Non-analysis Examples

We’ve found R to be great for slogging through text files and
database query results to make extremely useful and pretty plots
A. Lyon (FNAL/DØCA) – 2004
14
Non-analysis applications

Performance of DB
server middleware
(Marc Paterno)

Data transfer speed.
vs. data size for two
different servers

Fit to model of
startup time plus
constant
throughput
A. Lyon (FNAL/DØCA) – 2004
modpollux = nls( speed ~ alpha*(1alpha*beta/(alpha*beta+mb)),
data=client[pollux,], start=c(alpha=2.0,
beta=0.50), trace=T)
15
What have we learned so far?

There seems to be an "R way"
 Do it the functional way!
 Use the apply commands and vectorized functions instead of
for loops
 Higher order functions

One of R's strengths is its user contributions
 but this means some functionality is repeated (e.g. three
histogram functions -- albeit each serves a slightly different
purpose)

The learning curve is long (R can do lots!)
 But there are extensive manuals, online documentation, and
published books and papers
A. Lyon (FNAL/DØCA) – 2004
16
R in HEP

We are aware of no one using R, or any other statistical
package, in the HEP community. Why?
 Our needs are quite specific and…
 My Postdoc supervisor (Ed Thorndike): "Trust no one"
 "Or at least trust no one outside of HEP"
 With very few exceptions, all of our scientific software tools are
written within the community. Many people write their own,
reinventing lots of wheels
 Most are unaware of tools from the statistics community and
how they could apply to us
 Many of us (including me) have little to no formal statistical
training and had no exposure to statistical tools (e.g. SAS,
SPSS, MATLAB, R)
A. Lyon (FNAL/DØCA) – 2004
17
R in HEP

Maybe this is changing, a little
Root, the most widely used HEP analysis tool, has
TGraphSmooth which implements Loess smoother
(translated R functions into C++)
Software is getting more complicated (we are doing
lots more than just whipping up quick and dirty
Fortran). Some realization that we can't do it all
ourselves (e.g. databases, SAM uses consultants)
But problem: our datasets tend to be huge
A. Lyon (FNAL/DØCA) – 2004
18
HEP datasets and R

R seems to want to hold everything in memory
 (recently discovered externalVector; haven't tried it yet)

In HEP, we typically run successive skims to reduce the
data size (601 TB down to 100s of Meg or a few Gig)
 Hard trade offs between size and utility of skims
 Usually skims are output to a more convenient format (e.g.
Root files)
 For example, I use a 4th generation skim with 412 variables and
232K rows (1.9 Gig)
 Even our last stage skims are probably too large for R
 Efficient handling of large datasets is one reason why Root is
very successful
A. Lyon (FNAL/DØCA) – 2004
19
Three strategies for reading HEP data in R

Realize that I don't need all 412 variables for all rows in memory
at the same time


In fact usually concentrate on just a few variables at a time
Perform even further event requirements
1.
If data is small enough, bring it into R
2.
If can reduce data to something R can hold, bring that
subset of data into R -- have the full power of R
perhaps this means using that data for awhile, and loading a
new set to tackle another aspect of the problem
3.
If can't even do above, then have some R apparatus to
read in data one row at a time and update an R object
(e.g. histograms) [But you don't get the full power of
R]
A. Lyon (FNAL/DØCA) – 2004
20
Reading Root files into R

Do it the R way!
root.apply("myTree", "myFile.root", myFunction)

C++ and R code written evenings of one weekend (my
wife was out of town, dog was asleep)

You supply an R function that receives an entry from
your Root file (as a list).
 Function can make requirements on the data, return nothing if
fails
 Function returns a new list of variables to pass to R. Can be
new derived variables not in the Root entry

Return of root.apply is a data frame (an R database)
A. Lyon (FNAL/DØCA) – 2004
21
Example -- selecting dielectrons
# Select events with two good electrons
# Only the EM and MET branches are needed
#
selectDiE = function(entry) {
Join (AND) the
requirements
# If no electrons had a good eta, then stop
if ( ! any(goodEtaCuts) ) return(NULL)
# Get the list of electrons meeting all cuts
goodEs = goodEtaCuts & goodECuts
# Make dataframe of electron data
es = as.data.frame( entry$EM ) ; attach(es)
R Function
# Now require that at least two electrons pass
Make new data
goodEsDF = es[goodEs,]
Definition
# Make the requirements for a good electronTurn electron
if ( nrow(goodEsDF)
data < 2 ) return(NULL)
frame
with
passing
goodECuts = ( id == 10 |
into an##########
R data frame
11 ) &
Construct the return list
electrons.abs(id)
Are ==
there
pt > 25.0 & emfrac > 0.9 &
2 orfiducial==1
more?
# Get the ordering for the electrons
goodElectronsOrder
= order( -goodEsDF$pt )
Apply
cuts to electrons.
# If nothing passed, then stop
Returns
a boolean
vector
if ( ! any(goodECuts) ) return(NULL)
e1 = goodEsDF[
goodElectronsOrder[[1]],
]
names(e1) <- paste("e1", names(e1), sep=".")
(T,T,F,T,F)
# Get electron etas
etas = abs(eta)
# Make the requirements for good etas
goodEtaCuts = etas < 1.05 |
( etas > 1.7 & etas < 2.3 )
A. Lyon (FNAL/DØCA) – 2004
e2 = goodEsDF[ goodElectronsOrder[[2]], ]
names(e2) <- paste("e2", names(e2), sep=".")
Cut entry if nothing
#passed
Return
}
return ( c(as.list(e1), as.list(e2), entry$MET ) )
22
Example -- selecting dielectrons
# Select events with two good electrons
# Only the EM and MET branches are needed
#
selectDiE = function(entry) {
# If no electrons had a good eta, then stop
if ( ! any(goodEtaCuts) ) return(NULL)
# Get the list of electrons meeting all cuts
goodEs = goodEtaCuts & goodECuts
# Make dataframe of electron data
es = as.data.frame( entry$EM )
# Now require that at least two electrons pass
goodEsDF = es[goodEs,]
if ( nrow(goodEsDF) < 2 ) return(NULL)
# Make the requirements for a good electron
goodECuts = ( es$id == 10 |
abs(es$id) == 11 ) &
es$pt > 25.0 & es$emfrac > 0.9 &
es$fiducial==1
########## Construct the return list
# Get the ordering for the electrons
goodElectronsOrder = order( -goodEsDF$pt )
# If nothing passed, then stop
if ( ! any(goodECuts) ) return(NULL)
e1 = goodEsDF[ goodElectronsOrder[[1]], ]
names(e1) <- paste("e1", names(e1), sep=".")
# Get electron etas
etas = abs(es$eta)
e2 = goodEsDF[ goodElectronsOrder[[2]], ]
names(e2) <- paste("e2", names(e2), sep=".")
# Make the requirements for good etas
goodEtaCuts = etas < 1.05 |
( etas > 1.7 & etas < 2.3 )
A. Lyon (FNAL/DØCA) – 2004
}
# Return
return ( c(as.list(e1), as.list(e2), entry$MET ) )
23
Analyzing dielectrons
d = root.apply("Global",
"mydata.root",
selectDiE)

Handed back a data frame
with the variables I wanted.

Can now attack this data with
the full power of R
A. Lyon (FNAL/DØCA) – 2004
24
Dielectrons
> d = root.apply(…)
> given.met =
equal.counts(d$met,
number=4,
overlap=0.1)
> summary(given.met)
1
2
3
4
Intervals:
min
max count
0.05888367 4.103577 3044
3.84661865 6.498108 3045
6.21868896 9.914124 3046
9.42181396 88.125061 3043
Ovrlap between adjacent intervals:
[1] 307 306 308
> xyplot(e2.pt ~ e1.pt |
given.met, data=d)
A. Lyon (FNAL/DØCA) – 2004
25
Extracting signal and background from data


(From Marc Paterno)
Given a data sample, extract the amount of
signal and background
 Bump fitting
 A common HEP problem

Try a MC example
1. Generate data based on a signal distribution
(Breit-Wigner [Cauchy] of mass and width) and a
background distribution (1/(a+b*x)^3)
2. Fit this data with the background and signal
distributions, but with unknown parameters
A. Lyon (FNAL/DØCA) – 2004
26
Bump Fitting

Generate the background
distribution
 bf returns a function that
when given a uniform
random variable [0,1) returns
the background distribution
with parameters a and b
 rbackground generates the
distribution for n values
 Clever use of higher order
functions and vectorized
functions
A. Lyon (FNAL/DØCA) – 2004
bf = function(a,b) {
function(x) {
temp=1-x;
temp*(a/b)*(temp+sqrt(temp))
}
}
rbackground = function(n, a, b)
{
transform = bf(a,b);
transform(runif(n))
}
27
Bump Fitting

Generate the signal
rsignal = function(n, mass,
width, max) {
 Generate n random BreitWigner values
temp = rcauchy(n,mass,width);
temp = temp[temp > 0 &
temp < max];
 Require that distribution be
positive and less than max.
Throw away values that fail
num.more = n - length(temp);
if (num.more > 0) {
more = rsignal(
n-length(temp),
mass, width, max);
 Recursively call function to
make up the amount that was
lost

Make the data
 Join the signal and
background into one
distribution
A. Lyon (FNAL/DØCA) – 2004
temp = append(temp, more);
}
temp
}
rexperiment = function (nsig, mass,
width, nback, a, b) {
append(rsignal(nsig, mass, width,
a*b/2),
rbackground(nback, a, b))
}
28
Bump fitting


Use an unbinned
maximum likelihood
fitter (from MASS)
Rprof significantly sped
up fit (replace ^)
dbackground = function(x,
a, b) {
d = a+b*x
2*a*a*b/(d*d*d)
}
mydistr = function(x, f, m,
s, a, b) {
(1-f)*
dbackground(x,a,b) +
f*dcauchy(x,m,s)
}
fres2 = fitdistr(data,
densfun=mydistr,
start=list(f = FRAC,
m=40.0, s=3.0, a=100,
b=2.))
A. Lyon (FNAL/DØCA) – 2004
29
Bump Fitting






True distribution total
Histogram is generated
data
Signal fit
Background fit
Total fit
Bottom plot is of
residuals (true-fit)
A. Lyon (FNAL/DØCA) – 2004
30
What are we considering next?
Summary

We explore using R, a statistical
analysis package from the
statistics community, in an HEP
enviornment

Continue to learn more
about R

Further Develop the
"Three Strategies"

R has already proven useful for
analyzing monitoring and
benchmarking data

Explore doing a physics
analysis in R

We have ideas on how R can be
used to read large datasets

We've done some "proof of
principle" studies of physics
analysis with R

As we learn more about R, we
expect to be more surprised at its
capabilities
31
A. Lyon (FNAL/DØCA) – 2004
Options for R and Root Interfacing
(after discussions)

no interest from R community in
non-I/O functions of Root
In order of work required :
1) R and Root remain separate-- use the
more appropriate tool for the task.
Use text files to communicate
between the two if necessary.
2) Root loads R's math and low level
statistical libraries as shared objects
 Minimalist approach for some
functionality
 Some access to the math and
statistics C code functions from R
 These C functions take basic C
types, so no translation necessary
 But: no upper level functions
written in the R language available
A. Lyon (FNAL/DØCA) – 2004
3) R and Root remain separate, but:

R package to read Root Trees
 directly into R data frames.


Still use best tool for particular task
Now easier to get HEP data into R
4) Allow calling of selected high level R
functions from within Root

Root runs the R interpreter

translation is necessary
 R functions: understand Root objects
 Root: understand R return objects

Expose only some R functions
 may reduce amount of translation
32
More Advanced Integration Options
5) R prompt from the Root prompt

R needs seamless knowledge of
objects in current Root session
 At end of R session, new R variables
translated into Root objects
 Root runs the R interpreter
 Translation for all types of Root
variables into R and all types of R
variables returned to Root.
 A major undertaking

Things get interesting starting at 3)

I have a version 0.0.1 prototype for
reading Root trees into R.
 Required for all options above 3.
 I’ll try to work on this as time permits

Both Root and R interface to Python
 Translate with Python as intermediary?
 Not sure if that's performant enough
6) Root prompt from within R
 Harder than 5: R is C but Root is C++
 I don't see much interest in this
A. Lyon (FNAL/DØCA) – 2004
33
Download