R in High Energy Physics (A somewhat personal account) Adam Lyon Fermi National Accelerator Laboratory Computing Division - DØ Experiment PHYSTAT Workshop on Statistical Software MSU - March, 2004 Outline: Some Background Why is R interesting to us? Some non-analysis examples Using R in HEP Some thoughts on where this can go Some background on me… Graduate student on DØ (400 person Fermilab experiment) Marc Paterno and I were some of the first to use C++ for analysis at DØ (days of PAW) … and the first DØ to use Bayesian statistics for limit calculation Postdoc on CLEO (Cornell) (200 person experiment) Used PAW, ROOT & Mathematica for several analyses Involved in experiment's transition to C++ A. Lyon (FNAL/DØCA) – 2004 Back to DØ (now 700 person Fermilab experiment) as an associate scientist in Computing Division Used R for non-HEP analysis applications Pondering (with Marc Paterno and Jim Kowalkowski also of FNAL/CD) how R can be made useful in HEP analyses 2 First use of R Marc (C++ & Statistics expert & trouble maker) came across R and showed it to Jim and myself. Looked neat but didn't have any reason to use it until… Monitoring of DØ's Data Handling System DØ has 601 Terabytes of data on tape SAM (DØ & CDF joint project) is our • File storage system (knows where all files live) • File delivery system (gets those files to you worldwide) • File cataloging system (stores meta-data for file cataloging) • Analysis bookkeeping system (remembers what you did) A. Lyon (FNAL/DØCA) – 2004 3 Data Handling at DØ SAM typically delivers ~150 TB of data to users per month No monitoring except for huge dumps of text and log files It's perhaps a 0th generation GRID Monitoring was sorely needed -- lots of things can go wrong SAM is a very complicated system Usage statistics were needed for future planning and discovery of bottlenecks A. Lyon (FNAL/DØCA) – 2004 4 samTV A. Lyon (FNAL/DØCA) – 2004 5 Monitoring with R Turn a text file like this (from parsing big log files): station procId time event cabsrv1 2983599 1074593577 OpenFile cabsrv1 2983599 1074604748 RequestNextFile cabsrv1 2983599 1074609598 OpenFile cabsrv1 2983599 1074620392 RequestNextFile cabsrv1 2983599 1074620392 OpenFile cabsrv1 2983599 1074631505 RequestNextFile cab 3085189 1076666381 OpenFile cab 3085189 1076673379 RequestNextFile cab 3085189 1076673379 OpenFile cab 3085189 1076680426 RequestNextFile cab 3085189 1076680753 OpenFile cab 3085189 1076687836 RequestNextFile cab 3085189 1076687836 OpenFile cab 3085189 1076694821 RequestNextFile cab 3085189 1076695114 OpenFile cab 3085189 1076702701 RequestNextFile cab 3085189 1076702701 OpenFile cab 3085189 1076710021 RequestNextFile cab 3085189 1076710021 OpenFile cab 3085189 1076717651 RequestNextFile cab 3085189 1076717651 OpenFile (705,000 more lines like the above!) A. Lyon (FNAL/DØCA) – 2004 fromStation enstore NA enstore NA fnal-cabsrv1 NA cab NA cab NA enstore NA enstore NA cab NA enstore NA enstore NA cab dur 9343 11171 4850 10794 0 11113 415 6998 0 7047 327 7083 0 6985 293 7587 0 7320 0 7630 0 6 Into plots like this… R code: library(lattice) d = read.table("data.dat", head=T) w = data[ data$event=="OpenFile",] w$min = w$dur/60.0 bwPlot( fromStation ~ min | station, data=w, subset=(min<60) xlab="Minutes", main="Wait Time for …" ) A. Lyon (FNAL/DØCA) – 2004 7 Box and Whisker Plots A. Lyon (FNAL/DØCA) – 2004 8 Why is R interesting to us? Seems to be the "State of the Art" in statistics Enormous library of user contributed add-on packages Huge number of statistical tests, fitting, smoothing, … More advanced stuff too: genetic algorithms, support vector machines, kriging (would have been useful for my thesis!) Advanced graphics based on William Cleveland's Visualizing Data SQL (MySQL, Oracle, SqlLite, Postgres, ODBC), XML Hooks to COM and CORBA Interfaces for Python, Perl, Tk (GUIs), Java Pretty easy interface to C, C++, Fortran Some nice conveniences (R can save its state) It's multiplatform It's free! A. Lyon (FNAL/DØCA) – 2004 9 The R Language "Not unlike S" Author (John Chambers) received 1998 ACM Software System Award: The ACM's citation notes that Dr. Chambers' work "will forever alter the way people analyze, visualize, and manipulate data . . . S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers." (http://www.acm.org/announcements/ss99.html) I guess he did good! A. Lyon (FNAL/DØCA) – 2004 10 What is the R/S Language Interesting? "Programming with Data" The fundamental purpose of the language (as I see it) is to provide general tools for efficient data manipulation and analysis while allowing extensions to those tools to be programmed easily. Has a specific purpose. You wouldn't write your online data acquisition system in R/S. But analyzing output from online monitoring is certainly a good task for it. A. Lyon (FNAL/DØCA) – 2004 R/S is a functional language vectorized functions, apply, lazy evaluation R/S is an object oriented language (but with a functional bent) Functions with the same name are dispatched based on argument types (has notions of inheritance and other OO features) Is R/S ideal? Don't know, but we've been very surprised by how some complicated tasks can be accomplished with astonishingly simple code 11 Some non-analysis examples samTV: Plot the mean wait times by file source for each SAM station > nrow(w) [1] 399135 > w[1:2,] station procId time event 1 cabsrv1 2983599 1074593577 OpenFile 2 cabsrv1 2983599 1074609598 OpenFile fromStation dur enstore 934 enstore 4850 min 155.7 80.8 > w.means = aggregate(w, list(station=w$station, src=w$fromStation), mean) > w.means[1:2,] 2.2 seconds station src x 1 cab cab 6.861695109 2 cabsrv1 cab 8.171100917 A. Lyon (FNAL/DØCA) – 2004 12 samTV cont'd > dotplot(src ~ x | station, data=w.means, scales=list(cex=1.3), main=list("Mean Process Wait Times", cex=1.5), xlab=list("Wait time (minutes)", cex=1.5), cex=1.7, par.strip.text= list(cex=1.7) ) A. Lyon (FNAL/DØCA) – 2004 13 Non-analysis Examples We’ve found R to be great for slogging through text files and database query results to make extremely useful and pretty plots A. Lyon (FNAL/DØCA) – 2004 14 Non-analysis applications Performance of DB server middleware (Marc Paterno) Data transfer speed. vs. data size for two different servers Fit to model of startup time plus constant throughput A. Lyon (FNAL/DØCA) – 2004 modpollux = nls( speed ~ alpha*(1alpha*beta/(alpha*beta+mb)), data=client[pollux,], start=c(alpha=2.0, beta=0.50), trace=T) 15 What have we learned so far? There seems to be an "R way" Do it the functional way! Use the apply commands and vectorized functions instead of for loops Higher order functions One of R's strengths is its user contributions but this means some functionality is repeated (e.g. three histogram functions -- albeit each serves a slightly different purpose) The learning curve is long (R can do lots!) But there are extensive manuals, online documentation, and published books and papers A. Lyon (FNAL/DØCA) – 2004 16 R in HEP We are aware of no one using R, or any other statistical package, in the HEP community. Why? Our needs are quite specific and… My Postdoc supervisor (Ed Thorndike): "Trust no one" "Or at least trust no one outside of HEP" With very few exceptions, all of our scientific software tools are written within the community. Many people write their own, reinventing lots of wheels Most are unaware of tools from the statistics community and how they could apply to us Many of us (including me) have little to no formal statistical training and had no exposure to statistical tools (e.g. SAS, SPSS, MATLAB, R) A. Lyon (FNAL/DØCA) – 2004 17 R in HEP Maybe this is changing, a little Root, the most widely used HEP analysis tool, has TGraphSmooth which implements Loess smoother (translated R functions into C++) Software is getting more complicated (we are doing lots more than just whipping up quick and dirty Fortran). Some realization that we can't do it all ourselves (e.g. databases, SAM uses consultants) But problem: our datasets tend to be huge A. Lyon (FNAL/DØCA) – 2004 18 HEP datasets and R R seems to want to hold everything in memory (recently discovered externalVector; haven't tried it yet) In HEP, we typically run successive skims to reduce the data size (601 TB down to 100s of Meg or a few Gig) Hard trade offs between size and utility of skims Usually skims are output to a more convenient format (e.g. Root files) For example, I use a 4th generation skim with 412 variables and 232K rows (1.9 Gig) Even our last stage skims are probably too large for R Efficient handling of large datasets is one reason why Root is very successful A. Lyon (FNAL/DØCA) – 2004 19 Three strategies for reading HEP data in R Realize that I don't need all 412 variables for all rows in memory at the same time In fact usually concentrate on just a few variables at a time Perform even further event requirements 1. If data is small enough, bring it into R 2. If can reduce data to something R can hold, bring that subset of data into R -- have the full power of R perhaps this means using that data for awhile, and loading a new set to tackle another aspect of the problem 3. If can't even do above, then have some R apparatus to read in data one row at a time and update an R object (e.g. histograms) [But you don't get the full power of R] A. Lyon (FNAL/DØCA) – 2004 20 Reading Root files into R Do it the R way! root.apply("myTree", "myFile.root", myFunction) C++ and R code written evenings of one weekend (my wife was out of town, dog was asleep) You supply an R function that receives an entry from your Root file (as a list). Function can make requirements on the data, return nothing if fails Function returns a new list of variables to pass to R. Can be new derived variables not in the Root entry Return of root.apply is a data frame (an R database) A. Lyon (FNAL/DØCA) – 2004 21 Example -- selecting dielectrons # Select events with two good electrons # Only the EM and MET branches are needed # selectDiE = function(entry) { Join (AND) the requirements # If no electrons had a good eta, then stop if ( ! any(goodEtaCuts) ) return(NULL) # Get the list of electrons meeting all cuts goodEs = goodEtaCuts & goodECuts # Make dataframe of electron data es = as.data.frame( entry$EM ) ; attach(es) R Function # Now require that at least two electrons pass Make new data goodEsDF = es[goodEs,] Definition # Make the requirements for a good electronTurn electron if ( nrow(goodEsDF) data < 2 ) return(NULL) frame with passing goodECuts = ( id == 10 | into an########## R data frame 11 ) & Construct the return list electrons.abs(id) Are == there pt > 25.0 & emfrac > 0.9 & 2 orfiducial==1 more? # Get the ordering for the electrons goodElectronsOrder = order( -goodEsDF$pt ) Apply cuts to electrons. # If nothing passed, then stop Returns a boolean vector if ( ! any(goodECuts) ) return(NULL) e1 = goodEsDF[ goodElectronsOrder[[1]], ] names(e1) <- paste("e1", names(e1), sep=".") (T,T,F,T,F) # Get electron etas etas = abs(eta) # Make the requirements for good etas goodEtaCuts = etas < 1.05 | ( etas > 1.7 & etas < 2.3 ) A. Lyon (FNAL/DØCA) – 2004 e2 = goodEsDF[ goodElectronsOrder[[2]], ] names(e2) <- paste("e2", names(e2), sep=".") Cut entry if nothing #passed Return } return ( c(as.list(e1), as.list(e2), entry$MET ) ) 22 Example -- selecting dielectrons # Select events with two good electrons # Only the EM and MET branches are needed # selectDiE = function(entry) { # If no electrons had a good eta, then stop if ( ! any(goodEtaCuts) ) return(NULL) # Get the list of electrons meeting all cuts goodEs = goodEtaCuts & goodECuts # Make dataframe of electron data es = as.data.frame( entry$EM ) # Now require that at least two electrons pass goodEsDF = es[goodEs,] if ( nrow(goodEsDF) < 2 ) return(NULL) # Make the requirements for a good electron goodECuts = ( es$id == 10 | abs(es$id) == 11 ) & es$pt > 25.0 & es$emfrac > 0.9 & es$fiducial==1 ########## Construct the return list # Get the ordering for the electrons goodElectronsOrder = order( -goodEsDF$pt ) # If nothing passed, then stop if ( ! any(goodECuts) ) return(NULL) e1 = goodEsDF[ goodElectronsOrder[[1]], ] names(e1) <- paste("e1", names(e1), sep=".") # Get electron etas etas = abs(es$eta) e2 = goodEsDF[ goodElectronsOrder[[2]], ] names(e2) <- paste("e2", names(e2), sep=".") # Make the requirements for good etas goodEtaCuts = etas < 1.05 | ( etas > 1.7 & etas < 2.3 ) A. Lyon (FNAL/DØCA) – 2004 } # Return return ( c(as.list(e1), as.list(e2), entry$MET ) ) 23 Analyzing dielectrons d = root.apply("Global", "mydata.root", selectDiE) Handed back a data frame with the variables I wanted. Can now attack this data with the full power of R A. Lyon (FNAL/DØCA) – 2004 24 Dielectrons > d = root.apply(…) > given.met = equal.counts(d$met, number=4, overlap=0.1) > summary(given.met) 1 2 3 4 Intervals: min max count 0.05888367 4.103577 3044 3.84661865 6.498108 3045 6.21868896 9.914124 3046 9.42181396 88.125061 3043 Ovrlap between adjacent intervals: [1] 307 306 308 > xyplot(e2.pt ~ e1.pt | given.met, data=d) A. Lyon (FNAL/DØCA) – 2004 25 Extracting signal and background from data (From Marc Paterno) Given a data sample, extract the amount of signal and background Bump fitting A common HEP problem Try a MC example 1. Generate data based on a signal distribution (Breit-Wigner [Cauchy] of mass and width) and a background distribution (1/(a+b*x)^3) 2. Fit this data with the background and signal distributions, but with unknown parameters A. Lyon (FNAL/DØCA) – 2004 26 Bump Fitting Generate the background distribution bf returns a function that when given a uniform random variable [0,1) returns the background distribution with parameters a and b rbackground generates the distribution for n values Clever use of higher order functions and vectorized functions A. Lyon (FNAL/DØCA) – 2004 bf = function(a,b) { function(x) { temp=1-x; temp*(a/b)*(temp+sqrt(temp)) } } rbackground = function(n, a, b) { transform = bf(a,b); transform(runif(n)) } 27 Bump Fitting Generate the signal rsignal = function(n, mass, width, max) { Generate n random BreitWigner values temp = rcauchy(n,mass,width); temp = temp[temp > 0 & temp < max]; Require that distribution be positive and less than max. Throw away values that fail num.more = n - length(temp); if (num.more > 0) { more = rsignal( n-length(temp), mass, width, max); Recursively call function to make up the amount that was lost Make the data Join the signal and background into one distribution A. Lyon (FNAL/DØCA) – 2004 temp = append(temp, more); } temp } rexperiment = function (nsig, mass, width, nback, a, b) { append(rsignal(nsig, mass, width, a*b/2), rbackground(nback, a, b)) } 28 Bump fitting Use an unbinned maximum likelihood fitter (from MASS) Rprof significantly sped up fit (replace ^) dbackground = function(x, a, b) { d = a+b*x 2*a*a*b/(d*d*d) } mydistr = function(x, f, m, s, a, b) { (1-f)* dbackground(x,a,b) + f*dcauchy(x,m,s) } fres2 = fitdistr(data, densfun=mydistr, start=list(f = FRAC, m=40.0, s=3.0, a=100, b=2.)) A. Lyon (FNAL/DØCA) – 2004 29 Bump Fitting True distribution total Histogram is generated data Signal fit Background fit Total fit Bottom plot is of residuals (true-fit) A. Lyon (FNAL/DØCA) – 2004 30 What are we considering next? Summary We explore using R, a statistical analysis package from the statistics community, in an HEP enviornment Continue to learn more about R Further Develop the "Three Strategies" R has already proven useful for analyzing monitoring and benchmarking data Explore doing a physics analysis in R We have ideas on how R can be used to read large datasets We've done some "proof of principle" studies of physics analysis with R As we learn more about R, we expect to be more surprised at its capabilities 31 A. Lyon (FNAL/DØCA) – 2004 Options for R and Root Interfacing (after discussions) no interest from R community in non-I/O functions of Root In order of work required : 1) R and Root remain separate-- use the more appropriate tool for the task. Use text files to communicate between the two if necessary. 2) Root loads R's math and low level statistical libraries as shared objects Minimalist approach for some functionality Some access to the math and statistics C code functions from R These C functions take basic C types, so no translation necessary But: no upper level functions written in the R language available A. Lyon (FNAL/DØCA) – 2004 3) R and Root remain separate, but: R package to read Root Trees directly into R data frames. Still use best tool for particular task Now easier to get HEP data into R 4) Allow calling of selected high level R functions from within Root Root runs the R interpreter translation is necessary R functions: understand Root objects Root: understand R return objects Expose only some R functions may reduce amount of translation 32 More Advanced Integration Options 5) R prompt from the Root prompt R needs seamless knowledge of objects in current Root session At end of R session, new R variables translated into Root objects Root runs the R interpreter Translation for all types of Root variables into R and all types of R variables returned to Root. A major undertaking Things get interesting starting at 3) I have a version 0.0.1 prototype for reading Root trees into R. Required for all options above 3. I’ll try to work on this as time permits Both Root and R interface to Python Translate with Python as intermediary? Not sure if that's performant enough 6) Root prompt from within R Harder than 5: R is C but Root is C++ I don't see much interest in this A. Lyon (FNAL/DØCA) – 2004 33