Introduction to the ROOT framework http://root.cern.ch Do-Son school on Advanced Computing and GRID Technologies for Research Institute of Information Technology, VAST, Hanoi René Brun CERN What's ROOT? René Brun 2 ROOT Application Domains Data Analysis & Visualization Data Storage: Local, Network René Brun 3 ROOT in a Nutshell The ROOT system is an Object Oriented framework for large scale data handling applications. It is written in C++. Provides, among others, an efficient data storage and access system designed to support structured data sets (PetaBytes) a query system to extract data from these data sets a C++ interpreter advanced statistical analysis algorithms (multi dimensional histogramming, fitting, minimization and cluster finding) scientific visualization tools with 2D and 3D graphics an advanced Graphical User Interface The user interacts with ROOT via a graphical user interface, the command line or scripts The command and scripting language is C++, thanks to the embedded CINT C++ interpreter, and large scripts can be compiled and dynamically loaded. A Python shell is also provided. René Brun 4 ROOT: An Open Source Project The project was started in 1995. The project is developed as a collaboration between: Full time developers: • 11 people full time at CERN (PH/SFT) • +4 developers at Fermilab/USA, Protvino , JINR/Dubna (Russia) Large number of part-time contributors (155 in CREDITS file) A long list of users giving feedback, comments, bug fixes and many small contributions • 2400 registered to RootForum • 10,000 posts per year An Open Source Project, source available under the LGPL license René Brun 5 More information... http://root.cern.ch Download Documentation Tutorials Online Help Mailing list Forum René Brun 6 Install ROOT Download & extract ROOT v5.16/00 from http://root.cern.ch/root/Version516.html Set of precompiled versions and source code Set environment export ROOTSYS=<path-to-installation> export PATH=$ROOTSYS/bin:$PATH export LD_LIBRARY_PATH=$ROOTSYS/lib:$LD_LIBRARY_PATH export DYLD_LIBRARY_PATH=$ROOTSYS/lib:$DYLD_LIBRARY_PATH (MacOS X only) Compile source code (if chosen) René Brun cd $ROOTSYS ./configure make More information at http://root.cern.ch/root/Install.html 7 ROOT Library Structure ROOT libraries are a layered structure CORE classes always required: support for RTTI, basic I/O and interpreter Optional libraries loaded when needed. Separation between data objects and the high level classes acting on these objects. Example: a batch job uses only the histogram library, no need to link histogram painter library Shared libraries reduce the application size and link time Mix and match, build your own application Reduce dependencies via plug-in mechanism René Brun 8 Three User Interfaces { // example // macro ... } René Brun GUI windows, buttons, menus Command line CINT (C++ interpreter), Python, Ruby,… Macros, applications, libraries (C++ compiler and interpreter) 9 CINT Interpreter CINT in ROOT CINT is used in ROOT: As command line interpreter As script interpreter To generate class dictionaries To generate function/method calling stubs Signals/Slots with the GUI The command line, script and programming language become the same Large scripts can be compiled for optimal performance René Brun 11 Compiled versus Interpreted Why compile? Faster execution, CINT has limitations… Why interpret? Faster Edit → Run → Check result → Edit cycles ("rapid prototyping"). Scripting is sometimes just easier. Are Makefiles dead? No! if you build/compile a very large application Yes! ACLiC is even platform independent! René Brun 12 Running Code To run function mycode() in file mycode.C: root [0] .x mycode.C Equivalent: load file and run function: root [1] .L mycode.C root [2] mycode() All of CINT's commands (help): root [3] .h René Brun 13 Running Code Macro: file that is interpreted by CINT (.x) int mymacro(int value) { int ret = 42; ret += value; return ret; } Execute with .x mymacro.C(42) René Brun 14 Unnamed Macros No functions, just statements { float ret = 0.42; return sin(ret); } Execute with .x mymacro.C No functions thus no arguments Recommend named macros! Compiler prefers it, too… René Brun 15 Named Vs. Unnamed Named: function scope mymacro.C: void mymacro() { int val = 42; } unnamed: global unnamed.C: { int val = 42; } Back at the prompt: root [] val Error: Symbol val is not defined René Brun root [] val (int)42 16 Survivors: Heap Objects obj local to function, inaccessible from outside: void mymacro() { MyClass obj; } Instead: create on heap, pass to outside: MyClass* mymacro() { MyClass* pObj = new MyClass(); return pObj; } pObj gone – but MyClass still where pObj pointed to! Returned by mymacro() René Brun 17 Running Code – Libraries "Library": compiled code, shared library CINT can call its functions! Build a library: ACLiC! "+" instead of Makefile (Automatic Compiler of Libraries for CINT) CINT knows all its functions / types: .x something.C(42)+ something(42) René Brun 18 My first session root root [0] 344+76.8 (const double)4.20800000000000010e+002 root [1] float x=89.7; root [2] float y=567.8; root [3] x+sqrt(y) (double)1.13528550991510710e+002 root [4] float z = x+2*sqrt(y/6); root [5] z (float)1.09155929565429690e+002 root [6] .q root See file $HOME/.root_hist root [0] try up and down arrows René Brun 19 My second session root root [0] .x session2.C for N=100000, sum= 45908.6 root [1] sum (double)4.59085828512453370e+004 Root [2] r.Rndm() (Double_t)8.29029321670533560e-001 root [3] .q session2.C { int N = 100000; TRandom r; double sum = 0; for (int i=0;i<N;i++) { sum += sin(r.Rndm()); } printf("for N=%d, sum= %g\n",N,sum); unnamed macro executes in global scope } René Brun 20 My third session root root [0] .x session3.C for N=100000, sum= 45908.6 root [1] sum Error: Symbol sum is not defined in current scope *** Interpreter error recovered *** Root [2] .x session3.C(1000) for N=1000, sum= 460.311 root [3] .q session3.C Named macro Normal C++ scope rules René Brun void session3 (int N=100000) { TRandom r; double sum = 0; for (int i=0;i<N;i++) { sum += sin(r.Rndm()); } printf("for N=%d, sum= %g\n",N,sum); } 21 My third session with ACLIC root [0] gROOT->Time(); root [1] .x session4.C(10000000) for N=10000000, sum= 4.59765e+006 Real time 0:00:06, CP time 6.890 root [2] .x session4.C+(10000000) for N=10000000, sum= 4.59765e+006 Real time 0:00:09, CP time 1.062 root [3] session4(10000000) for N=10000000, sum= 4.59765e+006 Real time 0:00:01, CP time 1.052 root [4] .q File session4.C Automatically compiled and linked by the native compiler. Must be C++ compliant René Brun session4.C #include “TRandom.h” void session4 (int N) { TRandom r; double sum = 0; for (int i=0;i<N;i++) { sum += sin(r.Rndm()); } printf("for N=%d, sum= %g\n",N,sum); } 22 Macros with more than one function root [0] .x session5.C >session5.log root [1] .q root [0] .L session5.C root [1] session5(100); >session5.log root [2] session5b(3) sum(0) = 0 sum(1) = 1 sum(2) = 3 root [3] .q .x session5.C executes the function session5 in session5.C use gROOT->ProcessLine to execute a macro from a macro or from compiled code René Brun session5.C void session5(int N=100) { session5a(N); session5b(N); gROOT->ProcessLine(“.x session4.C+(1000)”); } void session5a(int N) { for (int i=0;i<N;i++) { printf("sqrt(%d) = %g\n",i,sqrt(i)); } } void session5b(int N) { double sum = 0; for (int i=0;i<N;i++) { sum += i; printf("sum(%d) = %g\n",i,sum); } } 23 Running a script in batch mode root –b –q myscript.C myscript.C is executed by the interpreter root –b –q myscript.C(42) same as above, but giving an argument root –b –q myscript.C+ myscript.C is compiled via ACLIC with the native compiler and linked with the executable. root –b –q myscript.C++g(42) same as above. Script compiled in debug mode and an argument given. René Brun 24 Calling interpreter in a macro All CINT commands can be excuted from an interpreted macro or from compiled code. gROOT->ProcessLine(“.x myscript.C”); René Brun 25 ROOT Tutorials & Doc See Users Guide (pdf) at See online Reference Manual at Execute tutorials at $ROOTSYS/tutorials 293 tutorials in the following directories foam gui matrix net sql geom fft graphics io quadp ruby unuran René Brun spectrum xml physics gl pythia tree hist mlp splot image pyroot thread fit graphs math 26 Generating a Dictionary MyClass.h MyClass.cxx rootcint –f MyDict.cxx –c MyClass.h MyDict.h MyDict.cxx compile and link libMyClass.so René Brun 27 HTML René Brun 28 28 Math TMVA SPlot René Brun 29 Graphics & GUI TPad: main graphics container Root > TLine line(.1,.9,.6,.6) Root > line.Draw() Root > TText text(.5,.2,”Hello”) Root > text.Draw() Hello The Draw function adds the object to the list of primitives of the current pad. If no pad exists, a pad is automatically created with a default range [0,1]. When the pad needs to be drawn or redrawn, the object Paint function is called. René Brun Only objects deriving from TObject may be drawn in a pad Root Objects or User objects 31 Basic Primitives TLine TArrow TEllipse TButton TBox TDiamond TCurvyLine TText TPave TMarker TPavesText TCrown TPaveLabel TCurlyArc TPolyLine TLatex René Brun 32 Full LateX support on screen and postscript latex3.C Formula or diagrams can be edited with the mouse Feynman.C TCurlyArc TCurlyLine TWavyLine and other building blocks for Feynmann diagrams René Brun 33 Graphs TGraphErrors(n,x,y,ex,ey) TGraph(n,x,y) gerrors2.C TCutG(n,x,y) TMultiGraph TGraphAsymmErrors(n,x,y,exl,exh,eyl,eyh) René Brun 34 Graphics examples René Brun 35 René Brun 36 René Brun 37 René Brun 38 Graphics (2D-3D) TH3 TGLParametric “LEGO” René Brun “SURF” TF3 39 ASImage: Image processor René Brun 40 GUI (Graphical User Interface) René Brun 41 Canvas tool bar/menus/help René Brun 42 Object editor René Brun Click on any object to show its editor 43 TBrowser: a dynamic browser René Brun see $ROOTSYS/test/RootIDE 44 TBrowser Editor René Brun 45 GUI C++ code generator When pressing ctrl+S on any widget it is saved as a C++ macro file thanks to the SavePrimitive methods implemented in all GUI classes. The generated macro can be edited and then executed via CINT Executing the macro restores the complete original GUI as well as all created signal/slot connections in a global way root [0] .x example.C // transient frame TGTransientFrame *frame2 = new TGTransientFrame(gClient>GetRoot(),760,590); // group frame TGGroupFrame *frame3 = new TGGroupFrame(frame2,"curve"); TGRadioButton *frame4 = new TGRadioButton(frame3,"gaus",10); frame3->AddFrame(frame4); René Brun 46 The GUI Builder The GUI builder provides GUI tools for developing user interfaces based on the ROOT GUI classes. It includes over 30 advanced widgets and an automatic C++ code generator. René Brun 47 More GUI Examples $ROOTSYS/tutorials/gui $ROOTSYS/test/RootShower $ROOTSYS/test/RootIDE René Brun 48 Geometry The GEOMetry package is used to model very complex detectors (LHC). It includes -a visualization system -a navigator (where am I, distances, overlaps, etc) René Brun 49 OpenGL see $ROOTSYS/tutorials/geom René Brun 50 Peak Finder + Deconvolutions TSpectrum René Brun 51 Fitters Minuit Fumili LinearFitter RobustFitter Roofit René Brun 52 Histograms Analysis result: often a histogram Value (x) vs. count (y) Menu: View / Editor René Brun 53 Fitting Analysis result: often a fit based on a histogram René Brun 54 Fit Fit = optimization in parameters, e.g. Gaussian For Gaussian: f ( x ) [ 0] e ( x [1]) 2 2[ 2 ]2 [0] = "Constant" [1] = "Mean" [2] = "Sigma" / "Width" Objective: choose parameters [0], [1], [2] to get function as close as possible to histogram René Brun 55 Fit Panel To fit a histogram: right click histogram, "Fit Panel" Straightforward interface for fitting! René Brun 56 Roofit: a powerful fitting framework René Brun see $ROOTSYS/tutorials/fit/RoofitDemo.C 57 Input/Output Saving Data Streaming, Reflection, TFile, Schema Evolution René Brun 59 Saving Objects – the Issues long aLong = 42; WARNING: platform dependent! Storing aLong: How many bytes? Valid: 0x0000002a 0x000000000000002a 0x0000000000000000000000000000002a which byte contains 42, which 0? Valid: char[] {42, 0, 0, 0}: little endian, e.g. Intel char[] {0, 0, 0, 42}: big endian, e.g. PowerPC René Brun 60 Platform Data Types Fundamental data types: size is platform dependent Store "int" on platform A 0x000000000000002a Read back on platform B – incompatible! Data loss, size differences, etc: 0x000000000000002a René Brun 61 ROOT Basic Data Types Solution: ROOT typedefs Signed Char_t Unsigned UChar_t Short_t UShort_t 2 Int_t UInt_t 4 Long64_t ULong64_t 8 Double32_t René Brun sizeof [bytes] 1 float on disk, double in RAM 62 Object Oriented Concepts Class: the description of a “thing” in the system Object: instance of a class TObject Methods: functions for a class IsA Members: a “has a” relationship to the class. Inheritance: an “is a” relationship to the class. René Brun Event HasA Jets HasA HasA Tracks HasA Momentum HasA EvNum HasA Charge Segments 63 63 Saving Objects – the Issues class TMyClass { float fFloat; Long64_t fLong; }; TMyClass anObject; WARNING: platform dependent! Storing anObject: need to know its members + base classes need to know "where they are" René Brun 64 Reflection Need type description (aka reflection) 1. types, sizes, members TMyClass is a class. class TMyClass { float fFloat; Long64_t fLong; }; Members: "fFloat", type float, size 4 bytes "fLong", type Long64_t, size 8 bytes René Brun 65 Reflection René Brun – 10 – 8 – 6 – 4 – 2 – 0 TMyClass Memory Address Need type description 1. types, sizes, members 2. offsets in memory class TMyClass { float fFloat; – 16 Long64_t fLong; – 14 }; – 12 fLong PADDING fFloat "fFloat" is at offset 0 "fLong" is at offset 8 66 I/O Using Reflection – 16 – 14 – 12 – 10 – 8 – 6 – 4 – 2 – 0 René Brun 2a 00 00 00... TMyClass Memory Address members memory re-order fLong ...00 00 00 2a PADDING fFloat 67 C++ Is Not Java Lesson: need reflection! Where from? Java: get data members with Class.forName("MyClass").getFields() C++: get data members with – oops. Not part of C++. -- discussions for next C++ René Brun 68 Reflection For C++ Program parses header files, e.g. Funky.h: rootcint –f Dict.cxx Funky.h LinkDef.h Collects reflection data for types requested in Linkdef.h Stores it in Dict.cxx (dictionary file) Compile Dict.cxx, link, load: C++ with reflection! René Brun 69 Reflection By Selection LinkDef.h syntax: #pragma #pragma #pragma #pragma René Brun link link link link C++ C++ C++ C++ class MyClass+; typedef MyType_t; function MyFunc(int); enum MyEnum; 70 Why So Complicated? Simply use ACLiC: .L MyCode.cxx+ Will create library with dictionary of all types in MyCode.cxx automatically! René Brun 71 Saving Objects: TFile ROOT stores objects in TFiles: TFile* f = new TFile("afile.root", "NEW"); TFile behaves like file system: f->mkdir("dir"); TFile has a current directory: f->cd("dir"); René Brun 72 Saving Objects, Really Given a TFile: TFile* f = new TFile("afile.root", "RECREATE"); Write an object deriving from TObject: object->Write("optionalName"); "optionalName" or TObject::GetName() Write any object (with dictionary): f->WriteObject(object, "name"); René Brun 73 "Where Is My Histogram?" TFile owns histograms, graphs, trees (due to historical reasons): TFile* f = new TFile("myfile.root"); TH1F* h = new TH1F("h","h",10,0.,1.); h->Write(); Canvas* c = new TCanvas(); c->Write(); delete f; h automatically deleted: owned by file. c still there. René Brun 74 TFile as Directory: TKeys One TKey per object: named directory entry knows what it points to like inodes in file system Each TFile directory is collection of TKeys TFile "myfile.root" René Brun TKey: "hist1", TH1F TKey: "list", TList TKey: "hist2", TH2F TDirectory: "subdir" TKey: "hist1", TH2D 75 TKeys Usage Efficient for hundreds of objects Inefficient for millions: lookup ≥ log(n) sizeof(TKey) on 64 bit machine: 160 bytes Better storage / access methods available, René Brun 76 Risks With I/O Physicists can loop a lot: For each particle collision For each particle created For each detector module Do something. Physicists can loose a lot: Run for hours… Crash. Everything lost. René Brun 77 Name Cycles Create snapshots regularly: MyObject;1 MyObject;2 MyObject;3 … MyObject Write() does not replace but append! but see documentation TObject::Write() René Brun 78 The "I" Of I/O Reading is simple: TFile* f = new TFile("myfile.root"); TH1F* h = 0; f->GetObject("h", h); h->Draw(); delete f; Remember: TFile owns histograms! file gone, histogram gone! René Brun 79 Ownership And TFiles Separate TFile and histograms: TFile* f = new TFile("myfile.root"); TH1F* h = 0; TH1::AddDirectory(kFALSE); f->GetObject("h", h); h->Draw(); delete f; … and h will stay around. René Brun 80 Changing Class:a Problem Things change: class TMyClass { float fFloat; Long64_t fLong; }; René Brun 81 Changing Class: a Problem Things change: class TMyClass { double fFloat; Long64_t fLong; }; Inconsistent reflection data! Objects written with old version cannot be read! René Brun 82 Solution: Schema Evolution Support changes: 1. removed members: just skip data file.root Long64_t fLong; float fFloat; RAM ignore float fFloat; 2. added members: just default-initialize them (whatever the default constructor does) TMyClass(): fLong(0) file.root float fFloat; René Brun RAM Long64_t fLong; float fFloat; 83 Solution: Schema Evolution Support changes: 3. members change types file.root Long64_t fLong; float fFloat; RAM Long64_t fLong; double fFloat; RAM (temporary) float DUMMY; René Brun 84 Supported Type Changes Schema Evolution supports: float ↔ double, int ↔ long, etc (and pointers) TYPE* ↔ std::vector<TYPE> CLASS* ↔ TClonesArray Adding / removing base class equivalent to adding / removing data member René Brun 85 Storing Reflection Big Question: file == memory class layout? Need to store reflection to file, see TFile::ShowStreamerInfo(): StreamerInfo for class: TH1F, version=1 TH1 BASE offset= 0 type= 0 TArrayF BASE offset= 0 type= 0 StreamerInfo for class: TH1, TNamed BASE offset= 0 ... Int_t fNcells offset= 0 TAxis fXaxis offset= 0 René Brun version=5 type=67 type= 3 type=61 86 Class Version One TStreamerInfo for all objects of the same class layout in a file Can have multiple versions in same file Use version number to identify layout: class TMyClass: public TObject { public: TMyClass(): fLong(0), fFloat(0.) {} virtual ~TMyClass() {} ... ClassDef(TMyClass,1); // example class }; René Brun 87 Random Facts On ROOT I/O We know exactly what's in a TFile – need no library to look at data! ROOT files are zipped Combine contents of TFiles with $ROOTSYS/bin/hadd Can even open TFile("http://myserver.com/afile.root") including read-what-you-need! Nice viewer for TFile: new TBrowser René Brun 88 I/O Buffer Object in Memory Streamer: No need for transient / persistent classes sockets Net File http Web File XML XML File SQL DataBase Local File on disk René Brun 89 TFile / TDirectory A TFile object may be divided in a hierarchy of directories, like a Unix file system. Two I/O modes are supported Key-mode (TKey). An object is identified by a name (key), like files in a Unix directory. OK to support up to a few thousand objects, like histograms, geometries, mag fields, etc. TTree-mode to store event data, when the number of events may be millions, billions. René Brun 90 Self-describing files Dictionary for persistent classes written to the file. ROOT files can be read by foreign readers Support for Backward and Forward compatibility Files created in 2001 must be readable in 2015 Classes (data objects) for all objects in a file can be regenerated via TFile::MakeProject Root >TFile f(“demo.root”); Root > f.MakeProject(“dir”,”*”,”new++”); René Brun 91 Example of key mode void keywrite() { TFile f(“keymode.root”,”new”); TH1F h(“hist”,”test”,100,-3,3); h.FillRandom(“gaus”,1000); h.Write() } void keyRead() { TFile f(“keymode.root”); TH1F *h = (TH1F*)f.Get(“hist”); h.Draw(); } René Brun 92 A Root file pippa.root with two levels of directories René Brun Objects in directory /pippa/DM/CJ eg: /pippa/DM/CJ/h15 93 LHC: How Much Data? Level 1 Rate (Hz) 106 High Level-1 Trigger (1 MHz) LHCB 105 1 billion people surfing the Web 104 HERA-B ATLAS CMS KLOE CDF II CDF 103 102 104 René Brun High No. Channels High Bandwidth (500 Gbit/s) High Data Archive (5 PetaBytes/year) 10 Gbits/s in Data base H1 ZEUS NA49 UA1 LEP 105 106 ALICE STAR 107 Event Size (bytes) 94 ROOT Collection Classes Or when one million TKeys are not enough… Collection Classes The ROOT collections are polymorphic containers that hold pointers to TObjects, so: They can only hold objects that inherit from TObject They return pointers to TObjects, that have to be cast back to the correct subclass René Brun 96 Types Of Collections TCollection TSeqCollection TList TSortedList THashTable TOrdCollection THashList TMap TObjArray TBtree TClonesArray The inheritance hierarchy of the primary collection classes René Brun 97 Collections (cont) Here is a subset of collections supported by ROOT: TObjArray TClonesArray THashTable THashList TMap Templated containers, e.g. STL (std::list etc) René Brun 98 TObjArray A TObjArray is a collection which supports traditional array semantics via the overloading of operator[]. Objects can be directly accessed via an index. The array expands automatically when objects are added. René Brun 99 TClonesArray Array of identical classes (“clones”). Designed for repetitive data analysis tasks: same type of objects created and deleted many times. No comparable class in STL! The internal data structure of a TClonesArray René Brun 100 Traditional Arrays Very large number of new and delete calls in large loops like this (N(100000) x N(10000) times new/delete): N(100000) TObjArray a(10000); while (TEvent *ev = (TEvent *)next()) { for (int i = 0; i < ev->Ntracks; ++i) { a[i] = new TTrack(x,y,z,...); ... } ... a.Delete(); N(10000) } René Brun 101 You better use a TClonesArray which reduces the number of new/delete calls to only N(10000): N(100000) TClonesArray a("TTrack", 10000); while (TEvent *ev = (TEvent *)next()) { for (int i = 0; i < ev->Ntracks; ++i) { new(a[i]) TTrack(x,y,z,...); N(10000) ... } ... a.Delete(); } Pair of new/delete calls cost about 4 μs NN(109) new/deletes will save about 1 hour. René Brun 102 THashTable - THashList THashTable puts objects in one of several of its short lists. Which list to pick is defined by TObject::Hash(). Access much faster than map or list, but no order defined. Nothing similar in STL. THashList consists of a THashTable for fast access plus a TList to define an order of the objects. René Brun 103 TMap TMap implements an associative array of (key,value) pairs using a THashTable for efficient retrieval (therefore TMap does not conserve the order of the entries). René Brun 104 ROOT Trees Why Trees ? As seen previously, any object deriving from TObject can be written to a file with an associated key with object.Write() However each key has an overhead in the directory structure in memory (up to 160 bytes). Thus, Object.Write() is very convenient for simple objects like histograms, but not for saving one object per event. Using TTrees, the overhead in memory is in general less than 4 bytes per entry! René Brun 106 Why Trees ? Extremely efficient write once, read many ("WORM") Designed to store >109 (HEP events) with same data structure Load just a subset of the objects to optimize memory usage Trees allow fast direct and random access to any entry (sequential access is the best) René Brun 107 Building ROOT Trees Overview of Trees Branches 5 steps to build a TTree René Brun 108 Memory <--> Tree Each Node is a branch in the Tree Memory 0 T.GetEntry(6) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 T.Fill() T René Brun 109 ROOT I/O -- Split/Cluster Tree version Tree entries Streamer Branches Tree in memory File René Brun 110 Writing/Reading a Tree class Event : public Something Header { fHeader; Event.h std::list<Vertex*> fVertices; std::vector<Track> fTracks; TOF Write.C Calor fTOF; *fCalor; Read.C } main() { main() { Event *event = 0; Event *event = 0; TFile f(“demo.root”,”recreate”); TFile f(“demo.root”); int split = 99; TTree *T = (TTree*)f.Get”T”); //maximum split TTree *T = new TTree(“T”,”demo Tree”); T->SetBranchAddress(“event”,&event); T->Branch(“event”,&event,split); Long64_t N = T->GetEntries(); for (int ev=0;ev<1000;ev++) { for (Long64_t ev=0;ev<N;ev++) { event = new Event(…); T->GetEntry(ev); T->Fill(); // do something with event delete event; } } } t->AutoSave(); } René Brun 111 Tree structure René Brun 112 Browsing a TTree with TBrowser 8 leaves of branch Electrons A double-click to histogram the leaf 8 Branches of T René Brun 113 The TTreeViewer René Brun 114 Tree structure Branches: directories Leaves: data containers Can read a subset of all branches – speeds up considerably the data analysis processes Tree layout can be optimized for data analysis The class for a branch is called TBranch Variables on TBranch are called leaf (yes TLeaf) Branches of the same TTree can be written to separate files René Brun 115 Five Steps to Build a Tree Steps: 1. Create a TFile 2. Create a TTree 3. Add TBranch to the TTree 4. Fill the tree 5. Write the file René Brun 116 Step 1: Create a TFile Object Trees can be huge need file for swapping filled entries TFile *hfile = new TFile("AFile.root"); René Brun 117 Step 2: Create a TTree Object The TTree constructor: Tree name (e.g. "myTree") Tree title TTree *tree = new TTree("myTree","A Tree"); René Brun 118 Step 3: Adding a Branch Branch name Address of the pointer to the object Event *event = new Event(); myTree->Branch("EventBranch", &event); René Brun 119 Splitting a Branch Setting the split level (default = 99) Split level = 0 Split level = 99 tree->Branch("EvBr", &event, 64000, 0 ); René Brun 120 Splitting (real example) Split level = 0 René Brun Split level = 99 121 Splitting Creates one branch per member – recursively Allows to browse objects that are stored in trees, even without their library Makes same members consecutive, e.g. for object with position in X, Y, Z, and energy E, all X are consecutive, then come Y, then Z, then E. A lot higher zip efficiency! Fine grained branches allow fain-grained I/O - read only members that are needed, instead of full object Supports STL containers, too! René Brun 122 Performance Considerations A split branch is: Faster to read - the type does not have to be read each time Slower to write due to the large number of buffers René Brun 123 Step 4: Fill the Tree Assign values to the object contained in each branch TTree::Fill() creates a new entry in the tree: snapshot of values of branches’ objects Create a for loop for (int e=0;e<100000;++e) { event->Generate(e); // fill event myTree->Fill(); // fill the tree } René Brun 124 Step 5: Write Tree To File myTree->Write(); René Brun 125 Example macro { Event *myEvent = new Event(); TFile f("mytree.root"); TTree *t = new TTree("myTree","A Tree"); t->Branch("SplitEvent", &myEvent); for (int e=0;e<100000;++e) { myEvent->Generate(); t->Fill(); } t->Write(); } René Brun 126 Reading a TTree Looking at a tree How to read a tree Trees, friends and chains René Brun 127 Looking at the Tree TTree::Print() shows the data layout root [] TFile f("AFile.root") root [] myTree->Print(); ****************************************************************************** *Tree :myTree : A ROOT tree * *Entries : 10 : Total = 867935 bytes File Size = 390138 * * : : Tree compression factor = 2.72 * ****************************************************************************** *Branch :EventBranch * *Entries : 10 : BranchElement (see below) * *............................................................................* *Br 0 :fUniqueID : * *Entries : 10 : Total Size= 698 bytes One basket in memory * *Baskets : 0 : Basket Size= 64000 bytes Compression= 1.00 * *............................................................................* … … René Brun 128 Looking at the Tree TTree::Scan("leaf:leaf:….") shows the values root [] myTree->Scan("fNseg:fNtrack"); > scan.txt root [] myTree->Scan("fEvtHdr.fDate:fNtrack:fPx:fPy","", "colsize=13 precision=3 col=13:7::15.10"); ****************************************************************************** * Row * Instance * fEvtHdr.fDate * fNtrack * fPx * fPy * ****************************************************************************** * 0 * 0 * 960312 * 594 * 2.07 * 1.459911346 * * 0 * 1 * 960312 * 594 * 0.903 * -0.4093382061 * * 0 * 2 * 960312 * 594 * 0.696 * 0.3913401663 * * 0 * 3 * 960312 * 594 * -0.638 * 1.244356871 * * 0 * 4 * 960312 * 594 * -0.556 * -0.7361358404 * * 0 * 5 * 960312 * 594 * -1.57 * -0.3049036264 * * 0 * 6 * 960312 * 594 * 0.0425 * -1.006743073 * * 0 * 7 * 960312 * 594 * -0.6 * -1.895804524 * René Brun 129 TTree Selection Syntax MyTree->Scan(); Prints the first 8 variables of the tree. MyTree->Scan("*"); Prints all the variables of the tree. Select specific variables: MyTree->Scan("var1:var2:var3"); Prints the values of var1, var2 and var3. A selection can be applied in the second argument: MyTree->Scan("var1:var2:var3", "var1==0"); Prints the values of var1, var2 and var3 for the entries where var1 is exactly 0. René Brun 130 Looking at the Tree TTree::Show(entry_number) shows the values for one entry root [] myTree->Show(0); ======> EVENT:0 EventBranch = NULL fUniqueID = 0 fBits = 50331648 [...] fNtrack = 594 fNseg = 5964 [...] fEvtHdr.fRun = 200 [...] fTracks.fPx = 2.066806, 0.903484, 0.695610, 0.637773,... fTracks.fPy = 1.459911, -0.409338, 0.391340, 1.244357,... René Brun 131 How To Read a TTree Example: 1. Open the Tfile TFile f("tree4.root") OR 2. Get the TTree TTree *t4=0; f.GetObject("t4",t4) René Brun 132 How to Read a TTree 3. Create a variable pointing to the data root [] Event *event = 0; 4. Associate a branch with the variable: root [] t4->SetBranchAddress("event_split", &event); 5. Read one entry in the TTree root [] t4->GetEntry(0) root [] event->GetTracks()->First()->Dump() ==> Dumping object at: 0x0763aad0, name=Track, class=Track fPx 0.651241 X component of the momentum fPy 1.02466 Y component of the momentum fPz 1.2141 Z component of the momentum [...] René Brun 133 Example macro { Event *ev = 0; TFile f("mytree.root"); TTree *myTree = (TTree*)f->Get("myTree"); myTree->SetBranchAddress("SplitEvent",&ev); for (int e=0;e<100000;++e) { myTree->GetEntry(e); ev->Analyse(); } } René Brun 134 Chains of Trees A TChain is a collection of Trees. Same semantics for TChains and TTrees root > .x h1chain.C root > chain.Process(“h1analysis.C”) { //creates a TChain to be used by the h1analysis.C class //the symbol H1 must point to a directory where the H1 data sets //have been installed TChain chain("h42"); chain.Add("$H1/dstarmb.root"); chain.Add("$H1/dstarp1a.root"); chain.Add("$H1/dstarp1b.root"); chain.Add("$H1/dstarp2.root"); } René Brun 135 Data Volume & Organisation 100MB 1 TTree 1GB 1 10GB 100GB 5 50 1TB 500 10TB 100TB 5000 50000 1PB TChain A TFile typically contains 1 TTree A TChain is a collection of TTrees or/and TChains A TChain is typically the result of a query to a file catalog René Brun 136 Friends of Trees - Adding new branches to existing tree without touching it, i.e.: myTree->AddFriend("ft1", "friend.root") - Unrestricted access to the friend's data via virtual branch of original tree René Brun 137 Tree Friends 01 23 45 67 89 10 11 12 13 14 15 16 17 18 René Brun 0 1 2 3 4 5 6 7 8 Entry # 8 01 23 45 67 89 10 11 12 13 14 15 16 17 18 9 10 11 12 13 14 15 16 17 18 Public Public User read read Write 138 Tree Friends Processing time independent of the number of friends unlike table joins in RDBMS Collaboration-wide public read Analysis group protected user private x TFile f1("tree1.root"); tree.AddFriend("tree2", "tree2.root") tree.AddFriend("tree3", "tree3.root"); tree.Draw("x:a", "k<c"); tree.Draw("x:tree2.x", "sqrt(p)<b"); René Brun 139 Summary: Reading Trees TTree is one of the most powerful collections available for HEP Extremely efficient for huge number of data sets with identical layout Very easy to look at TTree - use TBrowser! Write once, read many (WORM) ideal for experiments' data Still: extensible, users can add their own tree as friend René Brun 140 Analyzing Trees Selectors, Proxies, PROOF René Brun 141 Recap TTree efficient storage and access for huge amounts of structured data Allows selective access of data TTree knows its layout Almost all HEP analyses based on TTree René Brun 142 TTree Data Access Data access via TTree / TBranch is complex Lots of code identical for all TTrees: getting tree from file, setting up branches, entry loop Lots of code completely defined by TTree: branch names, variable types, branch ↔ variable mapping Need to enable / disable branches "by hand" Want common interface to allow generic "TTree based analysis infrastructure" René Brun 143 TTree Proxy The solution: write analysis code using branch names as variables auto-declare variables from branch structure read branches on-demand René Brun 144 Branch vs. Proxy Take a simple branch: class ABranch { int a; }; ABranch* b=0; tree->Branch("abranch", &b); Access via proxy in pia.C: double pia() { return sqrt(abranch.a * TMath::Pi()); } René Brun 145 Proxy Details double pia() { return sqrt(abranch.a * TMath::Pi()); } Analysis script somename.C must contain somename() Put #includes into somename.h Return type must convert to double René Brun 146 Proxy Advantages double pia() { return sqrt(abranch.a * TMath::Pi()); } Very efficient: only reads leaf "a" Can use arbitrary C++ Leaf behaves like a variable Uses meta-information stored with TTree: branch names types contained in branches / leaves and their members (unrolling) René Brun 147 Simple Proxy Analysis TTree::Draw() runs simple analysis: TFile* f = new TFile("tree.root"); TTree* tree = 0; f->GetObject("MyTree", tree); tree->Draw("pia.C+"); Compiles pia.C Calls it for each event Fills histogram named "htemp" with return value René Brun 148 Behind The Proxy Scene TTree::Draw("pia.C+") creates helper class generatedSel: public TSelector { #include "pia.C" ... }; pia.C gets #included inside generatedSel! Its data members are named like leaves, wrap access to Tree data Can also generate Proxy via tree->MakeProxy("MyProxy", "pia.C") René Brun 149 TSelector generatedSel: public TSelector TSelector is base for event-based analysis: 1. setup TSelector::Begin() 2. analyze an entry TSelector::Process(Long64_t entry) 3. draw / save histograms TSelector::Terminate() René Brun 150 Extending The Proxy Given Proxy generated for script.C (like pia.C): looks for and calls void Bool_t void script_Begin(TTree* tree) script_Process(Long64_t entry) script_Terminate() Correspond to TSelector's functions Can be used to set up complete analysis: fill several histograms, control event loop René Brun 151 Proxy And TClonesArray TClonesArray as branch: class TGap { float ore; }; TClonesArray* b = new TClonesArray("TGap"); tree->Branch("gap", &b); Cannot loop over entries and return each: double singapore() { return sin(gap.ore[???]); } René Brun 152 Proxy And TClonesArray TClonesArray as branch: class TGap { float ore; }; TClonesArray* b = new TClonesArray("TGap"); tree->Branch("gap", &b); Implement it as part of pia_Process() instead: Bool_t pia_Process(Long64_t entry) { for (int i=0; i<gap.GetEntries(); ++i) hSingapore->Fill(sin(gap.ore[i])); return kTRUE; } René Brun 153 Extending The Proxy, Example Need to declare hSingapore somewhere! But pia.C is #included inside generatedSel, so: TH1* hSingapore; void pia_Begin() { hSingapore = new TH1F(...); } Bool_t pia_Process(Long64_t entry) { hSingapore->Fill(...); } void pia_Terminate() { hSingapore->Draw(); } René Brun 154 ROOT on the GRID(s) Many ways to use ROOT on a GRID Single laptop Multi-Core desktop Local area cluster Cluster on a GRID center Clusters on GRID(s) GRID: the idea Exploit resources available on the Net: You take and you give. Not a new idea: Seti, Condor, etc BUT: Can we make it transparent? What type of work? Problem is more complex than a simple CPU distribution task WAN data access Data management Authentication, time & space sharing Disk and CPU quotas René Brun 156 The HEP GRID 100,000 computers 5,000 physicists in 1000 locations in 1000 locations René Brun 157 GRID: Users profile Few big users submitting many long jobs (Monte Carlo, reconstruction) They want to run many jobs in one month Many users submitting many short jobs (physics analysis) They want to run many jobs in one hour or less René Brun 158 Big Users Monte Carlo jobs (one hour one day) Each job generates one file (1 GigaByte) Reconstruction job (10 minutes -> one hour) Input from the MC job or copied from a storage center Output (< input) is staged back to a storage center Success rate (90%). If the job fails you resubmit it. For several years, GRID projects focused effort on big users. René Brun 159 Small Users Scenario 1: submit one batch job to the GRID. It runs somewhere with varying response times. Scenario 2: Use a splitter to submit many batch jobs to process many data sets (eg CRAB, Ganga, Alien). Output data sets are merged automatically. Success rate < 90%. You see the final results only when the last job has been received and all results merged. Scenario 3: Use PROOF (automatic splitter and merger). Success rate close to 100%. You can see intermediate feedback objects like histograms. You run from an interactive ROOT session. René Brun 160 Scenario 1 & 2: PROS Job level parallelism. Conventional model. Nothing to change in user program. Initialisation phase Loop on events Termination Same job can run on laptop or on a GRID job. René Brun 161 Scenario 1 & 2: CONS(1) Long tail in the jobs wall clock time distribution. René Brun 162 Scenario 1 & 2: CONS(2) Can only merge output after a timecut. More data movement (input & output) Cannot have interactive feedback Two consecutive queries will produce different results (problem with rare events) Will use only one core on a multicore laptop or GRID node. Hard to control priorities and quotas. René Brun 163 Scenario 3: PROS Predictive response. Event level parallelism. Workers terminate at the same time Process moved to data as much as possible, otherwise use network. Interactive feedback Can view running queries in different ROOT sessions. Can take advantage of multicore cpus René Brun 164 Scenario 3: CONS Event level parallelism. User code must follow a template: the TSelector API. Good for a local area cluster. More difficult to put in place in a GRID collection of local area clusters. Interactive schedulers, priority managers must be put in place. Debugging a problem slightly more difficult. René Brun 165 GRID & Parallelism: 1 The user application splits the problem in N subtasks. Each task is submitted to a GRID node (minimal input, minimal output). The GRID task can run synchronously or asynchronously. If the task fails or timeout, it can be resubmitted to another node. One of the first and simplest use of the GRID, but not many applications in HEP. René Brun 166 GRID & Parallelism: 2 The typical case of Monte Carlo or reconstruction in HEP. It requires massive data transfers between the main computing centers. This activity has concentrated so far a very large fraction of the GRID projects and budgets. It has been an essential step to foster coordination between hundreds of sites, improve international network bandwidths and robustness. René Brun 167 GRID & Parallelism: 3 Distributed data analysis will be a major challenge for the coming experiments. This is the area with thousands of people running many different styles of queries, most of the time in a chaotic way. The main challenges: Access to millions of data sets (eg 500 TeraBytes) Best match between execution and data location Distributing/compiling/linking users code(a few thousand lines) with experiment large libraries (a few million lines of code). Simplicity of use Real time response Robustness. René Brun 168 GRID & Parallelism: 3a Currently 2 different & competing directions for distributed data analysis. Batch solution using the existing GRID infrastructure for Monte Carlo and reconstruction programs. A front-end program partitions the problem to analyze ND data sets on NP processors. Interactive solution PROOF. Each query is parallelized with an optimum match of execution and data location. René Brun 169 Parallelism: Where ? Multi-Core CPU laptop/desktop 2(2007) 32(2012?) Network of desktops Local Cluster with multi-core CPUs GRID(s) René Brun 170