Muon Lifetime Analysis David W. Hertzog and David M. Webber University of Illinois at Urbana-Champaign For the MuLan collaboration1 March 7, 2016 ABSTRACT Muon decay probes the weak force and provides the most precise determination of the Fermi Constant—the “strength” of the weak interaction. The Fermi Constant, GF, is one of the three key inputs to the Standard Model of particle physics. The MuLan Collaboration at the Paul Scherrer Institute is a UIUC-led experiment that aims to improve the determination of GF to sub part-per-million precision, a 20-fold improvement compared to all previous world efforts. An initial commissioning run of the experiment has been completed and published, with the analysis based on local UIUC Nuclear Physics Laboratory resources. Since then, a high-statistics run in 2006 has been completed, obtaining more than 67 TB of raw data, which has been uploaded into the NCSA Mass Storage System. We have developed and tested efficient analysis codes using, in part, a Development Allocation Award from NCSA. In this proposal, we request the computing time necessary to process the large data set with the goal of extracting the physics result in a timely manner. We describe the science, the methodology and the computing requirements to achieve this goal. 1 Overview In this document we provide the motivation and justification for a new Medium Resource Allocation on the Tungsten cluster at NCSA to be used to carry out a highly sophisticated physics analysis that will lead to the most precise determination of one of the key parameters of the Standard Model of particle physics—the Fermi coupling constant. We are requesting 100 kSUs, 25 TB of project disk space, and 100 TB on the mass storage system. The allocation is sufficient to complete the analysis of data obtained by us in 2006, a dataset representing the decays of more than 1012 muons, as explained below. Defining the parameters of the Standard Model is a high priority of both particle and nuclear physics. Important advances come from efforts to produce directly new particles at colliders and, alternatively, from precision low-energy measurements, which test the predictions of the theory or which measure the input parameters. The Muon Lifetime Analysis (MuLan) experiment aims to measure the positive muon lifetime to 1-ppm precision, thus determining the Fermi coupling constant, GF, to 0.5 ppm. The Fermi constant represents the “strength” of the Weak Force, one of the four forces in nature. The muon lifetime measurement provides, by far, the most precise method of establishing this fundamental constant. A 1 ppm goal represents a 27-fold improvement compared to any single previous experiment, the most recent of which had been carried out more than 20 years ago. The experiment follows logically from our highly successful muon “g–2”i experience, another high-precision measurement that also relied on NCSA processing to extract physics from raw data. The scientific case for MuLan is motivated as part of a world effort to determine to the fullest extent possible the fundamental parameters of standard Senior personnel: R. M. Carey, P. T. Debevec, K. Giovanetti, F. Gray, T. P. Gorringe, D. W. Hertzog, P. Kammel, B. Lauss, K.R. Lynch, J. P. Miller, Q. Peng, B.L. Roberts, V. Tishchenko, D. M. Webber, P. Winter; Collaborating Institutions: Boston, Illinois, Kentucky, Regis, Paul Scherrer Institut. 1 1 Figure 1. Diagram of the MuLan detector with several elements removed to show the AK-3 target. Muons enter though a vacuum beampipe and stop in the AK-3. Two example decay trajectories are shown. Figure 2. A subset of the Fall 2004 data illustrating the rate of decays during the accumulation and measuring periods. The 2004 data clearly shows the time structure of the experiment. To reduce data volume in 2006, only decays during the measurement period were recorded. electroweak theory. Very briefly, the experiment, which takes place at the Paul Scherrer Institute in Switzerland, uses a high-intensity artificially chopped muon beam, a nearly hermetic decay detector, a high-precision clock system, and a fast DAQ environment. After five years of hardware development, a “small” data set was obtained in 2004, which was locally analyzed at Illinois and the results were published in July 2007 in Physical Review Lettersii. In 2006, the hardware was completed and the first large data set was obtained (1012 events, 67 TB data recorded). Since then we have developed, using our local resources, the coding and optimization for processing this data. A development allocation from NCSA has permitted testing performance aspects of our code. It is now time to enter full production analysis to extract the physics, thus the timing of this proposal. MuLan is supported by our main NSF umbrella grant, NSF PHY 06-01067, and the experiment hardware was largely funded by a special Major Research Instrumentation grant, NSF 00-79735, “Collaborative Research: The MuLan Project—Development of instrumentation for a new high-precision determination of the Fermi coupling constant”. Hertzog received a Guggenheim Fellowship to help commission the experiment. At present, we are operating analysis tests at NCSA with Development Allocation PHY060034N. 2 The MuLan Experiment 2.1 Motivation The Fermi constant GF is best determined from the muon lifetime, , where it is often termed G. The relationship is 1 G2 m5 192 3 1 ; the parameter accounts for QED radiative corrections. Prior to our recent intermediate result, the combined world average experimental knowledge of from many experiments had an uncertainty of 18 ppm. However, the uncertainty on GF was dominated by uncalculated two-loop radiative corrections prior to publication of a series of papers by van Ritbergen and Stuart,iii where the authors reduced the theoretical error to less than 0.3 ppm. They further detail how electroweak corrections to muon decay, which can be 2 separated in a natural way from those involving QED alone, are contained in the term r in the relation GF 2 g2 8M W2 1 r , which defines the Fermi constant in terms of the Standard Model weak coupling constant g. The work is summarized by giving an unambiguous prescription that relates (at low q2), the muon lifetime, and the Z mass to renormalized Standard Model parameters. As an example, these parameters can be used to predictiv the top-quark mass. Improvements in GF and MZ, (or related electroweak parameters) and in the measured mt , will be useful in future Standard Model tests. A high precision measurement of is also used in the analysis of the physics by our other current muon experiment—MuCap. There, we measure the negative muon lifetime in an ultra-pure high-pressure hydrogen gas Time Projection Chamber to determine the muon capture rate (and from that the fundamental induced pseudoscalar coupling, gP). The MuCap experiment used NCSA resources to complete analysis of its first large data and published the results in the same July 2007 Physical Review Letters issuev as MuLan’s intermediate result. 2.2 Detector The MuLan experiment is simple. A continuous beam of low-energy muons is directed to stop in a thin target during a “beam-on” accumulation period lasting approximately 5 s (2.3 ). The muon beam is then “switched off” and decays are recorded during a 22-s (10 ) measuring period by a surrounding detector. This cycle is repeated until more than 1012 decays are recorded. As an example, a 10-MHz dc beam delivers 50 muons to the target during the accumulation period. Of those, 20 remain at the beginning of the measuring period. The total cycle lasts 27 s; thus the effective beam rate exceeds 700 kHz, nearly 40 times greater than traditional “one-at-a-time” experiments. Figure 2 shows this pattern from a small subset of the data accumulated in Fall 2004. During the measuring interval, the decay positrons are recorded by a multisegmented, nearly hermetic spectrometer (Figure 1). The geometry features 170 independent scintillator tile pairs, with each element read out by a photomultiplier tube (PMT) whose signal is sampled at 450 MHz by a dedicated waveform digitizer (WFD). The time and energy deposited in each tile are derived from the signal shape. Decay time histograms are constructed from coincident hits and are then fit to extract the lifetime. The design of the experiment is driven by systematic error considerations. Primary concerns are related to multi-particle pileup, muon spin precession, time-dependence of detector gains or electronic thresholds, and backgrounds. Pileup is minimized by the segmentation of the detector, the relatively low peak rate per element (3 kHz) and by the double-hit resolution enabled by recording the pulse waveform in both tile elements for each event. Uncontrolled precession of the stopped, polarized muon ensemble can cause a change in acceptance of the detectors during the measuring interval, which occurs because the emitted positron rate is correlated to the direction of the muon spin. Residual polarization is minimized by using a high-internal-field metal alloy, Anakrome-3 (AK-3), that dephases the spins during the accumulation period, which is long compared to the period of rotation. Finally, the detector features front-back matched, symmetric segments, where the sum of elements is nearly immune to a change in spin direction. 3 2.3 Data Acquisition Each tile element of the segmented detector is read out independently by a dedicated set of electronics, which makes the setup well suited for parallelization. Geometrically opposite tile pairs are read out by each four-channel WFD. When the analog signal from a PMT rises above a preconfigured threshold, the waveform digitizer for that channel writes out 24 samples of the waveform at 2.2 ns intervals. If the waveform is still above threshold after 24 samples, the waveform digitizer retriggers, writing out another 24 samples. The waveform digitizer also records an offset time since the beginning of the current beam cycle, or “fill,” and a fill number. The set of hits in all detectors for a period of 5000 fills comprise a data structure called a “midas event,” and three events are present in the readout sequence at a time. From newest to oldest, the first event consists of waveform data being collected by the WFDs, the second is being read from the WFD FIFOs into the frontend computer’s memory, and the third is being compressed using open-source zlib with an MD5 checksum. The six compressed events from the frontend computers are assembled by the backend computer and two copies are written to LTO3 tape. There are approximately 430 midas events in each 2 GB data file. The typical overall data rate in seen by the backend was 25 MB/s with a DAQ livetime of 90%. 2.4 Status The MuLan experiment was proposed in 1999. The MuLan detector was completed in 2003, and a test run was carried out that same year. In 2004, a physics run was taken with preliminary instrumentation. The 2004 dataset was analyzed locally at the Nuclear Physics Lab and resulted in a thesis and a publication that reduced the uncertainty on the world average of the muon lifetime by a factor of 1.9. In 2005, the full complement of electronics was installed and another shakedown run was carried out. In the fall of 2006, the first full physics production run was completed, during which over 1012 muon decays were recorded. The transfer of this data to NCSA’s mass storage system was completed in February 2007. Since then, David Webber has been working on a complete version of the production analysis code, and that code is ready for final testing on a significant fraction of the data. In the summer of 2007, another dataset of equivalent size to the 2006 dataset was obtained. The muon stopping target was changed from AK-3, with its high internal field, to quartz plus an external magnet. This represents a significant systematic test—the two methods should agree on the result if the experiment is properly understood. The transfer of part of the 2007 data into NCSA is underway. The analysis effort of the 2007 dataset is being supervised by a University of Kentucky postdoc within our collaboration. The proposal presented here is aimed at the resources we believe will be required to analyze the 2006 dataset. During this effort, we will learn more about efficient processing and plan to submit, separately, a request or renewal for a fine-tuned allocation sufficient to complete analysis of the 2007 data. 3 Data Analysis and Computing Methodology The MuLan data consists of digitized waveforms from the detector photomultiplier tubes (PMTs). A one ppm measurement of the muon lifetime requires over 10 12 positron “hits” in our detector. Each “hit” triggers both an inner and an outer tile, giving a total of 56 bytes per hit. When backgrounds and systematic runs are included, the size of the 2006 dataset becomes 67.44 TB in approximately 50,000 data files. Unlike most particle physics experiments where choice events are filtered from background, the majority of our data will 4 be included in the final result. The analysis will reduce the size of the raw data in stages while processing it into meaningful physics results. The original proposal, submitted in 1999, called for the online analysis of the data— onboard processors were to identify “hits” and histogram them in local memory. However, developing and testing such an unforgiving algorithm would first require a data set where offline analysis of event by event data could be carried out repeatedly. In 2004, we recorded all the individual information about detector hits—a dataset nearly 100 times smaller than the final proposal goal. Our experience in analyzing these data required replaying the tapes many times as we learned more about systematic issues and tested data self-consistency. For example, subtle data-ordering and nonlinear timing issues were discovered in the 2004 data, originating from the preliminary electronics that were used. These issues could not have been discovered without replaying the full statistical power of the data and performing detailed self-consistency checks. Other systematic issues must be tested in the 2006 and 2007 datasets, using the full dataset to look for any subtle inconsistencies. Based on advances in storage technology and capacity, we decided to store the 2006 data for reprocessing rather than relying solely on the online analysis—a scientific decision that burdens us with a very challenging analysis of that large dataset. Local laboratory resources are a more than a factor of 10 too small to permit us to process these data, let alone store it in a convenient manner. In 2001, we had a similar challenge with the muon g-2 experimental data and, using NCSA systems, we were able to process the raw data into manageable histogram data sets, which were then analyzed using our local machines. The g-2 experience was, in fact, very positive. Based on these considerations, we are requesting a medium resource allocation on Tungsten to perform the data analysis of the 2006 MuLan dataset. The analysis is divided into the following stages, with combined requirements for all stages given in the next section. Stage 0: The total size of the 2006 dataset is 67.44 TB. The upload of this dataset into NCSA’s mass storage system was completed in early February, 2007. Since then, David Webber has written a robust and efficient code to process the data using a development allocation that expires on January 30, 2008. Stage 1: This stage is the most CPU intensive, and it also requires a high bandwidth between scratch/project space and the compute nodes. The purpose of this stage is to fit each of the waveforms in the raw data with pulse templates, producing fitted times and amplitudes for each pulse in the detector. Our current best effort at code optimization processes one 2 GB raw data file into a 1 GB ROOT2 tree file in one hour. If bandwidth is not an issue, we can process our entire dataset with 100 CPUs running continuously for two weeks. Realistically, bandwidth into and out of NCSA’s mass storage system will constrain the speed of the analysis. We expect the data will be processed in 1-2 months. We expect to perform this stage 1-2 times. Stage 2: After all the raw data files have been processed into physically meaningful tree files, the tree files can be reprocessed many times to construct histograms from which we can fit the muon lifetime. Each histogram file is typically a few tens of megabytes and takes 5-10 minutes to produce from a 1 GB tree file, depending on histograms being constructed. We expect to perform this stage several times per stage 1 analysis. Stage 3: The histogrammed files can be summed together to create summaries of each run condition. This summary of the dataset contains the most physically meaningful information from which conclusions about the run can be drawn. These summary histogram ROOT is an object-oriented data analysis framework for nuclear and particle physics. http://root.cern.ch 2 5 files will be transferred from NCSA to NPL for local analysis. We will perform this stage once per stage 2 analysis. 3.1 Project Requirements Tape Data Storage (100 TB): 67.44 TB of raw data are already on NCSA’s mass storage system. Although the current version of the ROOT tree output files are frugal in the data recorded, we are investigating the possibility of using 16-bit floats instead of 32-bit floats for the fitted times and amplitudes of each pulse. We estimate that using 16-bit floats will have negligible impact on our precision, and a dramatic improvement in the size of the output tree file. If no improvements are made to the output file, we will need an additional 32 TB of space on the mass storage system to store tree files after analysis cuts. If the tree files are regenerated due to further improvements in the production code, the tree files on the mass storage system will be replaced. Therefore, we will need about 100 TB in the mass storage system. Disk Data Storage (25 TB): Our data set is too large to store entirely on disk at once, so it will be broken into groups for processing. Raw data files will be staged to global scratch for processing and then deleted. Tree files will be stored intermediately in project space and long-term in the mass storage system. Having many tree files available on a project disk at once will reduce our usage of the mass storage system and speed up stage 2 of our analysis. We request 25 TB of project space, but we can function with less. CPU Hours (100 kSU): One 2 GB data file can be processed into a 1 GB tree file in one hour on a single Tungsten CPU. After the tree file has been generated, it takes 5-10 minutes to generate a set of histograms from the tree file. What we learn from our first production pass over the whole dataset may influence us to do another production pass. We request enough CPU time to do two production passes and several histogramming passes. At 34 kSUs/production pass, with the histogramming passes considered equivalent to another production pass, this comes to 100 kSUs. System: Tungsten is the logical choice for our analysis due to its large local disk and fast access to the local mass storage system. Tungsten has fewer cores per node than Abe, making it a sensible choice for high-throughput computing. 3.2 Production Code Overview Figure 3: A double pulse (red) is fit with two template pulses (blue). 6 The production code processes raw data files containing digitized pulse waveforms into an output file in ROOT tree format containing fitted times and amplitudes for individual pulses. After the raw data passes checksum and basic data-quality checks, individual blocks of 24 ADC samples are combined into “islands.” Each island typically contains one pulse, but approximately one in one thousand contain two or more pulses. A pulse template is fit to the island using a simple parabolic minimization routine to find the pulse time, and the residuals are inspected for additional pulses. If additional pulses are found, a more sophisticated routine fits multiple pulses to the island using a multidimensional minimization over all pulse times. In both these cases, pulse amplitude, pedestal, and goodness of fit are calculated exactly for given pulse times, reducing the space wherein the minimizer optimize the fit from three parameters per pulse to one. If the minimization should fail, the amplitude-weighted time is recorded for the island. The production scales as O(N), where N is the number of data blocks in the input data file. 3.3 Production Code Optimization As shown by the output of psrun below, our code’s performance is unchanged by running one or two processes per node. Running twenty processes simultaneously on 10 nodes does not increase the processing time. Overall, our analysis code is trivially parallelizable, scaling as 1/P where P is the number of CPUs. Our code will be bandwidthlimited, depending on the speed of the connection to the mass storage system and to local network storage. Considerable effort has been spent finding an efficient and correct pulse fitting algorithm. Optimizations include a fast parabolic minimization routine to fit one pulse, the analytic computation of pulse amplitudes and pedestals to reduce the search space of the fitter, and well-tuned thresholds for the intelligent pulse finding routines to control the addition and removal of fitted pulse templates from a digitized island. The production code has been designed to pass once through the data with minimal computation and interruption in the data stream. One of the most dangerous systematic errors in our experiment is the possibility to mistake two pulses for one pulse, artificially decreasing our counting rate more at early times than late times and thereby systematically increasing the apparent lifetime. Our correction for this effect is more effective for smaller deadtimes, motivating us to perform a detailed fit on each of the pulse waveforms. However, there is a tradeoff in terms of processing time. The smaller our imposed deadtime window, the more CPU hours it takes to process a run (Figures 4,5). 3.4 Figure 4: An example of the resolving power of the double-pulse finder. A waveform is simulated with two pulses. One pulse is always at time 12 and amplitude 130, and a second pulse is placed on the island with varying time and amplitude. This plot shows the number of pulses found on an island vs. the time and height of the second pulse. 7 Batch Jobs Since individual runs take a short time to process, we found that we can most efficiently process runs by leasing a node from the batch queuing system for H hours and processing 2*H runs on the node during that time. Multiple nodes are requested individually from the batch queue. We are willing to work with NCSA staff on an alternate job management system if recommended, but this method has worked well for our collaborators in the MuCap experiment, who have similar data processing needs. Actual time (s) to process 1 event 3.5 Local Computing 40 The Nuclear Physics Lab (NPL) maintains a small cluster consisting of several desktop machines of various speed and ten analysis nodes with two Xeon processors. The MuLan collaboration also has two data hosts in the NPL cluster with 4.0 TB available for MuLan analysis. The NPL cluster is well equipped to analyze the histograms produced at NCSA in stage 2 of our analysis, but is a factor of 10 smaller than necessary to do a production pass through the data. 30 20 10 0 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 Deadtime (clock ticks) Figure 5: Seconds required to process a midas event versus the imposed deadtime of the fit. The knee beginning at imposed deadtime of 3 clock ticks motivates us to set the deadtime of the code at 3 for the most efficient processing. 4 Project Team Qualifications David Hertzog has supervised many successful high-precision measurements. Among them, the g-2 experiment performed its production data analysis at NCSA, making the UIUC group’s analysis very competitive compared a parallel analysis effort based at Brookhaven National Lab. The publications from the high-profile g-2 experiment have received over 1300 citations and continue to attract attention worldwide. More recently, the NPL-based analysis of the 2004 MuLan dataset centered at Illinois has generated a publication and a thesis which reduced the uncertainty on the muon lifetime by a factor of 1.9. Our MuCap experiment has also used NCSA to process a large, high-precision dataset and the results from this experiment have been recently published and led to an Illinois Ph.D. thesis. Graduate student David Webber has been a part of the MuLan experiment since 2003, and a precision measurement of the muon lifetime based on the analysis of the 2006 dataset will be his doctoral thesis. He has a strong background in computational physics and has developed the majority of the code that will be used in the present effort. 5 Appendix: Abridged Psrun Counting Output 5.1 One analysis process per node PerfSuite Hardware Performance Summary Report Index Description Counter Value ============================================================================================ 1 Branch instructions.............................................. 975055082186 2 Conditional branch instructions mispredicted..................... 23218259029 3 Conditional branch instructions not taken........................ 410537368347 4 Conditional branch instructions correctly predicted.............. 952274759172 5 Conditional branch instructions taken............................ 565197215206 6 Floating point operations........................................ 987503920266 7 Data translation lookaside buffer misses......................... 2595201050 8 Instruction translation lookaside buffer misses.................. 1137907984 9 Total translation lookaside buffer misses........................ 3728105262 10 Total cycles..................................................... 11272605686664 11 Instructions completed........................................... 7340759462154 12 Vector/SIMD instructions......................................... 373680866067 13 Level 2 total cache accesses..................................... 173697376475 8 14 15 16 17 18 19 20 21 22 23 24 Level 2 total cache hits......................................... Level 1 instruction cache accesses............................... Level 1 instruction cache misses................................. Level 2 cache misses............................................. Cycles stalled on any resource................................... Load instructions................................................ Load/store instructions completed................................ Store instructions............................................... Level 3 total cache accesses..................................... Level 3 total cache hits......................................... Level 3 cache misses............................................. 171028884023 11703550995602 11672030885 2873671351 607573779562 2215355870555 3263299881190 1042734369922 3453700980 638303324 2826755800 Statistics ============================================================================================ Counting domain........................................................ user Multiplexed............................................................ yes Floating point operations per cycle.................................... 0.088 Vector instructions per cycle.......................................... 0.033 Floating point operations per graduated instruction.................... 0.135 Vector instructions per graduated instruction.......................... 0.051 Graduated instructions per cycle....................................... 0.651 Graduated instructions per level 1 instruction cache miss.............. 628.919 Loads completed per translation lookaside buffer miss.................. 594.231 Stores completed per translation lookaside buffer miss................. 279.696 Loads/stores completed per translation lookaside buffer miss........... 875.324 Cycles per translation lookaside buffer miss........................... 3023.682 % cycles stalled on any resource....................................... 5.390 Graduated loads and stores per cycle................................... 0.289 Graduated loads and stores per floating point operation................ 3.305 Graduated loads and stores per floating point operation................ 3.299 Mispredicted branches per correctly predicted branch................... 0.024 Level 1 cache miss ratio (instruction)................................. 0.001 Bandwidth used to level 2 cache (MB/s)................................. 52.037 Bandwidth used to level 3 cache (MB/s)................................. 51.188 MFLOPS (cycles)........................................................ 279.407 MFLOPS (wall clock).................................................... 270.834 MVOPS (cycles)......................................................... 105.730 MVOPS (wall clock)..................................................... 102.486 MIPS (cycles).......................................................... 2077.016 MIPS (wall clock)...................................................... 2013.285 CPU time (seconds)..................................................... 3534.282 Wall clock time (seconds).............................................. 3646.160 % CPU utilization...................................................... 96.932 5.2 Two analysis processes per node Only one analysis process is shown. PerfSuite Hardware Performance Summary Report Index Description Counter Value ============================================================================================ 1 Branch instructions.............................................. 976687416998 2 Conditional branch instructions mispredicted..................... 23153183125 3 Conditional branch instructions not taken........................ 410515726951 4 Conditional branch instructions correctly predicted.............. 950462977201 5 Conditional branch instructions taken............................ 562658107966 6 Floating point operations........................................ 989184519355 7 Data translation lookaside buffer misses......................... 2624071047 8 Instruction translation lookaside buffer misses.................. 1158605390 9 Total translation lookaside buffer misses........................ 3762313779 10 Total cycles..................................................... 11301275033972 11 Instructions completed........................................... 7323821542718 12 Vector/SIMD instructions......................................... 371901173205 13 Level 2 total cache accesses..................................... 174291089478 14 Level 2 total cache hits......................................... 172160013158 15 Level 1 instruction cache accesses............................... 11712388095912 16 Level 1 instruction cache misses................................. 11727391534 17 Level 2 cache misses............................................. 2884305828 9 18 19 20 21 22 23 24 Cycles stalled on any resource................................... Load instructions................................................ Load/store instructions completed................................ Store instructions............................................... Level 3 total cache accesses..................................... Level 3 total cache hits......................................... Level 3 cache misses............................................. 610794679537 2207330448798 3245516499360 1038037912583 3418236659 670506547 2785143316 Statistics ============================================================================================ Counting domain........................................................ user Multiplexed............................................................ yes Floating point operations per cycle.................................... 0.088 Vector instructions per cycle.......................................... 0.033 Floating point operations per graduated instruction.................... 0.135 Vector instructions per graduated instruction.......................... 0.051 Graduated instructions per cycle....................................... 0.648 Graduated instructions per level 1 instruction cache miss.............. 624.506 Loads completed per translation lookaside buffer miss.................. 586.695 Stores completed per translation lookaside buffer miss................. 275.904 Loads/stores completed per translation lookaside buffer miss........... 862.638 Cycles per translation lookaside buffer miss........................... 3003.810 % cycles stalled on any resource....................................... 5.405 Graduated loads and stores per cycle................................... 0.287 Graduated loads and stores per floating point operation................ 3.281 Graduated loads and stores per floating point operation................ 3.281 Mispredicted branches per correctly predicted branch................... 0.024 Level 1 cache miss ratio (instruction)................................. 0.001 Bandwidth used to level 2 cache (MB/s)................................. 52.097 Bandwidth used to level 3 cache (MB/s)................................. 50.306 MFLOPS (cycles)........................................................ 279.171 MFLOPS (wall clock).................................................... 269.189 MVOPS (cycles)......................................................... 104.959 MVOPS (wall clock)..................................................... 101.206 MIPS (cycles).......................................................... 2066.951 MIPS (wall clock)...................................................... 1993.048 CPU time (seconds)..................................................... 3543.297 Wall clock time (seconds).............................................. 3674.685 % CPU utilization...................................................... 96.425 G. W. Bennett et. al., “Final Report of the Muon E821 Anomalous Magnetic Moment Measurement at BNL,” Phys. Rev. D73 072003 (2006). ii D. B. Chitwood et. al., “Improved Measurement of the Positive Muon Lifetime and Determination of the Fermi Constant,” Phys. Rev. Lett. 99, 032001 (2007). iii T. van Ritbergen and R.G. Stuart, “On the Precise Determination of the Fermi Coupling Constant from the Muon Lifetime,” Nucl.Phys. B564, 343-390 (2000). iv J. Alcaraz, D. Abbaneo, P. Antilogus, T. Behnke, D. Bloch, A. Blondel, D. G. Charlton, R. Clare, P. Clarke, S. Dutta, M. Elsing, S. Ganguli, M. W. Grunewald, A. Gurtu, K. Hamacher, C. Hawkes, M. Hildreth, R. W. L. Jones, W. Lohmann, T. Kawamoto, Y. Khokhlov, C. M Mariotti, M. Martinez, E. Migliore, K. Monig, M. Morii, A. Nippe, A. Olshevsky, Ch. Paus, M. Pepe-Altarelli, B. Pietrzyk, G. Quast, P. Renton, D. Reid, D. Schlatter, R. Sobie, R. Tenchini, F. Teubert, L. Tomalin, P. S. Wells, G. Crawford, D. Falciai, B. Schumm, D. Su, and E. Torrence. The LEP Electroweak Working Group and the SLD Heavy Flavor Group, “A combination of preliminary LEP and SLD electroweak measurements and constraints on the standard model,” International Conference on High Energy Physics, Warsaw, 1996, CERN Report No. LEPEWWG/96-02., SLD Physics Note 52, July 30, 1996. http://citeseer.ist.psu.edu/338756.html v V.A. Andreev et al. “Measurement of the rate of muon capture in hydrogen gas and determination of the proton's pseudoscalar coupling g(P),” Phys.Rev.Lett.99, 032002 (2007). i 10