Availability Task Force Report ‘Appendix 1’ Outline Availability Task Force, November 5, 2009, KEK. 1) Introduction and Goal This appendix is the preliminary report of the availability task force. This task force was formed to find a reasonable way to make SB2009 meet the availability requirements of the ILC. Those requirements are unchanged from those used in the RDR: There should be 9 months of scheduled running and 3 months of scheduled downtime for maintenance and upgrades. Total unscheduled downtime should be less than 25% of scheduled runtime. In our simulation we allow only 15% unscheduled downtime. The other 10% is held as contingency for things that were missed in the simulation or for major design or QA problems. So far the task force has concentrated on the use of a single tunnel for the linac using either the DRFS or Klystron Cluster RF systems. This is the biggest cost saving in SB2009 and the biggest availability concern. That study is relatively complete. The other changes included in SB2009 have had a more cursory examination and look OK, but more work is needed. We took a three pronged approach to the problem. 1. We refined the design of the RF systems to enhance their availability. This included making some components redundant, ensuring the DRFS klystrons could be replaced rapidly during a scheduled maintenance day and changing the run schedule from a 3 month long downtime to two one month downtimes with a 24 hour preventive maintenance period scheduled every two weeks during the run. 2. We gathered more information on reliability of components (mean time between failures) 3. We integrated all this information into AVAILSIM, the program which simulates breakdowns and repairs of the ILC to learn how much downtime there was and what the causes were. We then adjusted some inputs (length of preventive maintenance period, energy overhead and longer MTBFs) to achieve the required availability. The availability task force has only studied the direct effects of the SB2009 changes on the availability. Other effects of the change to a single tunnel such as fire safety; need for space to install extra equipment in the accelerator tunnel; cost; installation logistics; radiation shielding of electronics and the effect of residual single event upsets; and debugging of subtle electronics problems without simultaneous access to the electronics and beam have been or will need to be studied by other groups. Note that the important result of our work is not the availability we have achieved but rather the design that was necessary to do so. The achieved availability was determined as a requirement before we started work. That said, Section 2 gives the results of the simulations, Section 3 describes the basic machine design we modeled with emphasis on availability, Section 4 describes many of the ingredients that were needed to achieve the desired availability, Section 5 gives some details about how the simulation was done, and section 6 describes some outstanding issues. Section 7 summarizes the conclusions. a. Availability requirements for the ILC b. Evaluate viability of proposed designs using estimation and simulation c. Adapt designs to provide a valid model through an iterative process 2) Results Our results are preliminary for several reasons. Due to time constraints we have not updated the inputs to AVAILSIM to the full SB2009 design. While the linac modeling is fairly accurate, we still have 6 km DRs and the injectors are not in the BDS tunnel for most of the simulation runs. There are also several details we are not pleased with and hope to improve (such as the cryoplants and AC power disruptions being the largest causes of downtime or the long recovery times needed after a scheduled preventive maintenance day). Nevertheless, we believe the model is accurate enough for the differences in availability of the various machine configurations we studied to be meaningful. Our main result is shown in Figure 1 which for four different RF schemes shows the simulated unscheduled downtime as a function of the main linac energy overhead. This is a very important figure, so it is worth making very clear what is meant by it. Figure 1. The unscheduled downtime as a function of the main linac energy overhead An example of the meaning of the horizontal axis is that for 250 GeV design energy and 10% energy overhead, the beam energy, if everything were working perfectly, would be 275 GeV. The plot is mainly relevant when we are trying to run at the full design energy. If for physics reasons we are running at lower energies, then the effective energy overhead is much greater and downtime due to inadequately working RF systems would be much less than shown. Four different accelerator configurations were simulated resulting in the 4 lines shown. The general trend for each line is that as energy overhead is decreased, the unscheduled downtime increases. This is expected as less acceleration related items need to fail before the design beam energy cannot be reached and hence the accelerator must be shut down for unscheduled repairs. Note that the three one linac tunnel designs all have the same unscheduled downtime when there is 20% energy overhead. This is such a large energy overhead that failures of components whose failure reduce the energy cause no downtime. The residual 14.5% downtime is caused by things like magnet power supplies, controls and AC power disruptions. For illustrative purposes, a horizontal line is drawn 1% above the points at 20% energy overhead. This represents what the total unscheduled downtime would be if 1% downtime were caused by broken components (e.g. klystrons or cavity tuners) making it so the design beam energy cannot be reached. One can then see how much energy overhead each design needs to avoid causing this much downtime. At high energy overhead, the 1 tunnel designs have about 1% more downtime than the 2 tunnel design. This represents a doubling of the downtime caused by linac components and a 50% increase for the RTML components that share the linac tunnel. This increase is expected as a typical mean time to repair for a component in a support tunnel is 1-2 hours. When that component is in the accelerator tunnel (as is the case for the 1 tunnel designs) an extra hour is added to allow radiation to cool down and a second hour is added to allow the accelerator to be secured, turned back on and magnets standardized. Note that in studies done for the RDR, the total downtime doubled when we went from two tunnels to one. What saves us from that fate in this study is that only the linac was changed from two tunnels to one. In all other areas devices like power supplies are modeled as accessible when the beam is on. The one tunnel 10 MW design degrades fastest with decreasing energy overhead probably due to the 40k and 50k hr MTBFs assumed for the klystron and modulator. DRFS does better probably due to the redundant modulator and 120k hour klystron MTBF assumed. KlyClus does still better due to the ability to repair klystrons and modulators while running and the internal redundancy built into a cluster. (A single klystron or modulator can die and the cluster can still output its design RF power.) The conclusion that can be drawn from Figure 1 is that either approach (Klystron Cluster or the Distributed RF System) would give adequate availability with the present assumptions. The Distributed RF System requires about 1.5 percent more energy overhead than the Klystron Cluster Scheme to give the same availability for all other assumptions the same. This small effect may well be compensated by other non availability related issues. With the component failure rates and operating models assumed today, the unscheduled lost time integrating luminosity with a single main linac tunnel is only 1% more than the two-tunnel RDR design given reasonable energy overheads. Note that all non-linac areas were modeled with support equipment accessible with beam on. Figure 2 and Figure 3 show details about what area of the accelerator and what technical systems caused the downtime in a typical simulation run (KlyClus with 4% energy overhead). Note that the unscheduled downtime is distributed over many areas and technical systems. No one thing dominates. In particular, the linacs and long transport lines of the RTML cause 20% of the total downtime of 15% or a total of 0.20*0.15 = 3% of the scheduled running time. Note that the cryo plants and site power are the technical systems that cause the most downtime. Both of these are modeled as single components in the simulation with availabilities based on aggressive estimates from experts familiar with performance of such systems at existing labs. They are both worth further study. Figure 2 shows the distribution of the downtime by area of the accelerator for a typical simulation run (KlyClus with 4% energy overhead). The downtime fractions shown are percent of the total downtime of about 15%. So the actual downtime caused by the cryo plants is 19% of 15% = 2.8%. The IP does not really cause 13% downtime. That is actually the excess time spent recovering from scheduled maintenance days. Figure 3 shows the distribution of the downtime by technical area for a typical simulation run (KlyClus with 4% energy overhead). The downtime fractions shown are percent of the total downtime of about 15%. As explained in the introduction, our goal was to find design ingredients that achieved the desired availability goal. This section shows we have achieved that goal. The next two sections describe the ingredients used to attain that goal. You will see it requires different construction and operations methods than are typically employed for HEP accelerators. Methods used by light sources are needed. Component mean time between failures must be comparable to the best achieved at any accelerator and some are factors of more than 10 better than those achieved at major HEP labs. Preventive maintenance is crucial as is built-in redundancy. a. Either HLRF scheme can be viable in a single main linac tunnel configuration b. RF overhead (figure 1: Unscheduled Downtime vs. Energy Overhead) i. 4% (recommended) ii. cost impact c. Availability projections- performance of the complex (figure 2 a/b: Downtime by Section / System) d. Recommendations i. Managing component and system reliability ii. Adapting the project life cycle to realize hoped-for performance 3) Background - changes applied to the Reference Design a. Motivation Improvements proposed to the Reference Design are expected to affect the estimated accelerator availability. In order to estimate availability in the new design, both the basic machine model and the assumptions that support the simulation need review. Through this process, the availability of the complex was also re-examined and re-optimized. b. Machine model The changes foreseen for the new baseline are described in (). The part of the new baseline with the greatest impact on availability is the removal of the support tunnel for the main linac and RTML. A single tunnel variant, in which no support tunnel space was allocated to any ILC subsystem, was studied during the development of the Reference Design. The estimated availability of that variant was shown to be poor[]. The configuration studied for this proposal allows un-obstructed access to source, DR and BDS support equipment. Only the main linac and RTML support tunnel is removed. c. d. Technology i. HLRF - KCS availability design ii. HLRF – DRFS availability design (figure 3: redundancy diagram(?)) e. Management model – i. Adaptations made to better suit ‘modern’ (light source and ‘factory’) accelerator operation and maintenance ii. scheduled downtime iii. industrial performance for off-the-shelf components iv. rejecting reliance on ‘opportunistic’ work Availability estimates made for the Reference Design were based on experience at labs with high energy colliders. Experience with component performance and operations and maintenance practice was included in the analysis. For further development of the ILC design, we intend to take advantage of the knowledge gained in recent years at light sources and ‘factories’, such as the B-Factories. In general, experience at these installations has been much better than older, more established facilities. An important part of that improvement is due to changes in managing component lifetime performance and maintenance. These typically involve a more rigid industry-like quality assurance process, with a stronger reliance on scheduled downtime and less emphasis on ‘opportunistic maintenance’, a practice which includes a substantial adhoc, rapid response maintenance and repair effort. Additionally, the use of industrial standardpractice system designs and components (section ) that have good reliability performance has been assumed in the analysis. 4) Key ingredients and their impact a. Component Mean Time Between Failures i. Sources of MTBF information and hoped for improvements ii. Klystrons, Modulators, Magnet/Power supplies iii. Cryomodules iv. Electronics v. Industrial ‘standard’ components: 1. plumbing and related instrumentation 2. electrical b. Preventative maintenance c. Global Systems – Electrical Power and Cryo-plant d. Operations scenarios – recovery from downtime and from scheduled maintenance 5) Simulations The simulation results described above were done with the program AVAILSIM developed by Tom Himel over the past several years. It is a Monte Carlo of the ILC that randomly breaks components, checks whether the set of presently broken components prevent the ILC from generating luminosity and if so, schedules repairs. A description of this program was given at the 2007 particle accelerator conferencei and some of its features are described below. Each component fails at a random time with an exponential distribution determined by its MTBF. When a component fails, the accelerator is degraded in some fashion. A klystron failure in the main linac simply reduces the energy overhead. The accelerator keeps running until this overhead is reduced to zero. Similarly there are 21 DR kickers where only 20 are needed so only the second failure causes downtime. Some components such as most magnet power supplies cause an immediate downtime for their repair. Each component can be specified as hot swappable (meaning it can be replaced without further degrading the accelerator); repairable without accessing the accelerator tunnel, or repairable with an access to the accelerator tunnel. A klystron that is not in the accelerator tunnel is an example of a hot swappable device. Without doubt the downtime planning is the most complicated part of the simulation. This should come as no surprise to anyone who has participated in the planning of a repair day. It is even harder in the simulation because computers don’t get a gestalt of the situation like humans do. Briefly, the simulation determines which parameter (e.g. e- linac energy overhead or e+ DR extraction kicker strength or luminosity) was degraded too much, and plans to fix things that degrade that parameter. Based on the required repairs, it calculates how long the downtime must be to repair the necessary items. It then schedules other items for repair, allowing the downtime to be extended by as much as 50 to 100%. Some other issues must also be taken into account: • If an access to the accelerator tunnel is required, one hour is allowed for prompt radiation to decay before entry. One hour is also allowed for locking up, turning on and standardizing power supplies. • The number of people in the accelerator tunnel can be limited to minimize the chaos of tracking who is in the tunnel. For the present work, we have not limited the number of people. We have estimated that number and it is not unreasonably large, but we still need to use the simulation to limit that number and see how it affects the results. • The accelerator runs a total of 9 months a year. There are two one month shutdowns for major maintenance and upgrades each year. In addition, every two weeks there is a 24 hour scheduled downtime to perform preventive maintenance. This includes work which is explicitly simulated like replacing DRFS klystrons which have died in the previous two weeks and work which is not explicitly simulated but which may be need to attain the long MTBFs like replacing pumps which are vibrating because their bearings are going bad. • Recovery from a downtime is not instantaneous. Things break when you turn them off for the downtime and the problems are only discovered when they are turned back on. People make mistakes in hardware and software improvements and these must be found and corrected. Temperatures change and the ground moves so the beam must be re-tuned. Rather than trying to model downtime recovery procedures in detail, AVAILSIM simply assumes that the time it takes to get good beam out of a region of the accelerator is on average proportional to the time that region was without beam. The constants of proportionality used for each region typically were 10%, except for the DRs and interaction region, for which 20% was used and simple transport lines for which 5% was used. Altogether, this recovery time on average slightly more than doubles the downtime due to a repair. The accelerator we simulated is not exactly the SB2009 design but we expect it is close enough that the simulation results are meaningful. We will continue our studies and add the remaining SB2009 changes in the near future. The simulated ILC had the following features: 1. The linac could be in 1 or 2 tunnels 2. Low power (half number of RDR bunches and RF power) 3. Three RF systems were simulated for the main linac: RDR, KlyClus, and DRFS. Short linacs (injectors etc.) were always simulated with the RDR RF scheme. 4. Two 6 km DRs in same tunnel near the IR. (Going to 3 km would decrease the component count and slightly improve the availability for all simulation runs.) 5. RTML transport lines in linac tunnels 6. Injectors in their own separate tunnels. (Putting them in the BDS will probably slightly decrease the availability for all simulation runs.) 7. The e+ source uses an undulator at end of linac 8. There is an e+ Keep Alive Source of adequate intensity to do machine development and to tune up during a downtime recovery. 9. Injectors, RTML turn-around, DRs and BDS have all power supplies and controls accessible with beam on. (Pre-RDR 1 vs. 2 tunnel studies had these inaccessible for 1 tunnel.) a. Machine Models and segmentation b. Operations model i. Response to failed equipment ii. Machine development iii. Recovery c. Maintenance model i. Limiting the number of repair personnel needed during scheduled maintenance 6) Outstanding issues and concerns Availability studies, generally, require input on technology, system layout and operations and maintenance models. It is critically important, therefore, to make sure that the results of these studies are well documented and accessible to provide guidance for ongoing and future R & D activities. The following outstanding availability issues and concerns have been identified. These are important for both the Reference Design and the proposed baseline. a. Integration with sub-system designers Component MTBF and MTTR performance requirements have been assigned and tabulated as a by-product of the availability analysis. In some cases, the performance criteria needed represent an advance beyond that routinely achieved in large colliders. In this case, industrial best-practice guidelines or light-source performance has been assumed. For both of these, added design effort, compared to large colliders, has to be applied. b. Estimating impact on cost Improved MTBF performance has a cost increment. In simple terms, higher quality off-the-shelf components, with adequate reliability performance, will cost more. c. Improving recovery times The simulation includes a ‘luminosity recovery process’ model, whereby the time to recover fullspec performance in each subsystem is proportional to the time without beam, (due, for example, to an upstream component failure). Completed simulation results indicated this to have a substantial impact, amounting to about half of the total simulated downtime. The proportionality constant used is based on observations done at several facilities, and some degree of variability is seen. A consistent approach to improving system availability includes effort on both component MTBF and the associated recovery process. d. Safety – impact on Operations and Maintenance model Each intervention for maintenance, including preventive activities and incident response, must be subject to a fully integrated safety analysis. The simulation includes several aspects, such as radiation cool-down wait-times and an allowance for proper preparation and entry procedures. It is reasonable to expect additional constraints. e. Accessibility – equipment transport For repairs within the beamline enclosure, the availability model includes an estimate of the time to move equipment from a storage area to the access-way. With more detailed underground equipment layout, the model will be updated to provide a more realistic estimate of transit time. f. Radiation effects Radiation effects may reduce electronic component MTBF substantially. This is important because the new baseline layout does not provide nearby support equipment enclosure space for the main linac and RTML. Several accelerator labs have experience with electronics located near the beamline, notably the PEPII B-Factory and (soon) the LHC. The availability simulation assumes that electronics is protected from radiation and does not take into account possible reduced performance. g. Initial commissioning None of the studies include an estimate of initial commissioning effects. h. Staffing levels The model shows, correctly, the balance between component failure rate and total time to replace or repair. An important ingredient of that balance is the estimate of the staffing level so that maintenance activities can proceed in parallel. Due to the statistical nature of component failure, the simulation can provide a good estimate of the required staffing levels. i. MTBF data The basis for the estimate is the reliability data collected from labs and industrial literature. In many cases, representative estimates are difficult to obtain, so the analysis has been based on data from only one institute, or is simply an estimate. There are several reasons why valid MTBF estimates are difficult to obtain, notably the tendency for organizations to group failures according to who in their organization is responsible for repairs associated with a given subsystem. j. Positron System 7) Conclusions i Proceedings of PAC 2009, Albuquerque, New Mexico, p. 1966-9