Content for an NSFAnnual Project Report[1].doc

OSG–doc–1054 June 30, 2011 www.opensciencegrid.org Report to the National Science Foundation June 2011 Miron Livny Ruth Pordes Kent Blackburn Paul Avery University of Wisconsin Fermilab Caltech University of Florida 1 PI, Technical Director Co-PI, Executive Director Co-PI, Council co-Chair Co-PI, Council co-Chair Table of Contents 1. Executive Summary ...................................................................................... 4 1.1 1.2 1.3 1.4 1.5 What is Open Science Grid? ............................................................................................... 4 Usage of Open Science Grid ............................................................................................... 5 Science enabled by Open Science Grid ............................................................................... 7 Technical achievements in 2010-2011 ............................................................................... 8 Preparing for the Future .................................................................................................. 12 2. Contributions to Science ............................................................................. 15 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 ATLAS ............................................................................................................................... 15 CMS .................................................................................................................................. 22 LIGO.................................................................................................................................. 27 ALICE ................................................................................................................................ 29 D0 at Tevatron ................................................................................................................. 30 CDF at Tevatron ............................................................................................................... 36 Nuclear physics ................................................................................................................ 41 Intensity Frontier at Fermilab .......................................................................................... 46 Astrophysics ..................................................................................................................... 46 Structural Biology .......................................................................................................... 47 2.10.1 2.10.2 2.10.3 2.10.4 2.11 Wide Search Molecular Replacement Workflow ................................................................................48 DEN Workflow.....................................................................................................................................49 OSG Infrastructure ..............................................................................................................................49 Outreach: Facilitating Access to Cyberinfrastructure .........................................................................50 Computer Science Research ........................................................................................... 50 3. Development of the OSG Distributed Infrastructure ................................... 51 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 Usage of the OSG Facility ................................................................................................. 51 Middleware/Software ...................................................................................................... 52 Operations ....................................................................................................................... 54 Integration and Site Coordination ................................................................................... 57 Campus Grids ................................................................................................................... 59 VO and User Support ....................................................................................................... 60 Security............................................................................................................................. 61 Content Management ...................................................................................................... 63 Metrics and Measurements ............................................................................................. 64 Extending Science Applications ...................................................................................... 66 3.10.1 3.10.2 3.11 3.12 Scalability, Reliability, and Usability ....................................................................................................66 Workload Management System .........................................................................................................67 OSG Technology Planning .............................................................................................. 68 Challenges facing OSG ................................................................................................... 69 4. Satellite Projects, Partners, and Collaborations .......................................... 71 2 4.1 CI Team Engagements ..................................................................................................... 73 4.2 Condor .............................................................................................................................. 75 4.2.1 4.2.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 Release Condor .....................................................................................................................................75 Support Condor .....................................................................................................................................79 High Throughput Parallel Computing .............................................................................. 81 Advanced Network Initiative (ANI) Testing ...................................................................... 82 ExTENCI ............................................................................................................................ 83 Corral WMS ...................................................................................................................... 85 OSG Summer School ......................................................................................................... 87 Science enabled by HCC ................................................................................................... 88 Virtualization and Clouds ................................................................................................. 91 Magellan ........................................................................................................................ 91 Internet2 Joint Activities ................................................................................................ 91 ESNET Joint Activities ..................................................................................................... 93 5. Training, Outreach and Dissemination ........................................................ 96 5.1 Training ............................................................................................................................ 96 5.2 Outreach Activities ........................................................................................................... 97 5.3 Internet dissemination ..................................................................................................... 99 6. Cooperative Agreement Performance ...................................................... 100 Sections of this report were provided by: the scientific members of the OSG Council, OSG PIs and Co-PIs, and OSG staff and partners. Paul Avery and Chander Sehgal acted as the editors. The scope of this report is: An initial summary of the goals and accomplishments of the OSG; Summarize the OSG-related accomplishments of each of the major scientific contributors and beneficiaries; Cover the progress on each of the technical areas of the distributed infrastructure and services; List synergistic and beneficial contributions of OSG’s important satellites (specifically contributing projects) and partnerships; Summarize the training, outreach and dissemination activities; Document the Cooperative Agreement Performance. The appendices give more detailed information on the publications from and the usage of OSG by each of the major scientific communities. 3 1. 1.1 Executive Summary What is Open Science Grid? Open Science Grid (OSG) is a large-scale collaboration that is advancing scientific knowledge through high performance computing and data analysis by operating and evolving a crossdomain, nationally distributed cyber-infrastructure (Figure 1). Meeting the strict demands of the scientific community has not only led OSG to actively drive the frontiers of High Throughput Computing (HTC) and massively Distributed Computing, it has also led to the development of a production quality facility. OSG’s distributed facility, composed of laboratory, campus, and community resources, is designed to meet the current and future needs of scientific operations at all scales. It provides a broad range of common services and support, a software platform, and a set of operational principles that organizes and supports scientific users and resources via the mechanism of Virtual Organizations (VOs). The OSG program consists of a Consortium of contributing communities (users, resource administrators, and software providers) and a funded project. The OSG project is jointly funded, until early-2012, by the Department of Energy SciDAC-2 program and the National Science Foundation. Figure 1: Sites in the OSG Facility While OSG does not own the computing, storage, or network resources used by the scientific community, these resources are contributed by the community, organized by the OSG facility, and governed by the OSG Member Consortium. OSG resources are summarized in Table 1. 4 Table 1: OSG computing resources Number of Grid interfaced processing resources on the production infrastructure Number of Grid interfaced data storage resources on the production infrastructure Number of Campus Infrastructures interfaced to the OSG Number of National Grids interoperating with the OSG Number of processing resources on the integration testing infrastructure Number of Grid interfaced data storage resources on the integration testing infrastructure Number of Cores accessible to the OSG infrastructure Size of Disk storage accessible to the OSG infrastructure CPU Wall Clock usage of the OSG infrastructure 1.2 131 61 9 (GridUNESP, Clemson, FermiGrid, Purdue, Wisconsin, Buffalo, Nebraska, Oklahoma, SBGrid) 3 (EGI, NGDF, TeraGrid) 28 11 ~70,000 ~29 Petabytes Average of 56,000 CPU days/ day during May 2011 Usage of Open Science Grid High Throughput Computing technology created and incorporated by the OSG and its contributing partners has now advanced to the point that scientific users (VOs) are utilizing more simultaneous resources than ever before. Typical VOs now utilize between 15 and 20 resources with some routinely using as many as 40 – 45 simultaneous resources. The transition to using a Pilot based job submission system that was broadly deployed in the OSG last year has enabled VOs to quickly scale to a much utilization point than ever before. For example, SBGrid went from struggling to achieve ~1000 simultaneous jobs to being able to scale to over ~4000 simultaneous jobs in less than a month. The overall usage of OSG has increased again by greater than 50% this past year and continues to grow at a steady rate (Figure 2). Utilization by each stakeholder varies depending on its needs during any particular interval. Overall use of the facility for the 12 month period ending June 2011 was 424M hours compared to 270M hours for the previous 12 months period reported ending June 2010; detailed usage plots can be found in the attached document on Production on Open Science Grid. During stable normal operations, OSG now provides over 1.4M CPU wall clock hours a day (~56,000 cpu days per day) with peaks occasionally exceeding 1.6M hours a day; approximately 400K – 500K opportunistic hours (~30%) are available on a daily basis for resource sharing. Based on transfer accounting, we measure approximately 0.6 PetaByte of data movement (both intra- and inter-site) on a daily basis with peaks of 1.2 PetaBytes per day. Of this, we estimate 25% is GridFTP transfers between sites and the rest is via LAN protocols. 5 Figure 2: OSG Usage (hours/month) from July 2007 to June 2011 The number of Non-HEP CPU-hours (Figure 3) is regularly greater than 1 million CPU hours per week (average = 1.1 M) even with the LHC actively producing data. LIGO averaged approximately 80K hours/day and non-physics use now averages 85K hours/day (9% of the total). Figure 3: OSG non-HEP weekly usage from June 2010 to June 2011. LIGO (shown in red) is the largest non-HEP contributor. 6 1.3 Science enabled by Open Science Grid Published, peer reviewed papers are one easily measurable metric of the science enabled by the Open Science Grid. The following table shows the publications in the past 12 months: Table 2: Science Publications in 2010-2011 Resulting from OSG Usage ALICE ATLAS CDF CIGI In Addition # pubs 4 38 43 3 CMS DES D0 57 1 23 Engage 20 GLOW 19 Grid UNESP HCC IceCube LIGO Mini-Boone MINOS NYSGRID SBGRID STAR OSG & DHTC Research Total 10 1 7 11 1 7 1 3 9 2 260 VO Type of Science US LHC US LHC Tevatron Run II Geographic Information System research US LHC Astrophysics 8 accepted & Tevatron Run II 12 submitted 3 submitted Mathematics modeling virtualization research, molecular dynamics, protein folding. Neutrino physics, genomics, chemical modeling, molecular dynamics, proteomics, biology, Cosmic ray physics, genomics Protein modelling. Neutrino Physics Gravitational Wave Physics Neutrino Physics Neutrino Physics Molecular dynamics Structural Biology Nuclear Physics Computer Science A more holistic measure of the science enabled by OSG is relevant here. Infrastructure and service providers such as OSG enable and improve scientific and research discoveries resulting from theoretical models and experimentally acquired data by providing an important computational “middle layer”. The availability of large and easily accessible computational resources provides researchers powerful opportunities to examine data in unusual ways and explore new scientific possibilities. These benefits are reflected in the US LHC statement of “reduced time to publication” – as stated in our recent proposal: “This global shared computing infrastructure of unprecedented scale and throughput has facilitated a transformation in the delivery of results from the most advanced experimental facility in the world – enabling the public presentation of 7 results days to weeks after the data is acquired rather than the months to years it took in the recent past.” We thus pick out some illustrative enabling features of the OSG contributions over the past year:  The LHC experiments all had an extremely productive 12 months and the rapid turnaround from beam to publications has been a tremendous success all round and in all areas.  OSG enabled US ATLAS and US CMS to develop and deploy mechanisms that have enabled productive use of over 40 U.S. university Tier-3 sites over the past year.  The OSG has enabled US LHC to modify their data distribution and analysis models while maintaining production services, to address the needs of the experiments. For example, US ATLAS designed and deployed new data placement models without perturbing the production throughput; OSG has continued to interoperate well with the World Wide LHC Computing Grid while a new Computing Element service (CREAM) was tested and deployed in Europe.  Ramp up of ALICE production usage on their Alice USA OSG sites.  LIGO has continued to make significant use of the OSG for Einstein@Home production and is ramping up on the use of OSG services for other analyses.  The Tevatron D0 and CDF experiments use the OSG facility for a large fraction of their simulation and analysis processing. Both experiments are now using OSG job management services to submit their jobs seamlessly across both the European (EGI) and OSG infrastructures  This year has shown an increase in the science enabled by OSG being done and supported locally at the existing and “new” campus distributed infrastructures.  Besides the physics communities, the 1) structural biology group at Harvard Medical School, 2) groups using GLOW, the Holland Computing Center and NYSGrid, 3) mathematics research at the University of Colorado. The structural biology community supported by the SBGrid project and VO have significantly increased their use of OSG and published results in the areas of protein structure modeling and prediction. The Harvard Medical School paper was published in Nature and the methodologies and portal published as part of PNAS. 1.4 Technical achievements in 2010-2011 The technical program of work in the OSG project is defined by an annual WBS created by the area coordinators and managed and tracked by the project manager. The past 12 months saw a redistribution of staff for the last year of the currently funded project. Table 3 shows the distribution of OSG staff by area and institution: 8 Table 3: Distribution of OSG staff Area of Work OSG Technical Director Software Tools Group Production Coordination LIGO Specific Requests Software 0.5 0.5 0.5 1.25 6.6 Operations Integration and Sites VO Coordination Engagement 5.5 4.05 0.45 1.3 Campus Grids Security Training and Content Management Extensions Scalability, Reliability and Usability Workload Management Systems support Internet2 Consortium + Project Coordination Metrics Communications and Education 0.75 2.35 1.7 Project Manager 0.85 Total Contribution1 OSG Staff 1.0 0.17 0.25 1.15 0.1 1.35 0.2 0 0.75 0.4 2.2 Institutions (lead institution first) UW Madison UW Madison, FNAL U Chicago Caltech UW Madison, Fermilab, LBNL Indiana U, Fermilab, UCSD U Chicago, Caltech, LBNL UCSD, Fermilab RENCI, ISI, Fermilab, UW Madison UChicago, UNL Fermilab, NCSA, UCSD Caltech, Fermilab, Indiana U UCSD, BNL UCSD BNL, UCSD 1.08 0.5 1 31.83 0.27 0.2 Internet2 Fermilab, Caltech, UFlorida 0.05 UNebraska Fermilab, UFlorida, UW Madison Fermilab 5.35 By the end of 2010 the OSG staff has decreased by 2 FTEs due to staff leaving. More than three quarters of the OSG staff directly support the operation and software for the ongoing stakeholder productions and applications (the remaining quarter mainly engages new customers and extends and proves software and capabilities; and also provides management and communications etc.). The 2010 WBS define more than 400 tasks, including both ongoing operational tasks and activities to put in place specific tools, capabilities and extensions. The area coordinators update the WBS quarterly. As new requests are made to the project the Executive Team prioritizes these against the existing tasks in the WBS. Additions are then made to the WBS to reflect those activities accepted into the work-program, some tasks are then dropped and other tasks deliverable 1 Contribution included in the WBS. Many other contributions come from external projects including the science communities, software developers and resource administrators. 9 dates are adjusted according to the new priorities. In FY10 the WBS was 85% accomplished (about the same as in FY09). In the past 12 months, the main technical achievements that directly support science include:  Operation, facility, software and consulting services for LHC data distribution, production, analysis and simulation as the LHC accelerator came on line and the full machinery of the worldwide distributed facility was used “under fire” to deliver understanding of the detectors, physics results and publications.  Sustained support for and response to operational issues, inclusion and distribution of new software and client tools, in support of the LHC Operations. Improved “ticketexchange” and “critical problem response” technologies and procedures were put in place, and well exercised, between the WLCG, EGEE/EGI operations services, and the US ATLAS and US CMS support processes.  Development of agreed upon Service Level Agreements were put in place for all operational services provided by OSG (including operations of the WMS pilot submit system, and a draft of an SLA with ESNET for the CA service).  OSG carried out “prove-in” of reliable critical services (e.g. BDII) for LHC and operation of services at levels that meet or exceed the needs of the experiments. In addition, OSG worked with the WLCG to improve the reliability of the BDII published information for the LHC experiments. This effort included robustness tests of the production infrastructure against failures and outages and validation of information by the OSG as well as the WLCG.  Success in simulated data production for the STAR experiment using virtual machine based software images on grid and cloud resources.  Significant improvements in LIGO’s ability to use the OSG infrastructure, including adapting the Einstein@Home for Condor-G submission, resulting in a greater than 5x increase in the use of OSG by Einstein@Home. Support for GlideinWMS submission of LIGO analysis applications, and testing of Binary Inspiral and ramp up of the LIGO Pulsar application running across more than 5 OSG sites.  Delivery of VDT components in “native packaging” for use on the LIGO data grid, the OSG Worker Node Client, and other specific components that are high priority for the stakeholders e.g. Glexec for US ATLAS.  Two significant publications from structural biology community (SBGrid) based on production running across many OSG sites as well as a rise in multi-user access through the SBGrid portal software.  Entry of ALICE USA to full OSG participation for the experiment’s US sites, following a successful evaluation activity. This includes WLCG reporting and accounting through the OSG services.  Sustained support and better effectiveness for further Geant4 validation runs.  Ongoing support for IceCube and GlueX, and initial engagement with LSST and NEES. 10   Increased opportunistic cycles provided to OSG users by our collaborators in Brazil and Clemson.  Extension of the campus communities and local research supported by the increasingly active and effective Campus communities at the Holland Computer Center at the University of Nebraska and the support for multi-core applications for the University of Wisconsin Madison GLOW community.  Security program activities that continue to improve our defenses and capabilities towards incident detection and response via review of our procedures by peer grids and adoption of new tools and procedures. Better understanding of the role and interfaces of Satellite projects as part of the larger OSG consortium contributions. Increased technical and educational collaboration with TeraGrid through the NSF-funded joint OSG – TeraGrid effort, ExTENCI, which began in August 2010 (see Section 4.5) and the joint OSG-TG summer student program.  Contributions to the WLCG in the areas of new job execution service (CREAM), use of Pilot based job management technologies, and interfaces to commercial (EC2) and scientific (Magellan, FutureGrid) cloud services.  Extensive support for US ATLAS and US CMS Tier-3 sites in security vulnerability testing.  Packaging and testing of XROOTD for the US ATLAS and US CMS Tier-3 sites in collaboration with the XROOTD development project located at SLAC and CERN.  Improvements (e.g. native packaging of software components, improved documentation, support for “storage only sites”) to reduce the “barrier to entry” to participate as part of the OSG infrastructure. We have held training schools for site administrators and a storage forum as part of the support activities.  Contributions to the PanDA and GlideinWMS workload management software that have helped improve capability and supported broader adoption of these within the experiments. Reuse of these technologies by other communities. Configuration of PANDA for the integration test bed automated testing and validation system.  Support for a central WMS “pilot factory” for more than 5 VOs and reduction in the barrier for new communities to run production across OSG.  Continued ollaboration with ESnet and Internet2 on perfSONAR and Identity Management.  Improvement and validation of the collaborative workspace documentation for all OSG areas of work.  A successful summer school for 17 students and OSG staff mentors, as well as successful educational schools in south and Central America.  Continuation of excellence in e-publishing through the collaboration with the European Grid Infrastructure represented by the International Science Grid This Week electronic newsletter (www.isgtw.org). The number of readers continues to grow. 11 In summary, OSG continues to demonstrate that national cyber infrastructure based on federation of distributed resources can effectively meet the needs of researchers and scientists. 1.5 Preparing for the Future The OSG project 1 year extension request to the DOE SciDAC-2 program to enable continuation of support for HEP and NP operations and production until March 2012 was accepted. In March 2011 we submitted a proposal for OSG “2011-2016” to NSF. The vision covers: Sustaining the infrastructure, services and software; Extending the capabilities and capacities of the services and software; and Expanding the reach to include new resource types – shared intracampus infrastructures, commercial and scientific clouds, multi-core compute nodes – and new user communities in early stages of testing the benefit of OSG to their science such as the NEESComm and LSST programs. We held a review of the new OSG proposal in January 2011 by external reviewers invited by Executive Director. The reviewers were the head of the WLCG project Ian Bird, the 2 XD proposal PIs John Towns and Richard Moore, and senior management of ESNet Bill Johnston. This review was very valuable. Among the outcomes was a better definition of the scope of the OSG’s services as shown in Figure 4. Figure 4: Diagram showing the relationship between users, services and resources We submitted a proposal to the DOE ASCR SCiDAC-3 Institute call – the Institute for Distributed High Throughput Computing (InDHTC)– where about half of the programs will contribute directly to the OSG program and the other half will extend the research and development of DHTC technologies to a broader set of the DOE scientific communities. 12 The following documents have been developed as input to the thinking of the OSG future programs:  National CI and the Campuses (OSG-939)  Requirements and Principles for the Future of OSG (OSG-938)  OSG Interface to Satellite Proposals/Projects (OSG-913)  OSG Architecture (OSG-966)  Report from the Workshops on Distributed Computing, Multidisciplinary Science, and the NSF’s Scientific Software Innovation Institutes Program (OSG-1002). In further preparation for the future we have put in place a revised organization (Figure 5) for year 6 of the current project and for the future work. Lothar Bauerdick was endorsed as the new Associate Executive Director of the OSG, with particular responsibilities to liaise with other “docked” projects such as InDHTC – delivering to OSG core program of work. The Council Cochairmanship passed from Kent Blackburn to Rick Snider as part of the preparations for the future of OSG. A revision of the management plan was published. Figure 5: OSG Consortium and Project org charts OSG will become a Service Provider (SP) to the XSEDE project – the next phase of the TeraGrid project – that starts on July 1, 2011. OSG will have representation on the XSEDE SP Forum, participation in the services offered to an SP, and interoperability with other SPs. The Technical Director and PI, Miron Livny, is proposed as the SP representative, with the Production Coordinator, Dan Fraser, as the alternate. OSG and XSEDE will have the following “connection points”:   Between the Campus activity and the Training Education Outreach Service Campus Champions program to facilitate the creation and distribution of materials, training, education and communications related to the resources and services available from OSG; Between User Support and the Advanced User Support Services for applications using OSG and in particular making use of both OSG and XSEDE; The OSG Security Officer will coordinate and collaborate with the XD security program in: 13     Common approaches to the risk and software vulnerability assessment together with common personnel at NCSA (Adam Slagell). Common approaches to the Federated ID management services together with common personnel at NCSA (Jim Basney). Coordination of policy development together with the European community (Jim Marsteller, Jim Barlow). Coordination of response and mitigations in the case of security incidents. Through synergistic activities at the OSG Indiana University Grid Operations Center and the NCSA Operations Activity OSG and XSEDE will support failover of the operations services when local failures occur. 14 2. 2.1 Contributions to Science ATLAS The ATLAS collaboration, consisting of 174 institutes from 38 countries, completed construction of the ATLAS detector at the LHC, and began first colliding-beam data taking in late 2009. The 44 institutions of U.S. ATLAS made major and unique contributions to the construction of the ATLAS detector, provided critical support for the ATLAS computing and software program and detector operations, and contributed significantly to physics analysis, results, and papers published. Experience gained during the first year of ATLAS data taking gives us confidence that the gridbased computing model has sufficient flexibility to process, reprocess, distill, disseminate, and analyze ATLAS data in a way that utilizes both computing and manpower resources efficiently. The computing facilities in the U.S. are based on the Open Science Grid (OSG) middleware stack and provide currently a total of 150k HEP-SPEC 2006 of processing power and 15 PB of disk space. The Tier-1 center at Brookhaven National Laboratory and the 5 Tier-2 centers located at 8 Universities (Boston University, Harvard University, Indiana University, Michigan State University, University of Chicago, University of Michigan, University of Oklahoma and University of Texas at Arlington) and at SLAC have contributed to the worldwide computing effort at the expected level (23% of the total). Time critical reprocessing tasks were completed at the Tier-1 center within the foreseen time limits, while the Tier-2 centers were widely used for centrally managed production and user analysis. In the ATLAS computing resource projections for 2011-2013 the requirements for the year 2012 were initially based on an LHC shutdown in 2012 for consolidation of superconducting magnet splices, and requests for that year only modest increases in computing resources above the level required in 2011, for instance for storage of increased samples of simulated data. The ATLAS resource model foresaw at that time a larger increment in resource requirements from 2012 to 2013, when LHC operations would resume than from 2011 to 2012. Following the annual LHC workshop in Chamonix early this year, the experiments and CERN management decided to change the LHC schedule to continue operations in 2012, and to initiate the shutdown in 2013. Consequently, ATLAS computing resource requirements for 2012 will be higher than estimated a year ago. Incremental resources originally expected to be needed for data-taking not until 2013 will now be required for data-taking in 2012, calling for a re-profiling of resource requirements. The computing resource requirements are based upon an assumed factor of ~2 increase, which requires improvements beyond present simulation-based projections. ATLAS has initiated a high-priority program to identify and implement improvements to reduce these increases with as little impact as possible on its physics capability. Improvements have already been implemented that reduce overall reconstruction time by ~20%, via a combination of code optimization, reduced efficiency for tracks from converted photons, and an increase in pT threshold from 100 MeV to 400 MeV for charged particle tracking. Reduction in ESD event size by ~30% has also already been achieved via removal of information that can be recomputed, with some loss of precision, from remaining quantities. Work on further improvements in reconstruction time and in raw, ESD, and AOD event sizes is ongoing. As explained below, the ATLAS data distribution model has also been adopted in order to manage larger event sizes within pledged 2011 resources. 15 Nonetheless, in light of the significant impact of anticipated increases in pileup, it is not foreseen that a trigger rate much higher than 200 Hz can be handled within planned resources. Consequently, the computing resource requirements continue to be based upon a 200-Hz trigger output rate. The significant increases in data volume due to pileup (and beneficial increases in trigger rate) provide a substantial challenge to fit within the foreseen 2011 computing resources. To address this challenge, ATLAS has conducted a careful review of the use of the different data formats, especially the large ESD event format (1.5 MB/event in 2010 data taking) and developed a new model for data distribution for 2011 and 2012. ESD will be dropped as a disk-resident format for unselected data events, and restricted to small sub-samples of ESD for specific calibration purposes (here referred to as “dESD”, derived-ESD samples), limited to the same total data volume as the total AOD size. Whereas in the past, multiple replicas of two generations of ESD were disk-resident, a single copy of RAW data will now be retained on disk. This copy of diskresident RAW data will provide more than AOD information for all events, but will require significantly less disk space than used formerly by ESD data. There is even the prospect of further gains in size with compression of the RAW data, currently being investigated. Although some data analysis operations will not be as convenient or timely under this new plan, the impact of restricted access to ESD is deemed to be outweighed by the associated reductions in disk space requirements. In the new model, to address the challenge posed by larger events from increased pileup, the new data distribution model for 2011/12 will significantly reduce the number of replicas of AOD sets and dESD sets at Tier-2 centers. The number of replicas of AOD data from the current version of processing across the worldwide ATLAS computing facility will be reduced from 10 to 8 and, for AOD from the previous version, from 10 to 2. The number of replicas of dESD data will be reduced from 10 to 4. Although the accessibility of data sets, particularly AOD, will be reduced, these changes are necessary in 2011 in order to fit within pledged resources, and are deemed preferable to drastic cuts in event size that would affect physics analysis more. It is hoped that the movement of analysis to further derived data samples (e.g. n-tuples) and the dynamic data placement mechanism outlined below will somewhat mitigate the reduction in AOD and dESD accessibility. As to data placement, operational experience has shown that pre-placement of physics datasets at Tier-2 sites via a predefined number of replicas per cloud, as in the 2010 data distribution model, is not optimal regarding efficient use of CPU and disk space. In that model, it is difficult to optimize the use of CPU resources because, in ATLAS analysis, jobs go where their specific input data reside. Thus, if a site is unlucky, it could have many datasets that are less in demand and its CPU resources will be underutilized. To mitigate this issue, ATLAS experimented in 2010 with a dynamic data placement mechanism that increases the number of replicas of data within, and across, clouds upon demand. This mechanism works well, and can potentially keep Tier-2 disks full with the most useful data. It improved CPU usage and decreased network bandwidth demands. It is also expected to improve scalability to meet future requirements, when analysis on a variety of data samples from different years will vary widely. Consequently, this mechanism, PanDA Dynamic Data Placement (PD2P, further explained below), is now deployed in all Tier-2 clouds. Its algorithm is still being tuned, and the detailed performance gains are not yet quantified precisely. 16 Following the short run in late 2009, LHC collider operations was resumed in March 2010. Until early December 2010 the ATLAS collaboration has taken 1.2 billion events from proton-proton collisions and more than 200 million events from HI collisions. The total RAW (unprocessed) data volume taken at the ATLAS detector in 2010 amounts to almost 2 PB (1.6 PB pp and 0.3 PB HI data). Following a four month long break LHC operations for the experiments resumed in March 2011 and ATLAS has collected another 780 TB of RAW data since. While the RAW data was directly replicated to all ten ATLAS Tier-1 centers according to their MoU share (the U.S. receives and archives 23% of the total), the derived data was, after prompt reconstruction at the Tier-0 center, distributed to regional Tier-1 centers for group and user analysis and further distributed to the regional Tier-2 centers. Following significant improvements that were incorporated into the reconstruction code and improved calibration data becoming available, rereconstruction of the entire data validated for physics so far was conducted at the Tier-1 centers while users analyzed the data using resources at the Tier-1 site, all Tier-2 centers and their own institutional computing facilities. As the amount of initial data taking was small we observed a spike in resource usage at the higher levels of the facility with users running data reduction steps followed by transfers of the derived, condensed data products to compute servers they use for interactive analysis, resulting in a reduced utilization of grid resources for a few months until LHC operations resumed in March 2011. Figure 6: Integrated luminosity as delivered by the LHC and as measured by ATLAS Figure 7: Volume of RAW and derived data accumulated by ATLAS since start of 2011 data taking (March 2011). Note while ESDs are produced at the Tier-0 they are not archived Centrally managed Monte Carlo production, collision data reconstruction and user analysis is ongoing with some 70,000 concurrent jobs worldwide out of which 19,000 are jobs running on the combined U.S. ATLAS Tier-1 and Tier-2 resources. 17 Figure 8: OSG CPU hours (124M total) used by ATLAS over 12 months. On average, the U.S. ATLAS facility contributes 30% of worldwide analysis-related data access. The number of user jobs submitted by the worldwide ATLAS community and brokered by PanDA to U.S. sites has reached an average number of 1.2 million per month, peaking occasionally at more than 2 million jobs per month. Over the course of the reporting period more than 17 million user analysis jobs were completed successfully; out of that 11 million at the U.S. ATLAS Tier-2 centers and 6 million at the Tier-1 center. Figure 9: Weekly number of Production and Analysis jobs in the U.S. managed by PanDA 18 Based on ATLAS’ data distribution model that foresees multiple replicas of the same datasets within regions, e.g. the U.S., a significant problem was observed shortly after 7 TeV data taking started in March 2010 with sharply increasing integrated luminosity that produced an avalanche of new data. In particular, as at Tier-2 sites disk storage filled up rapidly, and a solution had to be found to accommodate the data required for analysis. Based on job statistics that includes information about data usage patterns it was found that only a relatively small fraction of the programmatically replicated data was actually accessed. U.S. ATLAS in agreement with ATLAS computing management consequently decided to change the Tier-2 related distribution model such that only datasets requested by analysis jobs are replicated. Programmatic replication of large amounts of ESDs was stopped, only datasets of all categories that are explicitly requested by analysis jobs are replicated from the Tier-1 center at BNL to the Tier-2 centers in the U.S. Since June 2010, when the initial version of a PanDA steered dynamic data placement (PD2P) system was deployed, we observe a healthy growth of the data volume on disk and are no longer facing situations where actually needed datasets cannot be accommodated. Figure 10: Cumulative evolution for DATADISK at Tier-2 centers in the U.S. Figure 10 clearly shows the exponential growth of the disk space utilization in April and May 2010 up to the point in June when the dynamic data placement system was introduced. Since then the data volume on disk is almost constant, despite the exponential growth of integrated luminosity and data volume of interest for analysis. Meanwhile the usage of PD2P, which is fully transparent to users, was extended to all regions that provide computing resources to ATLAS. Disk capacities as provisioned at the Tier-2 centers have evolved from a kind of archival storage to a well-managed caching system. As a result of recent discussions it was decided to further develop the system by including the Tier-1 centers and evolve the distribution model such that it is no longer based on a strict hierarchy but allows direct access across all present (ATLAS cloud) hierarchy levels. Also part of a future model is remote access to files and fractions thereof, rather than having to rely on formal dataset subscriptions via the ATLAS distributed data management system prior to getting access to the data. The initial implementation of such a system will be based on xrootd, which is a storage management solution that has existed for quite some time and is locally used by high-energy and nuclear physics communities. Based on a CMS proposal xrootd can be further exploited in the context of geographically distributed (federated) storage systems where the software provides a scalable mechanism for global data discovery, presenting 19 a variety of different storage management systems, as they are deployed at sites, to the application as if they were the same. With PD2P, caching is performed at the dataset level, a coarse-grained level. PD2P caching relies on predicting future (re)use of cached data; the data cache triggered by initial usage is not itself used unless subsequent reuse takes place at the cache site. Nonetheless PD2P represents a large improvement in terms of disk usage efficiency and manageability over policy-based preplacement. The federated xrootd system will extend ATLAS’ use of caching in new directions. In addition to providing a means of transparently accessing data that is not local to the site but in a remote xrootd storage area, it will also support local caching of data that is remotely accessed at the file level, such that e.g. a Tier-3 center builds up a local cache of data files being used at the site. The caching in this scheme increases the granularity from the dataset level of PD2P to the file level. ATLAS intends to implement and evaluate a still more fine-grained approach to caching, below the file level. The approach takes advantage of a ROOT-based caching mechanism as well as recent efficiency gains in ROOT I/O implemented by the ROOT team that minimizes the number of transactions with storage during data read operations, which particularly over the Wide Area Network (WAN) are very expensive in terms of latency. It also utilizes development work performed by CERN-IT on a custom xrootd server which operates on the client side to direct ROOT I/O requests to remote xrootd storage, transparently caching at the block level data that is retrieved over the WAN and passed on to the application. Subsequent local use of the data hits the cache rather than the WAN. This benefits not only the latency seen by a client utilizing cached data, but also the source site, freed from the need to serve already-delivered data. In addition caching obviously saves network capacities. Deriving benefit from fine-grained caching depends upon re-use of the cache. As one approach to maximizing re-use, PanDA’s existing mechanism for brokering jobs to worker nodes on the basis of data affinity will be applied to this case, such that jobs are preferentially brokered to sites which have run jobs utilizing the same input files. Non-PanDA based applications using data at the cache site will also automatically benefit from the cache. The approach will integrate well with the federated xrootd system; it adds an automatic local caching capability to the federation. It may also be of interest in the context of serving data to applications running in commercial clouds, where the expense of data import and incloud storage could make fine-grained caching efficiencies valuable. Once integrated into the OSG Distributed High Throughput Computing (DHTC) services, some or all of the previously described capabilities will be available to the entire spectrum of scientific communities served by the OSG. This will be accomplished through the Virtual Data Toolkit (VDT) software infrastructure. To support users running data analysis, ATLAS has built a powerful system for computing activities on top of three major grid infrastructures, the Open Science Grid (OSG) in the U.S., EGI in Europe and Asia, and ARC in the Nordic Countries. As expected, with data finally arriving physicists need dedicated resources for analysis activities. In contrast to the existing grid infrastructure, there is a strong need to provide users with data control and quasi high-performance interactive data access. U.S. ATLAS in close collaboration with OSG has designed and implemented a Tier-3 solution that is targeted to provide efficient and manageable analysis computing at each 20 member institution. For most of the ~40 sites in the U.S. only a small fraction of a physicist or student can be diverted for computing support. Transformative technologies have been chosen and integrated with the existing ATLAS tools. The result is a site which is substantially simpler to maintain and which is essentially operated by client tools and extensive use of caching technologies. Most promising technologies ATLAS is using include xroot for distributed storage management and the CERN Virtual Machine File System (CVMFS) for ATLAS software distribution and conditions data access. Open Science Grid has organized monthly Tier-3 Liaison meetings between several members of the OSG facilities, U.S. Atlas and U.S. CMS. During these meetings, topics discussed include cluster management, site configuration, site security, storage technology, site design, and experiment-specific Tier-3 requirements. U.S. ATLAS (contributing to ATLAS as a whole) relies extensively on services and software provided by OSG, as well as on processes and support systems that have been developed and implemented by OSG. OSG has become essential for the operation of the worldwide distributed ATLAS computing facility and the OSG efforts have aided the integration with WLCG partners in Europe and Asia. The derived components and procedures have become the basis for support and operation covering the interoperation between OSG, EGI, and other grid sites relevant to ATLAS data analysis. OSG provides software components that are interoperable with European ATLAS grid sites, including selected components from the gLite middleware stack such as client utilities, and LHC Computing Grid File Catalog (LFC). It is vital to the ATLAS collaboration that the present level of service continues uninterrupted for the foreseeable future, and that all of the services and support structures upon which U.S. ATLAS relies today are properly maintained and have a clear continuation strategy. The Blueprint working group within OSG tasked to develop a coherent middleware architecture has made significant progress that benefits ATLAS. Important ingredients include “native packaging” of middleware components that can now be deployed as well managed RPMs instead of a collection of Pacman modules and the CREAM Computing Element (CE) that addresses scaling and resilience issues observed with the Globus Toolkit 2 based technology ATLAS has used so far. Middleware deployment support provides an essential and complex function for U.S. ATLAS facilities. For example, support for testing, certifying and building a foundational middleware for production and distributed analysis activities is a continuing requirement, as is the need for coordination of the roll out, deployment, debugging and support for the middleware services. In addition, some level of preproduction deployment testing has been shown to be indispensable. This testing is currently supported through the OSG Integration Test Bed (ITB) providing the underlying grid infrastructure at several sites along with a dedicated test instance of PanDA, the ATLAS Production and Distributed Analysis system. These elements implement the essential function of validation processes that accompany incorporation of new and new versions of grid middleware services into the Virtual Data Toolkit (VDT), which provides a coherent OSG software component repository. U.S. ATLAS relies on the VDT and OSG packaging, installation, and configuration processes to provide a well-documented and easily deployable OSG software stack. U.S. ATLAS greatly benefits from OSG’s Gratia accounting services, as well as the information services and probes that provide statistical data about facility resource usage and site information passed to the application layer and to WLCG for review of compliance with MoU agreements. 21 An essential component of grid operations is operational security coordination. The coordinator provided by OSG has good contacts with security representatives at the U.S. ATLAS Tier-1 center and Tier-2 sites and is closely connected to experts representing grid computing resources outside the U.S. Thanks to activities initiated and coordinated by OSG a strong operational security community is in place in the U.S., ensuring that security problems are well coordinated across the distributed infrastructure. In the area of middleware extensions, U.S. ATLAS continued to benefit from the OSG’s support for and involvement in the U.S. ATLAS-developed distributed processing and analysis system (PanDA) layered over the OSG’s job management, storage management, security and information system middleware and services. PanDA provides a uniform interface and utilization model for the experiment's exploitation of the grid, extending across OSG, EGI and Nordugrid. It is the basis for distributed analysis and production ATLAS-wide, and is also used by OSG as a WMS available to OSG VOs, as well as a PanDA based service for OSG Integrated Testbed (ITB) test job submission, monitoring and automation. This year the OSG’s WMS extensions program continued to provide the effort and expertise on PanDA security that has been essential to establish and maintain PanDA’s validation as a secure system deployable in production on the grids. In particular PanDA’s glexec-based pilot security system developed in this program was brought to production readiness through continued testing in the U.S. and Europe throughout the year. Another important extension activity during the past year was in WMS monitoring software and information systems. During the year ATLAS and U.S. ATLAS continued the process of merging the PanDA/US monitoring effort with CERN-based monitoring efforts, together with the ATLAS Grid Information System (AGIS) that integrates ATLAS-specific information with the grid information systems. The agreed common approach utilizes a python apache service serving json-formatted monitoring data to rich jQuery-based clients. A prototype PanDA monitoring infrastructure based on this approach was initiated last year and further developed this year; elements of it are now integrated in production PanDA monitoring and the development approach and architecture has been validated as our evolution path. In light of Oracle scaling limitations first seen last year, a program investigating alternative back end DB technologies was begun this year, particularly for the deep archive of job and file data. This archive shows the most severe scaling limitations and has access patterns amenable to so-called ‘noSQL’ storage approaches, in particular to the highly scalable key-value pair based systems such as Cassandra and Hive that have emerged as open source software from web behemoths such as Google, Amazon and Facebook. Cassandra has been adopted as the basis for a prototype job data archive. During the year a Cassandra testbed was established at BNL, schema designs were developed and tested, and a full year of PanDA job data was published to the system for performance evaluations, which look very promising. Results will be presented at a WLCG database workshop at CERN in June 2011. 2.2 CMS During 2010 CMS has transitioned from commissioning the detector to producing its first physics results across the entire range of physics topics. Several tens of scientific papers have been published in peer reviewed journals, including Phys. Rev. Letters, Physics Letters B, Eur. Phys. Journal C, and the Journal of High Energy Physics, covering the entire range of physics topics in the CMS physics program. 22 Among the published results are already some first surprises, like the first “Observation of LongRange, Near-Side Angular Correlations in Proton-Proton Collisions”, others are searches for new physics that already exceed the sensitivity reached by previous generations of experiments, still others, including the first observation of top pair production at the LHC, are major milestones that measure the cross section for the dominant Standard Model background processes to much of the ongoing, as well as future new physics searches. By the middle of June 2011, the LHC has delivered more than 1/fb of integrated luminosity. CMS is thus expecting to present results at EPS end of July 2011 based on this data sample, a factor 30 more than the published results for 2010 were based on. This should make for a very exciting summer conference season. Computing has proven to be the enabling technology it was designed to be, providing an agile environment for scientific discovery. U.S. CMS resources available via the Open Science Grid have been particularly important to the scientific output of the CMS experiment. The seven Tier2 sites are among the ten most heavily used Tier-2 sites globally, as shown in Figure XX. Figure YY shows the number of pending jobs versus time. During the peak usage times, the U.S. sites also have both the most running and pending jobs, indicating that we are at this point resource limited during peak times. In terms of organized production, which includes reprocessing as well as simulation, Figure ZZ shows that several of the U.S. Tier-2’s support the same number of running jobs as some of the global Tier-1 centers. With regard to data transfer volume, more than 5 PB (3PB) of data was received by (sourced from) the U.S. Tier-2 centers during this period. In comparison, the FNAL Tier-1 center received (sourced) slightly less than 5 PB (7 PB) of data during the same period. To put this in perspective, the U.S. Tier-1 and Tier-2s together send and received more data than all non-U.S. Tier-1s combined. The U.S. leadership position within CMS as indicated by these metrics is attributable to superior reliability, and agility of U.S. sites. We host a complete copy of all core data samples distributed across the seven US Tier-2 sites, and due to the excellent performance of the storage infrastructures, we are able to refresh data quickly. It is thus not uncommon that data becomes available first at U.S. sites, attracting time critical data analysis to those sites. The Open Science Grid has been a significant contributor to this success by providing critical computing infrastructure, operations, and security services. These contributions have allowed U.S. CMS to focus experiment resources on being prepared for analysis and data processing, by saving effort in areas provided by OSG. OSG provides a common set of computing infrastructure services on top of which CMS, with development effort from the U.S., has been able to build a reliable processing and analysis framework that runs on the Tier-1 facility at Fermilab, the project supported Tier-2 university computing centers, and opportunistic Tier-3 centers at universities. There are currently 27 Tier-3 centers registered with the CMS data grid in the U.S., 20 of them provide additional simulation and analysis resources via the OSG. The remainder are Universities that receive CMS data via the CMS data grid, using an OSG storage element API, but do not (yet) make any CPU cycles available to the general community. OSG and US CMS work closely together to ensure that these Tier-3 centers are fully integrated into the globally distributed computing system that CMS science depends on. In addition to common interfaces, OSG provides the packaging, configuration, and support of the storage services. Since the beginning of OSG the operations of storage at the Tier-2 centers have 23 improved steadily in reliability and performance. OSG is playing a crucial role here for CMS in that it operates a clearinghouse and point of contact between the sites that deploy and operate this technology and the developers. In addition, OSG fills in gaps left open by the developers in areas of integration, testing, and tools to ease operations. OSG has been crucial to ensure US interests are addressed in the WLCG. The U.S. is a large fraction of the collaboration both in terms of participants and capacity, but a small fraction of the sites that make-up WLCG. OSG is able to provide a common infrastructure for operations including support tickets, accounting, availability monitoring, interoperability and documentation. Now that CMS is taking data, the need for sustainable security models and regular accounting of available and used resources is crucial. The common accounting and security infrastructure and the personnel provided by OSG represent significant benefits to the experiment, with the teams at Fermilab and the University of Nebraska providing the development and operations support, including the reporting and validation of the accounting information between the OSG and WLCG. In addition to these general statements, we’d like to point to two specific developments that have become increasingly important to CMS within the last year. Within the last two to three years, OSG developed the concept of “Satellite projects”, and a notion of providing an “ecosystem” of independent technology projects that enhance the overall national computing infrastructure in close collaboration with OSG. CMS is starting to benefit from this concept as it has stimulated close collaboration with computer scientists on a range of issues including 100Gbps networking, workload management, cloud computing and virtualization, and High Throughput Parallel Computing that we expect will lead to multi-core scheduling as the dominant paradigm for CMS in a few years time. The existence of OSG as a “collaboratory” allows us to explore these important technology directions in ways that are much more cost effective, and more likely to be successful than if were pursuing these new technologies within a narrow CMS specific context. Finally, within the last year, we have seen increasing adoption of technologies and services originally developed for CMS. Most intriguing is the deployment of glideinWMS as an OSG service, adopted by a diverse set of customers including structural biology, nuclear physics, applied mathematics, chemistry, astrophysics, and CMS data analysis. A single instance of this service is jointly operated by OSG and CMS at UCSD for the benefit of all of these communities. OSG developed a Service Level Agreement that is now being reviewed for possible adoption also in Europe. Additional instances are operated at FNAL for the Run II experiments, MINOS, and data reprocessing for CMS at Tier-1 centers. 24 Figure 11: OSG CPU hours (115M total) used by CMS over 12 months, color-coded by facility. Figure 12: Average number of running analysis jobs per week by CMS worldwide 25 Figure 13: Average number of pending analysis jobs per week by CMS worldwide 26 Figure 14: Average number of running production jobs per week by CMS worldwide 2.3 LIGO LIGO continues to leverage the Open Science Grid for opportunistic computing cycles associated with its grid based Einstein@Home application, known as Einstein@OSG. This application is one of several in use for an “all-sky” search for gravitational waves of a periodic nature attributed to elliptically deformed pulsars. Such a search requires enormous computational resources to fully exploit the science content available within LIGO’s vast datasets. Volunteer and opportunistic computing based on the BOINC (Berkeley Open Infrastructure for Network Computing) has been leveraged to utilize as many computing resource worldwide as possible. Since porting to the grid based Einstein@OSG code onto the Open Science Grid more a little over two years ago, steady advances in both the code performance, reliability and overall deployment onto the Open Science Grid have been demonstrated. OSG has routinely ranked in the top five computational providers for this LIGO analysis worldwide. In this year the Open Science Grid has provided more than 18 million CPU-Hours towards this search for pulsar signals. 27 Figure 15: Opportunistic usage of the OSG by LIGO’s grid based Einstein@Home application for the current year. This year has also seen development effort on a variation of the search for gravitational waves from pulsars with the porting of the “PowerFlux” application onto the Open Science Grid. This is also a broadband search but utilizes a power-averaging scheme to more quickly cover a large region of the sky over a broad frequency band. The computational needs are not as large as with the Einstein@Home application at the expense of lower signal resolution. The code is currently being wrapped up to provide better monitoring in a grid environment where remote login is not supported. One of the most promising sources of gravitational waves for LIGO is from the inspiral of a system of compact black holes and/or neutron stars as the system emits gravitational radiation leading to the ultimate coalescence of the binary pair. The binary inspiral data analyses typically involve working with tens of terabytes of data in a single workflow. Collaborating with the Pegasus Workflow Planner developers at USC-ISI, LIGO continues to identify changes to both Pegasus and to the binary inspiral workflow codes to more efficiently utilize the OSG and its emerging storage technology, where data must be moved from LIGO archives to storage resources near the worker nodes on OSG sites. One area of intense focus this year has been on the understanding and integration into workflows of Storage Resource Management (SRM) technologies used in OSG Storage Element (SE) sites to house the vast amounts of data used by the binary inspiral workflows so that worker nodes running the binary inspiral codes can effectively access the LIGO data. The SRM based Storage Element established on the LIGO Caltech OSG integration testbed site is being used as a development and test platform to get this effort underway without impacting OSG production facilities. Using Pegasus for the workflow planning, DAGs for the binary inspiral data analysis application using of order ten terabytes of LIGO data have successfully run on three production sites. 28 Performance studies this year have suggested that the use of glide-in technologies can greatly improve the total run time requirements for these large workflows made up of tens of thousands of jobs. This is another area where Pegasus in conjunction with its Corral glide-in features have resulted in further gains in the ability to port and effectively use a complex LIGO data analysis workflow, designed originally for running on the LIGO Data Grid, over to the Open Science Grid where there are sufficient similarities to make this possible, but sufficient differences to require detailed investigations and development activities to reach the desired science driven goals. LIGO continues working closely with the OSG Security team, DOE Grids, and ESnet to evaluate the implications of its requirements on authentication and authorization within its own LIGO Data Grid user community and how these requirements map onto the security model of the OSG. 2.4 ALICE The ALICE experiment at the LHC relies on a mature grid framework, AliEn, to provide computing resources in a production environment for the simulation, reconstruction and analysis of physics data. Developed by the ALICE Collaboration, the framework is fully operational with sites deployed at ALICE and WLCG Grid facilities worldwide. During 2010, ALICE US collaboration deployed significant compute and storage resources in the US, anchored by new Tier 2 centers at LBNL/NERSC and LLNL. These resources, accessible via the AliEn grid framework, are being integrated with OSG to provide accounting and monitoring information to ALICE and WLCG while allowing unused cycles to be used by other Nuclear Physics groups. In early 2010, the ALICE USA Collaboration’s Computing plan was formally adopted. The plan specifies resource deployments at both the existing NERSC/PDSF cluster at LBNL and the LLNL/LC facility, and operational milestones for meeting ALICE USA’s required computing contributions to the ALICE experiment. A centerpiece of the plan is the integration of these resources with the OSG in order to leverage OSG capabilities for accessing and monitoring distributed compute resources. Milestones for this work included: completion of more extensive scale-tests of the AliEn-OSG interface to ensure stable operations at full ALICE production rates, establishment of operational OSG resources at both facilities, and activation of OSG reporting of resource usage by ALICE to the WLCG. During this past year, with the support of OSG personnel, we have met most of the goals set forth in the computing plan. NERSC/PDSF has operated as an OSG facility for several years and was the target site for the initial development and testing of an AliEn-OSG interface. With new hardware deployed for ALICE on PDSF by July of 2010, a new set of scaling tests were carried out by ALICE which demonstrated that the AliEn-OSG interface was able to sustain job submission rates and steadystate job-occupancy required by the ALICE team. Since about mid-July of 2010, ALICE has run production at PDSF directed through the OSG CE interface with a steady job concurrency rate of about 300 jobs or more, consistent with the computing plan. 29 NERSC/PDSF LLNL/LC Figure 16: Cumulative cpu-hours delivered since August 2010 from the NERSC/PDSF and LLNL/LC facilities to ALICE as measured by OSG accounting records. The LLNL/LC facility, operational since September of 2010, was integrated with OSG accounting in December of 2010. During the fall of 2010 a small OSG-ALICE task force was re-instated to facilitate further integration with OSG. Work in the group focused on the ALICE need for resource utilization to be reported by OSG to the WLCG. This work has included performing crosschecks on accounting records reported by the PDSF OSG site as well as developing additional tools needed for deploying OSG usage reporting at the LLNL/LC facility. As a result of these efforts, the LLNL/LC facility began sending job information to OSG in December of 2010 such that both facilities currently report accounting records to WLCG as a part of normal OSG operations. Cumulative cpuhours recorded by OSG from the two sites are shown in Figure 16 and illustrate the initial deployment of resource reporting at LLNL/LC in December. ALICE continues to work with OSG on several issues that target optimal use of both ALICE and OSG accessible resources. The ALICE site at LLNL/LC is currently working on a full OSG installation that will eventually include operation on the non-supported SLURM batch system preferred at LLNL/LC and will allow other OSG VOs opportunistic access to those resources. In addition, the OSG software team is working on the integration of the CREAM-CE as an option in the OSG software stack. The ALICE USA team plans to participate in the integration tests of the software as such an option would allow ALICE USA facilities to run with a job-submission and monitoring model identical to its European counter parts. Finally, the ALICE USA computing effort is evaluating modifications to the AliEN workflow that will allow ALICE to make opportunistic use of other OSG resources. We expect these efforts to continue over the next year. 2.5 D0 at Tevatron The D0 experiment continues to rely heavily on OSG infrastructure and resources in order to achieve the computing demands of the experiment. The D0 experiment has successfully used OSG resources for many years and plans on continuing with this very successful relationship into the foreseeable future. This usage has resulted in a tremendous science publication record (Figure 17), including contributing to improved limits on the Higgs mass exclusion as shown in Figure 18. The D0 experiment has been able to produce 47 different results for the Winter/Spring 2011 Conferences. See http://www-d0.fnal.gov/Run2Physics/ResultsWinter2011.html 30 Figure 17: Number of publications from the D0 experiment versus year. 31 Figure 18: Plot showing the latest combined D0 and CDF results on the observed and expected 95% confidence limit upper limits on the ratios to the Standard Model cross section as a function of Higgs mass. All D0 Monte Carlo simulation is generated at remote sites, with OSG continuing to be a major contributor. During the past year, OSG sites simulated approximately 600 million events for D0, (almost 100 million more events than were produced in the previous year) approximately 1/3 of all production. The rate of production has nearly leveled off over the past year as almost all major sources of inefficiency have been resolved and D0 continues to use OSG resources very efficiently. The changes in policy at numerous sites for job preemption, the continued use of automated job submissions, and the use of resource selection has allowed D0 to opportunistically use OSG resources to efficiently produce large samples of Monte Carlo events. D0 continues to use approximately 30 OSG sites regularly in its Monte Carlo production. Figure 19 shows a snapshot of idle and running Monte Carlo jobs on OSG on a typical day showing that many different sites are being utilized by D0. The total number of D0 OSG MC events produced over the past several years has exceeded 1.6 billion events (Figure 20). Over the past year, the average number of Monte Carlo events produced per week by OSG continues to remain approximately constant. Since we use the computing resources opportunistically, it is interesting to find that, on average, we can maintain an approximately constant rate of MC events at nearly 15 million events/week (Figure 21). In April 2011, 26 million events were produced which is a record week for MC production for D0. Any dips in OSG production are 32 now only due to D0 switching to new software releases which temporarily stops our requests to OSG. Over the past year D0 has been able to obtain the necessary opportunistic resources to meet our Monte Carlo needs even though the LHC also has high demand. We have been able to achieve this by continuing to improve our efficiency and to add additional resources when they become available. It is known that the Tevatron program will end in September of 2011. However by this time it is expected that D0 will have accumulated nearly 11 fb-1 of data. It will take many years to analyze this huge data set and Monte Carlo production will continue to be in high demand. So although the Tevatron accelerator will shut down, D0 will continue to need OSG resources for many more years. D0 OSG MC jobs ======================================================= CURRENT JOBS DISTRIBUTION @all_osg_queues from condor ======================================================= running idle site 324 : 0 : antaeus.hpcc.ttu.edu 30 : 392 : ce.grid.unesp.br 35 : 65 : ce01.cmsaf.mit.edu 1 : 5 : cit-gatekeeper.ultralight.org 1 : 303 : cmsgrid01.hep.wisc.edu 2 : 1685 : cmsosgce3.fnal.gov 831 : 48 : condor1.oscer.ou.edu 478 : 0 : d0cabosg1.fnal.gov 1001 : 563 : d0cabosg2.fnal.gov 371 : 0 : fermigridosg1.fnal.gov 31 : 445 : ff-grid.unl.edu 38 : 445 : ff-grid3.unl.edu 180 : 311 : fnpcosg1.fnal.gov 10 : 53 : gk01.atlas-swt2.org 10 : 133 : gk04.swt2.uta.edu 50 : 0 : gluskap.phys.uconn.edu 15 : 90 : gridgk01.racf.bnl.gov 1 : 118 : gridgk02.racf.bnl.gov 456 : 457 : msu-osg.aglt2.org 49 : 109 : nys1.cac.cornell.edu 2 : 0 : osg-ce.sprace.org.br 16 : 68 : osg-gk.mwt2.org 17 : 290 : osg-gw-2.t2.ucsd.edu 12 : 0 : osg.rcac.purdue.edu 64 : 187 : osg1.loni.org 81 : 332 : ouhep0.nhn.ou.edu 19 : 164 : pg.ihepa.ufl.edu 5 : 280 : red.unl.edu 210 : 0 : tier3-atlas1.bellarmine.edu 4 : 0 : umiss001.hep.olemiss.edu 6543 : total IDLE jobs 4344 : total RUNNING jobs Figure 19: Current D0 jobs distribution from Condor 33 Figure 20: Cumulative number of D0 MC events generated by OSG during the past year. Figure 21: Number of D0 MC events generated per week by OSG during the past year. Although D0 uses opportunistic computing for its Monte Carlo production, a constant rate of nearly 15 million events/week has been achieved. Two years ago D0 was able to first use LCG resources at a significant level to produce Monte Carlo events. The primary reason that this was possible was that LCG began to use some of the infrastructure developed by OSG. Because LCG was able to easily adopt some of the OSG infrastructure, D0 is able to produce a significant number of Monte Carlo events using LCG. Since the OSG infrastructure is robust, the LCG production has been very constant at approximately 5 million events/week, giving nearly 250 million Monte Carlo events from LCG during the past year. The ability for OSG infrastructure to be used by other grids has proved to be very beneficial. The primary processing of D0 data continues to be run using OSG infrastructure. One of the very important goals of the experiment is to have the primary processing of data keep up with the rate of data collection. It is critical that the processing of data keep up in order for the experiment to quickly find any problems in the data and to keep the experiment from having a backlog of data. Typically D0 is able to keep up with the primary processing of data by reconstructing 6-8 million events/day (Figure 22). However, when the accelerator collides at very high luminosities, it is difficult to keep up with the data using our standard resources. However, since the computing 34 farm and the analysis farm have the same infrastructure, D0 is able to move analysis computing nodes to primary processing to improve its daily processing of data, as it has done on more than one occasion. This flexibility is a tremendous asset and allows D0 to efficiently use its computing resources. Over the past year D0 has reconstructed nearly 1.6 billion events on OSG facilities. In order to achieve such a high throughput, much work has been done to improve the efficiency of primary processing. In almost all cases, only 1-2 job submissions are needed to complete a job, even though the jobs can take several days to finish, see Figure 23. Figure 22: Cumulative daily production of D0 data events processed by OSG infrastructure for data collected after the 2010 shutdown. The flat areas correspond to times when the accelerator/detector was down for maintenance so no events needed to be processed. OSG resources continue to allow D0 to meet is computing requirements in both Monte Carlo production and in data processing. This has directly contributed to D0 publishing 31 papers in 2010/2011 (with 13 additional papers submitted for publication) see http://wwwd0.fnal.gov/d0_publications/. 35 Figure 23: Submission statistics for D0 primary processing in May 2011. In almost all cases, only 12 job submissions are required to complete a job even though jobs can run for several days. 2.6 CDF at Tevatron The CDF experiment produced 42 new results for Summer 2010 followed by a further 37 new results for Winter 2011, using OSG infrastructure and resources. Included in these results was the 95% CL exclusion of a Standard Model Higgs boson with mass between 158 and 168 GeV/c2 (Figure 24). 36 Figure 24: Upper limit plot of recent CDF search for the Standard Model Higgs, March 2011 (not 2010 as labeled) OSG resources support the work of graduate students, who are producing one thesis every other week, and the collaboration as a whole, which is submitting a publication of new physics results at a rate of more than one per week. 41 publications were submitted in CY10 and 24 in the first half of CY11. A total of 560 million Monte Carlo events were produced by CDF in the last year. Most of this processing took place on OSG resources. CDF also used OSG infrastructure and resources to support the processing of raw data events. A major reprocessing has been under way to increase the b tagging efficiency for improved sensitivity to low mass Higgs. The production output from this and normal processing was 6.6 billion reconstructed events over the last year, and in the same period 11.7 billion ntuple events were created. Detailed numbers of events and volume of data are given in Table 4 (total data since 2000) and Table 5 (data taken in the year to June 2011). Table 4: CDF data collection since 2000 Data Type Raw Data Production MC Stripped-Prd MC Ntuple Total Volume (TB) 2061 3224 989 105 508 7946 # Events (M) 13807 21397 6632 876 7485 118926 37 # Files 2366975 2770627 1142643 98728 454747 7761262 Table 5: CDF data collection for the year to June 2011 Data Type Raw Data Production MC Stripped-Prd Ntuple MC Ntuple Total Data Volume (TB) 388 1198 111 16 435 141 2288 # Events (M) 2354 6622 561 85 12670 1763 49292 # Files 433569 805109 125535 12226 292934 120289 1832551 The OSG provides computing resources for the collaboration through two portals. The first, the North American Grid portal (NAmGrid), covers the functionality for MC generation in an environment that requires software to be ported to the site and only Kerberos or grid authenticated access to remote storage for output. The second portal, CDFGrid, provides an environment that allows full access to all CDF software libraries and methods for data handling. CDF operates the pilot-based Workload Management System (glideinWMS) as the submission method to remote OSG sites. Figure 25 shows the number of running jobs on NAmGrid and demonstrates that there has been steady usage of the facilities, while Figure 26, a plot of the queued requests, shows that there is large demand. CDF MC production is submitted to NAmGrid and use of OSG CMS, CDF, and General purpose Fermilab resources plus MIT is observed. A large resource provided by Korea at KISTI is in operation and provides a large Monte Carlo production resource with high-speed connection to Fermilab for storage of the output. It also provided a cache that allows the data handling functionality to be exploited. The system was commissioned and 10TB of raw data were processed using SAM data handling with KISTI in the NAmGrid portal. Problems in commissioning were handled with great speed by the OSG team through the “campfire” room and through weekly VO meetings. Lessons learned to make commissioning and debugging easier were analyzed by the OSG group. KISTI is being run as part of NAmGrid for MC processing when not being used for reprocessing. 38 Figure 25: Running jobs on NAmGrid Figure 26: Waiting CDF jobs on NAmGrid, showing large demand. Plots of running jobs and queued requests on CDFGrid are shown in Figure 27 and Figure 28. The very high demand for the CDFGrid resources in the period leading up to the 2011 winter conference season is particularly noteworthy, where queues exceeding 100,000 jobs can be seen. The high use has continued since then, corresponding both to reprocessing activity, and analysis of the reprocessed data that has become available. The reduced capacity at the start of the period, in early summer 2010, was due to an allocation of 15% of the CDFGrid resources for testing with SLF5. This testing period ended with the end of the summer conference season on August, 2010. At that point all of the CDFGrid and NAmGrid resources were upgraded to SLF5 and this became the default. CDF raw data processing, ntupling and user analysis has now been converted to SLF5. 39 Figure 27: Running CDF jobs on CDFGrid Figure 28: Waiting CDF jobs on CDFGrid In a new development that has gone live in June 2011, LCG sites in Europe that have been providing resources for CDF are now being made available through a new portal, EuroGrid, which uses OSG functionality through the UCSD glidein factory. CDF is very excited by the prospect of being able to access these sites with the user protection afforded by the glidein system. A number of issues affecting operational stability and operational efficiency have been pointed out in the past. Those that remain and solutions or requests for further OSG development are cited here.  Service level and Security: Since April 2009 Fermilab has had a new protocol for upgrading Linux kernels with security updates. While main core services can be handled with a rolling 40 reboot, the data handling still requires approximately quarterly draining of queues for up to 3 days prior to reboots.  Opportunistic Computing/Efficient resource usage: Preemption policy has not been revisited and CDF has not tried to include any new sites due to issues that arose when commissioning KISTI. It was found that monitoring showed that the KISTI site was healthy while we found that glideins from glideWMS were “swallowed” leaving a cleanup operational issue. This is being addressed by OSG.  Management of database resources: Monte Carlo production led to a large load on the CDF database server for queries that could be cached. An effort to reduce this load was launched and most queries were modified to use a Frontier server with Apache. This avoided a problem in resource management provided Frontier servers are provided with each site installation.  Disk space on worker nodes: As CDF has recently moved to writing large (2–8GB) files, space available to running jobs on worker nodes has become an issue. KISTI was able to upgrade all of its worker nodes in June 2011 to provide 20GB per slot, and OSG should consider the recommendations it makes for site design. The usage of OSG for CDF has been fruitful and the ability to add large new resources such as The usage of OSG for CDF has been fruitful and the ability to add large new resources such as KISTI as well as more moderate resources within a single job submission framework has been extremely useful for CDF. The collaboration has produced significant new results in the last year with the processing of huge data volumes. Significant consolidation of the tools has occurred. In the next year, the collaboration looks forward to a bold computing effort in the push to see evidence for the Higgs boson, a task that will require further innovation in data handling and significant computing resources in order to reprocess the large quantities of Monte Carlo and data needed to achieve the desired improvements in tagging efficiencies. We look forward to another year with high publication rates and interesting discoveries. 2.7 Nuclear physics STAR’s tenth year of data taking has brought new levels of data challenges, with the most recent year’s data matching the integrated data of the previous decade. Now operating at the Petabyte scale, the data mining and production has reached its maximum potential. Over a period of 10 years of running, the RHIC/STAR program has seen data rates grow by two orders of magnitudes, yet data production has kept pace and data analysis and science productivity remained strong. In 2010, the RHIC program and Brookhaven National Laboratory earned recognition as number 1 for Hadron collider research.2 To effectively face the data challenge, all raw simulations had previously been migrated to Gridbased operations. This year, the migration has been expanded, with a noticeable shift toward the use of Cloud resources wherever possible. While Cloud resources had been an interest to STAR as early as 2007, our previous years’ reports noted multiple tests and a first trial usage of Cloud resources (Nimbus) in 2008/2009 at the approach of a major conference, absorbing additional workload stemming from a last minute request. This mode of operation has continued as the Cloud approach allows STAR to run what our collaboration has not been able to perform on Grid 2 http://sciencewatch.com/ana/st/hadron/institutions/ 41 resources due to technical limitations (harvesting of resources on the fly has been debated in length by STAR as an unreachable ideal for experiment equipped with complex software stacks). Grid usage remains restricted to either opportunistic use of resources for event generator-based production (self-contained program easily assembled) or non-opportunistic / dedicated site usage with a pre-installed software stack maintained by a local crew allowing running STAR’s complex workflows. Cloud resources, coupled with virtualization technology, permit relatively easy deployment of the full STAR software stack within the VM, allowing large simulation requests to be accommodated. Even more relevant for STAR’s future, recent tests successfully demonstrated that larger scale real data reconstruction is easily feasible. Cloud activities and development remain (with some exceptions) outside the scope and program of work of the Open Science Grid; one massive simulation exercise was partly supported by the ExTENCI satellite project. STAR had planned to also run and further test the Glow resources after an initially successful reported usage via a Condor/VM mechanism. However, several alternative resources and approaches offered themselves. The use of the Clemson model especially appeared to allow for faster convergence and deliverables of a needed simulation production in support of the Spin program component of RHIC/STAR. Within a sustained scale of 1,000 jobs (peaking at 1,500 jobs) for three weeks, STAR clearly demonstrated that a full fledge Monte-Carlo simulation followed by a full detector response simulation and track reconstruction was not only possible on Cloud but of a large benefit to our user community. With over 12 billion PYTHIA events generated, this production represented the largest PYTHIA event sample ever generated in our community. The usage of Cloud resources in this case expanded the resources capacity for STAR by 25% (comparing to the resources available at BNL/RCF) and, for a typical student’s work, allowed a yearlong science time wait to be delivered in a few weeks. Typically, a given user at the RCF would be able to claim 50 job slots (the facility being shared by many users) while in this exploitation of Cloud resources, all 1,000 slots were uniquely dedicated to a given task and one student. The sample represented a four order of magnitude increase in statistics comparing to other studies made in STAR with a near total elimination of statistical uncertainties which would have reduced the significance of model interpretations. The results were presented at the Spin 2010 conference where unambiguous agreement between our data and the simulation was shown. It is noteworthy that the resources were gathered in an opportunistic manner as seen in Figure 29. We would like to acknowledge the help from our colleagues from Clemson, partly funded by the ExTENCI project. 42 Figure 29: Graph of the number of available machines to STAR (in red), working machines (in green) and idle nodes (in blue) within an opportunistic resource gathering at Clemson University. Within this period, the overlap of the red and green curve demonstrates the submission mechanism allows for immediate harvesting of resources as they become available. An overview of STAR’s Cloud efforts and usage has been presented at the OSG all handmeeting in March 2010 (see “Status of STAR's use of Virtualization and Clouds”) and at the International Symposium on Grid Computing 2010 (“STAR’s Cloud/VM Adventures”). Further overview of activities was given at the ATLAS data challenge workshop held at BNL that same month and finally, a summary presentation was given the CHEP 2010 conference in Taiwan in October (“When STAR Meets the Clouds – Virtualization & Grid Experience”). Based on usage trend and progress with Cloud usage and scalability, we project that 2011 will see workflow of the order of 10 to 100k jobs sustained as routine operation (see Figure 30). 43 Figure 30: Summary of our Cloud usage as a function of date. As seen, the rapid progression of the exploitation and usage may indicate that a 10,000 job scale in 2011 may be at reach. From BNL, we steered Grid-based simulation productions (essentially running on our NERSC resources), and STAR has in total produced 4.8 Million events representing a total of 254,200 CPU hours of processing time using the standard OSG/Grid infrastructure. During our usage of the NERSC resources, we re-enabled the SRM data transfer delegation mechanism allowing for a job to terminate and pass to a third party service (SRM) the task of transferring the data back to the Tier0 center, BNL. We had previously used this mechanism but not integrated it into our regular workflow as the network transfers allowed for immediate globus-based file transfer with no significant additional time added to the workflow. However, due to performance issues with our storage cache at BNL (outside of STAR’s control and purview), the transfers were recently found, at times, to add a significant overhead to the total job time (41% impact on total job time). The use of a 0.5 TB cache on the NERSC side and the SRM delegation mechanism allowed mitigation of the delay problems. In addition to NERSC, large simulation event generations were performed on the CMS/MIT site for the study of prompt photon cross section and double spin asymmetry. Forty-three million raw PYTHIA events were generated, amongst which 300 thousand events were passed to GEANT as part of cross section / pre-selection speed up (event filtering at generation), a mechanism designed in STAR to cope with large and statistically challenging simulations (cross section-based calculations require however to generate with a nonrestrictive phase space and count the events passing our filter and the one being rejected). Additionally, 20 billion PYTHIA events (1 million filtered and kept) were also processed on that facility. The total resource usage was equivalent to about 100,000 hours of CPU hours spanning over a period of two months total. STAR has also begun to test the resources provided by the Magellan project at NERSC and aims at pushing a fraction of its raw datasets to the Magellan Cloud for immediate processing via an 44 hybrid Cloud/Grid approach (a standard Globus gatekeeper will be used as well as data transfer tools), while the virtual machine capability will be leveraged for provisioning the resources with the most recent STAR software stack. The goal of this exercise is to provide a fast lane processing of data for the Spin working group with processing of events in near real time. While near real-time processing is already practiced in STAR, the run support data production known as “FastOffline” currently uses local BNL/RCF resources and passes over a sample of the data only once. The use of Cloud resources would allow outsourcing yet another workflow in support of the experiment scientific goals. This processing is also planned to be iterative, each pass using more accurate calibration constants. We expect by then to shorten the publication cycle of results from proton+proton 500 GeV Run 11 data by a year. During the Clemson exercise, STAR had designed a scalable database access approach which we will also use for this exercise. In essence, leveraging the capability of our database API, a “snapshot” is created and uploaded to the virtual machine image and a local database service is started. The need for a remote network connection is then abolished (as well as the possibility of thousands of processes overstressing the RHIC/BNL database servers). A fully ready database factory is available for exploitation. Final preparations of the workflow are in discussion, and if successful, this modus-operandi will represent a dramatic shift in the data processing capabilities of STAR. Raw data production will no longer be constrained to dedicated resources but allowed on widely distributed Cloud based resources). The OSG infrastructure has been heavily used to transfer and redistribute our datasets from the Tier-0 (BNL) center to our other facilities. Noticeably, the NERSC/PDSF center holds full sets of analysis ready data (known as micro-DST) for the Year 9 data and, on the approach of the Quark Matter 2011 conference (http://qm2011.in2p3.fr/node/12), we plan to make available the year 10 data allowing to spread user analysis over multiple facilities (Tier2 centers in STAR typically transfer only subsets of the data, targeting local analysis needs). Up to 7 TB of data can be transferred a day and over 150 TB of data were transferred in 2010 from BNL to PDSF. As a collaborative effort between BNL and the Prague institution, STAR is in the process of deploying a data placement planer tool in support of its data redistribution and production strategy. The planer is based on reasoning as per {from where / to where} the data has to be {taken / should be moved} to achieve the fastest possible plan, whether the plan is a data placement or a data production and processing turn-around. To get a baseline estimate of the transfer speed limit between BNL and PDSF, we have reassessed the link speed. The expected transfer profile is given by Figure 31. We expect this activity to reach completion by mid-2011. 45 Figure 31: Transfer speed maximum between BNL and NERSC facility. The speed maximum is consistent with a point to point 1 Gb/sec link. All STAR physics publications acknowledge the resources provided by the OSG. 2.8 Intensity Frontier at Fermilab MINOS data analysis use of OSG increased in the last year to 5.6 million core hours in 1 Million submitted jobs, with over 4 million managed file transfers. This computing resource, combined with 180 TB of dedicated BlueArc (NFS mounted) file storage, has allowed MINOS to move ahead with traditional and advanced analysis techniques. MINOS uses a few hundred cores of offsite computing at collaborating universities for occasional Monte Carlo generation. These computing resources are critical as the experiment continues to pursue the challenging analyses of anti-neutrino disappearance and nu-e appearance. MINOS has just published the antineutrino disappearance result, see arXiv:1104.0344 accepted Phys.Rev.Lett. 26 May 2011 The Minerva experiment started taking data in FY11, and is using OSG for all production reconstruction and most analysis activities. Activity is ramping up, with 1 Million core-hours used in the last year. The NOvA near detector prototype is being commissioned. NOvA has started using OSG resources for important Simulation and code development work. Other future experiments such as LBNE, Mu2e and g-2 are making use of OSG resources at Fermilab, mainly opportunistically. 2.9 Astrophysics The Dark Energy Survey (DES) used approximately 20,000 hours of OSG resources during the period July 2010 – June 2011 to generate simulated images of galaxies and stars on the sky as would be observed by the survey. We produced about 2 TB of simulated images, consisting of 46 both science and calibration data, for a set of galaxy cluster simulations for the DES weak lensing working group, a set of 60 nights of supernova simulations for the DES supernova working group, and a set of 7 so-called “Gold Standard Night” (GSN) simulation data sets generated to enable quick turnaround and debugging of the DES Data Management (DESDM) processing pipelines as part of DES Data Challenge 6 (DC6). When produced as planned during Summer 2011, the full DC6 simulations will consist of 2600 mock science images, covering some 200 square degrees of the sky, along with nearly another 1000 calibration images needed for data processing. Each 1-GB-sized DES image is produced by a single job on OSG and simulates the 300,000 galaxies and stars on the sky covered in a single 3-square-degree pointing of the DES camera. The processed simulated data are also being actively used by the DES science working groups for development and testing of their science analysis codes. Figure 32 shows an example color composite image of the sky derived from these DES simulations. Figure 32: Example simulated color composite image of the sky, here covering just a very small area compared to the full 5000 deg2 of sky that will be observed by the Dark Energy Survey. Most of the objects seen in the image are simulated stars and galaxies. Note in particular the rich galaxy cluster at the upper right, consisting of the many orange-red objects, which are galaxies that are members of the cluster. The red, green, and blue streaks are cosmic rays, and have those colors as they each appear in only one of the separate red, green, and blue images used to make this color composite image. 2.10 Structural Biology The SBGrid Consortium, operating from Harvard Medical School in Boston is supporting software needs of ~150 structural biology research laboratories, mostly in the US. The SBGrid Virtual Organization (SBGrid VO) extends the initiative to support most demanding structural biology applications on resources of the Open Science Grid. Support by the SBGrid VO is extended to all structural biology groups, regardless of their participation in the Consortium. Within last 12 months we have significantly increased our participation in the Open Science Grid, in terms of both utilization and engagement. Specifically: 47  We have launched and successfully maintained a GlideinWMS grid gateway at Harvard Medical School. The gateway communicates with the Glidein Factory at UCSD, and dispatches computing jobs to several computing centers across the US. This new infrastructure allowed us to reliably handle the increased computing workload. Within the last 12 months our VO supported ~6 million CPU hours on the Open Science Grid, and we rank as number 10 Virtual Organization in terms of overall utilization.  SBGrid completed development of the Wide Search Molecular Replacement workflow. The paper describing its scientific impact was recently published in PNAS. Another paper presenting the underlying computing technology was presented during the 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers, co-located with ACM/IEEE SC10 International Conference for High Performance, Networking, Storage and Analysis.  The WS-MR portal was made publicly available in November 2010. Since its release we have supported 35 users. The majority of users were from US academic institutions (e.g. Yale University, Harvard, WUSTL, University of Tennessee, University of Massachusetts, Stanford, Immune Disease Institute, Cornell University, Caltech), but international research groups utilized the portal as well (including research groups from Canada, Germany, Australia and Taiwan).  We continue planning for integration of the central biomedical cluster at Harvard Medical School with Open Science Grid resources. The cluster has been recently funded for expansion from 1000 to 7000 cores (S10 NIH award), and the first phase of the upgrade is being completed in December.  Our VO is organizing the Open Science Grid All-Hand Meeting which is scheduled to take place in Boston in March of 2011. We have prepared a preliminary program agenda, and participated in several planning discussions.  We successfully maintained a specialized MPI cluster in Immune Disease Institute (Harvard Medical School affiliate) to support highly scalable molecular dynamics computations. A long-term molecular dynamics simulation was recently completed on this cluster, and will complement crystal structure that was recently determined in collaboration with Walker laboratory at HMS (Nature, in press). The resource is also available to other structural biology groups in Boston area.  We hosted the OSG All-Hands meeting in March 2011 (see Section 5.2) 2.10.1 Wide Search Molecular Replacement Workflow We have previously reported on our progress with a specialized macromolecular structure determination method - Wide Search Molecular Replacement (WSMR). This grid workflow has been fully implemented, and we have shown that the method can complement other structural biology technique in a wide array of challenging cases. Results for this project have been published in PNAS (Stokes, Sliz, PNAS, December 2010), as well as in the 3rd IEEE workshop on ManyTask Computing on Grids and Supercomputers. After publication in PNAS, the portal was released to the structural biology community, and close to 100 users utilized the infrastructure. 48 Figure 33: iSGTW article on an SBGrid science result 2.10.2 DEN Workflow We have developed a second portal application that takes advantage of OSG infrastructure (Figure 34). The application facilitates refinement of low resolution X-ray structures. Figure 34: Job submission interface for DEN computations 2.10.3 OSG Infrastructure In the recent award cycle SBGrid OSG utilization increased from 531,116 to 6,057,170 CPU hours (Figure 35). In terms of utilization we are number nine Virtual Organization of the Open Science Grid (US Atlas and CMS are the top two experiments). Since December 2010 we have supported 73 WSMR runs on the Open Science Grid, submitted by 51 users from 47 institutions. 49 Those WSMR production computations consumed over 3M CPU hours. The remaining hours were consumed by our team while validating the WSMR approach, for DEN computations, and other minor workflow which were submitted to OSG by members of our community In order to accommodate the increased load we have migrated our infrastructure GlideIn scheduler. Open Science Grid support team was instrumental in accommodating this transition. We are now working with Harvard Medical School Research Information Technology Group to integrate new cluster resources with Open Science Grid. Figure 35: SBGrid usage (in green) 2.10.4 Outreach: Facilitating Access to Cyberinfrastructure The SBGrid VO continued to provide computing assistance to members of the biomedical community in Boston who are interested in utilizing Open Science Grid for their research. In addition, we have been also supporting the European WeNMR grid project. Our Virtual Organization issues security certificates for U.S. Scientists that require access to WeNMR resources. 2.11 Computer Science Research As the scale of OSG increases its contribution as a laboratory for applied research in the deployment and extension of distributed high throughput computing technologies remains active: OSG performs scalability tests with new components such as the CREAM CE from Europe, new development versions of Condor and integration of components to make the end to end solutions needed by the stakeholders. The OSG Security Team continues its work in the applied research area, starting with analysis of the trust model in collaboration with Bart Miller’s team at the University of Wisconsin. 50 3. Development of the OSG Distributed Infrastructure 3.1 Usage of the OSG Facility The OSG facility provides the platform that enables production by the science stakeholders; this includes operational capabilities, security, software, integration, testing, packaging and documentation as well as VO and User support. Scientists who use the OSG demand stability more than anything else and we are continuing our operational focus on providing stable and dependable production level capabilities. The stakeholders continued to perform record amounts of processing. The two largest experiments, ATLAS and CMS, after performing a series of data processing exercises last year that thoroughly vetted the end-to-end architecture, are steadily setting computational usage records on almost a weekly basis. The OSG infrastructure has demonstrated that is up to the challenge and continues to meet the needs of the stakeholders. Currently over 0.6 Petabytes of data is transferred every day and more than 4 million jobs complete each week. Figure 36: OSG facility usage vs. time for the past 12 months, broken down by VO During the last year, the usage of OSG resources by VOs increased by ~50% from about 6M hours per week to consistently more than 9M hours per week; additional detail is provided in the attachment entitled “Production on the OSG.” OSG provides an infrastructure that supports a broad scope of scientific research activities, including the major physics collaborations, biological sciences, nanoscience, applied mathematics, engineering, and computer science. Most of the current usage continues to be in the area of physics but non-physics use of OSG is an important area with current usage approximately 700K hours per week (averaged over the last year) spread over 17 VOs. 51 Figure 37: OSG facility usage vs. time for the past 12 months, broken down by Site. (“Other” represents the summation of all other “smaller” sites) With over 100 sites, the production provided on OSG resources continues to grow; the usage varies depending on the needs of the stakeholders. During normal operations, OSG provides more than 1.4 M CPU wall clock hours a day with peaks occasionally exceeding 1.6 M CPU wall clock hours a day; between 400K and 500K opportunistic wall clock hours are available on a daily basis for resource sharing. The Tier-3 sites that we heavily invested in last year now perform steadily, just like many of the Tier-2 sites. This is notable since many of their administrators do not have formal computer science training and thus special frameworks were developed to provide effective and productive environments and support. In summary, OSG has demonstrated that it is meeting the needs of US CMS and US ATLAS stakeholders at all Tier-1’s, Tier-2’s, and Tier-3’s. OSG is successfully managing the uptick in job submissions and data movement as the LHC data taking continues into 2011. And OSG continues to actively support and meet the needs of a growing community of non-LHC science communities that are increasing their reliance on OSG. 3.2 Middleware/Software In the last year, our efforts to provide a stable and reliable production platform have continued and we have focused on support and incremental, production-quality upgrades. In particular, we have focused on support for native packaging, the Tier-3 sites, and storage systems. 52 As in all software distributions, significant effort must be applied to ongoing support and maintenance. We have focused on continual, incremental support of our existing software stack release, OSG 1.2. Between June 2010 and June 2011, we released 10 minor updates. These included regular software updates, security patches, bug fixes, and new software features. We will not review all of the details of these releases here, but instead wish to emphasize that we have invested significant effort in keeping all of the software up to date so that OSG’s stakeholders can focus less on the software and more on their science; this software maintenance consumes roughly 50% of the effort of the OSG software effort. There have been several software updates and events in the last year that are worthy of deeper discussion. As background, the OSG software stack is based on the VDT grid software distribution. The VDT is grid-agnostic and used by several grid projects including OSG, WLCG, and BestGrid. The OSG software stack is the VDT with the addition of OSG-specific configuration.  We have continued to focus our efforts to provide the OSG software stack as so-called “native packages” (e.g. RPM on Red Hat Enterprise Linux). With the release of OSG 1.2, we have pushed the packaging abilities of our infrastructure (based on Pacman) as far as we can. While our established users are willing to use Pacman, there has been a steady pressure to package software in a way that is more similar to how they get software from their OS vendors. With the emergence of Tier-3s, this effort has become even more important because system administrators at Tier-3s are often less experienced and have less time to devote to managing their OSG sites. We have wanted to support native packages for some time, but have not had the effort to do so, due to other priorities; but it has become clear that we must do this now. We initially focused on the needs of the LIGO experiment and in April 2010 we shipped them a complete set of native packages for both CentOS 5 and Debian 5 (which have different packaging systems), and they are now in production. More recently, we have provided Xrootd as RPMs for ATLAS Tier 3 sites. We are now in the planning stages to drop future development of our Pacman-based packages and do all future work as native packages, initially RPMs for Red Hat Enterprise Linux-compatible systems.  Recently we have been focusing on the needs of the ATLAS and CMS Tier-3 sites. In particular, we have focused on Tier-3 support for new storage solutions. In the last year, we have improved our packaging, testing, and releasing of BeStMan, Xrootd, and Hadoop, which are a large part of our set of storage solutions. We have released several iterations of these.  We have emphasized improvements to our storage solutions in OSG. This is partly for the Tier-3 effort mentioned in the previous item, but is also for broader use in OSG. For example, we have created new testbeds for Xrootd and Hadoop and expanded our test suite to ensure that the storage software we support and release are well tested and understood internally. We have monthly joint meetings with the Xrootd developers and ATLAS to make sure that we understand how development is proceeding and what changes are needed. We have also provided a new tool (Pigeon Tools) to help our users monitor and understand the deployed storage systems in OSG. We have recently tested and released into production the new Bestman2.  We are currently preparing a major software addition: ATLAS has requested the addition of the gLite CREAM software, which has similar functionality to the Globus GRAM software; that is, it handles jobs submitted to a site. Our evaluation of this package shows that it is likely to scale quite well. We are far along with our work to integrate it into the OSG Software 53 Stack and have provided an early test release to ATLAS, who is likely to be our first user. This test release was packaged with Pacman and with our increased focus on native packaging; our next release will be as RPMs.  We made improvements to our grid monitoring software – RSV – to make it significantly easier for OSG sites to deploy and configure it; this was released in February 2011.  We have been evaluating various data management tools for the VOs that use OSG public storage. The goal is to find a reliable transfer service that allows users to perform bulk data transfer to and from users' home institutions, submission nodes, and OSG sites. Thus far we have evaluated iRODS, Bulk Data Mover tool, and Globus Online. Our evaluation shows that Globus Online is a potential match for our base requirements.  We have worked hard on outreach through several venues: o We conducted an in-person Storage Forum in September 2010 at the University of Chicago, to help us better understand the needs of our users and to directly connect them with storage experts. o We participated in various schools/workshops (the first OSG Summer School, OSG Site Administrator Forum, and Brazil/OSG Grid School) to assist with training on grid technologies, particularly storage technologies. We will be hosting the second OSG Summer School at the end of June 2011. o We participated in the OSG content management initiatives to significantly improve the OSG technical documentation. The VDT continues to be used by several external collaborators. WLCG uses portions of VDT (particularly Condor, Globus, UberFTP, and MyProxy). The VDT team maintains close contact with WLCG via the OSG Software Coordinator's engagement with the gLite Engineering Management Team. We are also in close contact with its successor, the European Middleware Initiative (EMI). TeraGrid and OSG continue to maintain a base level of interoperability by sharing a code base for Globus, which is a release of Globus patched for OSG and TeraGrid’s needs. The VDT software and storage coordinators are members of the WLCG Technical Forum, which is addressing ongoing problems, needs and evolution of the WLCG infrastructure during LHC data taking. 3.3 Operations The OSG Operations team provides the central point for operational support for the Open Science Grid and provides the coordination for various distributed OSG services. OSG Operations publishes real time monitoring information about OSG resources, supports users, developers and system administrators, maintains critical grid infrastructure services, provides incident response, and acts as a communication hub. The primary goals of the OSG Operations group are: supporting and strengthening the autonomous OSG resources, building operational relationships with peering grids, providing reliable grid infrastructure services, ensuring timely action and tracking of operational issues, and assuring quick response to security incidents. In the last year, OSG Operations continued to provide the OSG with a reliable facility infrastructure while at the same time improving services to offer more robust tools to the OSG stakeholders. OSG Operations is actively supporting the US LHC and we continue to refine and improve our capabilities for these stakeholders. We have supported the additional load of the LHC data54 taking by increasing the number of support staff and implementing an ITIL based (Information Technology Infrastructure Library) change management procedure. As OSG Operations supports the LHC data-taking phase, we have set high expectations for service reliability and stability of existing and new services. During the last year, the OSG Operations continued to provide and improve tools and services for the OSG:  Ticket Exchange mechanisms were updated as the WLCG GGUS system, ATLAS RT system, and the VDT RT system continue to evolve in their local environment. A community JIRA instance was opened for the use of all OSG groups for software task tracking and project management. New automated ticket exchanges with a new FNAL ticket system and this new JIRA instance are being evaluated to provide even more possibilities for workflow tracking between OSG collaborators.  The OSG Operations Support Desk continues to respond to ~160 tickets monthly. Figure 38: Monthly Ticket Count (August 2010 - May 2011)  The change management plan has evolved with experience to insure minimal impact to production work during service maintenance windows.  The BDII (Berkeley Database Information Index), which is critical to CMS and ATLAS production, has been at 99.94% availability and 99.95% reliability during the preceding 12 months; see Figure 39. 55 Figure 39: BDII Availability (June 1, 2010 to May 31, 2011)  A working group was created to deploy a WLCG Top Level BDII for use by USCMS and USATLAS. The Blueprint group has approved the working group’s recommendations and testing for this new deployment is ongoing with an implementation into production expected in fall 2011.  The MyOSG system was ported to MyEGEE, MyEGI, and MyWLCG. MyOSG allows administrative, monitoring, information, validation and accounting services to be displayed within a single user defined interface.  Using Apache Active Messaging Queue (Active MQ) we have provided WLCG with availability and reliability metrics.  The public ticket interface to OSG issues was continually updated to add requested features aimed at meeting the needs of the OSG users.  We have completed the SLAs for all Operational services, including the new OSG Glide-In Factory delegated to UCSD. And we continued our efforts to improve service availability via the completion of several hardware and service upgrades: 1. We overcame network bandwidth issues experienced as a result of operational services being behind institutional firewalls. 2. We continued moving metric services from non-production Nebraska facilities to production hosting facilities at Indiana University. 3. Monitoring of OSG Resources at the CERN BDII was implemented to allow end-to-end information system data flow to be tracked and alarmed on when necessary. 4. Operations continued to research virtual machine, high availability, and load balancing technologies in an attempt to bring even higher reliability standards to operational services. 56 Table 6: OSG SLA Availability and Actual Availability (12 Months Beginning June 1, 2010) Operational Service BDII (Information) OIM (Registration) RSV (Monitoring) Software Cache Ticket Exchange MyOSG TWiki SLA Defined Availability Level 99.00% 97.00% 97.00% 97.00% 97.00% 99.00% 97.00% Actual Availability Achieved 99.94% 99.89% 99.72% 99.75% 99.93% 99.89% 99.90% Service reliability for OSG services remains excellent and we now gather metrics that quantify the reliability of these services with respect to the requirements provided in the Service Level Agreements (SLAs); see Table 6. SLAs have been finalized for all OSG hosted services. Regular (pre-announced) release schedules for all OSG services have been implemented to enhance user testing and regularity of software release cycles for OSG Operations provided services. It is the goal of OSG Operations to provide excellent support and stable distributed core services that the OSG community can continue to rely upon and to decrease the possibility of unexpected events interfering with user workflow. 3.4 Integration and Site Coordination The OSG Integration and Sites Coordination activity continues to play a central role in helping improve the quality of grid software releases prior to deployment on the production OSG, and in helping sites deploy and operate OSG services thereby achieving greater success in production. To help achieve these goals, we continued to operate the Validation (VTB) and Integration Test Beds (ITB) in support of updates to the OSG software stack that include compute and storage element services. In the past year we have deployed a new integration and validation cluster to test the many compute elements (and associated job managers) and storage systems. As a major improvement to our testing productivity, we have used the automated ITB job submission system to validate releases regularly. The “ITB Robot” is an automated testing and validation system for the ITB which has a suite of test jobs that are executed through the pilot-based Panda workflow system; its main components are indicated in Figure 40. The test jobs can be of any type and flavor; the current set includes simple ‘hello world’ test jobs, jobs that are CPUintensive, and jobs that exercise access to/from the associated storage element of the site. Importantly, ITB site administrators are provided a command line tool they can use to inject jobs aimed for their site into the system, and then monitor the results using the full monitoring framework (pilot and Condor-G logs, job metadata, etc) for debugging and validation at the joblevel. In addition, a web site to provide reports and graphs detailing testing activities as well as results on multiple sites for various time periods was developed; this allows ITB site administrators to schedule tests to run on their site and allows them to view the daily testing that will run on their ITB resources. In the future, we envision that additional workloads will be executed by the system, simulating components of VO workloads. We are planning work on the “ITB robot” 57 involving migration to autofactory and improving the web interface to give better control to site administrators over tests run on their sites and to provide better reporting of test results. Figure 40: The automated ITB testing environment ("the ITB Robot") showing the Panda server, the ITB Site (compute element and worker node), client submit host and Django webserver. The ITB resources were augmented by nine Dell R610 servers and two Power Vault MD1200 storage shelves. One of the hosts has a 10G network interface so the ITB can be included in high-bandwidth testing applications. Four of the servers are deployed as KVM (hypervisor) servers to host virtual machine instances. The other systems were used as compute nodes and to provide resources for testing virtual machine-based jobs as well as whole machine jobs. The KVM servers currently host approximately 34 virtual machines serving as compute elements, dCache, XRootd, and Hadoop based storage elements, GUMS (Grid User Management System, for sitelevel authentication and account mapping), and ancillary services for the ITB cluster (Condor and PBS job managers, Ganglia for cluster monitoring, etc.). These new hardware resources have allowed the Integration group to provide coverage over a wider variety of software configurations. The ITB now provides coverage for all the major types of storage elements used on the OSG. Previously, the ITB only tested minimal configurations of dCache and XRootd/Bestman; however, with the new hardware, dCache, XRootd/Bestman, and Hadoop/Bestman storage elements can be tested using tests that exercise replication and which have larger space requirements. In addition, the ITB can run tests that require larger clusters and more simultaneous jobs. With the new hardware resources available, the ITB group has explored virtualization of various components of the OSG services. So far, compute element (CE), storage element (SE), and authorization components (GUMS) have been virtualized and their performance characterized. The work and experience gained in this effort is now being documented so that it can be used by the wider OSG community. 58 Finally in terms of site and end-user support, we continue to interact with the community of OSG sites using the persistent chat room (“Campfire”) that has now been in regular operation for nearly 30 months. We offer three hour sessions at least three days a week where OSG core Integration or Sites support staff are available to discuss issues, troubleshoot problems or simply “chat” regarding OSG specific issues; these sessions are archived and searchable. The Site Administrators Workshop in August 2010 and the “Talk with the experts” session during the OSG all hands meeting in March 2011 are other examples of constructive ways to build a community and serve OSG users. We continue to hold monthly teleconferences for all sites which feature a focus topic (usually a guest presenter) and open discussion. We are also using the new ITB cluster resources to support prototyping Tier-3 facilities and “testing” instructions for installation common Tier-3 services. In OSG we have seen many new Tier-3 facilities in the past year, often with administrators who were not UNIX computing professionals but post-docs or students working part time on their facility. Therefore it has been important to continue the support and testing within the virtualized Tier-3 cluster of services and installation techniques similar to the ones that are being developed by the ATLAS and CMS communities. 3.5 Campus Grids An emerging effort is the introduction of distributed HTC (DHTC) infrastructures onto campuses. The DHTC principles (diverse resources, dependability, autonomy and mutual trust) that OSG advances and implements at a national level apply equally well to a campus environment. The goal of the Campus DHTC infrastructure activity is to continue and translate this natural fit into a wide local deployment of high throughput computing capabilities at the nation’s campuses, bringing locally operated DHTC services to production in support of faculty and students as well as enabling integration with TeraGrid, the OSG, and other cyber infrastructures. Deploying DHTC capabilities onto campuses carries a strong value proposition for both the campus and the OSG. Intra-campus sharing of computing resources enhances scientific competitiveness and when interfaced to the OSG production infrastructure increases the national computational throughput. To accomplish this we have defined the following comprehensive approach that aims to eliminate key barriers to the adoption of HTC technologies by small research groups on our campuses. 1. Support for local campus identity management services removes the need for the researchers to fetch and maintain additional security credentials such as grid certificates. 2. An integrated software package that moves beyond current cookbook models that require campus IT teams to download and integrate multiple software components. This package does not require root privileges and thus can be easily installed by a campus researcher. 3. Coordinated education, training and documentation activities and materials that cover the potential, best practices and technical details of DHTC technologies. 4. A campus job submission point capable of routing jobs to multiple heterogeneous batch scheduling systems (e.g. Condor, LSF, PBS etc.). Previously existing campus DHTC models required that all resources are managed by the same scheduler. The job submission interfaces presented to campus faculty and students will be identical to those already in use today on the OSG production infrastructure. As they begin harnessing the local campus resources through these interfaces they will be offered, once they obtain a grid credential, seamless access to OSG production grid resources. This forms a natural seedbed of a new 59 generation of computational scientists who can expand their local computing environment into the national CI. We have begun working with campus communities at Clemson, Nebraska, Notre Dame, Purdue and Wisconsin-Madison on a prototyping effort that includes enabling the formation of local HTC partnerships and dynamic access to shared local (intra-campus) and remote (inter-campus) resources using campus identities. 3.6 VO and User Support The goal of the VO and User Support area is to enable science and research communities from their initial introduction to the OSG to production usage of the DHTC services and to provide ongoing support for existing communities’ evolving needs. The User Support area helps established VOs that require assistance as they evolve their deployment of resources, use of the services and software, and/or identify usability issues. The team works to understand the communities’ needs, agree on common objectives, and provide technical guidance on how to adapt their resources and applications to participate in and run on a DHTC environment. We support the users in integration and use of the existing software and help to resolve issues and problems; and, as needed, we work with the software and technology investigation areas to address any gaps in needed capabilities. In support of existing VOs, we have continued to provide forums that provide:  in-depth coverage of VO and OSG at-large issues  community building through shared experiences and learning  expeditious resolution of operational issues  knowledge building through tutorials and presentations on new technology As an outcome of the VO forum, VOs identify areas of common concern that then serve as inputs to the OSG work program planning process; some recent examples include: mechanisms to manage and benefit from public storage; accounting discrepancies; early issues with adopting pilot-based work flow management environments; and, the need for real-time job status monitoring. In the last year we have documented our process for supporting new science communities and we are currently using these processes and implementing improvements, as needed. Using these processes in support of new communities and new initiatives by current OSG VOs, we have enabled the following work programs:  The Large Synoptic Survey Telescope (LSST) collaboration has undergone a second phase of application integration in OSG, targeting image simulation as a pilot application. This phase used 40,000 CPU hours per day for a month, with peaks of 100,000 and has helped the LSST community understand the potential in the usage of OSG. As a follow-up to experience with this work, the project manager of the LSST Data Management group has requested a plan to create a stand-alone LSST VO as well as the porting to OSG of data management production workflows and of the data analysis framework.  The Network for Earthquake Engineering Simulation (NEES) has formed a stand-alone VO after the successful proof-of-principle integration of the OpenSEES application, a widely used earthquake simulation framework in the community. The application was run submit- 60 ting jobs from the NEESHub portal at Purdue and from a standard OSG submission site. The community is currently undertaking a production-demo phase, scaling up of two orders of magnitude the amount of jobs (~4000) and data (~5 TB) handled in OSG.  The Dark Energy Survey (DES) has requested OSG expertise to improve the efficiency of their data handling system (DAF). In collaboration with the Globus Online (GO) team, DES and OSG have produced a GO-integrated prototype of DAF. This prototype showed a 27% improvement in data transfer times and the need in GO for better interfaces for data placement verification.  SBGrid, based at Harvard Medical School, is extending its pilot job system to support the Italian-based worldwide e-Infrastructure for NMR and structural biology (WeNMR) communities. In parallel, SBGrid continues to provide broad access to OSG by other life science and structural biology researchers in the Boston area.  The GEANT4 Collaboration’s EGI-based biannual validation testing production runs were efficiency-improved and expanded onto the OSG; this helped improve the quality of the toolkit’s releases for MINOS, ATLAS, CMS, and LHCb.  The Cyber-Infrastructure Campus Champions from XSEDE have partnered with the OSG Campus Grids initiative to broaden user access to computing resources. OSG User Support has assisted XSEDE staff in running proof-of-principle jobs on OSG and identifying credentials accessible by CICC and trusted on OSG. We are currently planning to extend our work to support users whose workflows span both DHTC and HPC resources – such as those accessible through the TeraGrid/XD and DOE Science leadership class machines. We will work with the XD advanced user support group in helping users span this mix of resources. 3.7 Security The Security team continued its focus on operational security, identity management, security policies, adopting useful security software, and disseminating security knowledge and awareness within OSG. During 2010, OSG had conducted a series of workshops on identity management resulting in the collection of VO requirements as well as technical requirements from security experts and concluded that improving the usability of the identity infrastructure was a major desire. During the first half of 2011, we undertook a pilot study where we integrated the OSG infrastructure with members of the InCommon federation (Clemson University in particular). We used the CILogon CA as a bridge between US universities and were able to leverage university-assigned user identities in OSG. We worked with the Clemson University IT department and demonstrated that OSG users from Clemson University can access OSG resources without acquiring additional identities. We prepared a risk assessment of our pilot study and collected feedback from the OSG member institutions. The plan is to continue our work in this area during the next phase of OSG. Currently, the majority of InCommon members have no accreditation that is compatible with IGTF policies. Among our next steps is to propose a new authentication and identity vetting profile at IGTF that would be compatible with InCommon member practices. In the meantime, many of the DOE National laboratories such as Fermilab and LBNL have joined InCommon and we are exploring how these laboratories can gain special IGTF accreditation such that they can integrate with the OSG identity infrastructure. In addition, the OSG security team strongly sup61 ported the CILogon Silver CA for its IGTF accreditation. As a voting relying party member of IGTF, OSG voted in favor of CILogon CA accreditation and also encouraged other IGTF members to do the same by explaining the benefits to the US national cyber infrastructure. On operational security, we had a very successful year with no major security incidents. However, we addressed a large number (>10) of kernel-level and root-level software vulnerabilities. We had one attack reported in our community due to these vulnerabilities, but the attacker did not target the grid infrastructure and only compromised the local systems. We spent a significant amount of effort detecting vulnerable OSG sites and helping them apply the needed patches. Many OSG sites had varying institutional policies on their patching practices and this required some negotiations with WLCG security team since WLCG requested a shorter patching period (7-days) from all sites. We negotiated agreements with the WLCG security team to provide schedule extensions based on each site’s special circumstances (e.g. delayed sites are asked to apply a temporary fix) which required one-on-one work with the affected OSG sites. The vulnerability monitoring tool, Pakiti, that we had adopted in March 2010 proved to be very useful during this time by helping us detect patch levels of OSG sites that are subscribed to this service. We continued exchanging incident and vulnerability information across EGI/WLCG, TeraGrid and OSG security teams. This collaboration has provided significant benefit for our community since we are able to receive and understand ongoing attacks and vulnerabilities and thus take timely precautions against those threats. In order to ensure our incident preparedness, we tested and updated the security contact information of every OSG site and VO and ensured that each site and VO has two people registered as security contacts in OSG database. We transitioned the OSG DOEGrids Registration Authority (RA) duty to members of the OSG operations team at Indiana University. The RA handles all types of certificate issues, including certificate approval, renewal, revocation, and troubleshooting. This transition proved to be beneficial on several fronts: we created detailed documentations on all RA processes; we tested the documentation while training new back-up personnel; and, we gained a team of back-up personnel instead of relying on a single person acting as an RA. We conducted an incident drill (security service challenge) where we tested OSG sites’ ability to respond to a mock incident. This drill had been coordinated between EGI and OSG security teams based on a request by WLCG and it covered Atlas sites and aimed to test the Panda job submission workflow. The drill was completed successfully and we are currently evaluating the results. In the last year, the DOEGrids CA experienced two short-lived service outages lasting less than 2 days each. In addition, the Japanese and Korean CAs had longer-term (around a month) outages due to the earthquake in Japan. Although these outages did not have a serious impact on OSG operations, they created added awareness leading to update of our business continuity and contingency plans for the OSG identity infrastructure. We documented various scenarios that described how to use alternative Certificate Authorities in the event of such emergencies. As documented in the OSG Security Plan, we performed our annual Security Test and Controls which are designed to measure the security of OSG core services. We are now preparing a report to the OSG Executive Team that contains our recommendations and findings from the audits. We plan to repeat these controls each year and evaluate whether earlier recommendations were adopted and are beneficial to OSG security. Each year OSG services have shown continuous 62 improvement as assessed by these tests and controls. And based on our findings, we update the OSG Risk Assessment document each year. Due to the expiration of SHA-1 and MD-5 hashing algorithms, IGTF had decided to start using SHA-2 hashes in certificates. OSG generated a test cache for the new certificates in ITB and tested them against our software. We modified the needed software in the VDT stack and the ITB testing of this effort had been completed. Migration of OSG sites and VOs using to the new certificate cache is currently in progress. During the March 2011 OSG All Hands meeting, we performed our annual site and VO security training. This year, instead of presentations, we prepared and sent two sets of questionnaires to the security contacts before the meeting; and then during the AHM, we discussed their responses to the questions. We intend to continue this kind of hand-on survey and questionnaire and generate a FAQ database that can be reviewed online by security contacts for training. 3.8 Content Management A project was begun in late 2009 to improve the collaborative documentation process and the organization and content of OSG documentation. The project was undertaken after an initial study using interviews of users and providers of OSG documentation and a subsequent analysis of the OSG documentation system. Implementation was begun in 2010 and an overall coordinator was assigned and the documentation was divided into nine areas, each with a responsible owner. Based on feedback from the stakeholders, we developed an improved process for producing and reviewing documents, including new document standards (with tools to simplify application of the standards) and templates defined to make the documents more consistent for readers. The new content management process includes: 1) ownership of each document; 2) a formal review of new and modified documents; and, 3) testing of procedural documents. A different person fills each role, and all of them collaborate to produce the final document for release. Several new tools were created to facilitate the management and writing of documents in this collaborative and geographically dispersed environment. The team had several areas of focus: user and storage documentation and documentation specific to bringing up new Tier-3 sites. By the end of 2010, the team, with collaborators in many of the OSG VOs, had:  Reduced the number of documents by 1/3 by combining documents with similar content and making use of an “include mechanism” for common sections in multiple documents.  Produced a new navigation system that is being completed at this time.  Provided reviewed documentation specific to establishing Tier-3 sites.  Revamped documentation of the storage technologies in OSG.  Improved and released 77% of the documents overall. We have made great overall progress but two document areas (CE and VOs) are still delayed because of staffing issues that are currently being addressed. In the first quarter of 2011, the prototype navigation was implemented in close cooperation with OSG operations using a development instance of the OSG Twiki. Unlike the existing navigation, the improved navigation allows the user to search the content more easily, provides documents 63 by user role and technology of interest to the user. The prototype is being transitioned gradually into production in incremental steps to ease the transition from the old navigation to the new for users of the production Twiki. Once this milestone has been achieved the document process will change its current focus from improving individual documents to improving the navigation for readers in different roles such as users, system administrators, and VO managers. This process will test the connectivity and improve the usability of a set of documents when carrying out typical OSG role-based user tasks. In the second half of 2011, the document process will be integrated into the OSG release process to ensure that software and services provided by the OSG are in sync with the corresponding documentation. 3.9 Metrics and Measurements OSG Metrics and Measurements strive to give OSG management, VOs, and external entities quantitative details of the OSG Consortium’s growth and performance throughout its lifetime. The focus for the recent year has been maintenance of the technical aspects and analysis of new high-level metrics. The OSG Display (http://display.grid.iu.edu), developed last year, is meant to deliver a highlevel, focused view demonstrating the highlights of the consortium to technically savvy members of the public a feel for the services that OSG provides. This display continues to be a very visible aspect of Metrics and is used often at public displays and at several OSG sites. Most metrics services and automated reports have been transitioned to OSG Operations at GOC (the exceptions are discussed later); the goal remains to have operations run all reports by the start of FY12. These reports have proven to be useful in catching subtle issues in WLCG availability or accounting, and are reviewed daily by both the LHC VOs and OSG staff. Metrics staff continues to produce and manually verify monthly reports for the eJOT and OSG homepage; we believe this additional manual layer is an important validation prior to larger use. OSG Metrics performs periodic reviews of the validity of automated processes, and audit the documentation for manual processes. The Metrics area continues to coordinate WLCG-related reporting efforts. The installed capacity reporting put in place last year has been operating for a year without technical issues. Based on an ATLAS request, we investigated the effect of hyper-threading on the site normalization used by the WLCG for accounting. From those results, we started manually computing ATLAS normalizations, and are working on a technical solution to restore automated computations. In FY11, we have increased efforts to employ metrics data in assessing key strategic capabilities of the OSG facility and project overall. An important focus has been to examine factors driving high throughput computing peformance. An example of this was an investigation into wall-time efficiency which relates to the capabilities of sites within the OSG to efficiently process data, and to the ability of VOs to appropriately design their workflows. In Figure 41 the wall-time efficiency (CPU time/wall time) derived from the Gratia accounting service is plotted versus the number of completed jobs per day for the full 2010 year. The plot compares two VO’s with very different efficiency characteristics and usage patterns. In each case the VO’s submit a mix of jobs – each has a heterogeneous set of users doing both CPU intensive workloads (such as Monte Carlo simulation) and data analysis (which can be IO intensive resulting naturally in lower efficiency) which contributes to the spread of points. 64 Figure 41: Comparison of the wall-time efficiency for two VOs Figure 42: Comparison of the walltime efficiency for 17 OSG sites serving VOa for 2010 Furthermore, there are dependencies on the computing site, as displayed in Figure 42 which shows the cumulative wall time efficiency for all of 2010 for the computing sites selected by VOa. Here we see significant variation with three sites in particular showing comparatively weak performance. These facilities have very significant differences in profile, varying in scale (amount of storage and CPU job slots available), storage systems (dCache, HDFS, etc.) and data access protocol (staging to local disk on the compute node or streaming from network attached storage pools). There are also wide-area network affects when input datasets need to be prestaged to the site as part of the workflow. The VOs know in detail the most significant factors 65 that come into play; the lessons garnered are generally useful for all VOs in OSG, and these are discussed in a variety of forums within the OSG community. The OSG metrics web application framework allows users to browse metrics data, and staff to create reports for OSG management; this package is stable and is now well-packaged and installable as an RPM. Reports have continuously evolved and added based upon management requests and new Gratia data. As a continuation of work mostly completed in 2010, we have replaced hand-maintained data or custom databases with data sources such as GOC’s OIM or Gratia, although the “field of science” report mapping individuals to science fields is an ongoing concern. The scripts generating nightly plots for the OSG homepage have been significantly cleaned up, and are being prepared to transition to operations. OSG Metrics continues its successful collaboration with the external Gratia development project in the face of substantial effort reductions in the Gratia project. Using OSG feedback, Gratia has improved accounting for pilot jobs, HTPC jobs, and collecting storage and batch system data from the BDII. OSG Metrics has contributed to the storage probes, the batch system probes, and code for the next collector release. The scalability improvements to Gratia put in place last year based on analysis of the load resulting from the LHC turn-on have proven to be sufficient for the current year. 3.10 Extending Science Applications In addition to operating a facility, the OSG includes a program of work that extends the support of Science Applications in terms of both the complexity as well as the scale of the applications that can be effectively run on the infrastructure. We solicit input from the scientific user community both as it concerns operational experience with the deployed infrastructure, as well as extensions to the functionality of that infrastructure. We identify limitations, and address those with our stakeholders in the science community. In the last year, the high level focus has been threefold: 1) improve the scalability, reliability, and usability as well as our understanding thereof; 2) improve the usability of our Work Load Management systems to enable broader adoption by non-HEP user communities; and 3) evaluate new technologies, such as xrootd, for adoption by the OSG user communities. 3.10.1 Scalability, Reliability, and Usability As the scale of the hardware that is accessible via the OSG increases, we need to continuously assure that the performance of the middleware is adequate to meet the demands. There were three major goals in this area for the last year and they were met via a close collaboration between developers, user communities, and OSG. 1. At the job submission client level, the CMS-stated goal of 40,000 jobs running simultaneously and 1M jobs run per day from a single client installation has been achieved and exceeded, with a peak of 60,000 running jobs demonstrated. The job submission client goals were met in collaboration with CMS, Condor, and DISUN, using glideinWMS. This was done in a controlled environment, using the “overlay grid” for large scale testing on top of the production infrastructure provided by CMS. To achieve this goal, the Condor architecture was modified to allow for port multiplexing and asynchronous matchmaking. There have also been several smaller improvements in disk transaction rates and memory consumption to lower the hardware cost of achieving this goal. The glideinWMS is also used 66 for production activities in over ten scientific communities, the biggest being CMS, D0, CDF and HCC, where the job success rate has constantly been above the 95% mark. 2. At the storage level, the present goal is to have 100Hz file handling rates with thousands of clients accessing the same storage area at the same time, and delivering at least 1Gbit/s aggregate data throughput. The two versions of BeStMan SRM, based on two different java container technologies, Globus and Jetty, both running on top of the HadoopFS have been shown to achieve this goal. They can also handle in the order of 1M files at once, with directories containing up to 50K files. There was no major progress on performance of the dCache based SRM which did not exceed 10Hz in our tests. 3. At the functionality level, this year’s goal was to evaluate and facilitate the adoption of new streaming storage technology to complement the file based solution provided by SRM. This was requested by CMS to enable efficient remote event processing. The chosen technology is based on the xrootd protocol and implemented by Scala. A Scala instance has been tested using the above mentioned overlay grid and has shown great performance with a modest number of clients, but failing catastrophically at higher scales. The test results have been propagated back to the developers and a fix is expected soon. In addition, we have continued to work on a number of lower impact objectives: 1. We have been involved in testing the client implementation of alternative Grid gatekeepers, in particular CREAM and GRAM5 since several OSG VOs need access to sites that use such technologies. The primary client in OSG is Condor-G, so it has been extensively tested and all problems reported to the Condor development team, who fixed all of them. 2. We have been involved in tuning the performance of SRM implementations by performing configuration sweeps and measuring the performance at each point. 3. We continuously evaluate new versions of OSG software packages, with the aim of both discovering eventual bugs specific to the OSG use cases, as well as comparing the scalability and reliability characteristics against the previous release. 4. We have been involved with other OSG area coordinators in reviewing and improving the user documentation. The resulting improvements are expected to increase the usability of OSG for both the users and the resource providers. 5. We provide support to OSG VOs that express interest in using glideinWMS, by providing advice on deployment options as well as helping with high level operational issues. This has shown to be very effective as several VOs adopted the OSG-operated glideinWMS factory this year with excellent results for their production scaling. 6. Improvements to the package containing a framework for using Grid resources for performing consistent scalability tests against centralized services, like CEs and SRMs. The intent of this package is to quickly “certify” the performance characteristics of new middleware, a new site, or deployment on new hardware, by using thousands of clients instead of one. Using Grid resources allows us to achieve this, but requires additional synchronization mechanisms to perform in a reliable and repeatable manner. 3.10.2 Workload Management System OSG actively supports it stakeholders in the broad deployment of a flexible set of software tools and services for efficient, scalable and secure distribution of workload among OSG sites. Both 67 major Workload Management Systems supported by OSG – glideinWMS and Panda – saw significant usage growth in the last year. The Panda system continued its stable operation as a crucial element of the ATLAS experiment at LHC and saw more than a two-fold increase in workload over the past 12 months, with number of jobs processed daily peaking at 840,000. This growth brought with it a new set of challenges in the area of scalability of Panda databases and related information systems, such as monitoring. These challenges were met with an effort aimed at optimization of queries against the existing Oracle RDBMS, and also with R&D leading to the application of novel “noSQL” technologies already proven in industry, to Panda data storage solutions. A series of tests was performed at CERN and at Brookhaven National Laboratory which determined the optimal design of data structures for storage and hardware configuration of a high-performance Cassandra DB cluster which will be used to handle archived data currently accessed through Oracle, achieving better scalability and throughput going forward. We continued our collaboration with the BNL research group active in the Daya Bay and LBNE (DUSEL) experiments, providing them with increased and transparent access to large resource pools (at BNL) via Panda job submission. As a result, in recent months the researchers have been taking advantage of a throughput of more than an order of magnitude higher than before and are now submitting a steady stream of jobs daily. The GlideinWMS system saw a steady and significant increase in volume for the past 12 month, averaging roughly 100,000 jobs per day. To increase the impact of this technology, OSG now operates a glidein factory as a core service for the benefit of multiple VOs and in the last year this has enabled another ~6 VOs to adopt this job submission framework. Based on feedback from OSG VOs, the glideinWMS project undertook a development effort to further improve the system, and enhancements were made in areas such as monitoring, network communications, gLExec integration and diagnostics, improved handling of glide-ins, multi-core capability, packaging and installation, and others. As evidenced by increase in volume, scope and diversity of the workload handled globally by the OSG supported Workload Management Systems this program continues to be extremely important for the science community. These systems have proven themselves as key enabling factors for the LHC projects such as ATLAS and CMS. It is equally important that OSG continues to draw new entrants from diverse areas of research who receive significant benefit by leveraging stable and proven environment for access to resources, job submission and monitoring tools created and maintained by OSG. 3.11 OSG Technology Planning The OSG Technology planning Area provides the OSG with a mechanism for medium to longterm technology planning. This is accomplished via two sub-groups: Blueprint: The blueprint sub-group records the conceptual principles of the OSG and focuses on the long-term evolution of the OSG. The Blueprint group meets approximately quarterly and, under the direction of the OSG Technical Director, updates the "Blueprint Document" to reflect our understandings of the basic principles, definitions, and the broad outlines of how the elements fit together. Investigations: To manage the influx of new external technologies - while keeping to the OSG principles - this sub-group does investigations to understand the concepts, functionality, and im- 68 pact of external technologies. The goal is to identify technologies that are potentially disruptive in the medium-term of 12-24 months and give the OSG recommendations on whether and how to adopt them. This sub-group was added in late FY11 and is viewed as a key work area to enable the evolution of OSG. Starting in January 2011, the Blueprint group undertook a major update of the OSG Blueprint Document to make it more focused on the principles of OSG, and less on the technology implementation. It also now recognizes that the OSG, as an organization, has several grids and currently two reference implementations (the Production Grid and a Campus Grid). The blueprint meetings have identified several areas of investigation for the future. Examples include: 1. Integrating virtualization-based resources into the OSG alongside batch-based resources. 2. Alternate system for uploading the sites state information to the central operations databases. 3. Easy-to-operate data movement for smaller organizations. The investigation sub-group has begun work on the virtualization task, and expects to release a report at the end of July 2011. To engage and inform the wider community in this aspect of the OSG, a blog was recently started to record the area’s activities (http://osgtech.blogspot.com). 3.12 Challenges facing OSG The current challenge facing OSG is how to sustain, improve and extend our support for our existing communities, while engaging with and expanding our support for additional communities that can benefit from our services, software and expertise. We continue to face the challenge of providing the most effective value to the wider community, as well as contribute to the XD, SciDAC-3 and Exascale programs. We have learned that it is not trivial to transfer capabilities from large international collaborations to a smaller scientific community. We have identified some of these challenges and how we would propose to address them in our plans (and proposal) for the next 5 years to sustain and extend the OSG: 1. The heterogeneity of the resource environment of OSG makes it difficult for smaller communities to operate successfully at a significant scale. We made a breakthrough in this area in 2010 by offering a job submission service that creates community specific overlay batch systems across OSG sites, thus providing meta-scheduler functionality across the entire infrastructure. Today, six different communities share a single instance of this service, providing an economy of scale and centralizing the support across these communities. We will thus offer this service as a core feature of OSG to all communities in the next 5 years. 2. The complexity of the grid certificate-based authorization infrastructure presents a nonnegligible barrier of entry for smaller communities. During the last year, we saw multiple promising approaches emerging in different contexts: LIGO is pioneering a more integrated approach that includes federated identity management through Shibboleth, and the OSG campus infrastructure group has developed a prototype in which local identity management mechanisms are extended to a regional cross–campus infrastructure spanning campuses in six Midwest cities. 3. While the LHC communities are moving PetaBytes worldwide, smaller communities find it exceedingly difficult to manage terabytes. Over the last 5 years we have seen this gap in capability grow rather than shrink. The first step in reversing this trend is the “Any Data, Any69 time, Anywhere” initiative, a collaboration of OSG, WLCG, US ATLAS, and US CMS computing communities. The initiative aims to reduce the problem from one of moving data around within the OSG fabric to one of getting the data onto the OSG fabric in the first place; once data is anywhere on the fabric it would be accessible remotely from everywhere on the fabric. 70 4. Satellite Projects, Partners, and Collaborations The OSG coordinates with and leverages the work of many other projects, institutions, and scientific teams that collaborate with OSG in different ways. This coordination varies from reliance on external project collaborations to develop software that will be included in the VDT and deployed on OSG to maintaining relationships with other projects where there is a mutual benefit because of common software, common user projects, or expertise in areas of high throughput or high performance computing. Projects are Satellite Projects if they meet the following criteria:  OSG was involved in the planning process and there was communication and coordination between the proposal’s PI and OSG Executive Team before submission to agencies.  OSG commits support for the proposal and/or future collaborative action within the OSG project.  The project agrees to be considered an OSG Satellite project. Satellite Projects are independent projects with their own project management, reporting, and accountability to their program sponsors; the OSG core project does not provide oversight for the execution of the satellite project’s work program. OSG does have a close working relationship with Satellite Projects and a member of our leadership serves as an interface to these projects. Current OSG Satellite Projects are:  CI-TEAM: Embedded Immersive Engagement for Cyber infrastructure (EIE4CI) to establish the Engage VO, engagement program, and explore the use of OSG techniques for catalyzing campus grids.  ExTENCI: Extending Science Through Enhanced National Cyber infrastructure is a new project that began in August, 2010. It jointly serves OSG and TeraGrid by providing mechanisms for running applications on both architectures.  High Throughput Parallel Computing (HTPC) on OSG resources for an emerging class of applications where large ensembles (hundreds to thousands) of modestly parallel (4- to ~64way) jobs.  Application testing over the ESNet 100-Gigabit network prototype, using the storage and compute end-points supplied by the Magellan cloud computing at ANL and NERSC.  CorralWMS to enable user access to provisioned resources and “just-in-time” available resources for a single workload integrate. It builds on previous work on OSG’s GlideinWMS and Corral, a provisioning tool used to complement the Pegasus WMS used on TeraGrid.  VOSS: “Delegating Organizational Work to Virtual Organization Technologies: Beyond the Communications Paradigm” (OCI funded, NSF 0838383) studies how OSG functions as a collaboration. We maintain a partner relationship with many other organizations that are relate to grid infrastructures, other high performance computing infrastructures, international consortia, and certain projects that operate in the broad space of high throughput or high performance computing. These collaborations include: 71                     Community Driven Improvement of Globus Software (CDIGS) Condor Cyber Infrastructure Logon service (CILogon) European Grid Initiative (EGI) European Middleware Initiative (EMI) Energy Sciences Network (ESNet) FutureGrid study of grids and clouds Galios R&D company in the area of security for computer networks Globus Colombian National Grid (GridColombia) São Paulo State University's statewide, multi-campus computational grid (GridUNESP) Internet2 Magellan Network for Earthquake Engineering Simulation (NEES) National (UK) Grid Service (NGS) NYSGrid Open Grid Forum (OGF) Pegasus workflow management TeraGrid WLCG The OSG is supported by many institutions including:  Boston University  Brookhaven National Laboratory  California Institute of Technology  Clemson University  Columbia University  Fermi National Accelerator Laboratory  Harvard University (Medical School)  Indiana University  Information Sciences Institute (USC)  Lawrence Berkeley National Laboratory  Purdue University  Renaissance Computing Institute  Stanford Linear Accelerator Center (SLAC)  University of California San Diego  University of Chicago  University of Florida  University of Illinois Urbana Champaign/NCSA  University of Nebraska – Lincoln  University of Wisconsin, Madison  University of Buffalo (council) Selected Satellite Projects and Partnerships and their work with OSG are described below. 72 4.1 CI Team Engagements In April 2008, PI McGee was awarded an OSG Satellite Project (Embedded Immersive Engagement for Cyber infrastructure – EIE4CI) from the CI-TEAM program to establish the Engage VO, engagement program, and with co-pi Goasguen, explore the use of OSG techniques for catalyzing campus grids. This effort is now in a no-cost-extension period and the annual report on these activities has been submitted to NSF in May 2011, (award number 0753335). Included with the annual reports is an independent assessment of the Engage CI-TEAM effort conducted by the VOSS sponsored Scientific Software Ecosystems Research Project3. The EIE4CI team has been successful in engaging a significant and diverse group of researchers across scientific domains and universities. The infrastructure and expertise that we have employed has enabled rapid uptake of the OSG national CI by researchers with a need for scientific computing that fits the model of national scale distributed HTC. Engaging university campuses in the dialog of organizing local campus Cyber infrastructure using the tooling and techniques developed by the Open Science Grid has proven more challenging as reported in a January 2010 workshop4. During this reporting period, the Engage VO and associated hosted infrastructure has increasingly become relied upon to bring new users and larger communities onto OSG. In addition to the staff at RENCI who are sponsored by the satellite award, core OSG staff (e.g. FNAL staff working with NEES) and other OSG satellite projects (e.g. ExTENCI) are engaging new users and communities, building upon the experiences of this CI-TEAM award, and depending upon the EIE4CI managed infrastructure. During the remaining time of the EIE4CI no-costextension period, we will begin discussions with the OSG core program to make contingency plans for the case in which these activities do not receive continued funding to engage new users and maintain the enabling infrastructure, processes, and Engage-VO community that has come together. Figure 43 below shows the EIE4CI cumulative usage throughout the entire project period, clearly indicating increased growth and demand for OSG. 3 http://www.renci.org/wp-content/pub/techreports/TR-10-05.pdf 4 https://twiki.grid.iu.edu/bin/view/CampusGrids/WorkingMeetingFermilab 73 Figure 43: Cumulative Hours by Facility throughout the Entire EIE4CI Program Engage-VO use of OSG during the reporting period is depicted in Figure 44 and represents a number of science domains and projects including: Biochemistry (Zhao), theoretical physics (Bass, Peterson, Bhattacharya, Coleman-Smith, Qin), Mechanical Engineering (Ratnaswamy), Earthquake Engineering (Espinosa, Barbosa), RCSB Protein Data Bank (Prlic), and Oceanography (Thayre). Figure 44: CPU hours per engaged user during reporting period 74 We note that all usage by Engage staff depicted here is directly related assisting users, and not related to any computational work of the Engage staff themselves. This typically involves running jobs on behalf of users for the first time or after significant changes to test wrapper scripts and probe how the distributed infrastructure will react to the particular user codes. In an effort to increase the available cycles for Engage VO users, RENCI has made available two clusters to the Engage community including opportunistic access to the 11TFlop BlueRidge system, including GPGPU nodes being used by an EIE4CI molecular dynamics user5, and some experimental large memory nodes (two nodes of 32 cores and 1TB of system memory). 4.2 Condor The OSG software platform includes Condor, a high throughput computing system developed by the Condor Project. Condor can manage local clusters of computers, dedicated or cycle scavenged from desktops or other resources, and can manage jobs running on both local clusters and delegated to remote sites via Condor itself, Globus, CREAM, and other systems. The Condor Team released updated versions of the Condor software and provided support for that software. That support frequently involved improvements to the Condor software, which would be tested and released. 4.2.1 Release Condor This activity consisted of the ongoing work required to have regular new releases of Condor. Creating quality Condor releases at regular intervals required significant effort. New releases fixed known bugs; supported new operating system releases (porting); supported new versions of dependent system software and hardware; underwent a rigorous quality assurance and development lifecycle process (consisting of a strict source code management, release process, and regression testing); and received updates to the documentation. Previously release engineering work was done by various team members on a rotating basis. This reduced the efficiency of the work and made long term release engineering projects difficult. We moved to having two staff members are now partially dedicated to release engineering, allowing us to improve our process. From July 2010 through early June 2011, we have made a number of improvements to our release engineering. We added more platforms on which we do hourly builds and testing, enabling earlier bug detection. In collaboration with Red hat, we added caching of external libraries to speed up individual builds. We enhanced our build and test dashboard to quickly identify flawed commits. We began running our most recent development release to power the Metronome software in the NMI Build and Test Lab6, providing an additional real and complex environment to help identify problems earlier in the development series. These benefits combine to allow earlier and more bug-free releases of Condor. Catching bugs earlier means fewer stable releases are necessary, increasing stability and ease of administration for end users. Over the last year the Condor team made 7 releases of Condor, with at least one more planned before the end of June 2011. During this time the Condor team created and code-reviewed 84 publicly documented bug fixes. Condor ports were maintained and released for Windows, Mac OS X, and 6 different Linux distributions. These ports were tested across 30 different operating 5 6 http://osglog.wordpress.com/2011/02/02/amber11-pmemd-for-nvidia-gpgpu/ See http://nmi.cs.wisc.edu/ for more about the NMI facility and Metronome. 75 system and architecture combinations. Recent work testing binary compatibility allows us to support and deliver fewer different packages while supporting more operating system distributions than before. We continued to invest effort improving our automated test suite to find bugs before our users do, and continued our efforts to better leverage the NMI Build and Test facility and Metronome framework. The number of automated builds we perform via NMI averages over 80 per day. This allows us to better meet our release schedules by alerting us to problems in the code or a port as early as possible. We currently perform more than 30,000 tests per day on the current Condor source code snapshot (see Figure 45). Figure 45: Number of daily automated regression tests performed on the Condor source Support and release for Condor is non-trivial because of the nature of distributed computing, but also because the Condor codebase is very active. On average over the past year, each and every month the Condor project performed the following:  Released a new version of Condor to the public  Performed over 170 commits to the codebase (see Figure 46)  Modified over 350 source code files  Changed over 8,500 lines of code (Condor source code written at UW-Madison now sits at about 922,422 lines of code)  Compiled about 2,500 builds of the code for testing purposes  Ran about 930,000 regression tests (both functional and unit) 76 Figure 46: Number of code commits made per month to Condor source repository The 7.4 stable series continued to be popular, with over 7,800 downloads since July 2010, bringing lifetime downloads from the Condor homepage alone to over 24,000. The new 7.6 stable series was first released in April 2011 and already has over 1,900 downloads, replacing 7.4 as the most popular download. The stable series contains only bug fixes and ports, while the development series contains new functionality. Throughout 2003, the project averaged over 400 downloads of the software each month; in the past year, that number has grown to over 1,000 downloads each month as shown in Figure 47. 77 Figure 47: Stacked graph depicting number of downloads per month per Condor version Note these numbers exclude the increasingly popular downloads of Condor from other thirdparty distributors, including the VDT (Open Science Grid software stack), and copies of Condor bundled with popular Linux distributions such as Red Hat MRG, Ubuntu, and Fedora Linux; as a result, Condor use grows more quickly than these limited numbers suggest. Several OSG collaborators wish to compile the Condor sources themselves. While Condor has been open source since 2003, it could be difficult for external groups to build and interoperate with native installed packages. We collaborated with Red Hat to overhaul Condor’s build system to create a build system that works well for Condor team members and external groups. Condor team members have also been working with the Debian Project to improve the quality of Condor’s Debian packages. This also includes work to move Condor to dynamic linking of all possible libraries. Dynamic linking will reduce package sizes, speeding releases and downloads for users. This work is a required step on the way toward eventual addition of Condor into the official Debian repository. Once Condor is available from the official Debian repository, this will provide the most convenient possible packaging for users of Debian-based systems. Besides the Condor software itself (covered in the section below), the project maintained the Condor web sites at URLs http://www.condorproject.org and http://wiki.condorproject.org, which contains links to the Condor software, documentation, research papers, technical documents, and presentations. The project maintained numerous public project related email lists such as condor-users (see Figure 48), condor-devel, and their associated archives on-line. The project also maintained a publicly accessible version control system allowing access to in-development releases, a public issue tracking system, and a public wiki of documentation on using, configuring, maintaining, and releasing Condor. For users anticipating bug fixes or new features, we have 78 placed an increased focus on ensuring that release information is associated with issues in our public tracking system, allowing users to identify which version of Condor contains which changes. A primary publication for the Condor team is the Condor Manual which currently stands at over 1,000 pages. In the past year, the manual has been kept up to date with changes and additions in the stable 7.4 series, the development 7.5 series, the new stable 7.6 series, and in preparation for the imminent 7.7.0 release; the most recent stable release is the Condor Version 7.6.1 Manual7. 4.2.2 Support Condor Users received support directly from project developers by sending email questions or bug reports directly to the condor-admin@cs.wisc.edu email address. All incoming support email was tracked by an email-based ticket tracking system running on servers at UW-Madison. From July 2010 through early June 2011, over 1,500 email messages were exchanged between the project and users towards resolving 430 support incidents (see Figure 48). Figure 48: Number of tracked incidents and support emails exchanged per month Support is also provided on various online forums, bug tracking systems, and mailing lists. For example, approximately 10% of the of the hundreds of messages monthly on Condor Users email list originated from Condor staff; see Figure 48. 7 http://www.cs.wisc.edu/condor/manual/ 79 300 250 200 150 100 Others 50 Condor team members 0 Jul 10 Aug 10 Sep 10 Oct 10 Nov 10 Dec 10 Jan 11 Feb 11 Mar 11 Apr 11 May 11 Figure 49: The condor-users email list receives hundreds of messages per month; project staff regularly contributes to the discussion. The Condor team provides on-going support through regular phone conferences and face-to-face meetings with OSG collaborations that use Condor in complex or mission-critical settings. This includes monthly meetings with USCMS and FermiGrid, weekly teleconferences with ATLAS, and biweekly teleconferences with LIGO. Over the last year Condor’s email support system has been used to manage 21 issues with LIGO, resolving 9 of them. The Condor team also uses a web system to track ongoing issues. This web system is tracking 27 issues associated with ATLAS, of which 19 are resolved, 107 issues associated with CMS, of which 70 are resolved, 74 tickets for LIGO, of which 29 are resolved, and 33 for other OSG users, of which 18 are resolved. Support work for LIGO has included hardening and adding functionality, helping to debug problems in their cluster, and fixing bugs uncovered by the work of LIGO and other OSG groups. Work with ATLAS revealed bugs in Condor-G, which have been fixed. For groups including ATLAS and LIGO the Condor team provides advice, configuration support, and debugging support to meet policy, security, and scalability needs. The Condor team has crafted specific tests as part of our support for OSG associated groups. For example, LIGO's work pushes the limits of Condor DAGMan, so specific extremely large scale tests have been added and are regularly run to ensure that new releases will continue to meet LIGO's needs. The Condor team has developed and performed multiple stress tests to ensure that Condor-G's ability to manage large GRAM and CREAM job submissions will scale to meet ATLAS needs. As part of our community support for Condor, we organize and lead an annual workshop called Condor Week. This past year’s Condor Week 2011 was held on May 2-6, 2011 at the Wisconsin 80 Institute for Discovery Town Center and Teaching Labs8. Condor Week 2011 was held in conjunction with the Paradyn Project. There were 147 attendees, with 110 attendees at solely the Condor events. There were 44 speakers at Condor Week over four days. There were attendees from industry, government and universities. Industry attendees included representatives from Cisco and DreamWorks Animation. Government attendees included representatives from Brookhaven National Laboratory and Fermi National Accelerator Laboratory. Representatives from universities include attendees from the University of Notre Dame, University of Southern California and the University of Nebraska-Lincoln. There were also representatives from universities and companies from countries around the world, including Australia, Spain, Germany and Italy. 4.3 High Throughput Parallel Computing With the advent of 8-, 16- and soon 32-cores packaged in commodity CPU systems, OSG stakeholders have shown an increased interest in computing that combines small scale parallel applications with large scale high throughput capabilities, i.e. ensembles of independent jobs, each using 8 to 64 tightly coupled processes. The OSG “HTPC” program is funded through a separate NSF grant to evolve the technologies, engage new users, and support the deployment and use of these applications. HTPC is an emerging paradigm that combines the benefits of High Throughput Computing with small way parallel computing. One immediate benefit is that parallel HTPC jobs are far more portable than most parallel jobs since they do not depend on the nuances of parallel library software versions and machine specific hardware interconnects. For HTPC, parallel libraries are packaged and shipped simultaneously with job. This pattern allows for two additional benefits: First, there are no restrictions as to the method of parallelization, these can be MPI, OpenMP, Linda, or utilize other parallelization methods. Second, the libraries can be optimized for onprocessor communication so that these jobs can run optimally on Multi-core hardware. The work advanced significantly this year as the groundwork was laid for using the OSG Glidein mechanism to submit jobs. The implication is that users will soon be able to submit and manage HTPC jobs as easily as they do ordinary HTC jobs via GlideinWMS. Recently, efforts on this project have focused on:  Documenting how to submit applications to the HTPC environment.  Extending the OSG information services to advertise support for HTPC jobs.  Extending Glide-in technology so that users can use this powerful mechanism to submit and manage HTPC jobs. All production job submissions now utilize Glide-in technology. To date applications from the fields of chemistry, weather, and computational engineering have been run across 7 sites – Oklahoma, Clemson, Purdue, Wisconsin, Nebraska, UCSD and the CMS Tier-1 at Fermilab. These represent new applications on the OSG. In effect HTPC is making the OSG infrastructure available to new classes of researchers for which the OSG was previously not an option. 8 http://www.cs.wisc.edu/condor/CondorWeek2011 81 Both Atlas and CMS are actively working to take advantage of HTPC. As they continue to utilize HTPC, HTPC capabilities will be rolled out more broadly across the Tier-2 sites. We have logged nearly 15M hours since the first HTPC jobs ran in November of 2009. The work is being watched closely by the HTC communities who are interested in taking advantage of multi-core while not adding a dependency on MPI. Also, we have learned that the HTPC model is key for enabling scientists that need to use GPUs which appears to be an emerging paradigm for the OSG in the near future. 4.4 Advanced Network Initiative (ANI) Testing The ANI project's objective is to prepare typical OSG data transfer applications for the emergence of 100Gbps WAN networking. This is accomplished in close collaboration with OSG, contributing to OSG in terms of testing OSG software, and benefitting from OSG. Thus we look to shrink the time between 100Gbps network links becoming available, and the OSG Science Stakeholders being able to benefit from their capacity. This project is focusing on the following areas:  Creation of an easy to operate "load generator" that can generate traffic between Storage Systems on the Open Science Grid.  Instrumenting the existing and emerging production software stack in OSG to allow benchmarking.  Benchmark the production software stack, identify existing bottlenecks, and work with the external software developers on improving the performance of this stack. The following is a summary of achievements in various aspects related to ANI project that we accomplished in 2010.  Architecture design of OSG/HEP Application with 100Gb/s connection Through intensive discussion with other groups the ANI project and OSG colleagues, a high level design of OSG/HEP application architecture was documented that is consistent with both the envisioned ANI and the LHC data grid architectures. This was documented in Specification of OSG Use Cases and Requirements for the 100 Gb/s Network Connection (OSGdoc-1008).  Building hardware test platform We built a test cluster at UCSD to conduct various tests which involve the core technologies for the OSG/HEP applications for the ANI project. The cluster has all the necessary components to function as a SE, including BeStMan, HDFS, GUMS, GridFTP, FUSE, and also used for other type of data transfer tool, e.g. Fast Data Transfer (FDT). This test platform has been used for transaction rate tests described in OSG-doc-1004, as well as the 2009 and 2010 Supercomputing bandwidth challenges.  Validating Hadoop Distributed File System We previously validated the Hadoop Distributed File System (HDFS) as a key technology of an OSG Storage Element (SE). We gave a presentation on this at the International Symposium on Grid Computing (ISGC) 2010 with the title: Roadmap for Applying Hadoop distributed file system in Scientific Grid Computing. 82  Test of Scalability of Storage Resource Manager (SRM) System We conducted transaction rate scalability tests of two different versions of BeStMan SRM. These two versions are based on two different java container technologies, Globus and Jetty. Our tests contributed to the release of the new Jetty-based BeStMan-2 by improving the configuration and documenting the corresponding performance as compared with the previously available Globus based implementation on the same hardware. We worked closely with various parties: Hadoop distribution and packaging from OSG storage, BeStMan development team from LBNL, and used the new glidein based scalability tool from OSG Scalability. The results of BeStMan test were documented in Measurement of BeStMan Scalability (OSGdoc-1004). The use of glideinWMS and glideinTester is documented in Use of Late-binding technology for Workload Management System in CMS (OSG-doc-937).  Test of WAN data transfer tools and participation at Supercomputing 2010 Various WAN data transfers have been tested between UCSD and other CMS Tier-2 sites. We present our results in networking configuration, storage architecture and data transfer in 18th International Conference on Computing in High Energy Physics (CHEP) 2010 with title: Study of WAN Data Transfer with Hadoop-based SE.  Detailed test plan for ANI We contributed to the ANI testing plan development. The present plan assumes that two types of application tests will be run. In both cases, NERSC will function as source, and one of either ANL or ORNL will function as sink of data. In the first test, data will be copied from storage at NERSC straight to memory at ANL, and consumed there by applications running on the Magellan hardware at ANI. No storage at ANL is involved in this test. In the second test, data will be copied to storage at ORNL. 4.5 ExTENCI The Extending Science Through Enhanced National Cyber infrastructure (ExTENCI) project goal is to develop and provide production quality enhancements to the National Cyber Infrastructure that will enable specific science applications to more easily use both the Open Science Grid (OSG) and TeraGrid (TG) or broaden access to a capability to both TG and OSG users. ExTENCI is a combination of four technology components, each run as a separate project with one or more science projects targeted as the first user. The components and sciences are: Workflow & Client Tools – SCEC and Protein Modeling, Distributed File System – CMS/ATLAS, Virtual Machines – STAR and CMS, and Job Submission Paradigms – Cactus Applications (EnKF and Coastal Modeling). Although they are independent projects, there are linkages between them. For example, experiments in the Workflow & Client Tools project are planning use of Distributed File System and Job Submission Paradigms capabilities. The CMS experiment is actively investigating the use of the Distributed File System as well as Virtual Machines. After 10 months, significant progress has been made by each of the projects. Workflow & Client Tools has been working in two main areas, earthquake hazard prediction (from the SCEC project) and protein structure analysis. We have demonstrated a method using the Swift workflow engine that can obtain large (~2000) numbers of nodes from OSG Engage Virtual Organization (VO) sites, built and generalized a framework that allows this to be done, 83 and demonstrated it with another application, glass material modeling. This has allowed running instances of straightforward applications (protein and glass modeling) on OSG as well as on TG. The SCEC application has been run on both OSG and TG; but in this case, the intense data transfer needs required reformulation of the application to limit data transfers to where the amount of computation to be done on the data is sufficiently large. In all applications, current work is on optimizing the methods for determining how to best divide the work between OSG and TG. Distributed File System has set up the hardware at three sites, installed the Kerberos security infrastructure, added Kerberos to the Lustre/WAN file system, and installed the Lustre server software at two sites (as planned) and Lustre client software at the two sites, and is now doing performance testing between sites. Some performance issues have been identified and tests of possible solutions are in progress. Science application testing was started but is delayed until the performance issues are resolved. Virtual Machines has deployed hypervisors on a large scale at Clemson (OSG) and Purdue (TG) and verified that the STAR VM can be run on OSG and TG and that the CMS VM runs on TG. We have tested and documented Condor-G’s support of cloud services, created a report on tools used to author VMs and manage libraries of such machines, and have cataloged VM distribution mechanisms in preparation for developing tools to assist users in these areas. Job Submission Paradigms has completed the overall design and has implemented the SAGA Condor-G adaptor that is now being tested by sending jobs to both TG and OSG. Verification of operation and performance of a science application is awaiting completion of this testing. Below is a graph of usage by sciences now able to use OSG via new ExTENCI capability. The first phase of the work is complete and testing will resume after solutions to problems discovered have been resolved. Figure 50: ExTENCI Usage of OSG 84 Another important benefit of ExTENCI has been to provide an opportunity for increased joint activity and collaboration between OSG and TG. Each of the projects has fostered sharing of information, joint problem solving, and combined teams to better serve users across both environments, and thus paves the way to OSG becoming a Service Provider in XSEDE. 4.6 Corral WMS Under the NSF OCI-0943725 award, the University of Southern California, Fermi National Accelerator Laboratory and University of California San Diego have been working on the CorralWMS integration project, which provides an interface to resource provisioning across national as well as campus cyber infrastructures. Software initially developed under the name Corral now extends the capabilities of glideinWMS. The resulting product, glideinWMS, is one of the main workload management systems on OSG, and enables many major user communities to efficiently use the available OSG resources. . The current users are the particle physics experiments CMS, CDF and GlueX, the structural biology research communities SBGrid and NEBioGrid, the nanotechnology research community NanoHUB, the engineering community NEED, the campus grid communities of the University of Nebraska (HCC), University of Wisconsin (GLOW) and University of California San Diego (UCSDGrid), the Northwest Indiana Computational Grid (NWICG), the Southern California Earthquake Center (SCEC), and the OSG Engage VO. Corral, a tool developed to complement the Pegasus Workflow Management System, was recently built to meet the needs of workflow-based applications running on the TeraGrid. It is being used today by the Southern California Earthquake Center (SCEC) CyberShake application. In a period of 10 days in May 2009, SCEC used Corral to provision a total of 33,600 cores and used them to execute 50 workflows, each containing approximately 800,000 application tasks, which corresponded to 852,120 individual jobs executed on the TeraGrid Ranger system. The 50-fold reduction from the number of workflow tasks to the number of jobs is due to job-clustering features within Pegasus designed to improve overall performance for workflows with short duration tasks. GlideinWMS was initially developed to meet the needs of the CMS (Compact Muon Solenoid) experiment at the Large Hadron Collider (LHC) at CERN. It generalizes a Condor glidein system developed for CDF (The Collider Detector at Fermilab) first deployed for production in 2003. It has been in production across the Worldwide LHC Computing Grid (WLCG), with major contributions from the Open Science Grid (OSG) in support of CMS for the past two years, and has recently been adopted for user analysis. Over those two years, CMS alone has used glideinWMS to consume more than 29 million CPU hours, and has had over 15,000 concurrently running jobs. The integrated CorralWMS system, which will retain the glideinWMS product name, includes a new version of Corral as a frontend. It provides a robust and scalable resource provisioning service that supports a broad set of domain application workflow and workload execution environments. The system enables workflows to run across local and distributed computing resources, the major national cyber infrastructure providers (Open Science Grid and TeraGrid), as well as emerging commercial and community cloud environments. The Corral frontend handles the enduser interface, the user credentials, and determines when new resources need to be provisioned. Corral then communicates the requirements with the glideinWMS factory, and the factory performs the actual provisioning. Communities using the Corral frontend on OSG include SCEC and LIGO. The SCEC workflows described above start off with a couple of large earthquake simulation MPI jobs, and are fol85 lowed by a large set of serial jobs to determine how different sites in the Los Angeles region would be affected by the simulated earthquake. SCEC has been a long-time Corral user and requirements driver, and has been using Corral in production for runs on TeraGrid. As a demonstration on how CorralWMS can be used across cyber infrastructures, a SCEC workflow was planned to execute MPI jobs on TeraGrid and serial workloads on OSG using the glideinWMS system. Four such runs were completed successfully. CorralWMS project supports the OSG-LIGO Taskforce effort whose mission is to enable LIGO Inspiral workflows to perform better in the OSG environment. One problem LIGO had was that short tasks in the workflows were competing with a large amount of long-running LIGO Einstein at Home jobs. When submitted to the same OSG site, the workflows were essentially starved. With the glideinWMS glideins, the workflow jobs are now on a more equal footing as the glideins retain their resources longer, and during the lifetime can service many workflow jobs. The glideinWMS glideins also helped when running multiple workflows at the same time, as job priorities were used to overlap data staging job and compute jobs. This was not possible when relying on the remote Grid site’s batch system to handle scheduling rather than the glideinWMS scheduler. Recently, LIGO has also used the system to support additional LIGO workflows and scientists. One of those, a workflow called Powerflux, has less data dependencies than the Inspiral workflow and because of that the workflow is easier to spread out across a large number of computational resources at the same time, which makes the workload a great fit for OSG. The CorralWMS system also contributes to the glideinWMS factory development. During the year, many new features have been implemented. It is now possible to pass project information between the frontend and factory and for the frontend to specify that large groups of glideins should be submitted as one grid job. These features are mainly for the system to work better with TeraGrid, but are required to make workloads run across the infrastructures. The UCSD group is operating a production glidein factory gathering resources from hundreds of Grid sites from all over the world, and open to several user communities. Compared to last year, the number of communities served more than doubled. The glidein factory is now regularly providing around 10,000 CPUs for them to use. Figure 51: CPUs provided by the glidein factory 86 Each community operates its own frontend, with most of the frontend instances being located outside UCSD. The amount of administration work needed by each of the frontend operators is minimal. Most of the operational work stems from problems developing on any of the many Grid sites providing the resources, which is handled almost entirely by the glidein factory operators, thus effectively shielding the frontend administrators. The CorralWMS project has recently contributed sessions to the OSG Grid Schools in Madison and Sao Paulo, and gave a presentation at Condor Week 2011. 4.7 OSG Summer School OSG is committed to educating students and researchers in the use of Distributed HighThroughput Computing to enable scientific computation. Our staff provides proactive support in workshops and collaborative efforts to help define, improve, and evolve the US national cyberinfrastructure. As part of this support, we were awarded support to conduct a four-day school in high-throughput computing (HTC) in July 2010 at the University of Wisconsin–Madison. This was a joint proposal with TeraGrid, and all seventeen students who participated in the OSG summer school also participated in the TeraGrid 2010 conference. In June 2011, we are offering this school again, expanding it to include twenty-five students. An important focus of the school is direct, guided hands-on experience with HTC; students learn how to use HTC in a campus environment as well as large-scale computations with OSG. The students receive a variety of experiences including running basic computations, large workflows, scientific applications, and pilot-based infrastructures as well as how to use storage technologies to cope with large quantities of data. The students get access to multiple OSG sites so that they can experience scaling of computations. They use exactly the same tools and techniques that OSG users currently use, but we emphasize the underlying principles so they can apply what they learned more generally. In both years of this summer school, students came from a wide variety of scientific disciplines: physics, biology, GIS, computer science, and more. To provide students with the best learning opportunities, we recruit instructors from staff who are experienced with the relevant technologies to develop, teach, and support the school. Most instructors are OSG staff, and a few come from other projects: TeraGrid, Globus, the Middleware Security and Testing research project, and the Condor Project. A highlight of the school is the HTC Showcase, in which local scientists give lectures about their experiences with HTC. They showcase how the use of HTC expanded not only the amount of science they could do, but the kinds of science they could do. During and after the school, we ask students to evaluate their experiences. Last year, an independent researcher examined these evaluations. While this examination is too detailed to include here, an extract of the summary said: Overall, the majority of the participants were very happy with how the conference went, and raved about the accommodations and organization of it. The majority were also very happy with the instructors and the presentations and activities within the sessions. The OSG Summer School has web pages for each year, including curricula and materials: https://twiki.grid.iu.edu/bin/view/Education/OSGSummerSchool2010 87 https://twiki.grid.iu.edu/bin/view/Education/OSGSummerSchool2011 In addition, an article about the school appeared in International Science Grid This Week: http://www.isgtw.org/?pid=1002696 After each school is completed, we assign OSG staff as mentors to the students, and they make regular contact with the students so they can provide help and deepen students’ participation in the broader HTC community. Students are paired with staff based on factors such as common interests, organizational memberships, and geography. Many of these students make use of this mentorship program to help integrate HTC into their research. 4.8 Science enabled by HCC The Holland Computing Center (HCC) of the University of Nebraska (NU) began to use the OSG to support its researchers during 2010 and usage is continuing and growing. HCC’s mission is to meet the computational needs of NU scientists, and does so by offering a variety of resources. In the past year, the OSG has served as one of the resources available to our scientists, leveraging the expertise from the OSG personnel and running the CMS Tier2 center at the site. New applications from the humanities and fine arts have been deployed over the last year, as described in more detail below. Support for running jobs on the OSG is provided by our general-purpose applications specialists and we have only one graduate student dedicated to using the OSG. To distribute jobs, we primarily utilize the Condor-based GlideinWMS system originally developed by CDF and CMS (now independently supported by the NSF). This provides a user experience identical to using the Condor batch system, greatly lowering the barrier of entry for users. HCC’s GlideinWMS install has been used as a submission point for other VOs, particularly the LSST collaboration with OSG Engage. The GlideinWMS led a CS graduate student to write his master thesis on grid technologies for local campus grids and campus bridging. In particular, he created the campus grid factory, a local version of the pilot job factory used by GlideinWMS. For data management, we leverage the available high-speed network and Hadoop-based storage at Nebraska. For most workflows, data is moved to and from Hadoop, sometimes resulting in multiple terabytes of data movement a day. In the last year and a half, 6 teams of scientists and the 2 new teams mentioned above have run over 15 million hours on the OSG; see Figure 52. This is about 10% of our active research teams at HCC, but over 20% of HCC computing (only counting computing done opportunistically at remote sites). Less precise figures are available for data movement, but it is estimated to be about 50TB total. 88 Figure 52: Monthly wall hours per facility Applications we have run on the OSG in the past year are:  TreeSearch, a platform for the brute-force discovery of graphs with particular structures. This was written by a Mathematics doctoral candidate and accumulated 4.6 million wall hours of runtime. Without OSG, HCC would not have been able to provide a sufficient number of hours to this student to finish this work.  DaliLite, a biochemistry application. This was a one-off processing needed due to reviewers of a paper requesting more statistics. This was moved from a small lab cluster to the OSG in a matter of days, and accumulated over 120 thousand runtime hours. Without OSG, the scientists would not have been able to complete their paper in time.  OMSSA, an application used by a researcher at the medical school. This was another example where a researcher who had never used HCC clusters discovered he needed a huge amount of CPU time in order to make a paper deadline in less than 2 weeks. As with DaliLite, the researcher would have not made the deadline if he ran only on HCC resources.  Interview Truth Table Generator for Boolean models, developed by a mathematical biology research group at the University of Nebraska-Omaha, is a Java-based tool to generate Boolean truth tables from user data supplied via a web portal. Depending on input size, as a single process the tool required several days to complete one table. After a small modification to the source code, the Condor glidein mechanism was utilized to deploy jobs to both HCC and external resources, enabling significant reduction in runtime. For one particular case with a table of 67.1 million entries, runtime was reduced from approximately 4 days to about 1 hour. 89  CPASS, another biochemistry application. This was our first web application converted to the grid; Condor allows a user to submit a burst of jobs which first fill the local clusters, then migrate out to the grid if excess capacity is needed.  AutoDock and CS-ROSETTA, two smaller-scale biochemistry applications brought to HCC and converted to the grid. The work done with these was essential for a UNL graduate student to finish his degree work.  Digital Media Generation: Jeff Thompson, assistant professor New Genres and Digital Arts in the Department of Art & Art History produces work investigating uses of highpowered computing for the creation of visual and sonic artworks. A current project, titled Every Nokia Tune and begun in March 2011, uses the Open Science Grid to visualize all 6-billion unique combinations of the Nokia Tune ringtone, the most ubiquitous music in the world, heard ca. 20,000 times per second. The final visualization, if printed, would stretch for 42 kilometers. Like with scientific projects, the scale that the HCC offers for artistic projects will spur innovation the field of digital arts; while taking only an afternoon on the OSG, if completed on a desktop machine “Every Nokia Tune” would take approximately 5,700 years, making the project impossible.  Digital Research in the Humanities: This work at the Center for Digital Research in the Humanities (CDRH) with Brian Pytlik-Zillig is substantially focused on text analysis in general, and on n-gram (pattern sequence) studies in particular. Using Pytlik-Zillig’s TokenX software, which UNL offers as a free licensed download, CDRH constructs data sets of use to researchers in a variety of disciplines not limited to the humanities. N-gram data sets make it possible for scholars to observe and count patterns of word, part of speech, and other sequences over time, by genre, and so on. TokenX works serially to extract n-gram data from groups of texts (text corpora) and to create structures that align the data in traditional tables where each column represents a text, and every row signifies a specific n-gram, with counts of actual occurrences. Separate databases are constructed for n-grams (1-grams, 2-grams, up through 5-grams) that occur more than once. Serial ngram processing begins to show scalability problems when there are more than about three hundred texts in a given corpus. A full TokenX n-gram extraction for a corpus of 300 novel-length texts takes more than a month to process. With a surge in digitization of texts, larger and larger groupings of texts are possible and desirable. In the next six months, we hope to perform n-gram extractions on a corpus of thirty thousand texts representing approximately 8 gigabytes of content. With serial extraction, construction of the n-gram database would take more than eight years to complete. We are therefore working to leverage the existing TokenX code base to take advantage of parallel processing over a computer network with the desired throughput to be measured in weeks or months rather than years. HCC has gone from almost zero local usage of the OSG in 2009 to millions of CPU hours per year. This has been done without local OSG funding for user support (there is other OSG personnel and one student who do share expertise). The OSG is an important part of our “toolbox” of solutions for Nebraska scientists. The OSG is not a curiosity or a toy for HCC, but something we depend on not only to offload jobs, but to support science, and now other scholarly research, which could not have been completed at HCC resources alone. 90 4.9 Virtualization and Clouds OSG staff and contributors continue to explore the technologies and services needed to support virtualization, scientific and commercial clouds. Focused work is happening, and is reported, as part of the ExTENCI satellite project with the new technology area within OSG providing the connection to OSG. This had included extended testing and initial use of GlideinWMS access through EC2 interfaces to Amazon and the DOE Magellan resource at NERSC; US ATLAS, HCC and LIGO VOs have used this resource to date. Small samples of CMS monte carlo have been run on these infrastructures. 4.10 Magellan OSG communities continue to access the DOE Magellan resources at the National Energy Research Scientific Computing Center. In the course of enabling use of Magellan, the OSG client tools have been extended.  Condor’s Amazon EC2 interface was generalized to work with 3rd party clouds that run the EC2 specification including Magellan.  GlideinWMS was extended to enable submission to Amazon and Magellan clouds using the generic Condor framework. This enables users to transparently use the same condor execution environment on the OSG and the Cloud. The fixes for Condor and GlideinWMS were done by their respective developers, but motivated by the testing coordinated by the OSG. The OSG and Magellan hold monthly meetings to coordinate our efforts for cloud submission. The Magellan team at NERSC has installed a generic OSG Compute Element on Magellan that opportunistic OSG VOs have successfully utilized. The accounted use of the cloud resources at NERSC is VO: HCC, (User Weitzel, Derek ), CPU hours used: 103,091; VO: LIGO (User Engel, Robert ), CPU hours used: 425,480. 4.11 Internet2 Joint Activities Internet2 is an advanced networking consortium led by the research and education community. An exceptional partnership spanning U.S. and international institutions that are leaders in the worlds of research, academia, industry and government, Internet2 is developing breakthrough cyber infrastructure technologies that support the most exacting applications of today—and spark the most essential innovations of tomorrow. Internet2 regularly collaborates with members of the research community, including the OSG and their associated user community, to develop tools and services, to advance the use and awareness of cutting edge networking technology. One of Internet2’s many areas of focus involves the monitoring and management of network traffic, with an emphasis on identification and correction of network performance problems through software and personalized training. Performance problems can be challenging for users and operators alike – the success of bulk data movement and remote collaboration often requires flawless network performance, free of defects that may induce packet loss or designs that fail to deal with congestion. Advanced tools, developed by Internet2 and partners, often mitigate these problems and improve the overall grid computing experience for end users from scientific communities. Over the past year, Internet2 has focused on designing and constructing software to consume performance metrics from the distributed monitoring infrastructure present at many OSG affiliated end sites. Network performance metrics, including bandwidth, latency, packet loss, and link 91 utilization, are valuable indicators for discovering potential problem areas as they happen, rather than at a later point in time. Using these early warning components through a reporting framework, such as the OSG Gratia and RSV, we provide sites a valuable record of performance over time and allow them an instant view into potential network complications. Internet2 engineers, in collaboration with the USATLAS Virtual Organization (VO) and developers from Brookhaven National Laboratory, have made these probes available, through the perfSONAR-PS software framework. This, in conjunction with the development of several graphical interfaces designed to give operators and end users a high level overview of network performance across the VO, will provide a valuable lens into operational concerns in the OSG community. Internet2 continues to work with OSG software developers to update the advanced network diagnostic tools already included in the VDT software package. These client applications allow VDT users to verify the network performance between end site locations and perfSONAR-PS based servers deployed on campus, regional, backbone, and exchange point networks by allowing on-demand diagnostic tests to be run. The tools enable OSG site administrators and end users to test any individual compute or storage element in the OSG environment thereby reducing the time it takes to diagnose performance problems. They allow site administrators to more quickly determine if a performance problem is due to network specific problems, host configuration issues, or application behavior. The VDT installation has proven to be a particularly useful addition for debugging problems directly related to the storage and compute portions of the infrastructure, particularly when these components are located deep within the core of a network and may experience connectivity that travels through many interim locations. In addition to deploying client tools via the VDT, Internet2 staff, working with partners in the US and internationally, have continued to support and enhance the pS Performance Toolkit, a complete Linux distribution that contains fully configured performance tools and targeted graphical interfaces to monitor network performance. When used as a “Live CD”, the tools are instantly available and can be deployed anywhere within a network; the alternative method involves dedicating a specific measurement point and installing directly to the hard disk of the target machine. Either use case allows OSG site-admins to quickly stand up a perfSONAR-PS based server to support end users. These perfSONAR-PS hosts automatically register their existence in a distributed global database, making it easy to find new servers as they become available. All perfSONAR-PS software is maintained in a repository that allows for security updates and software enhancements; this is normally transparent to the operators and end users. It is also possible to adopt specific packages from these software storehouses to customize the installation to the needs of the deploying site. These measurement servers provide two important functions for the OSG site-administrators. First they provide an end point for the client tools deployed via the VDT package. OSG users and site-administrators can run on-demand tests to begin troubleshooting performance problems. The second function they perform is to host regularly scheduled tests between peer sites. This allows a site to continuously monitor the network performance between itself and the peer sites of interest. The USATLAS community has deployed perfSONAR-PS hosts and is currently using them to monitor network performance between the Tier-1, Tier-2, and a growing number of Tier-3 sites. Internet2 attends weekly USATLAS calls to provide on-going support of these deployments, and has come out with regular bug fixes. Finally, on-demand testing and regular monitoring can be performed to both peer sites and the Internet2 or ESNet backbone network using either the client tools, or the perfSONAR-PS servers. 92 The development of tools and strategies to address network performance problem at the site level is one aspect of Internet2’s mission. Training services, particularly in the installation and use of these tools to diagnose network behavior, are a vital part of Internet2’s roll in the R&E networking community. In the past year Internet2 has participated in several OSG site-admin workshops, the annual OSG all-hands meeting, and interacted directly with the LHC community to determine how the tools are being used and what improvements are required. Internet2 has provided hands-on training in the use of the client tools, including the command syntax and interpreting the test results. Internet2 has also provided training in the setup and configuration of the perfSONAR-PS server, allowing site-administrators to quickly bring up their server. Finally, Internet2 staff has participated in several troubleshooting exercises; this includes running tests, interpreting the test results and guiding the OSG site-admin through the troubleshooting process. 4.12 ESNET Joint Activities OSG depends on ESnet for the network fabric over which data is transferred and for the ESnet DOE Grids Certificate Authority for the issuing of X509 digital identity certificates which are a key component of the OSG authorization methods. ESnet is part of the collaboration delivering and supporting the perfSONAR tools that are now in the VDT distribution. In addition, OSG staff and stakeholders make significant use of ESnet’s collaborative tools with telephone and video meetings. ESnet and OSG cooperate on a service to provide X.509 digital certificates for both users and hosts of the organizations participating in the Open Science Grid. ESnet runs the DOEGrids Certification Authority and other infrastructure while the OSG coordinates the registration process with all the participating organizations. Table 7 below lists the number of certificates issued to these organizations for the past few years, for all those which have received at least one host certificate in 2011. 93 Table 7: DOEGrid Certificate Issuance History Year → Type → Total Organization CMS FNAL BNL ATLAS OSGVO LIGO OSGEDU DZERO ENGAGE DOSAR LCG MIS NYSGRID SBGRID GLOW SLAC ALICE GLUEX NWICG NEBIOGRID PHENIX SGVO CIGI ICECUBE DAYABAY 2006 User Host 296 1716 28 107 11 39 37 32 0 2 1 3 11 2 0 0 4 5 0 0 8 0 6 0 0 0 0 288 966 32 124 73 157 0 28 0 4 10 1 0 0 8 16 0 0 2 0 7 0 0 0 0 2007 User Host 1122 4307 197 224 36 219 100 118 110 6 30 0 17 8 12 3 10 9 0 0 14 0 7 0 2 0 0 744 1591 807 381 267 317 6 59 8 14 38 2 5 3 21 25 0 0 8 0 6 0 5 0 0 2008 User Host 1013 6169 153 175 31 234 106 100 71 2 63 1 4 6 17 9 15 7 0 0 11 0 4 0 4 0 0 1390 2250 151 1524 204 389 6 52 25 19 50 15 32 15 20 10 0 0 6 0 4 0 7 0 0 2009 User Host 1103 7533 215 194 18 267 51 97 77 4 71 5 8 2 18 17 13 9 11 2 18 1 2 0 0 3 0 2401 2589 1369 331 221 249 76 51 35 48 28 35 27 7 12 22 0 4 7 0 6 0 13 2 0 2010 User Host 1163 8178 264 147 16 263 49 88 142 0 54 6 12 4 5 30 22 7 26 3 9 5 1 0 2 7 1 2490 3044 1303 362 158 264 238 65 50 32 29 23 17 9 9 9 2 34 16 6 5 0 8 4 1 2011 User Host 496 3091 94 68 7 113 27 38 58 0 15 3 5 3 6 16 4 3 16 1 5 0 0 10 0 1 3 1083 882 370 275 173 84 57 30 18 17 17 11 10 10 9 8 6 5 5 5 5 5 3 2 1 In addition to this production service, the ESnet ATF group and OSG interact on a number of activities and development projects. The most significant of these is about the Science Identity Federation (SIF) where ESnet is playing a leading role in supporting the DOE science laboratories to join the InCommon Federation and use federated identity providers and services. ESnet has negotiated an agreement with InCommon where the DOE ASCR Office funds the membership dues for the laboratories, and provides assistance to each laboratory in the process to join InCommon. At this point, several laboratories have completed this process and a few have set up or are in process of deploying Shibboleth Identity Providers and Service Providers. We also partner as members of the identity management accreditation bodies in America (TAGPMA) and globally (International Grid Trust Federation, IGTF). ESnet and OSG have been working on the next revision of the DOEGrids CA infrastructure. This is primarily a deployment change in DOEGrids CA components, distributing them around the country to provide a much more reliable and resilient infrastructure while maintaining the essential service characteristics. However, a new RA service is being added, allowing the different service communities of DOEGrids CA to be separated from each other, and enabling significant improvements in flexibility and integration with the service community's work flow for user 94 and resource certification. And OSG and ESnet are implementing contingency plans to make certificates and CA/RA operations more robust and reliable by replication, monitoring, coordinated testing of changes, and defined service performance objectives. As part of this initiative, we completed a working draft of a Service Level Agreement between DOEGrids CA and OSG which is currently being circulated for management approvals. 95 5. 5.1 Training, Outreach and Dissemination Training The OSG Training program brings domain scientists and computer scientists together to provide a rich training ground for the engagement of students, faculty, and researchers in learning the OSG infrastructure, applying it to their discipline and contributing to its development. During the last year, OSG sponsored and conducted various training events. Training organized and delivered by OSG in the last year is identified in the following table: Workshop Length Location Month OSG Summer School 4 days Madison, Wisconsin July, 2010 Site Administrators Workshop 2 days Nashville, TN Aug., 2010 OSG Storage Forum 2 days Chicago, IL Sept., 2010 South American Grid Workshop 5 days Sao Palo, Brazil Dec., 2010 OSG Summer School 4 days Madison, Wisconsin June, 2011 The GridUNESP training workshop was held the first week of December in Sao Paulo, Brazil. Around 40 students attended from multiple science disciplines and 23 institutions. The school, focused on end-users’ needs, taught students step by step how to join the Grid community, adapt their scientific applications to the new technology and use it efficiently. Also, students learned how to run large-scale computing applications, first locally and then using the OSG. Two open discussions took place, along with one Q&A session with Ruth Pordes and Miron Livny and hands-on exercises; other OSG staff who lectured included Dan Fraser, Tanya Levshina, Zack Miller, and Igor Sfiligoi. Joel Snow lectured on the experience of the D0 high-energy physics experiment using OSG resources, enabling attendees to see how all the concepts and techniques they learned during the week have been applied by a real experiment. A working meeting between the GOC’s Rob Quick and those developing and deploying a local grid operation center allowed the sharing of ideas and experiences. Two special lectures by invited local speakers, one on high performance computing and one on data storage, completed the school. Some comments from students: 1. “The OSG School in Sao Paulo was very good. The Grid concepts were presented in a theoretical and practical way, with a focus on the main goal of any Grid infrastructure: ‘collaboration.’ - Charles Rodamilans, PhD student of Computer Engineering 2. “… the most impressive course that I have attended, not because of the size of the Open Science Grid, but because of the ability and motivation that 96 Figure 53: Students at the Sao Paolo School instructors demonstrated throughout the course. They were informative and technically knowledgeable. “ - Eriksen Costa, Software architect 3. “The school was fantastic! … It gave me a broadened view of what a grid is, what it is useful for and how I can use it.” - Griselda Garrido, neuroscience researcher Other training activities include the organization of the Site Administrators workshop held in Nashville TN in August 2010; support for VO-focused events such as the US CMS Tier 3 workshop, and the storage forum held in September 2010 which focused on communication of storage technologies and implementations at the sites. For the Site Administrators workshop, experts from around the Consortium contributed to course-ware tutorials that participants used during the event. Some forty site administrators, several brand new to OSG, participated and were impressed by the many experts on hand which included not only the instructors but also seasoned OSG site administrators all who were eager to help. Members of the Condor team contributing to OSG participated in the joint CHAIN/EPIKH School for Grid Site Administrators in China in May 2011, and demonstrated interoperation with the European gLite middleware9. OSG staff also participated as keynote speakers, instructors, and/or presenters at venues this year as detailed in the following table: 5.2 Venue Length Location Month Oklahoma Supercomputing Symposium 2 days Norman, Oklahoma Oct., 2010 EGI User Forum 1 day Vilnius, Lithuania May 2011 Outreach Activities We present a selection of the activities in the past year:  The NSF Task Force on Campus Bridging  HPC Best Practices Workshop  Workshops on Distributed Computing, Multidisciplinary Science, and the NSF’s Scientific Software Innovation Institutes Program  Chair of the Network for Earthquake Engineering Simulation Cyber infrastructure subcommittee of the Project Advisory Committee  Member of the DOE Knowledge Base requirements group  Continued co-editorship of the highly successful International Science Grid This Week newsletter, www.isgtw.org (see next section) 9 http://agenda.ct.infn.it/conferenceOtherViews.py?view=standard&confId=474 97 The Open Science Grid (OSG) achievements benefit from engaged communities of scientists from many disciplines, all collaborating in an effort to build, maintain and use the US national distributed infrastructure. The continued success of the Open Science Grid depends on such sustained efforts to ensure the longer-term growth in the scale of the resources and the number of scientific users, increased functionality, usability and robustness of the middleware, and educates and trains the future workforce. In support of this, the Open Science Grid Computer Science Student Fellowship provides for one year of funding to a graduate or undergraduate student in Computer Science. The fellowship provides funding for up to 20 hours a week for computer science research of value both to the OSG and the broader community. The research should be associated with an existing Virtual Organization already working with the OSG. The Fellowship is offered for the institutions that are funded as part of the OSG project at the time of application and delivery of the research. For the 2010-2011, Derek Weitzel was the recipient of this support at the University of Nebraska – Lincoln. During the OSG Fellowship, Derek Weitzel assisted in user support and campus grid development. He developed an initial prototype of the NEES workflow running on the OSG, as well as adapted the Cactus programming environment for execution on the OSG. Derek developed an OSG Job Visualization showing job movement across the OSG for display at the Super Computing 2010 conference. He developed the Campus Factory, an open source application that creates a transparent interface between clusters on the Campus Grid. Additionally, he co-wrote the CHEP paper: "Enabling Campus Grids with Open Science Grid Technology". Additionally, Derek completed his Master's thesis: "Campus Grids: A Framework to Facilitate Resource Sharing"10. In March of 2011 SBGrid leadership organized the annual Open Science Grid All-Hands Meeting. The event was well attended and received very good reviews from all participants (Figure 54). 10 https://osg-docdb.opensciencegrid.org:440/cgi-bin/ShowDocument?docid=1052 98 Figure 54: OSG All-Hands Meeting, held at Harvard Medical School 5.3 Internet dissemination OSG co-sponsors the weekly online publication, International Science Grid This Week (http://www.isgtw.org/), in collaboration with the European project e-Science Talk. The publication, which has published 228 issues as of 10 June 2011, is very well received, with subscribers totaling approximately 8,000, an increase of about 30% in the last year. Combined with support from NSF and DOE, OSG’s sponsorship has enabled the retention of a full-time US editor and the launch in 1Q-2011 of a new website with more modern, user-friendly features. ISGTW has won the trust of its readers in part because it is a joint publication that does not limit itself to OSG-related content. The result is that the OSG content that is included is given more weight by readers, making it a useful tool for dissemination. The OSG also maintains a website, http://www.opensciencegrid.org. The site provides information and guidance for stakeholders and new users of the OSG. The front page features research and technology highlights, headlines linking to relevant news stories, and links to the archive of the OSG Newsletter. Visitors to the website can learn how to get started with grid computing, read about accomplishments of the OSG and its users, peruse a list of scientific papers that have resulted from the OSG, view live grid status and usage statistics, and much more. 99 6. Cooperative Agreement Performance OSG has put in place processes, activities, and reports that meet the terms of the Cooperative Agreement and Management Plan:  The Joint Oversight Team meetings are conducted, as scheduled by DOE and NSF, via phone to hear about OSG progress, status, and concerns. Follow-up items are reviewed and addressed by OSG, as needed.  Two intermediate progress reports were submitted to NSF in February and June of 2007.  In February 2008, a DOE annual report was submitted.  In July 2008, an annual report was submitted to NSF.  In December 2008, a DOE annual report was submitted.  In June 2009, an annual report was submitted to NSF.  In January 2010, a DOE annual report was submitted.  In June 2010, an annual report was submitted to NSF.  In December 2010, an annual report submitted to DOE.  In June 2011, this annual report was submitted to NSF. As requested by DOE and NSF, OSG staff provides pro-active support in workshops and collaborative efforts to help define, improve, and evolve the US national cyber-infrastructure. 100

Content for an NSFAnnual Project Report[1].doc

Related documents

Products

Support

Content for an NSFAnnual Project Report[1].doc

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib