Content for an NSFAnnual Project Report[1].doc

advertisement
OSG–doc–1054
June 30, 2011
www.opensciencegrid.org
Report to the National Science Foundation
June 2011
Miron Livny
Ruth Pordes
Kent Blackburn
Paul Avery
University of Wisconsin
Fermilab
Caltech
University of Florida
1
PI, Technical Director
Co-PI, Executive Director
Co-PI, Council co-Chair
Co-PI, Council co-Chair
Table of Contents
1. Executive Summary ...................................................................................... 4
1.1
1.2
1.3
1.4
1.5
What is Open Science Grid? ............................................................................................... 4
Usage of Open Science Grid ............................................................................................... 5
Science enabled by Open Science Grid ............................................................................... 7
Technical achievements in 2010-2011 ............................................................................... 8
Preparing for the Future .................................................................................................. 12
2. Contributions to Science ............................................................................. 15
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
ATLAS ............................................................................................................................... 15
CMS .................................................................................................................................. 22
LIGO.................................................................................................................................. 27
ALICE ................................................................................................................................ 29
D0 at Tevatron ................................................................................................................. 30
CDF at Tevatron ............................................................................................................... 36
Nuclear physics ................................................................................................................ 41
Intensity Frontier at Fermilab .......................................................................................... 46
Astrophysics ..................................................................................................................... 46
Structural Biology .......................................................................................................... 47
2.10.1
2.10.2
2.10.3
2.10.4
2.11
Wide Search Molecular Replacement Workflow ................................................................................48
DEN Workflow.....................................................................................................................................49
OSG Infrastructure ..............................................................................................................................49
Outreach: Facilitating Access to Cyberinfrastructure .........................................................................50
Computer Science Research ........................................................................................... 50
3. Development of the OSG Distributed Infrastructure ................................... 51
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
Usage of the OSG Facility ................................................................................................. 51
Middleware/Software ...................................................................................................... 52
Operations ....................................................................................................................... 54
Integration and Site Coordination ................................................................................... 57
Campus Grids ................................................................................................................... 59
VO and User Support ....................................................................................................... 60
Security............................................................................................................................. 61
Content Management ...................................................................................................... 63
Metrics and Measurements ............................................................................................. 64
Extending Science Applications ...................................................................................... 66
3.10.1
3.10.2
3.11
3.12
Scalability, Reliability, and Usability ....................................................................................................66
Workload Management System .........................................................................................................67
OSG Technology Planning .............................................................................................. 68
Challenges facing OSG ................................................................................................... 69
4. Satellite Projects, Partners, and Collaborations .......................................... 71
2
4.1 CI Team Engagements ..................................................................................................... 73
4.2 Condor .............................................................................................................................. 75
4.2.1
4.2.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
Release Condor .....................................................................................................................................75
Support Condor .....................................................................................................................................79
High Throughput Parallel Computing .............................................................................. 81
Advanced Network Initiative (ANI) Testing ...................................................................... 82
ExTENCI ............................................................................................................................ 83
Corral WMS ...................................................................................................................... 85
OSG Summer School ......................................................................................................... 87
Science enabled by HCC ................................................................................................... 88
Virtualization and Clouds ................................................................................................. 91
Magellan ........................................................................................................................ 91
Internet2 Joint Activities ................................................................................................ 91
ESNET Joint Activities ..................................................................................................... 93
5. Training, Outreach and Dissemination ........................................................ 96
5.1 Training ............................................................................................................................ 96
5.2 Outreach Activities ........................................................................................................... 97
5.3 Internet dissemination ..................................................................................................... 99
6. Cooperative Agreement Performance ...................................................... 100
Sections of this report were provided by: the scientific members of the OSG Council, OSG PIs
and Co-PIs, and OSG staff and partners. Paul Avery and Chander Sehgal acted as the editors.
The scope of this report is: An initial summary of the goals and accomplishments of the OSG;
Summarize the OSG-related accomplishments of each of the major scientific contributors and
beneficiaries; Cover the progress on each of the technical areas of the distributed infrastructure
and services; List synergistic and beneficial contributions of OSG’s important satellites (specifically contributing projects) and partnerships; Summarize the training, outreach and dissemination activities; Document the Cooperative Agreement Performance. The appendices give more
detailed information on the publications from and the usage of OSG by each of the major scientific communities.
3
1.
1.1
Executive Summary
What is Open Science Grid?
Open Science Grid (OSG) is a large-scale collaboration that is advancing scientific knowledge
through high performance computing and data analysis by operating and evolving a crossdomain, nationally distributed cyber-infrastructure (Figure 1).
Meeting the strict demands of the scientific community has not only led OSG to actively drive
the frontiers of High Throughput Computing (HTC) and massively Distributed Computing, it has
also led to the development of a production quality facility. OSG’s distributed facility, composed
of laboratory, campus, and community resources, is designed to meet the current and future
needs of scientific operations at all scales. It provides a broad range of common services and
support, a software platform, and a set of operational principles that organizes and supports scientific users and resources via the mechanism of Virtual Organizations (VOs).
The OSG program consists of a Consortium of contributing communities (users, resource administrators, and software providers) and a funded project. The OSG project is jointly funded, until
early-2012, by the Department of Energy SciDAC-2 program and the National Science Foundation.
Figure 1: Sites in the OSG Facility
While OSG does not own the computing, storage, or network resources used by the scientific
community, these resources are contributed by the community, organized by the OSG facility,
and governed by the OSG Member Consortium. OSG resources are summarized in Table 1.
4
Table 1: OSG computing resources
Number of Grid interfaced processing resources on the
production infrastructure
Number of Grid interfaced data storage resources on
the production infrastructure
Number of Campus Infrastructures interfaced to the
OSG
Number of National Grids interoperating with the
OSG
Number of processing resources on the integration
testing infrastructure
Number of Grid interfaced data storage resources on
the integration testing infrastructure
Number of Cores accessible to the OSG infrastructure
Size of Disk storage accessible to the OSG infrastructure
CPU Wall Clock usage of the OSG infrastructure
1.2
131
61
9 (GridUNESP, Clemson, FermiGrid, Purdue, Wisconsin, Buffalo,
Nebraska, Oklahoma, SBGrid)
3 (EGI, NGDF, TeraGrid)
28
11
~70,000
~29 Petabytes
Average of 56,000 CPU days/ day
during May 2011
Usage of Open Science Grid
High Throughput Computing technology created and incorporated by the OSG and its contributing partners has now advanced to the point that scientific users (VOs) are utilizing more simultaneous resources than ever before. Typical VOs now utilize between 15 and 20 resources with
some routinely using as many as 40 – 45 simultaneous resources. The transition to using a Pilot
based job submission system that was broadly deployed in the OSG last year has enabled VOs to
quickly scale to a much utilization point than ever before. For example, SBGrid went from
struggling to achieve ~1000 simultaneous jobs to being able to scale to over ~4000 simultaneous
jobs in less than a month. The overall usage of OSG has increased again by greater than 50%
this past year and continues to grow at a steady rate (Figure 2). Utilization by each stakeholder
varies depending on its needs during any particular interval. Overall use of the facility for the 12
month period ending June 2011 was 424M hours compared to 270M hours for the previous 12
months period reported ending June 2010; detailed usage plots can be found in the attached document on Production on Open Science Grid. During stable normal operations, OSG now provides over 1.4M CPU wall clock hours a day (~56,000 cpu days per day) with peaks occasionally
exceeding 1.6M hours a day; approximately 400K – 500K opportunistic hours (~30%) are available on a daily basis for resource sharing. Based on transfer accounting, we measure approximately 0.6 PetaByte of data movement (both intra- and inter-site) on a daily basis with peaks of
1.2 PetaBytes per day. Of this, we estimate 25% is GridFTP transfers between sites and the rest
is via LAN protocols.
5
Figure 2: OSG Usage (hours/month) from July 2007 to June 2011
The number of Non-HEP CPU-hours (Figure 3) is regularly greater than 1 million CPU hours
per week (average = 1.1 M) even with the LHC actively producing data. LIGO averaged approximately 80K hours/day and non-physics use now averages 85K hours/day (9% of the total).
Figure 3: OSG non-HEP weekly usage from June 2010 to June 2011. LIGO (shown in red) is the
largest non-HEP contributor.
6
1.3
Science enabled by Open Science Grid
Published, peer reviewed papers are one easily measurable metric of the science enabled by the
Open Science Grid. The following table shows the publications in the past 12 months:
Table 2: Science Publications in 2010-2011 Resulting from OSG Usage
ALICE
ATLAS
CDF
CIGI
In Addition
#
pubs
4
38
43
3
CMS
DES
D0
57
1
23
Engage
20
GLOW
19
Grid UNESP
HCC
IceCube
LIGO
Mini-Boone
MINOS
NYSGRID
SBGRID
STAR
OSG & DHTC Research
Total
10
1
7
11
1
7
1
3
9
2
260
VO
Type of Science
US LHC
US LHC
Tevatron Run II
Geographic Information System research
US LHC
Astrophysics
8 accepted & Tevatron Run II
12 submitted
3 submitted
Mathematics modeling virtualization
research, molecular dynamics, protein
folding.
Neutrino physics, genomics, chemical
modeling, molecular dynamics, proteomics, biology,
Cosmic ray physics, genomics
Protein modelling.
Neutrino Physics
Gravitational Wave Physics
Neutrino Physics
Neutrino Physics
Molecular dynamics
Structural Biology
Nuclear Physics
Computer Science
A more holistic measure of the science enabled by OSG is relevant here. Infrastructure and service providers such as OSG enable and improve scientific and research discoveries resulting
from theoretical models and experimentally acquired data by providing an important computational “middle layer”. The availability of large and easily accessible computational resources
provides researchers powerful opportunities to examine data in unusual ways and explore new
scientific possibilities. These benefits are reflected in the US LHC statement of “reduced time to
publication” – as stated in our recent proposal: “This global shared computing infrastructure of
unprecedented scale and throughput has facilitated a transformation in the delivery of results
from the most advanced experimental facility in the world – enabling the public presentation of
7
results days to weeks after the data is acquired rather than the months to years it took in the recent past.”
We thus pick out some illustrative enabling features of the OSG contributions over the past year:

The LHC experiments all had an extremely productive 12 months and the rapid turnaround
from beam to publications has been a tremendous success all round and in all areas.

OSG enabled US ATLAS and US CMS to develop and deploy mechanisms that have enabled
productive use of over 40 U.S. university Tier-3 sites over the past year.

The OSG has enabled US LHC to modify their data distribution and analysis models while
maintaining production services, to address the needs of the experiments. For example, US
ATLAS designed and deployed new data placement models without perturbing the production throughput; OSG has continued to interoperate well with the World Wide LHC Computing Grid while a new Computing Element service (CREAM) was tested and deployed in Europe.

Ramp up of ALICE production usage on their Alice USA OSG sites.

LIGO has continued to make significant use of the OSG for Einstein@Home production and
is ramping up on the use of OSG services for other analyses.

The Tevatron D0 and CDF experiments use the OSG facility for a large fraction of their simulation and analysis processing. Both experiments are now using OSG job management services to submit their jobs seamlessly across both the European (EGI) and OSG infrastructures

This year has shown an increase in the science enabled by OSG being done and supported
locally at the existing and “new” campus distributed infrastructures.

Besides the physics communities, the 1) structural biology group at Harvard Medical School,
2) groups using GLOW, the Holland Computing Center and NYSGrid, 3) mathematics research at the University of Colorado.
The structural biology community supported by the SBGrid project and VO have significantly increased their use of OSG and published results in the areas of protein structure modeling
and prediction. The Harvard Medical School paper was published in Nature and the methodologies and portal published as part of PNAS.
1.4
Technical achievements in 2010-2011
The technical program of work in the OSG project is defined by an annual WBS created by the
area coordinators and managed and tracked by the project manager. The past 12 months saw a
redistribution of staff for the last year of the currently funded project. Table 3 shows the distribution of OSG staff by area and institution:
8
Table 3: Distribution of OSG staff
Area of Work
OSG Technical Director
Software Tools Group
Production Coordination
LIGO Specific Requests
Software
0.5
0.5
0.5
1.25
6.6
Operations
Integration and Sites
VO Coordination
Engagement
5.5
4.05
0.45
1.3
Campus Grids
Security
Training and Content Management
Extensions
Scalability, Reliability and Usability
Workload Management Systems
support
Internet2
Consortium + Project Coordination
Metrics
Communications and Education
0.75
2.35
1.7
Project Manager
0.85
Total
Contribution1
OSG Staff
1.0
0.17
0.25
1.15
0.1
1.35
0.2
0
0.75
0.4
2.2
Institutions
(lead institution first)
UW Madison
UW Madison, FNAL
U Chicago
Caltech
UW Madison, Fermilab,
LBNL
Indiana U, Fermilab, UCSD
U Chicago, Caltech, LBNL
UCSD, Fermilab
RENCI, ISI, Fermilab,
UW Madison
UChicago, UNL
Fermilab, NCSA, UCSD
Caltech, Fermilab, Indiana U
UCSD, BNL
UCSD
BNL, UCSD
1.08
0.5
1
31.83
0.27
0.2
Internet2
Fermilab, Caltech, UFlorida
0.05
UNebraska
Fermilab, UFlorida, UW
Madison
Fermilab
5.35
By the end of 2010 the OSG staff has decreased by 2 FTEs due to staff leaving.
More than three quarters of the OSG staff directly support the operation and software for the ongoing stakeholder productions and applications (the remaining quarter mainly engages new customers and extends and proves software and capabilities; and also provides management and
communications etc.).
The 2010 WBS define more than 400 tasks, including both ongoing operational tasks and activities to put in place specific tools, capabilities and extensions. The area coordinators update the
WBS quarterly. As new requests are made to the project the Executive Team prioritizes these
against the existing tasks in the WBS. Additions are then made to the WBS to reflect those activities accepted into the work-program, some tasks are then dropped and other tasks deliverable
1
Contribution included in the WBS. Many other contributions come from external projects including the science
communities, software developers and resource administrators.
9
dates are adjusted according to the new priorities. In FY10 the WBS was 85% accomplished
(about the same as in FY09).
In the past 12 months, the main technical achievements that directly support science include:

Operation, facility, software and consulting services for LHC data distribution, production, analysis and simulation as the LHC accelerator came on line and the full machinery
of the worldwide distributed facility was used “under fire” to deliver understanding of the
detectors, physics results and publications.

Sustained support for and response to operational issues, inclusion and distribution of
new software and client tools, in support of the LHC Operations. Improved “ticketexchange” and “critical problem response” technologies and procedures were put in
place, and well exercised, between the WLCG, EGEE/EGI operations services, and the
US ATLAS and US CMS support processes.

Development of agreed upon Service Level Agreements were put in place for all operational services provided by OSG (including operations of the WMS pilot submit system,
and a draft of an SLA with ESNET for the CA service).

OSG carried out “prove-in” of reliable critical services (e.g. BDII) for LHC and operation of services at levels that meet or exceed the needs of the experiments. In addition,
OSG worked with the WLCG to improve the reliability of the BDII published information for the LHC experiments. This effort included robustness tests of the production
infrastructure against failures and outages and validation of information by the OSG as
well as the WLCG.

Success in simulated data production for the STAR experiment using virtual machine based software images on grid and cloud resources.

Significant improvements in LIGO’s ability to use the OSG infrastructure, including
adapting the Einstein@Home for Condor-G submission, resulting in a greater than 5x increase in the use of OSG by Einstein@Home. Support for GlideinWMS submission of
LIGO analysis applications, and testing of Binary Inspiral and ramp up of the LIGO Pulsar application running across more than 5 OSG sites.

Delivery of VDT components in “native packaging” for use on the LIGO data grid, the
OSG Worker Node Client, and other specific components that are high priority for the
stakeholders e.g. Glexec for US ATLAS.

Two significant publications from structural biology community (SBGrid) based on production running across many OSG sites as well as a rise in multi-user access through the
SBGrid portal software.

Entry of ALICE USA to full OSG participation for the experiment’s US sites, following a
successful evaluation activity. This includes WLCG reporting and accounting through the
OSG services.

Sustained support and better effectiveness for further Geant4 validation runs.

Ongoing support for IceCube and GlueX, and initial engagement with LSST and NEES.
10


Increased opportunistic cycles provided to OSG users by our collaborators in Brazil and
Clemson.

Extension of the campus communities and local research supported by the increasingly
active and effective Campus communities at the Holland Computer Center at the University of Nebraska and the support for multi-core applications for the University of Wisconsin Madison GLOW community.

Security program activities that continue to improve our defenses and capabilities towards incident detection and response via review of our procedures by peer grids and
adoption of new tools and procedures.
Better understanding of the role and interfaces of Satellite projects as part of the larger OSG
consortium contributions. Increased technical and educational collaboration with TeraGrid
through the NSF-funded joint OSG – TeraGrid effort, ExTENCI, which began in August
2010 (see Section 4.5) and the joint OSG-TG summer student program.

Contributions to the WLCG in the areas of new job execution service (CREAM), use of
Pilot based job management technologies, and interfaces to commercial (EC2) and scientific (Magellan, FutureGrid) cloud services.

Extensive support for US ATLAS and US CMS Tier-3 sites in security vulnerability testing.

Packaging and testing of XROOTD for the US ATLAS and US CMS Tier-3 sites in collaboration with the XROOTD development project located at SLAC and CERN.

Improvements (e.g. native packaging of software components, improved documentation,
support for “storage only sites”) to reduce the “barrier to entry” to participate as part of
the OSG infrastructure. We have held training schools for site administrators and a storage forum as part of the support activities.

Contributions to the PanDA and GlideinWMS workload management software that
have helped improve capability and supported broader adoption of these within the experiments. Reuse of these technologies by other communities. Configuration of PANDA for
the integration test bed automated testing and validation system.

Support for a central WMS “pilot factory” for more than 5 VOs and reduction in the
barrier for new communities to run production across OSG.

Continued ollaboration with ESnet and Internet2 on perfSONAR and Identity Management.

Improvement and validation of the collaborative workspace documentation for all OSG
areas of work.

A successful summer school for 17 students and OSG staff mentors, as well as successful
educational schools in south and Central America.

Continuation of excellence in e-publishing through the collaboration with the European
Grid Infrastructure represented by the International Science Grid This Week electronic
newsletter (www.isgtw.org). The number of readers continues to grow.
11
In summary, OSG continues to demonstrate that national cyber infrastructure based on federation of distributed resources can effectively meet the needs of researchers and scientists.
1.5
Preparing for the Future
The OSG project 1 year extension request to the DOE SciDAC-2 program to enable continuation
of support for HEP and NP operations and production until March 2012 was accepted.
In March 2011 we submitted a proposal for OSG “2011-2016” to NSF. The vision covers: Sustaining the infrastructure, services and software; Extending the capabilities and capacities of the
services and software; and Expanding the reach to include new resource types – shared intracampus infrastructures, commercial and scientific clouds, multi-core compute nodes – and new
user communities in early stages of testing the benefit of OSG to their science such as the NEESComm and LSST programs.
We held a review of the new OSG proposal in January 2011 by external reviewers invited by
Executive Director. The reviewers were the head of the WLCG project Ian Bird, the 2 XD proposal PIs John Towns and Richard Moore, and senior management of ESNet Bill Johnston. This
review was very valuable. Among the outcomes was a better definition of the scope of the
OSG’s services as shown in Figure 4.
Figure 4: Diagram showing the relationship between users, services and resources
We submitted a proposal to the DOE ASCR SCiDAC-3 Institute call – the Institute for Distributed High Throughput Computing (InDHTC)– where about half of the programs will contribute
directly to the OSG program and the other half will extend the research and development of
DHTC technologies to a broader set of the DOE scientific communities.
12
The following documents have been developed as input to the thinking of the OSG future programs:

National CI and the Campuses (OSG-939)

Requirements and Principles for the Future of OSG (OSG-938)

OSG Interface to Satellite Proposals/Projects (OSG-913)

OSG Architecture (OSG-966)

Report from the Workshops on Distributed Computing, Multidisciplinary Science, and
the NSF’s Scientific Software Innovation Institutes Program (OSG-1002).
In further preparation for the future we have put in place a revised organization (Figure 5) for
year 6 of the current project and for the future work. Lothar Bauerdick was endorsed as the new
Associate Executive Director of the OSG, with particular responsibilities to liaise with other
“docked” projects such as InDHTC – delivering to OSG core program of work. The Council Cochairmanship passed from Kent Blackburn to Rick Snider as part of the preparations for the future of OSG. A revision of the management plan was published.
Figure 5: OSG Consortium and Project org charts
OSG will become a Service Provider (SP) to the XSEDE project – the next phase of the
TeraGrid project – that starts on July 1, 2011. OSG will have representation on the XSEDE SP
Forum, participation in the services offered to an SP, and interoperability with other SPs. The
Technical Director and PI, Miron Livny, is proposed as the SP representative, with the Production Coordinator, Dan Fraser, as the alternate. OSG and XSEDE will have the following “connection points”:


Between the Campus activity and the Training Education Outreach Service Campus
Champions program to facilitate the creation and distribution of materials, training, education and communications related to the resources and services available from OSG;
Between User Support and the Advanced User Support Services for applications using
OSG and in particular making use of both OSG and XSEDE;
The OSG Security Officer will coordinate and collaborate with the XD security program in:
13




Common approaches to the risk and software vulnerability assessment together with
common personnel at NCSA (Adam Slagell).
Common approaches to the Federated ID management services together with common
personnel at NCSA (Jim Basney).
Coordination of policy development together with the European community (Jim
Marsteller, Jim Barlow).
Coordination of response and mitigations in the case of security incidents.
Through synergistic activities at the OSG Indiana University Grid Operations Center and the
NCSA Operations Activity OSG and XSEDE will support failover of the operations services
when local failures occur.
14
2.
2.1
Contributions to Science
ATLAS
The ATLAS collaboration, consisting of 174 institutes from 38 countries, completed construction of the ATLAS detector at the LHC, and began first colliding-beam data taking in late 2009.
The 44 institutions of U.S. ATLAS made major and unique contributions to the construction of
the ATLAS detector, provided critical support for the ATLAS computing and software program
and detector operations, and contributed significantly to physics analysis, results, and papers
published.
Experience gained during the first year of ATLAS data taking gives us confidence that the gridbased computing model has sufficient flexibility to process, reprocess, distill, disseminate, and
analyze ATLAS data in a way that utilizes both computing and manpower resources efficiently.
The computing facilities in the U.S. are based on the Open Science Grid (OSG) middleware
stack and provide currently a total of 150k HEP-SPEC 2006 of processing power and 15 PB of
disk space. The Tier-1 center at Brookhaven National Laboratory and the 5 Tier-2 centers located at 8 Universities (Boston University, Harvard University, Indiana University, Michigan State
University, University of Chicago, University of Michigan, University of Oklahoma and University of Texas at Arlington) and at SLAC have contributed to the worldwide computing effort at
the expected level (23% of the total). Time critical reprocessing tasks were completed at the Tier-1 center within the foreseen time limits, while the Tier-2 centers were widely used for centrally managed production and user analysis.
In the ATLAS computing resource projections for 2011-2013 the requirements for the year 2012
were initially based on an LHC shutdown in 2012 for consolidation of superconducting magnet
splices, and requests for that year only modest increases in computing resources above the level
required in 2011, for instance for storage of increased samples of simulated data. The ATLAS
resource model foresaw at that time a larger increment in resource requirements from 2012 to
2013, when LHC operations would resume than from 2011 to 2012. Following the annual LHC
workshop in Chamonix early this year, the experiments and CERN management decided to
change the LHC schedule to continue operations in 2012, and to initiate the shutdown in 2013.
Consequently, ATLAS computing resource requirements for 2012 will be higher than estimated
a year ago. Incremental resources originally expected to be needed for data-taking not until 2013
will now be required for data-taking in 2012, calling for a re-profiling of resource requirements.
The computing resource requirements are based upon an assumed factor of ~2 increase, which
requires improvements beyond present simulation-based projections. ATLAS has initiated a
high-priority program to identify and implement improvements to reduce these increases with as
little impact as possible on its physics capability. Improvements have already been implemented
that reduce overall reconstruction time by ~20%, via a combination of code optimization, reduced efficiency for tracks from converted photons, and an increase in pT threshold from 100
MeV to 400 MeV for charged particle tracking. Reduction in ESD event size by ~30% has also
already been achieved via removal of information that can be recomputed, with some loss of precision, from remaining quantities. Work on further improvements in reconstruction time and in
raw, ESD, and AOD event sizes is ongoing. As explained below, the ATLAS data distribution
model has also been adopted in order to manage larger event sizes within pledged 2011 resources.
15
Nonetheless, in light of the significant impact of anticipated increases in pileup, it is not foreseen
that a trigger rate much higher than 200 Hz can be handled within planned resources. Consequently, the computing resource requirements continue to be based upon a 200-Hz trigger output
rate.
The significant increases in data volume due to pileup (and beneficial increases in trigger rate)
provide a substantial challenge to fit within the foreseen 2011 computing resources. To address
this challenge, ATLAS has conducted a careful review of the use of the different data formats,
especially the large ESD event format (1.5 MB/event in 2010 data taking) and developed a new
model for data distribution for 2011 and 2012. ESD will be dropped as a disk-resident format for
unselected data events, and restricted to small sub-samples of ESD for specific calibration purposes (here referred to as “dESD”, derived-ESD samples), limited to the same total data volume
as the total AOD size. Whereas in the past, multiple replicas of two generations of ESD were
disk-resident, a single copy of RAW data will now be retained on disk. This copy of diskresident RAW data will provide more than AOD information for all events, but will require significantly less disk space than used formerly by ESD data. There is even the prospect of further
gains in size with compression of the RAW data, currently being investigated. Although some
data analysis operations will not be as convenient or timely under this new plan, the impact of
restricted access to ESD is deemed to be outweighed by the associated reductions in disk space
requirements.
In the new model, to address the challenge posed by larger events from increased pileup, the new
data distribution model for 2011/12 will significantly reduce the number of replicas of AOD sets
and dESD sets at Tier-2 centers. The number of replicas of AOD data from the current version of
processing across the worldwide ATLAS computing facility will be reduced from 10 to 8 and,
for AOD from the previous version, from 10 to 2. The number of replicas of dESD data will be
reduced from 10 to 4. Although the accessibility of data sets, particularly AOD, will be reduced,
these changes are necessary in 2011 in order to fit within pledged resources, and are deemed
preferable to drastic cuts in event size that would affect physics analysis more. It is hoped that
the movement of analysis to further derived data samples (e.g. n-tuples) and the dynamic data
placement mechanism outlined below will somewhat mitigate the reduction in AOD and dESD
accessibility.
As to data placement, operational experience has shown that pre-placement of physics datasets at
Tier-2 sites via a predefined number of replicas per cloud, as in the 2010 data distribution model,
is not optimal regarding efficient use of CPU and disk space. In that model, it is difficult to optimize the use of CPU resources because, in ATLAS analysis, jobs go where their specific input
data reside. Thus, if a site is unlucky, it could have many datasets that are less in demand and its
CPU resources will be underutilized. To mitigate this issue, ATLAS experimented in 2010 with a
dynamic data placement mechanism that increases the number of replicas of data within, and
across, clouds upon demand. This mechanism works well, and can potentially keep Tier-2 disks
full with the most useful data. It improved CPU usage and decreased network bandwidth demands. It is also expected to improve scalability to meet future requirements, when analysis on a
variety of data samples from different years will vary widely. Consequently, this mechanism,
PanDA Dynamic Data Placement (PD2P, further explained below), is now deployed in all Tier-2
clouds. Its algorithm is still being tuned, and the detailed performance gains are not yet quantified precisely.
16
Following the short run in late 2009, LHC collider operations was resumed in March 2010. Until
early December 2010 the ATLAS collaboration has taken 1.2 billion events from proton-proton
collisions and more than 200 million events from HI collisions. The total RAW (unprocessed)
data volume taken at the ATLAS detector in 2010 amounts to almost 2 PB (1.6 PB pp and 0.3
PB HI data). Following a four month long break LHC operations for the experiments resumed in
March 2011 and ATLAS has collected another 780 TB of RAW data since. While the RAW data
was directly replicated to all ten ATLAS Tier-1 centers according to their MoU share (the U.S.
receives and archives 23% of the total), the derived data was, after prompt reconstruction at the
Tier-0 center, distributed to regional Tier-1 centers for group and user analysis and further distributed to the regional Tier-2 centers. Following significant improvements that were incorporated into the reconstruction code and improved calibration data becoming available, rereconstruction of the entire data validated for physics so far was conducted at the Tier-1 centers
while users analyzed the data using resources at the Tier-1 site, all Tier-2 centers and their own
institutional computing facilities. As the amount of initial data taking was small we observed a
spike in resource usage at the higher levels of the facility with users running data reduction steps
followed by transfers of the derived, condensed data products to compute servers they use for
interactive analysis, resulting in a reduced utilization of grid resources for a few months until
LHC operations resumed in March 2011.
Figure 6: Integrated luminosity as delivered by the LHC and as measured by
ATLAS
Figure 7: Volume of RAW and derived data accumulated by
ATLAS since start of 2011 data taking (March 2011). Note
while ESDs are produced at the Tier-0 they are not archived
Centrally managed Monte Carlo production, collision data reconstruction and user analysis is
ongoing with some 70,000 concurrent jobs worldwide out of which 19,000 are jobs running on
the combined U.S. ATLAS Tier-1 and Tier-2 resources.
17
Figure 8: OSG CPU hours (124M total) used by ATLAS over 12 months.
On average, the U.S. ATLAS facility contributes 30% of worldwide analysis-related data access.
The number of user jobs submitted by the worldwide ATLAS community and brokered by PanDA to U.S. sites has reached an average number of 1.2 million per month, peaking occasionally
at more than 2 million jobs per month. Over the course of the reporting period more than 17 million user analysis jobs were completed successfully; out of that 11 million at the U.S. ATLAS
Tier-2 centers and 6 million at the Tier-1 center.
Figure 9: Weekly number of Production and Analysis jobs in the U.S. managed by PanDA
18
Based on ATLAS’ data distribution model that foresees multiple replicas of the same datasets
within regions, e.g. the U.S., a significant problem was observed shortly after 7 TeV data taking
started in March 2010 with sharply increasing integrated luminosity that produced an avalanche
of new data. In particular, as at Tier-2 sites disk storage filled up rapidly, and a solution had to be
found to accommodate the data required for analysis. Based on job statistics that includes information about data usage patterns it was found that only a relatively small fraction of the programmatically replicated data was actually accessed. U.S. ATLAS in agreement with ATLAS
computing management consequently decided to change the Tier-2 related distribution model
such that only datasets requested by analysis jobs are replicated. Programmatic replication of
large amounts of ESDs was stopped, only datasets of all categories that are explicitly requested
by analysis jobs are replicated from the Tier-1 center at BNL to the Tier-2 centers in the U.S.
Since June 2010, when the initial version of a PanDA steered dynamic data placement (PD2P)
system was deployed, we observe a healthy growth of the data volume on disk and are no longer
facing situations where actually needed datasets cannot be accommodated.
Figure 10: Cumulative evolution for DATADISK at Tier-2 centers in the U.S.
Figure 10 clearly shows the exponential growth of the disk space utilization in April and May
2010 up to the point in June when the dynamic data placement system was introduced. Since
then the data volume on disk is almost constant, despite the exponential growth of integrated luminosity and data volume of interest for analysis. Meanwhile the usage of PD2P, which is fully
transparent to users, was extended to all regions that provide computing resources to ATLAS.
Disk capacities as provisioned at the Tier-2 centers have evolved from a kind of archival storage
to a well-managed caching system. As a result of recent discussions it was decided to further develop the system by including the Tier-1 centers and evolve the distribution model such that it is
no longer based on a strict hierarchy but allows direct access across all present (ATLAS cloud)
hierarchy levels. Also part of a future model is remote access to files and fractions thereof, rather
than having to rely on formal dataset subscriptions via the ATLAS distributed data management
system prior to getting access to the data. The initial implementation of such a system will be
based on xrootd, which is a storage management solution that has existed for quite some time
and is locally used by high-energy and nuclear physics communities. Based on a CMS proposal
xrootd can be further exploited in the context of geographically distributed (federated) storage
systems where the software provides a scalable mechanism for global data discovery, presenting
19
a variety of different storage management systems, as they are deployed at sites, to the application as if they were the same.
With PD2P, caching is performed at the dataset level, a coarse-grained level. PD2P caching relies on predicting future (re)use of cached data; the data cache triggered by initial usage is not
itself used unless subsequent reuse takes place at the cache site. Nonetheless PD2P represents a
large improvement in terms of disk usage efficiency and manageability over policy-based preplacement.
The federated xrootd system will extend ATLAS’ use of caching in new directions. In addition to
providing a means of transparently accessing data that is not local to the site but in a remote
xrootd storage area, it will also support local caching of data that is remotely accessed at the file
level, such that e.g. a Tier-3 center builds up a local cache of data files being used at the site. The
caching in this scheme increases the granularity from the dataset level of PD2P to the file level.
ATLAS intends to implement and evaluate a still more fine-grained approach to caching, below
the file level. The approach takes advantage of a ROOT-based caching mechanism as well as recent efficiency gains in ROOT I/O implemented by the ROOT team that minimizes the number
of transactions with storage during data read operations, which particularly over the Wide Area
Network (WAN) are very expensive in terms of latency. It also utilizes development work performed by CERN-IT on a custom xrootd server which operates on the client side to direct ROOT
I/O requests to remote xrootd storage, transparently caching at the block level data that is retrieved over the WAN and passed on to the application. Subsequent local use of the data hits the
cache rather than the WAN. This benefits not only the latency seen by a client utilizing cached
data, but also the source site, freed from the need to serve already-delivered data. In addition
caching obviously saves network capacities.
Deriving benefit from fine-grained caching depends upon re-use of the cache. As one approach
to maximizing re-use, PanDA’s existing mechanism for brokering jobs to worker nodes on the
basis of data affinity will be applied to this case, such that jobs are preferentially brokered to
sites which have run jobs utilizing the same input files.
Non-PanDA based applications using data at the cache site will also automatically benefit from
the cache. The approach will integrate well with the federated xrootd system; it adds an automatic local caching capability to the federation. It may also be of interest in the context of serving
data to applications running in commercial clouds, where the expense of data import and incloud storage could make fine-grained caching efficiencies valuable.
Once integrated into the OSG Distributed High Throughput Computing (DHTC) services, some
or all of the previously described capabilities will be available to the entire spectrum of scientific
communities served by the OSG. This will be accomplished through the Virtual Data Toolkit
(VDT) software infrastructure.
To support users running data analysis, ATLAS has built a powerful system for computing activities on top of three major grid infrastructures, the Open Science Grid (OSG) in the U.S., EGI in
Europe and Asia, and ARC in the Nordic Countries. As expected, with data finally arriving physicists need dedicated resources for analysis activities. In contrast to the existing grid infrastructure, there is a strong need to provide users with data control and quasi high-performance interactive data access. U.S. ATLAS in close collaboration with OSG has designed and implemented a
Tier-3 solution that is targeted to provide efficient and manageable analysis computing at each
20
member institution. For most of the ~40 sites in the U.S. only a small fraction of a physicist or
student can be diverted for computing support. Transformative technologies have been chosen
and integrated with the existing ATLAS tools. The result is a site which is substantially simpler
to maintain and which is essentially operated by client tools and extensive use of caching technologies. Most promising technologies ATLAS is using include xroot for distributed storage
management and the CERN Virtual Machine File System (CVMFS) for ATLAS software distribution and conditions data access.
Open Science Grid has organized monthly Tier-3 Liaison meetings between several members of
the OSG facilities, U.S. Atlas and U.S. CMS. During these meetings, topics discussed include
cluster management, site configuration, site security, storage technology, site design, and experiment-specific Tier-3 requirements.
U.S. ATLAS (contributing to ATLAS as a whole) relies extensively on services and software
provided by OSG, as well as on processes and support systems that have been developed and
implemented by OSG. OSG has become essential for the operation of the worldwide distributed
ATLAS computing facility and the OSG efforts have aided the integration with WLCG partners
in Europe and Asia. The derived components and procedures have become the basis for support
and operation covering the interoperation between OSG, EGI, and other grid sites relevant to
ATLAS data analysis. OSG provides software components that are interoperable with European
ATLAS grid sites, including selected components from the gLite middleware stack such as client
utilities, and LHC Computing Grid File Catalog (LFC).
It is vital to the ATLAS collaboration that the present level of service continues uninterrupted for
the foreseeable future, and that all of the services and support structures upon which U.S. ATLAS relies today are properly maintained and have a clear continuation strategy.
The Blueprint working group within OSG tasked to develop a coherent middleware architecture
has made significant progress that benefits ATLAS. Important ingredients include “native packaging” of middleware components that can now be deployed as well managed RPMs instead of a
collection of Pacman modules and the CREAM Computing Element (CE) that addresses scaling
and resilience issues observed with the Globus Toolkit 2 based technology ATLAS has used so
far.
Middleware deployment support provides an essential and complex function for U.S. ATLAS
facilities. For example, support for testing, certifying and building a foundational middleware for
production and distributed analysis activities is a continuing requirement, as is the need for coordination of the roll out, deployment, debugging and support for the middleware services. In addition, some level of preproduction deployment testing has been shown to be indispensable. This
testing is currently supported through the OSG Integration Test Bed (ITB) providing the underlying grid infrastructure at several sites along with a dedicated test instance of PanDA, the ATLAS
Production and Distributed Analysis system. These elements implement the essential function of
validation processes that accompany incorporation of new and new versions of grid middleware
services into the Virtual Data Toolkit (VDT), which provides a coherent OSG software component repository. U.S. ATLAS relies on the VDT and OSG packaging, installation, and configuration processes to provide a well-documented and easily deployable OSG software stack.
U.S. ATLAS greatly benefits from OSG’s Gratia accounting services, as well as the information
services and probes that provide statistical data about facility resource usage and site information
passed to the application layer and to WLCG for review of compliance with MoU agreements.
21
An essential component of grid operations is operational security coordination. The coordinator
provided by OSG has good contacts with security representatives at the U.S. ATLAS Tier-1 center and Tier-2 sites and is closely connected to experts representing grid computing resources
outside the U.S. Thanks to activities initiated and coordinated by OSG a strong operational security community is in place in the U.S., ensuring that security problems are well coordinated
across the distributed infrastructure.
In the area of middleware extensions, U.S. ATLAS continued to benefit from the OSG’s support
for and involvement in the U.S. ATLAS-developed distributed processing and analysis system
(PanDA) layered over the OSG’s job management, storage management, security and
information system middleware and services. PanDA provides a uniform interface and utilization
model for the experiment's exploitation of the grid, extending across OSG, EGI and Nordugrid. It
is the basis for distributed analysis and production ATLAS-wide, and is also used by OSG as a
WMS available to OSG VOs, as well as a PanDA based service for OSG Integrated Testbed
(ITB) test job submission, monitoring and automation. This year the OSG’s WMS extensions
program continued to provide the effort and expertise on PanDA security that has been essential
to establish and maintain PanDA’s validation as a secure system deployable in production on the
grids. In particular PanDA’s glexec-based pilot security system developed in this program was
brought to production readiness through continued testing in the U.S. and Europe throughout the
year.
Another important extension activity during the past year was in WMS monitoring software and
information systems. During the year ATLAS and U.S. ATLAS continued the process of
merging the PanDA/US monitoring effort with CERN-based monitoring efforts, together with
the ATLAS Grid Information System (AGIS) that integrates ATLAS-specific information with
the grid information systems. The agreed common approach utilizes a python apache service
serving json-formatted monitoring data to rich jQuery-based clients. A prototype PanDA
monitoring infrastructure based on this approach was initiated last year and further developed
this year; elements of it are now integrated in production PanDA monitoring and the
development approach and architecture has been validated as our evolution path. In light of
Oracle scaling limitations first seen last year, a program investigating alternative back end DB
technologies was begun this year, particularly for the deep archive of job and file data. This
archive shows the most severe scaling limitations and has access patterns amenable to so-called
‘noSQL’ storage approaches, in particular to the highly scalable key-value pair based systems
such as Cassandra and Hive that have emerged as open source software from web behemoths
such as Google, Amazon and Facebook. Cassandra has been adopted as the basis for a prototype
job data archive. During the year a Cassandra testbed was established at BNL, schema designs
were developed and tested, and a full year of PanDA job data was published to the system for
performance evaluations, which look very promising. Results will be presented at a WLCG
database workshop at CERN in June 2011.
2.2
CMS
During 2010 CMS has transitioned from commissioning the detector to producing its first physics results across the entire range of physics topics. Several tens of scientific papers have been
published in peer reviewed journals, including Phys. Rev. Letters, Physics Letters B, Eur. Phys.
Journal C, and the Journal of High Energy Physics, covering the entire range of physics topics in
the CMS physics program.
22
Among the published results are already some first surprises, like the first “Observation of LongRange, Near-Side Angular Correlations in Proton-Proton Collisions”, others are searches for new
physics that already exceed the sensitivity reached by previous generations of experiments, still
others, including the first observation of top pair production at the LHC, are major milestones
that measure the cross section for the dominant Standard Model background processes to much
of the ongoing, as well as future new physics searches.
By the middle of June 2011, the LHC has delivered more than 1/fb of integrated luminosity.
CMS is thus expecting to present results at EPS end of July 2011 based on this data sample, a
factor 30 more than the published results for 2010 were based on. This should make for a very
exciting summer conference season.
Computing has proven to be the enabling technology it was designed to be, providing an agile
environment for scientific discovery. U.S. CMS resources available via the Open Science Grid
have been particularly important to the scientific output of the CMS experiment. The seven Tier2 sites are among the ten most heavily used Tier-2 sites globally, as shown in Figure XX. Figure
YY shows the number of pending jobs versus time. During the peak usage times, the U.S. sites
also have both the most running and pending jobs, indicating that we are at this point resource
limited during peak times. In terms of organized production, which includes reprocessing as well
as simulation, Figure ZZ shows that several of the U.S. Tier-2’s support the same number of
running jobs as some of the global Tier-1 centers.
With regard to data transfer volume, more than 5 PB (3PB) of data was received by (sourced
from) the U.S. Tier-2 centers during this period. In comparison, the FNAL Tier-1 center received
(sourced) slightly less than 5 PB (7 PB) of data during the same period. To put this in perspective, the U.S. Tier-1 and Tier-2s together send and received more data than all non-U.S. Tier-1s
combined.
The U.S. leadership position within CMS as indicated by these metrics is attributable to superior
reliability, and agility of U.S. sites. We host a complete copy of all core data samples distributed
across the seven US Tier-2 sites, and due to the excellent performance of the storage infrastructures, we are able to refresh data quickly. It is thus not uncommon that data becomes available
first at U.S. sites, attracting time critical data analysis to those sites.
The Open Science Grid has been a significant contributor to this success by providing critical
computing infrastructure, operations, and security services. These contributions have allowed
U.S. CMS to focus experiment resources on being prepared for analysis and data processing, by
saving effort in areas provided by OSG. OSG provides a common set of computing infrastructure
services on top of which CMS, with development effort from the U.S., has been able to build a
reliable processing and analysis framework that runs on the Tier-1 facility at Fermilab, the project supported Tier-2 university computing centers, and opportunistic Tier-3 centers at universities. There are currently 27 Tier-3 centers registered with the CMS data grid in the U.S., 20 of
them provide additional simulation and analysis resources via the OSG. The remainder are Universities that receive CMS data via the CMS data grid, using an OSG storage element API, but
do not (yet) make any CPU cycles available to the general community. OSG and US CMS work
closely together to ensure that these Tier-3 centers are fully integrated into the globally distributed computing system that CMS science depends on.
In addition to common interfaces, OSG provides the packaging, configuration, and support of the
storage services. Since the beginning of OSG the operations of storage at the Tier-2 centers have
23
improved steadily in reliability and performance. OSG is playing a crucial role here for CMS in
that it operates a clearinghouse and point of contact between the sites that deploy and operate this
technology and the developers. In addition, OSG fills in gaps left open by the developers in areas
of integration, testing, and tools to ease operations.
OSG has been crucial to ensure US interests are addressed in the WLCG. The U.S. is a large
fraction of the collaboration both in terms of participants and capacity, but a small fraction of the
sites that make-up WLCG. OSG is able to provide a common infrastructure for operations including support tickets, accounting, availability monitoring, interoperability and documentation.
Now that CMS is taking data, the need for sustainable security models and regular accounting of
available and used resources is crucial. The common accounting and security infrastructure and
the personnel provided by OSG represent significant benefits to the experiment, with the teams
at Fermilab and the University of Nebraska providing the development and operations support,
including the reporting and validation of the accounting information between the OSG and
WLCG.
In addition to these general statements, we’d like to point to two specific developments that have
become increasingly important to CMS within the last year. Within the last two to three years,
OSG developed the concept of “Satellite projects”, and a notion of providing an “ecosystem” of
independent technology projects that enhance the overall national computing infrastructure in
close collaboration with OSG. CMS is starting to benefit from this concept as it has stimulated
close collaboration with computer scientists on a range of issues including 100Gbps networking,
workload management, cloud computing and virtualization, and High Throughput Parallel Computing that we expect will lead to multi-core scheduling as the dominant paradigm for CMS in a
few years time. The existence of OSG as a “collaboratory” allows us to explore these important
technology directions in ways that are much more cost effective, and more likely to be successful
than if were pursuing these new technologies within a narrow CMS specific context.
Finally, within the last year, we have seen increasing adoption of technologies and services originally developed for CMS. Most intriguing is the deployment of glideinWMS as an OSG service,
adopted by a diverse set of customers including structural biology, nuclear physics, applied
mathematics, chemistry, astrophysics, and CMS data analysis. A single instance of this service is
jointly operated by OSG and CMS at UCSD for the benefit of all of these communities. OSG
developed a Service Level Agreement that is now being reviewed for possible adoption also in
Europe. Additional instances are operated at FNAL for the Run II experiments, MINOS, and data reprocessing for CMS at Tier-1 centers.
24
Figure 11: OSG CPU hours (115M total) used by CMS over 12 months, color-coded by facility.
Figure 12: Average number of running analysis jobs per week by CMS worldwide
25
Figure 13: Average number of pending analysis jobs per week by CMS worldwide
26
Figure 14: Average number of running production jobs per week by CMS worldwide
2.3
LIGO
LIGO continues to leverage the Open Science Grid for opportunistic computing cycles associated with its grid based Einstein@Home application, known as Einstein@OSG. This application is
one of several in use for an “all-sky” search for gravitational waves of a periodic nature attributed to elliptically deformed pulsars. Such a search requires enormous computational resources to
fully exploit the science content available within LIGO’s vast datasets. Volunteer and opportunistic computing based on the BOINC (Berkeley Open Infrastructure for Network Computing) has
been leveraged to utilize as many computing resource worldwide as possible. Since porting to
the grid based Einstein@OSG code onto the Open Science Grid more a little over two years ago,
steady advances in both the code performance, reliability and overall deployment onto the Open
Science Grid have been demonstrated. OSG has routinely ranked in the top five computational
providers for this LIGO analysis worldwide. In this year the Open Science Grid has provided
more than 18 million CPU-Hours towards this search for pulsar signals.
27
Figure 15: Opportunistic usage of the OSG by LIGO’s grid based Einstein@Home application for
the current year.
This year has also seen development effort on a variation of the search for gravitational waves
from pulsars with the porting of the “PowerFlux” application onto the Open Science Grid. This is
also a broadband search but utilizes a power-averaging scheme to more quickly cover a large region of the sky over a broad frequency band. The computational needs are not as large as with
the Einstein@Home application at the expense of lower signal resolution. The code is currently
being wrapped up to provide better monitoring in a grid environment where remote login is not
supported.
One of the most promising sources of gravitational waves for LIGO is from the inspiral of a system of compact black holes and/or neutron stars as the system emits gravitational radiation leading to the ultimate coalescence of the binary pair. The binary inspiral data analyses typically involve working with tens of terabytes of data in a single workflow. Collaborating with the Pegasus Workflow Planner developers at USC-ISI, LIGO continues to identify changes to both Pegasus and to the binary inspiral workflow codes to more efficiently utilize the OSG and its emerging storage technology, where data must be moved from LIGO archives to storage resources near
the worker nodes on OSG sites.
One area of intense focus this year has been on the understanding and integration into workflows
of Storage Resource Management (SRM) technologies used in OSG Storage Element (SE) sites
to house the vast amounts of data used by the binary inspiral workflows so that worker nodes
running the binary inspiral codes can effectively access the LIGO data. The SRM based Storage
Element established on the LIGO Caltech OSG integration testbed site is being used as a development and test platform to get this effort underway without impacting OSG production facilities. Using Pegasus for the workflow planning, DAGs for the binary inspiral data analysis application using of order ten terabytes of LIGO data have successfully run on three production sites.
28
Performance studies this year have suggested that the use of glide-in technologies can greatly
improve the total run time requirements for these large workflows made up of tens of thousands
of jobs. This is another area where Pegasus in conjunction with its Corral glide-in features have
resulted in further gains in the ability to port and effectively use a complex LIGO data analysis
workflow, designed originally for running on the LIGO Data Grid, over to the Open Science
Grid where there are sufficient similarities to make this possible, but sufficient differences to require detailed investigations and development activities to reach the desired science driven goals.
LIGO continues working closely with the OSG Security team, DOE Grids, and ESnet to evaluate
the implications of its requirements on authentication and authorization within its own LIGO Data Grid user community and how these requirements map onto the security model of the OSG.
2.4
ALICE
The ALICE experiment at the LHC relies on a mature grid framework, AliEn, to provide computing resources in a production environment for the simulation, reconstruction and analysis of
physics data. Developed by the ALICE Collaboration, the framework is fully operational with
sites deployed at ALICE and WLCG Grid facilities worldwide. During 2010, ALICE US collaboration deployed significant compute and storage resources in the US, anchored by new Tier
2 centers at LBNL/NERSC and LLNL. These resources, accessible via the AliEn grid framework, are being integrated with OSG to provide accounting and monitoring information to ALICE and WLCG while allowing unused cycles to be used by other Nuclear Physics groups.
In early 2010, the ALICE USA Collaboration’s Computing plan was formally adopted. The plan
specifies resource deployments at both the existing NERSC/PDSF cluster at LBNL and the
LLNL/LC facility, and operational milestones for meeting ALICE USA’s required computing
contributions to the ALICE experiment. A centerpiece of the plan is the integration of these resources with the OSG in order to leverage OSG capabilities for accessing and monitoring distributed compute resources. Milestones for this work included: completion of more extensive
scale-tests of the AliEn-OSG interface to ensure stable operations at full ALICE production
rates, establishment of operational OSG resources at both facilities, and activation of OSG reporting of resource usage by ALICE to the WLCG. During this past year, with the support of
OSG personnel, we have met most of the goals set forth in the computing plan.
NERSC/PDSF has operated as an OSG facility for several years and was the target site for the
initial development and testing of an AliEn-OSG interface. With new hardware deployed for
ALICE on PDSF by July of 2010, a new set of scaling tests were carried out by ALICE which
demonstrated that the AliEn-OSG interface was able to sustain job submission rates and steadystate job-occupancy required by the ALICE team. Since about mid-July of 2010, ALICE has run
production at PDSF directed through the OSG CE interface with a steady job concurrency rate of
about 300 jobs or more, consistent with the computing plan.
29
NERSC/PDSF
LLNL/LC
Figure 16: Cumulative cpu-hours delivered since August 2010 from the NERSC/PDSF and
LLNL/LC facilities to ALICE as measured by OSG accounting records. The LLNL/LC facility,
operational since September of 2010, was integrated with OSG accounting in December of 2010.
During the fall of 2010 a small OSG-ALICE task force was re-instated to facilitate further integration with OSG. Work in the group focused on the ALICE need for resource utilization to be
reported by OSG to the WLCG. This work has included performing crosschecks on accounting
records reported by the PDSF OSG site as well as developing additional tools needed for deploying OSG usage reporting at the LLNL/LC facility. As a result of these efforts, the LLNL/LC facility began sending job information to OSG in December of 2010 such that both facilities currently report accounting records to WLCG as a part of normal OSG operations. Cumulative cpuhours recorded by OSG from the two sites are shown in Figure 16 and illustrate the initial deployment of resource reporting at LLNL/LC in December.
ALICE continues to work with OSG on several issues that target optimal use of both ALICE and
OSG accessible resources. The ALICE site at LLNL/LC is currently working on a full OSG installation that will eventually include operation on the non-supported SLURM batch system preferred at LLNL/LC and will allow other OSG VOs opportunistic access to those resources. In
addition, the OSG software team is working on the integration of the CREAM-CE as an option
in the OSG software stack. The ALICE USA team plans to participate in the integration tests of
the software as such an option would allow ALICE USA facilities to run with a job-submission
and monitoring model identical to its European counter parts. Finally, the ALICE USA computing effort is evaluating modifications to the AliEN workflow that will allow ALICE to make
opportunistic use of other OSG resources. We expect these efforts to continue over the next
year.
2.5
D0 at Tevatron
The D0 experiment continues to rely heavily on OSG infrastructure and resources in order to
achieve the computing demands of the experiment. The D0 experiment has successfully used
OSG resources for many years and plans on continuing with this very successful relationship into
the foreseeable future. This usage has resulted in a tremendous science publication record
(Figure 17), including contributing to improved limits on the Higgs mass exclusion as shown in
Figure 18. The D0 experiment has been able to produce 47 different results for the Winter/Spring 2011 Conferences. See http://www-d0.fnal.gov/Run2Physics/ResultsWinter2011.html
30
Figure 17: Number of publications from the D0 experiment versus year.
31
Figure 18: Plot showing the latest combined D0 and CDF results on the observed and expected 95%
confidence limit upper limits on the ratios to the Standard Model cross section as a function of
Higgs mass.
All D0 Monte Carlo simulation is generated at remote sites, with OSG continuing to be a major
contributor. During the past year, OSG sites simulated approximately 600 million events for D0,
(almost 100 million more events than were produced in the previous year) approximately 1/3 of
all production. The rate of production has nearly leveled off over the past year as almost all major sources of inefficiency have been resolved and D0 continues to use OSG resources very efficiently. The changes in policy at numerous sites for job preemption, the continued use of automated job submissions, and the use of resource selection has allowed D0 to opportunistically use
OSG resources to efficiently produce large samples of Monte Carlo events. D0 continues to use
approximately 30 OSG sites regularly in its Monte Carlo production. Figure 19 shows a snapshot of idle and running Monte Carlo jobs on OSG on a typical day showing that many different
sites are being utilized by D0.
The total number of D0 OSG MC events produced over the past several years has exceeded 1.6
billion events (Figure 20).
Over the past year, the average number of Monte Carlo events produced per week by OSG continues to remain approximately constant. Since we use the computing resources opportunistically, it is interesting to find that, on average, we can maintain an approximately constant rate of
MC events at nearly 15 million events/week (Figure 21). In April 2011, 26 million events were
produced which is a record week for MC production for D0. Any dips in OSG production are
32
now only due to D0 switching to new software releases which temporarily stops our requests to
OSG. Over the past year D0 has been able to obtain the necessary opportunistic resources to
meet our Monte Carlo needs even though the LHC also has high demand. We have been able to
achieve this by continuing to improve our efficiency and to add additional resources when they
become available. It is known that the Tevatron program will end in September of 2011. However by this time it is expected that D0 will have accumulated nearly 11 fb-1 of data. It will take
many years to analyze this huge data set and Monte Carlo production will continue to be in high
demand. So although the Tevatron accelerator will shut down, D0 will continue to need OSG
resources for many more years.
D0 OSG MC jobs
=======================================================
CURRENT JOBS DISTRIBUTION @all_osg_queues from condor
=======================================================
running
idle
site
324 :
0 : antaeus.hpcc.ttu.edu
30 :
392 : ce.grid.unesp.br
35 :
65 : ce01.cmsaf.mit.edu
1 :
5 : cit-gatekeeper.ultralight.org
1 :
303 : cmsgrid01.hep.wisc.edu
2 :
1685 : cmsosgce3.fnal.gov
831 :
48 : condor1.oscer.ou.edu
478 :
0 : d0cabosg1.fnal.gov
1001 :
563 : d0cabosg2.fnal.gov
371 :
0 : fermigridosg1.fnal.gov
31 :
445 : ff-grid.unl.edu
38 :
445 : ff-grid3.unl.edu
180 :
311 : fnpcosg1.fnal.gov
10 :
53 : gk01.atlas-swt2.org
10 :
133 : gk04.swt2.uta.edu
50 :
0 : gluskap.phys.uconn.edu
15 :
90 : gridgk01.racf.bnl.gov
1 :
118 : gridgk02.racf.bnl.gov
456 :
457 : msu-osg.aglt2.org
49 :
109 : nys1.cac.cornell.edu
2 :
0 : osg-ce.sprace.org.br
16 :
68 : osg-gk.mwt2.org
17 :
290 : osg-gw-2.t2.ucsd.edu
12 :
0 : osg.rcac.purdue.edu
64 :
187 : osg1.loni.org
81 :
332 : ouhep0.nhn.ou.edu
19 :
164 : pg.ihepa.ufl.edu
5 :
280 : red.unl.edu
210 :
0 : tier3-atlas1.bellarmine.edu
4 :
0 : umiss001.hep.olemiss.edu
6543 : total IDLE jobs
4344 : total RUNNING jobs
Figure 19: Current D0 jobs distribution from Condor
33
Figure 20: Cumulative number of D0 MC events generated by OSG during the past year.
Figure 21: Number of D0 MC events generated per week by OSG during the past year. Although
D0 uses opportunistic computing for its Monte Carlo production, a constant rate of nearly 15 million events/week has been achieved.
Two years ago D0 was able to first use LCG resources at a significant level to produce Monte
Carlo events. The primary reason that this was possible was that LCG began to use some of the
infrastructure developed by OSG. Because LCG was able to easily adopt some of the OSG infrastructure, D0 is able to produce a significant number of Monte Carlo events using LCG.
Since the OSG infrastructure is robust, the LCG production has been very constant at approximately 5 million events/week, giving nearly 250 million Monte Carlo events from LCG during
the past year. The ability for OSG infrastructure to be used by other grids has proved to be very
beneficial.
The primary processing of D0 data continues to be run using OSG infrastructure. One of the very
important goals of the experiment is to have the primary processing of data keep up with the rate
of data collection. It is critical that the processing of data keep up in order for the experiment to
quickly find any problems in the data and to keep the experiment from having a backlog of data.
Typically D0 is able to keep up with the primary processing of data by reconstructing 6-8 million
events/day (Figure 22). However, when the accelerator collides at very high luminosities, it is
difficult to keep up with the data using our standard resources. However, since the computing
34
farm and the analysis farm have the same infrastructure, D0 is able to move analysis computing
nodes to primary processing to improve its daily processing of data, as it has done on more than
one occasion. This flexibility is a tremendous asset and allows D0 to efficiently use its
computing resources. Over the past year D0 has reconstructed nearly 1.6 billion events on OSG
facilities. In order to achieve such a high throughput, much work has been done to improve the
efficiency of primary processing. In almost all cases, only 1-2 job submissions are needed to
complete a job, even though the jobs can take several days to finish, see
Figure 23.
Figure 22: Cumulative daily production of D0 data events processed by OSG infrastructure for data collected after the 2010 shutdown. The flat areas correspond to times when the accelerator/detector was down for maintenance so no events needed to be processed.
OSG resources continue to allow D0 to meet is computing requirements in both Monte Carlo
production and in data processing. This has directly contributed to D0 publishing 31 papers in
2010/2011 (with 13 additional papers submitted for publication) see http://wwwd0.fnal.gov/d0_publications/.
35
Figure 23: Submission statistics for D0 primary processing in May 2011. In almost all cases, only 12 job submissions are required to complete a job even though jobs can run for several days.
2.6
CDF at Tevatron
The CDF experiment produced 42 new results for Summer 2010 followed by a further 37 new
results for Winter 2011, using OSG infrastructure and resources. Included in these results was
the 95% CL exclusion of a Standard Model Higgs boson with mass between 158 and 168 GeV/c2
(Figure 24).
36
Figure 24: Upper limit plot of recent CDF search for the Standard Model Higgs, March 2011 (not
2010 as labeled)
OSG resources support the work of graduate students, who are producing one thesis every other
week, and the collaboration as a whole, which is submitting a publication of new physics results
at a rate of more than one per week. 41 publications were submitted in CY10 and 24 in the first
half of CY11. A total of 560 million Monte Carlo events were produced by CDF in the last year.
Most of this processing took place on OSG resources. CDF also used OSG infrastructure and resources to support the processing of raw data events. A major reprocessing has been under way
to increase the b tagging efficiency for improved sensitivity to low mass Higgs. The production
output from this and normal processing was 6.6 billion reconstructed events over the last year,
and in the same period 11.7 billion ntuple events were created. Detailed numbers of events and
volume of data are given in Table 4 (total data since 2000) and Table 5 (data taken in the year to
June 2011).
Table 4: CDF data collection since 2000
Data Type
Raw Data
Production
MC
Stripped-Prd
MC Ntuple
Total
Volume (TB)
2061
3224
989
105
508
7946
# Events (M)
13807
21397
6632
876
7485
118926
37
# Files
2366975
2770627
1142643
98728
454747
7761262
Table 5: CDF data collection for the year to June 2011
Data Type
Raw Data
Production
MC
Stripped-Prd
Ntuple
MC Ntuple
Total
Data Volume (TB)
388
1198
111
16
435
141
2288
# Events (M)
2354
6622
561
85
12670
1763
49292
# Files
433569
805109
125535
12226
292934
120289
1832551
The OSG provides computing resources for the collaboration through two portals. The first, the
North American Grid portal (NAmGrid), covers the functionality for MC generation in an environment that requires software to be ported to the site and only Kerberos or grid authenticated
access to remote storage for output. The second portal, CDFGrid, provides an environment that
allows full access to all CDF software libraries and methods for data handling.
CDF operates the pilot-based Workload Management System (glideinWMS) as the submission
method to remote OSG sites. Figure 25 shows the number of running jobs on NAmGrid and
demonstrates that there has been steady usage of the facilities, while Figure 26, a plot of the
queued requests, shows that there is large demand. CDF MC production is submitted to
NAmGrid and use of OSG CMS, CDF, and General purpose Fermilab resources plus MIT is observed.
A large resource provided by Korea at KISTI is in operation and provides a large Monte Carlo
production resource with high-speed connection to Fermilab for storage of the output. It also
provided a cache that allows the data handling functionality to be exploited. The system was
commissioned and 10TB of raw data were processed using SAM data handling with KISTI in the
NAmGrid portal. Problems in commissioning were handled with great speed by the OSG team
through the “campfire” room and through weekly VO meetings. Lessons learned to make commissioning and debugging easier were analyzed by the OSG group. KISTI is being run as part of
NAmGrid for MC processing when not being used for reprocessing.
38
Figure 25: Running jobs on NAmGrid
Figure 26: Waiting CDF jobs on NAmGrid, showing large demand.
Plots of running jobs and queued requests on CDFGrid are shown in Figure 27 and Figure 28.
The very high demand for the CDFGrid resources in the period leading up to the 2011 winter
conference season is particularly noteworthy, where queues exceeding 100,000 jobs can be seen.
The high use has continued since then, corresponding both to reprocessing activity, and analysis
of the reprocessed data that has become available. The reduced capacity at the start of the period, in early summer 2010, was due to an allocation of 15% of the CDFGrid resources for testing
with SLF5. This testing period ended with the end of the summer conference season on August,
2010. At that point all of the CDFGrid and NAmGrid resources were upgraded to SLF5 and this
became the default. CDF raw data processing, ntupling and user analysis has now been converted to SLF5.
39
Figure 27: Running CDF jobs on CDFGrid
Figure 28: Waiting CDF jobs on CDFGrid
In a new development that has gone live in June 2011, LCG sites in Europe that have been
providing resources for CDF are now being made available through a new portal, EuroGrid,
which uses OSG functionality through the UCSD glidein factory. CDF is very excited by the
prospect of being able to access these sites with the user protection afforded by the glidein system.
A number of issues affecting operational stability and operational efficiency have been pointed
out in the past. Those that remain and solutions or requests for further OSG development are cited here.

Service level and Security: Since April 2009 Fermilab has had a new protocol for upgrading
Linux kernels with security updates. While main core services can be handled with a rolling
40
reboot, the data handling still requires approximately quarterly draining of queues for up to 3
days prior to reboots.

Opportunistic Computing/Efficient resource usage:
Preemption policy has not been revisited and CDF has not tried to include any new sites due
to issues that arose when commissioning KISTI. It was found that monitoring showed that
the KISTI site was healthy while we found that glideins from glideWMS were “swallowed”
leaving a cleanup operational issue. This is being addressed by OSG.

Management of database resources: Monte Carlo production led to a large load on the
CDF database server for queries that could be cached. An effort to reduce this load was
launched and most queries were modified to use a Frontier server with Apache. This avoided
a problem in resource management provided Frontier servers are provided with each site installation.

Disk space on worker nodes: As CDF has recently moved to writing large (2–8GB) files,
space available to running jobs on worker nodes has become an issue. KISTI was able to upgrade all of its worker nodes in June 2011 to provide 20GB per slot, and OSG should consider the recommendations it makes for site design.
The usage of OSG for CDF has been fruitful and the ability to add large new resources such as
The usage of OSG for CDF has been fruitful and the ability to add large new resources such as
KISTI as well as more moderate resources within a single job submission framework has been
extremely useful for CDF. The collaboration has produced significant new results in the last
year with the processing of huge data volumes. Significant consolidation of the tools has occurred. In the next year, the collaboration looks forward to a bold computing effort in the push
to see evidence for the Higgs boson, a task that will require further innovation in data handling
and significant computing resources in order to reprocess the large quantities of Monte Carlo and
data needed to achieve the desired improvements in tagging efficiencies. We look forward to
another year with high publication rates and interesting discoveries.
2.7
Nuclear physics
STAR’s tenth year of data taking has brought new levels of data challenges, with the most recent
year’s data matching the integrated data of the previous decade. Now operating at the Petabyte
scale, the data mining and production has reached its maximum potential. Over a period of 10
years of running, the RHIC/STAR program has seen data rates grow by two orders of magnitudes, yet data production has kept pace and data analysis and science productivity remained
strong. In 2010, the RHIC program and Brookhaven National Laboratory earned recognition as
number 1 for Hadron collider research.2
To effectively face the data challenge, all raw simulations had previously been migrated to Gridbased operations. This year, the migration has been expanded, with a noticeable shift toward the
use of Cloud resources wherever possible. While Cloud resources had been an interest to STAR
as early as 2007, our previous years’ reports noted multiple tests and a first trial usage of Cloud
resources (Nimbus) in 2008/2009 at the approach of a major conference, absorbing additional
workload stemming from a last minute request. This mode of operation has continued as the
Cloud approach allows STAR to run what our collaboration has not been able to perform on Grid
2
http://sciencewatch.com/ana/st/hadron/institutions/
41
resources due to technical limitations (harvesting of resources on the fly has been debated in
length by STAR as an unreachable ideal for experiment equipped with complex software stacks).
Grid usage remains restricted to either opportunistic use of resources for event generator-based
production (self-contained program easily assembled) or non-opportunistic / dedicated site usage
with a pre-installed software stack maintained by a local crew allowing running STAR’s complex workflows. Cloud resources, coupled with virtualization technology, permit relatively easy
deployment of the full STAR software stack within the VM, allowing large simulation requests
to be accommodated. Even more relevant for STAR’s future, recent tests successfully demonstrated that larger scale real data reconstruction is easily feasible. Cloud activities and development remain (with some exceptions) outside the scope and program of work of the Open Science
Grid; one massive simulation exercise was partly supported by the ExTENCI satellite project.
STAR had planned to also run and further test the Glow resources after an initially successful
reported usage via a Condor/VM mechanism. However, several alternative resources and approaches offered themselves. The use of the Clemson model especially appeared to allow for
faster convergence and deliverables of a needed simulation production in support of the Spin
program component of RHIC/STAR. Within a sustained scale of 1,000 jobs (peaking at 1,500
jobs) for three weeks, STAR clearly demonstrated that a full fledge Monte-Carlo simulation followed by a full detector response simulation and track reconstruction was not only possible on
Cloud but of a large benefit to our user community. With over 12 billion PYTHIA events generated, this production represented the largest PYTHIA event sample ever generated in our community. The usage of Cloud resources in this case expanded the resources capacity for STAR by
25% (comparing to the resources available at BNL/RCF) and, for a typical student’s work, allowed a yearlong science time wait to be delivered in a few weeks. Typically, a given user at the
RCF would be able to claim 50 job slots (the facility being shared by many users) while in this
exploitation of Cloud resources, all 1,000 slots were uniquely dedicated to a given task and one
student. The sample represented a four order of magnitude increase in statistics comparing to
other studies made in STAR with a near total elimination of statistical uncertainties which would
have reduced the significance of model interpretations. The results were presented at the Spin
2010 conference where unambiguous agreement between our data and the simulation was
shown. It is noteworthy that the resources were gathered in an opportunistic manner as seen in
Figure 29. We would like to acknowledge the help from our colleagues from Clemson, partly
funded by the ExTENCI project.
42
Figure 29: Graph of the number of available machines to STAR (in red), working machines (in green) and idle nodes (in blue) within an opportunistic resource gathering at
Clemson University. Within this period, the overlap of the red and green curve demonstrates the submission mechanism allows for immediate harvesting of resources as they
become available.
An overview of STAR’s Cloud efforts and usage has been presented at the OSG all handmeeting in March 2010 (see “Status of STAR's use of Virtualization and Clouds”) and at the International Symposium on Grid Computing 2010 (“STAR’s Cloud/VM Adventures”). Further
overview of activities was given at the ATLAS data challenge workshop held at BNL that same
month and finally, a summary presentation was given the CHEP 2010 conference in Taiwan in
October (“When STAR Meets the Clouds – Virtualization & Grid Experience”). Based on usage
trend and progress with Cloud usage and scalability, we project that 2011 will see workflow of
the order of 10 to 100k jobs sustained as routine operation (see Figure 30).
43
Figure 30: Summary of our Cloud usage as a function of date. As seen, the rapid progression of the exploitation and usage may indicate that a 10,000 job scale in 2011 may be at
reach.
From BNL, we steered Grid-based simulation productions (essentially running on our NERSC
resources), and STAR has in total produced 4.8 Million events representing a total of 254,200
CPU hours of processing time using the standard OSG/Grid infrastructure. During our usage of
the NERSC resources, we re-enabled the SRM data transfer delegation mechanism allowing for a
job to terminate and pass to a third party service (SRM) the task of transferring the data back to
the Tier0 center, BNL. We had previously used this mechanism but not integrated it into our
regular workflow as the network transfers allowed for immediate globus-based file transfer with
no significant additional time added to the workflow. However, due to performance issues with
our storage cache at BNL (outside of STAR’s control and purview), the transfers were recently
found, at times, to add a significant overhead to the total job time (41% impact on total job time).
The use of a 0.5 TB cache on the NERSC side and the SRM delegation mechanism allowed mitigation of the delay problems. In addition to NERSC, large simulation event generations were
performed on the CMS/MIT site for the study of prompt photon cross section and double spin
asymmetry. Forty-three million raw PYTHIA events were generated, amongst which 300 thousand events were passed to GEANT as part of cross section / pre-selection speed up (event filtering at generation), a mechanism designed in STAR to cope with large and statistically challenging simulations (cross section-based calculations require however to generate with a nonrestrictive phase space and count the events passing our filter and the one being rejected). Additionally, 20 billion PYTHIA events (1 million filtered and kept) were also processed on that facility. The total resource usage was equivalent to about 100,000 hours of CPU hours spanning
over a period of two months total.
STAR has also begun to test the resources provided by the Magellan project at NERSC and aims
at pushing a fraction of its raw datasets to the Magellan Cloud for immediate processing via an
44
hybrid Cloud/Grid approach (a standard Globus gatekeeper will be used as well as data transfer
tools), while the virtual machine capability will be leveraged for provisioning the resources with
the most recent STAR software stack. The goal of this exercise is to provide a fast lane processing of data for the Spin working group with processing of events in near real time. While
near real-time processing is already practiced in STAR, the run support data production known
as “FastOffline” currently uses local BNL/RCF resources and passes over a sample of the data
only once. The use of Cloud resources would allow outsourcing yet another workflow in support
of the experiment scientific goals. This processing is also planned to be iterative, each pass using
more accurate calibration constants. We expect by then to shorten the publication cycle of results
from proton+proton 500 GeV Run 11 data by a year. During the Clemson exercise, STAR had
designed a scalable database access approach which we will also use for this exercise. In essence, leveraging the capability of our database API, a “snapshot” is created and uploaded to the
virtual machine image and a local database service is started. The need for a remote network
connection is then abolished (as well as the possibility of thousands of processes overstressing
the RHIC/BNL database servers). A fully ready database factory is available for exploitation.
Final preparations of the workflow are in discussion, and if successful, this modus-operandi will
represent a dramatic shift in the data processing capabilities of STAR. Raw data production will
no longer be constrained to dedicated resources but allowed on widely distributed Cloud based
resources).
The OSG infrastructure has been heavily used to transfer and redistribute our datasets from the
Tier-0 (BNL) center to our other facilities. Noticeably, the NERSC/PDSF center holds full sets
of analysis ready data (known as micro-DST) for the Year 9 data and, on the approach of the
Quark Matter 2011 conference (http://qm2011.in2p3.fr/node/12), we plan to make available the
year 10 data allowing to spread user analysis over multiple facilities (Tier2 centers in STAR typically transfer only subsets of the data, targeting local analysis needs). Up to 7 TB of data can be
transferred a day and over 150 TB of data were transferred in 2010 from BNL to PDSF.
As a collaborative effort between BNL and the Prague institution, STAR is in the process of deploying a data placement planer tool in support of its data redistribution and production strategy.
The planer is based on reasoning as per {from where / to where} the data has to be {taken /
should be moved} to achieve the fastest possible plan, whether the plan is a data placement or a
data production and processing turn-around. To get a baseline estimate of the transfer speed limit
between BNL and PDSF, we have reassessed the link speed. The expected transfer profile is given by Figure 31. We expect this activity to reach completion by mid-2011.
45
Figure 31: Transfer speed maximum between BNL and NERSC facility. The speed maximum is
consistent with a point to point 1 Gb/sec link.
All STAR physics publications acknowledge the resources provided by the OSG.
2.8
Intensity Frontier at Fermilab
MINOS data analysis use of OSG increased in the last year to 5.6 million core hours in 1 Million
submitted jobs, with over 4 million managed file transfers.
This computing resource, combined with 180 TB of dedicated BlueArc (NFS mounted) file storage, has allowed MINOS to move ahead with traditional and advanced analysis techniques.
MINOS uses a few hundred cores of offsite computing at collaborating universities for occasional Monte Carlo generation. These computing resources are critical as the experiment continues to
pursue the challenging analyses of anti-neutrino disappearance and nu-e appearance.
MINOS has just published the antineutrino disappearance result, see arXiv:1104.0344 accepted
Phys.Rev.Lett. 26 May 2011
The Minerva experiment started taking data in FY11, and is using OSG for all production reconstruction and most analysis activities. Activity is ramping up, with 1 Million core-hours used in
the last year.
The NOvA near detector prototype is being commissioned. NOvA has started using OSG resources for important Simulation and code development work.
Other future experiments such as LBNE, Mu2e and g-2 are making use of OSG resources at
Fermilab, mainly opportunistically.
2.9
Astrophysics
The Dark Energy Survey (DES) used approximately 20,000 hours of OSG resources during the
period July 2010 – June 2011 to generate simulated images of galaxies and stars on the sky as
would be observed by the survey. We produced about 2 TB of simulated images, consisting of
46
both science and calibration data, for a set of galaxy cluster simulations for the DES weak
lensing working group, a set of 60 nights of supernova simulations for the DES supernova working group, and a set of 7 so-called “Gold Standard Night” (GSN) simulation data sets generated
to enable quick turnaround and debugging of the DES Data Management (DESDM) processing
pipelines as part of DES Data Challenge 6 (DC6). When produced as planned during Summer
2011, the full DC6 simulations will consist of 2600 mock science images, covering some 200
square degrees of the sky, along with nearly another 1000 calibration images needed for data
processing. Each 1-GB-sized DES image is produced by a single job on OSG and simulates the
300,000 galaxies and stars on the sky covered in a single 3-square-degree pointing of the DES
camera. The processed simulated data are also being actively used by the DES science working
groups for development and testing of their science analysis codes. Figure 32 shows an example
color composite image of the sky derived from these DES simulations.
Figure 32: Example simulated color composite image of the sky, here covering just a very small area compared to the full 5000 deg2 of sky that will be observed by the Dark Energy Survey. Most of
the objects seen in the image are simulated stars and galaxies. Note in particular the rich galaxy
cluster at the upper right, consisting of the many orange-red objects, which are galaxies that are
members of the cluster. The red, green, and blue streaks are cosmic rays, and have those colors as
they each appear in only one of the separate red, green, and blue images used to make this color
composite image.
2.10
Structural Biology
The SBGrid Consortium, operating from Harvard Medical School in Boston is supporting software needs of ~150 structural biology research laboratories, mostly in the US. The SBGrid Virtual Organization (SBGrid VO) extends the initiative to support most demanding structural biology applications on resources of the Open Science Grid. Support by the SBGrid VO is extended
to all structural biology groups, regardless of their participation in the Consortium. Within last 12
months we have significantly increased our participation in the Open Science Grid, in terms of
both utilization and engagement. Specifically:
47

We have launched and successfully maintained a GlideinWMS grid gateway at Harvard
Medical School. The gateway communicates with the Glidein Factory at UCSD, and dispatches computing jobs to several computing centers across the US. This new infrastructure allowed us to reliably handle the increased computing workload. Within the last 12
months our VO supported ~6 million CPU hours on the Open Science Grid, and we rank
as number 10 Virtual Organization in terms of overall utilization.

SBGrid completed development of the Wide Search Molecular Replacement workflow.
The paper describing its scientific impact was recently published in PNAS. Another paper presenting the underlying computing technology was presented during the 3rd IEEE
Workshop on Many-Task Computing on Grids and Supercomputers, co-located with
ACM/IEEE SC10 International Conference for High Performance, Networking, Storage
and Analysis.

The WS-MR portal was made publicly available in November 2010. Since its release we
have supported 35 users. The majority of users were from US academic institutions (e.g.
Yale University, Harvard, WUSTL, University of Tennessee, University of Massachusetts, Stanford, Immune Disease Institute, Cornell University, Caltech), but international
research groups utilized the portal as well (including research groups from Canada, Germany, Australia and Taiwan).

We continue planning for integration of the central biomedical cluster at Harvard Medical
School with Open Science Grid resources. The cluster has been recently funded for expansion from 1000 to 7000 cores (S10 NIH award), and the first phase of the upgrade is
being completed in December.

Our VO is organizing the Open Science Grid All-Hand Meeting which is scheduled to
take place in Boston in March of 2011. We have prepared a preliminary program agenda,
and participated in several planning discussions.

We successfully maintained a specialized MPI cluster in Immune Disease Institute (Harvard Medical School affiliate) to support highly scalable molecular dynamics computations. A long-term molecular dynamics simulation was recently completed on this cluster,
and will complement crystal structure that was recently determined in collaboration with
Walker laboratory at HMS (Nature, in press). The resource is also available to other
structural biology groups in Boston area.

We hosted the OSG All-Hands meeting in March 2011 (see Section 5.2)
2.10.1 Wide Search Molecular Replacement Workflow
We have previously reported on our progress with a specialized macromolecular structure determination method - Wide Search Molecular Replacement (WSMR). This grid workflow has been
fully implemented, and we have shown that the method can complement other structural biology
technique in a wide array of challenging cases. Results for this project have been published in
PNAS (Stokes, Sliz, PNAS, December 2010), as well as in the 3rd IEEE workshop on ManyTask Computing on Grids and Supercomputers. After publication in PNAS, the portal was released to the structural biology community, and close to 100 users utilized the infrastructure.
48
Figure 33: iSGTW article on an SBGrid science result
2.10.2 DEN Workflow
We have developed a second portal application that takes advantage of OSG infrastructure
(Figure 34). The application facilitates refinement of low resolution X-ray structures.
Figure 34: Job submission interface for DEN computations
2.10.3 OSG Infrastructure
In the recent award cycle SBGrid OSG utilization increased from 531,116 to 6,057,170 CPU
hours (Figure 35). In terms of utilization we are number nine Virtual Organization of the Open
Science Grid (US Atlas and CMS are the top two experiments). Since December 2010 we have
supported 73 WSMR runs on the Open Science Grid, submitted by 51 users from 47 institutions.
49
Those WSMR production computations consumed over 3M CPU hours. The remaining hours
were consumed by our team while validating the WSMR approach, for DEN computations, and
other minor workflow which were submitted to OSG by members of our community
In order to accommodate the increased load we have migrated our infrastructure GlideIn scheduler. Open Science Grid support team was instrumental in accommodating this transition. We are
now working with Harvard Medical School Research Information Technology Group to integrate
new cluster resources with Open Science Grid.
Figure 35: SBGrid usage (in green)
2.10.4 Outreach: Facilitating Access to Cyberinfrastructure
The SBGrid VO continued to provide computing assistance to members of the biomedical community in Boston who are interested in utilizing Open Science Grid for their research. In addition, we have been also supporting the European WeNMR grid project. Our Virtual Organization
issues security certificates for U.S. Scientists that require access to WeNMR resources.
2.11
Computer Science Research
As the scale of OSG increases its contribution as a laboratory for applied research in the deployment and extension of distributed high throughput computing technologies remains active:
OSG performs scalability tests with new components such as the CREAM CE from Europe, new
development versions of Condor and integration of components to make the end to end solutions
needed by the stakeholders.
The OSG Security Team continues its work in the applied research area, starting with analysis of
the trust model in collaboration with Bart Miller’s team at the University of Wisconsin.
50
3.
Development of the OSG Distributed Infrastructure
3.1
Usage of the OSG Facility
The OSG facility provides the platform that enables production by the science stakeholders; this
includes operational capabilities, security, software, integration, testing, packaging and documentation as well as VO and User support. Scientists who use the OSG demand stability more
than anything else and we are continuing our operational focus on providing stable and dependable production level capabilities.
The stakeholders continued to perform record amounts of processing. The two largest experiments, ATLAS and CMS, after performing a series of data processing exercises last year that
thoroughly vetted the end-to-end architecture, are steadily setting computational usage records
on almost a weekly basis. The OSG infrastructure has demonstrated that is up to the challenge
and continues to meet the needs of the stakeholders. Currently over 0.6 Petabytes of data is transferred every day and more than 4 million jobs complete each week.
Figure 36: OSG facility usage vs. time for the past 12 months, broken down by VO
During the last year, the usage of OSG resources by VOs increased by ~50% from about 6M
hours per week to consistently more than 9M hours per week; additional detail is provided in the
attachment entitled “Production on the OSG.” OSG provides an infrastructure that supports a
broad scope of scientific research activities, including the major physics collaborations, biological sciences, nanoscience, applied mathematics, engineering, and computer science. Most of the
current usage continues to be in the area of physics but non-physics use of OSG is an important
area with current usage approximately 700K hours per week (averaged over the last year) spread
over 17 VOs.
51
Figure 37: OSG facility usage vs. time for the past 12 months, broken down by Site.
(“Other” represents the summation of all other “smaller” sites)
With over 100 sites, the production provided on OSG resources continues to grow; the usage varies depending on the needs of the stakeholders. During normal operations, OSG provides more
than 1.4 M CPU wall clock hours a day with peaks occasionally exceeding 1.6 M CPU wall
clock hours a day; between 400K and 500K opportunistic wall clock hours are available on a daily basis for resource sharing.
The Tier-3 sites that we heavily invested in last year now perform steadily, just like many of the
Tier-2 sites. This is notable since many of their administrators do not have formal computer science training and thus special frameworks were developed to provide effective and productive
environments and support.
In summary, OSG has demonstrated that it is meeting the needs of US CMS and US ATLAS
stakeholders at all Tier-1’s, Tier-2’s, and Tier-3’s. OSG is successfully managing the uptick in
job submissions and data movement as the LHC data taking continues into 2011. And OSG continues to actively support and meet the needs of a growing community of non-LHC science
communities that are increasing their reliance on OSG.
3.2
Middleware/Software
In the last year, our efforts to provide a stable and reliable production platform have continued
and we have focused on support and incremental, production-quality upgrades. In particular, we
have focused on support for native packaging, the Tier-3 sites, and storage systems.
52
As in all software distributions, significant effort must be applied to ongoing support and
maintenance. We have focused on continual, incremental support of our existing software stack
release, OSG 1.2. Between June 2010 and June 2011, we released 10 minor updates. These included regular software updates, security patches, bug fixes, and new software features. We will
not review all of the details of these releases here, but instead wish to emphasize that we have
invested significant effort in keeping all of the software up to date so that OSG’s stakeholders
can focus less on the software and more on their science; this software maintenance consumes
roughly 50% of the effort of the OSG software effort.
There have been several software updates and events in the last year that are worthy of deeper
discussion. As background, the OSG software stack is based on the VDT grid software distribution. The VDT is grid-agnostic and used by several grid projects including OSG, WLCG, and
BestGrid. The OSG software stack is the VDT with the addition of OSG-specific configuration.

We have continued to focus our efforts to provide the OSG software stack as so-called “native packages” (e.g. RPM on Red Hat Enterprise Linux). With the release of OSG 1.2, we
have pushed the packaging abilities of our infrastructure (based on Pacman) as far as we can.
While our established users are willing to use Pacman, there has been a steady pressure to
package software in a way that is more similar to how they get software from their OS vendors. With the emergence of Tier-3s, this effort has become even more important because
system administrators at Tier-3s are often less experienced and have less time to devote to
managing their OSG sites. We have wanted to support native packages for some time, but
have not had the effort to do so, due to other priorities; but it has become clear that we must
do this now. We initially focused on the needs of the LIGO experiment and in April 2010 we
shipped them a complete set of native packages for both CentOS 5 and Debian 5 (which have
different packaging systems), and they are now in production. More recently, we have provided Xrootd as RPMs for ATLAS Tier 3 sites. We are now in the planning stages to drop
future development of our Pacman-based packages and do all future work as native packages,
initially RPMs for Red Hat Enterprise Linux-compatible systems.

Recently we have been focusing on the needs of the ATLAS and CMS Tier-3 sites. In particular, we have focused on Tier-3 support for new storage solutions. In the last year, we have
improved our packaging, testing, and releasing of BeStMan, Xrootd, and Hadoop, which are
a large part of our set of storage solutions. We have released several iterations of these.

We have emphasized improvements to our storage solutions in OSG. This is partly for the
Tier-3 effort mentioned in the previous item, but is also for broader use in OSG. For example, we have created new testbeds for Xrootd and Hadoop and expanded our test suite to ensure that the storage software we support and release are well tested and understood internally. We have monthly joint meetings with the Xrootd developers and ATLAS to make sure
that we understand how development is proceeding and what changes are needed. We have
also provided a new tool (Pigeon Tools) to help our users monitor and understand the deployed storage systems in OSG. We have recently tested and released into production the
new Bestman2.

We are currently preparing a major software addition: ATLAS has requested the addition of
the gLite CREAM software, which has similar functionality to the Globus GRAM software;
that is, it handles jobs submitted to a site. Our evaluation of this package shows that it is likely to scale quite well. We are far along with our work to integrate it into the OSG Software
53
Stack and have provided an early test release to ATLAS, who is likely to be our first user.
This test release was packaged with Pacman and with our increased focus on native packaging; our next release will be as RPMs.

We made improvements to our grid monitoring software – RSV – to make it significantly
easier for OSG sites to deploy and configure it; this was released in February 2011.

We have been evaluating various data management tools for the VOs that use OSG public
storage. The goal is to find a reliable transfer service that allows users to perform bulk data
transfer to and from users' home institutions, submission nodes, and OSG sites. Thus far we
have evaluated iRODS, Bulk Data Mover tool, and Globus Online. Our evaluation shows that
Globus Online is a potential match for our base requirements.

We have worked hard on outreach through several venues:
o We conducted an in-person Storage Forum in September 2010 at the University of Chicago, to help us better understand the needs of our users and to directly connect them
with storage experts.
o We participated in various schools/workshops (the first OSG Summer School, OSG Site
Administrator Forum, and Brazil/OSG Grid School) to assist with training on grid technologies, particularly storage technologies. We will be hosting the second OSG Summer
School at the end of June 2011.
o We participated in the OSG content management initiatives to significantly improve the
OSG technical documentation.
The VDT continues to be used by several external collaborators. WLCG uses portions of VDT
(particularly Condor, Globus, UberFTP, and MyProxy). The VDT team maintains close contact
with WLCG via the OSG Software Coordinator's engagement with the gLite Engineering Management Team. We are also in close contact with its successor, the European Middleware Initiative (EMI). TeraGrid and OSG continue to maintain a base level of interoperability by sharing a
code base for Globus, which is a release of Globus patched for OSG and TeraGrid’s needs. The
VDT software and storage coordinators are members of the WLCG Technical Forum, which is
addressing ongoing problems, needs and evolution of the WLCG infrastructure during LHC data
taking.
3.3
Operations
The OSG Operations team provides the central point for operational support for the Open Science Grid and provides the coordination for various distributed OSG services. OSG Operations
publishes real time monitoring information about OSG resources, supports users, developers and
system administrators, maintains critical grid infrastructure services, provides incident response,
and acts as a communication hub. The primary goals of the OSG Operations group are: supporting and strengthening the autonomous OSG resources, building operational relationships with
peering grids, providing reliable grid infrastructure services, ensuring timely action and tracking
of operational issues, and assuring quick response to security incidents. In the last year, OSG
Operations continued to provide the OSG with a reliable facility infrastructure while at the same
time improving services to offer more robust tools to the OSG stakeholders.
OSG Operations is actively supporting the US LHC and we continue to refine and improve our
capabilities for these stakeholders. We have supported the additional load of the LHC data54
taking by increasing the number of support staff and implementing an ITIL based (Information
Technology Infrastructure Library) change management procedure. As OSG Operations supports
the LHC data-taking phase, we have set high expectations for service reliability and stability of
existing and new services.
During the last year, the OSG Operations continued to provide and improve tools and services
for the OSG:

Ticket Exchange mechanisms were updated as the WLCG GGUS system, ATLAS RT system, and the VDT RT system continue to evolve in their local environment. A community
JIRA instance was opened for the use of all OSG groups for software task tracking and project management. New automated ticket exchanges with a new FNAL ticket system and this
new JIRA instance are being evaluated to provide even more possibilities for workflow
tracking between OSG collaborators.

The OSG Operations Support Desk continues to respond to ~160 tickets monthly.
Figure 38: Monthly Ticket Count (August 2010 - May 2011)

The change management plan has evolved with experience to insure minimal impact to production work during service maintenance windows.

The BDII (Berkeley Database Information Index), which is critical to CMS and ATLAS production, has been at 99.94% availability and 99.95% reliability during the preceding 12
months; see Figure 39.
55
Figure 39: BDII Availability (June 1, 2010 to May 31, 2011)

A working group was created to deploy a WLCG Top Level BDII for use by USCMS and
USATLAS. The Blueprint group has approved the working group’s recommendations and
testing for this new deployment is ongoing with an implementation into production expected
in fall 2011.

The MyOSG system was ported to MyEGEE, MyEGI, and MyWLCG. MyOSG allows administrative, monitoring, information, validation and accounting services to be displayed
within a single user defined interface.

Using Apache Active Messaging Queue (Active MQ) we have provided WLCG with availability and reliability metrics.

The public ticket interface to OSG issues was continually updated to add requested features
aimed at meeting the needs of the OSG users.

We have completed the SLAs for all Operational services, including the new OSG Glide-In
Factory delegated to UCSD.
And we continued our efforts to improve service availability via the completion of several hardware and service upgrades:
1. We overcame network bandwidth issues experienced as a result of operational services being
behind institutional firewalls.
2. We continued moving metric services from non-production Nebraska facilities to production
hosting facilities at Indiana University.
3. Monitoring of OSG Resources at the CERN BDII was implemented to allow end-to-end information system data flow to be tracked and alarmed on when necessary.
4. Operations continued to research virtual machine, high availability, and load balancing technologies in an attempt to bring even higher reliability standards to operational services.
56
Table 6: OSG SLA Availability and Actual Availability (12 Months Beginning June 1, 2010)
Operational Service
BDII (Information)
OIM (Registration)
RSV (Monitoring)
Software Cache
Ticket Exchange
MyOSG
TWiki
SLA Defined
Availability Level
99.00%
97.00%
97.00%
97.00%
97.00%
99.00%
97.00%
Actual Availability
Achieved
99.94%
99.89%
99.72%
99.75%
99.93%
99.89%
99.90%
Service reliability for OSG services remains excellent and we now gather metrics that quantify
the reliability of these services with respect to the requirements provided in the Service Level
Agreements (SLAs); see Table 6. SLAs have been finalized for all OSG hosted services. Regular
(pre-announced) release schedules for all OSG services have been implemented to enhance user
testing and regularity of software release cycles for OSG Operations provided services. It is the
goal of OSG Operations to provide excellent support and stable distributed core services that the
OSG community can continue to rely upon and to decrease the possibility of unexpected events
interfering with user workflow.
3.4
Integration and Site Coordination
The OSG Integration and Sites Coordination activity continues to play a central role in helping
improve the quality of grid software releases prior to deployment on the production OSG, and in
helping sites deploy and operate OSG services thereby achieving greater success in production.
To help achieve these goals, we continued to operate the Validation (VTB) and Integration Test
Beds (ITB) in support of updates to the OSG software stack that include compute and storage
element services. In the past year we have deployed a new integration and validation cluster to
test the many compute elements (and associated job managers) and storage systems.
As a major improvement to our testing productivity, we have used the automated ITB job submission system to validate releases regularly. The “ITB Robot” is an automated testing and validation system for the ITB which has a suite of test jobs that are executed through the pilot-based
Panda workflow system; its main components are indicated in Figure 40. The test jobs can be of
any type and flavor; the current set includes simple ‘hello world’ test jobs, jobs that are CPUintensive, and jobs that exercise access to/from the associated storage element of the site. Importantly, ITB site administrators are provided a command line tool they can use to inject jobs
aimed for their site into the system, and then monitor the results using the full monitoring
framework (pilot and Condor-G logs, job metadata, etc) for debugging and validation at the joblevel. In addition, a web site to provide reports and graphs detailing testing activities as well as
results on multiple sites for various time periods was developed; this allows ITB site administrators to schedule tests to run on their site and allows them to view the daily testing that will run on
their ITB resources. In the future, we envision that additional workloads will be executed by the
system, simulating components of VO workloads. We are planning work on the “ITB robot”
57
involving migration to autofactory and improving the web interface to give better control to site
administrators over tests run on their sites and to provide better reporting of test results.
Figure 40: The automated ITB testing environment ("the ITB Robot") showing the Panda
server, the ITB Site (compute element and worker node), client submit host and Django
webserver.
The ITB resources were augmented by nine Dell R610 servers and two Power Vault MD1200
storage shelves. One of the hosts has a 10G network interface so the ITB can be included in
high-bandwidth testing applications. Four of the servers are deployed as KVM (hypervisor) servers to host virtual machine instances. The other systems were used as compute nodes and to provide resources for testing virtual machine-based jobs as well as whole machine jobs. The KVM
servers currently host approximately 34 virtual machines serving as compute elements, dCache,
XRootd, and Hadoop based storage elements, GUMS (Grid User Management System, for sitelevel authentication and account mapping), and ancillary services for the ITB cluster (Condor
and PBS job managers, Ganglia for cluster monitoring, etc.).
These new hardware resources have allowed the Integration group to provide coverage over a
wider variety of software configurations. The ITB now provides coverage for all the major types
of storage elements used on the OSG. Previously, the ITB only tested minimal configurations of
dCache and XRootd/Bestman; however, with the new hardware, dCache, XRootd/Bestman, and
Hadoop/Bestman storage elements can be tested using tests that exercise replication and which
have larger space requirements. In addition, the ITB can run tests that require larger clusters and
more simultaneous jobs.
With the new hardware resources available, the ITB group has explored virtualization of various
components of the OSG services. So far, compute element (CE), storage element (SE), and authorization components (GUMS) have been virtualized and their performance characterized. The
work and experience gained in this effort is now being documented so that it can be used by the
wider OSG community.
58
Finally in terms of site and end-user support, we continue to interact with the community of OSG
sites using the persistent chat room (“Campfire”) that has now been in regular operation for nearly 30 months. We offer three hour sessions at least three days a week where OSG core Integration or Sites support staff are available to discuss issues, troubleshoot problems or simply “chat”
regarding OSG specific issues; these sessions are archived and searchable. The Site Administrators Workshop in August 2010 and the “Talk with the experts” session during the OSG all hands
meeting in March 2011 are other examples of constructive ways to build a community and serve
OSG users. We continue to hold monthly teleconferences for all sites which feature a focus topic
(usually a guest presenter) and open discussion. We are also using the new ITB cluster resources
to support prototyping Tier-3 facilities and “testing” instructions for installation common Tier-3
services. In OSG we have seen many new Tier-3 facilities in the past year, often with administrators who were not UNIX computing professionals but post-docs or students working part time
on their facility. Therefore it has been important to continue the support and testing within the
virtualized Tier-3 cluster of services and installation techniques similar to the ones that are being
developed by the ATLAS and CMS communities.
3.5
Campus Grids
An emerging effort is the introduction of distributed HTC (DHTC) infrastructures onto campuses. The DHTC principles (diverse resources, dependability, autonomy and mutual trust) that
OSG advances and implements at a national level apply equally well to a campus environment.
The goal of the Campus DHTC infrastructure activity is to continue and translate this natural fit
into a wide local deployment of high throughput computing capabilities at the nation’s campuses,
bringing locally operated DHTC services to production in support of faculty and students as well
as enabling integration with TeraGrid, the OSG, and other cyber infrastructures. Deploying
DHTC capabilities onto campuses carries a strong value proposition for both the campus and the
OSG. Intra-campus sharing of computing resources enhances scientific competitiveness and
when interfaced to the OSG production infrastructure increases the national computational
throughput.
To accomplish this we have defined the following comprehensive approach that aims to eliminate key barriers to the adoption of HTC technologies by small research groups on our campuses.
1. Support for local campus identity management services removes the need for the researchers
to fetch and maintain additional security credentials such as grid certificates.
2. An integrated software package that moves beyond current cookbook models that require
campus IT teams to download and integrate multiple software components. This package
does not require root privileges and thus can be easily installed by a campus researcher.
3. Coordinated education, training and documentation activities and materials that cover the potential, best practices and technical details of DHTC technologies.
4. A campus job submission point capable of routing jobs to multiple heterogeneous batch
scheduling systems (e.g. Condor, LSF, PBS etc.). Previously existing campus DHTC models
required that all resources are managed by the same scheduler.
The job submission interfaces presented to campus faculty and students will be identical to those
already in use today on the OSG production infrastructure. As they begin harnessing the local
campus resources through these interfaces they will be offered, once they obtain a grid credential, seamless access to OSG production grid resources. This forms a natural seedbed of a new
59
generation of computational scientists who can expand their local computing environment into
the national CI.
We have begun working with campus communities at Clemson, Nebraska, Notre Dame, Purdue
and Wisconsin-Madison on a prototyping effort that includes enabling the formation of local
HTC partnerships and dynamic access to shared local (intra-campus) and remote (inter-campus)
resources using campus identities.
3.6
VO and User Support
The goal of the VO and User Support area is to enable science and research communities from
their initial introduction to the OSG to production usage of the DHTC services and to provide
ongoing support for existing communities’ evolving needs. The User Support area helps established VOs that require assistance as they evolve their deployment of resources, use of the services and software, and/or identify usability issues. The team works to understand the communities’ needs, agree on common objectives, and provide technical guidance on how to adapt their
resources and applications to participate in and run on a DHTC environment. We support the users in integration and use of the existing software and help to resolve issues and problems; and,
as needed, we work with the software and technology investigation areas to address any gaps in
needed capabilities.
In support of existing VOs, we have continued to provide forums that provide:

in-depth coverage of VO and OSG at-large issues

community building through shared experiences and learning

expeditious resolution of operational issues

knowledge building through tutorials and presentations on new technology
As an outcome of the VO forum, VOs identify areas of common concern that then serve as inputs to the OSG work program planning process; some recent examples include: mechanisms to
manage and benefit from public storage; accounting discrepancies; early issues with adopting
pilot-based work flow management environments; and, the need for real-time job status monitoring.
In the last year we have documented our process for supporting new science communities and
we are currently using these processes and implementing improvements, as needed. Using these
processes in support of new communities and new initiatives by current OSG VOs, we have enabled the following work programs:

The Large Synoptic Survey Telescope (LSST) collaboration has undergone a second phase of
application integration in OSG, targeting image simulation as a pilot application. This phase
used 40,000 CPU hours per day for a month, with peaks of 100,000 and has helped the LSST
community understand the potential in the usage of OSG. As a follow-up to experience with
this work, the project manager of the LSST Data Management group has requested a plan to
create a stand-alone LSST VO as well as the porting to OSG of data management production
workflows and of the data analysis framework.

The Network for Earthquake Engineering Simulation (NEES) has formed a stand-alone VO
after the successful proof-of-principle integration of the OpenSEES application, a widely
used earthquake simulation framework in the community. The application was run submit-
60
ting jobs from the NEESHub portal at Purdue and from a standard OSG submission site. The
community is currently undertaking a production-demo phase, scaling up of two orders of
magnitude the amount of jobs (~4000) and data (~5 TB) handled in OSG.

The Dark Energy Survey (DES) has requested OSG expertise to improve the efficiency of
their data handling system (DAF). In collaboration with the Globus Online (GO) team, DES
and OSG have produced a GO-integrated prototype of DAF. This prototype showed a 27%
improvement in data transfer times and the need in GO for better interfaces for data placement verification.

SBGrid, based at Harvard Medical School, is extending its pilot job system to support the
Italian-based worldwide e-Infrastructure for NMR and structural biology (WeNMR) communities. In parallel, SBGrid continues to provide broad access to OSG by other life science and
structural biology researchers in the Boston area.

The GEANT4 Collaboration’s EGI-based biannual validation testing production runs were
efficiency-improved and expanded onto the OSG; this helped improve the quality of the
toolkit’s releases for MINOS, ATLAS, CMS, and LHCb.

The Cyber-Infrastructure Campus Champions from XSEDE have partnered with the OSG
Campus Grids initiative to broaden user access to computing resources. OSG User Support
has assisted XSEDE staff in running proof-of-principle jobs on OSG and identifying credentials accessible by CICC and trusted on OSG.
We are currently planning to extend our work to support users whose workflows span both
DHTC and HPC resources – such as those accessible through the TeraGrid/XD and DOE Science leadership class machines. We will work with the XD advanced user support group in helping users span this mix of resources.
3.7
Security
The Security team continued its focus on operational security, identity management, security policies, adopting useful security software, and disseminating security knowledge and awareness
within OSG.
During 2010, OSG had conducted a series of workshops on identity management resulting in the
collection of VO requirements as well as technical requirements from security experts and concluded that improving the usability of the identity infrastructure was a major desire. During the
first half of 2011, we undertook a pilot study where we integrated the OSG infrastructure with
members of the InCommon federation (Clemson University in particular). We used the CILogon
CA as a bridge between US universities and were able to leverage university-assigned user identities in OSG. We worked with the Clemson University IT department and demonstrated that
OSG users from Clemson University can access OSG resources without acquiring additional
identities. We prepared a risk assessment of our pilot study and collected feedback from the OSG
member institutions. The plan is to continue our work in this area during the next phase of OSG.
Currently, the majority of InCommon members have no accreditation that is compatible with
IGTF policies. Among our next steps is to propose a new authentication and identity vetting profile at IGTF that would be compatible with InCommon member practices. In the meantime,
many of the DOE National laboratories such as Fermilab and LBNL have joined InCommon and
we are exploring how these laboratories can gain special IGTF accreditation such that they can
integrate with the OSG identity infrastructure. In addition, the OSG security team strongly sup61
ported the CILogon Silver CA for its IGTF accreditation. As a voting relying party member of
IGTF, OSG voted in favor of CILogon CA accreditation and also encouraged other IGTF members to do the same by explaining the benefits to the US national cyber infrastructure.
On operational security, we had a very successful year with no major security incidents. However, we addressed a large number (>10) of kernel-level and root-level software vulnerabilities.
We had one attack reported in our community due to these vulnerabilities, but the attacker did
not target the grid infrastructure and only compromised the local systems. We spent a significant
amount of effort detecting vulnerable OSG sites and helping them apply the needed patches.
Many OSG sites had varying institutional policies on their patching practices and this required
some negotiations with WLCG security team since WLCG requested a shorter patching period
(7-days) from all sites. We negotiated agreements with the WLCG security team to provide
schedule extensions based on each site’s special circumstances (e.g. delayed sites are asked to
apply a temporary fix) which required one-on-one work with the affected OSG sites. The vulnerability monitoring tool, Pakiti, that we had adopted in March 2010 proved to be very useful during this time by helping us detect patch levels of OSG sites that are subscribed to this service.
We continued exchanging incident and vulnerability information across EGI/WLCG, TeraGrid
and OSG security teams. This collaboration has provided significant benefit for our community
since we are able to receive and understand ongoing attacks and vulnerabilities and thus take
timely precautions against those threats. In order to ensure our incident preparedness, we tested
and updated the security contact information of every OSG site and VO and ensured that each
site and VO has two people registered as security contacts in OSG database.
We transitioned the OSG DOEGrids Registration Authority (RA) duty to members of the OSG
operations team at Indiana University. The RA handles all types of certificate issues, including
certificate approval, renewal, revocation, and troubleshooting. This transition proved to be beneficial on several fronts: we created detailed documentations on all RA processes; we tested the
documentation while training new back-up personnel; and, we gained a team of back-up personnel instead of relying on a single person acting as an RA.
We conducted an incident drill (security service challenge) where we tested OSG sites’ ability to
respond to a mock incident. This drill had been coordinated between EGI and OSG security
teams based on a request by WLCG and it covered Atlas sites and aimed to test the Panda job
submission workflow. The drill was completed successfully and we are currently evaluating the
results.
In the last year, the DOEGrids CA experienced two short-lived service outages lasting less than 2
days each. In addition, the Japanese and Korean CAs had longer-term (around a month) outages
due to the earthquake in Japan. Although these outages did not have a serious impact on OSG
operations, they created added awareness leading to update of our business continuity and contingency plans for the OSG identity infrastructure. We documented various scenarios that described how to use alternative Certificate Authorities in the event of such emergencies.
As documented in the OSG Security Plan, we performed our annual Security Test and Controls
which are designed to measure the security of OSG core services. We are now preparing a report
to the OSG Executive Team that contains our recommendations and findings from the audits. We
plan to repeat these controls each year and evaluate whether earlier recommendations were
adopted and are beneficial to OSG security. Each year OSG services have shown continuous
62
improvement as assessed by these tests and controls. And based on our findings, we update the
OSG Risk Assessment document each year.
Due to the expiration of SHA-1 and MD-5 hashing algorithms, IGTF had decided to start using
SHA-2 hashes in certificates. OSG generated a test cache for the new certificates in ITB and
tested them against our software. We modified the needed software in the VDT stack and the
ITB testing of this effort had been completed. Migration of OSG sites and VOs using to the new
certificate cache is currently in progress.
During the March 2011 OSG All Hands meeting, we performed our annual site and VO security
training. This year, instead of presentations, we prepared and sent two sets of questionnaires to
the security contacts before the meeting; and then during the AHM, we discussed their responses
to the questions. We intend to continue this kind of hand-on survey and questionnaire and generate a FAQ database that can be reviewed online by security contacts for training.
3.8
Content Management
A project was begun in late 2009 to improve the collaborative documentation process and the
organization and content of OSG documentation. The project was undertaken after an initial
study using interviews of users and providers of OSG documentation and a subsequent analysis
of the OSG documentation system. Implementation was begun in 2010 and an overall coordinator was assigned and the documentation was divided into nine areas, each with a responsible
owner.
Based on feedback from the stakeholders, we developed an improved process for producing and
reviewing documents, including new document standards (with tools to simplify application of
the standards) and templates defined to make the documents more consistent for readers. The
new content management process includes: 1) ownership of each document; 2) a formal review
of new and modified documents; and, 3) testing of procedural documents. A different person
fills each role, and all of them collaborate to produce the final document for release. Several
new tools were created to facilitate the management and writing of documents in this collaborative and geographically dispersed environment.
The team had several areas of focus: user and storage documentation and documentation specific
to bringing up new Tier-3 sites. By the end of 2010, the team, with collaborators in many of the
OSG VOs, had:
 Reduced the number of documents by 1/3 by combining documents with similar content and
making use of an “include mechanism” for common sections in multiple documents.

Produced a new navigation system that is being completed at this time.

Provided reviewed documentation specific to establishing Tier-3 sites.

Revamped documentation of the storage technologies in OSG.

Improved and released 77% of the documents overall.
We have made great overall progress but two document areas (CE and VOs) are still delayed because of staffing issues that are currently being addressed.
In the first quarter of 2011, the prototype navigation was implemented in close cooperation with
OSG operations using a development instance of the OSG Twiki. Unlike the existing navigation,
the improved navigation allows the user to search the content more easily, provides documents
63
by user role and technology of interest to the user. The prototype is being transitioned gradually
into production in incremental steps to ease the transition from the old navigation to the new for
users of the production Twiki. Once this milestone has been achieved the document process will
change its current focus from improving individual documents to improving the navigation for
readers in different roles such as users, system administrators, and VO managers. This process
will test the connectivity and improve the usability of a set of documents when carrying out typical OSG role-based user tasks. In the second half of 2011, the document process will be integrated into the OSG release process to ensure that software and services provided by the OSG are in
sync with the corresponding documentation.
3.9
Metrics and Measurements
OSG Metrics and Measurements strive to give OSG management, VOs, and external entities
quantitative details of the OSG Consortium’s growth and performance throughout its lifetime.
The focus for the recent year has been maintenance of the technical aspects and analysis of new
high-level metrics.
The OSG Display (http://display.grid.iu.edu), developed last year, is meant to deliver a highlevel, focused view demonstrating the highlights of the consortium to technically savvy members
of the public a feel for the services that OSG provides. This display continues to be a very visible aspect of Metrics and is used often at public displays and at several OSG sites.
Most metrics services and automated reports have been transitioned to OSG Operations at GOC
(the exceptions are discussed later); the goal remains to have operations run all reports by the
start of FY12. These reports have proven to be useful in catching subtle issues in WLCG availability or accounting, and are reviewed daily by both the LHC VOs and OSG staff. Metrics staff
continues to produce and manually verify monthly reports for the eJOT and OSG homepage; we
believe this additional manual layer is an important validation prior to larger use. OSG Metrics
performs periodic reviews of the validity of automated processes, and audit the documentation
for manual processes.
The Metrics area continues to coordinate WLCG-related reporting efforts. The installed capacity
reporting put in place last year has been operating for a year without technical issues. Based on
an ATLAS request, we investigated the effect of hyper-threading on the site normalization used
by the WLCG for accounting. From those results, we started manually computing ATLAS normalizations, and are working on a technical solution to restore automated computations.
In FY11, we have increased efforts to employ metrics data in assessing key strategic capabilities
of the OSG facility and project overall. An important focus has been to examine factors driving
high throughput computing peformance. An example of this was an investigation into wall-time
efficiency which relates to the capabilities of sites within the OSG to efficiently process data, and
to the ability of VOs to appropriately design their workflows. In Figure 41 the wall-time efficiency (CPU time/wall time) derived from the Gratia accounting service is plotted versus the
number of completed jobs per day for the full 2010 year. The plot compares two VO’s with very
different efficiency characteristics and usage patterns. In each case the VO’s submit a mix of
jobs – each has a heterogeneous set of users doing both CPU intensive workloads (such as Monte
Carlo simulation) and data analysis (which can be IO intensive resulting naturally in lower efficiency) which contributes to the spread of points.
64
Figure 41: Comparison of the wall-time efficiency for two VOs
Figure 42: Comparison of the walltime efficiency for 17 OSG sites serving VOa for 2010
Furthermore, there are dependencies on the computing site, as displayed in Figure 42 which
shows the cumulative wall time efficiency for all of 2010 for the computing sites selected by
VOa. Here we see significant variation with three sites in particular showing comparatively
weak performance. These facilities have very significant differences in profile, varying in scale
(amount of storage and CPU job slots available), storage systems (dCache, HDFS, etc.) and data
access protocol (staging to local disk on the compute node or streaming from network attached
storage pools). There are also wide-area network affects when input datasets need to be prestaged to the site as part of the workflow. The VOs know in detail the most significant factors
65
that come into play; the lessons garnered are generally useful for all VOs in OSG, and these are
discussed in a variety of forums within the OSG community.
The OSG metrics web application framework allows users to browse metrics data, and staff to
create reports for OSG management; this package is stable and is now well-packaged and installable as an RPM. Reports have continuously evolved and added based upon management requests and new Gratia data. As a continuation of work mostly completed in 2010, we have replaced hand-maintained data or custom databases with data sources such as GOC’s OIM or Gratia, although the “field of science” report mapping individuals to science fields is an ongoing
concern. The scripts generating nightly plots for the OSG homepage have been significantly
cleaned up, and are being prepared to transition to operations.
OSG Metrics continues its successful collaboration with the external Gratia development project
in the face of substantial effort reductions in the Gratia project. Using OSG feedback, Gratia
has improved accounting for pilot jobs, HTPC jobs, and collecting storage and batch system data
from the BDII. OSG Metrics has contributed to the storage probes, the batch system probes, and
code for the next collector release. The scalability improvements to Gratia put in place last year
based on analysis of the load resulting from the LHC turn-on have proven to be sufficient for the
current year.
3.10
Extending Science Applications
In addition to operating a facility, the OSG includes a program of work that extends the support
of Science Applications in terms of both the complexity as well as the scale of the applications
that can be effectively run on the infrastructure. We solicit input from the scientific user community both as it concerns operational experience with the deployed infrastructure, as well as
extensions to the functionality of that infrastructure. We identify limitations, and address those
with our stakeholders in the science community. In the last year, the high level focus has been
threefold: 1) improve the scalability, reliability, and usability as well as our understanding thereof; 2) improve the usability of our Work Load Management systems to enable broader adoption
by non-HEP user communities; and 3) evaluate new technologies, such as xrootd, for adoption
by the OSG user communities.
3.10.1 Scalability, Reliability, and Usability
As the scale of the hardware that is accessible via the OSG increases, we need to continuously
assure that the performance of the middleware is adequate to meet the demands. There were
three major goals in this area for the last year and they were met via a close collaboration between developers, user communities, and OSG.
1. At the job submission client level, the CMS-stated goal of 40,000 jobs running
simultaneously and 1M jobs run per day from a single client installation has been achieved
and exceeded, with a peak of 60,000 running jobs demonstrated. The job submission client
goals were met in collaboration with CMS, Condor, and DISUN, using glideinWMS. This
was done in a controlled environment, using the “overlay grid” for large scale testing on top
of the production infrastructure provided by CMS. To achieve this goal, the Condor
architecture was modified to allow for port multiplexing and asynchronous matchmaking.
There have also been several smaller improvements in disk transaction rates and memory
consumption to lower the hardware cost of achieving this goal. The glideinWMS is also used
66
for production activities in over ten scientific communities, the biggest being CMS, D0, CDF
and HCC, where the job success rate has constantly been above the 95% mark.
2. At the storage level, the present goal is to have 100Hz file handling rates with thousands of
clients accessing the same storage area at the same time, and delivering at least 1Gbit/s
aggregate data throughput. The two versions of BeStMan SRM, based on two different java
container technologies, Globus and Jetty, both running on top of the HadoopFS have been
shown to achieve this goal. They can also handle in the order of 1M files at once, with
directories containing up to 50K files. There was no major progress on performance of the
dCache based SRM which did not exceed 10Hz in our tests.
3. At the functionality level, this year’s goal was to evaluate and facilitate the adoption of new
streaming storage technology to complement the file based solution provided by SRM. This
was requested by CMS to enable efficient remote event processing. The chosen technology is
based on the xrootd protocol and implemented by Scala. A Scala instance has been tested
using the above mentioned overlay grid and has shown great performance with a modest
number of clients, but failing catastrophically at higher scales. The test results have been
propagated back to the developers and a fix is expected soon.
In addition, we have continued to work on a number of lower impact objectives:
1. We have been involved in testing the client implementation of alternative Grid gatekeepers,
in particular CREAM and GRAM5 since several OSG VOs need access to sites that use such
technologies. The primary client in OSG is Condor-G, so it has been extensively tested and
all problems reported to the Condor development team, who fixed all of them.
2. We have been involved in tuning the performance of SRM implementations by performing
configuration sweeps and measuring the performance at each point.
3. We continuously evaluate new versions of OSG software packages, with the aim of both
discovering eventual bugs specific to the OSG use cases, as well as comparing the scalability
and reliability characteristics against the previous release.
4. We have been involved with other OSG area coordinators in reviewing and improving the
user documentation. The resulting improvements are expected to increase the usability of
OSG for both the users and the resource providers.
5. We provide support to OSG VOs that express interest in using glideinWMS, by providing
advice on deployment options as well as helping with high level operational issues. This has
shown to be very effective as several VOs adopted the OSG-operated glideinWMS factory
this year with excellent results for their production scaling.
6. Improvements to the package containing a framework for using Grid resources for
performing consistent scalability tests against centralized services, like CEs and SRMs. The
intent of this package is to quickly “certify” the performance characteristics of new
middleware, a new site, or deployment on new hardware, by using thousands of clients
instead of one. Using Grid resources allows us to achieve this, but requires additional
synchronization mechanisms to perform in a reliable and repeatable manner.
3.10.2 Workload Management System
OSG actively supports it stakeholders in the broad deployment of a flexible set of software tools
and services for efficient, scalable and secure distribution of workload among OSG sites. Both
67
major Workload Management Systems supported by OSG – glideinWMS and Panda – saw significant usage growth in the last year.
The Panda system continued its stable operation as a crucial element of the ATLAS experiment
at LHC and saw more than a two-fold increase in workload over the past 12 months, with number of jobs processed daily peaking at 840,000. This growth brought with it a new set of challenges in the area of scalability of Panda databases and related information systems, such as
monitoring. These challenges were met with an effort aimed at optimization of queries against
the existing Oracle RDBMS, and also with R&D leading to the application of novel “noSQL”
technologies already proven in industry, to Panda data storage solutions. A series of tests was
performed at CERN and at Brookhaven National Laboratory which determined the optimal design of data structures for storage and hardware configuration of a high-performance Cassandra
DB cluster which will be used to handle archived data currently accessed through Oracle, achieving better scalability and throughput going forward.
We continued our collaboration with the BNL research group active in the Daya Bay and LBNE
(DUSEL) experiments, providing them with increased and transparent access to large resource
pools (at BNL) via Panda job submission. As a result, in recent months the researchers have been
taking advantage of a throughput of more than an order of magnitude higher than before and are
now submitting a steady stream of jobs daily.
The GlideinWMS system saw a steady and significant increase in volume for the past 12 month,
averaging roughly 100,000 jobs per day. To increase the impact of this technology, OSG now
operates a glidein factory as a core service for the benefit of multiple VOs and in the last year
this has enabled another ~6 VOs to adopt this job submission framework. Based on feedback
from OSG VOs, the glideinWMS project undertook a development effort to further improve the
system, and enhancements were made in areas such as monitoring, network communications,
gLExec integration and diagnostics, improved handling of glide-ins, multi-core capability, packaging and installation, and others.
As evidenced by increase in volume, scope and diversity of the workload handled globally by the
OSG supported Workload Management Systems this program continues to be extremely important for the science community. These systems have proven themselves as key enabling factors for the LHC projects such as ATLAS and CMS. It is equally important that OSG continues
to draw new entrants from diverse areas of research who receive significant benefit by leveraging
stable and proven environment for access to resources, job submission and monitoring tools created and maintained by OSG.
3.11
OSG Technology Planning
The OSG Technology planning Area provides the OSG with a mechanism for medium to longterm technology planning. This is accomplished via two sub-groups:
Blueprint: The blueprint sub-group records the conceptual principles of the OSG and focuses on
the long-term evolution of the OSG. The Blueprint group meets approximately quarterly and,
under the direction of the OSG Technical Director, updates the "Blueprint Document" to reflect
our understandings of the basic principles, definitions, and the broad outlines of how the elements fit together.
Investigations: To manage the influx of new external technologies - while keeping to the OSG
principles - this sub-group does investigations to understand the concepts, functionality, and im-
68
pact of external technologies. The goal is to identify technologies that are potentially disruptive
in the medium-term of 12-24 months and give the OSG recommendations on whether and how to
adopt them. This sub-group was added in late FY11 and is viewed as a key work area to enable
the evolution of OSG.
Starting in January 2011, the Blueprint group undertook a major update of the OSG Blueprint
Document to make it more focused on the principles of OSG, and less on the technology implementation. It also now recognizes that the OSG, as an organization, has several grids and currently two reference implementations (the Production Grid and a Campus Grid).
The blueprint meetings have identified several areas of investigation for the future. Examples
include:
1. Integrating virtualization-based resources into the OSG alongside batch-based resources.
2. Alternate system for uploading the sites state information to the central operations databases.
3. Easy-to-operate data movement for smaller organizations.
The investigation sub-group has begun work on the virtualization task, and expects to release a
report at the end of July 2011. To engage and inform the wider community in this aspect of the
OSG, a blog was recently started to record the area’s activities (http://osgtech.blogspot.com).
3.12
Challenges facing OSG
The current challenge facing OSG is how to sustain, improve and extend our support for our existing communities, while engaging with and expanding our support for additional communities
that can benefit from our services, software and expertise. We continue to face the challenge of
providing the most effective value to the wider community, as well as contribute to the XD,
SciDAC-3 and Exascale programs. We have learned that it is not trivial to transfer capabilities
from large international collaborations to a smaller scientific community. We have identified
some of these challenges and how we would propose to address them in our plans (and proposal)
for the next 5 years to sustain and extend the OSG:
1. The heterogeneity of the resource environment of OSG makes it difficult for smaller communities to operate successfully at a significant scale. We made a breakthrough in this area in
2010 by offering a job submission service that creates community specific overlay batch systems across OSG sites, thus providing meta-scheduler functionality across the entire infrastructure. Today, six different communities share a single instance of this service, providing
an economy of scale and centralizing the support across these communities. We will thus offer this service as a core feature of OSG to all communities in the next 5 years.
2. The complexity of the grid certificate-based authorization infrastructure presents a nonnegligible barrier of entry for smaller communities. During the last year, we saw multiple
promising approaches emerging in different contexts: LIGO is pioneering a more integrated
approach that includes federated identity management through Shibboleth, and the OSG
campus infrastructure group has developed a prototype in which local identity management
mechanisms are extended to a regional cross–campus infrastructure spanning campuses in six
Midwest cities.
3. While the LHC communities are moving PetaBytes worldwide, smaller communities find it
exceedingly difficult to manage terabytes. Over the last 5 years we have seen this gap in capability grow rather than shrink. The first step in reversing this trend is the “Any Data, Any69
time, Anywhere” initiative, a collaboration of OSG, WLCG, US ATLAS, and US CMS computing communities. The initiative aims to reduce the problem from one of moving data
around within the OSG fabric to one of getting the data onto the OSG fabric in the first place;
once data is anywhere on the fabric it would be accessible remotely from everywhere on the
fabric.
70
4.
Satellite Projects, Partners, and Collaborations
The OSG coordinates with and leverages the work of many other projects, institutions, and scientific teams that collaborate with OSG in different ways. This coordination varies from reliance
on external project collaborations to develop software that will be included in the VDT and deployed on OSG to maintaining relationships with other projects where there is a mutual benefit
because of common software, common user projects, or expertise in areas of high throughput or
high performance computing.
Projects are Satellite Projects if they meet the following criteria:

OSG was involved in the planning process and there was communication and coordination
between the proposal’s PI and OSG Executive Team before submission to agencies.

OSG commits support for the proposal and/or future collaborative action within the OSG
project.

The project agrees to be considered an OSG Satellite project.
Satellite Projects are independent projects with their own project management, reporting, and
accountability to their program sponsors; the OSG core project does not provide oversight for the
execution of the satellite project’s work program. OSG does have a close working relationship
with Satellite Projects and a member of our leadership serves as an interface to these projects.
Current OSG Satellite Projects are:

CI-TEAM: Embedded Immersive Engagement for Cyber infrastructure (EIE4CI) to establish
the Engage VO, engagement program, and explore the use of OSG techniques for catalyzing
campus grids.

ExTENCI: Extending Science Through Enhanced National Cyber infrastructure is a new project that began in August, 2010. It jointly serves OSG and TeraGrid by providing mechanisms for running applications on both architectures.

High Throughput Parallel Computing (HTPC) on OSG resources for an emerging class of
applications where large ensembles (hundreds to thousands) of modestly parallel (4- to ~64way) jobs.

Application testing over the ESNet 100-Gigabit network prototype, using the storage and
compute end-points supplied by the Magellan cloud computing at ANL and NERSC.

CorralWMS to enable user access to provisioned resources and “just-in-time” available resources for a single workload integrate. It builds on previous work on OSG’s GlideinWMS
and Corral, a provisioning tool used to complement the Pegasus WMS used on TeraGrid.

VOSS: “Delegating Organizational Work to Virtual Organization Technologies: Beyond the
Communications Paradigm” (OCI funded, NSF 0838383) studies how OSG functions as a
collaboration.
We maintain a partner relationship with many other organizations that are relate to grid infrastructures, other high performance computing infrastructures, international consortia, and certain
projects that operate in the broad space of high throughput or high performance computing.
These collaborations include:
71




















Community Driven Improvement of Globus Software (CDIGS)
Condor
Cyber Infrastructure Logon service (CILogon)
European Grid Initiative (EGI)
European Middleware Initiative (EMI)
Energy Sciences Network (ESNet)
FutureGrid study of grids and clouds
Galios R&D company in the area of security for computer networks
Globus
Colombian National Grid (GridColombia)
São Paulo State University's statewide, multi-campus computational grid (GridUNESP)
Internet2
Magellan
Network for Earthquake Engineering Simulation (NEES)
National (UK) Grid Service (NGS)
NYSGrid
Open Grid Forum (OGF)
Pegasus workflow management
TeraGrid
WLCG
The OSG is supported by many institutions including:
 Boston University
 Brookhaven National Laboratory
 California Institute of Technology
 Clemson University
 Columbia University
 Fermi National Accelerator Laboratory
 Harvard University (Medical School)
 Indiana University
 Information Sciences Institute (USC)
 Lawrence Berkeley National Laboratory
 Purdue University
 Renaissance Computing Institute
 Stanford Linear Accelerator Center (SLAC)
 University of California San Diego
 University of Chicago
 University of Florida
 University of Illinois Urbana Champaign/NCSA
 University of Nebraska – Lincoln
 University of Wisconsin, Madison
 University of Buffalo (council)
Selected Satellite Projects and Partnerships and their work with OSG are described below.
72
4.1
CI Team Engagements
In April 2008, PI McGee was awarded an OSG Satellite Project (Embedded Immersive Engagement for Cyber infrastructure – EIE4CI) from the CI-TEAM program to establish the Engage
VO, engagement program, and with co-pi Goasguen, explore the use of OSG techniques for catalyzing campus grids. This effort is now in a no-cost-extension period and the annual report on
these activities has been submitted to NSF in May 2011, (award number 0753335). Included with
the annual reports is an independent assessment of the Engage CI-TEAM effort conducted by the
VOSS sponsored Scientific Software Ecosystems Research Project3.
The EIE4CI team has been successful in engaging a significant and diverse group of researchers
across scientific domains and universities. The infrastructure and expertise that we have employed has enabled rapid uptake of the OSG national CI by researchers with a need for scientific
computing that fits the model of national scale distributed HTC. Engaging university campuses
in the dialog of organizing local campus Cyber infrastructure using the tooling and techniques
developed by the Open Science Grid has proven more challenging as reported in a January 2010
workshop4. During this reporting period, the Engage VO and associated hosted infrastructure has
increasingly become relied upon to bring new users and larger communities onto OSG. In addition to the staff at RENCI who are sponsored by the satellite award, core OSG staff (e.g. FNAL
staff working with NEES) and other OSG satellite projects (e.g. ExTENCI) are engaging new
users and communities, building upon the experiences of this CI-TEAM award, and depending
upon the EIE4CI managed infrastructure. During the remaining time of the EIE4CI no-costextension period, we will begin discussions with the OSG core program to make contingency
plans for the case in which these activities do not receive continued funding to engage new users
and maintain the enabling infrastructure, processes, and Engage-VO community that has come
together. Figure 43 below shows the EIE4CI cumulative usage throughout the entire project period, clearly indicating increased growth and demand for OSG.
3
http://www.renci.org/wp-content/pub/techreports/TR-10-05.pdf
4
https://twiki.grid.iu.edu/bin/view/CampusGrids/WorkingMeetingFermilab
73
Figure 43: Cumulative Hours by Facility throughout the Entire EIE4CI Program
Engage-VO use of OSG during the reporting period is depicted in Figure 44 and represents a
number of science domains and projects including: Biochemistry (Zhao), theoretical physics
(Bass, Peterson, Bhattacharya, Coleman-Smith, Qin), Mechanical Engineering (Ratnaswamy),
Earthquake Engineering (Espinosa, Barbosa), RCSB Protein Data Bank (Prlic), and Oceanography (Thayre).
Figure 44: CPU hours per engaged user during reporting period
74
We note that all usage by Engage staff depicted here is directly related assisting users, and not
related to any computational work of the Engage staff themselves. This typically involves running jobs on behalf of users for the first time or after significant changes to test wrapper scripts
and probe how the distributed infrastructure will react to the particular user codes. In an effort to
increase the available cycles for Engage VO users, RENCI has made available two clusters to the
Engage community including opportunistic access to the 11TFlop BlueRidge system, including
GPGPU nodes being used by an EIE4CI molecular dynamics user5, and some experimental large
memory nodes (two nodes of 32 cores and 1TB of system memory).
4.2
Condor
The OSG software platform includes Condor, a high throughput computing system developed by
the Condor Project. Condor can manage local clusters of computers, dedicated or cycle scavenged from desktops or other resources, and can manage jobs running on both local clusters and
delegated to remote sites via Condor itself, Globus, CREAM, and other systems.
The Condor Team released updated versions of the Condor software and provided support for
that software. That support frequently involved improvements to the Condor software, which
would be tested and released.
4.2.1 Release Condor
This activity consisted of the ongoing work required to have regular new releases of Condor.
Creating quality Condor releases at regular intervals required significant effort. New releases
fixed known bugs; supported new operating system releases (porting); supported new versions of
dependent system software and hardware; underwent a rigorous quality assurance and development lifecycle process (consisting of a strict source code management, release process, and regression testing); and received updates to the documentation.
Previously release engineering work was done by various team members on a rotating basis. This
reduced the efficiency of the work and made long term release engineering projects difficult. We
moved to having two staff members are now partially dedicated to release engineering, allowing
us to improve our process.
From July 2010 through early June 2011, we have made a number of improvements to our release engineering. We added more platforms on which we do hourly builds and testing, enabling
earlier bug detection. In collaboration with Red hat, we added caching of external libraries to
speed up individual builds. We enhanced our build and test dashboard to quickly identify flawed
commits. We began running our most recent development release to power the Metronome software in the NMI Build and Test Lab6, providing an additional real and complex environment to
help identify problems earlier in the development series. These benefits combine to allow earlier
and more bug-free releases of Condor. Catching bugs earlier means fewer stable releases are
necessary, increasing stability and ease of administration for end users.
Over the last year the Condor team made 7 releases of Condor, with at least one more planned
before the end of June 2011. During this time the Condor team created and code-reviewed 84
publicly documented bug fixes. Condor ports were maintained and released for Windows, Mac
OS X, and 6 different Linux distributions. These ports were tested across 30 different operating
5
6
http://osglog.wordpress.com/2011/02/02/amber11-pmemd-for-nvidia-gpgpu/
See http://nmi.cs.wisc.edu/ for more about the NMI facility and Metronome.
75
system and architecture combinations. Recent work testing binary compatibility allows us to
support and deliver fewer different packages while supporting more operating system distributions than before.
We continued to invest effort improving our automated test suite to find bugs before our users
do, and continued our efforts to better leverage the NMI Build and Test facility and Metronome
framework. The number of automated builds we perform via NMI averages over 80 per day.
This allows us to better meet our release schedules by alerting us to problems in the code or a
port as early as possible. We currently perform more than 30,000 tests per day on the current
Condor source code snapshot (see Figure 45).
Figure 45: Number of daily automated regression tests performed on the Condor source
Support and release for Condor is non-trivial because of the nature of distributed computing, but
also because the Condor codebase is very active. On average over the past year, each and every
month the Condor project performed the following:

Released a new version of Condor to the public

Performed over 170 commits to the codebase (see Figure 46)

Modified over 350 source code files

Changed over 8,500 lines of code (Condor source code written at UW-Madison
now sits at about 922,422 lines of code)

Compiled about 2,500 builds of the code for testing purposes

Ran about 930,000 regression tests (both functional and unit)
76
Figure 46: Number of code commits made per month to Condor source repository
The 7.4 stable series continued to be popular, with over 7,800 downloads since July 2010, bringing lifetime downloads from the Condor homepage alone to over 24,000. The new 7.6 stable series was first released in April 2011 and already has over 1,900 downloads, replacing 7.4 as the
most popular download. The stable series contains only bug fixes and ports, while the development series contains new functionality. Throughout 2003, the project averaged over 400 downloads of the software each month; in the past year, that number has grown to over 1,000 downloads each month as shown in Figure 47.
77
Figure 47: Stacked graph depicting number of downloads per month per Condor version
Note these numbers exclude the increasingly popular downloads of Condor from other thirdparty distributors, including the VDT (Open Science Grid software stack), and copies of Condor
bundled with popular Linux distributions such as Red Hat MRG, Ubuntu, and Fedora Linux; as a
result, Condor use grows more quickly than these limited numbers suggest.
Several OSG collaborators wish to compile the Condor sources themselves. While Condor has
been open source since 2003, it could be difficult for external groups to build and interoperate
with native installed packages. We collaborated with Red Hat to overhaul Condor’s build system
to create a build system that works well for Condor team members and external groups.
Condor team members have also been working with the Debian Project to improve the quality of
Condor’s Debian packages. This also includes work to move Condor to dynamic linking of all
possible libraries. Dynamic linking will reduce package sizes, speeding releases and downloads
for users. This work is a required step on the way toward eventual addition of Condor into the
official Debian repository. Once Condor is available from the official Debian repository, this will
provide the most convenient possible packaging for users of Debian-based systems.
Besides the Condor software itself (covered in the section below), the project maintained the
Condor web sites at URLs http://www.condorproject.org and http://wiki.condorproject.org,
which contains links to the Condor software, documentation, research papers, technical documents, and presentations. The project maintained numerous public project related email lists such
as condor-users (see Figure 48), condor-devel, and their associated archives on-line. The project
also maintained a publicly accessible version control system allowing access to in-development
releases, a public issue tracking system, and a public wiki of documentation on using, configuring, maintaining, and releasing Condor. For users anticipating bug fixes or new features, we have
78
placed an increased focus on ensuring that release information is associated with issues in our
public tracking system, allowing users to identify which version of Condor contains which
changes.
A primary publication for the Condor team is the Condor Manual which currently stands at over
1,000 pages. In the past year, the manual has been kept up to date with changes and additions in
the stable 7.4 series, the development 7.5 series, the new stable 7.6 series, and in preparation for
the imminent 7.7.0 release; the most recent stable release is the Condor Version 7.6.1 Manual7.
4.2.2 Support Condor
Users received support directly from project developers by sending email questions or bug reports directly to the condor-admin@cs.wisc.edu email address. All incoming support email was
tracked by an email-based ticket tracking system running on servers at UW-Madison. From July
2010 through early June 2011, over 1,500 email messages were exchanged between the project
and users towards resolving 430 support incidents (see Figure 48).
Figure 48: Number of tracked incidents and support emails exchanged per month
Support is also provided on various online forums, bug tracking systems, and mailing lists. For
example, approximately 10% of the of the hundreds of messages monthly on Condor Users email
list originated from Condor staff; see Figure 48.
7
http://www.cs.wisc.edu/condor/manual/
79
300
250
200
150
100
Others
50
Condor team members
0
Jul 10
Aug 10
Sep 10
Oct 10
Nov 10
Dec 10
Jan 11
Feb 11
Mar 11
Apr 11
May 11
Figure 49: The condor-users email list receives hundreds of messages per month; project staff regularly contributes to the discussion.
The Condor team provides on-going support through regular phone conferences and face-to-face
meetings with OSG collaborations that use Condor in complex or mission-critical settings. This
includes monthly meetings with USCMS and FermiGrid, weekly teleconferences with ATLAS,
and biweekly teleconferences with LIGO. Over the last year Condor’s email support system has
been used to manage 21 issues with LIGO, resolving 9 of them. The Condor team also uses a
web system to track ongoing issues. This web system is tracking 27 issues associated with ATLAS, of which 19 are resolved, 107 issues associated with CMS, of which 70 are resolved, 74
tickets for LIGO, of which 29 are resolved, and 33 for other OSG users, of which 18 are resolved.
Support work for LIGO has included hardening and adding functionality, helping to debug problems in their cluster, and fixing bugs uncovered by the work of LIGO and other OSG groups.
Work with ATLAS revealed bugs in Condor-G, which have been fixed. For groups including
ATLAS and LIGO the Condor team provides advice, configuration support, and debugging support to meet policy, security, and scalability needs.
The Condor team has crafted specific tests as part of our support for OSG associated groups. For
example, LIGO's work pushes the limits of Condor DAGMan, so specific extremely large scale
tests have been added and are regularly run to ensure that new releases will continue to meet LIGO's needs. The Condor team has developed and performed multiple stress tests to ensure that
Condor-G's ability to manage large GRAM and CREAM job submissions will scale to meet ATLAS needs.
As part of our community support for Condor, we organize and lead an annual workshop called
Condor Week. This past year’s Condor Week 2011 was held on May 2-6, 2011 at the Wisconsin
80
Institute for Discovery Town Center and Teaching Labs8. Condor Week 2011 was held in conjunction with the Paradyn Project. There were 147 attendees, with 110 attendees at solely the
Condor events. There were 44 speakers at Condor Week over four days. There were attendees
from industry, government and universities. Industry attendees included representatives from
Cisco and DreamWorks Animation. Government attendees included representatives from
Brookhaven National Laboratory and Fermi National Accelerator Laboratory. Representatives
from universities include attendees from the University of Notre Dame, University of Southern
California and the University of Nebraska-Lincoln. There were also representatives from universities and companies from countries around the world, including Australia, Spain, Germany
and Italy.
4.3
High Throughput Parallel Computing
With the advent of 8-, 16- and soon 32-cores packaged in commodity CPU systems, OSG stakeholders have shown an increased interest in computing that combines small scale parallel applications with large scale high throughput capabilities, i.e. ensembles of independent jobs, each
using 8 to 64 tightly coupled processes. The OSG “HTPC” program is funded through a separate
NSF grant to evolve the technologies, engage new users, and support the deployment and use of
these applications.
HTPC is an emerging paradigm that combines the benefits of High Throughput Computing with
small way parallel computing. One immediate benefit is that parallel HTPC jobs are far more
portable than most parallel jobs since they do not depend on the nuances of parallel library software versions and machine specific hardware interconnects. For HTPC, parallel libraries are
packaged and shipped simultaneously with job. This pattern allows for two additional benefits:
First, there are no restrictions as to the method of parallelization, these can be MPI, OpenMP,
Linda, or utilize other parallelization methods. Second, the libraries can be optimized for onprocessor communication so that these jobs can run optimally on Multi-core hardware.
The work advanced significantly this year as the groundwork was laid for using the OSG Glidein mechanism to submit jobs. The implication is that users will soon be able to submit and manage HTPC jobs as easily as they do ordinary HTC jobs via GlideinWMS.
Recently, efforts on this project have focused on:

Documenting how to submit applications to the HTPC environment.

Extending the OSG information services to advertise support for HTPC jobs.

Extending Glide-in technology so that users can use this powerful mechanism to submit and
manage HTPC jobs. All production job submissions now utilize Glide-in technology.
To date applications from the fields of chemistry, weather, and computational engineering have
been run across 7 sites – Oklahoma, Clemson, Purdue, Wisconsin, Nebraska, UCSD and the
CMS Tier-1 at Fermilab. These represent new applications on the OSG. In effect HTPC is making the OSG infrastructure available to new classes of researchers for which the OSG was previously not an option.
8
http://www.cs.wisc.edu/condor/CondorWeek2011
81
Both Atlas and CMS are actively working to take advantage of HTPC. As they continue to utilize HTPC, HTPC capabilities will be rolled out more broadly across the Tier-2 sites.
We have logged nearly 15M hours since the first HTPC jobs ran in November of 2009. The work
is being watched closely by the HTC communities who are interested in taking advantage of
multi-core while not adding a dependency on MPI. Also, we have learned that the HTPC model
is key for enabling scientists that need to use GPUs which appears to be an emerging paradigm
for the OSG in the near future.
4.4
Advanced Network Initiative (ANI) Testing
The ANI project's objective is to prepare typical OSG data transfer applications for the emergence of 100Gbps WAN networking. This is accomplished in close collaboration with OSG,
contributing to OSG in terms of testing OSG software, and benefitting from OSG. Thus we look
to shrink the time between 100Gbps network links becoming available, and the OSG Science
Stakeholders being able to benefit from their capacity. This project is focusing on the following
areas:

Creation of an easy to operate "load generator" that can generate traffic between Storage Systems on the Open Science Grid.

Instrumenting the existing and emerging production software stack in OSG to allow benchmarking.

Benchmark the production software stack, identify existing bottlenecks, and work with the
external software developers on improving the performance of this stack.
The following is a summary of achievements in various aspects related to ANI project that we
accomplished in 2010.

Architecture design of OSG/HEP Application with 100Gb/s connection
Through intensive discussion with other groups the ANI project and OSG colleagues, a high
level design of OSG/HEP application architecture was documented that is consistent with
both the envisioned ANI and the LHC data grid architectures. This was documented in Specification of OSG Use Cases and Requirements for the 100 Gb/s Network Connection (OSGdoc-1008).

Building hardware test platform
We built a test cluster at UCSD to conduct various tests which involve the core technologies
for the OSG/HEP applications for the ANI project. The cluster has all the necessary components to function as a SE, including BeStMan, HDFS, GUMS, GridFTP, FUSE, and also
used for other type of data transfer tool, e.g. Fast Data Transfer (FDT). This test platform has
been used for transaction rate tests described in OSG-doc-1004, as well as the 2009 and 2010
Supercomputing bandwidth challenges.

Validating Hadoop Distributed File System
We previously validated the Hadoop Distributed File System (HDFS) as a key technology of
an OSG Storage Element (SE). We gave a presentation on this at the International Symposium on Grid Computing (ISGC) 2010 with the title: Roadmap for Applying Hadoop distributed file system in Scientific Grid Computing.
82

Test of Scalability of Storage Resource Manager (SRM) System
We conducted transaction rate scalability tests of two different versions of BeStMan SRM.
These two versions are based on two different java container technologies, Globus and Jetty.
Our tests contributed to the release of the new Jetty-based BeStMan-2 by improving the configuration and documenting the corresponding performance as compared with the previously
available Globus based implementation on the same hardware. We worked closely with various parties: Hadoop distribution and packaging from OSG storage, BeStMan development
team from LBNL, and used the new glidein based scalability tool from OSG Scalability. The
results of BeStMan test were documented in Measurement of BeStMan Scalability (OSGdoc-1004). The use of glideinWMS and glideinTester is documented in Use of Late-binding
technology for Workload Management System in CMS (OSG-doc-937).

Test of WAN data transfer tools and participation at Supercomputing 2010
Various WAN data transfers have been tested between UCSD and other CMS Tier-2 sites.
We present our results in networking configuration, storage architecture and data transfer in
18th International Conference on Computing in High Energy Physics (CHEP) 2010 with title:
Study of WAN Data Transfer with Hadoop-based SE.

Detailed test plan for ANI
We contributed to the ANI testing plan development. The present plan assumes that two
types of application tests will be run. In both cases, NERSC will function as source, and one
of either ANL or ORNL will function as sink of data. In the first test, data will be copied
from storage at NERSC straight to memory at ANL, and consumed there by applications
running on the Magellan hardware at ANI. No storage at ANL is involved in this test. In the
second test, data will be copied to storage at ORNL.
4.5
ExTENCI
The Extending Science Through Enhanced National Cyber infrastructure (ExTENCI) project
goal is to develop and provide production quality enhancements to the National Cyber Infrastructure that will enable specific science applications to more easily use both the Open Science Grid
(OSG) and TeraGrid (TG) or broaden access to a capability to both TG and OSG users.
ExTENCI is a combination of four technology components, each run as a separate project with
one or more science projects targeted as the first user. The components and sciences are: Workflow & Client Tools – SCEC and Protein Modeling, Distributed File System – CMS/ATLAS,
Virtual Machines – STAR and CMS, and Job Submission Paradigms – Cactus Applications
(EnKF and Coastal Modeling).
Although they are independent projects, there are linkages between them. For example, experiments in the Workflow & Client Tools project are planning use of Distributed File System and
Job Submission Paradigms capabilities. The CMS experiment is actively investigating the use of
the Distributed File System as well as Virtual Machines. After 10 months, significant progress
has been made by each of the projects.
Workflow & Client Tools has been working in two main areas, earthquake hazard prediction
(from the SCEC project) and protein structure analysis. We have demonstrated a method using
the Swift workflow engine that can obtain large (~2000) numbers of nodes from OSG Engage
Virtual Organization (VO) sites, built and generalized a framework that allows this to be done,
83
and demonstrated it with another application, glass material modeling. This has allowed running
instances of straightforward applications (protein and glass modeling) on OSG as well as on TG.
The SCEC application has been run on both OSG and TG; but in this case, the intense data transfer needs required reformulation of the application to limit data transfers to where the amount of
computation to be done on the data is sufficiently large. In all applications, current work is on
optimizing the methods for determining how to best divide the work between OSG and TG.
Distributed File System has set up the hardware at three sites, installed the Kerberos security infrastructure, added Kerberos to the Lustre/WAN file system, and installed the Lustre server
software at two sites (as planned) and Lustre client software at the two sites, and is now doing
performance testing between sites. Some performance issues have been identified and tests of
possible solutions are in progress. Science application testing was started but is delayed until the
performance issues are resolved.
Virtual Machines has deployed hypervisors on a large scale at Clemson (OSG) and Purdue (TG)
and verified that the STAR VM can be run on OSG and TG and that the CMS VM runs on TG.
We have tested and documented Condor-G’s support of cloud services, created a report on tools
used to author VMs and manage libraries of such machines, and have cataloged VM distribution
mechanisms in preparation for developing tools to assist users in these areas.
Job Submission Paradigms has completed the overall design and has implemented the SAGA
Condor-G adaptor that is now being tested by sending jobs to both TG and OSG. Verification of
operation and performance of a science application is awaiting completion of this testing.
Below is a graph of usage by sciences now able to use OSG via new ExTENCI capability. The
first phase of the work is complete and testing will resume after solutions to problems discovered
have been resolved.
Figure 50: ExTENCI Usage of OSG
84
Another important benefit of ExTENCI has been to provide an opportunity for increased joint
activity and collaboration between OSG and TG. Each of the projects has fostered sharing of
information, joint problem solving, and combined teams to better serve users across both environments, and thus paves the way to OSG becoming a Service Provider in XSEDE.
4.6
Corral WMS
Under the NSF OCI-0943725 award, the University of Southern California, Fermi National Accelerator Laboratory and University of California San Diego have been working on the CorralWMS integration project, which provides an interface to resource provisioning across national as
well as campus cyber infrastructures. Software initially developed under the name Corral now
extends the capabilities of glideinWMS. The resulting product, glideinWMS, is one of the main
workload management systems on OSG, and enables many major user communities to efficiently use the available OSG resources. . The current users are the particle physics experiments
CMS, CDF and GlueX, the structural biology research communities SBGrid and NEBioGrid, the
nanotechnology research community NanoHUB, the engineering community NEED, the campus
grid communities of the University of Nebraska (HCC), University of Wisconsin (GLOW) and
University of California San Diego (UCSDGrid), the Northwest Indiana Computational Grid
(NWICG), the Southern California Earthquake Center (SCEC), and the OSG Engage VO.
Corral, a tool developed to complement the Pegasus Workflow Management System, was recently built to meet the needs of workflow-based applications running on the TeraGrid. It is being
used today by the Southern California Earthquake Center (SCEC) CyberShake application. In a
period of 10 days in May 2009, SCEC used Corral to provision a total of 33,600 cores and used
them to execute 50 workflows, each containing approximately 800,000 application tasks, which
corresponded to 852,120 individual jobs executed on the TeraGrid Ranger system. The 50-fold
reduction from the number of workflow tasks to the number of jobs is due to job-clustering features within Pegasus designed to improve overall performance for workflows with short duration
tasks.
GlideinWMS was initially developed to meet the needs of the CMS (Compact Muon Solenoid)
experiment at the Large Hadron Collider (LHC) at CERN. It generalizes a Condor glidein system
developed for CDF (The Collider Detector at Fermilab) first deployed for production in 2003. It
has been in production across the Worldwide LHC Computing Grid (WLCG), with major contributions from the Open Science Grid (OSG) in support of CMS for the past two years, and has
recently been adopted for user analysis. Over those two years, CMS alone has used glideinWMS
to consume more than 29 million CPU hours, and has had over 15,000 concurrently running jobs.
The integrated CorralWMS system, which will retain the glideinWMS product name, includes a
new version of Corral as a frontend. It provides a robust and scalable resource provisioning service that supports a broad set of domain application workflow and workload execution environments. The system enables workflows to run across local and distributed computing resources,
the major national cyber infrastructure providers (Open Science Grid and TeraGrid), as well as
emerging commercial and community cloud environments. The Corral frontend handles the enduser interface, the user credentials, and determines when new resources need to be provisioned.
Corral then communicates the requirements with the glideinWMS factory, and the factory performs the actual provisioning.
Communities using the Corral frontend on OSG include SCEC and LIGO. The SCEC workflows
described above start off with a couple of large earthquake simulation MPI jobs, and are fol85
lowed by a large set of serial jobs to determine how different sites in the Los Angeles region
would be affected by the simulated earthquake. SCEC has been a long-time Corral user and requirements driver, and has been using Corral in production for runs on TeraGrid. As a demonstration on how CorralWMS can be used across cyber infrastructures, a SCEC workflow was
planned to execute MPI jobs on TeraGrid and serial workloads on OSG using the glideinWMS
system. Four such runs were completed successfully.
CorralWMS project supports the OSG-LIGO Taskforce effort whose mission is to enable LIGO
Inspiral workflows to perform better in the OSG environment. One problem LIGO had was that
short tasks in the workflows were competing with a large amount of long-running LIGO Einstein
at Home jobs. When submitted to the same OSG site, the workflows were essentially starved.
With the glideinWMS glideins, the workflow jobs are now on a more equal footing as the
glideins retain their resources longer, and during the lifetime can service many workflow jobs.
The glideinWMS glideins also helped when running multiple workflows at the same time, as job
priorities were used to overlap data staging job and compute jobs. This was not possible when
relying on the remote Grid site’s batch system to handle scheduling rather than the glideinWMS
scheduler. Recently, LIGO has also used the system to support additional LIGO workflows and
scientists. One of those, a workflow called Powerflux, has less data dependencies than the Inspiral workflow and because of that the workflow is easier to spread out across a large number of
computational resources at the same time, which makes the workload a great fit for OSG.
The CorralWMS system also contributes to the glideinWMS factory development. During the
year, many new features have been implemented. It is now possible to pass project information
between the frontend and factory and for the frontend to specify that large groups of glideins
should be submitted as one grid job. These features are mainly for the system to work better with
TeraGrid, but are required to make workloads run across the infrastructures. The UCSD group is
operating a production glidein factory gathering resources from hundreds of Grid sites from all
over the world, and open to several user communities. Compared to last year, the number of
communities served more than doubled. The glidein factory is now regularly providing around
10,000 CPUs for them to use.
Figure 51: CPUs provided by the glidein factory
86
Each community operates its own frontend, with most of the frontend instances being located
outside UCSD. The amount of administration work needed by each of the frontend operators is
minimal. Most of the operational work stems from problems developing on any of the many Grid
sites providing the resources, which is handled almost entirely by the glidein factory operators,
thus effectively shielding the frontend administrators.
The CorralWMS project has recently contributed sessions to the OSG Grid Schools in Madison
and Sao Paulo, and gave a presentation at Condor Week 2011.
4.7
OSG Summer School
OSG is committed to educating students and researchers in the use of Distributed HighThroughput Computing to enable scientific computation. Our staff provides proactive support in
workshops and collaborative efforts to help define, improve, and evolve the US national cyberinfrastructure. As part of this support, we were awarded support to conduct a four-day school in
high-throughput computing (HTC) in July 2010 at the University of Wisconsin–Madison. This
was a joint proposal with TeraGrid, and all seventeen students who participated in the OSG
summer school also participated in the TeraGrid 2010 conference. In June 2011, we are offering
this school again, expanding it to include twenty-five students.
An important focus of the school is direct, guided hands-on experience with HTC; students learn
how to use HTC in a campus environment as well as large-scale computations with OSG. The
students receive a variety of experiences including running basic computations, large workflows,
scientific applications, and pilot-based infrastructures as well as how to use storage technologies
to cope with large quantities of data. The students get access to multiple OSG sites so that they
can experience scaling of computations. They use exactly the same tools and techniques that
OSG users currently use, but we emphasize the underlying principles so they can apply what
they learned more generally.
In both years of this summer school, students came from a wide variety of scientific disciplines:
physics, biology, GIS, computer science, and more. To provide students with the best learning
opportunities, we recruit instructors from staff who are experienced with the relevant technologies to develop, teach, and support the school. Most instructors are OSG staff, and a few come
from other projects: TeraGrid, Globus, the Middleware Security and Testing research project,
and the Condor Project.
A highlight of the school is the HTC Showcase, in which local scientists give lectures about their
experiences with HTC. They showcase how the use of HTC expanded not only the amount of
science they could do, but the kinds of science they could do.
During and after the school, we ask students to evaluate their experiences. Last year, an independent researcher examined these evaluations. While this examination is too detailed to include
here, an extract of the summary said:
Overall, the majority of the participants were very happy with how the conference went, and raved about the accommodations and organization of it. The majority were also very happy with the instructors and the presentations and activities within the sessions.
The OSG Summer School has web pages for each year, including curricula and materials:
https://twiki.grid.iu.edu/bin/view/Education/OSGSummerSchool2010
87
https://twiki.grid.iu.edu/bin/view/Education/OSGSummerSchool2011
In addition, an article about the school appeared in International Science Grid This Week:
http://www.isgtw.org/?pid=1002696
After each school is completed, we assign OSG staff as mentors to the students, and they make
regular contact with the students so they can provide help and deepen students’ participation in
the broader HTC community. Students are paired with staff based on factors such as common
interests, organizational memberships, and geography. Many of these students make use of this
mentorship program to help integrate HTC into their research.
4.8
Science enabled by HCC
The Holland Computing Center (HCC) of the University of Nebraska (NU) began to use the
OSG to support its researchers during 2010 and usage is continuing and growing. HCC’s mission is to meet the computational needs of NU scientists, and does so by offering a variety of resources. In the past year, the OSG has served as one of the resources available to our scientists,
leveraging the expertise from the OSG personnel and running the CMS Tier2 center at the site.
New applications from the humanities and fine arts have been deployed over the last year, as described in more detail below.
Support for running jobs on the OSG is provided by our general-purpose applications specialists
and we have only one graduate student dedicated to using the OSG.
To distribute jobs, we primarily utilize the Condor-based GlideinWMS system originally developed by CDF and CMS (now independently supported by the NSF). This provides a user experience identical to using the Condor batch system, greatly lowering the barrier of entry for users.
HCC’s GlideinWMS install has been used as a submission point for other VOs, particularly the
LSST collaboration with OSG Engage. The GlideinWMS led a CS graduate student to write his
master thesis on grid technologies for local campus grids and campus bridging. In particular, he
created the campus grid factory, a local version of the pilot job factory used by GlideinWMS.
For data management, we leverage the available high-speed network and Hadoop-based storage
at Nebraska. For most workflows, data is moved to and from Hadoop, sometimes resulting in
multiple terabytes of data movement a day.
In the last year and a half, 6 teams of scientists and the 2 new teams mentioned above have run
over 15 million hours on the OSG; see Figure 52. This is about 10% of our active research teams
at HCC, but over 20% of HCC computing (only counting computing done opportunistically at
remote sites). Less precise figures are available for data movement, but it is estimated to be
about 50TB total.
88
Figure 52: Monthly wall hours per facility
Applications we have run on the OSG in the past year are:

TreeSearch, a platform for the brute-force discovery of graphs with particular structures.
This was written by a Mathematics doctoral candidate and accumulated 4.6 million wall
hours of runtime. Without OSG, HCC would not have been able to provide a sufficient
number of hours to this student to finish this work.

DaliLite, a biochemistry application. This was a one-off processing needed due to reviewers of a paper requesting more statistics. This was moved from a small lab cluster to
the OSG in a matter of days, and accumulated over 120 thousand runtime hours. Without
OSG, the scientists would not have been able to complete their paper in time.

OMSSA, an application used by a researcher at the medical school. This was another example where a researcher who had never used HCC clusters discovered he needed a huge
amount of CPU time in order to make a paper deadline in less than 2 weeks. As with DaliLite, the researcher would have not made the deadline if he ran only on HCC resources.

Interview Truth Table Generator for Boolean models, developed by a mathematical biology research group at the University of Nebraska-Omaha, is a Java-based tool to generate
Boolean truth tables from user data supplied via a web portal. Depending on input size,
as a single process the tool required several days to complete one table. After a small
modification to the source code, the Condor glidein mechanism was utilized to deploy
jobs to both HCC and external resources, enabling significant reduction in runtime. For
one particular case with a table of 67.1 million entries, runtime was reduced from approximately 4 days to about 1 hour.
89

CPASS, another biochemistry application. This was our first web application converted
to the grid; Condor allows a user to submit a burst of jobs which first fill the local clusters, then migrate out to the grid if excess capacity is needed.

AutoDock and CS-ROSETTA, two smaller-scale biochemistry applications brought to
HCC and converted to the grid. The work done with these was essential for a UNL graduate student to finish his degree work.

Digital Media Generation: Jeff Thompson, assistant professor New Genres and Digital
Arts in the Department of Art & Art History produces work investigating uses of highpowered computing for the creation of visual and sonic artworks. A current project, titled
Every Nokia Tune and begun in March 2011, uses the Open Science Grid to visualize all
6-billion unique combinations of the Nokia Tune ringtone, the most ubiquitous music in
the world, heard ca. 20,000 times per second. The final visualization, if printed, would
stretch for 42 kilometers. Like with scientific projects, the scale that the HCC offers for
artistic projects will spur innovation the field of digital arts; while taking only an afternoon on the OSG, if completed on a desktop machine “Every Nokia Tune” would take
approximately 5,700 years, making the project impossible.

Digital Research in the Humanities: This work at the Center for Digital Research in the
Humanities (CDRH) with Brian Pytlik-Zillig is substantially focused on text analysis in
general, and on n-gram (pattern sequence) studies in particular. Using Pytlik-Zillig’s TokenX software, which UNL offers as a free licensed download, CDRH constructs data
sets of use to researchers in a variety of disciplines not limited to the humanities. N-gram
data sets make it possible for scholars to observe and count patterns of word, part of
speech, and other sequences over time, by genre, and so on. TokenX works serially to extract n-gram data from groups of texts (text corpora) and to create structures that align the
data in traditional tables where each column represents a text, and every row signifies a
specific n-gram, with counts of actual occurrences. Separate databases are constructed for
n-grams (1-grams, 2-grams, up through 5-grams) that occur more than once. Serial ngram processing begins to show scalability problems when there are more than about
three hundred texts in a given corpus. A full TokenX n-gram extraction for a corpus of
300 novel-length texts takes more than a month to process. With a surge in digitization of
texts, larger and larger groupings of texts are possible and desirable. In the next six
months, we hope to perform n-gram extractions on a corpus of thirty thousand texts representing approximately 8 gigabytes of content. With serial extraction, construction of the
n-gram database would take more than eight years to complete. We are therefore working
to leverage the existing TokenX code base to take advantage of parallel processing over a
computer network with the desired throughput to be measured in weeks or months rather
than years.
HCC has gone from almost zero local usage of the OSG in 2009 to millions of CPU hours per
year. This has been done without local OSG funding for user support (there is other OSG personnel and one student who do share expertise). The OSG is an important part of our “toolbox”
of solutions for Nebraska scientists. The OSG is not a curiosity or a toy for HCC, but something
we depend on not only to offload jobs, but to support science, and now other scholarly research,
which could not have been completed at HCC resources alone.
90
4.9
Virtualization and Clouds
OSG staff and contributors continue to explore the technologies and services needed to support
virtualization, scientific and commercial clouds. Focused work is happening, and is reported, as
part of the ExTENCI satellite project with the new technology area within OSG providing the
connection to OSG. This had included extended testing and initial use of GlideinWMS access
through EC2 interfaces to Amazon and the DOE Magellan resource at NERSC; US ATLAS,
HCC and LIGO VOs have used this resource to date. Small samples of CMS monte carlo have
been run on these infrastructures.
4.10
Magellan
OSG communities continue to access the DOE Magellan resources at the National Energy Research Scientific Computing Center. In the course of enabling use of Magellan, the OSG client
tools have been extended.

Condor’s Amazon EC2 interface was generalized to work with 3rd party clouds that run the
EC2 specification including Magellan.

GlideinWMS was extended to enable submission to Amazon and Magellan clouds using the
generic Condor framework. This enables users to transparently use the same condor execution environment on the OSG and the Cloud.
The fixes for Condor and GlideinWMS were done by their respective developers, but motivated
by the testing coordinated by the OSG. The OSG and Magellan hold monthly meetings to coordinate our efforts for cloud submission. The Magellan team at NERSC has installed a generic
OSG Compute Element on Magellan that opportunistic OSG VOs have successfully utilized.
The accounted use of the cloud resources at NERSC is VO: HCC, (User Weitzel, Derek ), CPU
hours used: 103,091; VO: LIGO (User Engel, Robert ), CPU hours used: 425,480.
4.11
Internet2 Joint Activities
Internet2 is an advanced networking consortium led by the research and education community.
An exceptional partnership spanning U.S. and international institutions that are leaders in the
worlds of research, academia, industry and government, Internet2 is developing breakthrough
cyber infrastructure technologies that support the most exacting applications of today—and spark
the most essential innovations of tomorrow. Internet2 regularly collaborates with members of
the research community, including the OSG and their associated user community, to develop
tools and services, to advance the use and awareness of cutting edge networking technology.
One of Internet2’s many areas of focus involves the monitoring and management of network
traffic, with an emphasis on identification and correction of network performance problems
through software and personalized training. Performance problems can be challenging for users
and operators alike – the success of bulk data movement and remote collaboration often requires
flawless network performance, free of defects that may induce packet loss or designs that fail to
deal with congestion. Advanced tools, developed by Internet2 and partners, often mitigate these
problems and improve the overall grid computing experience for end users from scientific communities.
Over the past year, Internet2 has focused on designing and constructing software to consume
performance metrics from the distributed monitoring infrastructure present at many OSG affiliated end sites. Network performance metrics, including bandwidth, latency, packet loss, and link
91
utilization, are valuable indicators for discovering potential problem areas as they happen, rather
than at a later point in time. Using these early warning components through a reporting framework, such as the OSG Gratia and RSV, we provide sites a valuable record of performance over
time and allow them an instant view into potential network complications. Internet2 engineers,
in collaboration with the USATLAS Virtual Organization (VO) and developers from
Brookhaven National Laboratory, have made these probes available, through the perfSONAR-PS
software framework. This, in conjunction with the development of several graphical interfaces
designed to give operators and end users a high level overview of network performance across
the VO, will provide a valuable lens into operational concerns in the OSG community.
Internet2 continues to work with OSG software developers to update the advanced network diagnostic tools already included in the VDT software package. These client applications allow
VDT users to verify the network performance between end site locations and perfSONAR-PS
based servers deployed on campus, regional, backbone, and exchange point networks by allowing on-demand diagnostic tests to be run. The tools enable OSG site administrators and end users
to test any individual compute or storage element in the OSG environment thereby reducing the
time it takes to diagnose performance problems. They allow site administrators to more quickly
determine if a performance problem is due to network specific problems, host configuration issues, or application behavior. The VDT installation has proven to be a particularly useful addition for debugging problems directly related to the storage and compute portions of the infrastructure, particularly when these components are located deep within the core of a network and
may experience connectivity that travels through many interim locations.
In addition to deploying client tools via the VDT, Internet2 staff, working with partners in the
US and internationally, have continued to support and enhance the pS Performance Toolkit, a
complete Linux distribution that contains fully configured performance tools and targeted graphical interfaces to monitor network performance. When used as a “Live CD”, the tools are instantly available and can be deployed anywhere within a network; the alternative method involves dedicating a specific measurement point and installing directly to the hard disk of the target machine. Either use case allows OSG site-admins to quickly stand up a perfSONAR-PS
based server to support end users. These perfSONAR-PS hosts automatically register their existence in a distributed global database, making it easy to find new servers as they become available. All perfSONAR-PS software is maintained in a repository that allows for security updates
and software enhancements; this is normally transparent to the operators and end users. It is also
possible to adopt specific packages from these software storehouses to customize the installation
to the needs of the deploying site.
These measurement servers provide two important functions for the OSG site-administrators.
First they provide an end point for the client tools deployed via the VDT package. OSG users
and site-administrators can run on-demand tests to begin troubleshooting performance problems.
The second function they perform is to host regularly scheduled tests between peer sites. This
allows a site to continuously monitor the network performance between itself and the peer sites
of interest. The USATLAS community has deployed perfSONAR-PS hosts and is currently using them to monitor network performance between the Tier-1, Tier-2, and a growing number of
Tier-3 sites. Internet2 attends weekly USATLAS calls to provide on-going support of these deployments, and has come out with regular bug fixes. Finally, on-demand testing and regular
monitoring can be performed to both peer sites and the Internet2 or ESNet backbone network
using either the client tools, or the perfSONAR-PS servers.
92
The development of tools and strategies to address network performance problem at the site level
is one aspect of Internet2’s mission. Training services, particularly in the installation and use of
these tools to diagnose network behavior, are a vital part of Internet2’s roll in the R&E networking community. In the past year Internet2 has participated in several OSG site-admin workshops, the annual OSG all-hands meeting, and interacted directly with the LHC community to
determine how the tools are being used and what improvements are required. Internet2 has provided hands-on training in the use of the client tools, including the command syntax and interpreting the test results. Internet2 has also provided training in the setup and configuration of the
perfSONAR-PS server, allowing site-administrators to quickly bring up their server. Finally,
Internet2 staff has participated in several troubleshooting exercises; this includes running tests,
interpreting the test results and guiding the OSG site-admin through the troubleshooting process.
4.12
ESNET Joint Activities
OSG depends on ESnet for the network fabric over which data is transferred and for the ESnet
DOE Grids Certificate Authority for the issuing of X509 digital identity certificates which are a
key component of the OSG authorization methods. ESnet is part of the collaboration delivering
and supporting the perfSONAR tools that are now in the VDT distribution. In addition, OSG
staff and stakeholders make significant use of ESnet’s collaborative tools with telephone and
video meetings.
ESnet and OSG cooperate on a service to provide X.509 digital certificates for both users and
hosts of the organizations participating in the Open Science Grid. ESnet runs the DOEGrids
Certification Authority and other infrastructure while the OSG coordinates the registration process with all the participating organizations. Table 7 below lists the number of certificates issued
to these organizations for the past few years, for all those which have received at least one host
certificate in 2011.
93
Table 7: DOEGrid Certificate Issuance History
Year →
Type →
Total
Organization
CMS
FNAL
BNL
ATLAS
OSGVO
LIGO
OSGEDU
DZERO
ENGAGE
DOSAR
LCG
MIS
NYSGRID
SBGRID
GLOW
SLAC
ALICE
GLUEX
NWICG
NEBIOGRID
PHENIX
SGVO
CIGI
ICECUBE
DAYABAY
2006
User
Host
296
1716
28
107
11
39
37
32
0
2
1
3
11
2
0
0
4
5
0
0
8
0
6
0
0
0
0
288
966
32
124
73
157
0
28
0
4
10
1
0
0
8
16
0
0
2
0
7
0
0
0
0
2007
User
Host
1122 4307
197
224
36
219
100
118
110
6
30
0
17
8
12
3
10
9
0
0
14
0
7
0
2
0
0
744
1591
807
381
267
317
6
59
8
14
38
2
5
3
21
25
0
0
8
0
6
0
5
0
0
2008
User
Host
1013 6169
153
175
31
234
106
100
71
2
63
1
4
6
17
9
15
7
0
0
11
0
4
0
4
0
0
1390
2250
151
1524
204
389
6
52
25
19
50
15
32
15
20
10
0
0
6
0
4
0
7
0
0
2009
User
Host
1103 7533
215
194
18
267
51
97
77
4
71
5
8
2
18
17
13
9
11
2
18
1
2
0
0
3
0
2401
2589
1369
331
221
249
76
51
35
48
28
35
27
7
12
22
0
4
7
0
6
0
13
2
0
2010
User
Host
1163 8178
264
147
16
263
49
88
142
0
54
6
12
4
5
30
22
7
26
3
9
5
1
0
2
7
1
2490
3044
1303
362
158
264
238
65
50
32
29
23
17
9
9
9
2
34
16
6
5
0
8
4
1
2011
User
Host
496
3091
94
68
7
113
27
38
58
0
15
3
5
3
6
16
4
3
16
1
5
0
0
10
0
1
3
1083
882
370
275
173
84
57
30
18
17
17
11
10
10
9
8
6
5
5
5
5
5
3
2
1
In addition to this production service, the ESnet ATF group and OSG interact on a number of
activities and development projects. The most significant of these is about the Science Identity
Federation (SIF) where ESnet is playing a leading role in supporting the DOE science laboratories to join the InCommon Federation and use federated identity providers and services. ESnet
has negotiated an agreement with InCommon where the DOE ASCR Office funds the membership dues for the laboratories, and provides assistance to each laboratory in the process to join
InCommon. At this point, several laboratories have completed this process and a few have set up
or are in process of deploying Shibboleth Identity Providers and Service Providers. We also
partner as members of the identity management accreditation bodies in America (TAGPMA) and
globally (International Grid Trust Federation, IGTF).
ESnet and OSG have been working on the next revision of the DOEGrids CA infrastructure.
This is primarily a deployment change in DOEGrids CA components, distributing them around
the country to provide a much more reliable and resilient infrastructure while maintaining the
essential service characteristics. However, a new RA service is being added, allowing the different service communities of DOEGrids CA to be separated from each other, and enabling significant improvements in flexibility and integration with the service community's work flow for user
94
and resource certification. And OSG and ESnet are implementing contingency plans to make
certificates and CA/RA operations more robust and reliable by replication, monitoring, coordinated testing of changes, and defined service performance objectives. As part of this initiative,
we completed a working draft of a Service Level Agreement between DOEGrids CA and OSG
which is currently being circulated for management approvals.
95
5.
5.1
Training, Outreach and Dissemination
Training
The OSG Training program brings domain scientists and computer scientists together to provide
a rich training ground for the engagement of students, faculty, and researchers in learning the
OSG infrastructure, applying it to their discipline and contributing to its development. During the
last year, OSG sponsored and conducted various training events. Training organized and delivered by OSG in the last year is identified in the following table:
Workshop
Length
Location
Month
OSG Summer School
4 days
Madison, Wisconsin
July, 2010
Site Administrators Workshop
2 days
Nashville, TN
Aug., 2010
OSG Storage Forum
2 days
Chicago, IL
Sept., 2010
South American Grid
Workshop
5 days
Sao Palo, Brazil
Dec., 2010
OSG Summer School
4 days
Madison, Wisconsin
June, 2011
The GridUNESP training workshop was held the first week of December in Sao Paulo, Brazil.
Around 40 students attended from multiple science disciplines and 23 institutions. The school,
focused on end-users’ needs, taught students step by step how to join the Grid community, adapt
their scientific applications to the new technology and use it efficiently. Also, students learned
how to run large-scale computing applications, first locally and then using the OSG. Two open
discussions took place, along with one Q&A session with Ruth Pordes and Miron Livny and
hands-on exercises; other OSG staff who lectured included Dan Fraser, Tanya Levshina, Zack
Miller, and Igor Sfiligoi. Joel Snow lectured on the experience of the D0 high-energy physics
experiment using OSG resources, enabling attendees to see how all the concepts and techniques
they learned during the week have been applied by a real experiment. A working meeting between the GOC’s Rob Quick and those developing and deploying a local grid operation center
allowed the sharing of ideas and experiences. Two special lectures by invited local speakers, one
on high performance computing and one on data storage, completed the school. Some comments
from students:
1. “The OSG School in Sao Paulo was very good.
The Grid concepts were presented in a theoretical
and practical way, with a focus on the main goal of
any Grid infrastructure: ‘collaboration.’ - Charles
Rodamilans, PhD student of Computer Engineering
2. “… the most impressive course that I have attended, not because of the size of the Open Science
Grid, but because of the ability and motivation that
96
Figure 53: Students at the Sao Paolo
School
instructors demonstrated throughout the course. They were informative and technically
knowledgeable. “ - Eriksen Costa, Software architect
3. “The school was fantastic! … It gave me a broadened view of what a grid is, what it is
useful for and how I can use it.” - Griselda Garrido, neuroscience researcher
Other training activities include the organization of the Site Administrators workshop held in
Nashville TN in August 2010; support for VO-focused events such as the US CMS Tier 3 workshop, and the storage forum held in September 2010 which focused on communication of storage
technologies and implementations at the sites. For the Site Administrators workshop, experts
from around the Consortium contributed to course-ware tutorials that participants used during the
event. Some forty site administrators, several brand new to OSG, participated and were impressed by the many experts on hand which included not only the instructors but also seasoned
OSG site administrators all who were eager to help.
Members of the Condor team contributing to OSG participated in the joint CHAIN/EPIKH
School for Grid Site Administrators in China in May 2011, and demonstrated interoperation with
the European gLite middleware9.
OSG staff also participated as keynote speakers, instructors, and/or presenters at venues this year
as detailed in the following table:
5.2
Venue
Length
Location
Month
Oklahoma Supercomputing
Symposium
2 days
Norman, Oklahoma
Oct., 2010
EGI User Forum
1 day
Vilnius, Lithuania
May 2011
Outreach Activities
We present a selection of the activities in the past year:

The NSF Task Force on Campus Bridging

HPC Best Practices Workshop

Workshops on Distributed Computing, Multidisciplinary Science, and the NSF’s Scientific
Software Innovation Institutes Program

Chair of the Network for Earthquake Engineering Simulation Cyber infrastructure subcommittee of the Project Advisory Committee

Member of the DOE Knowledge Base requirements group

Continued co-editorship of the highly successful International Science Grid This Week newsletter, www.isgtw.org (see next section)
9
http://agenda.ct.infn.it/conferenceOtherViews.py?view=standard&confId=474
97
The Open Science Grid (OSG) achievements benefit from engaged communities of scientists
from many disciplines, all collaborating in an effort to build, maintain and use the US national
distributed infrastructure. The continued success of the Open Science Grid depends on such sustained efforts to ensure the longer-term growth in the scale of the resources and the number of
scientific users, increased functionality, usability and robustness of the middleware, and educates
and trains the future workforce. In support of this, the Open Science Grid Computer Science
Student Fellowship provides for one year of funding to a graduate or undergraduate student in
Computer Science. The fellowship provides funding for up to 20 hours a week for computer science research of value both to the OSG and the broader community. The research should be associated with an existing Virtual Organization already working with the OSG. The Fellowship is
offered for the institutions that are funded as part of the OSG project at the time of application
and delivery of the research. For the 2010-2011, Derek Weitzel was the recipient of this support
at the University of Nebraska – Lincoln.
During the OSG Fellowship, Derek Weitzel assisted in user support and campus grid development. He developed an initial prototype of the NEES workflow running on the OSG, as well as
adapted the Cactus programming environment for execution on the OSG. Derek developed an
OSG Job Visualization showing job movement across the OSG for display at the Super Computing 2010 conference. He developed the Campus Factory, an open source application that creates
a transparent interface between clusters on the Campus Grid. Additionally, he co-wrote the
CHEP paper: "Enabling Campus Grids with Open Science Grid Technology". Additionally,
Derek completed his Master's thesis: "Campus Grids: A Framework to Facilitate Resource Sharing"10.
In March of 2011 SBGrid leadership organized the annual Open Science Grid All-Hands Meeting. The event was well attended and received very good reviews from all participants (Figure
54).
10
https://osg-docdb.opensciencegrid.org:440/cgi-bin/ShowDocument?docid=1052
98
Figure 54: OSG All-Hands Meeting, held at Harvard Medical School
5.3
Internet dissemination
OSG co-sponsors the weekly online publication, International Science Grid This Week
(http://www.isgtw.org/), in collaboration with the European project e-Science Talk. The publication, which has published 228 issues as of 10 June 2011, is very well received, with subscribers
totaling approximately 8,000, an increase of about 30% in the last year. Combined with support
from NSF and DOE, OSG’s sponsorship has enabled the retention of a full-time US editor and
the launch in 1Q-2011 of a new website with more modern, user-friendly features.
ISGTW has won the trust of its readers in part because it is a joint publication that does not limit
itself to OSG-related content. The result is that the OSG content that is included is given more
weight by readers, making it a useful tool for dissemination.
The OSG also maintains a website, http://www.opensciencegrid.org. The site provides information and guidance for stakeholders and new users of the OSG. The front page features research and technology highlights, headlines linking to relevant news stories, and links to the archive of the OSG Newsletter. Visitors to the website can learn how to get started with grid computing, read about accomplishments of the OSG and its users, peruse a list of scientific papers
that have resulted from the OSG, view live grid status and usage statistics, and much more.
99
6.
Cooperative Agreement Performance
OSG has put in place processes, activities, and reports that meet the terms of the Cooperative
Agreement and Management Plan:

The Joint Oversight Team meetings are conducted, as scheduled by DOE and NSF, via phone
to hear about OSG progress, status, and concerns. Follow-up items are reviewed and addressed by OSG, as needed.

Two intermediate progress reports were submitted to NSF in February and June of 2007.

In February 2008, a DOE annual report was submitted.

In July 2008, an annual report was submitted to NSF.

In December 2008, a DOE annual report was submitted.

In June 2009, an annual report was submitted to NSF.

In January 2010, a DOE annual report was submitted.

In June 2010, an annual report was submitted to NSF.

In December 2010, an annual report submitted to DOE.

In June 2011, this annual report was submitted to NSF.
As requested by DOE and NSF, OSG staff provides pro-active support in workshops and collaborative efforts to help define, improve, and evolve the US national cyber-infrastructure.
100
Download