ppt

advertisement
LHC Experiments and the PACI
A Partnership for Global Data Analysis
Harvey B. Newman, Caltech
Advisory Panel on CyberInfrastructure
National Science Foundation
November 29, 2001
http://l3www.cern.ch/~newman/LHCGridsPACI.ppt
Global Data Grid Challenge
“Global scientific communities, served by networks
with bandwidths varying by orders of magnitude,
need to perform computationally demanding
analyses of geographically distributed datasets
that will grow by at least 3 orders of magnitude
over the next decade, from the 100 Terabyte to
the 100 Petabyte scale [from 2000 to 2007]”
The Large Hadron Collider (2006-)
 The Next-generation Particle Collider
 The largest superconductor
installation in the world
 Bunch-bunch collisions at 40 MHz,
Each generating ~20 interactions
 Only one in a trillion may lead
to a major physics discovery
 Real-time data filtering:
Petabytes per second to Gigabytes
per second
 Accumulated data of many
Petabytes/Year
Large data samples explored and analyzed by thousands
of globally dispersed scientists, in hundreds of teams
Four LHC Experiments: The
Petabyte to Exabyte Challenge
ATLAS, CMS, ALICE, LHCB
Higgs + New particles; Quark-Gluon Plasma; CP Violation
Data stored
~40 Petabytes/Year and UP;
CPU
0.30 Petaflops and UP
0.1 to
1
Exabyte (1 EB = 1018 Bytes)
(2007)
(~2012 ?) for the LHC Experiments
Evidence for the
Higgs at LEP at
M~115 GeV
The LEP Program
Has Now Ended
LHC: Higgs Decay into 4 muons
1000X LEP Data Rate
(+30 minimum bias events)
All charged tracks with pt > 2 GeV
Reconstructed tracks with pt > 25 GeV
109 events/sec, selectivity: 1 in 1013 (1 person in a thousand world populations)
LHC Data Grid Hierarchy
CERN/Outside Resource Ratio ~1:2
Tier0/( Tier1)/( Tier2)
~1:1:1
~PByte/sec
Online System
Experiment
~100-400
MBytes/sec
Tier 0 +1
~2.5 Gbits/sec
Tier 1
IN2P3 Center
INFN Center
RAL Center
~2.5 Gbps
CERN 700k SI95
~1 PB Disk;
Tape Robot
Tier 2
FNAL: 200k
SI95; 600 TB
2.5 Gbps
Tier2 Center
Tier2 Center
Tier2 Center
Tier2 Center
Tier2 Center
Tier 3
InstituteInstitute Institute
~0.25TIPS
Physics data cache
Workstations
Institute
100 - 1000
Mbits/sec
Tier 4
Physicists work on analysis “channels”
Each institute has ~10 physicists
working on one or more channels
TeraGrid:
NCSA, ANL, SDSC, Caltech
A Preview of the Grid Hierarchy
and Networks of the LHC Era
StarLight: Int’l
Optical Peering Point
(see www.startap.net)
Abilene
Chicago
Indianapolis
Urbana
Pasadena
San Diego
UIC
I-WIRE
OC-48 (2.5 Gb/s, Abilene)
ANL
Multiple 10 GbE (Qwest)
Multiple 10 GbE (I-WIRE Dark
Fiber)
 Solid lines in place and/or available in 2001
 Dashed I-WIRE lines planned for Summer 2002
Starlight / NW Univ
Multiple Carrier Hubs
Ill Inst of Tech
Univ of Chicago
Indianapolis
(Abilene
NCSA/UIUC
NOC)
Source: Charlie Catlett, Argonne
Current Grid Challenges: Resource
Discovery, Co-Scheduling, Transparency
Discovery and Efficient Co-Scheduling of Computing,
Data Handling, and Network Resources
 Effective, Consistent Replica Management
 Virtual Data: Recomputation Versus Data Transport
Decisions
Reduction of Complexity In a “Petascale” World
 “GA3”: Global Authentication, Authorization, Allocation
 VDT: Transparent Access to Results
(and Data When Necessary)
 Location Independence of the User Analysis, Grid,
and Grid-Development Environments
 Seamless Multi-Step Data Processing and Analysis:
DAGMan (Wisc), MOP+IMPALA(FNAL)
CMS Production: Event Simulation
and Reconstruction
Simulation
Digitization
No PU
GDMP
PU
Common
Prod. tools
(IMPALA)
CERN


FNAL
Moscow
INFN
Caltech
UCSD




In progress




UFL


Imperial
College




Bristol
Wisconsin
IN2P3
Helsinki
Fully operational

Not Op.
Not Op.




Not Op.
“Grid-Enabled” Automated
US CMS TeraGrid Seamless
Prototype
 Caltech/Wisconsin Condor/NCSA Production
 Simple Job Launch from Caltech
 Authentication Using Globus Security Infrastructure (GSI)
 Resources Identified Using Globus Information
Infrastructure (GIS)
 CMSIM Jobs (Batches of 100, 12-14 Hours, 100 GB Output)
Sent to the Wisconsin Condor Flock Using Condor-G
 Output Files Automatically Stored in NCSA Unitree (Gridftp)
 ORCA Phase: Read-in and Process Jobs at NCSA
 Output Files Automatically Stored in NCSA Unitree
 Future: Multiple CMS Sites; Storage in Caltech HPSS Also,
Using GDMP (With LBNL’s HRM).
 Animated Flow Diagram of the DTF Prototype:
http://cmsdoc.cern.ch/~wisniew/infrastructure.html
Baseline BW for the US-CERN Link:
HENP Transatlantic WG (DOE+NSF)
Transoceanic
Networking
Integrated with
the TeraGrid,
Abilene,
Regional Nets
and Continental
Network
Infrastructures
in US, Europe,
Asia, South
America
Link Bandwidth (Mbps)
10000
8000
6000
4000
2000
0
FY2001 FY2002 FY2003 FY2004 FY2005 FY2006
BW (Mbps)
310
622
1250
2500
5000
10000
US-CERN Plans: 155 Mbps to 2 X 155 Mbps this Year;
622 Mbps in April 2002;
DataTAG 2.5 Gbps Research Link in Summer 2002;
10 Gbps Research Link in ~2003
Transatlantic Net WG (HN, L. Price)
Bandwidth Requirements [*]
2001 2002 2003 2004 2005 2006
CMS
100
200
300
600
800
2500
ATLAS
50
100
300
600
800
2500
BaBar
300
600
CDF
100
300
D0
400
BTeV
20
40
100
200
300
500
DESY
100
180
210
240
270
300
1100 1600 2300
400
3000
2000 3000
6000
1600 2400 3200 6400
8000
CERN 155- 622 1250 2500 5000 10000
BW
310
[*] Installed BW. Maximum Link Occupancy 50% Assumed
The Network Challenge is Shared by Both Next- and
Present Generation Experiments
Internet2 HENP Networking WG [*]
Mission
 To help ensure that the required
 National and international network infrastructures
 Standardized tools and facilities for high
performance and end-to-end monitoring and
tracking, and
 Collaborative systems
 are developed and deployed in a timely manner,
and used effectively to meet the needs of the US LHC
and other major HENP Programs, as well as the
general needs of our scientific community.
 To carry out these developments in a way that is
broadly applicable across many fields, within and
beyond the scientific community
 [*] Co-Chairs: S. McKee (Michigan), H. Newman (Caltech);
With thanks to R. Gardner and J. Williams (Indiana)
Grid R&D: Focal Areas for
NPACI/HENP Partnership
Development of Grid-Enabled User Analysis Environments
 CLARENS (+IGUANA) Project for Portable Grid-Enabled
Event Visualization, Data Processing and Analysis
 Object Integration: backed by an ORDBMS, and
File-Level Virtual Data Catalogs
Simulation Toolsets for Systems Modeling, Optimization
 For example: the MONARC System
Globally Scalable Agent-Based Realtime Information
Marshalling Systems
 To face the next-generation challenge of Dynamic
Global Grid design and operations
 Self-learning (e.g. SONN) optimization
 Simulation (Now-Casting) enhanced: to monitor, track and
forward predict site, network and global system state
1-10 Gbps Networking development and global deployment
 Work with the TeraGrid, STARLIGHT, Abilene, the iVDGL
GGGOC, HENP Internet2 WG, Internet2 E2E, and DataTAG
Global Collaboratory Development: e.g. VRVS, Access Grid
CLARENS: a Data Analysis
Portal to the Grid: Steenberg (Caltech)
 A highly functional graphical interface,





Grid-enabling the working environment for
“non-specialist” physicists’ data analysis
Clarens consists of a server communicating with
various clients via the commodity XML-RPC protocol.
This ensures implementation independence.
The server is implemented in C++ to give access
to the CMS OO analysis toolkit.
The server will provide a remote API to Grid tools:
 Security services provided by the Grid (GSI)
 The Virtual Data Toolkit: Object collection access
 Data movement between Tier centers using GSI-FTP
 CMS analysis software (ORCA/COBRA)
Current prototype is running on the Caltech Proto-Tier2
More information at http://heppc22.hep.caltech.edu,
along with a web-based demo
Modeling and Simulation:
MONARC System
 Modelling and understanding current systems, their performance
and limitations, is essential for the design of the future large scale
distributed processing systems.
 The simulation program developed within the MONARC (Models Of
Networked Analysis At Regional Centers) project is based on a process
oriented approach for discrete event simulation. It is based on the on
Java(TM) technology and provides a realistic modelling tool for such large
scale distributed systems.
SIMULATION of Complex Distributed Systems
MONARC SONN: 3 Regional Centres
Learning to Export Jobs (Day 9)
<E> = 0.83
CERN
30 CPUs
<E> = 0.73
1MB/s ; 150 ms RTT
CALTECH
25 CPUs
NUST
20 CPUs
<E> = 0.66
Day = 9
Maximizing US-CERN TCP
Throughput (S.Ravot, Caltech)
TCP Protocol Study: Limits
 We determined Precisely
The parameters which limit the
throughput over a high-BW,
long delay (170 msec) network
How to avoid intrinsic limits;
unnecessary packet loss
Methods Used to Improve TCP
 Linux kernel programming in order
to tune TCP parameters
 We modified the TCP algorithm
 A Linux patch will soon be available
2) Fast Recovery
(Temporary state to repair the lost)
New
loss
1) A packet is
lost
Losses occur when
the cwnd is larger
than 3,5 Mbyte
3) Back to slow start
(Fast Recovery couldn’t repair the lost
The packet lost is detected by timeout => go back to slow start
cwnd = 2 MSS)
Congestion window behavior of a TCP
connection over the transatlantic line
Reproducible
125 Mbps Between
TCP performance between CERN and Caltech
CERN and Caltech/CACR
140
120
Caltech
135 Mbps between CERN and
Chicago
Status: Ready for Tests at Higher
BW (622 Mbps) in Spring 2002
Without tunning
100
Mbps
Result: The Current State of the Art
for Reproducible Throughput
125 Mbps between CERN and
80
60
By tunning the
SSTHRESH
parameter
40
20
0
1
2
3
4
5
6 7
8
Connection number
9 10 11
Agent-Based Distributed System:
JINI Prototype (Caltech/Pakistan)
Includes “Station Servers” (static) that
host mobile “Dynamic Services”
Servers are interconnected dynamically
to form a fabric in which mobile agents Lookup
Service
travel, with a payload of physics
analysis tasks
Prototype is highly flexible and
robust against network outages
Amenable to deployment on
leading edge and future portable
devices (WAP, iAppliances, etc.) Station
Server
 “The” system for the
travelling physicist
The Design and Studies with this
prototype use the MONARC Simulator,
Station
and build on SONN studies
 See http://home.cern.ch/clegrand/lia/ Server
Lookup
Discovery
Service
Lookup
Service
Station
Server
Globally Scalable Monitoring Service
Discovery
Lookup
Service
Proxy
Lookup
Service
Push & Pull
rsh & ssh existing scripts
RC
snmp
Monitor
Farm
Monitor
Service
Component
Factory
GUI marshaling
Code Transport
RMI data access
Farm
Monitor
Client
(other service)
Examples
 GLAST meeting
 10 participants connected via VRVS (and 16 participants in Audio only)
VRVS
7300 Hosts; 4300
Registered Users
In 58 Countries
34 Reflectors; 7 In I2
Annual Growth 250%
US CMS will use the CDF/KEK remote control room concept for Fermilab Run II
as a starting point. However, we will (1) expand the scope to encompass a US
based physics group and US LHC accelerator tasks, and (2) extend the concept
to a Global Collaboratory for realtime data acquisition + analysis
Next Round Grid Challenges: Global Workflow
Monitoring, Management, and Optimization
 Workflow Management, Balancing Policy Versus
Moment-to-moment Capability to Complete Tasks
 Balance High Levels of Usage of Limited Resources
Against Better Turnaround Times for Priority Jobs
 Goal-Oriented; According to (Yet to be Developed)
Metrics
 Maintaining a Global View of Resources and System State
 Global System Monitoring, Modeling, Quasi-realtime
simulation; feedback on the Macro- and MicroScales
 Adaptive Learning: new paradigms for execution
optimization and Decision Support (eventually
automated)
 Grid-enabled User Environments
PACI, TeraGrid and HENP
 The scale, complexity and global extent of the LHC Data





Analysis problem is unprecedented
The solution of the problem, using globally distributed Grids,
is mission-critical for frontier science and engineering
HENP has a tradition of deploying new highly functional
systems (and sometimes new technologies) to meet its
technical and ultimately its scientific needs
HENP problems are mostly “embarrassingly” parallel; but
potentially “overwhelming” in their data- and network
intensiveness
HENP/Computer Science synergy has increased dramatically
over the last two years, focused on Data Grids
 Successful collaborations in GriPhyN, PPDG, EU Data Grid
The TeraGrid (present and future) and its development
program is scoped at an appropriate level of depth and
diversity
 to tackle the LHC and other “Petascale” problems,
over a 5 year time span
 matched to the LHC time schedule, with full ops. In 2007
Some Extra
Slides Follow
Computing Challenges:
LHC Example

Geographical dispersion: of people and resources
 Complexity: the detector and the LHC environment
 Scale:
Tens of Petabytes per year of data
5000+ Physicists
250+ Institutes
60+ Countries
Major challenges associated with:
Communication and collaboration at a distance
Network-distributed computing and data resources
Remote software development and physics analysis
R&D: New Forms of Distributed Systems: Data Grids
Why Worldwide Computing?
Regional Center Concept Goals
 Managed, fair-shared access for Physicists everywhere
 Maximize total funding resources while meeting the
total computing and data handling needs
 Balance proximity of datasets to large central resources,
against regional resources under more local control
 Tier-N Model
 Efficient network use: higher throughput on short paths
 Local > regional > national > international
 Utilizing all intellectual resources, in several time zones
 CERN, national labs, universities, remote sites
 Involving physicists and students at their home institutions
 Greater flexibility to pursue different physics interests,
priorities, and resource allocation strategies by region
 And/or by Common Interests (physics topics,
subdetectors,…)
 Manage the System’s Complexity
 Partitioning facility tasks, to manage and focus resources
HENP Related Data Grid
Projects
Funded Projects
 PPDG I
USA
 GriPhyN
USA
 EU DataGrid EU
 PPDG II (CP) USA
 iVDGL
USA
 DataTAG
EU
DOE
NSF
EC
DOE
NSF
EC
$ 2M
$ 11.9M + $1.6M
€ 10M
$ 9.5M
$ 13.7M + $2M
€ 4M
1999-2001
2000-2005
2001-2004
2001-2004
2001-2006
2002-2004
About to be Funded Project
 GridPP*
UK
PPARC >$15M?
2001-2004
Many national projects of interest to HENP
 Initiatives in US, UK, Italy, France, NL, Germany, Japan, …
 EU networking initiatives (Géant, SURFNet)
 US Distributed Terascale Facility:
($53M, 12 TFL, 40 Gb/s network)
* = in final stages of approval
Network Progress and
Issues for Major Experiments
 Network backbones are advancing rapidly to the 10 Gbps
range: “Gbps” end-to-end data flows will soon be in demand
 These advances are likely to have a profound impact
on the major physics Experiments’ Computing Models
 We need to work on the technical and political network issues
 Share technical knowledge of TCP: Windows,
Multiple Streams, OS kernel issues; Provide User Toolset
 Getting higher bandwidth to regions outside W. Europe and
US: China, Russia, Pakistan, India, Brazil, Chile, Turkey, etc.
 Even to enable their collaboration
 Advanced integrated applications, such as Data Grids, rely on
seamless “transparent” operation of our LANs and WANs
 With reliable, quantifiable (monitored), high performance
 Networks need to become part of the Grid(s) design
 New paradigms of network and system monitoring
and use need to be developed, in the Grid context
Grid-Related R&D Projects in CMS:
Caltech, FNAL, UCSD, UWisc, UFl
 Installation, Configuration and Deployment of Prototype
Tier2 Centers at Caltech/UCSD and Florida
 Large Scale Automated Distributed Simulation Production
DTF “TeraGrid” (Micro-)Prototype: CIT, Wisconsin
Condor, NCSA
Distributed MOnte Carlo Production (MOP): FNAL
 “MONARC” Distributed Systems Modeling;
Simulation system applications to Grid Hierarchy
management
Site configurations, analysis model, workload
Applications to strategy development; e.g. inter-site
load balancing using a “Self Organizing Neural Net”
(SONN)
Agent-based System Architecture for Distributed
Dynamic Services
 Grid-Enabled Object Oriented Data Analysis
MONARC Simulation System Validation
CMS ProtoTier1
Production
Farm at FNAL
CMS Farm
at CERN
Measurement
Mean measured Value ~48MB/s
Simulation
Muon
Jet <0.90>
<0.52>
MONARC SONN: 3 Regional Centres
Learning to Export Jobs (Day 0)
1MB/s ; 150 ms RTT
CERN
CALTECH
30 CPUs
25 CPUs
NUST
20 CPUs
Day = 0
US CMS Remote Control Room
For LHC
Full Event
Database of
~40,000 large
objects


Request





Parallel tuned GSI
FTP
Full Event
Database of
~100,000 large
objects
Request



“Tag”
database
of
~140,000
small
objects
Parallel tuned GSI
FTP
Bandwidth Greedy Grid-enabled Object Collection Analysis for
Particle Physics (SC2001 Demo)
Julian Bunn, Ian Fisk, Koen Holtman, Harvey Newman, James Patton
The object of this demo is to show grid-supported interactive physics analysis on a set of 144,000 physics events.
Initially we start out with 144,000 small Tag objects, one for each event, on the Denver client machine. We also
have 144,000 LARGE objects, containing full event data, divided over the two tier2 servers.
 Using local Tag event database, user plots event parameters of interest
 User selects subset of events to be fetched for further analysis
 Lists of matching events sent to Caltech and San Diego
 Tier2 servers begin sorting through databases extracting required events
 For each required event, a new large virtual object is materialized in the server-side cache, this object contains all tracks in the event.
 The database files containing the new objects are sent to the client using Globus FTP, the client adds them to its local cache
of large objects
 The user can now plot event parameters not available in the Tag
 Future requests take advantage of previously cached large objects in the client
http://pcbunn.cacr.caltech.edu/Tier2/Tier2_Overall_JJB.htm
Download