HTC in Research & Education Miron Livny

advertisement
HTC
in
Research & Education
Miron Livny
Computer Sciences Department
University of Wisconsin-Madison
miron@cs.wisc.edu
Claims for “benefits” provided by
Distributed Processing Systems
 High Availability and Reliability
 High System Performance
 Ease of Modular and Incremental Growth
 Automatic Load and Resource Sharing
 Good Response to Temporary Overloads
 Easy Expansion in Capacity and/or Function
“What is a Distributed Data Processing System?” , P.H.
Enslow, Computer, January 1978
www.cs.wisc.edu/~miron
2
Democratization
of Computing:
You do not need to be a
super-person
to do
super-computing
www.cs.wisc.edu/~miron
3
NCBI FTP
Searching for
small RNAs
candidates
in a kingdom
45 CPU days
.ffn
.fna
IGRExtract3
.ptt
.gbk
RNAMotif
FindTerm
TransTerm
ROI IGRs
All other IGRs
BLAST
Terminators
Conservation
Known sRNAs,
riboswitches
sRNAPredict
IGR sequences of candidates
Candidate loci
TFBS matrices
Patser
IGRs all
known sRNAs
BLAST
BLAST
homology
QRNA
2o cons.
TFBSs
paralogy
sRNA_Annotate
Annotated candidate sRNA-encoding genes
BLAST
FFN_parse
ORFs flank
candidates
ORFs flank
known
BLAST
synteny
4
Education and Training
› Computer Science – develop and implement novel
›
›
›
HTC technologies (horizontal)
Domain Sciences – develop and implement end-toend HTC capabilities that are fully integrated in
the scientific discovery process (vertical)
Experimental methods – develop and implement a
curriculum that harnesses HTC capabilities to
teach how to use modeling and numerical data to
answer scientific questions.
System Management – develop and implement a
curriculum that uses HTC resources to teach how
to build, deploy, maintain and operate distributed
systems
www.cs.wisc.edu/~miron
5
As we look to hire new graduates,
both at the undergraduate and
graduate levels, we find that in
most cases people are coming in
with a good, solid core computer
science traditional education ... but
not a great, broad-based education
in all the kinds of computing that
near and dear to our business."
"
Ron Brachman
Vice President of Worldwide Research Operations, Yahoo
www.cs.wisc.edu/~miron
6
Yahoo! Inc., a leading global Internet company, today
announced that it will be the first in the industry to
launch an open source program aimed at advancing
the research and development of systems software
for distributed computing. Yahoo’s program is
intended to leverage its leadership in Hadoop, an
open source distributed computing sub-project of
the Apache Software Foundation, to enable
researchers to modify and evaluate the systems
software running on a 4,000-processor
supercomputer provided by Yahoo. Unlike other
companies and traditional supercomputing centers,
which focus on providing users with computers for
running applications and for coursework, Yahoo’s
program focuses on pushing the boundaries of largescale systems software research.
www.cs.wisc.edu/~miron
7
1986-2006
Celebrating
20 years since we
first installed Condor
in our CS department
www.cs.wisc.edu/~miron
8
Integrating Linux Technology with Condor
Kim van der Riet
Principal Software Engineer
What will Red Hat be doing?

Red Hat will be investing into the Condor project locally in Madison WI, in
addition to driving work required in upstream and related projects. This work
will include:
Engineering on Condor features & infrastructure

Should result in tighter integration with related technologies

Tighter kernel integration
Information transfer between the Condor team and Red Hat engineers working on
things like Messaging, Virtualization, etc.
Creating and packaging Condor components for Linux distributions
Support for Condor packaged in RH distributions

All work goes back to upstream communities, so this partnership will benefit all.
Shameless plug: If you want to be involved, Red Hat is hiring...
10
IBM Systems and Technology Group
High Throughput Computing
on Blue Gene
IBM Rochester: Amanda Peters, Tom Budnik
With contributions from:
IBM Rochester: Mike Mundy, Greg Stewart, Pat McCarthy
IBM Watson Research: Alan King, Jim Sexton
UW-Madison Condor: Greg Thain, Miron Livny, Todd Tannenbaum
© 2007 IBM Corporation
IBM Systems and Technology Group
Condor and IBM Blue Gene Collaboration
 Both IBM and Condor teams engaged in adapting code to bring Condor and Blue
Gene technologies together
 Initial Collaboration (Blue Gene/L)
– Prototype/research Condor running HTC workloads on Blue Gene/L
•
•
Condor developed dispatcher/launcher running HTC jobs
Prototype work for Condor being performed on Rochester On-Demand Center Blue Gene system
 Mid-term Collaboration (Blue Gene/L)
– Condor supports HPC workloads along with HTC workloads on Blue Gene/L
 Long-term Collaboration (Next Generation Blue Gene)
– I/O Node exploitation with Condor
– Partner in design of HTC services for Next Generation Blue Gene
•
Standardized launcher, boot/allocation services, job submission/tracking via database, etc.
– Study ways to automatically switch between HTC/HPC workloads on a partition
– Data persistence (persisting data in memory across executables)
•
Data affinity scheduling
– Petascale environment issues
12
5/31/2016
© 2007 IBM Corporation
The Grid: Blueprint for a New
Computing Infrastructure
Edited by Ian Foster and Carl Kesselman
July 1998, 701 pages.
The grid promises to fundamentally change the way we
think about and use computing. This infrastructure will
connect multiple regional and national computational
grids, creating a universal source of pervasive
and dependable computing power that
supports dramatically new classes of applications. The
Grid provides a clear vision of what computational
grids are, why we need them, who will use them, and
how they will be programmed. www.cs.wisc.edu/~miron
13
“ … We claim that these mechanisms, although
originally developed in the context of a cluster
of workstations, are also applicable to
computational grids. In addition to the required
flexibility of services in these grids, a very
important concern is that the system be robust
enough to run in “production mode” continuously
even in the face of component failures. … “
Miron Livny & Rajesh Raman, "High Throughput Resource
Management", in “The Grid: Blueprint for
a New Computing Infrastructure”.
www.cs.wisc.edu/~miron
14
www.cs.wisc.edu/~miron
15
www.cs.wisc.edu/~miron
16
The search for SUSY*
Sanjay Padhi is a UW Chancellor Fellow who is
working at the group of Prof. Sau Lan Wu located at
CERN (Geneva)
Using Condor Technologies he established a “grid
access point” in his office at CERN
Through this access-point he managed to harness in
3 month (12/05-2/06) more that 500 CPU years
from the LHC Computing Grid (LCG) the Open
Science Grid (OSG) the Grid Laboratory Of
Wisconsin (GLOW) resources and local group owned
desk-top resources.
*Super-Symmetry
www.cs.wisc.edu/~miron
17
High Throughput Computing
We first introduced the distinction between
High Performance Computing (HPC) and High
Throughput Computing (HTC) in a seminar at
the NASA Goddard Flight Center in July of
1996 and a month later at the European
Laboratory for Particle Physics (CERN). In
June of 1997 HPCWire published an
interview on High Throughput Computing.
www.cs.wisc.edu/~miron
18
Why HTC?
For many experimental scientists, scientific
progress and quality of research are strongly
linked to computing throughput. In other words,
they are less concerned about instantaneous
computing power. Instead, what matters to them
is the amount of computing they can harness over
a month or a year --- they measure computing
power in units of scenarios per day, wind patterns
per week, instructions sets per month, or crystal
configurations per year.
www.cs.wisc.edu/~miron
19
High Throughput Computing
is a
24-7-365
activity
FLOPY  (60*60*24*7*52)*FLOPS
www.cs.wisc.edu/~miron
20
High Throughput
Computing
Miron Livny
Computer Sciences
University of Wisconsin-Madison
{miron@cs.wisc.edu}
Customers of HTC
 Most HTC application follow the
Master-Worker paradigm where a
group of workers executes a loosely
coupled heap of tasks controlled by
on or more masters.
• Job Level - Tens to thousands of independent jobs
• Task Level - A parallel application (PVM,MPI-2)
that consists of a small group of master processes
and tens to hundreds worker processes.
22
The Challenge
Turn large collections of existing
distributively owned computing
resources into effective High
Throughput Computing Environments
Minimize Wait while Idle
23
Obstacles to HTC
 Ownership Distribution
 Size and Uncertainties
 Technology Evolution
 Physical Distribution
(Sociology)
(Robustness)
(Portability)
(Technology)
24
Sociology
Make owners (& system administrators) happy.
• Give owners full control on
– when and by whom private resources are used for HTC
– impact of HTC on private Quality of Service
– membership and information on HTC related activities
• No changes to existing software and make it easy
– to install, configure, monitor, and maintain
owners  more resources  higher throughput
25
Sociology
 Owners look for a verifiable
contract with the HTC environment
that spells out the rules of
engagements.
 System administrators do not like
weird distributed applications that
have the potential of interfering
with the happiness of their
interactive users.
26
Robustness
To be effective, a HTC environment
must run as a 24-7-356 operation.
• Customers count on it
• Debugging and fault isolation may be a very
time consuming processes
• In a large distributed system, everything that
might go wrong will go wrong.
t system  less down time  higher throughput
27
Portability
To be effective, the HTC software
must run on and support the latest
greatest hardware and software.
• Owners select hardware and software according
to their needs and tradeoffs
• Customers expect it to be there.
• Application developer expect only few (if any)
changes to their applications.
tability  more platforms higher throughput
28
Technology
A HTC environment is a large,
dynamic and evolving Distributed
System;
•
•
•
•
Autonomous and heterogeneous resources
Remote file access
Authentication
Local and wide-area networking
29
Robust and Portable
Mechanisms Hold
The
To
High Throughput
Computing
Policies play only a secondary role in HTC
30
Leads to a
“bottom up”
approach to building
and operating
distributed systems
www.cs.wisc.edu/~miron
31
My jobs should run …
› … on my laptop if it is not connected to the
›
›
›
network
… on my group resources if my certificate
expired
... on my campus resources if the meta
scheduler is down
… on my national resources if the transAtlantic link was cut by a submarine
www.cs.wisc.edu/~miron
32
The
Open Science Grid
(OSG)
Miron Livny - OSG PI & Facility Coordinator,
Computer Sciences Department
University of Wisconsin-Madison
Supported by the Department of Energy Office of Science SciDAC-2 program from the High Energy Physics, Nuclear
Physics and Advanced Software and Computing Research programs, and the National Science Foundation Math and
Physical Sciences, Office of CyberInfrastructure and Office of International Science and Engineering Directorates.
The Evolution of the OSG
LIGO operation
LIGO preparation
LHC construction, preparation
LHC Ops
iVDGL(NSF)
GriPhyN(NSF)
Trillium Grid3
PPDG (DOE)
DOE Science Grid
1999
2000
2001
2002
2003
OSG (DOE+NSF)
(DOE)
2004
2005
2006
2007
2008
2009
European Grid + Worldwide LHC Computing Grid
Campus, regional grids
]
34
The Open Science Grid vision
Transform processing and data
intensive science through a crossdomain self-managed national
distributed cyber-infrastructure that
brings together campus and
community infrastructure and
facilitating the needs of Virtual
Organizations (VO) at all scales
]
35
D0 Data Re-Processing
Total Events
12 sites
contributed
up to 1000
OSG CPUHours/Week
jobs/day
160,000
140,000
120,000
100,000
80,000
60,000
40,000
20,000
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23
Week in 2007
]
CIT_CMS_T2
FNAL_GPFARM
MIT_CMS
NERSC-PDSF
OU_OSCER_CONDOR
UCSDT2
USCMS-FNAL-WC1-CE
FNAL_DZEROOSG_2
GLOW
MWT2_IU
OSG_LIGO_PSU
Purdue-RCAC
UFlorida-IHEPA
FNAL_FERMIGRID
GRASE-CCR-U2
Nebraska
OU_OSCER_ATLAS
SPRACE
UFlorida-PG
2M
CPU hours
286M
events
286K
Jobs on
OSG
48TB Input data
22TB Output data
36
The Three Cornerstones
National
]
Campus
Need to be
harmonized into a well
integrated whole.
Community37
OSG challenges
• Develop the organizational and management
structure of a consortium that drives such a Cyber
Infrastructure
• Develop the organizational and management
structure for the project that builds, operates and
evolves such Cyber Infrastructure
• Maintain and evolve a software stack
capable of offering powerful and
dependable capabilities that meet the
science objectives of the NSF and DOE
scientific communities
• Operate and evolve a dependable and well managed
distributed facility
]
38
6,400 CPUs available
Campus Condor pool backfills idle nodes in
PBS clusters - provided 5.5 million CPUhours in 2006, all from idle nodes in
clusters
Use on TeraGrid: 2.4 million hours in 2006
spent Building a database of hypothetical
zeolite structures; 2007: 5.5 million hours
allocated to TG
http://www.cs.wisc.edu/condor/PCW2007/presentations/cheeseman_Purdue_Condor_Week_2007.ppt
Clemson Campus Condor Pool
• Machines in 27 different
locations on Campus
• ~1,700 job slots
• >1.8M hours served in
6 months
• users from Industrial and
Chemical engineering, and
Economics
• Fast ramp up of usage
• Accessible to the OSG
through a gateway
40
Grid Laboratory of Wisconsin
2003 Initiative funded by NSF(MIR)/UW at $1.5M. Second
phase funded in 2007 by NSF(MIR)/UW at $1.5M.
Six Initial GLOW Sites
• Computational Genomics, Chemistry
• Amanda, Ice-cube, Physics/Space Science
• High Energy Physics/CMS, Physics
• Materials by Design, Chemical Engineering
• Radiation Therapy, Medical Physics
• Computer Science
Diverse users with different deadlines and usage
patterns.
5/31/2016
41
GLOW Usage - between 2004-01-31 and
GLOW
Usage
4/04-11/08
2007-11-08
other
13%
Plasma
Physics
1%
CMPhysics
1%
Atlas
20%
MultiScalar
1%
Over 35M
CPU hours
served!
MedPhysics
4%
LMCG
18%
ChemE
18%
IceCube
5%
5/31/2016
CS
2%
CMS
17%
42
The next 20 years
We all came to this meeting because we
believe in the value of HTC and are aware of
the challenges we face in offering
researchers and educators dependable HTC
capabilities.
We all agree that HTC is not just about
technologies but is also very much about
people – users, developers, administrators,
accountants, operators, policy makers, …
www.cs.wisc.edu/~miron
43
Download