PowerPoint - Computer Sciences Dept.

advertisement
Purdue Campus Grid
Preston Smith
psmith@purdue.edu
Condor Week 2006
April 24, 2006
Overview
• RCAC
– Community Clusters
• Grids at Purdue
– Campus
– Regional
• NWICG
– National
•
•
•
•
OSG
CMS Tier-2
NanoHUB
Teragrid
• Future Work
Purdue’s RCAC
• Rosen Center for Advanced Computing
– Division of Information Technology at Purdue
(ITaP)
– Wide variety of systems: shared memory and
clusters
• 352 CPU IBM SP
• Five 24-processor Sun F6800s, Two
56-processor Sun E10ks
• Five Linux clusters
Linux clusters in RCAC
• Recycled clusters
– Systems retired from student labs
– Nearly 1000 nodes of single-CPU PIII, P4,
and 2-CPU Athlon MP and EM64T Xeons for
general use by Purdue researchers
Community Clusters
• Federate resources at a low level
• Separate researchers buy sets of nodes
to federate into larger clusters
– Enables larger clusters than a scientist could
support on his own
– Leverage central staff and infrastructure
• No need to sacrifice a grad student to be a
sysadmin!
Community Clusters
Macbeth
126 nodes dual Opteron (~1 Tflops)
1.8 GHz
4-16GB RAM
Infiniband, GigE for IP traffic
7 owners (ME, Biology, HEP Theory)
Lear
512 nodes dual Xeon 64 bit (6.4
Tflops)
3.2 GHz
4GB and 6 GB RAM
GigE
6 owners (EEx2, CMS, Provost,
VPR, Teragrid)
Hamlet
308 nodes dual Xeon (3.6 Tflops)
3.06 GHz to 3.2 GHz
2 GB and 4 GB RAM
GigE, Infiniband
5 owners (EAS, BIOx2, CMS, EE)
Community Clusters
• Primarily scheduled with PBS
– Contributing researchers are assigned a
queue that can run as many “slots” as they
have contributed.
• Condor co-schedules alongside PBS
– When PBS is not running a job, a node is fair
game for Condor!
• But Condor work is subject to preemption if PBS
assigns work to the node.
Condor on Community Clusters
• All in all, Condor joins together 4 clusters
(~2500 CPU) within RCAC.
Grids at Purdue - Campus
• Instructional computing group manages a
1300-node Windows Condor pool to
support instruction.
– Mostly used by computer graphics classes for
rendering animations
• Maya, etc.
– Work in progress to connect Windows pool
with RCAC pools.
Grids at Purdue - Campus
• Condor pools around campus
– Physics department: 100 nodes, flocked
– Envision Center: 48 nodes, flocked
• Potential collaborations
– Libraries: ~200 nodes on Windows terminals
– Colleges of Engineering: 400 nodes in
existing pool
• Or any department interested in sharing
cycles!
Grids at Purdue - Regional
• Northwest Indiana
Computational Grid
– Purdue West Lafayette
– Purdue Calumet
– Notre Dame
– Argonne Labs
• Condor pools available to
NWICG today.
• Partnership with OSG?
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Open Science Grid
• Purdue active in Open Science Grid
– CMS Tier-2 Center
– NanoHUB
– OSG/Teragrid
Interoperability
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
• Campus Condor pools accessible to OSG
– Condor used for access to extra, non-dedicated
cycles for CMS and is becoming the preferred
interface for non-CMS VOs.
CMS Tier-2 - Condor
– MC production from UW-HEP ran this spring
on RCAC Condor pools.
• Processed 23% or so of entire production.
• High rates of preemption, but that’s expected!
– 2006 will see addition of dedicated Condor
worker nodes to Tier-2, in addition to PBS
clusters.
• Condor running on resilient dCache nodes.
NanoHUB
Science Gateway
Workspaces
Campus Grids
Capability Computing
Purdue, GLOW
Middleware
Grid
VM
nanoHUB VO
Virtual backends
Capacity Computing
Research apps
Virtual Cluster with
VIOLIN
Teragrid
• Teragrid Resource Provider
• Resources offered to Teragrid
– Lear cluster
– Condor pools
– Data collections
Teragrid
• Two current projects active in Condor
pools via Teragrid allocations
– Database of Hypothetical Zeolite Structures
– CDF Electroweak MC Simulation
• Condor-G Glide-in
• Great exercise in OSG/TG Interoperability
– Identifying other potential users
Teragrid
• TeraDRE - Distributed
Rendering on the Teragrid
– Globus, Condor, and IBRIX
FusionFS enables
Purdue’s Teragrid site to
serve as a render farm
• Maya and other renderers
available
Grid Interoperability
“Lear”
Grid Interoperability
• Tier-2 to Tier-2 connectivity via dedicated
Teragrid WAN (UCSD->Purdue)
• Aggregating resources at low level makes
interoperability easier!
– OSG stack available to TG users and vice
versa
• “Bouncer” Globus job forwarder
Future of Condor at Purdue
• Add resources
– Continue growth around campus
• RCAC
• Other departments
• Add Condor capabilities to resources
– Teragrid data portal adding on-demand processing
with Condor now
• Federation
– Aggregate Condor pools with other institutions?
Condor at Purdue
• Questions?
PBS/Condor Interaction
PBS Prologue
# Prevent new Condor jobs and push any existing ones off
#
/opt/condor/bin/condor_config_val -rset -startd \
PBSRunning=True > /dev/null
/opt/condor/sbin/condor_reconfig -startd > /dev/null
if ( condor_status -claimed -direct $(hostname) 2>/dev/null \
| grep -q Machines )
then
condor_vacate > /dev/null
sleep 5
fi
PBS/Condor Interaction
PBS Epilogue
/opt/condor/bin/condor_config_val -rset -startd \
PBSRunning=False > /dev/null
/opt/condor/sbin/condor_reconfig -startd > /dev/null
Condor START Expression in condor_config.local
PBSRunning
= False
# Only start jobs if PBS is not currently running a job
PURDUE_RCAC_START_NOPBS = ( $(PBSRunning) == False )
START = $(START) && $(PURDUE_RCAC_START_NOPBS)
Download