PowerPoint - Computer Sciences Dept.

advertisement
Windows Condor Pool at
Clemson University
Sebastien Goasguen
School of Computing
and Clemson Computing Information
Technology (CCIT)
May 2nd 2007
Condor Week 2007
Clemson basics
• Public land grant institution founded in 1889
• ~13,000 undergrads and ~4,500 grads
• ~1,300 faculty members
New commitment to
computing as the
backbone of
research and
teaching
Condor Week 2007
Clemson Computing DNA
• Clemson has put computing at the core of its mission
– New CIO: Jim Bottum
– New CTO: Jim Pepin
– New School of Computing, search in progress for a school director,
three division leaders and couple months later six assistant
professors
• Building traditional HPC from scratch
– No prior involvement in HPC support
– No trained staff for either system administration or application
• Infrastructure and hardware are there or coming
– 20,000 sqft of raised floor, new power coming straight from the
nuclear plant, $3c a kwatt
– 10 Gbps connection being worked on, NLR 6 miles away from
machine room. ~$1.5M SCLR approved by board of trustees last week
– Above 10 Tflops in the works through various sources. Automotive
research center (Michelin, BMW…), Faculty community cluster and
Provost support
• All hands on deck to build the CI campus of tomorrow
Condor Week 2007
Building CI at Clemson
• The Fabric layer
– HPC resources ->Clusters
– Campus Grid -> Condor
– Sharing of resources ->OSG
• The middleware layer
– Deploy services interface to our resources -> WS
– Increase identity management capabilities for sharing ->
Gridshib
• The application layer
– New environments for students
– New environments for researchers
• ->”Portal”, “Gateways”, Desktop applications, other…
• A social layer
– Raising awareness on campus
– Fulfilling expectations of faculty
Condor Week 2007
Teaching CI
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Condor Week 2007
Where to start ?
• No HPC resources
• No expertise in HPC or grids
• “what works, is reliable and free ?…”
• “Let’s do a windows condor pool, and let’s join
OSG”
Condor Week 2007
Randy Martin, David
Atkinson, Matt Rector
Condor Week 2007
Results
•
•
•
Built a ~1,000 machines pool and got usage in 4 months
Learned condor installation, administration, debugging
Experience improved our management of the windows machines
– More efficient lab image distribution. The current in-house
developed method of distribution takes days to distribute image
changes to all pcs…Also need to specify machine ads on image.
– Eliminate need for 2am image refreshes on each lab pc.
• Outreach to the whole campus
•
Got familiar with grid software and operation, used VDT
•
Attending Condor week.
Condor Week 2007
Details
• 1085 windows machines, 2 linux machines (central
and a OSG gatekeeper), condor reporting 1563 slots
• 845 maintained by CCIT
• 241 from other campus depts
• >50 locations
• From 1 to 112 machines in one location
• Student housing, labs, library, coffee shop
• Mary Beth Kurz, first condor user at Clemson:
• March 215,000 hours, ~110,000 jobs
• April 110,000 hours, ~44,000 jobs
Condor Week 2007
The world before Condor
•
•
•
•
1800 input files
3 alternative genetic algorithm designs
50 replicates desired
Estimated running time on 3.2 GHz machine with 1
GB RAM: 241 days
Slides from Dr. Kurz
Condor Week 2007
First submit file attempt
Monday noon-ish
• Used the documentation and examples at Wisconsin
condor site and created:
Universe
Executable
log
output
arguments
Queue
=
=
=
=
=
vanilla
main.exe
re.log
out.$(Process).out
1 llllll-0
• Forgot to specify Windows and Intel and also to
transfer the output back (thanks David Atkinson)
• Got a single submit file to run 2 specific input files
by mid-afternoon Tuesday
Slides from Dr. Kurz
Condor Week 2007
Tuesday 6 pm – submitted 1800 jobs
in a Cluster
Universe
= vanilla
Executable
= MainCondor.exe
requirements
= Arch=="INTEL" && OpSYS=="WINNT51"
should_transfer_files
= YES
transfer_input_files
= InputData/input$(Process).ft
whenToTransferOutput
= ON_EXIT
log
= run_1/re_1.log
output
= run_1/re_1.stdout
error
= run_1/re_1.err
transfer_output_remaps = "1.out = run_1/opt1output$(Process).out"
arguments
= 1 input$(Process)
queue 1800
• 200 ran at a time, but that eventually got
resolved
Slides from Dr. Kurz
Condor Week 2007
Wednesday afternoon: Love notes
Slides from Dr. Kurz
Condor Week 2007
Type of jobs being worked on
GROMACS,
Java universe
molecular
Text bio
mining
“Currently, we are conducting large
text for
dynamics
mining on the Condor. The Medline
database
of
medline
database
U.S. National Library of Medicine includes over
Cygwin
17 million citations of life science journals for
biomedical articles OSG
back to 1950s. Our
research focuses on mining relationship
jobs
between
ten thousands
genes, chemicals
and
Matlab
/ Octave
in nanoHUB
OSG
hundreds of diseases from Medline database.
cygwin
GLOW nanoHUB
The Condor provides us a platform for the
quick,
parallel search the Medline database.”
LIDAR
data
GLOW
analysis,
FDTD,
Dr. Feng Luo, School of Computing
Neural Networks
Condor Week 2007
Open Science Grid
Join the national
infrastructure
Use the national
infrastructure
Contribute
resources
(hardware and
human)
Ease of installation
through VDT
Condor Week 2007
Firewall Issues
• Couple years ago after Blaster and Co, Clemson put
every machine behind a firewall.
• Globus ephemeral ports closed
– Cannot send Globus job from my desktop
osggate
Condor-c
ssh
desktop
Condor Week 2007
nanoHUB Internals
VNC redirect
Sessions managed by InVIGO-Lite
Static set of VMs
Local Virtual Machines
Condor-C submit
VIOLIN Virtual
Cluster
PBS Submit
Gateway machine
Initializes trusted proxy
Condor-G submit
Globus enabled resources
GT2 or GT4 WSRF
Condor Week 2007
Web Services
Interaction Age
Fully Service Oriented
Architecture/Semantic Grid
Technologies
Information Age
gsissh and/or Web Services
SSH - Direct Access
Evolution of Science Gateways for Virtual Organizations
Web Services
Remote Resources
Least interactions
Social interactions
Social immersion
Condor Week 2007
Next-generation: Socially immersive science gateways
Work being led by Prof. Madhavan with CCIT collaboration
Condor Week 2007
Conclusions
•
•
•
•
•
•
Clemson has made computing a priority
Condor is the first “CI” project at Clemson
OSG is a close second
Condor has already impacted Clemson researcher
Clemson hopes to contribute to the community
NSF seems happy…
• Thanks to the Condor team !!
• Acknowledgements: Randy Martin, David Atkinson, Matt
Rector, Mike Gossett,John Minor, Matt Saltzmann, Mary-Beth
Kurz, Feng Luo
Condor Week 2007
Download