What Makes the Early Birds Get the Worm? Industrial Drivers/Solutions for Grid/Condor

advertisement
What Makes the Early Birds Get the Worm?
Industrial Drivers/Solutions
for Grid/Condor
Jason Stowe
Why HTC? What are “Enterprise’s”
requirements for Grid?
Based upon
first-hand experience
What is an ‘Enterprise’?
What motivates them?
Includes
Research/Gov’t/Academic
environments?
What about Money?
Doesn’t industry spend more on large
computations?
Remember our community
Then just look at Top 10 or 50 of the
largest computing sites…
Many/most are not at companies
So, its *Any* Organization Using
Condor with…
Demanding Users
Organization =>
Groups of Demanding Users
Purchased
Computer Capacity
Need
Computation
Done.
Easily.
On a Deadline.
In-House,
Third Party Applications
without modification
in order to do their jobs
What drives
Enterprise Condor Users?
Cost?
Important Certainly
Condor works well here,
more $$$ for Hardware
Reliability
Uptime
Fault Tolerance
Disaster Recovery
Condor provides
High Availability,
Fault-tolerant Design
20+ Years old, so
it is very stable
Users have varying
levels of tolerance for uptime
Depends upon the application
Commercial endeavors have tighter
Service Level Agreements
What else about SLAs?
SLAs = Latency/Throughput
Some Applications Require
High-throughput
Many longer jobs, none lost,
even with failures
Example: Movies,
lots of frames,
don't miss any
Some Require low-latency
Many short jobs, need
fast response
Missing some is fine,
we'll decide about retries
Example: Trader pricings,
fast turn around times
Throughput and latency
are related
Overall utilization ~
scheduling latency as % of
compute time
=> Decreasing latency improves
utilization (overall goodput)
Critical aspect:
Missing Job Tolerance
The Bottom Line =>
Productivity
Employee productivity
Easy integration of
custom and 3rd party
applications
Condor does well at these
Cycle Computing
Training/Consultation
on Condor
Developers want APIs
to submit work
SOAP/DRMAA
Not only people
productivity
Also resource
utilization
Budgets = looking at how much
resources are 'used'
Cycle-stealing, fair-share
become drivers
Corporations do
share internally
Condor fits perfectly
into this scenario
On-demand Computing is
appearing as a driver
CycleCloud™
On-demand
Condor pools
in Beta (contact me)
Competitive Advantage:
Computation Options
Be able to have any competitive
advantage they can:
Hiring and Computing
Be able to use any compute
technology available
Want Computing Options
for Advantage:
Windows/Linux/x86/PS3/etc.
Avoid “Vendor Lock-in”
Because that causes
spending lots of
Competitive Advantage:
Scale
Entities computing usage
constantly increases
10 slots, become 100,
become several hundred
Thanks to Moore’s Law
With numerous sites
having X000s of slots
Condor can grow with
the installation
Competitive Advantage =>
Scheduling
Project/User Priorities
AccountingGroups enable priority
for departments/project/etc.
Quotas for
minimum/maximum
capacity
Need Flexible control of how
resources are used
Condor ClassAds and
policies are the best at this
Competitive Advantage =>
Virtualization
Checkpoint Any Job
Security on Workstations
Condor 6.9
Competitive Advantages =>
Easy Management
Ability to manage configuration
Analyze productivity across
many resources
Usage reporting and
visualization
Accountability
Audit changes, Authorization
Monitoring of Machine/Condor
Condor provides CmdLine Tools
Community provides support
Grids generate Lots of Data
But
Need Productivity Solutions
In Administration,
Provisioning, and Analysis
Need Analysis, Management,
Training Solutions
Cycle Computing
• CycleServer™ Management
• Consulting and Advice
• Implementation
• Training
Cycle has Experience with
Tens of Condor Pools,
5000+ machines, X0000 slots
Attempting to implement policies
for sharing or scaling?
Consulting for:
Policy Creation, Best Practices
Improvements in Various
Environments:
20x in Negotiation Performance
10x in Scheduler Capacity
Need help creating software
pipelines or pools?
Implementation for:
Pool Setup,
Software Pipeline Development,
Low-latency scheduling
Several Hundred Servers up and
running in 1.5 days
Want to bring Users/System
Administrators up to speed
on Condor?
Training Classes for:
Condor Administration
Architecture Best Practices
Submission Best Practices
Command Lines
Issue Diagnosis
Do you need coverage for
Condor issues with variable SLAs?
Condor/CycleServer Support :
24/7 Phone (SLAs in the hours)
E-mail based (SLAs in days)
Need Management for
Condor Pools to enable
Analysis, Reporting, and
Auditing?
CycleServer™
Web-based GUI for
Condor Management
CycleServer™ Overview
– Web and Command-line based management
system for multiple Condor pools
– Built to run in Java Servlet Containers (e.g.
Tomcat, WebLogic, etc.)
– Data persistence to Oracle or PostgreSQL
– Uses XSLT transforms for HTML
presentation, so page layout/look can be
configured
CycleServer™ Overview
•Configuration Management and Machine Movement
• Auditing for Config Changes and Condor Commands
• Job Status and Diagnostic Information
• Pool Administration w/o logging in to machines
• Authorization/Permissions for Actions on Grid
• Usage Monitoring, Capacity Planning, Reporting
• Easy Pool Status notifications, and viewing
• Easy shutdown of Machines for maintenance
Thank you NeSC.
Demo. Questions?
http://www.cyclecomputing.com
jstowe @ cyclecomputing.com
Download