Performing Large Science Experiments on Azure: Pitfalls and Solutions Wei Lu

advertisement
Performing Large Science Experiments
on Azure: Pitfalls and Solutions
Wei Lu, Jared Jackson, Jaliya Ekanayake,
Roger Barga, Nelson Araujo
Microsoft eXtreme Computing Group
CloudCom2010, Indianapolis , IN
Windows Azure
Application
Storage
Compute
Fabric
…
CloudCom2010, Indianapolis , IN
Suggested Application Model
Using queues for reliable messaging
To scale, add more of either
Worker Role
Web Role
main(
{ … }
ASP.NET, WCF,
IIS
etc.
2) Put work in
queue
• Decouple the system
• Absorb the bursts
• resilient to the instance
failure,
• Easy to scale
4) Do
work
3) Get work
from queue
Queue
CloudCom2010, Indianapolis , IN
Azure Queue
• Communication channel
between instances
– Messages in the Queue is
reliable and durable
Instance
Instance
• 7-day life time
• Fault tolerance mechanism
– De-queued message becomes
visible again after
visibilityTimeout if it is not
deleted
Instance
• 2-hour maximum limitation
– Idempotent processing
CloudCom2010, Indianapolis , IN
AzureBLAST
• BLAST (Basic Local Alignment Search Tool)
– the most important software in bioinformatics
– Identify the similarity between bio-sequences
• BLAST is highly computation-intensive
– Large number of pairwise alignment operations
– The size of sequence databases has been growing exponentially
• Two choices for running large BLAST jobs
– Building a local cluster
– Submit jobs to NCBI or EBI
BLAST task
• Long job queuing time
• BLAST is easy to be parallelized
– Query segmentation
BLAST task
Splitting task
BLAST task
Merging
Task
…
BLAST task
CloudCom2010, Indianapolis , IN
AzureBLAST
Worker
Web Role
Web
Portal
Job Management Role
Worker
Job
registration
Job
Scheduler
Web
Service
Global
dispatch
queue
…
Worker
Job Registry
NCBI
databases
Azure Table
Database
updating
Role
Blast databases,
temporary data,
etc.)
Azure Blob
CloudCom2010, Indianapolis , IN
All-by-All BLAST experiment
• “All by All” query
– Compare the database against itself
– Discovering Homologs
• inter-relationships of known protein sequences
• Large protein database (4.2 GB size)
– Totally 9,865,668 sequences
• In theory100 billion sequence comparisons!
• Performance estimation
– would require 14 CPU-years
– One of biggest BLAST jobs as far as we know
CloudCom2010, Indianapolis , IN
Our Solution
• Allocated 3776 weighted instances
– 475 extra-large instances
– From three datacenters
• US South Central, West Europe and North Europe
• Dividing 10 million sequences into several
segments
– Each will be submitted to one datacenter as one job
– Each segment consists of smaller partitions
• Finally the job took two weeks
– Total size of all outputs is ~230GB
CloudCom2010, Indianapolis , IN
Understanding Azure by analyzing logs
• A normal log record should be
3/31/2010 6:14RD00155D3611B0
3/31/2010 6:25RD00155D3611B0
3/31/2010 6:25RD00155D3611B0
3/31/2010 6:44RD00155D3611B0
3/31/2010 6:44RD00155D3611B0
3/31/2010 7:02RD00155D3611B0
Executing the task 251523...
Execution of task 251523 is done, it takes 10.9mins
Executing the task 251553...
Execution of task 251553 is done, it takes 19.3mins
Executing the task 251600...
Execution of task 251600 is done, it takes 17.27 mins
• Otherwise, something is wrong (e.g., lost task)
3/31/2010 8:22RD00155D3611B0
Executing the task 251774...
3/31/2010 9:50RD00155D3611B0
Executing the task 251895...
3/31/2010 11:12RD00155D3611B0
Execution of task 251895 is done, it takes 82 mins
CloudCom2010, Indianapolis , IN
Challenges & Pitfalls
•
•
•
•
•
Failures
Instance Idle time
Limitation of current Azure Queue
Performance/Cost Estimation
Minimizing the Needs for Programming
CloudCom2010, Indianapolis , IN
Case Study 1
North Europe datacenter, totally 34, 265 tasks processed
Node replacement,
Avoid using machine
name in your
program
Almost one day delay.
Try not to orchestrate
instances by the tight
synchronization (e.g.,
barrier)
CloudCom2010, Indianapolis , IN
Case Study 2
North Europe Data Center, totally 34,256 tasks processed
All 62 nodes lost tasks and
then came back in a group
fashion. This is
Update domain
~ 6 nodes in one group
~30 mins
CloudCom2010, Indianapolis , IN
Case Study 3
West Europe Datacenter; 30,976 tasks are completed, and job was killed
35 Nodes
experienced the
blob writing failure
at same time
A reasonable guess:
the Fault Domain is
working
CloudCom2010, Indianapolis , IN
Challenges & Pitfalls
• Failures
– Failures are expectable and unpredictable
• Design with failure in mind
– Most are automatically recovered by cloud
•
•
•
•
Instance Idle time
Limitation of current Azure Queue
Performance/Cost Estimation
Minimizing the Needs for Programming
CloudCom2010, Indianapolis , IN
Challenges & Pitfalls
• Failures
• Instance Idle time
– Gap time between two jobs
– Diversity of work load
– Load imbalance
• Limitation of current Azure Queue
• Performance/Cost Estimation
• Minimizing the Needs for Programming
CloudCom2010, Indianapolis , IN
Load imbalance
North Europe Data center, 2058 tasks
Two-day very low system
throughput due to some
long-tail tasks
Task 56823 needs 8 hours to complete;
it was re-executed by 8 nodes due to the
2-hour max value of the
visibliblityTimeout of a message
CloudCom2010, Indianapolis , IN
Challenges & Pitfalls
• Failures
• Instance Idle time
• Limitation of current Azure Queue
– 2-hour max value of visibilityTimeout
• Each individual task has to be done in 2 hours
– 7-day max message life time
• Entire experiment has to be done in less then 7 days
• Performance/Cost Estimation
• Minimizing the Needs for Programming
CloudCom2010, Indianapolis , IN
Challenges & Pitfalls
•
•
•
•
Failures
Instance Idle time
Limitation of current Azure Queue
Performance/Cost Estimation
– The better you understand your application, the
more money you can save
– BLAST has about 20 arguments
– VM size
• Minimizing the Needs for Programming
CloudCom2010, Indianapolis , IN
Cirrus: Parameter Sweeping Service on
Azure
Worker
Web Role
Web
Portal
Job
registration
Job Manager Role
Job
Scheduler
Web
Service
Scaling
Engine
Parametric
Engine
Worker
Sampling
Filter
…
Dispatch
Queue
Worker
Azure
Table
Azure Blob
CloudCom2010, Indianapolis , IN
Job Manager Role
Job Definition
• Declarative Job definition
– Derived from Nimrod
– Each job can have
• Prolog
• Commands
• Paramters
• Azure-related opeartors
– AzureCopy
– AzureMount
– SelectBlobs
• Job configuration
• Minimize the programming for
running legacy binaries on Azure
– BLAST
– Bayesian Network Machine
Learning
– Image rendering
Job
Scheduler
Scaling
Engine
Parametric
Engine
Sampling
Filter
<job name="blast">
<prolog>
azurecopy http://.../uniref.fasta uniref.fasta
</prolog>
<cmd>
azurecopy %partition% input
blastall.exe -p blastp -d uniref.fasta
-i input -o output
azurecopy output %partition%.out
</cmd>
<parameter name="partition">
<selectBlobs>
<prefix>partitions/</prefix>
</selectBlobs>
</parameter>
<configure>
<minInstances>2</minInstances>
<maxInstances>4</maxInstances>
<shutdownWhenDone> true </shutdownWhenDone>
<sampling> true </sampling>
</configure>
</job>
CloudCom2010, Indianapolis , IN
Job Manager Role
Dynamic Scaling
Job
Scheduler
Scaling
Engine
Parametric
Engine
Sampling
Filter
• Scaling in/out for individual job
– Fit into the [min, max] window specified in the job config
– Synchronous Scaling
• Tasks are dispatched after the scaling is done
– Asynchronous Scaling
• Tasks execution and scaling operation are simultaneous
• Scaling in when load imbalance happens
• Scaling in when not receiving new jobs after a period of
time
– Or if the job is configured as “shutdown-when-done”
• Usually used for the reducing job.
CloudCom2010, Indianapolis , IN
Job Pause-ReConfig-Resume
• Each job maintains a take
status table
– Checkpoint by snapshotting
the task table
– A task can be incomplete
– Fix the 7-day/ 2-hour
limitation
• Handle the exception
optimistically
– Ignore the exceptions,
– retry incomplete tasks with
reduced number of instance,
– minimize the cost of failures
• Handle the load imbalance
CloudCom2010, Indianapolis , IN
Performance Estimation by Sampling
Job Manager Role
• Observation based approach
Job
Scheduler
Scaling
Engine
Parametric
Engine
– Randomly sample the parameter space based on
the sampling ration a
Sampling
Filter
• Only dispatch the sample tasks
– scaling in only with n’ instances to save cost
• Assuming the uniform distribution, the
estimation is done by
CloudCom2010, Indianapolis , IN
Evaluation
• A complete BLAST running takes 2 hours with 16 instances,
• a 2%-sampling-run which achieves 96% accuracy only takes about 18 minutes with
2 instances
• the overall cost for the sampling run is only 1.8% of the complete run.
CloudCom2010, Indianapolis , IN
Evaluation
• Scaling-out
– Sync. Operation
• stall all instances for 80 minutes
– Async. Operation,
• Existing instances keep working
• New instances needs 20-80 minutes
• 16-instance run is 1.4x faster
• Scaling-in
– Sync. Operation
Azure randomly
picks the
instances to
shutdown
• finished in 3 minutes
– Async. Operation
• caused the random message losing
• May lead to more idle instance time.
• the best practices
– scale-out asynchronously
– Scale-in synchronously
New instances
join in 20 – 80
minutes
CloudCom2010, Indianapolis , IN
Conclusion
• Running large-scale parameter sweeping experiment
on Azure
• Identified Pitfalls
–
–
–
–
Design with Failure (most of them are recoverable)
Watch out the instance idle time
understand your application to save cost
Minimize the need of programming
• Our parameter sweeping solutions
–
–
–
–
Declarative job definition
Dynamic scaling,
Job pause-reconfig-resume pattern
Performance estimation
CloudCom2010, Indianapolis , IN
Download