Ensembl Compute Grid issues James Cuff Informatics Systems Group

advertisement
Ensembl Compute
Grid issues
James Cuff
Informatics Systems Group
Wellcome Trust Sanger Institute
Wellcome Trust
Sanger Institute
Informatics Systems Group
The Problem
•
•
•
•
Find all the genes and syntenic regions between species
4.3Gb DNA, 480,000 sequences (human)
16 odd million traces (mouse)
7x coverage of mouse (3Gb)
[lets not even talk about fugu, zebrafish, mosquito, rat etc.]
• 16+ analysis types
–
–
–
–
–
–
–
Automatic submission of jobs
Dependencies between them
Track their progress
Retry failed jobs
Need access to large file based databases
Store output
It must be easy to include new analysis types
Wellcome Trust
Sanger Institute
Informatics Systems Group
e! - System overview
• Based on MySQL relational database and Perl
• Submission/tracking independent of the data
• Reusable analysis components
– Standalone objects
– Database aware objects
• Simple interfaces for rapid development
• Open Source, Open Standards…
Wellcome Trust
Sanger Institute
Informatics Systems Group
The Compute
• >1142 hosts (1,200CPUs) in one LSF cluster
• 360 Alpha DS10L’s 1GB, 60GB, 467MHz
• 768 Intel RLX blades, 1GB, 80GB, 800MHz
• 6xES45, 8xES40s 667 / 1000 MHz, 8-16GB
• 10+ TB of Fibre Channel storage
Wellcome Trust
Sanger Institute
Informatics Systems Group
Typical CPU usage (last week)
768 nodes for >1 day
~ 2 years of CPU
I/O and CPU sustain
is totally non-trivial
Wellcome Trust
Sanger Institute
Informatics Systems Group
PFAM
GS320 32-way
128GB mem.
Sanger Compute
GS320 32-way
128GB mem.
Humgen
8 X ES45
High
throughput
Farm
768 RLX nodes
Oracle
Cluster
6xDS20
2xES40
360 ds10 alpha
Ensembl cluster
8 X ES40, 6 X ES40
Contingency cluster
backup engines
+storage
SAN
Backup/
mirrors
Large scale
assembly, sequencing &
trace data
19 X ES40, 4 X DS20
8 X ES40 +
2 x DS20
SAN attached
Tape silos
Informatics
Development
5xES40
SAN attached
Tape libraries
Pathogen
15 x ES40
User X at
Institute Y
The ‘Internet’
FIREWALL DMZ
Internal
Router
Mail-hub, local ftp, secure login,
Aceserver, Dial-in hubs
Cancer
Project
X-linked
disease
4 X ES40
Wellcome Trust 4Tb disk
Sanger Institute
Front-end
Compute
Servers
Desk top
devices
Extranet
Web
Cluster
2X ES40
0.5Tb disk
Ensembl web
Blast services
12 ES40 +
6TB storage
Informatics Systems Group
Whitehead Collaboration
(the problem)
• blastn all by all comparison:
– WI and TIGR Human BAC ends (800k entries) against
Human entries in Genbank (5.7GB)
• Existing pipeline from WI
• Java JDBC / Oracle pipeline based on XML
• Tight 2 week time frame (as always!)
Wellcome Trust
Sanger Institute
Informatics Systems Group
Whitehead Collaboration
(the solution?)
• ssh / scp / ftp access to Sanger and WI systems…
• 2 weeks to run and setup:
–
–
–
–
–
–
Oracle instance
Set up user account, familiarisation with system
Oracle dumps, copy ddl and input results
Total data size: 21GB I/O
System failures (recovery)
A great many telephone / e-mail discussions ☺
• Only took 2 days total compute on just 360 nodes…
Wellcome Trust
Sanger Institute
Informatics Systems Group
Computational Farms
(and their equivalent of foot and mouth)
•
NFS/CIFS/AFS (network share) meltdown
–
–
–
–
•
Creation of batch scripts (100,000’s of jobs – some take < 1min)
Reading NFS-mounted binaries
Reading NFS-mounted data files
Writing output to NFS-mounted directories
MySQL / Oracle meltdown
– Too many simultaneous connections
– Queries blocking each other
•
LSF mbatchd meltdown (DRM failure in general)
– Broken code in general – both developer and sysadmin error
•
Even when you are supposed to…
“Know what you are doing…”
Wellcome Trust
Sanger Institute
Informatics Systems Group
External CPU and Data Collaborations
(How would an ‘ideal world GRID’ help?)
• Rapid data distribution to and from SI and external site?
• Zero to little setup time?
• ‘Direct’ connections to remote Oracle/MySQL instances
at Sanger (i.e. via replication)?
• No need for local account [shell] access?
• Single ‘system image’ – e.g. no need to find out where
java/perl/binaries live, how the queues work etc.?
Wellcome Trust
Sanger Institute
Informatics Systems Group
MySQL – remote access
DS20, 250GB Alpha in DMZ with Ensembl data
From cisco firewall logs,
1st Oct 2001 to 1st Oct 2002:
– 159,251 port 3306 TCP connections
– Corresponds to 1,016 unique hosts
– 348 hosts with more than 10 connections
Wellcome Trust
Sanger Institute
Informatics Systems Group
Download