Ensembl Compute Grid issues James Cuff Informatics Systems Group Wellcome Trust Sanger Institute Wellcome Trust Sanger Institute Informatics Systems Group The Problem • • • • Find all the genes and syntenic regions between species 4.3Gb DNA, 480,000 sequences (human) 16 odd million traces (mouse) 7x coverage of mouse (3Gb) [lets not even talk about fugu, zebrafish, mosquito, rat etc.] • 16+ analysis types – – – – – – – Automatic submission of jobs Dependencies between them Track their progress Retry failed jobs Need access to large file based databases Store output It must be easy to include new analysis types Wellcome Trust Sanger Institute Informatics Systems Group e! - System overview • Based on MySQL relational database and Perl • Submission/tracking independent of the data • Reusable analysis components – Standalone objects – Database aware objects • Simple interfaces for rapid development • Open Source, Open Standards… Wellcome Trust Sanger Institute Informatics Systems Group The Compute • >1142 hosts (1,200CPUs) in one LSF cluster • 360 Alpha DS10L’s 1GB, 60GB, 467MHz • 768 Intel RLX blades, 1GB, 80GB, 800MHz • 6xES45, 8xES40s 667 / 1000 MHz, 8-16GB • 10+ TB of Fibre Channel storage Wellcome Trust Sanger Institute Informatics Systems Group Typical CPU usage (last week) 768 nodes for >1 day ~ 2 years of CPU I/O and CPU sustain is totally non-trivial Wellcome Trust Sanger Institute Informatics Systems Group Sanger Compute PFAM GS320 32-way 128GB mem. GS320 32-way 128GB mem. Humgen 8 X ES45 High throughput Farm Oracle Cluster 6xDS20 2xES40 768 RLX nodes 360 ds10 alpha Ensembl cluster 8 X ES40, 6 X ES40 Contingency cluster backup engines +storage SAN Backup/ mirrors 8 X ES40 + 2 x DS20 SAN attached Tape silos Informatics Development 5xES40 SAN attached Tape libraries Large scale assembly, sequencing & trace data 19 X ES40, 4 X DS20 User X at Institute Y The ‘Internet’ FIREWALL DMZ Internal Router Pathogen 15 x ES40 Mail-hub, local ftp, secure login, Aceserver, Dial-in hubs Wellcome Trust Sanger Institute Cancer Project X-linked disease 4 X ES40 4Tb disk Front-end Compute Servers Desk top devices Extranet Web Cluster 2X ES40 0.5Tb disk Ensembl web Blast services 12 ES40 + 6TB storage Informatics Systems Group Whitehead Collaboration (the problem) • blastn all by all comparison: – WI and TIGR Human BAC ends (800k entries) against Human entries in Genbank (5.7GB) • Existing pipeline from WI • Java JDBC / Oracle pipeline based on XML • Tight 2 week time frame (as always!) Wellcome Trust Sanger Institute Informatics Systems Group Whitehead Collaboration (the solution?) • ssh / scp / ftp access to Sanger and WI systems… • 2 weeks to run and setup: – – – – – – Oracle instance Set up user account, familiarisation with system Oracle dumps, copy ddl and input results Total data size: 21GB I/O System failures (recovery) A great many telephone / e-mail discussions • Only took 2 days total compute on just 360 nodes… Wellcome Trust Sanger Institute Informatics Systems Group Computational Farms (and their equivalent of foot and mouth) • NFS/CIFS/AFS (network share) meltdown – – – – Creation of batch scripts (100,000’s of jobs – some take < 1min) Reading NFS-mounted binaries Reading NFS-mounted data files Writing output to NFS-mounted directories • MySQL / Oracle meltdown – Too many simultaneous connections – Queries blocking each other • LSF mbatchd meltdown (DRM failure in general) – Broken code in general – both developer and sysadmin error • Even when you are supposed to… “Know what you are doing…” Wellcome Trust Sanger Institute Informatics Systems Group External CPU and Data Collaborations (How would an ‘ideal world GRID’ help?) • Rapid data distribution to and from SI and external site? • Zero to little setup time? • ‘Direct’ connections to remote Oracle/MySQL instances at Sanger (i.e. via replication)? • No need for local account [shell] access? • Single ‘system image’ – e.g. no need to find out where java/perl/binaries live, how the queues work etc.? Wellcome Trust Sanger Institute Informatics Systems Group