Ensembl Compute Grid issues James Cuff Informatics Systems Group Wellcome Trust Sanger Institute Wellcome Trust Sanger Institute Informatics Systems Group The Problem • • • • Find all the genes and syntenic regions between species 4.3Gb DNA, 480,000 sequences (human) 16 odd million traces (mouse) 7x coverage of mouse (3Gb) [lets not even talk about fugu, zebrafish, mosquito, rat etc.] • 16+ analysis types – – – – – – – Automatic submission of jobs Dependencies between them Track their progress Retry failed jobs Need access to large file based databases Store output It must be easy to include new analysis types Wellcome Trust Sanger Institute Informatics Systems Group e! - System overview • Based on MySQL relational database and Perl • Submission/tracking independent of the data • Reusable analysis components – Standalone objects – Database aware objects • Simple interfaces for rapid development • Open Source, Open Standards… Wellcome Trust Sanger Institute Informatics Systems Group The Compute • >1142 hosts (1,200CPUs) in one LSF cluster • 360 Alpha DS10L’s 1GB, 60GB, 467MHz • 768 Intel RLX blades, 1GB, 80GB, 800MHz • 6xES45, 8xES40s 667 / 1000 MHz, 8-16GB • 10+ TB of Fibre Channel storage Wellcome Trust Sanger Institute Informatics Systems Group Typical CPU usage (last week) 768 nodes for >1 day ~ 2 years of CPU I/O and CPU sustain is totally non-trivial Wellcome Trust Sanger Institute Informatics Systems Group PFAM GS320 32-way 128GB mem. Sanger Compute GS320 32-way 128GB mem. Humgen 8 X ES45 High throughput Farm 768 RLX nodes Oracle Cluster 6xDS20 2xES40 360 ds10 alpha Ensembl cluster 8 X ES40, 6 X ES40 Contingency cluster backup engines +storage SAN Backup/ mirrors Large scale assembly, sequencing & trace data 19 X ES40, 4 X DS20 8 X ES40 + 2 x DS20 SAN attached Tape silos Informatics Development 5xES40 SAN attached Tape libraries Pathogen 15 x ES40 User X at Institute Y The ‘Internet’ FIREWALL DMZ Internal Router Mail-hub, local ftp, secure login, Aceserver, Dial-in hubs Cancer Project X-linked disease 4 X ES40 Wellcome Trust 4Tb disk Sanger Institute Front-end Compute Servers Desk top devices Extranet Web Cluster 2X ES40 0.5Tb disk Ensembl web Blast services 12 ES40 + 6TB storage Informatics Systems Group Whitehead Collaboration (the problem) • blastn all by all comparison: – WI and TIGR Human BAC ends (800k entries) against Human entries in Genbank (5.7GB) • Existing pipeline from WI • Java JDBC / Oracle pipeline based on XML • Tight 2 week time frame (as always!) Wellcome Trust Sanger Institute Informatics Systems Group Whitehead Collaboration (the solution?) • ssh / scp / ftp access to Sanger and WI systems… • 2 weeks to run and setup: – – – – – – Oracle instance Set up user account, familiarisation with system Oracle dumps, copy ddl and input results Total data size: 21GB I/O System failures (recovery) A great many telephone / e-mail discussions ☺ • Only took 2 days total compute on just 360 nodes… Wellcome Trust Sanger Institute Informatics Systems Group Computational Farms (and their equivalent of foot and mouth) • NFS/CIFS/AFS (network share) meltdown – – – – • Creation of batch scripts (100,000’s of jobs – some take < 1min) Reading NFS-mounted binaries Reading NFS-mounted data files Writing output to NFS-mounted directories MySQL / Oracle meltdown – Too many simultaneous connections – Queries blocking each other • LSF mbatchd meltdown (DRM failure in general) – Broken code in general – both developer and sysadmin error • Even when you are supposed to… “Know what you are doing…” Wellcome Trust Sanger Institute Informatics Systems Group External CPU and Data Collaborations (How would an ‘ideal world GRID’ help?) • Rapid data distribution to and from SI and external site? • Zero to little setup time? • ‘Direct’ connections to remote Oracle/MySQL instances at Sanger (i.e. via replication)? • No need for local account [shell] access? • Single ‘system image’ – e.g. no need to find out where java/perl/binaries live, how the queues work etc.? Wellcome Trust Sanger Institute Informatics Systems Group MySQL – remote access DS20, 250GB Alpha in DMZ with Ensembl data From cisco firewall logs, 1st Oct 2001 to 1st Oct 2002: – 159,251 port 3306 TCP connections – Corresponds to 1,016 unique hosts – 348 hosts with more than 10 connections Wellcome Trust Sanger Institute Informatics Systems Group