BGI Bioinformatics An Overview of the Bioinformatics Platform Based on HPC in BGI Jiandong Sun Genomics & Bioinformatics Institute CAS Beijing Genomics Institute Beijing Genomics Institute BGI Bioinformatics Outline ¾ A brief introduction of BGI ¾ Bioinformatics on Genomics Sequencing Data Management Assembly and Annotation Comparative Genomics EST/cDNA pipeline ¾ Bioinformatics on Proteomics Data Collection and Management Module LC/MS/MS data analysis Module Protein 3D structure Prediction Module Proteomics Database Module ¾ Acknowledge Beijing Genomics Institute BGI Bioinformatics A brief introduction of BGI z Genomics & Bioinformatics Institute CAS Beijing Genomics Institute Founded: 1999,9,9,9:9 am Beijing Genomics Institute BGI Bioinformatics Beijing Hangzhou Beijing Genomics Institute BGI Bioinformatics A brief introduction of BGI ¾ 4 directors ¾ 8 board members ¾More than 400 people Beijing Genomics Institute BGI Bioinformatics A brief introduction of BGI 4 Major Departments ---- Genome Sequencing 80 MegaBACE sequencers Every Day : ¾ 60000 reactions ¾ 300,000,000 bps sequenced ¾ 15G Byte Raw data file Beijing Genomics Institute BGI Bioinformatics A brief introduction of BGI 4 Major Departments ---- Proteomics 2 Sets of 2D-PAGE 4 Mass Spectrometers Every Day : ¾ 35 2D-PAGE images ¾ 6500 protein identified by LC/MS/MS ¾ 3500 proteins identified by MALDI-TOF Projects: ¾Rice, ThermoBacteria ¾Cancer, Snake Venom Beijing Genomics Institute BGI Bioinformatics A brief introduction of BGI 4 Major Departments ---- Drug Screening ¾ Just the beginning ¾ Setting up Cell line environment for screening ¾ Focus on Chinese Traditional Medicine ¾ 600 compounds have been extracted from Chinese Traditional Medicines Beijing Genomics Institute BGI A brief introduction of BGI Bioinformatics 4 Major Departments ---- Bioinformatics ( I ) z 63 people 17% z Average age : 24 46% computer background biology background other background 37% Beijing Genomics Institute BGI Bioinformatics A brief introduction of BGI 4 Major Departments ---- Bioinformatics ( II ) High Performance computers : ¾ 3000 ( 176 CPU, 96 G, 2 T ) --- Dawning ¾ E10k ( 64 CPU, 64 G, 10 T ) --- SUN Microsystem ¾ P690 (32 CPU, 256G, 5T) --- IBM Beijing Genomics Institute BGI Bioinformatics A brief introduction of BGI 4 Major Departments ---- Bioinformatics ( III ) Genomics Area : assembly, Gene annotation, comparative genomics, SNP, software development, algorithm research, repeat analysis…… Proteomics Area : 2D-PAGE image analysis, MS data analysis, proteomics and genomics relative, protein 3D structure, metabolic pathway…… Biology Databases : Genomics database, proteomics database, system biology …… Beijing Genomics Institute BGI Bioinformatics A brief introduction of BGI Major Project : Human Genome (1% ) Rice Genome Pig Genome Thermoanaerobacter tengcongensis Genome Spirulina Genome …… Beijing Genomics Institute BGI Bioinformatics A brief introduction of BGI Three Cover Papers 5 papers with 72.5 SCI factor in 2.5 years Beijing Genomics Institute BGI Bioinformatics A brief introduction of BGI National High Performance computing Center SUN Center of Excellence for Bioinformatics Genomics & Bioinformatics Institute (Beijing , Hangzhou) BGI-DAWNING Joint Bioinformatics Institute Sister Center of WhiteHead Genome Center / MIT Beijing Genomics Institute BGI Bioinformatics The Bioinformatics Platform on Genomics LIMS --- Sequencing Data Collection and Management collect, quality report, backup, basecalling, vector mask…… Assembly ---- Whole Genome Shot-gun sequencing data assembly ThermoBacteria, HGP, Rice Genome, Pig Genome A new strategy and a new software ( Genome Research ) Annotation ---- Gene finding, regulation factor, alternative splicing, repeats… GC3 Codon bias in Plant ( Genome Research ) Comparative Genomics ---- rice EST / cDNA pipeline ---- clustering, assembly if needed, alignment …… Beijing Genomics Institute BGI Bioinformatics The Integrated Bioinformatics Platform on Proteomics Structure ---- Data Collection and Management Module 9 Detect data generation and gain data automatically 9 Bar Code control to avoid miss-named 9 Different data will drop in different raw database 9 Backup automatically at setting time 9 Logs and software status are real-time put on intranet 9 Daily report and summary are email to manager automatically Beijing Genomics Institute BGI Bioinformatics The Integrated Bioinformatics Platform on Proteomics Super Computer (Sun10000, Dawn3000) MALDI-TOF DataBase MALDI-TOF Sample Preparation Data 2D-PAGE LC MS/MS Automatic Data Collection System 2D-PAGE DataBase LC MS/MS DataBase Beijing Genomics Institute BGI Bioinformatics The Integrated Bioinformatics Platform on Proteomics Structure ---- LC/MS/MS data analysis Module 9 Database search engine ¾ Comprehensive data source public database, EST assembly, genome self-sequenced, gene-selfdevelopment….. ¾ New algorithm of Clustering spectra by peaks pattern to pick up sample spectra ¾ Accurate pattern matching 9 De Novo Sequencing ¾ Dynamic Programming ¾ Ions recognition algorithm 9 Web Base Interface 9 Task profile and schedule design Beijing Genomics Institute BGI Bioinformatics De novo method Algorithm for De novo sequence Find new gene LC MS/MS DataBase Database Related method EST Annotation Annotation Assembly Data (by Data Result ourself) (public tools) protein Homology Nr Other Protein Sequence Other Genomics Data MS DataBase Construction system Predicated Protein sequences MS DataBase Candidate protein MS DataBase Search engine Sequences Match tools unknown protein? Not match Sequences Matching match Identified Protein sequence Validate Database Search result Beijing Genomics Institute BGI Bioinformatics The Integrated Bioinformatics Platform on Proteomics Structure ---- Protein 3D structure Prediction Module 9 Based on PROSPECT ¾ PROtein Structure Prediction and Evaluation Computer Toolkit ¾ one of the top 6 performers in CASP4 contest ¾ Parallelized on Dawning 3000 ¾ 80 sequences less than 500 amino acids can be calculated at the same time. 9 Task queue management 9 Auto-notice when finished prediction 9 Result displays on Browser Beijing Genomics Institute BGI Bioinformatics The Integrated Bioinformatics Platform on Proteomics Structure ---- Protein 3D structure Prediction Module Beijing Genomics Institute BGI Bioinformatics The Integrated Bioinformatics Platform on Proteomics Structure ---- Proteomics Database Module Fast Internet Connection to Public Database Self-generated Data ¾ 2D-PAGE images database ¾ LC/MS/MS database including analysis result ¾ Peptides and Theoretic CID spectra Pattern Database ¾ Protein 3D-structure database ¾ Genomics integrated database ¾ Metabolic Pathway database Beijing Genomics Institute BGI Bioinformatics A “Private” Grid System ¾ Computing Service ---- HPCs ¾ Data Management ---- Integrated Database ¾ Applications ---- self-developed on HPCs ¾ Network ---- fast connection between from Beijing to Hangzhou Basic Foundation of E-Science Beijing Genomics Institute BGI Bioinformatics Acknowledge Beijing Genomics Institute