An Overview of the Bioinformatics Platform Based on HPC in BGI BGI

advertisement
BGI
Bioinformatics
An Overview of the Bioinformatics
Platform Based on HPC in BGI
Jiandong Sun
Genomics & Bioinformatics Institute CAS
Beijing Genomics Institute
Beijing Genomics Institute
BGI
Bioinformatics
Outline
¾ A brief introduction of BGI
¾ Bioinformatics on Genomics
„ Sequencing Data Management
„ Assembly and Annotation
„ Comparative Genomics
„ EST/cDNA pipeline
¾ Bioinformatics on Proteomics
„ Data Collection and Management Module
„ LC/MS/MS data analysis Module
„ Protein 3D structure Prediction Module
„ Proteomics Database Module
¾ Acknowledge
Beijing Genomics Institute
BGI
Bioinformatics
A brief introduction of BGI
z Genomics & Bioinformatics
Institute CAS
Beijing Genomics Institute
Founded: 1999,9,9,9:9 am
Beijing Genomics Institute
BGI
Bioinformatics
Beijing
Hangzhou
Beijing Genomics Institute
BGI
Bioinformatics
A brief introduction of BGI
¾
4 directors
¾
8 board members
¾More than 400 people
Beijing Genomics Institute
BGI
Bioinformatics
A brief introduction of BGI
4 Major Departments ---- Genome Sequencing
‹ 80 MegaBACE sequencers
‹ Every Day :
¾ 60000 reactions
¾ 300,000,000 bps sequenced
¾ 15G Byte Raw data file
Beijing Genomics Institute
BGI
Bioinformatics
A brief introduction of BGI
4 Major Departments ---- Proteomics
‹ 2 Sets of 2D-PAGE
‹ 4 Mass Spectrometers
‹ Every Day :
¾ 35 2D-PAGE images
¾ 6500 protein identified
by LC/MS/MS
¾ 3500 proteins identified
by MALDI-TOF
‹ Projects:
¾Rice, ThermoBacteria
¾Cancer, Snake Venom
Beijing Genomics Institute
BGI
Bioinformatics
A brief introduction of BGI
4 Major Departments ---- Drug Screening
¾ Just the beginning
¾ Setting up Cell line environment for screening
¾ Focus on Chinese Traditional Medicine
¾ 600 compounds have been extracted from Chinese
Traditional Medicines
Beijing Genomics Institute
BGI
A brief introduction of BGI
Bioinformatics
4 Major Departments ---- Bioinformatics ( I )
z 63 people
17%
z Average age : 24
46%
computer background
biology background
other background
37%
Beijing Genomics Institute
BGI
Bioinformatics
A brief introduction of BGI
4 Major Departments ---- Bioinformatics ( II )
High Performance computers :
¾ 3000 ( 176 CPU, 96 G, 2 T )
--- Dawning
¾ E10k ( 64 CPU, 64 G, 10 T )
--- SUN Microsystem
¾ P690 (32 CPU, 256G, 5T)
--- IBM
Beijing Genomics Institute
BGI
Bioinformatics
A brief introduction of BGI
4 Major Departments ---- Bioinformatics ( III )
‹ Genomics Area :
assembly, Gene annotation, comparative genomics, SNP,
software development, algorithm research, repeat analysis……
‹ Proteomics Area :
2D-PAGE image analysis, MS data analysis, proteomics and
genomics relative, protein 3D structure, metabolic pathway……
‹ Biology Databases :
Genomics database, proteomics database, system biology
……
Beijing Genomics Institute
BGI
Bioinformatics
A brief introduction of BGI
Major Project :
‹ Human Genome (1% )
‹ Rice Genome
‹ Pig Genome
‹ Thermoanaerobacter
tengcongensis Genome
‹ Spirulina Genome
‹ ……
Beijing Genomics Institute
BGI
Bioinformatics
A brief introduction of BGI
Three Cover Papers
5 papers with 72.5 SCI factor in 2.5 years
Beijing Genomics Institute
BGI
Bioinformatics
A brief introduction of BGI
‹ National High Performance computing Center
‹ SUN Center of Excellence for Bioinformatics
Genomics & Bioinformatics Institute (Beijing , Hangzhou)
‹BGI-DAWNING Joint Bioinformatics Institute
‹ Sister Center of WhiteHead Genome Center / MIT
Beijing Genomics Institute
BGI
Bioinformatics
The Bioinformatics Platform on Genomics
LIMS --- Sequencing Data Collection and Management
collect, quality report, backup, basecalling, vector mask……
Assembly ---- Whole Genome Shot-gun sequencing data assembly
ThermoBacteria, HGP, Rice Genome, Pig Genome
A new strategy and a new software ( Genome Research )
Annotation ---- Gene finding, regulation factor, alternative splicing, repeats…
GC3 Codon bias in Plant ( Genome Research )
Comparative Genomics ---- rice
EST / cDNA pipeline ---- clustering, assembly if needed, alignment ……
Beijing Genomics Institute
BGI
Bioinformatics
The Integrated Bioinformatics Platform on Proteomics
Structure ---- Data Collection and Management Module
9 Detect data generation and gain data automatically
9 Bar Code control to avoid miss-named
9 Different data will drop in different raw database
9 Backup automatically at setting time
9 Logs and software status are real-time put on intranet
9 Daily report and summary are email to manager automatically
Beijing Genomics Institute
BGI
Bioinformatics
The Integrated Bioinformatics Platform on Proteomics
Super Computer
(Sun10000, Dawn3000) MALDI-TOF
DataBase
MALDI-TOF
Sample
Preparation
Data
2D-PAGE
LC MS/MS
Automatic
Data
Collection
System
2D-PAGE
DataBase
LC MS/MS
DataBase
Beijing Genomics Institute
BGI
Bioinformatics
The Integrated Bioinformatics Platform on Proteomics
Structure ---- LC/MS/MS data analysis Module
9 Database search engine
¾ Comprehensive data source
public database, EST assembly, genome self-sequenced, gene-selfdevelopment…..
¾ New algorithm of Clustering spectra by peaks pattern to pick up sample
spectra
¾ Accurate pattern matching
9 De Novo Sequencing
¾ Dynamic Programming
¾ Ions recognition algorithm
9
Web Base Interface
9 Task profile and schedule design
Beijing Genomics Institute
BGI
Bioinformatics
De novo method
Algorithm
for
De novo
sequence
Find new
gene
LC MS/MS
DataBase
Database Related method
EST
Annotation Annotation
Assembly Data (by
Data
Result
ourself)
(public
tools)
protein
Homology
Nr
Other
Protein
Sequence
Other
Genomics
Data
MS DataBase
Construction
system
Predicated
Protein
sequences
MS
DataBase
Candidate
protein
MS DataBase
Search engine
Sequences
Match tools
unknown
protein?
Not match
Sequences
Matching
match
Identified
Protein
sequence
Validate
Database
Search
result
Beijing Genomics Institute
BGI
Bioinformatics
The Integrated Bioinformatics Platform on Proteomics
Structure ---- Protein 3D structure Prediction Module
9 Based on PROSPECT
¾ PROtein Structure Prediction and Evaluation Computer Toolkit
¾ one of the top 6 performers in CASP4 contest
¾ Parallelized on Dawning 3000
¾ 80 sequences less than 500 amino acids can be calculated at the same
time.
9 Task queue management
9 Auto-notice when finished prediction
9 Result displays on Browser
Beijing Genomics Institute
BGI
Bioinformatics
The Integrated Bioinformatics Platform on Proteomics
Structure ---- Protein 3D structure Prediction Module
Beijing Genomics Institute
BGI
Bioinformatics
The Integrated Bioinformatics Platform on Proteomics
Structure ---- Proteomics Database Module
‹ Fast Internet Connection to Public Database
‹ Self-generated Data
¾
2D-PAGE images database
¾
LC/MS/MS database including analysis result
¾
Peptides and Theoretic CID spectra Pattern Database
¾
Protein 3D-structure database
¾
Genomics integrated database
¾ Metabolic Pathway database
Beijing Genomics Institute
BGI
Bioinformatics
A “Private” Grid System
¾ Computing Service ---- HPCs
¾ Data Management ---- Integrated Database
¾ Applications ---- self-developed on HPCs
¾ Network ---- fast connection between from Beijing to Hangzhou
Basic Foundation of E-Science
Beijing Genomics Institute
BGI
Bioinformatics
Acknowledge
Beijing Genomics Institute
Download