Finding Patterns in Protein Sequence and Structure

advertisement

Bioinformatics as an integrative science

Jaap Heringa

Faculty of Sciences

Faculty of Earth and Life Sciences

Integrative Bioinformatics Institute VU (IBIVU) heringa@cs.vu.nl, www.cs.vu.nl/~ibivu, Tel. +31-20-4447649

Gathering knowledge

• Anatomy, architecture

Rembrandt,

1632

• Dynamics, mechanics

Newton,

1726

• Informatics

(Cybernetics – Wiener, 1948)

(Cybernetics has been defined as the science of control in machines and animals, and hence it applies to technological, animal and environmental systems)

• Genomics, bioinformatics

Bioinformatics

Chemistry

Mathematics

Statistics

Biology

Molecular biology

Bioinformatics

Computer

Science

Informatics

Medicine

Physics

Bioinformatics

“Studying informational processes in biological systems”

(Hogeweg Utrecht; early 1970s)

“Information technology applied to the management and analysis of biological data”

(Attwood and Parry-Smith)

Applying algorithms and mathematical formalisms in biology (genomics) USA started but now everywhere

Taking care of the computational infrastructure and data management everywhere

Is a supporting science everywhere

The Human Genome -- 26 June 2000

Genomics

genome transcriptome proteome metabolome physiome

Dinner discussion: Integrative Bioinformatics & Genomics VU

A gene codes for a protein

DNA transcription

CCTGAGCCAACTATTGATGAA

4-letter alphabet

CCU GAG CCA ACU AUU GAU GAA mRNA translation

Protein P E P T I D E

20-letter alphabet

Humans have spliced genes…

DNA makes RNA makes Protein

Remarks

•Proteins can use different combinations of exons => alternative splicing

•The human factor VIII gene (whose mutations cause hemophilia A) is spread over ~186,000 bp. It consists of 26 exons ranging in size from 69 to 3,106 bp, and its

25 introns range in size from 207 to 32,400 bp. The complete gene is thus ~9 kb of exon and ~177 kb of intron.

•The biggest human gene yet is for dystrophin . It has

> 30 exons and is spread over 2.4 million bp.

•Single Nucleotide Polymorphism (SNP) data important for health

Microarray with about

20K genes…

Proteomics

• X-ray crystallography

• NMR

• Mass spectrometry data

Structural genomics : solving and categorising all existing protein folds (3D structures)

• Protein-protein interactions

• Protein-ligand interactions (drug design)

Metabolic networks

Glycolysis and

Gluconeogenesis

Kegg database

(Japan)

Physiome

• Metabolomics + all other little things in the cell

• Ions, protons, etc.

Algorithms in bioinformatics

• string algorithms

• dynamic programming

• machine learning (NN, k-NN, SVM, GA, ..)

• Markov chain models

• hidden Markov models

• Markov Chain Monte Carlo (MCMC) algorithms

• stochastic context free grammars

• EM algorithms

• Gibbs sampling

• clustering

• tree algorithms

• text analysis

• hybrid/combinatorial techniques and more…

Free University initiatives

Integrative Bioinformatics Institute VU (IBIVU)

•Centre for Research on BioComplex

Systems (CRBCS) – Systems Biology

•Centre for Neurobiology and

Cognitive Research (CNCR)

•VU Medical Centre (Microarray,

CGH data)

IBIVU supporting Dutch initiatives

•BioRange: Pan-Dutch bioinformatics proposal

(65M Euro)

•Centre for Medical Systems Biology (Leiden,

A’dam, R’dam)

•Ecogenomics (A’dam, Wageningen, Nat. Inst.

For Health and Environment (RIVM))

•BioASP: streamline/stimulate bioinformatics teaching across The Netherlands

Dutch Centres of Excellence

• Cancer Genomics Consortium [DCGP]

Center for Biosystem Genomics [CBSG], focuses on plant genomics (potato, tomato)

Kluyver Centre for Genomics of Industrial

Fermentation [Kluyver]

Center for Medical Systems Biology [CMSB], focuses on multifactorial disease

Netherlands Proteomics Centre for proteomics as an emerging horizontal genomics discipline

Dutch academic/industrial initiatives

Nutrigenomics exploration into the prevention and care of nutrional inroads in vascular disease, diabetes, hypertension and obesity

• Interaction between the immune system and food; a functional genomics approach to celiac disease

• Mechanisms of life-threatening virus disease and new leads for treatment and vaccines

• Genomics of host – respiratory virus interactions : towards novel intervention strategies;

• Ecogenomics: Functioning of ecosystems targeted at sustainable environmentally friendly and healthy products

(ecology, toxicology and sustainable innovation)

Technology development

Research questions

Ecogenomics

Assessing the Living Soil

In vitro In situ

Biol. response array

Metagenome array

Bio-informatics

-

Technology platform

Ecotoxicology

Life-support functions soil

PROJECT

Coordinator

EPIDEMIOLOGY

Dorret

Boomsma(VU/mc)*

Cornelia van Duyn(EMC)

SYSTEMS BIOLOGY

Jan vd Greef (TNO/UL)*

Cor Verweij (VU/mc)

TECHNOLOGY

Huub de Groot (UL)

MODEL SYSTEMS

Rune Frants (LUMC)

Medical Systems Biology

Task force

Populations

Genotyping

Arraying

Proteomics

Metabolomics

Molecular interactions

In vivo imaging

Mouse / Rat

Zebrafish

Drosophila

Yeast

Cells, vaccines

EMC : van Duyn, Hofman, Oostra

VU/mc : Boomsma, Boers, Dijkmans, Heine, Hoogendijk, van der

Knaap, Meier, Pena, Pinedo

LUMC : Slagboom, Bertina, Breedveld, Breuning, Cornelisse,

Devilee, vDissel, Ferrari, Huizinga, Roosendaal, Roos, van der

Velde, Westendorp, Zitman

LUMC : Slagboom, Sandkuijl, den Dunnen

EMC : Oostra, Heutink

VU/mc : Boomsma, (Heutink)

LUMC: den Dunnen, Boer, Fodde

VU/mc: Verweij, Ylstra, Brakenhoff

EMC: Oostra

LUMC: Koning, Deelder, den Dunnen, van der Maarel

UL: Overkleeft, Abrahams,

VU/mc: Smit, Li, van Kooyk

UL: Verduijn Lunel, van de Geer, Verheij

TNO: van der Greef, Havekes, te Koppele

VU/mc: Jakobs

UL: Abrahams, Brouwer, IJzerman, van Boom

LUMC: Tanke, Raap, Deelder, den Dunnen

VU/mc: Leurs, Irth

UL: de Groot, Kok

LUMC: Reiber, van Buchem, de Roos, Poelmann, Lowick

VU/mc: Witter, Bal, Lammertsma

EMC: van Duyn, van Swieten

EMC: Oostra

LUMC : Verbeek, Fodde, deKloet, Verrijzer, Noordermeer,

Mullenders

TNO: Havekes;

UL: Spaink, Brouwer, Schmidt

CLINICAL

APPLICATIONS

Cornelis Melief (LUMC)

Viral

Methodologies,

Pharmaceuticals

VU/mc: van Kooiyk; Meijer,Pinedo

LUMC: Spaan, Wiertz, Hoeben

VU/mc: Gerritsen, Curiel

UL: IJzerman, Mulder, van Boom

LUMC: Huizinga, Breedveld, Breuning, van Deutekom, Ferrari,

Fodde, Frants, Jukema, de Kloet, Ottenhoff, van der Velde, Zitman

VU/mc: Maassen, Dijkmans, Leurs, Meijer, Pinedo

EMC : Stricker

CENTRAL PROJECT

Coordinator / Elements

DATA INTEGRATION,

ANALYSIS AND

LOGISTICS

NN

Central Information

Management

TFBI / BIG-VU / EBB

- Rosetta Resolver®

- LIMS integration/

/interfacing

Biostatistics van Houwelingen, Eijlers,

Boer, Sandkuijl (LUMC); van der Vaart, de Gunst, Boers

(VU/mc); Houwing-

Duiistermaat (EMC), van de

Geer (UL)

Bioinformatics

Boer, Svensson, Gorbalenya

(LUMC), Heringa, vBeek

(VU/mc), Stijnen, van der Lei,

Mons (EMC), Kok (UL)

BioASP Interface ism: Vriend/Tellegen

GRID – Virtual Laboratory

NWO- BMI FLEXwork van Ommen, Boer,

Svensson ism: - Stiekema (Wag)

- Herzberger (Ams)

- Vriend (Nijm)

Integrative bioinformatics

Data integration

• Integrate data sources

• Integrate methods

• Integrate data through method integration (biological model)

Bioinformatics tool

Data

Algorithm

Biological

Interpretation

(model)

tool

Bioinformatics

“Nothing in Biology makes sense except in the light of evolution” (Theodosius

Dobzhansky (1900-1975))

“Nothing in Bioinformatics makes sense except in the light of Biology”

Pair-wise sequence alignment

(more than just string matching)

Global dynamic programming

MDAGSTVILCFVG

S

T

I

L

M

D

A

A

C

G

S

Search matrix

Evolution

Amino Acid Exchange

Matrix

MDAGSTVILCFVG-

MDAAST-ILC--GS

Integrative bioinformatics

Data integration

Data

Algorithm

Biological

Interpretation

(model)

tool

Integrative bioinformatics

Data integration

Data 1 Data 2 Data 3

Integrative bioinformatics

Data integration

Data 1 Data 2 Data 3

Algorithm 1

Biological

Interpretation

(model) 1

Algorithm 2

Biological

Interpretation

(model) 2

Algorithm 3

Biological

Interpretation

(model) 3

tool

Integrative bioinformatics

Data integration

“The solution includes an infrastructure or data pipeline involving:

•a general portal

•virtual lab technology (virtual LIMS)

•‘ petabase

’ data handling facilities

•methods, software and ‘tools’ to integrate data and extract knowledge from data in the user domain.

This infrastructure calls for

•a central facilitation unit providing large storage and computing facilities to run central software packages with user interfaces”

•Could Gridlab do this?

Integrating Primary and Predicted

Secondary Structure data for

Multiple Alignment

Using secondary structure in multiple alignment

“Structure more conserved than sequence”

•10 years SS prediction method development: Q3 += 3%

•10 years MA method development: difference in Q3 can be >30%

Protein structure hierarchical levels

PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)

VHLTPEEKSAVTALWGKVNVDE

VGGEALGRLLVVYPWTQRFFE

SFGDLSTPDAVMGNPKVKAHG

KKVLGAFSDGLAHLDNLKGTFA

TLSELHCDKLHVDPENFRLLGN

VLVCVLAHHFGKEFTPPVQAAY

QKVVAGVANALAHKYH

QUATERNARY STRUCTURE (oligomers)

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)

VHLTPEEKSAVTALWGKVNVDE

VGGEALGRLLVVYPWTQRFFE

SFGDLSTPDAVMGNPKVKAHG

KKVLGAFSDGLAHLDNLKGTFA

TLSELHCDKLHVDPENFRLLGN

VLVCVLAHHFGKEFTPPVQAAY

QKVVAGVANALAHKYH

QUATERNARY STRUCTURE (oligomers)

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)

VHLTPEEKSAVTALWGKVNVDE

VGGEALGRLLVVYPWTQRFFE

SFGDLSTPDAVMGNPKVKAHG

KKVLGAFSDGLAHLDNLKGTFA

TLSELHCDKLHVDPENFRLLGN

VLVCVLAHHFGKEFTPPVQAAY

QKVVAGVANALAHKYH

QUATERNARY STRUCTURE (oligomers)

TERTIARY STRUCTURE (fold)

Flavodoxin-cheY: Praline alignment (prepro=1500)

1fx1 -P KALIV YGSTTG NT EYTAETIARQLA NAG-Y EVDSR DA ASVEAGGLFEG FD LVLL GCSTWGDDSI------ELQDDF IPLF DSLE ETGAQGR KVACF

FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf

FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf

FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf

FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf

2fcr --K IGIFF STSTG NT TEVADFIGKTL GA---KADAP ID VDDVTDPQALKD YD LLFLGAP TWNTG----ADTERSGTS WDEFLYD KLPEVDMKDL PVAIF

FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf

FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf

FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-SELDDVDFNGKLVAYf

FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-PTLEEIDFNGKLVALf

4fxn -M K -IVYW SGTG NT EKMAELIAKGIIE SG-KDV NTIN VSDVN I DELL N ED ILILGC SAMGDEVL-------EESE FEPFI EEI S-TKISGK KVALF

FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-TDLA-PKLKGKKVGLf

FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-DESSEFNLEGKLGAAf

3chy ADKELK FLVV DDF STMRRIVRNLLKEL GFN--N VEEA ED GVDALNKLQA GGYG FVI --SD WNMPNM----------D GLELL KTIRA DGAMSALP VLM

T

1fx1 GCG DS-SY-EYFC GA VDAIEEKLK NLGAEIVQD---------------------G LRID GD--PRAA RDDIVGWAHDVRG AI--------

FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL--------

FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI--------

FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD---------------------SLKIDGD--PE--RDEIVSwGSGIADKI--------

FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS---------------------SLKIDGE--PD--SAEVLDwAREVLARV--------

2fcr GLG DAEGYPDNFCD A IEEIHDCFAK QGAKPVGFSNPDDYDYEESKS-VRDGKFLG LPLD MVNDQIP MEKRVAGWVEAVVSET GV------

FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L--

FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------

FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------

FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA

4fxn G -----SY-GWGDG KWMRDFEERMNG YGCVVVET---------------------P LIVQ NE--PDEA EQDCIEFGKKIA NI---------

FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNA-PECKElGEAAAKA---------

FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF-----------

3chy VT AEAK K -ENIIAA --------AQ AGAS------------------------GYVV -----KPFT AATLEEKLNKIFEKL GM------

G

Iteration 0 SP= 136944.00 AvSP= 10.675 SId= 4009 AvSId= 0.313

Flavodoxin-cheY NJ tree

Secondary structure-induced alignment iteration

Flavodoxin-cheY multiple alignment/ secondary structure iteration cheY SSEs

3chy-AA SEQUENCE|| AA |ADKELK FLVV DDF STMRRIVRNLLKELG FNN VEEA ED GVDALNKLQA GGYG FVISD WNMP|

3chy-ITERATION-0|| PHD | EEEEEEE HHHHHHHHHHHHHHHHH E HHHHHHHHHH HH HEEE |

3chy-ITERATION-1|| PHD | EEEEEEEE

3chy-ITERATION-2|| PHD | EEEEEEEE

HHHHHHHHHHHHHHH

HHHHHHHHHHHHHH

HHHHHHHH EEEEEE

HHHHHHHHH EEEEEE

|

|

3chy-ITERATION-3|| PHD | EEEEEEEE

3chy-ITERATION-4|| PHD | EEEEEEEE

HHHHHHHHHHHHHH

HHHHHHHHHHHHHH

EEE HHHHHH

HHHHHHH

EEEEE

EEEEE

3chy-ITERATION-5|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE

3chy-ITERATION-6|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHH EEEEEE

|

|

|

|

3chy-ITERATION-7|| PHD | EEEEEEEE HHHHHHHHHHHHHH

3chy-ITERATION-8|| PHD | EEEEEEEE HHHHHHHHHHHHHH

3chy-ITERATION-9|| PHD | EEEEEEEE HHHHHHHHHHHHHH

EEE HHHHHH EEEEE

HHHHHHH EEEEEE

HHHHHHHHHH EEEEE

|

|

|

3chy-AA SEQUENCE|| AA |NMD GLELLKTIRA DGAMSALP VLMVT AEAK KENIIAAAQAG AS GYVV KPFT AATLEEKLNKIFEKL GM|

3chy-ITERATION-0|| PHD | HHHHHH EEEEEE HHHHHHHHHHHHHHHHH HHHHHHHHHHHHHH |

3chy-ITERATION-1|| PHD | HHHHHH EEEEEE HHH HHHHHHHHHHHHHHHHHH

3chy-ITERATION-2|| PHD | HHHHHH EEEEEE HHHHHHHHHHHHHHHHHH

EEE HHHHHHHHHHHHHH |

EEE HHHHHHHHHHHHHH |

3chy-ITERATION-3|| PHD | HHHHHHHHHHHH

3chy-ITERATION-4|| PHD | HHHHH

3chy-ITERATION-5|| PHD | HHHHHHHH

3chy-ITERATION-6|| PHD | HHHHHHHH

3chy-ITERATION-7|| PHD | HHHHHHHH

3chy-ITERATION-8|| PHD | HHHHHHHH

3chy-ITERATION-9|| PHD | HHHHHHHH

HHHHHHHHHHHHHHHHHH

EEEEE HHHHHHHHHHHHHHHHH

EEEEE HHHHHHHHHHHHHHHH

EEEEE HHHHHHHHHHHHHHH

EEE

EEE

EEE

EEEE

HHHHHHHHHHHHHH

HHHHHHHHHHHHHH

HHHHHHHHHHHHHH

HHHHHHHHHHHHHH

|

|

|

EEEEE HHHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH |

EEEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

|

Integrating secondary structure prediction and multiple alignment

• Low key example

• But difficult

• How to scale up?

• Need new formalisms and technology

Integrating protein multiple alignment, secondary and tertiary structure prediction to predict structural domains in sequence data

SnapDRAGON

Richard A. George

George R.A. and Heringa, J. (2002) J. Mol. Biol ., 316, 839-851.

The DEATH Domain

Present in a variety of Eukaryotic proteins involved with cell death.

• Six helices enclose a tightly packed hydrophobic core.

Some DEATH domains form homotypic and heterotypic dimers.

Structural domain organisation can be nasty…

Pyruvate kinase

Phosphotransferase b barrel regulatory domain a/b barrel catalytic substrate binding domain a/b nucleotide binding domain

1 continuous + 2 discontinuous domains

Protein structure hierarchical levels

PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)

VHLTPEEKSAVTALWGKVNVDE

VGGEALGRLLVVYPWTQRFFE

SFGDLSTPDAVMGNPKVKAHG

KKVLGAFSDGLAHLDNLKGTFA

TLSELHCDKLHVDPENFRLLGN

VLVCVLAHHFGKEFTPPVQAAY

QKVVAGVANALAHKYH

QUATERNARY STRUCTURE

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)

VHLTPEEKSAVTALWGKVNVDE

VGGEALGRLLVVYPWTQRFFE

SFGDLSTPDAVMGNPKVKAHG

KKVLGAFSDGLAHLDNLKGTFA

TLSELHCDKLHVDPENFRLLGN

VLVCVLAHHFGKEFTPPVQAAY

QKVVAGVANALAHKYH

QUATERNARY STRUCTURE

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)

VHLTPEEKSAVTALWGKVNVDE

VGGEALGRLLVVYPWTQRFFE

SFGDLSTPDAVMGNPKVKAHG

KKVLGAFSDGLAHLDNLKGTFA

TLSELHCDKLHVDPENFRLLGN

VLVCVLAHHFGKEFTPPVQAAY

QKVVAGVANALAHKYH

QUATERNARY STRUCTURE

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)

VHLTPEEKSAVTALWGKVNVDE

VGGEALGRLLVVYPWTQRFFE

SFGDLSTPDAVMGNPKVKAHG

KKVLGAFSDGLAHLDNLKGTFA

TLSELHCDKLHVDPENFRLLGN

VLVCVLAHHFGKEFTPPVQAAY

QKVVAGVANALAHKYH

QUATERNARY STRUCTURE

TERTIARY STRUCTURE (fold)

N

C a distance matrix

SnapDRAGON

Target matrix

3

N

N N

100 randomised initial matrices

100 predictions

N

Multiple alignment

Predicted secondary structure

CCHHHCCEEE

Input data

•The C a distance matrix is divided into smaller clusters.

•Seperately, each cluster is embedded into a local centroid.

•The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures.

1

2

3

SnapDRAGON

Domains in structures assigned using method by Taylor (1997)

Domain boundary positions of each model against sequence

Summed and Smoothed Boundaries

(Biased window protocol)

SnapDRAGON

•Predicting domain boundaries for single average size protein could take hours on

128-node cluster computer with simplified significance testing.

•How to do scale up to structural genomics? 30,000 human proteins of 1 hr each gives 3.5 years..

What we still cannot do well

• “Give us sequence, we do rest” failed so far; e.g., number of human genes

• Gene prediction bad, RNA genes missed

• Protein structure/function prediction unsolved; we have no clue about function of 50% of human genes

• No theory of gene regulation

• Cannot well predict post-translational modification

• Many (database) solutions not generic

• We have no E=mc 2 so need to keep all data

• Integrating methods and data

• Understand biologically

Future Bioinformatics Research

Topics

• Integration of knowledge

– We have some formalisms (ontologies, distributed databases) but we need to develop many completely new formalisms and new technologies beyond what we have now

Conclusions

• Getting important integrative

Bioinformatics/Systems Biology applications onto the Grid through Gridlab can be significant

• Bioinformatics and genomics are getting clinical. Gridlab could play an important role

The end. Thanks

Future Bioinformatics Research

Topics

Keywords morning session

• Integration of knowledge

– Information transfer from one object to another

– What are the rules

– From genotype to phenoypes, current algorithms and ontologies not sufficient

– Biological interpretation needs context

– DB maintenance is dynamic process, most info is static

– Need resources

– Environment should allow student to make method in 3 hours

• Genomics

– Identifying genetic elements still bad

– Collect easy primary biological facts

– Gene pred, struct pred, functional all unsolved

– Genetic “parts” list is uncomplete and scanty

– Many omics “unknowme”

• Genomics

– Hypothesis driven versus systematic approaches

– Need databases,algorithms, biol knowledge

– Data structures not suitable for complexity

– Solutions such as Ensembl not generic

– Need technologies beyond ontologies

– Need new formalisms to be able to do “vertical genomics”

• Systems Biology

– Very promising area

• Health

• Pharmaceuticals

• Biotechnology

• Environment

• (Medical) Systems Biology

– Diego di Bernardo

– Ilias Jakovidis

– Very promising area

Summary

• How can Europe regain ground

Hans Werner Mewes

• DNA contains all

• Identifying genetic elements still bad

• Collect easy primary biological facts

• Gene pred, struct pred, functional all unsolved

• Genetic “parts” list is uncomplete and scanty

• Many omics “unknowme”

• Hypothesis driven versus systematic approaches

• Need databases,algorithms, biol knowledge

• Data structures not suitable for complexity

• Solutions such as Ensembl not generic

• Need technologies beyond ontologies

• Information transfer from one object to another

• What are the rules

• From genotype to phenoypes, current algorithms not sufficient

• Biological interpretation needs context

• DB maintenance is dynamic process, most info is static

• Need resources

• Environment should allow student to make method in 3 hours

Diego di Bernardo

• TIGEM: disease genes

• Bioinformatics and comp biol not at a par

• 81 of genome “genomics&databases” and 19%

“genomics&algorithms

• Important topics: regulation, network, digital signal processing HMMs

• Problems : algorithms not biological and no experimental verification

• Bioinformatics helps design biological experimnents

• Richard Durbin: value of physics and engineering

• Computational tools for discovery of novel objects

Ilias Jakovidis

• Medical informatics

• Health telematics

• eHealth

• Medical ontologies didn’t help Paul

Schofield at all (tried with NCBI-big mess)

• Middleware includes ontologies so covers biology (IBM!)

• Language engineering

• Natural language in medicine, computerize medical community

• Biomedical informatics: applications in healthcare, how to get to clinical?

• Synergy between medicine and biology informatics

• Alphonso, med will dominate, lot of money with unclear methods

• Medical info has worked coherently, how can we do that? How can we change?

• Mewes: Bioinf has achieved usage, not med. Bioinf is entering cliniques.

Gunnar von Heijne

• Databases should be funded

• Start problem for 5 years: and then what?

• With infrastructure this problem is less, so funding is relatively OK.

• Technology development should not become dominant

• Most biologist are small scale hypothesis driven

• Marketing problem

• From 19 bioinf nethods , 15 are European in genomics

• Validation is not always key (Alfonso)

• EMBOSS project European wide, for algorithm driven research. EMBOSS is longstanding. But could not get funding from EC (no funding category)

Alfonso Valencia

• Often, 1 bioinformatician for everything

• Need of integration/collaboration

– Social, technical barriers

• People should realise that Bioinf and

Bioinformaticians are very different

• Integrated (med) system

– Underfunded (1 postdoc)

– Difficult to develop

– lack of standards and repositories

– Difficult to interact with biologist

– All these things essential

• 3-4 good bioinf groups in Spain

• Make virtual institute for bioinformatics

• There are few large groups with national funding

• There are few large groups with European funding

• There are many small groups with weak institutional funding

• Create framework valid for biology

• Interaction reduces overhead

• System access for biologists, point to the right expert

• Create new science beyond current needs

• This does not compete with basic needs

• Support strong European areas (eg. protein interaction)

• Bioinformatics is a new discipline

• Who solves the problem, who is interested in solving it, and not always who qualifies to solving it (engineers,..)

• Example “information extraction in molecular biology”: after years no real progress made.

• Systems Biology: what to do and how (no linear path), but we have opportunity to develop knowing

• Experimental validation: methods debug databases. Many proteins (90%?) have never seen an experiment

• Should bioinf talk to biology or vice versa?

Jean-Marie Claverie

• 1951 first protein sequence (insulin)

• Field has come of age, so outsiders shouldn’t tell us what to do and how

• Bioinformatics is part of the foundation

• Clear difference in application of informatics or bioinformatics

• Future will be different

• Give us sequence, we do rest failed! Number of human genes is example.

• Gene finding: standard genes good, RNA genes missed, no theory of gene transcription

• We have no E=mc 2 so need to keep data

• Computational biology is same as systems biology

• Good integration: E. Coli Bioinf-project, find all genes in small bacterium. Inclusive project. Now good consortium.

• Bioinformatics becomes invisible for biologists

(Blast).

Howard Bilofsky

• PRISM forum

• Provide challenges for (bio)informatics

• Drives Bioinf,omics,.. techniques

Download