Jaap Heringa
Faculty of Sciences
Faculty of Earth and Life Sciences
Integrative Bioinformatics Institute VU (IBIVU) heringa@cs.vu.nl, www.cs.vu.nl/~ibivu, Tel. +31-20-4447649
• Anatomy, architecture
Rembrandt,
1632
• Dynamics, mechanics
Newton,
1726
• Informatics
(Cybernetics – Wiener, 1948)
(Cybernetics has been defined as the science of control in machines and animals, and hence it applies to technological, animal and environmental systems)
• Genomics, bioinformatics
Chemistry
Mathematics
Statistics
Biology
Molecular biology
Bioinformatics
Computer
Science
Informatics
Medicine
Physics
“Studying informational processes in biological systems”
(Hogeweg Utrecht; early 1970s)
“Information technology applied to the management and analysis of biological data”
(Attwood and Parry-Smith)
Applying algorithms and mathematical formalisms in biology (genomics) USA started but now everywhere
Taking care of the computational infrastructure and data management everywhere
Is a supporting science everywhere
The Human Genome -- 26 June 2000
genome transcriptome proteome metabolome physiome
Dinner discussion: Integrative Bioinformatics & Genomics VU
A gene codes for a protein
DNA transcription
CCTGAGCCAACTATTGATGAA
4-letter alphabet
CCU GAG CCA ACU AUU GAU GAA mRNA translation
Protein P E P T I D E
20-letter alphabet
Humans have spliced genes…
•Proteins can use different combinations of exons => alternative splicing
•The human factor VIII gene (whose mutations cause hemophilia A) is spread over ~186,000 bp. It consists of 26 exons ranging in size from 69 to 3,106 bp, and its
25 introns range in size from 207 to 32,400 bp. The complete gene is thus ~9 kb of exon and ~177 kb of intron.
•The biggest human gene yet is for dystrophin . It has
> 30 exons and is spread over 2.4 million bp.
•Single Nucleotide Polymorphism (SNP) data important for health
Microarray with about
20K genes…
• X-ray crystallography
• NMR
• Mass spectrometry data
•
Structural genomics : solving and categorising all existing protein folds (3D structures)
• Protein-protein interactions
• Protein-ligand interactions (drug design)
Kegg database
(Japan)
• Metabolomics + all other little things in the cell
• Ions, protons, etc.
• string algorithms
• dynamic programming
• machine learning (NN, k-NN, SVM, GA, ..)
• Markov chain models
• hidden Markov models
• Markov Chain Monte Carlo (MCMC) algorithms
• stochastic context free grammars
• EM algorithms
• Gibbs sampling
• clustering
• tree algorithms
• text analysis
• hybrid/combinatorial techniques and more…
Integrative Bioinformatics Institute VU (IBIVU)
•Centre for Research on BioComplex
Systems (CRBCS) – Systems Biology
•Centre for Neurobiology and
Cognitive Research (CNCR)
•VU Medical Centre (Microarray,
CGH data)
•BioRange: Pan-Dutch bioinformatics proposal
(65M Euro)
•Centre for Medical Systems Biology (Leiden,
A’dam, R’dam)
•Ecogenomics (A’dam, Wageningen, Nat. Inst.
For Health and Environment (RIVM))
•BioASP: streamline/stimulate bioinformatics teaching across The Netherlands
• Cancer Genomics Consortium [DCGP]
•
Center for Biosystem Genomics [CBSG], focuses on plant genomics (potato, tomato)
•
Kluyver Centre for Genomics of Industrial
Fermentation [Kluyver]
•
Center for Medical Systems Biology [CMSB], focuses on multifactorial disease
•
Netherlands Proteomics Centre for proteomics as an emerging horizontal genomics discipline
•
Nutrigenomics exploration into the prevention and care of nutrional inroads in vascular disease, diabetes, hypertension and obesity
• Interaction between the immune system and food; a functional genomics approach to celiac disease
• Mechanisms of life-threatening virus disease and new leads for treatment and vaccines
• Genomics of host – respiratory virus interactions : towards novel intervention strategies;
• Ecogenomics: Functioning of ecosystems targeted at sustainable environmentally friendly and healthy products
(ecology, toxicology and sustainable innovation)
Technology development
Research questions
Ecogenomics
Assessing the Living Soil
In vitro In situ
Biol. response array
Metagenome array
Bio-informatics
-
Technology platform
Ecotoxicology
Life-support functions soil
PROJECT
Coordinator
EPIDEMIOLOGY
Dorret
Boomsma(VU/mc)*
Cornelia van Duyn(EMC)
SYSTEMS BIOLOGY
Jan vd Greef (TNO/UL)*
Cor Verweij (VU/mc)
TECHNOLOGY
Huub de Groot (UL)
MODEL SYSTEMS
Rune Frants (LUMC)
Task force
Populations
Genotyping
Arraying
Proteomics
Metabolomics
Molecular interactions
In vivo imaging
Mouse / Rat
Zebrafish
Drosophila
Yeast
Cells, vaccines
EMC : van Duyn, Hofman, Oostra
VU/mc : Boomsma, Boers, Dijkmans, Heine, Hoogendijk, van der
Knaap, Meier, Pena, Pinedo
LUMC : Slagboom, Bertina, Breedveld, Breuning, Cornelisse,
Devilee, vDissel, Ferrari, Huizinga, Roosendaal, Roos, van der
Velde, Westendorp, Zitman
LUMC : Slagboom, Sandkuijl, den Dunnen
EMC : Oostra, Heutink
VU/mc : Boomsma, (Heutink)
LUMC: den Dunnen, Boer, Fodde
VU/mc: Verweij, Ylstra, Brakenhoff
EMC: Oostra
LUMC: Koning, Deelder, den Dunnen, van der Maarel
UL: Overkleeft, Abrahams,
VU/mc: Smit, Li, van Kooyk
UL: Verduijn Lunel, van de Geer, Verheij
TNO: van der Greef, Havekes, te Koppele
VU/mc: Jakobs
UL: Abrahams, Brouwer, IJzerman, van Boom
LUMC: Tanke, Raap, Deelder, den Dunnen
VU/mc: Leurs, Irth
UL: de Groot, Kok
LUMC: Reiber, van Buchem, de Roos, Poelmann, Lowick
VU/mc: Witter, Bal, Lammertsma
EMC: van Duyn, van Swieten
EMC: Oostra
LUMC : Verbeek, Fodde, deKloet, Verrijzer, Noordermeer,
Mullenders
TNO: Havekes;
UL: Spaink, Brouwer, Schmidt
CLINICAL
APPLICATIONS
Cornelis Melief (LUMC)
Viral
Methodologies,
Pharmaceuticals
VU/mc: van Kooiyk; Meijer,Pinedo
LUMC: Spaan, Wiertz, Hoeben
VU/mc: Gerritsen, Curiel
UL: IJzerman, Mulder, van Boom
LUMC: Huizinga, Breedveld, Breuning, van Deutekom, Ferrari,
Fodde, Frants, Jukema, de Kloet, Ottenhoff, van der Velde, Zitman
VU/mc: Maassen, Dijkmans, Leurs, Meijer, Pinedo
EMC : Stricker
CENTRAL PROJECT
Coordinator / Elements
DATA INTEGRATION,
ANALYSIS AND
LOGISTICS
NN
Central Information
Management
TFBI / BIG-VU / EBB
- Rosetta Resolver®
- LIMS integration/
/interfacing
Biostatistics van Houwelingen, Eijlers,
Boer, Sandkuijl (LUMC); van der Vaart, de Gunst, Boers
(VU/mc); Houwing-
Duiistermaat (EMC), van de
Geer (UL)
Bioinformatics
Boer, Svensson, Gorbalenya
(LUMC), Heringa, vBeek
(VU/mc), Stijnen, van der Lei,
Mons (EMC), Kok (UL)
BioASP Interface ism: Vriend/Tellegen
GRID – Virtual Laboratory
NWO- BMI FLEXwork van Ommen, Boer,
Svensson ism: - Stiekema (Wag)
- Herzberger (Ams)
- Vriend (Nijm)
• Integrate data sources
• Integrate methods
Data
Algorithm
Biological
Interpretation
(model)
“Nothing in Biology makes sense except in the light of evolution” (Theodosius
Dobzhansky (1900-1975))
“Nothing in Bioinformatics makes sense except in the light of Biology”
(more than just string matching)
Global dynamic programming
MDAGSTVILCFVG
S
T
I
L
M
D
A
A
C
G
S
Search matrix
Amino Acid Exchange
Matrix
MDAGSTVILCFVG-
MDAAST-ILC--GS
Data integration
Data
Algorithm
Biological
Interpretation
(model)
Data integration
Data 1 Data 2 Data 3
Data integration
Data 1 Data 2 Data 3
Algorithm 1
Biological
Interpretation
(model) 1
Algorithm 2
Biological
Interpretation
(model) 2
Algorithm 3
Biological
Interpretation
(model) 3
Data integration
“The solution includes an infrastructure or data pipeline involving:
•a general portal
•virtual lab technology (virtual LIMS)
•‘ petabase
’ data handling facilities
•methods, software and ‘tools’ to integrate data and extract knowledge from data in the user domain.
This infrastructure calls for
•a central facilitation unit providing large storage and computing facilities to run central software packages with user interfaces”
•10 years SS prediction method development: Q3 += 3%
•10 years MA method development: difference in Q3 can be >30%
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE (oligomers)
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE (oligomers)
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE (oligomers)
TERTIARY STRUCTURE (fold)
Flavodoxin-cheY: Praline alignment (prepro=1500)
1fx1 -P KALIV YGSTTG NT EYTAETIARQLA NAG-Y EVDSR DA ASVEAGGLFEG FD LVLL GCSTWGDDSI------ELQDDF IPLF DSLE ETGAQGR KVACF
FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf
FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf
FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf
FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf
2fcr --K IGIFF STSTG NT TEVADFIGKTL GA---KADAP ID VDDVTDPQALKD YD LLFLGAP TWNTG----ADTERSGTS WDEFLYD KLPEVDMKDL PVAIF
FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf
FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf
FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-SELDDVDFNGKLVAYf
FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-PTLEEIDFNGKLVALf
4fxn -M K -IVYW SGTG NT EKMAELIAKGIIE SG-KDV NTIN VSDVN I DELL N ED ILILGC SAMGDEVL-------EESE FEPFI EEI S-TKISGK KVALF
FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-TDLA-PKLKGKKVGLf
FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-DESSEFNLEGKLGAAf
3chy ADKELK FLVV DDF STMRRIVRNLLKEL GFN--N VEEA ED GVDALNKLQA GGYG FVI --SD WNMPNM----------D GLELL KTIRA DGAMSALP VLM
T
1fx1 GCG DS-SY-EYFC GA VDAIEEKLK NLGAEIVQD---------------------G LRID GD--PRAA RDDIVGWAHDVRG AI--------
FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL--------
FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI--------
FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD---------------------SLKIDGD--PE--RDEIVSwGSGIADKI--------
FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS---------------------SLKIDGE--PD--SAEVLDwAREVLARV--------
2fcr GLG DAEGYPDNFCD A IEEIHDCFAK QGAKPVGFSNPDDYDYEESKS-VRDGKFLG LPLD MVNDQIP MEKRVAGWVEAVVSET GV------
FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L--
FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------
FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------
FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA
4fxn G -----SY-GWGDG KWMRDFEERMNG YGCVVVET---------------------P LIVQ NE--PDEA EQDCIEFGKKIA NI---------
FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNA-PECKElGEAAAKA---------
FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF-----------
3chy VT AEAK K -ENIIAA --------AQ AGAS------------------------GYVV -----KPFT AATLEEKLNKIFEKL GM------
G
Iteration 0 SP= 136944.00 AvSP= 10.675 SId= 4009 AvSId= 0.313
Flavodoxin-cheY multiple alignment/ secondary structure iteration cheY SSEs
3chy-AA SEQUENCE|| AA |ADKELK FLVV DDF STMRRIVRNLLKELG FNN VEEA ED GVDALNKLQA GGYG FVISD WNMP|
3chy-ITERATION-0|| PHD | EEEEEEE HHHHHHHHHHHHHHHHH E HHHHHHHHHH HH HEEE |
3chy-ITERATION-1|| PHD | EEEEEEEE
3chy-ITERATION-2|| PHD | EEEEEEEE
HHHHHHHHHHHHHHH
HHHHHHHHHHHHHH
HHHHHHHH EEEEEE
HHHHHHHHH EEEEEE
|
|
3chy-ITERATION-3|| PHD | EEEEEEEE
3chy-ITERATION-4|| PHD | EEEEEEEE
HHHHHHHHHHHHHH
HHHHHHHHHHHHHH
EEE HHHHHH
HHHHHHH
EEEEE
EEEEE
3chy-ITERATION-5|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE
3chy-ITERATION-6|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHH EEEEEE
|
|
|
|
3chy-ITERATION-7|| PHD | EEEEEEEE HHHHHHHHHHHHHH
3chy-ITERATION-8|| PHD | EEEEEEEE HHHHHHHHHHHHHH
3chy-ITERATION-9|| PHD | EEEEEEEE HHHHHHHHHHHHHH
EEE HHHHHH EEEEE
HHHHHHH EEEEEE
HHHHHHHHHH EEEEE
|
|
|
3chy-AA SEQUENCE|| AA |NMD GLELLKTIRA DGAMSALP VLMVT AEAK KENIIAAAQAG AS GYVV KPFT AATLEEKLNKIFEKL GM|
3chy-ITERATION-0|| PHD | HHHHHH EEEEEE HHHHHHHHHHHHHHHHH HHHHHHHHHHHHHH |
3chy-ITERATION-1|| PHD | HHHHHH EEEEEE HHH HHHHHHHHHHHHHHHHHH
3chy-ITERATION-2|| PHD | HHHHHH EEEEEE HHHHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH |
EEE HHHHHHHHHHHHHH |
3chy-ITERATION-3|| PHD | HHHHHHHHHHHH
3chy-ITERATION-4|| PHD | HHHHH
3chy-ITERATION-5|| PHD | HHHHHHHH
3chy-ITERATION-6|| PHD | HHHHHHHH
3chy-ITERATION-7|| PHD | HHHHHHHH
3chy-ITERATION-8|| PHD | HHHHHHHH
3chy-ITERATION-9|| PHD | HHHHHHHH
HHHHHHHHHHHHHHHHHH
EEEEE HHHHHHHHHHHHHHHHH
EEEEE HHHHHHHHHHHHHHHH
EEEEE HHHHHHHHHHHHHHH
EEE
EEE
EEE
EEEE
HHHHHHHHHHHHHH
HHHHHHHHHHHHHH
HHHHHHHHHHHHHH
HHHHHHHHHHHHHH
|
|
|
EEEEE HHHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH |
EEEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |
EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |
|
• Low key example
• But difficult
• How to scale up?
• Need new formalisms and technology
Richard A. George
George R.A. and Heringa, J. (2002) J. Mol. Biol ., 316, 839-851.
The DEATH Domain
•
Present in a variety of Eukaryotic proteins involved with cell death.
• Six helices enclose a tightly packed hydrophobic core.
•
Some DEATH domains form homotypic and heterotypic dimers.
Structural domain organisation can be nasty…
Pyruvate kinase
Phosphotransferase b barrel regulatory domain a/b barrel catalytic substrate binding domain a/b nucleotide binding domain
1 continuous + 2 discontinuous domains
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
N
C a distance matrix
Target matrix
3
N
N N
100 randomised initial matrices
100 predictions
N
Multiple alignment
Predicted secondary structure
CCHHHCCEEE
Input data
•The C a distance matrix is divided into smaller clusters.
•Seperately, each cluster is embedded into a local centroid.
•The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures.
Domains in structures assigned using method by Taylor (1997)
Domain boundary positions of each model against sequence
Summed and Smoothed Boundaries
(Biased window protocol)
• “Give us sequence, we do rest” failed so far; e.g., number of human genes
• Gene prediction bad, RNA genes missed
• Protein structure/function prediction unsolved; we have no clue about function of 50% of human genes
• No theory of gene regulation
• Cannot well predict post-translational modification
• Many (database) solutions not generic
• We have no E=mc 2 so need to keep all data
• Integrating methods and data
• Understand biologically
• Integration of knowledge
– We have some formalisms (ontologies, distributed databases) but we need to develop many completely new formalisms and new technologies beyond what we have now
• Getting important integrative
Bioinformatics/Systems Biology applications onto the Grid through Gridlab can be significant
• Bioinformatics and genomics are getting clinical. Gridlab could play an important role
Keywords morning session
• Integration of knowledge
– Information transfer from one object to another
– What are the rules
– From genotype to phenoypes, current algorithms and ontologies not sufficient
– Biological interpretation needs context
– DB maintenance is dynamic process, most info is static
– Need resources
– Environment should allow student to make method in 3 hours
• Genomics
– Identifying genetic elements still bad
– Collect easy primary biological facts
– Gene pred, struct pred, functional all unsolved
– Genetic “parts” list is uncomplete and scanty
– Many omics “unknowme”
• Genomics
– Hypothesis driven versus systematic approaches
– Need databases,algorithms, biol knowledge
– Data structures not suitable for complexity
– Solutions such as Ensembl not generic
– Need technologies beyond ontologies
– Need new formalisms to be able to do “vertical genomics”
• Systems Biology
– Very promising area
• Health
• Pharmaceuticals
• Biotechnology
• Environment
• (Medical) Systems Biology
– Diego di Bernardo
– Ilias Jakovidis
– Very promising area
• How can Europe regain ground
• DNA contains all
• Identifying genetic elements still bad
• Collect easy primary biological facts
• Gene pred, struct pred, functional all unsolved
• Genetic “parts” list is uncomplete and scanty
• Many omics “unknowme”
• Hypothesis driven versus systematic approaches
• Need databases,algorithms, biol knowledge
• Data structures not suitable for complexity
• Solutions such as Ensembl not generic
• Need technologies beyond ontologies
• Information transfer from one object to another
• What are the rules
• From genotype to phenoypes, current algorithms not sufficient
• Biological interpretation needs context
• DB maintenance is dynamic process, most info is static
• Need resources
• Environment should allow student to make method in 3 hours
• TIGEM: disease genes
• Bioinformatics and comp biol not at a par
• 81 of genome “genomics&databases” and 19%
“genomics&algorithms
• Important topics: regulation, network, digital signal processing HMMs
• Problems : algorithms not biological and no experimental verification
• Bioinformatics helps design biological experimnents
• Richard Durbin: value of physics and engineering
• Computational tools for discovery of novel objects
• Medical informatics
• Health telematics
• eHealth
• Medical ontologies didn’t help Paul
Schofield at all (tried with NCBI-big mess)
• Middleware includes ontologies so covers biology (IBM!)
• Language engineering
• Natural language in medicine, computerize medical community
• Biomedical informatics: applications in healthcare, how to get to clinical?
• Synergy between medicine and biology informatics
• Alphonso, med will dominate, lot of money with unclear methods
• Medical info has worked coherently, how can we do that? How can we change?
• Mewes: Bioinf has achieved usage, not med. Bioinf is entering cliniques.
• Databases should be funded
• Start problem for 5 years: and then what?
• With infrastructure this problem is less, so funding is relatively OK.
• Technology development should not become dominant
• Most biologist are small scale hypothesis driven
• Marketing problem
• From 19 bioinf nethods , 15 are European in genomics
• Validation is not always key (Alfonso)
• EMBOSS project European wide, for algorithm driven research. EMBOSS is longstanding. But could not get funding from EC (no funding category)
• Often, 1 bioinformatician for everything
• Need of integration/collaboration
– Social, technical barriers
• People should realise that Bioinf and
Bioinformaticians are very different
• Integrated (med) system
– Underfunded (1 postdoc)
– Difficult to develop
– lack of standards and repositories
– Difficult to interact with biologist
– All these things essential
• 3-4 good bioinf groups in Spain
• Make virtual institute for bioinformatics
• There are few large groups with national funding
• There are few large groups with European funding
• There are many small groups with weak institutional funding
• Create framework valid for biology
• Interaction reduces overhead
• System access for biologists, point to the right expert
• Create new science beyond current needs
• This does not compete with basic needs
• Support strong European areas (eg. protein interaction)
• Bioinformatics is a new discipline
• Who solves the problem, who is interested in solving it, and not always who qualifies to solving it (engineers,..)
• Example “information extraction in molecular biology”: after years no real progress made.
• Systems Biology: what to do and how (no linear path), but we have opportunity to develop knowing
• Experimental validation: methods debug databases. Many proteins (90%?) have never seen an experiment
• Should bioinf talk to biology or vice versa?
• 1951 first protein sequence (insulin)
• Field has come of age, so outsiders shouldn’t tell us what to do and how
• Bioinformatics is part of the foundation
• Clear difference in application of informatics or bioinformatics
• Future will be different
• Give us sequence, we do rest failed! Number of human genes is example.
• Gene finding: standard genes good, RNA genes missed, no theory of gene transcription
• We have no E=mc 2 so need to keep data
• Computational biology is same as systems biology
• Good integration: E. Coli Bioinf-project, find all genes in small bacterium. Inclusive project. Now good consortium.
• Bioinformatics becomes invisible for biologists
(Blast).
• PRISM forum
• Provide challenges for (bio)informatics
• Drives Bioinf,omics,.. techniques