E-science and Systems Biology - A Revolution in the Life Sciences? Chris Rawlings

advertisement

E-science and Systems Biology

- A Revolution in the Life

Sciences?

Chris Rawlings

Head of Department of Biomathematics and

Bioinformatics http://www.rothamsted.ac.uk/bab

Rothamsted Research chris.rawlings@bbsrc.ac.uk

Outline

Rothamsted Research

Systems Biology, Bioinformatics

Integrating Data

Text Mining to Support Database Curation

Systems Modelling

What are the issues?

Rothamsted Origins

Rothamsted Research

Largest agricultural and crop science research institute in UK

Research started in 1853

400 Staff

Funding BBSRC (55%)

Others Defra, EU, Industry

Sir Henry Gilbert Sir John Bennet Lawes

The classical experiments

Rothamsted Soil Archive

Dioxins & fura ns, mg/ kg

100

80

60

40

20

1840 1 860 1880 1 900 1920 19 40 1960 19 80

Yea r

5

3

7

Lea d mg/ kg

1

1955 1965 1975 1985 1990

Yea r

New Approaches – High throughput science in agriculture research

Rothamsted’s Five Research Centres

The impacts of climate change on agriculture and approaches to its mitigation

Development of arable crops with improved resource use, performance, yield and end-use quality

The vital functions performed by soils and agricultural ecosystems

Effective and lasting approaches to reducing the impacts of pest and disease

Use of informatics, mathematics and statistics to derive added value from large volumes of complex noisy data. (E-science)

Research style

Mixture of basic and applied research

Translational research important

BBSRC->Defra->farmers->processors

Strongly interdisciplinary

 plant, insect and microbial molecular and cell biology, plant and insect ecology, soil science, chemistry, physics, mathematics, statistics, bioinformatics

Increasing use of molecular biological approaches to understanding:

Interactions between plants and their pests and pathogens including disease resistance

Biological diversity in above and below ground ecosystems

The mechanisms controlling the productivity of crop plants and their responses to biotic and abiotic stress

Example Systems

Plant-pathogen interactions

Managing disease resistance in crops

Understanding how pathogens evolve to overcome host defence

Interactions between plant, pest and biological control mechanisms (or agricultural practices)

Signalling between plant, pests and beneficial insects or plants

Chemical ecology – natural methods of pest control

Interplay between crop plant, nutritional or disease status and weather

Impact of climate change

Role of soil microbes interacting with plant roots and soil chemistry

Production/sequestering of greenhouse gasses

Systems at a range of scales

Scale

Modelling approaches

Landscape Geostatistical methods and wavelets

Deterministic and stochastic models

Systems models (Petri Nets)

Ecosystem

Field

Plant

Cell

Pathway

Signalling and

Metabolic Pathways

Systems Biology

Systems Biology - Two Definitions

Systems

Biology

Emphasizes Systems

Approaches

Predictive modelling

Multi-scale

Up-scaling, down-scaling

From genes and biochemical pathways to whole organism behaviour

Collaborations between biologists, mathematicians, engineers and physicists

Emphasizes

Integration

Holistic approach

 anti-reductionism

Whole genomes

Comparative analysis

High throughput technologies

‘omics

Data integration

Ideal Situation

Modelling/simulation

Clear goals

Study has a design

Expressive models

Computing power

Experimentalists

Simulation

Validation

Prediction

Revision

Reliable technology

Reproducible biology

Adequate resources

Quality data

Adequate resolution

High throughput

Experimental platforms

Common Requirements

Modelling/simulation

Clear goals

Study has a design

Expressive models

Simulation

Validation

Prediction

Revision expertise throughout project to define and structure development

Experimentalists

Reliable technology

Reproducible biology

Adequate resources

Quality data

Adequate resolution

High throughput

Experimental platforms

Additional

Data for model validation

Bioinformatics and E-science

Bioinformatics and Escience

The use and development of computer systems for the analysis and management of biological data

Underpins genomics and the use of high throughput molecular biology

Key component in systems biology

G E N O M I C S

B I O I N F O R M A T I C S

Data volume is not the only important factor

By comparison with other domains, the volume of data is not that great

The real challenges are:

The interrelatedness of all these data

The complexity of the dependencies

The incompleteness of the data

Interrelatedness of databases indexed by SRS in ‘96

Complexity of interactions

Biomathematics and Bioinformatics at Rothamsted

Integrate data from multiple biological sources and develop tools to analyse and interpret results

Exploit mathematics and computational sciences to develop methods for detection of subtle signals in complex and noisy datasets

Develop predictive systems models of plants and their interactions with pathogens and the environment at a variety of scales

Validate and apply the models to support the development of sustainable agricultural practises

Access to Data is Key Requirement for Integrative Systems Biology

Data integration platform - ONDEX

Semantic integration

Visualisation

Text mining

Data Integration

Data Integration

ONDEX system

 http://ondex.sourceforge.net

Key features:

Treats all data as components in a graph of concepts linked by edges with defined semantics

All information is a network

Ontologies provide key to linking across information types

Specialist treatment of text and sequence information

Client server architecture

Recent version exploits emerging GRID technologies to enable open access to ONDEX-integrated data resources

ONDEX principles

everything is a network… protein interactions metabolic pathways

… in which the nodes and edges have different properties ontologies

Main idea

Simple graphs

Protein binds

Cofactor

Protein binds

Protein

Relations

Edges

Concepts/Entities

Nodes binds

Substrate binds

Enzyme catalyses

Product

Best analogy is a map

Think of it as layers which can be combined in different ways to answer particular questions

Integrated Analysis of ‘Omics Data

ONDEX for Gene

Expression

Use integrated information to help provide biological context/explanation for the pattern of up/down regulated genes

Sequence

Transcription

Factors

(TRANSFAC)

Gene

Ontologies

(GO)

ONDEX

Data Integration

Biochemical

Pathways

(Kegg, AraCyc)

Enzyme

Reactions

(BRENDA)

Gene

Expression

Data

Could be other ‘omic data

Parsers available for 14 data formats:

Kegg, AraCyc, MetaCyc,

BRENDA, Cell Ontology, OBO

Ontologies, Drastic, Enzyme

Commission, Mesh, Transfac,

Transpath, Human disease ontology, mouse pathology

Pilot Study

Gene Expression Analysis

-

-

-

Parani, M., et al. (2004) Microarray analysis of nitric oxide responsive transcripts in Arabidopsis. Plant Biotechnology

Journal , 2, 359-366 .

Published study of NO signalling (stress)

List of statistically significant differentially regulated genes

Re-interpret in context of integrated data relating to plant signalling mechanisms

Graph Visualisation & Analysis

Pilot Study

Arabidopsis data with 120 “novel” genes

New observations not in original paper made because of access to integrated data:

 provided annotation to 50 “novels”

 an important “unspotted” gene (a TF)

 drought stress

 jasmonic acid biosynthesis

Köhler, J., Baumbach, J., Taubert, J., Specht, M., Skusa, A., Rueegg, A.,

Rawlings, C., Verrier, P. and Philippi, S. (2006)

Graph-based analysis and visualization of experimental results with

ONDEX. Bioinformatics 22(11):1383-90.

Text Mining for Database Curation

Database of genes from plant fungal pathogens

Validated by gene disruption experiments

Extended to other pathogens

Research question - use of text mining to improve search for additional genes

Supplement manual methods

Pathogen Host Interactions Database

To fight pathogens one can a) reduce pathogenicity b) increase resistance in hosts

First version of PHI-base

Curated experimentally validated genes that result in loss of infection function

Generic for any pathogens and hosts (not only fungi and plants)

Why have a database

 support analysis of experimental results identify key pathogen genes and families across species how are the genes related?

pathway analysis starting point for fungicide/drug target identification

Original Curation Process

Papers

Curator

Original situation

Post-doc and PhD Student curators

Simple literature search terms

Read abstracts to select relevant articles

Read paper to abstract detailed information

Time consuming

Potential for missing genes

Free text, no controlled vocab

No links to other database

Not scalable

Capture in spreadsheet not suitable for DB

Text Mining to Support Curation

Papers

Text mining

Curator(s)

Web Frontend

Relational

Database

(PostgreSQL)

PHI-base Database

Principles

Interoperability with external data sources

 use controlled vocabularies, ontologies, taxonomies linkout to external data sources use stable accession numbers so other data sources can link to PHI-base

Text Mining Results

Compared with manual curators – trying to recreate same content

3 Concept groups: gene symbols, pathogens and hosts

Precision 41% (41 / 100 extracted abstracts)

(60 different genes, 7 new genes)

Recall 70% (104 / 150 extracted abstracts)

Mixed results:

Reduced recall and precision – but not that bad for first attempts with simple term co-occurrence

Found new genes

Combined manual and text mining

Current status

Collaboration with National Centre for Text mining

More advanced text mining methods

Improve precision and recall

Data extraction

Extend Web front end to support curation

Grow curator community

Improve content (further funding)

Modelling Plant Biochemical

Systems

Many groups in RRes study complex signalling and metabolic pathways

Create mutant plants

Single targetted gene knocked-out

Phenotype not always easy to predict

Develop predictive biochemical systems models

Formalise pathways and biological hypothesis

Use to predict phenotype from model

Biological pathways represented as Petri nets

Gibberellin biosynthesis

Gibberellin biosynthesis

Gibberellin biosynthesis

Experiment

Petri Net

ODE

Control

GA 2-ox

Control

GA 2-ox

Control

GA 2-ox

GA

4

99

16

205

9

GA

3

34

89

5861

453

GA

38

3

3

1

0.3

GA

47

6

8

4439

0.8

GA

2

3

3

20

18

GA

19

111

2

115

102

103-201 4061 17.8-30.7 7584.6 249.2 0.00052

0.73 2.3 0.6 1.4 98 0.00052

What Characterises Systems

Biology Research

Access to wide variety of data from many different sources

Wide variety of data analysis methods for different types of data

 combine and interpret data

Create structured quantitative model of system –

Mathematical – differential equations

Computational – Petri nets, Pi Calculus

Validate quantitative dynamic behaviour of model by simulation

What Systems Biology Requires

Open access to life science databases

Challenge: number and variety

Access to scientific literature and especially the quantitative information embedded there

Reaction rates, time course information etc

Particular Challenges

Integrating data to facilitate analysis and interpretation

Identification and extraction of relevant information from scientific literature

Currently manually intensive and requires moderate domain expertise

Finding all the information necessary to parameterise highly complex models

Parameter estimation methods for under-determined models

Issues

Public databanks capture high volume data

Generally low “value” until high volume

Exception - protein structure database

Increasing number of databases that synthesize richer views

Database equivalent of review

E.g. KEGG (Kyoto Encyclopedia of Genes and Genomes), EBI Genome

Reviews database

No general problem to the small volume, high value interpreted data such as that in supplementary data lodged with journals publishers

Data in Online Publications

Poor links between additional data and text – for data mining

Information in other presentation forms – graphs, tables

Images

E-science and Systems Biology what is different

Highly dependent on 3 rd party “public” data

Open access is vital

Even for primary data producer in lab – interpretation in context of 3 rd party is essential

Rapid change in methods with higher sensitivity and throughput makes (some) information ephemeral

E-science = Ephemeral-science?

Cheaper to run experiment again

E.g. gene expression

Peer-reviewed literature important but needs of are different

Online publication model (2 column PDF) unsatisfactory

More structure / improved information extraction

Methods/protocols/metadata

Publications more for scientific career development than as a true record of scientific progress?

Evolution – not Revolution

Acknowledgements

Funding – BBSRC

Rothamsted Colleagues

Jacob Koehler

Rainer Winnenberg

Jan Taubert

Tully Yates

Peter Heddon

Andy Phillips

Kim Hammond-Kosack

Martin Urban

Thomas Baldwin

Download