1 Introduction to bioinformatics

advertisement
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
1 INTRODUCTION TO BIOINFORMATICS
Table of contents: Introduction to bioinformatics
1
Introduction to bioinformatics ..................................................................................................... 1
1.1. Voorwoord Bio-informatica ................................................................................................. 2
1.2. Bioinformatics formal definition ......................................................................................... 5
1.3. Driving force for bioinformatics: ......................................................................................... 5
1.3.1 Advance in molecular biology ......................................................................................... 5
1.4. Different subfields in bioinformatics research ..................................................................... 6
1.5. Structural genomics.............................................................................................................. 8
1.5.1 Overview .......................................................................................................................... 8
1.5.2 Biological application: genome sequencing .................................................................. 12
1.6. Comparative genomics ....................................................................................................... 18
1.6.1 Overview ........................................................................................................................ 18
1.6.2 Biological application 1: comparative genomics, genome evolution (Y. Van de Peer) 19
1.6.3 Biological application 2: metagenomics (G. Venter)..................................................... 22
1.7. Functional genomics & Systems Biology .......................................................................... 24
1.7.1 Systems biology ............................................................................................................. 24
1.7.2 Synthetic biology from an engineering point of view: rational design .......................... 25
Updated 20/09/2013
Master file introduction
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
1
Introduction
1.1.
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Voorwoord Bio-informatica
Bio-informatica, hoewel een relatief recente term, bestaat reeds meer dan 400 jaar. Galileo schreef
immers “the book of nature is written in the language of mathematics!”.
Het gebruik van wiskundige modellen om biologische fenomenen te verklaren en gegevens te
analyseren is zeker niet nieuw. Tot nog toe was het enkel gemeengoed in bepaalde deeldomeinen
van de biologie (e.g. populatiegenetica, fylogenie, “molecular modeling” etc.).
Belangrijke technologische vernieuwingen in de moleculaire biologie in het begin van de jaren ‘90
brachten hierin grondige verandering. De toepassing van de hoge-doorvoer technologieën
(genomica, transcriptomica, proteomica, metabolomica) laat immers toe om in zeer korte tijd de
DNA-sequentie van hele genomen in kaart te brengen, de expressie van duizenden genen of
proteïnen in een organisme te analyseren, de aard en concentratie van alle metabolieten te evalueren
en de interacties tussen deze verschillende genetische entiteiten te identificeren. Dit heeft geleid tot
een onevenaarbare data-explosie. Voor het analyseren van deze data volstaat een excel spread sheet
niet langer, maar is een interdisciplinaire aanpak noodzakelijk
Deze dataexplosie heeft ook geleid tot een drastische verruiming in het “biologisch” denken (ook
wel de nieuwe biologie geheten). De finale doelstelling van de moleculaire biologie “het
verwerven van inzicht in de werking en evolutie van organismen” bleef dezelfde. De manier
om dit doel te bereiken is gewijzigd. Tot voor enkele jaren werden in het functioneel moleculair
biologisch onderzoek, genen, proteïnen en andere moleculen één voor één als geïsoleerde entiteiten
bestudeerd. Het gebruik van de nieuwe technologieën situeert de functie van een gen nu in een
globale context, namelijk als deel van een complex regulatorisch netwerk. Vanuit dit nieuw
perspectief wordt het organisme beschouwd als een systeem dat interageert met zijn omgeving. Het
gedrag ervan wordt bepaald door de complexe dynamische interacties tussen
genen/proteïnen/metabolieten op het niveau van het regulatorische netwerk. Door de
beschikbaarheid van data van verschillende modelorganismen kunnen bovendien de cellulaire
mechanismen tussen de organismen vergeleken worden.
Organisme voorgesteld als een systeem dat interageert met zijn omgeving. Via de werking van regulatorische netwerken past een organisme zich
voortdurend aan aan wisselende omgevingssignalen. Deze aanpassingen resulteren in een gewijzigd gedrag of fenotype. De regulatorische netwerken
kunnen beschouwd worden als de biologische signaalverwerkingssystemen.
Traditionele studies van biologische systemen waren veeleer beschrijvend. De systeembenadering
van de biologie impliceert echter een doorgedreven kwantitatieve en geïntegreerde analyse van
complexe gegevens. Onder invloed van deze nieuwe tendens ontstond de term "bio-informatica"
(voor het eerst gebruikt in de rond 1993) en werd de hoge-doorvoer functionele moleculaire
biologie een deel van de “systeembiologie”.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
2
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Zoals de moleculaire biologie zijn systeembiologie en bio-informatica onderzoeksdomeinen met
vele deeldisciplines (structurele, functionele, comparatieve bio-informatica).
De bio-informatica vraagstelling ontstaat vanuit de biologie. De computationele wetenschappen
stellen een arsenaal standaardalgoritmes en principes ter beschikking. Beide moeten op een zinvolle
manier verenigd worden, rekening houdend zowel met de specifieke eigenschappen van het
gebruikte algoritme als met deze van het biologisch probleem. Het verzoenen van algoritmen uit
exacte wetenschappen met experimentele data afkomstig van stochastische biologische systemen
vormt hierbij de belangrijkste uitdaging. Om die reden kan het oplossen van een biologisch
probleem via computationele weg al snel een paar jaar onderzoek in beslag nemen maar leidt het tot
waardevolle resultaten die in sommige gevallen het traditioneel biologisch onderzoek overstijgen.
Toekomst van bio-informatica
Bio-informatica is dus geen “hype”. Naarmate de moleculair biologische technologie evolueert zal
ze verder aan belang toenemen. De meest succesvolle moleculair biologische laboratoria zullen
daarbij ongetwijfeld deze zijn, die het “wet lab” onderzoek sturen a.h.v. de predicties van
geavanceerd computioneel onderzoek. De toekomst van zowel de moleculaire biologie als de bioinformatica ligt in de uitbouw van het onderzoek waarbij de grens tussen het “wet lab” en het
computationeel aspect vervaagt.
Doel van de cursus bio-informatica
Het doel van de cursus is tweeledig:
De eerste en waarschijnlijk meest belangrijke doelstelling is om jullie ervan te overtuigen dat
bioinformatica een essentieel onderdeel is van jullie curriculum. Met een aantal voorbeelden en
verwezenlijkingen uit het domein hoop ik jullie van te kunnen overtuigen dat ‘bioinformatica’ en
‘systeem biologie’ ons leven en denken zullen veranderen.De moleculaire bioloog van de 21e eeuw
zal niet enkel beschikken over een goed uitgebouwde biologische kennis, maar hij dient ook
vertrouwd te zijn met belangrijke principes uit de wiskunde, de statistiek en de
informatietechnologie. Dergelijke integratie van biologisch inzicht, analytisch en
probleemoplossend denken is eigen aan de bioinformatica.
Een tweede aspect van de cursus is om jullie vertrouwd te maken met het gebruik van
bioinformatica tools. Het bio-informatica domein is echter zeer ruim en in volle expansie. Het is
dan ook onmogelijk om alle tools en onderdelen te belichten. We zullen een aantal belangrijke en
veel gebruikte voorbeelden bespreken. Het is hierbij van belang dat jullie realisereb dat zinvolle
bio-informatica meer is dan enkel het toepassen van tools maar dat het essentieel is om ook de
onderliggende mathematische principes van deze tools te begrijpen en tegelijk inzicht te verwerven
in de datageneratieprotocols en de biologische complexiteit. Dit impliceert dat bio-informatica meer
is dan een hulpmiddel bij het moleculair biologisch onderzoek maar dat het een volwaardig
onderzoeksdomein op zichtzelf vormt.
K. Marchal
Important messages you should realize after having followed the course:
How can bioinformatics change the world?
Answer: Bioinformatics has numerous application domains, and will for instance revolutionize the
medical field, because it for example will make personalized medicine possible. Now that the
genome (the blueprints of life) of everyone can (and will) get sequenced, we can start investigating
why some persons are susceptible to certain diseases while others are not, and why a certain
treatment works for some and not for others. In agriculture, we can for instance study why certain
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
3
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
plants are more resistant to drought than others, which is important in these days of global warming
and climate change. Bioinformatics can also address more fundamental questions in evolutionary
biology, such as whether Neanderthals and the ancestor of modern humans ever had sex (the answer
is yes), questions that can only be addressed with bioinformatics or computational biology.
What is the most important skill of being a bioinformatician
Well, I think first of all you need to be a generalist rather than a specialist. You need to know a bit
of everything but nothing too much in detail (that can even be disadvantageous I think). To give an
example: wet lab scientists typically have a very detailed view on biology: biological systems have
randomly evolved into emerging complex systems that can not be captured in a few rules. There are
more exceptions than fixed rules in biology. Engineers on the other hand model systems and these
models depend on predefined rules. As a bioinformatician you need to keep both parties happy: a
good formalization of a biological question should reduce the problem to a model that is
mathematically tractable but that still captures the intricacies the biologist is interested in. Finding
the right assumptions and simplifications builds on this generic knowledge. This generic
knowledge is also key to the scientific Intuition you need to have as a bioinformatician. As was
already mentioned: with bioinformatics we can solve research questions that could not be addressed
before. There is so much data out there that when you integrate it all you can tackle research
questions that go far beyond what was accessible or could be dreamt of by a single person or even a
single lab. The difficulty often is defining these novel research questions or hypothesis no one has
ever thought of before. This again requires very good interdisciplinary knowledge on how the data
was generated, what type of information does it contain, how can it be integrated etc.
If Bioinformatics will become so prominent and is referred to as ‘the new biology’, how will
this effect the more classical wet lab science?
Answer: Of course without data there is no bioinformatics. But there is indeed a tendency that
increasingly, data generation becomes robotized or outsourced. This has a consequence that wet lab
scientists have more time left to spend on the design of their experiment and will be confronted at a
much earlier stage with the analysis of their data, and the problems related to this. What do you
hope to to get out of your data, how will you synthesize all these data, what is the hypothesis you
want to formulate, and so on? So rather than focusing on a single gene, they will need to start
thinking more globally, solving the bigger picture and that is what the term ‘new biology’ is
referring to. This is now often considered the problem of the bioinformatician but obviously, the
wet lab scientist of the future will have to adopt at least some of those skills. So the distinction
between a bioinformatician and a wet lab scientist (systems biologist) will become fuzzier and In
the coming decades, we expect that about one third of the people in the life sciences will be
bioinformaticians or at least use some sort of bioinformatics in their research. However, although
genome hackers and number crunchers can learn a lot from the loads of data generated, wet lab
work will always be necessary. Bioinformatics is also often about making predictions, but of
course these still will need to be validated in the lab. On the other hand, for some specific fields
such as evolutionary research, bioinformatics is often sufficient or even the only way to obtain
results.
Taken from an interview in the framework of N2N (bioinformatics speerpunt at the UGhent)
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
4
Introduction
1.2.
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Bioinformatics formal definition
Bioinformatics is an interdisciplinary research area at the interface between biological and
computational sciences. Although the term 'Bioinformatics' is not really well-defined, you could say
that this scientific field deals with the computational management of all kinds of molecular
biological information. Most of the bioinformatics work that is being done deals with either
analyzing biological data, or with the organization of biological information.
As a consequence of the large amount of data produced in the field of molecular biology, most of
the current bioinformatics projects deal with structural and functional aspects of genes and proteins.
1.3.
Driving force for bioinformatics:
1.3.1
Advance in molecular biology
Traditional genetics and molecular biology have been directed toward understanding the role of a
particular gene or protein in a molecular biological process. A gene is sequenced to predict its
function or to manipulate its activity or expression. Traditional molecular biology was focusing on
single genes. With the advent of novel molecular biological techniques such as genome scale
sequencing, large scale expression analysis (gene, protein expression, microarrays, 2Delectrophoresis, mass spectroscopy), large scale identification of protein-protein interactions (yeast
2 hybrid; protein chips) or protein-DNA interactions (immunochromatine precipitation), the scale of
molecular biology has changed. One is no longer focusing on a single gene but many genes or
proteins are analyzed simultaneously (i.e. at high throughput level transcriptomics, translatomics
interactomics, metabolomics). This novel approach offers advantages: one can study the function or
the expression of a gene in a global context of the cell. Because a gene does not act on its own, it is
always embedded in a larger network (systems biology). These holistic approaches allow better
understanding of fundamental molecular biological processes.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
5
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
On the other hand, high throughput approaches pose several novel challenges to molecular biology:
the analysis of such large scale data is no longer trivial. Simple spreadsheet analysis such as excel
are no longer sufficient. More advanced datamining procedures become necessary. Another urgent
problem is also how to store and organize all the information.
There is, in fact, an inseparable relationship between the experimental and the computational
aspects.
 On the one hand, data resulting from high-throughput experimentation require intensive
computational interpretation and evaluation.
 On the other hand, computational methods produce questionable predictions that should be
reviewed and confirmed through experiments.
1.4.
Different subfields in bioinformatics research
This intricate merge between molecular biology and computational biology has given rise to new
research fields and application. In each of these research fields, a specific field of bioinformatics
expertise is required.
Three main fields can be distinguished:
 Structural genomics,
o Input: raw sequence data
o Goal: annotation
o Bioinformatics Tools: genome assembly, gene/promoter/intron prediction
 Comparative genomics
o Input: annotated genomes
o Goal: annotation, evolutionary genomics
o Bioinformatics Tools: sequence alignment, tree construction
 Functional genomics.
o Input: experimental information
o Goal: function assignment, systems biology
o Bioinformatics Tools: microarray analysis, network reconstruction, dataintegration
Note that the field of molecular dynamics and protein modeling is not covered in this course.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
6
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Structural
Genomics/Annotation
Comparative
Genomics/
evolutionary
biology
Functional
genomics/
Systems Biology
For some purposes, different subfield have to be combined i.e., the distinction is not always a s
clear cut as it seems.
For instance, for genome annotation:
As these genomes are collected they need to be annotated. This means that we will have
 To identify the location of the genes on the genome (structural annotation)
 To assign a function to each of the potential genes (functional annotation).
In structural annotation, the question to be answered is 'where are the genes'? One needs to
localize the gene elements on the sequence (chromosome) and find the coding sequences, intergenic
sequences, exons/intro boundaries, promoters, 5'UTR, 3'UTR regions, and so on.
In functional annotation, one tries to get information on the function of genes. Often, it is possible
to get hints on the biochemical function of the gene products by finding homologs in protein
databases or by studying the biochemical characteristics of the gene (proteome, transcriptome
analysis).
In the following, each of the bioinformatics subfields will be briefly described and illustrated with a
biological case study.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
7
Introduction
1.5.
1.5.1
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Structural genomics
Overview
Structural genomics is based on raw sequence data. The first step in structural genomics consists of
assembling raw sequence fragments into contigs or whole genomes. The complexity of the
assembly process depends on the used sequence technique. Two major sequencing approaches will
be described below:
For more information see also
http://www.genomenewsnetwork.org/articles/06_00/sequence_primer.shtml
http://www.bio.davidson.edu/courses/genomics/method/shotgun.html
In a second step the genetic entities need to be located on the genome (structural annotation).
SEQUENCING AND GENOME ASSEMBLY
1.1.5.1.1
Top down sequencing
The first genome sequencing approach “top down” is based on the known order of DNA fragments.
To sequence larger molecules such as human chromosomes,
1) individual chromosomes are broken into random fragments of approximately 150 kb.
2) These fragments are then cloned into BACs (vectors).
3) In an intensive but largely automated laboratory procedure, the resulting library is screened for
clusters of fragments called contigs which have overlapping or common sequences. These contigs
are then joined to produce an integrated physical map of the genome based on the order of the
BACS. Once the correct map has been identified unique overlapping clones are chosen for
sequencing.
4) However, these clones are too large for direct sequencing. One procedure for sequencing these
subclones is to subclone them further into smaller fragments that are of sizes suitable for
sequencing (500 bp). 5) From the DNA sequences of approximate length of 500 bp, genome
sequences are assembled using the fragment order on the physical map as a guide.
Top down sequencing
2.
1.
Genome fragmentation
3.
BAC library
4.
Physical map
Subclone library
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
8
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
5.
Genome assembly
This method of creating physical maps of genomes and then using this map to guide the sequencing
was used by the Public Human genome Consortium to create a draft of the human genome. This
carefully crafted, but laborious procedure was designed to produce a sequence of the human
genome that was based on a top down approach, at each stage using the physical map to guide the
placement of sequences (Lander, Nature 2001). The reasoning behind this strategy was the
avoidance of sequence repeats that might otherwise confound obtaining the correct genome
sequence.
1.1.5.1.2
Shot gun sequencing
A contrasting “bottom-up” method in which the genome sequence is derived from solely overlaps
in large numbers of random sequence without using the physical map as a guide, has been devised.
This alternative method, called shotgun sequencing attempts to assemble a linear map from
subclone sequences without knowing their order on the chromosome. Contigs are assembled based
on alignment of all possible sequence pairs in the computer. This method is now routinely used to
sequence microbial genomes and the cloned fragments of larger clones (see also metagenomics).
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
9
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Shot Gun
Sequencing
1. Genome
fragmentation
2. Library
3. Sequences
4. Genome assembly
The shotgun method was used by Celera Genomics to sequence the human genomes (Venter,
Science 2001; http://www.jcvi.org/). There has since been controversy as to whether or not use of
the public data by the Venter group contributed significantly to their draft of the human genome or
from the overlaps in a highly redundant set of fragments by automatic computational methods
(shotgun method).
Fuzz about the public versus the commercial effort (lander versus venter)
http://www.dnalc.org/view/15326-Analysis-in-public-and-private-Human-Genome-Projects-EricLander-.html
Nowadays for large genomes a combination between top down and bottom up sequencing is used as
illustrated below.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
10
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
STRUCTURAL ANNOTATION
Once assembled, structural elements such as the location of genes, introns, exons, splice sites,
promoters, repeated elements etc.need to be predicted in the genome (the structural analysis).
Distinct gene predictions algorithms have been developed. Methods for ab initio gene predictions
are based on supervised machine learning techniques(1). The model (e.g. a hidden markov model or
a neural network) is trained on a set of known genes (or promoters or introns) and subsequently
used to predict the location of unknown genes (or promoters or introns) in an organism. Features
(properties in the genome) that are extracted from the trainingsset and that thus help recognizing
genes are for instance specific codon usage (which differs between coding and non coding regions),
spice site recognition sites (when predicting splice sites) etc. Because of differences in codon usage
and splice junctions between organisms, a model must be trained for each novel genome (see
chapter “gene prediction”).
Once the complete genome is known genome maps can be constructed that indicate the position of
each gene on the genome. Comparing gene maps of different organisms allows identification of
translocation or other chromosome arrangements (important in cancer research).
Fig. genome map of the bacterium A. tumefaciens
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
11
Introduction
1.5.2
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Biological application: genome sequencing
THE NUMBER OF FULLY SEQUENCED GENOMES INCREASES WITH AN UNPRECEDENTED PACE
The first bacterial genome to be sequenced was that of Haemophilus influenzae (sequenced by the
TIGR institute (http://www.tigr.org) in 1995). The success of sequencing this genome in relatively
short time heralded the sequencing of a large number of additional prokaryotic organisms. To data
the genomes 96 of these species have been sequenced among which the model organisms E. coli
and B. subtilis.
Later on eukarotic genomes were sequenced. In 2002 the human genome sequence was completed
by two distinct research groups in parallel: a commercial group Celera and an academic sequence
consortium (Sanger Center). Nowadays the sequences of several eukaryotic model organisms have
been determined and the number of sequences is steadily increasing.
Microbial genomes
http://www.ncbi.nlm.nih.gov/genomes/static/micr.html
Genome resources at ncbi:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
Vertebrate genomes:
http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=7227
http://www.ncbi.nlm.nih.gov/genome/guide/human/
http://www.ncbi.nlm.nih.gov/genome/guide/mouse/index.html
http://www.ncbi.nlm.nih.gov/genome/guide/rat/index.html
http://www.ncbi.nlm.nih.gov/genome/guide/zebrafish/index.html
http://www.ensembl.org/
yeast genomes
http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?chr=scerevisiae.inf
Plant genomes:
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/Resources_1.html#arab
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
12
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
From Nature Reviews Genetics 4, 251-262 (2003);
GENOME SEQUENCES AT INTERSPECIES LEVEL
Why are these genomes useful. Below 2 examples of using genome sequences are described.
With all these genome sequences at hand, it becomes even possible to study our own evolution.
In September 2006, an international team published the genome of our closest relative, the
chimpanzee. With the human genome already in hand, researchers could begin to line up chimp and
human DNA and examine, one by one, the 40 million evolutionary events that separate them from
us. The genome data confirm our close kinship with chimps: We differ by only about 1% in the
nucleotide bases that can be aligned between our two species, and the average protein differs by less
than two amino acids.
Given the dramatic behavioral and developmental differences that have arisen since their
divergence from a common ancestor 6-7 million years ago, the question therefore arises of how
these phenotypic differences are reflected at the genome sequence level.
Recent studies have shown that mainly genes involved in smell and hearing are significantly
different between humans and chimpanzees.
Also changes in gene regulatory binding sequences (promoters, enhancers, and silencers) are likely
to have contributed to divergence between humans and chimps. Using a comparative approach, it
has been shown that regulatory binding sites lost in human but still present in chimp are located in
specific genomic regions and are associated with genes involved in sensory perception.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
13
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
In this figure, two examples are given of regulatory binding sites that changed between human and
chimp. Note how small differences in sequences can have such large phenotypic influences.
GENOME SEQUENCES AT THE LEVEL OF THE INDIVIDUAL
In the early days most genomes were sequenced by the classical Sanger sequencing approach (see
figure below), but nowadays the next-generation sequencing (NGS) methodology is taken over.
Mainly the developments in nanotechnology have resulted in the origin of novel technologies for
sequencing and synthesizing DNA sequences. Next-generation sequencing has the ability to process
millions of sequence reads in parallel rather than 96 at a time. All NGS platforms share a common
technological feature, reactions. massively parallel sequencing of clonally amplified or single DNA
molecules that are spatially separated in a flow cell (for a recent review see Metzker, M.L. (2010)
Nature Reviews Genetics 11:31-46) (see figure below). This design is a paradigm shift from that of
Sanger sequencing, which is based on the electrophoretic separation of chain-termination products
produced in individual sequencing reactions.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
14
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Fig. Sanger sequencing methodology: The dideoxynucleotide termination DNA sequencing
technology invented by Fred Sanger and colleagues in 1977, formed the basis for DNA sequencing
from its inception through 2004. Originally based on radioactive labeling, the method was
automated by the use of fluorescent labeling coupled with excitation and detection on dedicated
instruments, with fragment separation by slab gel and ultimately by capillary gel electrophoresis.
Overview of next generation sequencing technologies.
The breakthroughs in these technologies are unpreceded and follow the law of Moore. Related to
the sequencing technology, it is to be expected that within a few years we will have the 100 dollar
genome, which allows the genome of a human to be sequenced within a few hours for 1000 dollar.
(comparison the human genome is 3.4 Gb=3.4 miljard baseparen en heeft 20000-25000 genen).
As a result recent sequencing projects start focusing on sequencing different individuals of the same
species (1000 genomes project e.g. http://www.1000genomes.org/). This has been made possible
thanks to the lower sequencing cost of the next generation sequencing approaches. This opens novel
perspectives for amongst others, personalized medicine, sequence-based trait selection, evolution
experiments.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
15
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
The ‘1000 genomes project’
Recent improvements in sequencing technology ("next-gen" sequencing platforms) have sharply
reduced the cost of sequencing. The 1000 Genomes Project is the first project to sequence the
genomes of a large number of people, to provide a comprehensive resource on human genetic
variation. As with other major human genome reference projects, data from the 1000 Genomes
Project will be made available quickly to the worldwide scientific community through freely
accessible public databases. (See Data use statement.)
The goal of the 1000 Genomes Project is to find the genetic variants that have frequencies of at
least 1% in the populations studied. This goal can be attained by sequencing many individuals
lightly. To sequence a person's genome, many copies of the DNA are broken into short pieces and
each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less
randomly distributed across the genome. The pieces are then aligned to the reference sequence and
joined together. To find the complete genomic sequence of one person with current sequencing
platforms requires sequencing that person's DNA the equivalent of about 28 times (called 28X). If
the amount of sequence done is only an average of once across the genome (1X), then much of the
sequence will be missed, because some genomic locations will be covered by several pieces while
others will have none. The deeper the sequencing coverage, the more of the genome will be covered
at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both
chromosomes at a location will be included. In addition, deeper coverage is particularly useful for
detecting structural variants, and allows sequencing errors to be corrected.
Sequencing is still too expensive to deeply sequence the many samples being studied for this
project. However, any particular region of the genome generally contains a limited number of
haplotypes. Data can be combined across many samples to allow efficient detection of most of the
variants in a region. The Project currently plans to sequence each sample to about 4X coverage; at
this depth sequencing cannot provide the complete genotype of each sample, but should allow the
detection of most variants with frequencies as low as 1%. Combining the data from 2500 samples
should allow highly accurate estimation (imputation) of the variants and genotypes for each sample
that were not seen directly by the light sequencing.
[definition haplotype
A haplotype in genetics is a combination of alleles (DNA sequences) at adjacent locations (loci) on
the chromosome that are transmitted together. A haplotype may be one locus, several loci, or an
entire chromosome depending on the number of recombination events that have occurred between a
given set of loci.
The data now available to scientists contains 99% of all genetic variants that occur in the
populations studied, down to the level of rare variations that only occur in 1 out of every 100
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
16
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
people. "The whole point of this resource is that we're moving to a point where individuals are
being sequenced in clinical settings and what you want to do there is sift through the variants you
find in an individual and interpret them," said Professor Gil McVean of Oxford University, a lead
author for the study.
The information will be pored over by thousands of researchers, who will analyse and interpret the
DNA variations between people in a bid to work out which ones are implicated in disease. In
addition to the DNA sequences, the 1,000 Genomes Project has stored cell samples from all the
people it has sequenced, to allow future scientific projects to look at the biological effect of the
DNA variations they might want to study. How is this done e.g. through a GWAS.
Genome wide association studies (GWAS): Any two human genomes differ in millions of
different ways. There are small variations in the individual nucleotides of the genomes (SNPs) as
well as many larger variations, such as deletions, insertions and copy number variations. Any of
these may cause alterations in an individual's traits, or phenotype, which can be anything from
disease risk to physical properties such as height. In a genetic association study one asks if the allele
of a genetic variant is found more often than expected in individuals with the phenotype of interest
(e.g. with the disease being studied).
Overview of a genomewide association study, from W. Gregory Feero et al 2010Th e new england
journal of medicine.
The most common approach of GWA studies is the case-control setup which compares two large
groups of individuals, one healthy control group and one case group affected by a disease. All
individuals in each group are genotyped for the majority of common known SNPs. The exact
number of SNPs depends on the genotyping technology, but are typically one million or more. For
each of these SNPs it is then investigated if the allele frequency is significantly altered between the
case and the control group. In such setups, the fundamental unit for reporting effect sizes is the odds
ratio. The odds ratio reports the ratio between two proportions, which in the context of GWA
studies are the proportion of individuals in the case group having a specific allele, and the
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
17
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
proportions of individuals in the control group having the same allele. When the allele frequency in
the case group is much higher than in the control group, the odds ratio will be higher than 1, and
vice versa for lower allele frequency. Additionally, a P-value for the significance of the odds ratio is
typically calculated. Finding odds ratios that are significantly different from 1 is the objective of the
GWA study because this shows that a SNP is associated with disease.
There are several variations to this case-control approach. A common alternative to case-control
GWA studies is the analysis of quantitative phenotypic data, e.g. height or biomarker concentrations
or even gene expression.
In addition to the calculation of association, it is common to take several variables into account that
could potentially confound the results. Sex and age are common examples of this. Moreover, it is
also known that many genetic variations are associated with the geographical and historical
populations in which the mutations first arose. Because of this association, studies must take
account of the geographical and ethnical background of participants by controlling for what is
called population stratification.
Key to the application of GWA studies was the International HapMap Project and the 1000
genomes project which allowed to identify a majority of the common SNPs which are customarily
interrogated in a GWA study.
1.6.
1.6.1
Comparative genomics
Overview
The basic idea of comparative genomics is the comparison of sequences between genomes.
Sequence alignment methodologies form the basis tools for comparative genomics (Blast, clustalW,
Needleman Wunsh, Markov models…) (see chapter sequence alignment).
1) Comparative genomics can be used to aid or validate gene predictions: Since gene prediction
methods based on sequence features only (ab initio gene prediction) are only partially accurate,
gene identification is facilitated by high-throughput sequencing of partial cDNA copies of
expressed genes (called expressed sequence tags or ESTs). Presence of ESTs confirms that the
predicted gene is transcribed. A more through sequencing of full length cDNA clone may be
necessary to confirm the structure of genes. Gene prediction methods that not only take into account
sequence features (codon usage, intron exon recognition sites), but also sequence homology (with
ESTs, cDNAs, proteins) are called extrinsic gene finding methods (and are in fact a combination of
structural and comparative genomics). An example is the genewise method discussed in chapter
gene prediction.
2) Homology based annotation: The amino acid sequence of proteins encoded by the predicted
genes can be used as a query sequence in a database similarity search. A match of a predicted
protein sequence to one or more database sequences not only serves to validate the gene prediction,
but also can give indications on the function of the gene.
Not all genes will give hits in database searches. Some proteins might be unique for a certain
organism or might not have been characterized before. In such cases it might also be important to
search for characteristic domains (conserved amino acid patterns that can be aligned) that represent
a structural fold or a biochemical feature (see chapter pattern searches).
3) Another important goal of comparative genomics is the study of protein families i.e., all proteins
are compared in two or more proteomes. Orthologs are genes that are so highly conserved by
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
18
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
sequence in different genomes that the proteins they encode are strongly predicted to have the same
structure and function and to have arisen from a common ancestor through speciation. Paralogs
have arisen by duplication events and might have a distinct function (see chapter comparative
genomics). Highly similar proteins (both orthologs and paralogs) form protein families. They can
be identified by reciprocal blast searches or by cluster analysis (see COGs). In related organisms
both gene content of the genome and gene order on the chromosome are likely to be conserved. As
the relationship between organisms decreases local groups of genes remain clustered but
chromosomal rearrangements move the clusters to other locations (see chapter comparative
genomics). Evolutionary modeling at genome level can therefore include the following analyses: 1)
the prediction of chromosomal rearrangements, analysis of duplications at the level of protein
domain gene chromosome or full genome level search for horizontal transfer between separate
organisms.
4) Phylogenetic footprinting is still another application of comparative genomics. Currently we
have very limited information about regulatory elements especially in complex eukaryotes.
Comparison of orthologous chromosomal regions in reasonably distantly related species should lead
to identification of common regulatory elements (see chapter motif detection).
1.6.2
Biological application 1: comparative genomics, genome evolution (Y. Van
de Peer)
STUDYING GENOME EVOLUTION
Fossil records of plant evolution
The extant (or modern) angiosperms did not appear until the Early Cretaceous (145–125 Mya),
when the final combination of these three angiosperm features occurred, as supported by evidence
from micro- and macrofossils and clear documentation of all of the major lines of flowering
plants. This diversification of angiosperms occurred during a period (the Aptian, 125–112Mya;
Figure 1) when their pollen and megafossils were rare components of terrestrial floras and species
diversity was low. Angiosperm fossils show a dramatic increase in diversity between the Albian
(112–99.6 Mya) and the Cenomanian (99.6–93.5 Mya) at a global scale (Figure 1).
The angiosperm radiation yielded species with new growth architectures and new ecological roles.
Early angiosperms had small flowers with a limited number of parts that were probably pollinated
by a variety of insect taxa but specialized for none. Accordingly, Cenomanian flowers do not yet
provide strong evidence for specialization of pollination syndromes. However, by the Turonian
(93.5–89.3 Mya), flowering plants had a wide variety of features that are, in extant species, closely
associated with several types of specialized insect pollination and with high species diversity within
angiosperm subclades. The evolution of larger seed size in many angiosperm lineages during the
early Cenozoic (from 65 Mya) indicates that animal-mediated dispersal and shade-tolerant lifehistory strategies. In summary, fossils with affinities to diverse angiosperm lineages, including
monocots, are all found in Early Cretaceous floras. However, the question remains why this was
such a decisive time in the evolution of plants. Can whole-genome duplication events have had a
key role in the origin of angiosperms and their morphological and ecological diversification.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
19
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Evolution of
angiosperms based
on fossil records
Specialised
flowering plants
Unspecialised
flowering plants
Dramatic increase
in diversity (aptian)
extant (or modern)
angiosperms 130
Mya
Why was this such a decisive time in plant evolution ?
Evidences:
 Many angiosperms have experienced one or more episodes of polyploidy in their ancestry.
o Duplicated genes and genomes can provide the raw material for evolutionary
diversification and the functional divergence of duplicated genes
o The dates of the duplication events correspond to time periods of large expansion in
angiosperms as recorded based on fossils.
Corresponds with age

older
Genes involved in transcriptional regulation and signal transduction have been preferentially
retained following genome duplications. Similarly, developmental genes have been
observed to be retained following genome duplications, particularly following the two oldest
events (1R and 2R). Few regulatory and developmental gene duplicates appear to have
survived small-scale duplication events. Their rapid loss can be explained by the fact that
transcription factors and genes involved in signal transduction tend to show a high dosage
effect in multicellular eukaryotes. The expression of a wide range of genes regulated by
these proteins show major perturbations when only one regulatory component is duplicated,
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
20
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
rather than all components that govern a certain pathway. Furthermore, that transcription
factors and kinases are often active as protein complexes and must be present in
stoichiometric quantities for their correct functioning is congruent with their high retention
rate following whole-genome instead of small-scale duplication events. Regulatory and
developmental genes are thought to have been of primordial importance for the evolution of
morphological complexity in plants and animals.

Genes involved in secondary metabolism or responses to biotic stimuli, such as pathogen
attack, tend to be preserved regardless of the mode of duplication. Because plants are sessile
organisms, secondary metabolite pathways, as well as genes governing responses to biotic
stimuli, are crucial to the development of survival strategies against herbivores, insects,
snails and plant pathogens. Additionally, in angiosperms, anthocyanins and other secondary
metabolites give rise to colourful and scented flowers that attract pollen- and nectarcollecting animals. Thus, secondary metabolite diversification might have led to more
efficient seed dispersal (compared with wind pollination, which is widespread in most seed
plants) and might have provided new possibilities for reproductive isolation and the
elevation of speciation rates. The finding that genes involved in secondary metabolism and
responses to biotic stimuli are also strongly retained following (continuously occurring)
small-scale gene duplications might reflect the continuous interaction between plants and
animals, fungi or plant pathogens imposing a constant need for adaptation.

By contrast, genes involved in responses to abiotic stress, such as drought, cold and salinity,
appear to have been only moderately retained after small-scale gene duplication events [17],
indicating that they might have been required at more specific times in evolution, such as
during major environmental changes or adaptation to new niches. Interestingly, 1R and 2R
might have occurred during a period of increased tectonic activity linked to highly elevated
atmospheric CO2 levels
Bioinformatics methods used
Field of comparative genomics and phylogeny. Methodologies mainly based on sequence alignment
and phylogenetic tree construction.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
21
Introduction
1.6.3
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Biological application 2: metagenomics (G. Venter)
METAGENOMICS: DNA SEQUENCING OF ENVIRONMENTAL SAMPLES
Nature Reviews Genetics 6, 805-814 (2005); doi:10.1038/nrg1709
Although genomics has classically focused on pure, easy-to-obtain samples, such as microbes that
grow readily in culture or large animals and plants, these organisms represent only a fraction of the
living or once-living organisms of interest. Many species are difficult to study in isolation because
they fail to grow in laboratory culture, depend on other organisms for critical processes, or have
become extinct. Methods that are based on DNA sequencing circumvent these obstacles, as DNA
can be isolated directly from living or dead cells in various contexts. Such methods have led to the
emergence of a new field, which is referred to as metagenomics.
* DNA sequencing can provide insights into organisms that are difficult to study because they
are inaccessible by conventional methods such as laboratory culture. Examples are for instance,
organisms that exist only in tight association with other organisms, including various obligate
symbionts and pathogens, members of natural microbial consortia and an extinct cave bear.
* Isolation and sequencing of DNA from mixed communities of organisms (metagenomics) has
revealed surprising insights into diversity and evolution.
* Partially assembled or unassembled genomic sequence from complex microbial communities
has revealed the existence of novel and environment-specific genes.
The application of high-throughput shotgun sequencing environmental samples has recently
provided global views of those communities not obtainable from 16S rRNA or BAC clone–
sequencing surveys. The sequence data have also posed challenges to genome assembly, which
suggests that complex communities will demand enormous sequencing expenditure for the
assembly of even the most predominant members.
However, for metagenomic data, this complete assembly may not always be necessary or feasible.
Determining the proteins encoded by a community, rather than the types of organisms producing
them, suggests a means to distinguish samples on the basis of the functions selected for by the local
environment and reveals insights into features of that environment.
For instance, Examination of higher order processes reveals known differences in energy
production (e.g., photosynthesis in the oligotrophic waters of the Sargasso Sea and starch and
sucrose metabolism in soil) or population density and interspecies communication,
overrepresentation of conjugation systems, plasmids, and antibiotic biosynthesis in soil (Fig. 4,
lower left). The predicted metaproteome, based on fragmented sequence data, is sufficient to
identify functional fingerprints that can provide insight into the environments from which microbial
communities originate. Information derived from extension of the comparative metagenomic
analyses performed here could be used to predict features of the sampled environments such as
energy sources or even pollution levels.
Metagenomics data bases are currently been set up: for instance
http://www.megx.net/index.php?navi=EasyGenomes
EXAMPLE G. VENTER SARGASSO SEA
Boston (04/16/04)—This Spring, J. Craig Venter is sailing around the French Polynesian Islands
scooping up bucketfuls (figuratively) of seawater in an ambitious voyage to sample microbial
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
22
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
genomes found in the world's oceans. His 95-foot yacht, Sorcerer II, has been outfitted with all
manner of technical equipment to accommodate the task, as well as a few surfboards should that
opportunity arise.
Venter and colleagues report finding 1.2 million genes, including almost 70,000 entirely novel
genes, from an estimated 1,800 genomic species, including 148 novel bacterial phylotypes. This
diversity is staggering and to a large extent unexpected. "We chose the Sargasso seas because it was
supposed to be a marine desert," says Venter wryly. "The assumption was low diversity there
because of the extremely low nutrients." His team sequenced a total of 1.045 billion base pairs of
non-redundant sequence. At the height of the work, "over 100 million letters of genetic code were
sequenced every 24 hours." The results have been deposited in GenBank. You can go and search for
them.
PALEOGENOMICS
Mammoth genome
A very recent application is the use of metagenomics approaches to sequence the mammoth
genome:





Usually mitochondrial genomes are sequenced form extinct species as it is abundantly
present in eukaryotic cells and thus easier to sequence. In permafrost settings, theoretical
calculations predict DNA fragment survival up to 1 million years (11, 12). When preserved
in such conditions sequencing of genomic DNA is still possible.
1 g of bone was used to extract DNA which was subsequently used for library construction
and sequencing technology that recently became available (13, 19).
The mammalian fraction dominated the identifiable fraction of the metagenome.
Nonvertebrate eukaryotic and prokaryotic species occur at approximately equal ratios, with
paucity of fungal species and nematodes.
hits against grass species to outnumber the ones from Brassicales by a ratio of 3:1, which
could be indicative of ancient pastures on which the mammoth is believed to have grazed.
From Poinar et al., Science 2006.
Ancient salt crystals
Bacteria have been found associated with a variety of ancient samples, however few studies
are generally accepted due to questions about sample quality and contamination. Cano and Borucki
isolated in 1995 a strain of Bacillus sphaericus from an extinct bee trapped in 25-30 million-yearold amber. More recently a report about the isolation of a 250 million-year-old halotolerant
bacterium from a primary salt crystal has been published. Halite crystals from the dissolution pipe
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
23
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
at the 569 m level of the Salado Formation were taken from sampling. A fluid volume of 9 l was
recovered from an inclusion in the crystal and inoculated into two different media: casein-derived
amino acids medium (CAS) and glycerol-acetate medium (GA). Only the CAS enrichment yielded
the bacteria, designated 2-9-3.
Once isolated the bacteria, the next step in the research is to achieve its taxonomical
classification. Two important genotypic markers widely used in recent bacterial taxonomy are the
16S rRNA gene sequence data and DNA-DNA hybridization data. Many researchers reported the
correlation between 16S rRNA gene sequence similarity values and genomic DNA relatedness. It
has been proposed that phenotypically related bacterial strains showing 70% or greater genomic
DNA relatedness constitute a single bacterial species. In contrast, those having <70% but >20%
similarity are considered to be different species within a genus.
These analysis showed that the organism was most similar to Bacillus marismortui (99%
similarity S) and Virgibacillus pantothenticus (97.5% S). Phylogenetic analysis showed that isolate
2-9-3 is part of a distinct lineage within the larger Bacillus cluster.
Additional info
Metagenomics and industrial applications. Nat Rev Microbiol. 2005 Jun;3(6):510-6. Review.
The metagenomics of soil. Nat Rev Microbiol. 2005 Jun;3(6):470-8. Review.
http://www.megx.net/index.php?navi=EasyGenomes
http://www.bio-itworld.com/news/041604_report4889.html
1.7.
1.7.1
Functional genomics & Systems Biology
Systems biology
Is field that originated in the early 90’s: it stems from ‘molecular biology’ but reflects a novel
holistic way of thinking: understanding complex biological phenomena in their entirety. In systems
biology a cell is considered as a system that interacts with its environment. It receives dynamically
changing environmental cues and transduces these signals into the observed behavior (phenotype or
dynamically changing physiological responses). This signal transduction is mediated by the
regulatory network (below). Genetic entities (proteins), located on top of a regulation cascade, are
activated by external cues. They further transduce the signal downstream in the cascade via proteinprotein interactions, chemical modifications of intermediate proteins, etc into transcriptional
activation and subsequent translation. Ultimately, these processes turn the genetic code into
functional entities, the proteins. The action of regulatory networks determines how well cells can
react or adapt to novel conditions. This signaling network in a cell can be compared with the
electronic circuitry on a microchip. It also consists of individual components (often called
modules). Systems biology is the science that tries to decode the design principles of biological
systems. It can be used for both fundamental and applied purposes. A typical example of a
fundamental application is the domain of evolutionary systems biology which has as a goal
studying the impact of network rewiring on adaptive behavior and organism evolution.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
24
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Figure: The cell as a signal transduction system. The signaling circuitry can be considered as
having a modular composition, in which each part form an individual functional unit.
1.7.2
Synthetic biology from an engineering point of view: rational design
The major difference between ‘genetic engineering/biotechnology’ to ‘synthetic biology’ would
reflect a novel mind set: the idea of rational design. Synthetic biology relies on the identification,
reuse or adaptation of existing parts of systems to construct reduced systems tailored to an aim
whose starting assumptions might be very different from those of the natural system. The idea of
using parts stems from the parallel between electronic circuits and biological systems. Each
component of the system can be seen as an individual transistor. By combining the different
signaling components a circuitry can be designed that has operational characteristics or that gives
rise to functionalities which do not occur as such in nature.
The premise of synthetic biology is thus built on the modularity of signal transduction pathways.
This modularity thus allows constructing and synthesizing artificial biological systems by
combining “microchip design principles” with libraries of molecular modules to obtain a desired
microbial functionality. According to this vision, Synthetic Biology should be able to rely on a list
of standardized parts (amino acids, bases, proteins, genes, circuits, cells, etc) whose properties have
been characterized quantitatively and on software modeling tools that would help putting parts
together to create a new biological function.
The idea behind the MIT ‘Registry of Standard Biological Parts’ (http://parts.mit.edu), is that as
more libraries of parts are being constructed and provided that all these parts are well documented
and standardized, in the end one can select immediately his appropriate part from the library and the
tedious step of making a mutant library or synthesizing all possible sequences and characterizing
their in and output characteristics can be omitted.
Currently the Registry is a collection of ~3200 genetic parts that can be mixed and matched to build
synthetic biology devices and systems. Founded in 2003 at MIT, the Registry is part of the
Synthetic Biology community's efforts to make biology easier to engineer. It provides a resource of
available genetic parts to iGEM teams and academic labs.
Current challenges in synthetic biology:
The premise of synthetic biology is built on the modularity of signal transduction pathways.
Artificial biological systems are synthesized by combining parts with desired functionalities and
kinetic behavior, as predicted by a model-based design. However the generation of the parts with
the proper characteristics is still very laborious and ad hoc (large libraries are made randomly, see
figure below. All parts within such library need to be characterized experimentally).
A fundamental systems understanding of how regulation or a certain kinetic behavior is encoded
could further rationalize the design of modules and contribute to a better standardization (that is the
key features in the primary sequence that drive a specific expression behavior, motifs, motif
spacing, nucleosome positioning, etc).
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
25
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
Modeling is the key to both systems and synthetic biology (see figure below). For systems biology
modeling aims at getting a fundamental understanding of the host cellular behavior, while for
synthetic biology ‘model-based design’ is used to determine the circuit topology and its parameters
subject to predefined design requirements. Such design requirements should not only consider
desired input/output characteristics (linear, oscillating behavior, bistability), but also take into
account the easiness by which certain parts can be manipulated in the lab. A challenging task is
making design models that determine design parameters conditioned on systems properties of the
global cellular system.
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
26
Introduction
Bioinformatics (Bachelor Cel & Gen/ Biochemistry)
HETEROGENEOUS EXPERIMENTAL DATA SOURCES
HOW WILL WE ANALYZE THE DATA SIMULTANEOUSLY?
Kathleen Marchal Dept of Plant Biotechnology and Bioinformatics UGent/ M2S KU Leuven
27
Download