Genome Sequence Acquisition

advertisement
Bioc 300: Bioinformatics
www.geneticsplace.com
Goals of the Course
 Understand Methods and Research Questions
 Analyze Real Data
 Engage in a Realistic Learning Environment
 Utilize Online Databases
 Appreciate Complexity of Research Systems
 Integrate Different Types of Information
 Reconsider Cells as Intracellular Ecosystems
 Integrate Bioinformatics with Biology
What is bioinformatics?
"Bioinformatics is the term coined for the new field that
merges biology, computer science, and information
technology to manage and analyze the data, with the
ultimate goal of understanding and modeling living
systems."
Genomics and Its Impact on Medicine and Society - A 2001 Primer U.S.
Department of Energy Human Genome Program
Bioinformatics also represents a paradigm shift for
molecular biology, instead of taking a reductionist
approach, the sub-disciplines of bioinformatics are more
expansionist: they attempt to study the entire
complement of a particular cellular molecule or process.
The “omics” revolution

Genomics:
The study of the entire DNA complement of an organism
Genome Sequence Information
Basic Research
•Acquiring Sequence
•Human Genome Draft
•Evolution
Applied Research
•Identification of Biological Unknowns
•Biomedical Research
Genomic Variations
Ecology
•Tracking Ivory Sales
•Diatoms and Global Warming
Human Variations
•SNPs
•Disease Analysis
Ethics
•GMO’s
•Genetic Testing
DNA Microarrays
BasicResearch
•Introduction to Method
•Data Analysis
AppliedResearch
•Cancer
•Pharmacogenomics
The “omics” revolution

Genomics:
The study of the entire DNA complement of an organism

Proteomics
The study of the entire set of proteins in a particular cell type
Proteomics
Cellular Roles
Protein-Protein Interactions
permission from
Benno Schwikowski
permission form Stan Fields
Identification and Quantification
permission form Stan Fields
The “omics” revolution

Genomics:
The study of the entire DNA complement of an organism

Proteomics
The study of the entire set of proteins in a particular cell type

Transcriptonomics
The study of all mRNA transcripts in a particular cell type

Metabolomics
The study of all metabolites in a particular cell type

Glycomics
The study of all polysaccharides in a particular cell type

Variomics
The study of all possible drug targets in a particular cell type
Genomic Circuits
Single Gene Circuit
ToggleSwitches
www.bio.davidson.edu/courses/genomics/circuits.html
IntegratedCircuits
Sequencing of Whole Genomes
Three Phases of Genome Sequencing:



Preliminary sequencing
Finishing
Annotating
Preliminary sequencing
1970’s
Maxam-Gilbert sequencing (chemical cleavage)
Sanger sequencing (dideoxy method)

Autorad
You could sequence 100’s of bases per day!
Genomics “took off” with automated sequencing
1990’s
Leroy Hood made modifications to dideoxy sequencing:

ddNTPs were coupled to fluorescent dyes (instead of radioactivity)
DNA fragments were separated via capillary gel electrophoresis
Sequence read by lasers, data was directly recorded into computer
Now, instead of an autorad, we have a:
Chromat!
The newest DNA sequencers can determine millions
of bases of sequence in a day!
The increasing ease of obtaining sequence data
has lead to a logarithmic growth of Genbank, the
main repository of sequence data which is housed
at the National Library of Medicine at NIH.
Growth of Genbank
Sequencing Entire Organisms
Before the 1990’s, sequencing was
somewhat haphazard. Depending on the
researcher, different pieces of different
organisms’ genomes had been sequenced.
No concerted effort had been made to
sequence the entire genome of an organism.
HUGO changed all of that, it’s mission was
to sequence the human genome, as well as a
number of the genomes of model organisms.
While small genomes could be sequenced
directly, larger genomes were first mapped out.
Mapping large genomes
Sequencers needed some reference sequences to
know what part of a genome they were dealing with.
STSs - sequence tagged sites
These are defined by a pair of PCR primers that
amplify only one segment of a genome (ie. unique
sequence).
ESTs- expressed sequence tags
These are short sequences of cDNA that indicate
where genes are located within the genome.
Now genomes could be cut into pieces,
sequenced, and the pieces reassembled.
Cutting up genomes
Vectors designed to carry large pieces of DNA include:
BACs- bacterial artificial chromosomes- can carry
about 150 kb of insert
YACs- yeast artificial chromosomes- can carry up
to 1.5 Mb of insert
BACs or YACs containing overlapping DNA can be
assembled into contigous overlapping fragments.
“Shotgun” sequencing
While HUGO was busy mapping large genomes and
sequencing some small genomes, Craig Venter
founded TIGR.
TIGR took a completely different approach. Instead
of mapping a genome, they simply cut it into
thousands of pieces, sequenced the pieces, and
reassembled the data using overlapping fragments.
It was TIGR, not HUGO, who produced the
world’s 1st completed genome in 1995- H.
influenzae.
Finishing a Genomic Sequence




A “finished” sequence is defined as one that
contains no more than 1 error in 10,000 bases.
Finishing a sequence involves aligning a number
of preliminary sequences and correcting any
inconsistencies.
Overlapping segments are combined into larger
assemblies of contiguous DNA (contigs).
If contigs do not overlap, a gap remains in the
sequence.
Finishing continued




The human “draft” sequence, published in 2001,
contained 147,821 gaps.
The “finished” sequence, published in 2004,
contained 341 gaps.
A gap usually contains highly repetitive DNA
that complicates attempts to clone and sequence it.
Finishing is a very expensive process, many
genomes have not been finished.
Annotating Genomes



Annotation involves the identification of
functionally important sections of a genome.
This includes, but is not limited to, making an
educated guess about what kind of protein is
encoded by a given coding sequence.
Annotation is performed using various computer
programs.
Locating genes within a genome
Process is different in prokaryotes vs. eukaryotes
Prokaryotes contain ORFs with no introns and very
little intergenic sequence.
 Eukaryotes contain introns, complex promoters, and
enhancers
Introns range between 70 and 30,000 bp
One eukaryotic gene can encode more that one different
protein via alternate splicing mechanisms
Eukaryotes also contain pseudogenes, ORFs which have
been rendered nonfunctional by mutation
Mammalian genomes contain about 23% pseudogenes

Tools for gene hunting



GeneMark - originally created for prokaryotes but
adapted for some model eukaryotes
GenScan - accepts up to 1 million bp of sequence
online, more if downloaded
Glimmer & GlimmerM - developed by TIGR,
accepts up to 200 kb online, more if downloaded
Once a genome is annotated…


One can use a genome browser to locate specific
loci on specific chromosomes
One can then use resources such as GeneCard to
find out more about a specific gene
Progress of Genome Sequencing
Sequenced Euk. Genomes











Yeast
Drosophila
C. elegans
Arabidopsis
Mosquito
Human
Mouse
Rat
Chicken
Dog
Zebra fish
Euk. Genomes in Progress
Xenopus
 Cow
 Cat
 Horse
 Kangaroo
 Honey Bee
 Turkey
 Lobster
 Bat
others…
 Hedgehog

and
Tools had to be developed to make sense of the
dearth of genomic data being produced
Genomic Search Engines include:
 BLAST- searches sequence information, either
nucleotide (BLASTn) or protein (BLASTp)
 BLAST2- aligns two sequences, checking similarity
 Enterez- searches databases for textual information
 PubMed- searches scientific literature for text
 ORF finder- finds Open Reading Frames (genes)
 PREDATOR- predicts secondary structure of proteins
 ExPASy- analysis of protein sequence and structure
as well as 2D gel information
Calculating E(expect)-values
E-values measure the “significance” of a
match, the smaller E-value, the better
E-values are calculated using:
1) S, the bit score, a measure of the similarity
between the hit and the query
2) m, the length of the query
3) n, the size of the database
E = mn2-S
So, how do you get the bit score?
S is calculated from the raw score, R
R = aI + bX - cO - dG
Where I is the # of identities, X is the # of mismatched nucleotides, O is the # of gaps, and G is the
# of spaces in the gap.
a, b, c, and d are the rewards, and penalties, for
each of these variables.
The defaults of these lower-case letters are set at
1, -3, 5, and 2, respectively.
These values can be changed on the “Other advanced”
line.
Now that we have a raw score, the bit score
can be obtained by normalizing the data:
S = (lR - ln K)/ln 2
(where l and K are the normalizing parameters)
These parameters are printed at the bottom of a
BLAST report.
Normalization enables a direct comparison of Evalues and bit scores, even if the reward and penalty
variables have been changed by the user.
More databases of interest:
 SwissProt- protein sequence database
 PDB- contains protein structural information
 OMIM- catalogs human disease genes
 TIGR- many searchable genomes, esp. bacterial ones
 GeneCard- genomic, proteomic and phenotypic info.
 Unigene- catalogs human ESTs
 Human map viewer- shows chromosomal location of
genes
Protein structure and function
For most researchers, the final goal of genomic
research is not the genomic data itself but an
understanding of the proteins encoded for by a genome.
Steps to determining protein structure and function:
 Find ORFs, or coding sequences (CDSs)
 Translate ORFs
 Is this a known protein? If not, find protein
orthologs, similar proteins in different species
 Check if 3D structure has been determined
 Predict hydropathy using a Kyte-Doolitle plot
 Predict secondary structure of your protein
What do we mean by function?
The term “function” is too simplistic and is
somewhat outdated. A consortium called
“Gene Ontology” decided that a complete
description of function must include not only
“why?” but also “what?” and “where?”



Why = biological process. The objective toward
which this protein contributes.
What = molecular function. The biochemical
activity that the protein accomplishes.
Where = cellular component. The location of
protein activity.
One example:
isocitrate dehydrogenase (IDH)
OMIM - IDH3A
 COG - functional categories, dendograms,
isoforms- distinct genes encoding similar proteins
 Enzyme Commission, “EC” numbers
 Swiss-Prot
 Phylogenetic trees
rooted vs. unrooted

Terms used to describe phylogeny
paralogs - genes which arose from a common ancestral
gene within one species (isoforms)
 orthologs - genes from two organisms which arose
from a common ancestral gene
 synteny -genetic loci located on the same chromosome
(or multiple genetic loci from different species
which are located on a chromosomal region of
common ancestry)
 homology - sequences which are similar due to a
common evolutionary origin
 similarity or - terms used to describe sequences without
identity
regard to evolutionary relationships

Searching for related proteins
PSI-BLAST allows one to search outward in a
spiraling pattern from a central starting point.
First iteration- finds proteins with similar sequences.
Second iteration- can be performed using a consensus
sequence computed from your first iteration.
More iterations can be performed as desired.
Or, one can choose a species and perform another first
iteration using the results of the original search.
This approach can be used to annotate ORFs
from a newly sequenced genome
Alternate Splicing


60% of human genes
produce more than 1
mRNA
Only about 22% of genes
in C. elegans fit into this
category
Epigenetic Control
It is not just the coding regions which matter.
Methylation, such as that found in heterochromatin
and CpG islands, also plays a role in gene expression.
At any given time, there are 400,000 mC in a
given cell. Since there are about 100
different human cell types, this totals 40
million methylation events in our methylome.
Nonmammalian animals lack this form of
epigenetic control.
The # of CpG islands correlates with
the # of genes on a chromosome
CpGs are usually associated with genes
Imprinting
About 20 mammalian genes are known to be
methylated during gametogenesis in either the
parental or maternal copy.
Imprinting may represent a “genetic tug-ofwar” between male and female interests.
For example, the insulin-like growth factor 2,
Igf2, is expressed only in the paternal allele. Igf2
promotes the growth of the developing embryo.
The expression of its receptor, Igf2r, is
controlled by the maternally inherited allele.
Expression of Paternal Allele of Igf2 in embryo and
placenta
How does silencing work?
What is the effect a loss of imprinting?

Loss of Igf2 imprinting can lead to colorectal
cancer and Beckwith-Wiedemann Syndrome

There is a cluster of CpG islands in an
insulator region near Igf2
CTCF is a protein which only binds to
unmethylated DNA.
17/20 tumor samples taken from cancer
patients were found to be hypermethylated
in this region.
What about the rest of our genome?
Since only 1-2% of our genome is coding
sequence what does the rest do?
A majority of our DNA is repetitive sequence
 There are 5 classes of repetitive sequence:
1) transposon derived
2) pseudogenes
3) simple repeats such as VNTRs
4) segmental duplications
5) heterochromatic regions

The first category alone accounts for 45% of our
genome!
Transposons
Transposons fall into 4 categories:
1) SINEs, short interspersed elements, such as Alu
comprise 13% of our genome
These may help a cell cope with stress, RNA
produced from these bind to an inhibitor of
translation.
2) LINEs, long interspersed elements, comprise 21%
of our genome
3) LTR retrotransposons comprise 8% of our genome
4) Other DNA transposons 3% of our genome
More Transposon Facts

About 50 genes appear to be derived from
transposons, including RAG1 and RAG2,
necessary for antibody diversity.
The
X chromosome has the highest
concentration of transposons- one 525 kb
section is 89% transposon-derived.
The Y chromosome has the highest
concentration of LINEs, it is the most genepoor of the chromosomes and probably
tolerates insertions well.
Download