Overview and manual annotation of a eukaryotic gene

advertisement
Module 6. Genome annotation: Overview
and manual annotation of a eukaryotic gene
Background
The remaining modules focus on how genes are characterized in a newly sequenced genome. This
process is called structural gene annotation. Concepts explored include the cell, the molecular basis of
heredity, and evolution. Analytical tools will be used to explore the fine structure of genes in the flag
rockfish Sebasets rubrivinctus, explore the connection between gene structure and cellular functions,
and the connection between function and evolutionary conservation of gene sequences. Genomic
studies provide a comprehensive catalog of basic genetic information in a system that underpin
structure and functions responsible for organism’s survival, evolution, and interactions with other
organisms of the same or different species. New genomes are being sequenced at an increasing rate,
leaving vast quantities of orphaned data that can be explored in authentic research experiences.
In this investigation, participants will be given five raw segments of DNA (i.e. genomic scaffolds) from
the Sebastes rubrivinctus genome project. In order to inform their gene annotations, students will
search for evidence available online in the form of similar proteins and RNAs from closely related
organisms, as well as humans. This “extrinsic” evidence will be combined with “intrinsic” signals in the
DNA itself (signals that direct the cellular apparatus through transcription and translation) to devise
gene models from raw DNA. Readers will be answering the basic question: what does a Sebastes
rubrivinctus gene look like? The answer may depend on the extrinsic evidence the student is able to
find. Participants will perform structural gene annotation by hand, use a gene annotation pipeline to
perform structural gene annotation, assign a putative function to the gene, describe how the gene is
important to the organism’s development, reproduction, and/or survival, and examine alignments of
gene sections from many different vertebrates using the UCSC genome browser.
This lab manual assumes that each participant has some familiarity with cell and molecular biology and
access to a computer and the worldwide web. The background information covered is not meant to be
an exhaustive treatment of these topics, but rather, reviews the specific information necessary to
perform and understand the exercises.
Review of Some Basic Molecular Biology
The “Central Dogma of Biology” describes information flow in biological systems. It states that DNA
makes RNA, which makes proteins. Transcription is the process of turning DNA into RNA. Translation is
the process of turning RNA into proteins. We will focus our analysis on predicting protein coding genes
from raw genome sequence in a marine fish that has recently been sequenced. Only messenger RNA
1
(mRNA) genes make, or code for, proteins. Ribosomal genes are transcribed into ribosomal RNAs
(rRNA), transfer RNA genes produce tRNA molecules, and many other RNA genes do not code for
proteins. Most DNA is found in the nucleus, is transcribed in the nucleus, and is exported from the
nucleus to the cytoplasm for translation.
The flag rockfish Sebastes rubrivinctus is a member of a diverse marine fish assemblage with an
estimated 102 species native to the west coast of North America. Rockfishes of the genus Sebastes
support important commercial and recreational fisheries on the west coast of North America and are
the dominant assemblage on most cold temperate reefs. These live-bearers have a low intrinsic rate of
population increase and highly sporadic recruitment, releasing large numbers of pelagic larvae into a
variable coastal environment. Their slow growth rates render them vulnerable to overfishing.
Uncertainty in the success of any particular year class has favored an evolutionary strategy whereby
some species of the Sebastes genus have extremely long lifespans and do not show signs of aging
(negligible senescence), while others demonstrate typical aging patterns and have short lifespans. S.
rubrivinctus has a maximum lifespan of only 18 years, but it is closely related to S. nigrocinctus, which
has a maximum age of at least 116 years. Researchers are expecting to gain insight into the genetic
mechanism for negligible senescence by sequencing and comparing the genomes of these two species.
2
Transcription and Eukaryotic Gene Structure
Most cell functions involve chemical reactions. Food molecules taken into cells react to provide the chemical
constituents needed to synthesize other molecules. Both breakdown and synthesis are made possible by a
large set of protein catalysts, called enzymes. The breakdown of some of the food molecules enables the cell to
store energy in specific chemicals that are used to carry out the many functions of the cell. Cells store and use
information to guide their functions. The genetic information stored in DNA is used to direct the synthesis of
thousands of proteins that each cell requires Cell functions are regulated. Regulation occurs both through
changes in the activity of the functions performed by proteins and through the selective expression of
individual genes. This regulation allows cells to respond to their environment and to control and coordinate cell
growth and division. (National Science Education Standards pg. 184)
In all organisms, the instructions for specifying the characteristics of organisms are carried in DNA, a large
polymer formed from subunits of four kinds (A, G, C, and T). The chemical and structural properties of DNA
explain how the genetic information that underlies heredity is both encoded in genes (as a string of molecular
“letters”) and replicated (by a templating mechanism). Each DNA molecule in a cell forms a single
chromosome. (National Science Education Standards pg. 185)
The structure of a gene dictates how cellular proteins will interact with it to transcribe a messenger
RNA and translate that mRNA into a protein. Some of the important details of DNA and a gene are
listed below:
 Each cell contains the same genome sequence, but different genes are transcribed into mRNAs
by different cells and by the same cells at different times.
 DNA is double stranded, and has 5’ and 3’ ends, pronounced “five prime” and “three prime.” It is
read from 5’ to 3’ and the two strands go opposing directions, the 5’ on one is the 3’ on its
partner.
5'
ATGGCGT TGCCATA CCCGCAT CCCTGAT
3'
GGGACTA
3'
TACCGCA ACGGTAT GGGCGTA
5'
 Eukaryotic Genes have several different parts that orchestrate the cellular processes of
transcription and translation (see Figure below):
o Promoter
o Five Prime UnTranslated Region (5’ UTR)
o Coding sequence(s)
o Intron(s)
o Three Prime UnTranslated Region (3’ UTR)
 The Promoter is a region of DNA up to a few hundred bp immediately upstream of the
transcription start point to which transcription factors (proteins) bind, and recruit RNA
polymerases for transcription.
 Exons are the sections of transcribed DNA that are exported from the nucleus after introns have
been removed and the ends of the transcript stabilized to form the mature mRNA. A gene may
have one exon or several exons separated by intervening sequences called introns. The
beginning and ends of exons are untranslated.
 The Five Prime Untranslated Region (5’ UTR) is the part of the mature mRNA immediately
upstream of the coding sequence. It may contain introns. Before translation can start, the
ribosome binds to the modified 5’ end of the 5’UTR after export to the cytoplasm.
3





Coding sequences (CDS) of DNA are ultimately the sections of exons that are translated by
ribosomes into amino acid sequences once the mature mRNA is exported to the nucleus. The
protein coding region of DNA begins with the start codon ATG and ends in the stop codons TAG,
TGA, or TAA. Note that the process of transcription copies Ts (thymines) as Us (uracils).
Introns are non-coding regions between exons. Introns usually begin with GT and end in AG (or
the reverse complements) in what is known as “GT/AG rule”. They are excised from pre-mRNAs
in the nucleus by proteins that recognize these sequences. Their excision is part of the premRNA processing step that also includes adding a 5’cap (a modified guanine) to the mRNA and
3’ poly-A tail (a long stretch of As) for message stability.
The Three Prime Untranslated Region (3’UTR) is the region of DNA after the stop codon in a
gene. Once it is transcribed, a polyA tail is added that helps to stabilize the mRNA. These
regions sometimes have binding sites to microRNAs that, when present, signal the mRNA for
break down.
Intergenic DNA is the DNA between genes. There are still recognizable DNA elements in
intergenic DNA:
o Enhancers and silencers are sequence elements in the DNA that are more distant from
the transcription startpoint than promoters yet still regulate transcription by
determining what kind of molecules will be able to bind to the DNA . “Regulation”
refers to turning transcription “up or down”.
o Short tandem repeats (a.k.a. microsatellites) are stretches of repeated nucleotide
motifs from one to 30 bp. For example, the sequence ACACACACACACAC, contains the
motif AC repeated seven times.
o Dispersed repeats are repeats that do not occur in tandem. These are often ancient
viral DNA elements that have incorporated themselves into other organisms’ genomes,
have made copies of themselves over the course of evolutionary time (10s of millions of
years), and have lost their ability to become full viruses and infect their host. Dispersed
repeats can be also found within introns of genes and must be identified to keep gene
finders from characterizing them as exons belonging to native genes.
o Tandem repetitive DNA can serve important purposes such as protecting the ends of
chromosomes, called telomeres, and serving as binding sites during cell division towards
the constriction point of a chromosome (centromere).
o Most genome sequences are erroneously contaminated by bacterial and human genes
too. This illustrates a major point of scientific research. Researchers must be constantly
“on-guard” for a myriad of problems that obscure the truth.
One gene can encode different proteins in different cell types by including or excluding different
exons in a process known as alternative splicing. Different cell types are formed in a process
called differentiation, during development of the organism. After differentiation, cell types have
different proteins present that direct the cell to use the DNA in a way that suits the function of
the cell. Exons are mixed to form alternative mRNA products.
4
How DNA becomes a protein
DNA Gene
Promoter
Exon 1
5’ UTR
Exon 2
CDS 1
Intron
CDS 2
3’ UTR
5’
GT
ATG…
…TGA
AG
Transcription
Pre-mRNA
AUG…
GU
AG
…UGA
RNA processing: Removal of introns
addition of 5’ cap and poly-A tail
Mature mRNA
5’ cap
Poly-A tail
AAAAAAAAAAAAAAAAAAAAAA
AUG…
…UGA
Translation
Protein5
Translation
mRNA codes for amino acids (the building blocks of proteins) in stretches of three nucleotides called
codons. Each codon specifies an amino acid, the building blocks of proteins, or a stop codon that signals
to stop translation. tRNAs carry specific amino acids and interact with ribosomes to deliver the specified
amino acid to the growing chain of amino acids (called a polypeptide ). The translation of nucleotide to
amino acid sequences follows a standard genetic code (below).
U
UUU Phe (F)
UUC Phe (F)
U
UUA Leu (L)
UUG Leu (L)
CUU Leu (L)
CUC Leu (L)
C
CUA Leu (L)
CUG Leu (L)
AUU Ile (I)
AUC Ile (I)
A
AUA Ile (I)
AUG Met (M)
GUU Val (V)
GUC Val (V)
G
GUA Val (V)
GUG Val (V)
Genetic Code.
UCU
UCC
UCA
UCG
CCU
CCC
CCA
CCG
ACU
ACC
ACA
ACG
GCU
GCC
GCA
GCG
Ser
Ser
Ser
Ser
Pro
Pro
Pro
Pro
Thr
Thr
Thr
Thr
Ala
Ala
Ala
Ala
C
(S)
(S)
(S)
(S)
(P)
(P)
(P)
(P)
(U)
(U)
(U)
(U)
(A)
(A)
(A)
(A)
UAU
UAC
UAA
UAG
CAU
CAC
CAA
CAG
AAU
AAC
AAA
AAG
GAU
GAC
GAA
GAG
A
Tyr (Y)
Tyr (Y)
Stop
Stop
His (H)
His (H)
Gln (Q)
Gln (Q)
Asn (N)
Asn (N)
Lys (K)
Lys (K)
Asp (D)
Asp (D)
Glu (E)
Glu (E)
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA
AGG
GGU
GGC
GGA
GGG
G
Cys (C)
Cys (C)
Stop
Trp (W)
Arg (R)
Arg (R)
Arg (R)
Arg (R)
Ser (S)
Ser (S)
Arg (R)
Arg (R)
Gly (G)
Gly (G)
Gly (G)
Gly (G)
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
Use the genetic code to translate the following DNA: ATG TTG CGA TGA
Gene Annotation
Long stretches of RNA that translate into amino acids without a stop codon (known as open reading
frames), start codons, intron/exon boundaries, and other gene specific features form clues as to where
genes are located in eukaryotic genomes. Finding genes in prokaryotic genomes is much easier because
there are no introns and there is little intergenic DNA. For eukaryotes, annotating tens of thousands of
genes by hand is an extremely time consuming process, in particular because DNA has to be examined
forwards and backwards, and for different starting points. Initial gene prediction is now accomplished
by computers, but these still need to be hand-checked for quality by Biologists for genes of highest
interest.
6
Goals

Understand requisite background information on the cell, the molecular basis of heredity.
GCAT-SEEK sequencing requirements
None
Computer/program requirements for data analysis
None
Protocols
None
Assessment
The following is the nucleotide sequence of the human -globin gene. Gene regions are indicated as
follows: Untranscribed, Transcribed but not translated, Transcribed and Translated, Conserved
Sequence. Nucleotide positions in the DNA are shown in the right hand margin.
ATATCTTAGA
CCAGAAGAGC
TGGAGCCACA
AGGAGCCAGG
ATTTGCTTCT
GGTGCACCTG
AGGTGAACGT
AGGTTACAAG
CAGAGAAGAC
GTCTATTTTC
GTTCTTTGAG
ACCCTAAGGT
GGCCTGGCTC
GCTGCACTGT
TATGGGACCC
TGTCATAGGA
ACGAATGATT
ATTTGCTGTT
TTTTTCTTCT
TATAACAAAA
GGGAGGGCTG
CAAGGACAGG
CCCTAGGGTT
GCTGGGCATA
GACACAACTG
ACTCCTGAGG
GGATGAAGTT
ACAGGTTTAA
TCTTGGGTTT
CCACCCTTAG
TCCTTTGGGG
GAAGGCTCAT
ACCTGGACAA
GACAAGCTGC
TTGATGTTTT
AGGGGAGAAG
GCATCAGTGT
CATAACAATT
CCGCAATTTT
GGAAATATCT
AGGGTTTGAA
TACGGCTGTC
GGCCAATCTA
AAAGTCAGGG
TGTTCACTAG
AGAAGTCTGC
GGTGGTGAGG
GGAGACCAAT
CTGATAGGCA
GCTGCTGGTG
ATCTGTCCAC
GGCAAGAAAG
CCTCAAGGGC
ACGTGGATCC
CTTTCCCCTT
TAACAGGGTA
GGAAGTCTCA
GTTTTCTTTT
TACTATTATA
CTGAGATACA
GTCCAACTCC
ATCACTTAGA
CTCCCAGGAG
CAGAGCCATC
CAACCTCAAA
CGTTACTGCC
CCCTGGGCAG
AGAAACTGGG
CTGACTCTCT
GTCTACCCTT
TCCTGATGCT
TGCTCGGTGC
ACCTTTGCCA
TGAGAACTTC
CTTTTCTATG
CAGTTTAGAA
GGATCGTTTT
GTTTATTCTT
CTTAATGCCT
TTAAGTAACT
TAAGCCAGTG
CCTCACCCTG
CAGGGAGGGC
TATTGCTTAC
CAGACACCAT
CTGTGGGGCA
GTTGGTATCA
CATGTGGAGA
CTGCCTATTG
GGACCCAGAG
GTTATGGGCA
CTTTAGTGAT
CACTGAGTGA
AGGGTGAGTC
GTTAAGTTCA
TGGGAAACAG
AGTTTCTTTT
GCTTTCTTTT
TAACATTGTG
TAAAAAAAAA
1-50
51-100
101-150
151-200
201-250
251-300
301-350
351-400
401-450
451-500
501-550
551-600
601-650
651-700
701-750
751-800
801-850
851-900
901-950
951-1000
CTTACACAGT
TTGCATATTC
CATAATCATT
TACACATATT
TGCTTTCTTC
CTGCCTAGTA
ATAATCTCCC
ATACATATTT
GACCAAATCA
TTTTAATATA
CATTACTATT
TACTTTATTT
ATGGGTTAAA
GGGTAATTTT
CTTTTTGTTT
TGGAATATAT
TCTTTTATTT
GTGTAATGTT
GCATTTGTAA
ATCTTATTTC
GTGTGCTTAT
TTAATTGATA
TTAATATGTG
TTTTAAAAAA
TAATACTTTC
1001-1050
1051-1100
1101-1150
1151-1200
1201-1250
7
CCTAATCTCT
TGCACCATTC
AATATTTCTG
AGGTTTCATA
ATTTTATGGT
TTTGCTAATC
ACGTGCTGGT
CCAGTGCAGG
GGCCCACAAG
AGGTTCCTTT
GGGCCTTGAG
CAATGATGTA
GGAGGTCAGT
GGGAAAATAC
TTCTTTCAGG
TAAAGAATAA
CATATAAATA
TTGCTAATAG
TGGGATAAGG
ATGTTCATAC
CTGTGTGCTG
CTGCCTATCA
TATCACTAAG
GTTCCCTAAG
CATCTGGATT
TTTAAATTAT
GCATTTAAAA
ACTATATCTT
GCAATAATGA
CAGTGATAAT
TTTCTGCATA
CAGCTACAAT
CTGGATTATT
CTCTTATCTT
GCCCATCACT
GAAAGTGGTG
CTCGCTTTCT
TCCAACTACT
CTGCCTAATA
TTCTGAATAT
CATAAAGAAA
AAACTCCATG
TACAATGTAT
TTCTGGGTTA
TAAATTGTAA
CCAGCTACCA
CTGAGTCCAA
CCTCCCACAG
TTGGCAAAGA
GCTGGTGTGG
TGCTGTCCAA
AAACTGGGGG
AAAAACATTT
TTTACTAAAA
TGATGAGCTG
AAAGAA
8
CATGCCTCTT
AGGCAATAGC
CTGATGTAAG
TTCTGCTTTT
GCTAGGCCCT
CTCCTGGGCA
ATTCATCCCA
CTAATGCCCT
TTTCTATTAA
ATATTATGAA
ATTTTCATTG
AGGGAATGTG
TTCAAACCTT
1251-1300
1301-1350
1351-1400
1401-1450
1451-1500
1501-1550
1551-1600
1601-1650
1651-1700
1701-1750
1751-1800
1801-1850
1851-1900
1901-1936
1. Without looking, redraw the sketch of a eukaryotic gene, pre-mRNA, and mature mRNA in the space
below. Include one intron. Then correct your own work.
Gene
Pre-mRNA
Mature mRNA
2. Circle and label the following in the sequence of β-globin (opposite page) and in the sketch of the gene above.
A. Transcription start point
B. Transcription end point
C. Translation start point
D. Translation end point
3. Explain what evidence (i.e. signals and annotations from the DNA sequence) you used to support your
answers to (2).
9
4. Based on the sequence information provided above, draw a sketch of the human -globin gene below
labeling the promoter, 5’UTR, coding sequences, introns, and 3’UTR. Make sizes roughly proportional to the
sequence itself.
5. What would be the first three and the last three amino acids produced from the mRNA transcribed from this
gene? Note: a stop codon does not produce an amino acid.
6. Is a mutation more likely to disrupt the function of the protein produced if it occurs in an intron or
coding sequence? Explain.
Time line of module
One hour of lecture.
Discussion topics for class
See assessment
Relevant background lecture topics include basic molecular biology (replication, transcription,
translation), genome structure, fine structure of a gene, regulation of genes, and cited literature.
10
References
Further Reading: Any genetics textbook
11
Download