Student Edition - Lycoming College

advertisement
Decoding the rockfish genome: An introduction to modern genomics and marine biology
Target Audience: AP high school biology students or lower division undergraduate college students.
Vince Buonaccorsi
Juniata College
1
OUTLINE
Pg 3.
Introduction
Pgs. 4 to 11
Lab 1: How do eukaryotic genes work?
Pgs. 12 to 16
Lab 2: What can comparison of DNA sequences tell us about evolution?
Pgs. 17 to 29
Lab 3: How can I use bioinformatics to annotate genes?
Pg. 30
Lab 4: What does a typical rockfish gene look like?
2
Introduction
This investigation focuses on how genes are found in a newly sequenced genome. This process is called
structural gene annotation. Concepts explored include the cell, the molecular basis of heredity, and
evolution. Students will use online analytical tools to explore the fine structure of genes in the flag
rockfish Sebasets rubrivinctus, explore the connection between gene structure and cellular functions,
and the connection between function and evolutionary conservation of gene sequences. Genomics and
bioinformatics are dynamic fields well-suited for capturing the imagination of students in inquiry driven
classroom efforts. Genomic studies provide a comprehensive catalog of basic genetic information in a
system that underpin structure and functions responsible for organism’s survival, evolution, and
interactions with other organisms of the same or different species. New genomes are being sequenced
at an increasing rate, leaving vast quantities of orphaned data that can be explored in authentic research
experiences.
In this investigation, students will be given raw segments of DNA (i.e. genomic scaffolds) from the
Sebastes rubrivinctus genome project. In order to inform their gene annotations, students will search
for evidence available online in the form of similar proteins and RNAs from closely related organisms, as
well as humans. This “extrinsic” evidence will be combined with “intrinsic” signals in the DNA itself
(signals that direct the cellular apparatus through transcription and translation) to devise gene models
from raw DNA. Students will be answering the basic question: what does a Sebastes rubrivinctus gene
look like? The answer may depend on the extrinsic evidence the student is able to find. Students will
perform structural gene annotation by hand, examine alignments of gene sections from many different
vertebrates using the UCSC genome browser, use a gene annotation pipeline to perform structural gene
annotation, assign a putative function to the gene, and describe how the gene is important to the
organism’s development, reproduction, and/or survival.
This example highlights some elements of 9-12 National Science Education Content Standards A and C,
Science as Inquiry, and Life Science. Students will obtain the means necessary to perform and
understand scientific inquiry. The primary life science standards covered include: the cell, the molecular
basis of heredity, and biological evolution. This lab manual assumes that each student has some
familiarity with cell and molecular biology and access to a computer and the worldwide web. The
background information covered is not meant to be an exhaustive treatment of these topics, but rather,
reviews the specific information necessary to perform and understand the exercises.
3
Lab 1. How do eukaryotic genes work?
Goal: To give a basic understanding of genes and their functions.
Review of Some Basic Molecular Biology:
The “Central Dogma of Biology” describes information flow in biological systems. It states that DNA
makes RNA, which makes proteins. Transcription is the process of turning DNA into RNA. Translation is
the process of turning RNA into proteins. We will focus our analysis on predicting protein coding genes
from raw genome sequence in a marine fish that has recently been sequenced. Only messenger RNA
(mRNA) genes make, or code for, proteins. Ribosomal genes are transcribed into ribosomal RNAs
(rRNA), transfer RNA genes produce tRNA molecules, and many other RNA genes do not code for
proteins. Most DNA is found in the nucleus, is transcribed in the nucleus, and is exported from the
nucleus to the cytoplasm for translation.
The flag rockfish Sebastes rubrivinctus is a member of a diverse marine fish assemblage with an
estimated 102 species native to the west coast of North America. Rockfishes of the genus Sebastes
support important commercial and recreational fisheries on the west coast of North America and are
the dominant assemblage on most cold temperate reefs. These live-bearers have a low intrinsic rate of
population increase and highly sporadic recruitment, releasing large numbers of pelagic larvae into a
variable coastal environment. Their slow growth rates render them vulnerable to overfishing.
Uncertainty in the success of any particular year class has favored an evolutionary strategy whereby
some species of the Sebastes genus have extremely long lifespans and do not show signs of aging
(negligible senescence), while others demonstrate typical aging patterns and have short lifespans. S.
rubrivinctus has a maximum lifespan of only 18 years, but it is closely related to S. nigrocinctus, which
has a maximum age of at least 116 years. Researchers are expecting to gain insight into the genetic
mechanism for negligible senescence by sequencing and comparing the genomes of these two species.
4
5
Transcription and Eukaryotic Gene Structure:
Most cell functions involve chemical reactions. Food molecules taken into cells react to provide the chemical
constituents needed to synthesize other molecules. Both breakdown and synthesis are made possible by a
large set of protein catalysts, called enzymes. The breakdown of some of the food molecules enables the cell to
store energy in specific chemicals that are used to carry out the many functions of the cell. Cells store and use
information to guide their functions. The genetic information stored in DNA is used to direct the synthesis of
thousands of proteins that each cell requires Cell functions are regulated. Regulation occurs both through
changes in the activity of the functions performed by proteins and through the selective expression of
individual genes. This regulation allows cells to respond to their environment and to control and coordinate cell
growth and division. (National Science Education Standards pg. 184)
In all organisms, the instructions for specifying the characteristics of organisms are carried in DNA, a large
polymer formed from subunits of four kinds (A, G, C, and T). The chemical and structural properties of DNA
explain how the genetic information that underlies heredity is both encoded in genes (as a string of molecular
“letters”) and replicated (by a templating mechanism). Each DNA molecule in a cell forms a single
chromosome. (National Science Education Standards pg. 185)
The structure of a gene dictates how cellular proteins will interact with it to transcribe a messenger
RNA and translate that mRNA into a protein. Some of the important details of DNA and a gene are
listed below:
 Each cell contains the same genome sequence, but different genes are transcribed into mRNAs
by different cells and by the same cells at different times.
 DNA is double stranded, and has 5’ and 3’ ends, pronounced “five prime” and “three prime.” It is
read from 5’ to 3’ and the two strands go opposing directions, the 5’ on one is the 3’ on its
partner.
5'
ATGGCGT TGCCATA CCCGCAT CCCTGAT
3'
3'
TACCGCA ACGGTAT GGGCGTA GGGACTA
5'
 Eukaryotic Genes have several different parts that orchestrate the cellular processes of
transcription and translation (see Figure below):
o Promoter
o Five Prime UnTranslated Region (5’ UTR)
o Coding sequence(s)
o Intron(s)
o Three Prime UnTranslated Region (3’ UTR)
 The Promoter is a region of DNA up to a few hundred bp immediately upstream of the
transcription start point to which transcription factors (proteins) bind, and recruit RNA
polymerases for transcription.
 Exons are the sections of transcribed DNA that are exported from the nucleus after introns have
been removed and the ends of the transcript stabilized to form the mature mRNA. A gene may
have one exon or several exons separated by intervening sequences called introns. The
beginning and ends of exons are untranslated.
 The Five Prime Untranslated Region (5’ UTR) is the part of the mature mRNA immediately
upstream of the coding sequence. It may contain introns. Before translation can start, the
ribosome binds to the modified 5’ end of the 5’UTR after export to the cytoplasm.
6





Coding sequences (CDS) of DNA are ultimately the sections of exons that are translated by
ribosomes into amino acid sequences once the mature mRNA is exported to the nucleus. The
protein coding region of DNA begins with the start codon ATG and ends in the stop codons TAG,
TGA, or TAA. Note that the process of transcription copies Ts (thymines) as Us (uracils).
Introns are non-coding regions between exons. Introns usually begin with GT and end in AG (or
the reverse compliments) in what is known as “GT/AG rule”. They are excised from pre-mRNAs
in the nucleus by proteins that recognize these sequences. Their excision is part of the premRNA processing step that also includes adding a 5’cap (a modified guanine) to the mRNA and
3’ poly-A tail (a long stretch of As) for message stability.
The Three Prime Untranslated Region (3’UTR) is the region of DNA after the stop codon in a
gene. Once it is transcribed, a polyA tail is added that helps to stabilize the mRNA. These
regions sometimes have binding sites to microRNAs that, when present, signal the mRNA for
break down.
Intergenic DNA is the DNA between genes. There are still recognizable DNA elements in
intergenic DNA:
o Enhancers and silencers are sequence elements in the DNA that are more distant from
the transcription startpoint than promoters yet still regulate transcription by
determining what kind of molecules will be able to bind to the DNA . “Regulation”
refers to turning transcription “up or down”.
o Short tandem repeats (a.k.a. microsatellites) are stretches of repeated nucleotide
motifs from one to 30 bp. For example, the sequence ACACACACACACAC, contains the
motif AC repeated seven times.
o Dispersed repeats are repeats that do not occur in tandem. These are often ancient
viral DNA elements that have incorporated themselves into other organisms’ genomes,
have made copies of themselves over the course of evolutionary time (10s of millions of
years), and have lost their ability to become full viruses and infect their host. Dispersed
repeats can be also found within introns of genes and must be identified to keep gene
finders from characterizing them as exons belonging to native genes.
o Tandem repetitive DNA can serve important purposes such as protecting the ends of
chromosomes, called telomeres, and serving as binding sites during cell division towards
the constriction point of a chromosome (centromere).
o Most genome sequences are erroneously contaminated by bacterial and human genes
too. This illustrates a major point of scientific research. Researchers must be constantly
“on-guard” for a myriad of problems that obscure the truth.
One gene can encode different proteins in different cell types by including or excluding different
exons in a process known as alternative splicing. Different cell types are formed in a process
called differentiation, during development of the organism. After differentiation, cell types have
different proteins present that direct the cell to use the DNA in a way that suits the function of
the cell. Exons are mixed to form alternative mRNA products.
7
How DNA becomes a protein
DNA Gene
Promoter
Exon 1
5’ UTR
Exon 2
CDS 1
Intron
CDS 2
3’ UTR
5’
GT
ATG…
…TGA
AG
Transcription
Pre-mRNA
AUG…
GU
AG
…UGA
RNA processing: Removal of introns
addition of 5’ cap and poly-A tail
Mature mRNA
5’ cap
Poly-A tail
AAAAAAAAAAAAAAAAAAAAAA
AUG…
…UGA
Translation
Protein8
Translation:
mRNA codes for amino acids (the building blocks of proteins) in stretches of three nucleotides called
codons. Each codon specifies an amino acid, the building blocks of proteins, or a stop codon that signals
to stop translation. tRNAs carry specific amino acids and interact with ribosomes to deliver the specified
amino acid to the growing chain of amino acids (called a polypeptide ). The translation of nucleotide to
amino acid sequences follows a standard genetic code (below).
U
UUU Phe (F)
UUC Phe (F)
U
UUA Leu (L)
UUG Leu (L)
CUU Leu (L)
CUC Leu (L)
C
CUA Leu (L)
CUG Leu (L)
AUU Ile (I)
AUC Ile (I)
A
AUA Ile (I)
AUG Met (M)
GUU Val (V)
GUC Val (V)
G
GUA Val (V)
GUG Val (V)
Genetic Code.
UCU
UCC
UCA
UCG
CCU
CCC
CCA
CCG
ACU
ACC
ACA
ACG
GCU
GCC
GCA
GCG
Ser
Ser
Ser
Ser
Pro
Pro
Pro
Pro
Thr
Thr
Thr
Thr
Ala
Ala
Ala
Ala
C
(S)
(S)
(S)
(S)
(P)
(P)
(P)
(P)
(U)
(U)
(U)
(U)
(A)
(A)
(A)
(A)
UAU
UAC
UAA
UAG
CAU
CAC
CAA
CAG
AAU
AAC
AAA
AAG
GAU
GAC
GAA
GAG
A
Tyr (Y)
Tyr (Y)
Stop
Stop
His (H)
His (H)
Gln (Q)
Gln (Q)
Asn (N)
Asn (N)
Lys (K)
Lys (K)
Asp (D)
Asp (D)
Glu (E)
Glu (E)
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA
AGG
GGU
GGC
GGA
GGG
G
Cys (C)
Cys (C)
Stop
Trp (W)
Arg (R)
Arg (R)
Arg (R)
Arg (R)
Ser (S)
Ser (S)
Arg (R)
Arg (R)
Gly (G)
Gly (G)
Gly (G)
Gly (G)
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
Use the genetic code to translate the following DNA: ATG TTG CGA TGA
Gene Annotation:
Long stretches of RNA that translate into amino acids without a stop codon (known as open reading
frames), start codons, intron/exon boundaries, and other gene specific features form clues as to where
genes are located in eukaryotic genomes. Finding genes in prokaryotic genomes is much easier because
there are no introns and there is little intergenic DNA. For eukaryotes, annotating tens of thousands of
genes by hand is an extremely time consuming process, in particular because DNA has to be examined
forwards and backwards, and for different starting points. Initial gene prediction is now accomplished
by computers, but these still need to be hand-checked for quality by Biologists for genes of highest
interest.
9
Day 1 Worksheet Gene Annotation Lab
Name: ________________________
The following is the nucleotide sequence of the human -globin gene. Gene regions are indicated as
follows: Untranscribed, Transcribed but not translated, Transcribed and Translated, Conserved
Sequence. Nucleotide positions in the DNA are shown in the right hand margin.
ATATCTTAGA
CCAGAAGAGC
TGGAGCCACA
AGGAGCCAGG
ATTTGCTTCT
GGTGCACCTG
AGGTGAACGT
AGGTTACAAG
CAGAGAAGAC
GTCTATTTTC
GTTCTTTGAG
ACCCTAAGGT
GGCCTGGCTC
GCTGCACTGT
TATGGGACCC
TGTCATAGGA
ACGAATGATT
ATTTGCTGTT
TTTTTCTTCT
TATAACAAAA
GGGAGGGCTG
CAAGGACAGG
CCCTAGGGTT
GCTGGGCATA
GACACAACTG
ACTCCTGAGG
GGATGAAGTT
ACAGGTTTAA
TCTTGGGTTT
CCACCCTTAG
TCCTTTGGGG
GAAGGCTCAT
ACCTGGACAA
GACAAGCTGC
TTGATGTTTT
AGGGGAGAAG
GCATCAGTGT
CATAACAATT
CCGCAATTTT
GGAAATATCT
AGGGTTTGAA
TACGGCTGTC
GGCCAATCTA
AAAGTCAGGG
TGTTCACTAG
AGAAGTCTGC
GGTGGTGAGG
GGAGACCAAT
CTGATAGGCA
GCTGCTGGTG
ATCTGTCCAC
GGCAAGAAAG
CCTCAAGGGC
ACGTGGATCC
CTTTCCCCTT
TAACAGGGTA
GGAAGTCTCA
GTTTTCTTTT
TACTATTATA
CTGAGATACA
GTCCAACTCC
ATCACTTAGA
CTCCCAGGAG
CAGAGCCATC
CAACCTCAAA
CGTTACTGCC
CCCTGGGCAG
AGAAACTGGG
CTGACTCTCT
GTCTACCCTT
TCCTGATGCT
TGCTCGGTGC
ACCTTTGCCA
TGAGAACTTC
CTTTTCTATG
CAGTTTAGAA
GGATCGTTTT
GTTTATTCTT
CTTAATGCCT
TTAAGTAACT
TAAGCCAGTG
CCTCACCCTG
CAGGGAGGGC
TATTGCTTAC
CAGACACCAT
CTGTGGGGCA
GTTGGTATCA
CATGTGGAGA
CTGCCTATTG
GGACCCAGAG
GTTATGGGCA
CTTTAGTGAT
CACTGAGTGA
AGGGTGAGTC
GTTAAGTTCA
TGGGAAACAG
AGTTTCTTTT
GCTTTCTTTT
TAACATTGTG
TAAAAAAAAA
1-50
51-100
101-150
151-200
201-250
251-300
301-350
351-400
401-450
451-500
501-550
551-600
601-650
651-700
701-750
751-800
801-850
851-900
901-950
951-1000
CTTACACAGT
TTGCATATTC
CATAATCATT
TACACATATT
TGCTTTCTTC
CCTAATCTCT
TGCACCATTC
AATATTTCTG
AGGTTTCATA
ATTTTATGGT
TTTGCTAATC
ACGTGCTGGT
CCAGTGCAGG
GGCCCACAAG
AGGTTCCTTT
GGGCCTTGAG
CAATGATGTA
GGAGGTCAGT
GGGAAAATAC
CTGCCTAGTA
ATAATCTCCC
ATACATATTT
GACCAAATCA
TTTTAATATA
TTCTTTCAGG
TAAAGAATAA
CATATAAATA
TTGCTAATAG
TGGGATAAGG
ATGTTCATAC
CTGTGTGCTG
CTGCCTATCA
TATCACTAAG
GTTCCCTAAG
CATCTGGATT
TTTAAATTAT
GCATTTAAAA
ACTATATCTT
CATTACTATT
TACTTTATTT
ATGGGTTAAA
GGGTAATTTT
CTTTTTGTTT
GCAATAATGA
CAGTGATAAT
TTTCTGCATA
CAGCTACAAT
CTGGATTATT
CTCTTATCTT
GCCCATCACT
GAAAGTGGTG
CTCGCTTTCT
TCCAACTACT
CTGCCTAATA
TTCTGAATAT
CATAAAGAAA
AAACTCCATG
TGGAATATAT
TCTTTTATTT
GTGTAATGTT
GCATTTGTAA
ATCTTATTTC
TACAATGTAT
TTCTGGGTTA
TAAATTGTAA
CCAGCTACCA
CTGAGTCCAA
CCTCCCACAG
TTGGCAAAGA
GCTGGTGTGG
TGCTGTCCAA
AAACTGGGGG
AAAAACATTT
TTTACTAAAA
TGATGAGCTG
AAAGAA
GTGTGCTTAT
TTAATTGATA
TTAATATGTG
TTTTAAAAAA
TAATACTTTC
CATGCCTCTT
AGGCAATAGC
CTGATGTAAG
TTCTGCTTTT
GCTAGGCCCT
CTCCTGGGCA
ATTCATCCCA
CTAATGCCCT
TTTCTATTAA
ATATTATGAA
ATTTTCATTG
AGGGAATGTG
TTCAAACCTT
1001-1050
1051-1100
1101-1150
1151-1200
1201-1250
1251-1300
1301-1350
1351-1400
1401-1450
1451-1500
1501-1550
1551-1600
1601-1650
1651-1700
1701-1750
1751-1800
1801-1850
1851-1900
1901-1936
10
1. Without looking, redraw the sketch of a eukaryotic gene, pre-mRNA, and mature mRNA in the space
below. Then correct your own work.
Gene
Pre-mRNA
Mature mRNA
2. Circle and label the following in the sequence of β-globin (opposite page) and in the sketch of the gene above.
A. Transcription start point
B. Transcription end point
C. Translation start point
D. Translation end point
3. Explain what evidence (i.e. signals in the DNA sequence) you used to support your answers to (2).
11
4. Based on the sequence information provided above, draw a sketch of the human -globin gene below
labeling the promoter, 5’UTR, coding sequences, introns, and 3’UTR. Make sizes roughly proportional to the
sequence itself.
5. What would be the first three and the last three amino acids produced from the mRNA transcribed from this
gene? Note: a stop codon does not produce an amino acid.
6. Is a mutation more likely to disrupt the function of the protein produced if it occurs in an intron or
coding sequence? Explain.
12
Lab 2. What can comparison of DNA sequences tell us about evolution?
Goal: To give an understanding of how DNA shed light on evolution.
Species evolve over time. Evolution is the consequence of the interactions of 1) the potential for a species
to increase its numbers, 2) the genetic variability of offspring due to mutation and recombination of genes,
3) a finite supply of the resources required for life, and 4) the ensuing selection by the environment of
those offspring better able to survive and leave offspring. The great diversity of organisms is the result of
more than 3.5 billion years of evolution that has filled every available niche with life forms. Natural
selection and its evolutionary consequences provide a scientific explanation for the fossil record of ancient
life forms as well as for the striking molecular similarities observed among the diverse species of living
organisms. The millions of different species of plants, animals, and microorganisms that live on earth
today are related by descent from common ancestors. Biological classifications are based on how
organisms are related. Organisms are classified into a hierarchy of groups and subgroups based on
similarities which reflect their evolutionary relationships.
National Science Education Standards, p. 185
Whole genome sequences from many different species have been aligned by researchers. Now that
you’ve gained some experience in understanding the fine structure of a gene, lets see what the same
gene, beta-globin, looks like when compared among many different species. Do you think it is possible
for mutations at a single gene to reveal phylogenetic relationships among vertebrates?
1. Go to the UCSC genome browser web page at genome.ucsc.edu, and select Genomes.
2. Select the clade “Mammal”, genome “Human”, clear all text from “position or search term,” and
enter the abbreviation for human beta globin “HBB.”
13
3. You will get a list of search results, choose HBB: Homo Sapiens Hemoglobin Beta. You will now
see a close up view of the beta-globin gene in humans. Some sections on your screen may look
different than below, but the top should be similar. Find the browser navigation tools, exact
location of the gene on chromosome 11, the sketch of the gene, and miscellaneous information
tracks below the gene sketch.
Navigation tools
Exact location shown
Location on chromosome
View of gene
Miscellaneous Information Tracks
4. The beta globin gene is in the reverse orientation in this view of the chromosome (arrows on the
sketch point to the left). Before looking deeper, scroll down and reverse your orientation of the
gene so that it matches the orientation of the gene from the hand-annotation exercise you
performed earlier. (Lab 1 Question 4)
14
5. Scroll up to the picture and see if you can reconcile the genome browser’s gene sketch with your
hand sketch of this gene. How are UTRs, coding sequences, and introns depicted in the
browser? Roughly redraw the sketch below, and label the major pieces.
15
6. Scroll down to the Comparative Genomics Track controls and adjust the settings so that
“conservation” is set to “full.” This will give your browser the most expanded view of the
alignment among many different species, which allows you to see how similar (i.e. conserved)
each nucleotide is among many different vertebrates. Then hit “refresh” and scroll back up to
the view the changes.
7. Under, “Mutliz alignment of 46 (your number might differ) species” you will see vertical bars
corresponding to each nucleotide location in the genome. The taller the bars, the more
conserved the DNA sequence is at that location in pairwise comparison of the human versus the
species identified on the left of the screen.
8. To complete the lab, continue from this point to answer the questions on the “Evolution
Worksheet” below.
16
Evolution Worksheet
Name: ________________________
1. Which species on your screen looks most similar to the human sequence? Does that make
phylogenetic sense? Explain.
2. Does it look like some regions of the gene are more conserved than others?
a.
What evidence supports your answer?
b. Which regions of the gene appear more conserved?
c. Why might that be the case?
3. Zoom in all the way to the “base level” from chromosome 11:5,247,971-5,248,171. This view shows
the location of the first exon/intron boundary.
a. Does the intron follow the GT/AG “rule”? Explain.
17
b. Based on the species you see in the browser viewer, what percent similar are the sequences
in each of the first four nucleotides of the intron?
c. What could explain the variation in conservation you see in (b)?
4. Write down the amino acid sequence for the six amino acids before the intron begins.
a. Are more phylogenetically similar species, like mammals, more similar to each other than they
are to fishes?
b. At what % of sites are all species in the group identical? What about mammals? Humans vs.
fish?
c. What could explain this?
d. Do you think this pattern is something unique to beta globin, or is a general property of your
species genomes? Support your idea by going to the HDB (Homo sapiens hemoglobin delta)
gene and repeating the analysis below.
18
Lab 3: How can I use bioinformatics to annotate genes?
Goal: To give an understanding of how raw data and research help us predict genes in a genome.
Bioinformatics:
Bioinformatics is branch of biological science involved in using computers to analyze biological data.
Bioinformatics normally deals with very large datasets (sometimes in excess of 500 GB) that would be
nearly impossible to generate and manage without the use of supercomputers. Recent advances in
sequencing technology are allowing more genomes to be sequenced each year. However, it takes ten
times as long to find genes and describe their functions (i.e. annotate a genome) than it does to
sequence genomes. As a result, there is a widening gap between sequenced genomes and annotated,
searchable genomes in publically available databases.
The Maker Annotation Pipeline Website does not work anymore
“Maker” is a genome annotation pipeline that seeks to close this gap. A pipeline is a series of programs
working together like when you follow the different steps in a protocol to dissect a frog and label its
organs. Maker was optimized for use on non-model organisms. Much has been learned about biology
in the last fifty years from model species like the fruit fly (Drosophila spp), nematode worm (C. elegans),
baker’s yeast (Sachharomyces cerevisea), and bacteria (Escherischia coli) that have small genomes, are
easy to raise, have short life cycles, are easy to observe the connection between genotype and
phenotype, and can be easily manipulated genetically. However, because sequencing costs have
dramatically decreased in the last decade, genomes from species with interesting phenotypes are now
being sequenced at increasing rates. The Maker pipeline combines two kinds of information to make
structural gene annotations from raw DNA: i) intrinsic signals in the organism’s DNA, which are found by
ab-initio (from scratch) gene predictors, and ii) extrinsic evidence, which is evidence supplied to Maker
based on similarity of genomic regions to other organisms’ mRNA (also known as expressed sequence
tags, or ESTs) and protein sequences.
19
Assembled Genome
Repeat Masker
Masked Genome
Extrinsic
Evidence
Intrinsic
Signals
Gene Predictor
(rockfish
trained SNAP
and Augustus)
Proteins
ESTs
Exonerate to
improve
alignment
quality
MAKER Annotations:
supported by both
Intrinsic and Extrinsic
Evidence.
The first step of the multi-step Maker gene annotation pipeline involves finding repetitive DNA and
labeling (i.e. “masking”) the genomic DNA using the program RepeatMasker. Repeat masking is
important in order to prevent inserted viral exons from being counted as fish exons. A repeat library
was developed specifically from the Sebastes rubrivinctus genome. The ab-initio gene predictor “SNAP”
can detect the intrinsic signals in the DNA model genes. Because “what a gene looks like” differs
significantly among genomes, gene finders must be “trained” to know what the signals look like in each
new species sequenced. The program BLAST (Basic local alignment search tool) is used to find similarity
of genome sequence to public mRNA and protein sequences. There are different kinds of BLAST
searches depending on what kind of sequences (DNA, RNA, protein) are being compared. The program
Exonerate helps to polish up gene annotations since “Local” alignments of sequences end wherever
similarity between sequences begins to decrease. The final annotations predicted by Maker are those
that are supported by both kinds of information, intrinsic signals and extrinsic evidence.
20
The object of this this exercise is to use Maker, a cutting edge research tool, to predict genes from a
small section of the Sebastes rubrivinctus genome using current methodologies.
1) Retrieve Scaffold folder from public drive.
2) Go to http://blast.ncbi.nlm.nih.gov/
3) Click on the link to BLASTx.
4) Upload the scaffold.
5) Change the database to UniProtKB/Swiss-prot (swissprot).
21
6) Click the BLAST button. BLAST will take several minutes to run. It will search the Swissprot database
for matches to your query based on local sequence similarity.
7) Select all of the sequences, then click “Get selected sequences.
8) Select view as fasta and select the first 10 results.
22
9) Copy the results to a file and save as ProteinEvidence.fasta.
Thought Question: What protein appears most in the search results? Do a quick internet search, what
does this protein seem to do?
10) Go to http://blast.ncbi.nlm.nih.gov/
11) Click on the link to tBLASTx.
12) Upload the scaffold.
13) Change the database to expressed sequence tags (EST). Enter “Sebastes” in the Organism line.
23
14) Click the BLAST button. This may take several minutes.
15) Select 10 of the sequences, 2 from each “column” of alignments in the picture at the top. Clicking on
any of the lines under the long bar will take you straight to that entry, check the box. Then click “Get
Selected Sequences”.
16) Select view as fasta and select the first 15 results.
17) Copy the results to a file and save as ESTEvidence.fasta.
18) Go to http://derringer.genetics.utah.edu/cgi-bin/MWAS/maker.cgi
19) Click new guest account. Remember to write your guest number down.
24
20) Click on the manage files link next to new job.
21) Upload the scaffold file, the protein evidence, the EST evidence as FASTA files.
25
22) Upload the HMM file (on public drive) as a SNAP HMM file
23) Go back to the new jobs tab.
24) Upload the Sequence file in the “Choose a genome fasta file” menu.
25) Upload the EST file as ESTs from a related organism.
26
26) Upload the Protein file.
27
27) Upload the SNAP file.
28) Set “Consider single exon EST evidence when generating annotations” to yes.
29) Click “Add Job to Queue”.
30) Once Maker has finished, click the icon in the view results tab.
28
32) Click the “View in Apollo” button.
33) Select “open with Java Web Start Launcher”.
34) Select run to launch Apollo.
35) Right click the Maker annotated gene and select “Sequence”.
36) Select Peptide Sequence and highlight the sequence.
37) Go to http://blast.ncbi.nlm.nih.gov/
38) Select protein BLAST.
39) Paste in the sequence, change the database to Uniprot/Swissprot, and click BLAST.
40) Determine the identity of the gene based off of the more similar, significant blast result hit. The Evalue of the “query” sequence you entered against a database “subject” is a measure of the probability
that the hit was random. Values less than 10 x 10-10 are considered reliable indicators of significant
similarity.
41) Complete the Worksheet below to finish the lab.
Want more? Have the students annotate the genes with and without the ab initio gene finder, the EST
evidence, or the protein evidence to see which most strongly affects the resulting protein sequence.
Measure percent overlap of different gene annotations to see how similar they are.
29
Gene Annotation Statistics Worksheet
Name:_____________________________
1) Paste a screen shot of your Apollo result in the space below:
2) How many genes were annotated by MAKER in the scaffold?
3) If there are any predicted genes, which appear complete, that is, beginning with ATG and ending with
a stop codon?
4) How many exons are in each gene supported by MAKER?
5) How long are the introns on average, and how many are there per gene?
6) Do all introns follow the GT/AG rule?
30
6) For which genes were UTRs predicted? Name the genes and UTR types.
7) Sometimes assembly of genomes from many small pieces results in chimeric sequences (Recall from
mythology that a chimera is a monster comprised of pieces of different animals). Do your blast results
suggest that the gene MAKER predicted is such a monster?
8) Describe the protein from the strongest protein Blast hit. Can you assign a putative function to the
gene based on this information?
9) Research how the gene is important to the organism’s development, reproduction, and/or survival.
31
Lab 4. What does a typical rockfish gene look like?
Goal: Use skills learned in previous labs and classes to complete a gene annotation using an
annotation pipeline.
Exercise: Each student should take a scaffold and repeat the annotation process above. There are 50
random scaffolds provided in Appendix D (Ask your instructor for the file location). Two students
should independently annotate the same scaffold. Each individual should complete the worksheet with
their scaffold. Compare your results to another student who annotated the same scaffold.
Presentation: Each pair of students with the same scaffold will present a 15 minute presentation on
their annotations. Discuss your worksheet results, annotations, the features of those annotations, what
the predicted genes are and why they are important. Be sure to support your point with citations and
evidence and to include tables and figures where appropriate. The scaffolds may contain single exon
genes, multi exon genes, one gene per scaffold, more than one gene per scaffold, or simply no genes.
As a group discuss what the general features of all of the scaffolds say about eukaryotic gene structure.
Can you classify different kinds of rockfish genes based on the results you’ve discovered?
Epilogue
As of the writing of this manual, scaffolds containing age-related genes have not yet been identified in S.
rubrivinctus or its sister species that lives 10X longer than it. Inquire to see if these are now available to
the author: buonaccorsi@juniata.edu. There are hundreds of candidate aging genes that are found in
both humans and rockfishes. Students could annotate the same gene from the two species to see if
there are differences that represent a “smoking gun” that might explain negligible senescence in the
tiger rockfish!
32
Download