Decoding the rockfish genome: An introduction to modern genomics and marine biology Target Audience: AP high school biology students or lower division undergraduate college students. Vince Buonaccorsi Juniata College 1 OUTLINE Pg 3. Introduction Pgs. 4 to 11 Lab 1: How do eukaryotic genes work? Pgs. 12 to 16 Lab 2: What can comparison of DNA sequences tell us about evolution? Pgs. 17 to 29 Lab 3: How can I use bioinformatics to annotate genes? Pg. 30 Lab 4: What does a typical rockfish gene look like? 2 Introduction This investigation focuses on how genes are found in a newly sequenced genome. This process is called structural gene annotation. Concepts explored include the cell, the molecular basis of heredity, and evolution. Students will use online analytical tools to explore the fine structure of genes in the flag rockfish Sebasets rubrivinctus, explore the connection between gene structure and cellular functions, and the connection between function and evolutionary conservation of gene sequences. Genomics and bioinformatics are dynamic fields well-suited for capturing the imagination of students in inquiry driven classroom efforts. Genomic studies provide a comprehensive catalog of basic genetic information in a system that underpin structure and functions responsible for organism’s survival, evolution, and interactions with other organisms of the same or different species. New genomes are being sequenced at an increasing rate, leaving vast quantities of orphaned data that can be explored in authentic research experiences. In this investigation, students will be given raw segments of DNA (i.e. genomic scaffolds) from the Sebastes rubrivinctus genome project. In order to inform their gene annotations, students will search for evidence available online in the form of similar proteins and RNAs from closely related organisms, as well as humans. This “extrinsic” evidence will be combined with “intrinsic” signals in the DNA itself (signals that direct the cellular apparatus through transcription and translation) to devise gene models from raw DNA. Students will be answering the basic question: what does a Sebastes rubrivinctus gene look like? The answer may depend on the extrinsic evidence the student is able to find. Students will perform structural gene annotation by hand, examine alignments of gene sections from many different vertebrates using the UCSC genome browser, use a gene annotation pipeline to perform structural gene annotation, assign a putative function to the gene, and describe how the gene is important to the organism’s development, reproduction, and/or survival. This example highlights some elements of 9-12 National Science Education Content Standards A and C, Science as Inquiry, and Life Science. Students will obtain the means necessary to perform and understand scientific inquiry. The primary life science standards covered include: the cell, the molecular basis of heredity, and biological evolution. This lab manual assumes that each student has some familiarity with cell and molecular biology and access to a computer and the worldwide web. The background information covered is not meant to be an exhaustive treatment of these topics, but rather, reviews the specific information necessary to perform and understand the exercises. 3 Lab 1. How do eukaryotic genes work? Goal: To give a basic understanding of genes and their functions. Review of Some Basic Molecular Biology: The “Central Dogma of Biology” describes information flow in biological systems. It states that DNA makes RNA, which makes proteins. Transcription is the process of turning DNA into RNA. Translation is the process of turning RNA into proteins. We will focus our analysis on predicting protein coding genes from raw genome sequence in a marine fish that has recently been sequenced. Only messenger RNA (mRNA) genes make, or code for, proteins. Ribosomal genes are transcribed into ribosomal RNAs (rRNA), transfer RNA genes produce tRNA molecules, and many other RNA genes do not code for proteins. Most DNA is found in the nucleus, is transcribed in the nucleus, and is exported from the nucleus to the cytoplasm for translation. The flag rockfish Sebastes rubrivinctus is a member of a diverse marine fish assemblage with an estimated 102 species native to the west coast of North America. Rockfishes of the genus Sebastes support important commercial and recreational fisheries on the west coast of North America and are the dominant assemblage on most cold temperate reefs. These live-bearers have a low intrinsic rate of population increase and highly sporadic recruitment, releasing large numbers of pelagic larvae into a variable coastal environment. Their slow growth rates render them vulnerable to overfishing. Uncertainty in the success of any particular year class has favored an evolutionary strategy whereby some species of the Sebastes genus have extremely long lifespans and do not show signs of aging (negligible senescence), while others demonstrate typical aging patterns and have short lifespans. S. rubrivinctus has a maximum lifespan of only 18 years, but it is closely related to S. nigrocinctus, which has a maximum age of at least 116 years. Researchers are expecting to gain insight into the genetic mechanism for negligible senescence by sequencing and comparing the genomes of these two species. 4 5 Transcription and Eukaryotic Gene Structure: Most cell functions involve chemical reactions. Food molecules taken into cells react to provide the chemical constituents needed to synthesize other molecules. Both breakdown and synthesis are made possible by a large set of protein catalysts, called enzymes. The breakdown of some of the food molecules enables the cell to store energy in specific chemicals that are used to carry out the many functions of the cell. Cells store and use information to guide their functions. The genetic information stored in DNA is used to direct the synthesis of thousands of proteins that each cell requires Cell functions are regulated. Regulation occurs both through changes in the activity of the functions performed by proteins and through the selective expression of individual genes. This regulation allows cells to respond to their environment and to control and coordinate cell growth and division. (National Science Education Standards pg. 184) In all organisms, the instructions for specifying the characteristics of organisms are carried in DNA, a large polymer formed from subunits of four kinds (A, G, C, and T). The chemical and structural properties of DNA explain how the genetic information that underlies heredity is both encoded in genes (as a string of molecular “letters”) and replicated (by a templating mechanism). Each DNA molecule in a cell forms a single chromosome. (National Science Education Standards pg. 185) The structure of a gene dictates how cellular proteins will interact with it to transcribe a messenger RNA and translate that mRNA into a protein. Some of the important details of DNA and a gene are listed below: Each cell contains the same genome sequence, but different genes are transcribed into mRNAs by different cells and by the same cells at different times. DNA is double stranded, and has 5’ and 3’ ends, pronounced “five prime” and “three prime.” It is read from 5’ to 3’ and the two strands go opposing directions, the 5’ on one is the 3’ on its partner. 5' ATGGCGT TGCCATA CCCGCAT CCCTGAT 3' 3' TACCGCA ACGGTAT GGGCGTA GGGACTA 5' Eukaryotic Genes have several different parts that orchestrate the cellular processes of transcription and translation (see Figure below): o Promoter o Five Prime UnTranslated Region (5’ UTR) o Coding sequence(s) o Intron(s) o Three Prime UnTranslated Region (3’ UTR) The Promoter is a region of DNA up to a few hundred bp immediately upstream of the transcription start point to which transcription factors (proteins) bind, and recruit RNA polymerases for transcription. Exons are the sections of transcribed DNA that are exported from the nucleus after introns have been removed and the ends of the transcript stabilized to form the mature mRNA. A gene may have one exon or several exons separated by intervening sequences called introns. The beginning and ends of exons are untranslated. The Five Prime Untranslated Region (5’ UTR) is the part of the mature mRNA immediately upstream of the coding sequence. It may contain introns. Before translation can start, the ribosome binds to the modified 5’ end of the 5’UTR after export to the cytoplasm. 6 Coding sequences (CDS) of DNA are ultimately the sections of exons that are translated by ribosomes into amino acid sequences once the mature mRNA is exported to the nucleus. The protein coding region of DNA begins with the start codon ATG and ends in the stop codons TAG, TGA, or TAA. Note that the process of transcription copies Ts (thymines) as Us (uracils). Introns are non-coding regions between exons. Introns usually begin with GT and end in AG (or the reverse compliments) in what is known as “GT/AG rule”. They are excised from pre-mRNAs in the nucleus by proteins that recognize these sequences. Their excision is part of the premRNA processing step that also includes adding a 5’cap (a modified guanine) to the mRNA and 3’ poly-A tail (a long stretch of As) for message stability. The Three Prime Untranslated Region (3’UTR) is the region of DNA after the stop codon in a gene. Once it is transcribed, a polyA tail is added that helps to stabilize the mRNA. These regions sometimes have binding sites to microRNAs that, when present, signal the mRNA for break down. Intergenic DNA is the DNA between genes. There are still recognizable DNA elements in intergenic DNA: o Enhancers and silencers are sequence elements in the DNA that are more distant from the transcription startpoint than promoters yet still regulate transcription by determining what kind of molecules will be able to bind to the DNA . “Regulation” refers to turning transcription “up or down”. o Short tandem repeats (a.k.a. microsatellites) are stretches of repeated nucleotide motifs from one to 30 bp. For example, the sequence ACACACACACACAC, contains the motif AC repeated seven times. o Dispersed repeats are repeats that do not occur in tandem. These are often ancient viral DNA elements that have incorporated themselves into other organisms’ genomes, have made copies of themselves over the course of evolutionary time (10s of millions of years), and have lost their ability to become full viruses and infect their host. Dispersed repeats can be also found within introns of genes and must be identified to keep gene finders from characterizing them as exons belonging to native genes. o Tandem repetitive DNA can serve important purposes such as protecting the ends of chromosomes, called telomeres, and serving as binding sites during cell division towards the constriction point of a chromosome (centromere). o Most genome sequences are erroneously contaminated by bacterial and human genes too. This illustrates a major point of scientific research. Researchers must be constantly “on-guard” for a myriad of problems that obscure the truth. One gene can encode different proteins in different cell types by including or excluding different exons in a process known as alternative splicing. Different cell types are formed in a process called differentiation, during development of the organism. After differentiation, cell types have different proteins present that direct the cell to use the DNA in a way that suits the function of the cell. Exons are mixed to form alternative mRNA products. 7 How DNA becomes a protein DNA Gene Promoter Exon 1 5’ UTR Exon 2 CDS 1 Intron CDS 2 3’ UTR 5’ GT ATG… …TGA AG Transcription Pre-mRNA AUG… GU AG …UGA RNA processing: Removal of introns addition of 5’ cap and poly-A tail Mature mRNA 5’ cap Poly-A tail AAAAAAAAAAAAAAAAAAAAAA AUG… …UGA Translation Protein8 Translation: mRNA codes for amino acids (the building blocks of proteins) in stretches of three nucleotides called codons. Each codon specifies an amino acid, the building blocks of proteins, or a stop codon that signals to stop translation. tRNAs carry specific amino acids and interact with ribosomes to deliver the specified amino acid to the growing chain of amino acids (called a polypeptide ). The translation of nucleotide to amino acid sequences follows a standard genetic code (below). U UUU Phe (F) UUC Phe (F) U UUA Leu (L) UUG Leu (L) CUU Leu (L) CUC Leu (L) C CUA Leu (L) CUG Leu (L) AUU Ile (I) AUC Ile (I) A AUA Ile (I) AUG Met (M) GUU Val (V) GUC Val (V) G GUA Val (V) GUG Val (V) Genetic Code. UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG Ser Ser Ser Ser Pro Pro Pro Pro Thr Thr Thr Thr Ala Ala Ala Ala C (S) (S) (S) (S) (P) (P) (P) (P) (U) (U) (U) (U) (A) (A) (A) (A) UAU UAC UAA UAG CAU CAC CAA CAG AAU AAC AAA AAG GAU GAC GAA GAG A Tyr (Y) Tyr (Y) Stop Stop His (H) His (H) Gln (Q) Gln (Q) Asn (N) Asn (N) Lys (K) Lys (K) Asp (D) Asp (D) Glu (E) Glu (E) UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG G Cys (C) Cys (C) Stop Trp (W) Arg (R) Arg (R) Arg (R) Arg (R) Ser (S) Ser (S) Arg (R) Arg (R) Gly (G) Gly (G) Gly (G) Gly (G) U C A G U C A G U C A G U C A G Use the genetic code to translate the following DNA: ATG TTG CGA TGA Gene Annotation: Long stretches of RNA that translate into amino acids without a stop codon (known as open reading frames), start codons, intron/exon boundaries, and other gene specific features form clues as to where genes are located in eukaryotic genomes. Finding genes in prokaryotic genomes is much easier because there are no introns and there is little intergenic DNA. For eukaryotes, annotating tens of thousands of genes by hand is an extremely time consuming process, in particular because DNA has to be examined forwards and backwards, and for different starting points. Initial gene prediction is now accomplished by computers, but these still need to be hand-checked for quality by Biologists for genes of highest interest. 9 Day 1 Worksheet Gene Annotation Lab Name: ________________________ The following is the nucleotide sequence of the human -globin gene. Gene regions are indicated as follows: Untranscribed, Transcribed but not translated, Transcribed and Translated, Conserved Sequence. Nucleotide positions in the DNA are shown in the right hand margin. ATATCTTAGA CCAGAAGAGC TGGAGCCACA AGGAGCCAGG ATTTGCTTCT GGTGCACCTG AGGTGAACGT AGGTTACAAG CAGAGAAGAC GTCTATTTTC GTTCTTTGAG ACCCTAAGGT GGCCTGGCTC GCTGCACTGT TATGGGACCC TGTCATAGGA ACGAATGATT ATTTGCTGTT TTTTTCTTCT TATAACAAAA GGGAGGGCTG CAAGGACAGG CCCTAGGGTT GCTGGGCATA GACACAACTG ACTCCTGAGG GGATGAAGTT ACAGGTTTAA TCTTGGGTTT CCACCCTTAG TCCTTTGGGG GAAGGCTCAT ACCTGGACAA GACAAGCTGC TTGATGTTTT AGGGGAGAAG GCATCAGTGT CATAACAATT CCGCAATTTT GGAAATATCT AGGGTTTGAA TACGGCTGTC GGCCAATCTA AAAGTCAGGG TGTTCACTAG AGAAGTCTGC GGTGGTGAGG GGAGACCAAT CTGATAGGCA GCTGCTGGTG ATCTGTCCAC GGCAAGAAAG CCTCAAGGGC ACGTGGATCC CTTTCCCCTT TAACAGGGTA GGAAGTCTCA GTTTTCTTTT TACTATTATA CTGAGATACA GTCCAACTCC ATCACTTAGA CTCCCAGGAG CAGAGCCATC CAACCTCAAA CGTTACTGCC CCCTGGGCAG AGAAACTGGG CTGACTCTCT GTCTACCCTT TCCTGATGCT TGCTCGGTGC ACCTTTGCCA TGAGAACTTC CTTTTCTATG CAGTTTAGAA GGATCGTTTT GTTTATTCTT CTTAATGCCT TTAAGTAACT TAAGCCAGTG CCTCACCCTG CAGGGAGGGC TATTGCTTAC CAGACACCAT CTGTGGGGCA GTTGGTATCA CATGTGGAGA CTGCCTATTG GGACCCAGAG GTTATGGGCA CTTTAGTGAT CACTGAGTGA AGGGTGAGTC GTTAAGTTCA TGGGAAACAG AGTTTCTTTT GCTTTCTTTT TAACATTGTG TAAAAAAAAA 1-50 51-100 101-150 151-200 201-250 251-300 301-350 351-400 401-450 451-500 501-550 551-600 601-650 651-700 701-750 751-800 801-850 851-900 901-950 951-1000 CTTACACAGT TTGCATATTC CATAATCATT TACACATATT TGCTTTCTTC CCTAATCTCT TGCACCATTC AATATTTCTG AGGTTTCATA ATTTTATGGT TTTGCTAATC ACGTGCTGGT CCAGTGCAGG GGCCCACAAG AGGTTCCTTT GGGCCTTGAG CAATGATGTA GGAGGTCAGT GGGAAAATAC CTGCCTAGTA ATAATCTCCC ATACATATTT GACCAAATCA TTTTAATATA TTCTTTCAGG TAAAGAATAA CATATAAATA TTGCTAATAG TGGGATAAGG ATGTTCATAC CTGTGTGCTG CTGCCTATCA TATCACTAAG GTTCCCTAAG CATCTGGATT TTTAAATTAT GCATTTAAAA ACTATATCTT CATTACTATT TACTTTATTT ATGGGTTAAA GGGTAATTTT CTTTTTGTTT GCAATAATGA CAGTGATAAT TTTCTGCATA CAGCTACAAT CTGGATTATT CTCTTATCTT GCCCATCACT GAAAGTGGTG CTCGCTTTCT TCCAACTACT CTGCCTAATA TTCTGAATAT CATAAAGAAA AAACTCCATG TGGAATATAT TCTTTTATTT GTGTAATGTT GCATTTGTAA ATCTTATTTC TACAATGTAT TTCTGGGTTA TAAATTGTAA CCAGCTACCA CTGAGTCCAA CCTCCCACAG TTGGCAAAGA GCTGGTGTGG TGCTGTCCAA AAACTGGGGG AAAAACATTT TTTACTAAAA TGATGAGCTG AAAGAA GTGTGCTTAT TTAATTGATA TTAATATGTG TTTTAAAAAA TAATACTTTC CATGCCTCTT AGGCAATAGC CTGATGTAAG TTCTGCTTTT GCTAGGCCCT CTCCTGGGCA ATTCATCCCA CTAATGCCCT TTTCTATTAA ATATTATGAA ATTTTCATTG AGGGAATGTG TTCAAACCTT 1001-1050 1051-1100 1101-1150 1151-1200 1201-1250 1251-1300 1301-1350 1351-1400 1401-1450 1451-1500 1501-1550 1551-1600 1601-1650 1651-1700 1701-1750 1751-1800 1801-1850 1851-1900 1901-1936 10 1. Without looking, redraw the sketch of a eukaryotic gene, pre-mRNA, and mature mRNA in the space below. Then correct your own work. Gene Pre-mRNA Mature mRNA 2. Circle and label the following in the sequence of β-globin (opposite page) and in the sketch of the gene above. A. Transcription start point B. Transcription end point C. Translation start point D. Translation end point 3. Explain what evidence (i.e. signals in the DNA sequence) you used to support your answers to (2). 11 4. Based on the sequence information provided above, draw a sketch of the human -globin gene below labeling the promoter, 5’UTR, coding sequences, introns, and 3’UTR. Make sizes roughly proportional to the sequence itself. 5. What would be the first three and the last three amino acids produced from the mRNA transcribed from this gene? Note: a stop codon does not produce an amino acid. 6. Is a mutation more likely to disrupt the function of the protein produced if it occurs in an intron or coding sequence? Explain. 12 Lab 2. What can comparison of DNA sequences tell us about evolution? Goal: To give an understanding of how DNA shed light on evolution. Species evolve over time. Evolution is the consequence of the interactions of 1) the potential for a species to increase its numbers, 2) the genetic variability of offspring due to mutation and recombination of genes, 3) a finite supply of the resources required for life, and 4) the ensuing selection by the environment of those offspring better able to survive and leave offspring. The great diversity of organisms is the result of more than 3.5 billion years of evolution that has filled every available niche with life forms. Natural selection and its evolutionary consequences provide a scientific explanation for the fossil record of ancient life forms as well as for the striking molecular similarities observed among the diverse species of living organisms. The millions of different species of plants, animals, and microorganisms that live on earth today are related by descent from common ancestors. Biological classifications are based on how organisms are related. Organisms are classified into a hierarchy of groups and subgroups based on similarities which reflect their evolutionary relationships. National Science Education Standards, p. 185 Whole genome sequences from many different species have been aligned by researchers. Now that you’ve gained some experience in understanding the fine structure of a gene, lets see what the same gene, beta-globin, looks like when compared among many different species. Do you think it is possible for mutations at a single gene to reveal phylogenetic relationships among vertebrates? 1. Go to the UCSC genome browser web page at genome.ucsc.edu, and select Genomes. 2. Select the clade “Mammal”, genome “Human”, clear all text from “position or search term,” and enter the abbreviation for human beta globin “HBB.” 13 3. You will get a list of search results, choose HBB: Homo Sapiens Hemoglobin Beta. You will now see a close up view of the beta-globin gene in humans. Some sections on your screen may look different than below, but the top should be similar. Find the browser navigation tools, exact location of the gene on chromosome 11, the sketch of the gene, and miscellaneous information tracks below the gene sketch. Navigation tools Exact location shown Location on chromosome View of gene Miscellaneous Information Tracks 4. The beta globin gene is in the reverse orientation in this view of the chromosome (arrows on the sketch point to the left). Before looking deeper, scroll down and reverse your orientation of the gene so that it matches the orientation of the gene from the hand-annotation exercise you performed earlier. (Lab 1 Question 4) 14 5. Scroll up to the picture and see if you can reconcile the genome browser’s gene sketch with your hand sketch of this gene. How are UTRs, coding sequences, and introns depicted in the browser? Roughly redraw the sketch below, and label the major pieces. 15 6. Scroll down to the Comparative Genomics Track controls and adjust the settings so that “conservation” is set to “full.” This will give your browser the most expanded view of the alignment among many different species, which allows you to see how similar (i.e. conserved) each nucleotide is among many different vertebrates. Then hit “refresh” and scroll back up to the view the changes. 7. Under, “Mutliz alignment of 46 (your number might differ) species” you will see vertical bars corresponding to each nucleotide location in the genome. The taller the bars, the more conserved the DNA sequence is at that location in pairwise comparison of the human versus the species identified on the left of the screen. 8. To complete the lab, continue from this point to answer the questions on the “Evolution Worksheet” below. 16 Evolution Worksheet Name: ________________________ 1. Which species on your screen looks most similar to the human sequence? Does that make phylogenetic sense? Explain. 2. Does it look like some regions of the gene are more conserved than others? a. What evidence supports your answer? b. Which regions of the gene appear more conserved? c. Why might that be the case? 3. Zoom in all the way to the “base level” from chromosome 11:5,247,971-5,248,171. This view shows the location of the first exon/intron boundary. a. Does the intron follow the GT/AG “rule”? Explain. 17 b. Based on the species you see in the browser viewer, what percent similar are the sequences in each of the first four nucleotides of the intron? c. What could explain the variation in conservation you see in (b)? 4. Write down the amino acid sequence for the six amino acids before the intron begins. a. Are more phylogenetically similar species, like mammals, more similar to each other than they are to fishes? b. At what % of sites are all species in the group identical? What about mammals? Humans vs. fish? c. What could explain this? d. Do you think this pattern is something unique to beta globin, or is a general property of your species genomes? Support your idea by going to the HDB (Homo sapiens hemoglobin delta) gene and repeating the analysis below. 18 Lab 3: How can I use bioinformatics to annotate genes? Goal: To give an understanding of how raw data and research help us predict genes in a genome. Bioinformatics: Bioinformatics is branch of biological science involved in using computers to analyze biological data. Bioinformatics normally deals with very large datasets (sometimes in excess of 500 GB) that would be nearly impossible to generate and manage without the use of supercomputers. Recent advances in sequencing technology are allowing more genomes to be sequenced each year. However, it takes ten times as long to find genes and describe their functions (i.e. annotate a genome) than it does to sequence genomes. As a result, there is a widening gap between sequenced genomes and annotated, searchable genomes in publically available databases. The Maker Annotation Pipeline Website does not work anymore “Maker” is a genome annotation pipeline that seeks to close this gap. A pipeline is a series of programs working together like when you follow the different steps in a protocol to dissect a frog and label its organs. Maker was optimized for use on non-model organisms. Much has been learned about biology in the last fifty years from model species like the fruit fly (Drosophila spp), nematode worm (C. elegans), baker’s yeast (Sachharomyces cerevisea), and bacteria (Escherischia coli) that have small genomes, are easy to raise, have short life cycles, are easy to observe the connection between genotype and phenotype, and can be easily manipulated genetically. However, because sequencing costs have dramatically decreased in the last decade, genomes from species with interesting phenotypes are now being sequenced at increasing rates. The Maker pipeline combines two kinds of information to make structural gene annotations from raw DNA: i) intrinsic signals in the organism’s DNA, which are found by ab-initio (from scratch) gene predictors, and ii) extrinsic evidence, which is evidence supplied to Maker based on similarity of genomic regions to other organisms’ mRNA (also known as expressed sequence tags, or ESTs) and protein sequences. 19 Assembled Genome Repeat Masker Masked Genome Extrinsic Evidence Intrinsic Signals Gene Predictor (rockfish trained SNAP and Augustus) Proteins ESTs Exonerate to improve alignment quality MAKER Annotations: supported by both Intrinsic and Extrinsic Evidence. The first step of the multi-step Maker gene annotation pipeline involves finding repetitive DNA and labeling (i.e. “masking”) the genomic DNA using the program RepeatMasker. Repeat masking is important in order to prevent inserted viral exons from being counted as fish exons. A repeat library was developed specifically from the Sebastes rubrivinctus genome. The ab-initio gene predictor “SNAP” can detect the intrinsic signals in the DNA model genes. Because “what a gene looks like” differs significantly among genomes, gene finders must be “trained” to know what the signals look like in each new species sequenced. The program BLAST (Basic local alignment search tool) is used to find similarity of genome sequence to public mRNA and protein sequences. There are different kinds of BLAST searches depending on what kind of sequences (DNA, RNA, protein) are being compared. The program Exonerate helps to polish up gene annotations since “Local” alignments of sequences end wherever similarity between sequences begins to decrease. The final annotations predicted by Maker are those that are supported by both kinds of information, intrinsic signals and extrinsic evidence. 20 The object of this this exercise is to use Maker, a cutting edge research tool, to predict genes from a small section of the Sebastes rubrivinctus genome using current methodologies. 1) Retrieve Scaffold folder from public drive. 2) Go to http://blast.ncbi.nlm.nih.gov/ 3) Click on the link to BLASTx. 4) Upload the scaffold. 5) Change the database to UniProtKB/Swiss-prot (swissprot). 21 6) Click the BLAST button. BLAST will take several minutes to run. It will search the Swissprot database for matches to your query based on local sequence similarity. 7) Select all of the sequences, then click “Get selected sequences. 8) Select view as fasta and select the first 10 results. 22 9) Copy the results to a file and save as ProteinEvidence.fasta. Thought Question: What protein appears most in the search results? Do a quick internet search, what does this protein seem to do? 10) Go to http://blast.ncbi.nlm.nih.gov/ 11) Click on the link to tBLASTx. 12) Upload the scaffold. 13) Change the database to expressed sequence tags (EST). Enter “Sebastes” in the Organism line. 23 14) Click the BLAST button. This may take several minutes. 15) Select 10 of the sequences, 2 from each “column” of alignments in the picture at the top. Clicking on any of the lines under the long bar will take you straight to that entry, check the box. Then click “Get Selected Sequences”. 16) Select view as fasta and select the first 15 results. 17) Copy the results to a file and save as ESTEvidence.fasta. 18) Go to http://derringer.genetics.utah.edu/cgi-bin/MWAS/maker.cgi 19) Click new guest account. Remember to write your guest number down. 24 20) Click on the manage files link next to new job. 21) Upload the scaffold file, the protein evidence, the EST evidence as FASTA files. 25 22) Upload the HMM file (on public drive) as a SNAP HMM file 23) Go back to the new jobs tab. 24) Upload the Sequence file in the “Choose a genome fasta file” menu. 25) Upload the EST file as ESTs from a related organism. 26 26) Upload the Protein file. 27 27) Upload the SNAP file. 28) Set “Consider single exon EST evidence when generating annotations” to yes. 29) Click “Add Job to Queue”. 30) Once Maker has finished, click the icon in the view results tab. 28 32) Click the “View in Apollo” button. 33) Select “open with Java Web Start Launcher”. 34) Select run to launch Apollo. 35) Right click the Maker annotated gene and select “Sequence”. 36) Select Peptide Sequence and highlight the sequence. 37) Go to http://blast.ncbi.nlm.nih.gov/ 38) Select protein BLAST. 39) Paste in the sequence, change the database to Uniprot/Swissprot, and click BLAST. 40) Determine the identity of the gene based off of the more similar, significant blast result hit. The Evalue of the “query” sequence you entered against a database “subject” is a measure of the probability that the hit was random. Values less than 10 x 10-10 are considered reliable indicators of significant similarity. 41) Complete the Worksheet below to finish the lab. Want more? Have the students annotate the genes with and without the ab initio gene finder, the EST evidence, or the protein evidence to see which most strongly affects the resulting protein sequence. Measure percent overlap of different gene annotations to see how similar they are. 29 Gene Annotation Statistics Worksheet Name:_____________________________ 1) Paste a screen shot of your Apollo result in the space below: 2) How many genes were annotated by MAKER in the scaffold? 3) If there are any predicted genes, which appear complete, that is, beginning with ATG and ending with a stop codon? 4) How many exons are in each gene supported by MAKER? 5) How long are the introns on average, and how many are there per gene? 6) Do all introns follow the GT/AG rule? 30 6) For which genes were UTRs predicted? Name the genes and UTR types. 7) Sometimes assembly of genomes from many small pieces results in chimeric sequences (Recall from mythology that a chimera is a monster comprised of pieces of different animals). Do your blast results suggest that the gene MAKER predicted is such a monster? 8) Describe the protein from the strongest protein Blast hit. Can you assign a putative function to the gene based on this information? 9) Research how the gene is important to the organism’s development, reproduction, and/or survival. 31 Lab 4. What does a typical rockfish gene look like? Goal: Use skills learned in previous labs and classes to complete a gene annotation using an annotation pipeline. Exercise: Each student should take a scaffold and repeat the annotation process above. There are 50 random scaffolds provided in Appendix D (Ask your instructor for the file location). Two students should independently annotate the same scaffold. Each individual should complete the worksheet with their scaffold. Compare your results to another student who annotated the same scaffold. Presentation: Each pair of students with the same scaffold will present a 15 minute presentation on their annotations. Discuss your worksheet results, annotations, the features of those annotations, what the predicted genes are and why they are important. Be sure to support your point with citations and evidence and to include tables and figures where appropriate. The scaffolds may contain single exon genes, multi exon genes, one gene per scaffold, more than one gene per scaffold, or simply no genes. As a group discuss what the general features of all of the scaffolds say about eukaryotic gene structure. Can you classify different kinds of rockfish genes based on the results you’ve discovered? Epilogue As of the writing of this manual, scaffolds containing age-related genes have not yet been identified in S. rubrivinctus or its sister species that lives 10X longer than it. Inquire to see if these are now available to the author: buonaccorsi@juniata.edu. There are hundreds of candidate aging genes that are found in both humans and rockfishes. Students could annotate the same gene from the two species to see if there are differences that represent a “smoking gun” that might explain negligible senescence in the tiger rockfish! 32