Module 6. Genome annotation: Overview and manual annotation of a eukaryotic gene Background The remaining modules focus on how genes are characterized in a newly sequenced genome. This process is called structural gene annotation. Concepts explored include the cell, the molecular basis of heredity, and evolution. Analytical tools will be used to explore the fine structure of genes in the flag rockfish Sebasets rubrivinctus, explore the connection between gene structure and cellular functions, and the connection between function and evolutionary conservation of gene sequences. Genomic studies provide a comprehensive catalog of basic genetic information in a system that underpin structure and functions responsible for organism’s survival, evolution, and interactions with other organisms of the same or different species. New genomes are being sequenced at an increasing rate, leaving vast quantities of orphaned data that can be explored in authentic research experiences. In this investigation, participants will be given five raw segments of DNA (i.e. genomic scaffolds) from the Sebastes rubrivinctus genome project. In order to inform their gene annotations, students will search for evidence available online in the form of similar proteins and RNAs from closely related organisms, as well as humans. This “extrinsic” evidence will be combined with “intrinsic” signals in the DNA itself (signals that direct the cellular apparatus through transcription and translation) to devise gene models from raw DNA. Readers will be answering the basic question: what does a Sebastes rubrivinctus gene look like? The answer may depend on the extrinsic evidence the student is able to find. Participants will perform structural gene annotation by hand, use a gene annotation pipeline to perform structural gene annotation, assign a putative function to the gene, describe how the gene is important to the organism’s development, reproduction, and/or survival, and examine alignments of gene sections from many different vertebrates using the UCSC genome browser. This lab manual assumes that each participant has some familiarity with cell and molecular biology and access to a computer and the worldwide web. The background information covered is not meant to be an exhaustive treatment of these topics, but rather, reviews the specific information necessary to perform and understand the exercises. Review of Some Basic Molecular Biology The “Central Dogma of Biology” describes information flow in biological systems. It states that DNA makes RNA, which makes proteins. Transcription is the process of turning DNA into RNA. Translation is the process of turning RNA into proteins. We will focus our analysis on predicting protein coding genes from raw genome sequence in a marine fish that has recently been sequenced. Only messenger RNA 1 (mRNA) genes make, or code for, proteins. Ribosomal genes are transcribed into ribosomal RNAs (rRNA), transfer RNA genes produce tRNA molecules, and many other RNA genes do not code for proteins. Most DNA is found in the nucleus, is transcribed in the nucleus, and is exported from the nucleus to the cytoplasm for translation. The flag rockfish Sebastes rubrivinctus is a member of a diverse marine fish assemblage with an estimated 102 species native to the west coast of North America. Rockfishes of the genus Sebastes support important commercial and recreational fisheries on the west coast of North America and are the dominant assemblage on most cold temperate reefs. These live-bearers have a low intrinsic rate of population increase and highly sporadic recruitment, releasing large numbers of pelagic larvae into a variable coastal environment. Their slow growth rates render them vulnerable to overfishing. Uncertainty in the success of any particular year class has favored an evolutionary strategy whereby some species of the Sebastes genus have extremely long lifespans and do not show signs of aging (negligible senescence), while others demonstrate typical aging patterns and have short lifespans. S. rubrivinctus has a maximum lifespan of only 18 years, but it is closely related to S. nigrocinctus, which has a maximum age of at least 116 years. Researchers are expecting to gain insight into the genetic mechanism for negligible senescence by sequencing and comparing the genomes of these two species. 2 Transcription and Eukaryotic Gene Structure Most cell functions involve chemical reactions. Food molecules taken into cells react to provide the chemical constituents needed to synthesize other molecules. Both breakdown and synthesis are made possible by a large set of protein catalysts, called enzymes. The breakdown of some of the food molecules enables the cell to store energy in specific chemicals that are used to carry out the many functions of the cell. Cells store and use information to guide their functions. The genetic information stored in DNA is used to direct the synthesis of thousands of proteins that each cell requires Cell functions are regulated. Regulation occurs both through changes in the activity of the functions performed by proteins and through the selective expression of individual genes. This regulation allows cells to respond to their environment and to control and coordinate cell growth and division. (National Science Education Standards pg. 184) In all organisms, the instructions for specifying the characteristics of organisms are carried in DNA, a large polymer formed from subunits of four kinds (A, G, C, and T). The chemical and structural properties of DNA explain how the genetic information that underlies heredity is both encoded in genes (as a string of molecular “letters”) and replicated (by a templating mechanism). Each DNA molecule in a cell forms a single chromosome. (National Science Education Standards pg. 185) The structure of a gene dictates how cellular proteins will interact with it to transcribe a messenger RNA and translate that mRNA into a protein. Some of the important details of DNA and a gene are listed below: Each cell contains the same genome sequence, but different genes are transcribed into mRNAs by different cells and by the same cells at different times. DNA is double stranded, and has 5’ and 3’ ends, pronounced “five prime” and “three prime.” It is read from 5’ to 3’ and the two strands go opposing directions, the 5’ on one is the 3’ on its partner. 5' ATGGCGT TGCCATA CCCGCAT CCCTGAT 3' GGGACTA 3' TACCGCA ACGGTAT GGGCGTA 5' Eukaryotic Genes have several different parts that orchestrate the cellular processes of transcription and translation (see Figure below): o Promoter o Five Prime UnTranslated Region (5’ UTR) o Coding sequence(s) o Intron(s) o Three Prime UnTranslated Region (3’ UTR) The Promoter is a region of DNA up to a few hundred bp immediately upstream of the transcription start point to which transcription factors (proteins) bind, and recruit RNA polymerases for transcription. Exons are the sections of transcribed DNA that are exported from the nucleus after introns have been removed and the ends of the transcript stabilized to form the mature mRNA. A gene may have one exon or several exons separated by intervening sequences called introns. The beginning and ends of exons are untranslated. The Five Prime Untranslated Region (5’ UTR) is the part of the mature mRNA immediately upstream of the coding sequence. It may contain introns. Before translation can start, the ribosome binds to the modified 5’ end of the 5’UTR after export to the cytoplasm. 3 Coding sequences (CDS) of DNA are ultimately the sections of exons that are translated by ribosomes into amino acid sequences once the mature mRNA is exported to the nucleus. The protein coding region of DNA begins with the start codon ATG and ends in the stop codons TAG, TGA, or TAA. Note that the process of transcription copies Ts (thymines) as Us (uracils). Introns are non-coding regions between exons. Introns usually begin with GT and end in AG (or the reverse complements) in what is known as “GT/AG rule”. They are excised from pre-mRNAs in the nucleus by proteins that recognize these sequences. Their excision is part of the premRNA processing step that also includes adding a 5’cap (a modified guanine) to the mRNA and 3’ poly-A tail (a long stretch of As) for message stability. The Three Prime Untranslated Region (3’UTR) is the region of DNA after the stop codon in a gene. Once it is transcribed, a polyA tail is added that helps to stabilize the mRNA. These regions sometimes have binding sites to microRNAs that, when present, signal the mRNA for break down. Intergenic DNA is the DNA between genes. There are still recognizable DNA elements in intergenic DNA: o Enhancers and silencers are sequence elements in the DNA that are more distant from the transcription startpoint than promoters yet still regulate transcription by determining what kind of molecules will be able to bind to the DNA . “Regulation” refers to turning transcription “up or down”. o Short tandem repeats (a.k.a. microsatellites) are stretches of repeated nucleotide motifs from one to 30 bp. For example, the sequence ACACACACACACAC, contains the motif AC repeated seven times. o Dispersed repeats are repeats that do not occur in tandem. These are often ancient viral DNA elements that have incorporated themselves into other organisms’ genomes, have made copies of themselves over the course of evolutionary time (10s of millions of years), and have lost their ability to become full viruses and infect their host. Dispersed repeats can be also found within introns of genes and must be identified to keep gene finders from characterizing them as exons belonging to native genes. o Tandem repetitive DNA can serve important purposes such as protecting the ends of chromosomes, called telomeres, and serving as binding sites during cell division towards the constriction point of a chromosome (centromere). o Most genome sequences are erroneously contaminated by bacterial and human genes too. This illustrates a major point of scientific research. Researchers must be constantly “on-guard” for a myriad of problems that obscure the truth. One gene can encode different proteins in different cell types by including or excluding different exons in a process known as alternative splicing. Different cell types are formed in a process called differentiation, during development of the organism. After differentiation, cell types have different proteins present that direct the cell to use the DNA in a way that suits the function of the cell. Exons are mixed to form alternative mRNA products. 4 How DNA becomes a protein DNA Gene Promoter Exon 1 5’ UTR Exon 2 CDS 1 Intron CDS 2 3’ UTR 5’ GT ATG… …TGA AG Transcription Pre-mRNA AUG… GU AG …UGA RNA processing: Removal of introns addition of 5’ cap and poly-A tail Mature mRNA 5’ cap Poly-A tail AAAAAAAAAAAAAAAAAAAAAA AUG… …UGA Translation Protein5 Translation mRNA codes for amino acids (the building blocks of proteins) in stretches of three nucleotides called codons. Each codon specifies an amino acid, the building blocks of proteins, or a stop codon that signals to stop translation. tRNAs carry specific amino acids and interact with ribosomes to deliver the specified amino acid to the growing chain of amino acids (called a polypeptide ). The translation of nucleotide to amino acid sequences follows a standard genetic code (below). U UUU Phe (F) UUC Phe (F) U UUA Leu (L) UUG Leu (L) CUU Leu (L) CUC Leu (L) C CUA Leu (L) CUG Leu (L) AUU Ile (I) AUC Ile (I) A AUA Ile (I) AUG Met (M) GUU Val (V) GUC Val (V) G GUA Val (V) GUG Val (V) Genetic Code. UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG Ser Ser Ser Ser Pro Pro Pro Pro Thr Thr Thr Thr Ala Ala Ala Ala C (S) (S) (S) (S) (P) (P) (P) (P) (U) (U) (U) (U) (A) (A) (A) (A) UAU UAC UAA UAG CAU CAC CAA CAG AAU AAC AAA AAG GAU GAC GAA GAG A Tyr (Y) Tyr (Y) Stop Stop His (H) His (H) Gln (Q) Gln (Q) Asn (N) Asn (N) Lys (K) Lys (K) Asp (D) Asp (D) Glu (E) Glu (E) UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG G Cys (C) Cys (C) Stop Trp (W) Arg (R) Arg (R) Arg (R) Arg (R) Ser (S) Ser (S) Arg (R) Arg (R) Gly (G) Gly (G) Gly (G) Gly (G) U C A G U C A G U C A G U C A G Use the genetic code to translate the following DNA: ATG TTG CGA TGA Gene Annotation Long stretches of RNA that translate into amino acids without a stop codon (known as open reading frames), start codons, intron/exon boundaries, and other gene specific features form clues as to where genes are located in eukaryotic genomes. Finding genes in prokaryotic genomes is much easier because there are no introns and there is little intergenic DNA. For eukaryotes, annotating tens of thousands of genes by hand is an extremely time consuming process, in particular because DNA has to be examined forwards and backwards, and for different starting points. Initial gene prediction is now accomplished by computers, but these still need to be hand-checked for quality by Biologists for genes of highest interest. 6 Goals Understand requisite background information on the cell, the molecular basis of heredity. GCAT-SEEK sequencing requirements None Computer/program requirements for data analysis None Protocols None Assessment The following is the nucleotide sequence of the human -globin gene. Gene regions are indicated as follows: Untranscribed, Transcribed but not translated, Transcribed and Translated, Conserved Sequence. Nucleotide positions in the DNA are shown in the right hand margin. ATATCTTAGA CCAGAAGAGC TGGAGCCACA AGGAGCCAGG ATTTGCTTCT GGTGCACCTG AGGTGAACGT AGGTTACAAG CAGAGAAGAC GTCTATTTTC GTTCTTTGAG ACCCTAAGGT GGCCTGGCTC GCTGCACTGT TATGGGACCC TGTCATAGGA ACGAATGATT ATTTGCTGTT TTTTTCTTCT TATAACAAAA GGGAGGGCTG CAAGGACAGG CCCTAGGGTT GCTGGGCATA GACACAACTG ACTCCTGAGG GGATGAAGTT ACAGGTTTAA TCTTGGGTTT CCACCCTTAG TCCTTTGGGG GAAGGCTCAT ACCTGGACAA GACAAGCTGC TTGATGTTTT AGGGGAGAAG GCATCAGTGT CATAACAATT CCGCAATTTT GGAAATATCT AGGGTTTGAA TACGGCTGTC GGCCAATCTA AAAGTCAGGG TGTTCACTAG AGAAGTCTGC GGTGGTGAGG GGAGACCAAT CTGATAGGCA GCTGCTGGTG ATCTGTCCAC GGCAAGAAAG CCTCAAGGGC ACGTGGATCC CTTTCCCCTT TAACAGGGTA GGAAGTCTCA GTTTTCTTTT TACTATTATA CTGAGATACA GTCCAACTCC ATCACTTAGA CTCCCAGGAG CAGAGCCATC CAACCTCAAA CGTTACTGCC CCCTGGGCAG AGAAACTGGG CTGACTCTCT GTCTACCCTT TCCTGATGCT TGCTCGGTGC ACCTTTGCCA TGAGAACTTC CTTTTCTATG CAGTTTAGAA GGATCGTTTT GTTTATTCTT CTTAATGCCT TTAAGTAACT TAAGCCAGTG CCTCACCCTG CAGGGAGGGC TATTGCTTAC CAGACACCAT CTGTGGGGCA GTTGGTATCA CATGTGGAGA CTGCCTATTG GGACCCAGAG GTTATGGGCA CTTTAGTGAT CACTGAGTGA AGGGTGAGTC GTTAAGTTCA TGGGAAACAG AGTTTCTTTT GCTTTCTTTT TAACATTGTG TAAAAAAAAA 1-50 51-100 101-150 151-200 201-250 251-300 301-350 351-400 401-450 451-500 501-550 551-600 601-650 651-700 701-750 751-800 801-850 851-900 901-950 951-1000 CTTACACAGT TTGCATATTC CATAATCATT TACACATATT TGCTTTCTTC CTGCCTAGTA ATAATCTCCC ATACATATTT GACCAAATCA TTTTAATATA CATTACTATT TACTTTATTT ATGGGTTAAA GGGTAATTTT CTTTTTGTTT TGGAATATAT TCTTTTATTT GTGTAATGTT GCATTTGTAA ATCTTATTTC GTGTGCTTAT TTAATTGATA TTAATATGTG TTTTAAAAAA TAATACTTTC 1001-1050 1051-1100 1101-1150 1151-1200 1201-1250 7 CCTAATCTCT TGCACCATTC AATATTTCTG AGGTTTCATA ATTTTATGGT TTTGCTAATC ACGTGCTGGT CCAGTGCAGG GGCCCACAAG AGGTTCCTTT GGGCCTTGAG CAATGATGTA GGAGGTCAGT GGGAAAATAC TTCTTTCAGG TAAAGAATAA CATATAAATA TTGCTAATAG TGGGATAAGG ATGTTCATAC CTGTGTGCTG CTGCCTATCA TATCACTAAG GTTCCCTAAG CATCTGGATT TTTAAATTAT GCATTTAAAA ACTATATCTT GCAATAATGA CAGTGATAAT TTTCTGCATA CAGCTACAAT CTGGATTATT CTCTTATCTT GCCCATCACT GAAAGTGGTG CTCGCTTTCT TCCAACTACT CTGCCTAATA TTCTGAATAT CATAAAGAAA AAACTCCATG TACAATGTAT TTCTGGGTTA TAAATTGTAA CCAGCTACCA CTGAGTCCAA CCTCCCACAG TTGGCAAAGA GCTGGTGTGG TGCTGTCCAA AAACTGGGGG AAAAACATTT TTTACTAAAA TGATGAGCTG AAAGAA 8 CATGCCTCTT AGGCAATAGC CTGATGTAAG TTCTGCTTTT GCTAGGCCCT CTCCTGGGCA ATTCATCCCA CTAATGCCCT TTTCTATTAA ATATTATGAA ATTTTCATTG AGGGAATGTG TTCAAACCTT 1251-1300 1301-1350 1351-1400 1401-1450 1451-1500 1501-1550 1551-1600 1601-1650 1651-1700 1701-1750 1751-1800 1801-1850 1851-1900 1901-1936 1. Without looking, redraw the sketch of a eukaryotic gene, pre-mRNA, and mature mRNA in the space below. Include one intron. Then correct your own work. Gene Pre-mRNA Mature mRNA 2. Circle and label the following in the sequence of β-globin (opposite page) and in the sketch of the gene above. A. Transcription start point B. Transcription end point C. Translation start point D. Translation end point 3. Explain what evidence (i.e. signals and annotations from the DNA sequence) you used to support your answers to (2). 9 4. Based on the sequence information provided above, draw a sketch of the human -globin gene below labeling the promoter, 5’UTR, coding sequences, introns, and 3’UTR. Make sizes roughly proportional to the sequence itself. 5. What would be the first three and the last three amino acids produced from the mRNA transcribed from this gene? Note: a stop codon does not produce an amino acid. 6. Is a mutation more likely to disrupt the function of the protein produced if it occurs in an intron or coding sequence? Explain. Time line of module One hour of lecture. Discussion topics for class See assessment Relevant background lecture topics include basic molecular biology (replication, transcription, translation), genome structure, fine structure of a gene, regulation of genes, and cited literature. 10 References Further Reading: Any genetics textbook 11