Lecture of Principles of gene engineering 2008.5.5 ATGGTGAAACTGGCGTTTCCGCGTGAACTGCGTCTGCTGACCCCGAGCCAGTTTACCTTTGTGTTTCAGCAGCCGCAGCGTGCGGGCACCCCGCAGA TTACCATTCTGGGCCGTCTGAACAGCCTGGGCCATCCGCGTATTGGCCTGACCGTGGCGAAAAAAAACGTGCGTCGTGCGCATGAACGTAACCGTA TTAAACGTCTGACCCGTGAAAGCTTTCGTCTGCGTCAGCATGAACTGCCGGCGATGGATTTTGTGGTGGTGGCGAAAAAAGGCGTGGCGGATCTGG ATAACCGTGCGCTGAGCGAAGCGCTGGAAAAACTGTGGCGTCGTCATTGCCGTCTGGCGCGTGGCAGCATGGTGAAACTGGCGTTTCCGCGTGAAC TGCGTCTGCTGACCCCGAAACATTTTAACTTTGTGTTTCAGCAGCCGCAGCGTGCGAGCAGCCCGGAAGTGACCATTCTGGGCCGTCAGAACGAACT GGGCCATCCGCGTATTGGCCTGACCATTGCGAAAAAAAACGTGAAACGTGCGCATGAACGTAACCGTATTAAACGTCTGGCGCGTGAATATTTTCG TCTGCATCAGCATCAGCTGCCGGCGATGGATTTTGTGGTGCTGGTGCGTAAAGGCGTGGCGGAACTGGATAACCATCAGCTGACCGAAGTGCTGGG CAAACTGTGGCGTCGTCATTGCCGTCTGGCGCAGAAAAGCATGCTGAAAGTGGTGAAAGTGTATCTGCATAACCATAACAGCCAGTTTCTGGTGGT GAAACTGAACTTTAGCCGTGAACTGCGTCTGCTGACCCCGATTCAGTTTAAAAACGTGTTTGAACAGCCGTTTCGTGCGAGCACCCCGGAAATTACC ATTCTGGCGCGTAAAAACAACCTGGAACATCCGCGTCTGGGCCTGACCGTGGCGAAAAAACATCTGAAACGTGCGCATGAACGTAACCGTATTAAA CGTCTGGTGCGTGAAAGCTTTCGTCTGAGCCAGCATCGTCTGCCGGCGTATGATTTTGTGTTTGTGGCGAAAAACGGCATTGGCAAACTGGATAACA ACACCTTTGCGCAGATTCTGGAAAAACTGTGGCAGCGTCATATTCGTCTGGCGCAGAAAAGCATGAGCCAGGATTTTAGCCGTGAAAAACGTCTGC TGACCCCGCGTCATTTTAAAGCGGTGTTTGATAGCCCGACCGGCAAAGTGCCGGGCAAAAACCTGCTGATTCTGGCGCGTGAAAACGGCCTGGATC ATCCGCGTCTGGGCCTGGTGATTGGCAAAAAAAGCGTGAAACTGGCGGTGCAGCGTAACCGTCTGAAACGTCTGATGCGTGATAGCTTTCGTCTGA ACCAGCAGCTGCTGGCGGGCCTGGATATTGTGATTGTGGCGCGTAAAGGCCTGGGCGAAATTGAAAACCCGGAACTGCATCAGCATTTTGGCAAAC TGTGGAAACGTCTGGCGCGTAGCCGTCCGACCCCGGCGGTGACCGCGAACAGCGCGGGCGTGGATAGCCAGGATGCGATGCTGAACTATTTTTTTA AAAAAAAAAGCAAACTGCTGAAAAGCACCAACTTTCAGTATGTGTTTAGCAACCCGTGCAACAAAAACACCTTTCATATTAACATTCTGGGCCGTA GCAACCTGCTGGGCCATCCGCGTCTGGGCCTGAGCATTAGCCGTAAAAACATTAAACATGCGTATCGTCGTAACAAAATTAAACGTCTGATTCGTGA AACCTTTCGTCTGCTGCAGCATCGTCTGATTAGCATGGATTTTGTGGTGATTGCGAAAAAAAACATTGTGTATCTGAACAACAAAAAAATTGTGAAC ATTCTGGAATATATTTGGAGCAACTATCAGCGTATGGAAAAAGGCTTTAGCGTGGGCTGGCGTATTCGTACCACCGCGGAATTTCGTCGTATTTATG CGGCGCGTCAGCGTATTATTGGCCGTTATTATCTGCTGTATTATCGTGAAAACGAAATTAAACATAGCCGTCTGGGCGTGGTGGCGAGCAAACGTAA CGTGCGTAAAGCGGTGTGGCGTAACCGTGTGCGTCGTGTGGTGAAAGAAGCGTTTCGTATTCGTAAAAAAGATCTGCCGGCGTTTGATATTGTGGTG GTGGCGAAAGCGAGCAGCGTGGAAGCGGATAACAAAGAACTGTATGAATGCATTAACAAACTGTTTACCCAGCTGGAAAAACAGAGCAAACGTAG CAGCAGCGTGATGCTGCCGACCGAAAACCGTCTGCGTCGTCGTGAAGATTTTGCGACCGCGGTGCGTCGTGGCCGTCGTGCGGGCCGTCCGCTGCTG GTGGTGCATCGTCTGAGCGGCGCGACCGATCCGCATGCGCCGGGCGAAAGCGCGCCGCCGACCCGTGCGGGCTTTGTGGTGAGCAAAGCGGTGGGC GGCGCGGTGGTGCGTAACCAGGTGAAACGTCGTCTGCGTCATCTGGTGTGCGATCGTCTGAGCGCGCTGCCGCCGGGCAGCCTGGTGGTGGTGCGT GCGCTGCCGGGCGCGGGCGATGCGGATCATGCGCAGCTGGCGCGTGATCTGGATGCGGCGCTGCAGCGTCTGCTGGGCGGCGGCACCCGTATGCTG CCGACCGAAAACCGTCTGCGTCGTCGTGAAGATTTTGCGACCGCGGTGCGTCGTGGCCGTCGTGTGGGCCGTAGCACCCTGGTGGTGCATCTGCGTA GCGGCGCGACCGATCCGCATGCGCCGGGCGAAAGCGCGCCGCGTACCCGTGCGGGCTTTGTGGTGAGCAAAGCGGTGGGCGTGGCGGTGGTGCGT AACAAAGTGAAACGTCGTCTGCGTCATCTGATGCGTGATCGTATTGATCTGCTGCCGCCGGGCAGCCTGGTGGTGGTGCGTGCGCTGCCGGGCGCG GGCGATGCGGATCATGCGCAGCTGGCGCGTGATCTGGATGCGGCGCTGGCGCGTCTGCTGGGCGGCGGCGCGCGTATGCTGCCGCGTGATCGTCGT T Dr. Jin-Mei Lai bio2028@mails.fju.edu.tw 1 Various methods can be employed to identify and locate the genes that reside within the genome. 1. Sequence inspection 2. cDNA comparison Genes that code for proteins comprise “ORF” initiation codon ……………………………………………..termination codon (ATG) * The average ORF length E. coli : 317 codons Yeast : 483 codons (TAA, TAG, or TGA) The search for a agene can be thought of as a scan for an initiation and termination codon that are separated by, at least 100 codons. 2 Finding genes in prokaryotes is easy. --- Just translate the DNA sequence in all 6 reading frames. The ORFs (regions starting with ATG and ending in an in-frame stop codon) will be at least 300 bases in length, while random reading frames will be dotted with stop codons at the rate of about 3 stop codons every 64 codons. XXXXXATG…..(3X)…….TGAXXXXX 3 Finding genes in eukaryotes is harder. 4 ORF; coding sequence * exon-intron junction: GT-AG rule (GU-AG in mRNA) Exon/GU-intron-AG/exon 5’-AG/GUAAGU-intron-YNCURAC-YnNAG/G-3’ Y is either pyrimidine, Yn denotes a string of about nine pyrimidines, R is either purine, A is a special A that participates in forming a branched splicing intermediate, and N is any base. 5 Splicing mechanism, spliceosome 6 Splicing mechanism 7 Performs 2 main functions recognition of intron/exon boundaries remove introns/join exons. Made of 5 small nuclear ribonucleoproteins (snRNPs). Each snRNP is composed of a single U-rich small nuclear RNA and multiple proteins. 8 Several computer programs are available for the identification of ORFs. (species specific) ~ not only on the basis of initiator and terminator codons, but also codon bias, intron-exon boundaries and transcriptional control elements (e.g. the TATA box) However, the sequences can be quite variable!! Alternative approach: To use previously identified genes as a guide, try to assign similar (homologous) to any existing genes. Exceptions: pseudogenes ( generally non-transcribed genomic DNA with a high degree of sequence similarity to a real gene) 9 * What is codon bias? Codon bias is the probability that a given codon will be used to code for an amino acid over a different codon which codes for the same amino acid. Ex. ~ genes that are always expressed at a high rate should have a different codon bias than those genes that are always expressed at a low rate. 10 Definition of Homolog, Ortholog and Paralog Homolog A gene related to a second gene by descent from a common ancestral DNA sequence. The term, homolog, may apply to the relationship between genes separated by the event of speciation (see ortholog) or to the relationship between genes separated by the event of genetic duplication (see paralog). Ortholog Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes. Paralog Paralogs are genes related by duplication within a genome. Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original one. 11 The transcriptional control elements Bacteria have one RNA polymerase that transcribes all of their genes. (holoenzyme: 2ω’) RNA polymerase binds to the promoter (defined as the region of DNA recognized by the polymerase; have a similar nucleotide composition 12 Eukaryotes have multiple RNA polymerases which are specialized for specific gene families Like bacterial polymerase, eukaryote polymerase bind to promoters These promoters are much more complex and are composed of RNA polymerase binding site and DNA binding sites for regulatory protein (do not use operator terminology for eukaryotes instead have transcription elements or “boxes”) Transcription factors bind to transcription elements 13 TATA BOX & Initiator Regions TATA Box and Initiator region help RNA polymerase start transcription at correct site Model eukaryote promoter TE = Transcription element Multiple upstream elements mRNA start site GENE TE4 TE3 TE2 TE1 TATA Box Initiator region 14 RNA transcript cleavage and poly-adenylation are directed by a polyA signal within the RNA. 15 Improved Definition of PAS (PolyA Site) signals. How does the polyA machinery tells a true cleavage site from a random AATAAA? What other signals help dictate use of specific sites in certain conditions? Upstream Seq. Elemt. « enhancer element » Mostly found in viral Sequences Downstream Seq. Elemt « constitutive » Poorly defined Mutations tolerated 16 2. cDNA comparison ~ The simplest way to identify a gene within a segment of genomic DNA is compare the sequence to a copy of the corresponding cDNA. Through the hybridization of genomic DNA fragments to mRNA separated on an agarose gel (northern blotting). Through the comparison with databases of sequenced cDNA fragments. (ESTs) cDNA 17 Expressed Sequence Tags (ESTs) ~ are small pieces of cDNA sequence (usually 200 to 500 bases long) that are generated by sequencing either one or both ends of an expressed gene. (sequence only once!) High error rate (>1%) mainly frameshifts and insertions/deletions Redundant sampling of 5’ and 3’ ends Large number in public databases mRNA 5’ ESTs EST lengths vary due to varying polymerase activity 3’ ESTs 60-80% of human genes are represented in dbEST (human) 18 Expressed Sequence Tag (EST) Partial cDNA sequences of genes expressed in different tissues Tissues mRNA cDNA EST 5` partial sequencing 3` partial sequencing 19 EST Data are Fragmented, but there are lots of it! ESTs mRNA Exon Genome intron * Database of ESTs continues to grow rapidly; database of all genes and/all gene transcripts does not yet exist. * ESTs do not have to be checked for sequencing errors as mistakes do not prevent identification of the gene from which the EST was derived using similarity searches. * ESTs database (http://www.ncbi.nlm.nih.gov/dbEST ) (contains over 12 million sequences from different organisms including 4.5 million human sequences) 20 Expressed Sequence Tags (ESTs) Why EST sequencing? Systematic sampling of the transcribed portion of the genome (“transcriptome”) Provides “sequence tags” allowing unique identification of genes (e.g. for SAGE) Provides experimental evidence for the positions of exons. Provides regions coding for potentially new proteins. Provides clones for DNA microarrays. Deposits readable part of sequence in database by sequencing various cDNA libraries (>2,000), prepared from various tissues and cell lines, using directional cloning. Systematic effort to make libraries from cancerous tissue: CGAP project (NCI). Most cDNA libraries managed by the IMAGE consortium. But, many tissues still not sampled and quality very uneven. 21 Strategy for gene discovery by using EST Discrimination between coding and non-coding sequences. Cluster EST sequences to identify candidate transcripts. Assemble to increase length and reduce redundancy. Detection of beginning and end of coding regions. Find coding regions and reading frame, and correct frameshift error. Use deduced protein sequences as searchable database (TrEST). This is a gene with 10 ESTs The ORESTES project: to obtain EST sequences associated. The from the under-represented, often coding, central portions of mRNAs, resulting in obtaining many novel cluster size is 10. genes. 22 Importance of alternative splicing * Do the number of genes account for the complexity in humans? Humans: ~ 25,000 genes C. elegans: ~19,500 genes Arabidopsis: ~27,000 genes * How common is alternative splicing? 35% -- 59% of human genes affected by alternative splicing. If only 2 splice variants/gene… minimum of 27,000 – maximum of 39,750 unique transcripts/proteins 23 Splicing of immature mRNA Constitutive splicing: all exons are joined together in the order in which they occur in the heterogeneous nuclear RNA. Alternative splicing: the production of two or more distinct mRNAs from RNA transcripts having the same sequence via different exons. 24 Discovery of Alternative Splicing First discovered with an Immunoglobulin heavy chain gene (D. Baltimore et al.) Alternative splicing gives two forms of the protein with different C-termini mouse immunoglobulin μ heavy chain gene (via C-terminus) S-signal peptide C - constant region V- variable region green – membrane anchor 25 Red- untranslated reg. yellow – end of coding reg. for secreted form Regulation of Alternative splicing Sex determination in Drosophila involves 3 regulatory genes that are differentially spliced in females versus males; 2 of them affect alternative splicing 1. Sxl (sex-lethal) - promotes alternative splicing of tra (exon 2 is skipped) and of its own (exon 3 is skipped) pre-mRNA 2. Tra – promotes alternative splicing of dsx (last 2 exons are excluded) 3. Dsx (double-sex) - Alternatively spliced form of dsx needed to maintain female state 26 Alternative splicing in Drosophila maintains the female state. Sxl and Tra are SR proteins! Tra binds exon 4 in dsx mRNA causing it to be retained in mature mRNA. 27 Finding Potential Splice Sites using ESTs 28 After finish the genome sequencing projects, it was realized that only less of the genes had been previously characterized. Two methods are currently used to assign the function of a gene based only on its sequence. Similarity searches ~ many genes that encode proteins with the same function in different organism will be similar. Experimental gene assignment ~ the phenotype of the disrupted mutant or gene knock out can be assessed in order to attempt to identify the natural function of the wild-type gene. 29 Similarity searches ~ usually performed using amino acid sequence. The amino acid sequence of the galactokinase from one organism shares similarity to the galactokinase from another organism. 30 Experimental gene assignment In experimental organisms, such as E. coli or yeast, one of the most popular ways of ascribing a function into an unknown gene is to make a gene knockout. ~ the phenotype of the gene knockout can be assessed to identify the natural function of the wild-type gene. Ex. yeast gene SNU17 shows little similarity to other proteins when compared using database searches. however, a yeast strain knocked out for SNU17 shows a slow-growth phenotype and is defective in pre-mRNA splicing, indicating that the protein is involved in the splicing process. Alternative approach: overproduce a protein. 31 Chapter 7. Definitions Mutation: a change in the nucleic acid sequence (bases) of an organism’s genetic material (a change in the genetic material of an organism). Directed mutagenesis: a change in the nucleic acid sequence (or genetic material) of an organism at a specific predetermined location. 32 Silent mutations Most amino acids are encoded by several different codons. For example, if the third base in the TCT codon for serine is changed to any one of the other three bases, serine will still be encoded. Such mutations are said to be silent . Missense mutations With a missense mutation, the new nucleotide alters the codon so as to produce an altered amino acid in the protein product. Nonsense mutations The mutation generate STOP codons (TAA, TAG, or TGA). Frame shift mutations insertions or deletions of one or a few nucleotides in a coding sequence. Usually very detrimental. 33 In vitro mutagenesis, Why? Want to determine how DNA and/or encoded proteins function in intact entity (virus, bacterium, cell, animal etc.) Most direct way to find out what a gene or protein does is to find out what happens when it is missing or mutated. Study mutants that lack gene/protein or express altered version of it - determine which biological processes are altered in mutants. 34 In Vitro Mutagenesis At its most simplistic, in vitro mutagenesis allows us to change the base sequence of a DNA segment or gene. Mutations can be localized or general, random or targeted; Less specific methods of mutagenesis used to analyze regulatory regions of genes. More specific methods of mutagenesis used to understand contribution of individual amino acids, or groups of amino acids, to structure and function of target protein. Both methods generate mutants in vitro, without phenotypic selection 35 Directed Mutagenesis Directed mutagenesis can be done using: ♥ M13 DNA (using primer extension mutagenesis) ♥ Dut-/ung- strand selection ♥ Cassette mutagenesis ♥ PCR based mutagenesis ♥ QuikChangR mutagenesis ♥ Random mutations * Ala scanning, Charged to Ala scanning mutagenesis * Doped cassette mutagenesis * Error-prone PCR: 36 Figure 7.1 Directed Mutagenesis Using M13 The procedure involves: The gene of interest is inserted into the ds form of the M13 bacteriophage. (M13 has ssDNA and replicated via a dsDNA intermediate). The ssDNA is isolated from the M13 phage. 37 ~ continue The ssDNA is mixed with an excess of the synthetic oligonucleotide. The oligo is complimentary to the area of the cloned gene except for the one nucleotide to be changed. The oligo anneals to the ssDNA. The oligo acts a primer for DNA synthesis using the M13 DNA as a template and the enzyme Klenow fragment of DNA polymerase I. T4 DNA ligase is used to ligate the ends of the newly synthesized DNA. The newly synthesized M13 DNA is transformed into E. coli. 38 ~ continue Because DNA replicates semiconservatively, half the cells should have the mutant gene. Mutant plaques are identified by DNA hybridization using the oligo as probe. However, … a number of drawbacks 1. The DNA that is to be mutated needs to be cloned into the M13 genome. 2. The efficiency of the mutagenesis procedure is quite low. 3. The newly synthesized DNA will not be methylated and will be repaired by the mismatch repair system of the E. coli. 4. The final screening procedure is slow and cumbersome. 39 Enrichment for identifying the Mutant Plaques (I) Phosphorothioante strand selection ~ a phosphorothioate nucleotide contains a phosphorus-sulphur linkage in place of a phosphorusoxygen group DNA containing phosphorothioate linkages are resistant to cleavage by certain restriction enzymes. Mutation efficiency 40 Enrichment for identifying the Mutant Plaques (II) One strategy has been to introduce M13 vector carrying the desired gene into an E. coli strain with 2 defective enzymes: A defective form of dUTPase (dut). Cells with defective dUTPase has elevated levels of dUTP which is incorporated into the DNA often replacing dTTP. A defective Uracil N-glycosylase (ung). Uracil N-glycosylase is the enzyme that removes dUTP which is incorporated into DNA during replication. 41 Enrichment for identifying the Mutant Plaques (II) The procedure involves: The desired gene is cloned into M13 vector. The M13 vector with the desired gene is transformed into E. coli stain dut-/ung-, which produces ssDNA with the T replaced by U. Anneal mutagenic oligonucleotide and synthesis of a second strand. Addition of T4 ligase. The dsDNA is transformed into E. coli wild type strain, which will use Uracil N-glycosylase to remove the dUTP which was incorporated into the DNA. Therefore the original DNA strand is degraded and only the mutant strand remains. In this way the number of plaques with the mutant gene is greatly increased. 42 Cassette mutagenesis ~ relies on the presence of two restriction enzyme recognition sites flanking the DNA that is to be mutated. (containing the desired mutations & overhanging sequences for the ligation) 1. 2. It requires two restriction enzyme recognition sequences to flank the DNA that is to be mutated. Oligonucleotides are difficult to synthesize accurately above about 43 70nts in length. PCR based mutagenesis PCR can be used to : ~ Enrich for the mutant gene ~ Avoid using M13 vector The procedure involves: The target gene is cloned into an E.coli plasmid. 4 specific oligos are added to the PCR reaction. ~ 2 primers are complimentary to the target. ~ The other primers are complimentary to the target gene except for the nucleotide that is targeted for change. Two-step PCR to introduce mutations into the 44 middle of an amplified DNA fragment. QuikChangeR Mutagenesis ~ using the power of PCR to introduce mutations directly into plasmid DNA ~ alleviate the need for a additional cloning steps. * DpnI: CH3 5’ – GA TC – 3’ 3’ – CT AG – 3’ CH3 ~ the newly synthesized DNA will not be methylated, and consequently will not be cleaved by the restriction enzyme. Very rapid (3~4 h) and is highly efficient (~80%) at producing mutant DNA plasmid. (dam+) Dam methylase 45 Creating random mutations in specific genes. Why? It is not always possible to know which amino acids of a protein should altered, or what they shouldwhich be altered This type ofbe approach requires a screen is to. available for identifying protein function. Some systematic approaches: Alanine scanning mutagenesis: ~ can identify a.a. side chains that are important for protein function with the premise that the presence of Ala will not perturb the overall structure of the protein and will only eliminate a.a. side chain interactions. Charged to alanine scanning mutagenesis: ~ most proteins contain a hydrophobic core with charged residues on the outside surface of the protein, which may participate in, for example, protein-protein interactions. 46 Two approaches are commonly used for the creation of random mutations with genes: (1) Doped cassette mutagenesis: ~ like conventional cassette mutagenesis, however, the oligonucleotides do not encode a unique sequence. (are libraries of oligonucleotides) * An appropriate level of doping can be controlled. 47 Two approaches are commonly used for the creation of random mutations with genes: (2) Error-prone PCR: ~ the lack of a 3’-5’ exonuclease proofreading activity in Taq DNA polymerase. The error rate of Taq DNA polymerasecan be increased by altering a variety of the PCR reaction conditions. * Transitions > Transversions 48 Advantages and Disadvantages of Random Mutagenesis What are some of the advantages of directed mutagenesis? Advantages of random mutagenesis: ╬ Many different mutants encoding a wide variety of proteins are generated. ╬ Detailed information regarding function of particular amino acids is not necessary. Disadvantages of random mutagenesis : ╬ Many mutants have to be assayed to determine which proteins have the desired properties. 49