1. ANALYSIS OF GENETIC INFORMATION II a. Three key events radically transformed the field of genetics: i. 1860’s: Mendel’s fundamental principles of genetics ii. 1953: Watson & Crick, DNA structure iii. 1990: Start and continuation of the Human Genome Project b. Goal of the Human Genome Project: to _________________and analyze the human genome in conjunction with the genomes of several model organisms. c. What would they look at? A haploid human genome consisting of _______ _______________ ______________________________ that would represent 99.9% of the information contained in a diploid set of chromosomes. i. The 24 DNA molecules contain a total of: ______________________ nucleotides ii. Each molecule ranges in size from: _______________________________________. 2. GENOMICS a. GENOMICS: The study of whole genomes. b. GENOME: the sum total of genetic information in a particular cell or organism. c. GENOME PROJECTS: A large-scale, often multi-laboratory effort required to sequence a complex genome. d. 2001: First ________________________________of sequence of human genome. Problems: 1._________________________________________________________ 2.__________________________________________________________ e. 2003: An accurate sequence covering ________________ of genome completed 2 years ahead of schedule. f. 2006: Published finished human genome sequence with greater than 99% coverage, and 99.99% accuracy. g. 2009: Whole genome sequences had been completed for ____________ distinct species as well as identification of 2359 ________________ causing different _________________ disease. h. 2013: Whole genome sequences were completed for greater that _______________ distinct species. 3. Creating Genomic Libraries a. Each colony present on the agar plates in this image contain a different recombinant plasmid with a different fragment of the human genome. b. IMPORTANT NOTE: Small fragments of human genomic DNA cannot reproduce themselves in a cell. A "vector" must be used to manage the DNA. Why? i. Vector: a vehicle for introducing transgenes into living cells. ii. The vector and inserted piece of foreign DNA (DNA from two different origins) is a recombinant DNA molecule. c. Why #1. DNA sequences must be present that promote replication. DNA fragments can lack the regulatory sequences to give the cell information about what to do with this piece of DNA. d. Why #2. There must be a method by which the vector signals its presence in the cell by conferring a detectable property on the host cell (example: blue or white colonies). e. RECOMBINANT DNA MOLECULES: i. Both the human DNA and the vector DNA (bacterial in this image) have been cut with the same restriction enzyme. Bio 110 010 student Genetic Analysis II Beavers Page 1 of 9 ii. "Sticky ends" produced by using the same restriction enzyme allow for complementary base pairing – regardless of the origin of each fragment of DNA. iii. Simplest vector is the plasmid – it can be more useful if it has been engineered to have more than one recognition site for more than one restriction enzyme. iv. Each plasmid should have an origin of replication (ORI) permitting the vector to replicate independently from the bacterium's chromosomal DNA. v. Each plasmid should include a gene for antibiotic resistance to allow for selection. vi. The plasmid should be able to be purified from the bacteria for study of the DNA (we do this with a mini-prep). vii. Larger vectors include artificial chromosomes: 1. BAC (bacterial artificial chromosome) holds a DNA fragment (insert) up to 300 kb. 2. YAC (yeast artificial chromosome) holds a DNA fragment (insert) up to 2000 kb (2 Mb). 4. ONE MORE LOOK AT SANGER SEQUENCING a. Clones (colonies) present in a genomic library on an agar plate represent different human DNA fragments. Their arrangement on the plate gives no indication of their relative order in the genome. b. Each insert (DNA fragment) present in vectors must be SEQUENCED. c. Sequencing of the human genome was accomplished using the original method developed by Fred Sanger in the 1970's once it had been automated. d. Sequencing is based on hybridization – the natural tendency of complementary single-stranded molecules of DNA or RNA to base pair and form double helixes. e. Key requirements for sequencing: i. DNA polymerase (enzyme) to catalyze DNA replication. ii. A template (a single-strand of DNA) iii. Deoxyribonucleotide triphosphates (dATP, dCTP, dGTP and dTTP) as building blocks for the new strand iv. Primer (complementary to the part of the template) that provides the free 3' end to which DNA polymerase can attach new nucleotides. The sequence must be known to produce a primer. If it is not, and usually the fragments inserted in the vectors sequences are not known, using one strand of the vector's sequence as a template to produce a primer solves the issue. f. Mechanism of sequencing i. Production of a series of single stranded fragments produced by DNA polymerase using the unknown fragment as a template. ii. Each produced fragment differs in length by a SINGLE NUCLEOTIDE. iii. The graduated set of fragments is called a nested array. iv. Each fragment is identified by relative length and one of four terminating nucleotides. 1. The fragments are produced using normal deoxyribonucleotide triphosphates and four dideoxyribonucleotide triphosphates; ddATP, ddCTP, ddGTP and ddTTP. 2. These dideoxyribonucleotide triphosphates LACK A 3' -OH (hydroxyl group) which will terminate the sequence whenever they are added. 3. These dideoxyribonucleotide triphosphates are labeled with four different color fluorescent dyes. v. DNA polymerase adds nucleotides to the millions of replicating strands and continues with each strand until a dideoxyribonucleotide triphosphate (ddNTP – the "N" strands for any of the four) terminates replication. Bio 110 010 student Genetic Analysis II Beavers Page 2 of 9 vi. The mixture of variable sized fragments are run on a polyacrylamide gel electrophoresis under conditions which separate fragments by a difference of one nucleotide. vii. A detector transmits information about the fluorescent signals of the DNA fragments to a computer which interprets each different signal as series of different colored peaks representing nucleotides that are COMPLEMENTARY to the original template fragment which is referred to as a READ. 1. READ: in a single DNA sequencing run, a digital file of the sequence of As, Cs, Gs and Ts comprising the newly synthesized DNA. 5. SANGER SEQUENCING AUTOMATED 6. HUMAN GENOME PROJECT - Maps a. Started with a genome-wide linkage map, then a physical map, ending with a sequence map. b. Linkage maps: which we have reviewed previously, depict the distances BETWEEN loci as well as the order in which they occur in the organism. This technique can map a small number of loci in a relatively small region of the genome. The terms linkage map or genetic map can be used interchangeably when talking about maps produced through analyses of recombination frequencies. c. Researchers have expanded on techniques to produce physical and sequence maps. d. Physical map: a map of locations of identifiable landmarks on DNA, for example, restriction enzyme cutting sites, genes. e. Human genome lowest resolution physical map: Banding pattern on the 24 different chromosomes. f. A physical map is a constellation of overlapping DNA fragments that are ordered and oriented and span each of the chromosomes in a genome. g. They are the molecular counterpart of linkage maps. h. Unlike linkage maps, which use recombination frequencies to map, physical maps are based on direct analysis of genomic DNA. i. They chart the actual base pairs (bp), kilobases (kb) or megabases (Mb) that define or separate a locus from its neighbor. ii. Humans: Linkage maps 1cM (or m.u.) = 1 Mb Physical map distance. iii. 1cM = 1% recombination frequency iv. 1 Mb = 1 million nucleotides i. Short range physical maps produced with multiple restriction enzymes and probed for genes or markers is just a smaller scale version of how they produce physical maps of chromosomes that average 100’s of thousands of base pairs. 7. KARYOTYPE – Maps a. KARYOTYPE: (lowest resolution physical map) the visual description of the complete set of chromosomes in one cell of an organism. Idiogram: black and white diagram of the chromosomes converted from the light and dark bands observed under the microscope. b. REVIEW if needed: c. We learned previously that chromosomes at metaphase of mitosis can be stained with a Giemsa dye and viewed under a light microscope. d. The regions of dark and light called bands and interbands are used by cytogeneticists as: landmarks. e. The landmarks: distinguish homologous chromosome pair from other pairs. f. Karyotype analysis produces: low-resolution physical maps that locate where on the chromosome you might find: i. Cloned genes Bio 110 010 student Genetic Analysis II Beavers Page 3 of 9 g. h. i. j. k. ii. Markers: an identifiable physical location on a chromosome, whose inheritance can be monitored. Markers can be expressed regions of DNA (genes) or any segment of DNA with variant forms that can be followed. This visual description is termed a karyotype. Autosomes numbered in order of descending length. Each band is numbered starting at the centromere and moving out along each arm toward the telomere, arms p and q. Banding resolution can increase as staining techniques improve. See chromosome 7 at 3 levels of resolution. i. Cells to be examined must be capable of growth and rapid division in culture – most accessible – white blood cells, specifically T-lymphocytes. ii. Higher levels of resolution are achieved by different staining methods and timing of staining. iii. G-banding or R-banding is utilized on chromosomes at prophase or prometaphase when they are still in a relatively uncondensed state. Ideal for subtle structural abnormalities in the chromosome. iv. Standard banding: ~450 total bands v. Prometaphase banding: ~ 550-850 bands or more 8. SPECTRAL KARYOTYPING (SKY) - maps a. Specialized application of FISH (fluorescent in situ hybridization). b. FISH: a physical mapping approach that uses fluorescent tags to detect hybridization of nucleic acid probes with chromosomes. 9. FISH - maps: map making and site of interest detection. a. FISH can show the location of a particular DNA sequence within the genome by hybridizing a single fluorescent probe to a chromosome. b. Some example uses: mapping, detecting deletions or additions, even extra whole chromosomes. 10. HIGH RESOLUTION PHYSICAL MAPPING a. The ultimate goal of high-resolution physical mapping is the generation of one large contig for each chromosome. b. CONTIG: (from the word contiguous): is a set of 2 or more overlapping cloned DNA fragments that together cover an UNINTERRUPTED stretch of the genome. c. Why not just read the DNA from one end to the other? d. SEQUENCE ASSEMBLY: (a real challenge to researchers) The compilation of THOUSANDS or MILLIONS of independent DNA sequence reads (i.e. sequence data) into a set of contigs and scaffolds. e. How do they build up all of the individual segments of DNA into a consensus sequence? f. CONSENSUS SEQUENCE: The nucleotide sequence of a segment of DNA that is in agreement with most sequence reads of the same segment from different individuals. g. Overlapping contigs can produce sequence maps: maps that show the order of nucleotides in a cloned piece of DNA. h. PROBLEMS: i. The length of segments is not the only problem to overcome. As in all experimental observation, automated sequencing machines do not always give perfectly accurate sequence reads and the error rate is not constant. Bio 110 010 student Genetic Analysis II Beavers Page 4 of 9 ii. To ensure accuracy, genome projects obtain multiple independent sequence reads of each base pair of a genome. Tenfold coverage (10x) ensures that chance errors in the reads do not give a false reconstruction of the consensus sequence. 11. WHAT STRATEGIES WERE EMPLOYED TO GET SEQUENCE MAPS? a. Two strategies: 1. Hierarchical Strategy 2. Whole Genome Shotgun Sequencing 12. HIERARCHICAL SHOTGUN SEQUENCING STRATEGY – Map first, sequence later. a. SHOTGUN: sequencing approach in which the overlapping insert fragments to be sequenced have been randomly generated in one of three ways: i. SOURCE OF DNA: BAC's, genome sonication (shearing DNA with sound), restriction digest 1. Produce a genomic BAC library. (Bacterial Artificial Chromosome based on the F1 fertility plasmid in bacteria.) 2. Develop map of overlapping BAC clones. a. Large clone contigs are screened for similarities: Restriction enzyme recognition sites, short tandem repeats or STS’s (STS: one of a kind marker that tag positions along the DNA molecule). Organize a minimal tiling path (minimally overlapping regions). 3. Produce shotgun clones. a. Choose a BAC insert to be sequenced and shear into ~2 kb fragments. 4. Sequence. 13. WHOLE-GENOME SHOTGUN SEQUENCING STRATEGY – Sequence first, map later. 1. Shear DNA 3 times to construct 3 different sized fragments to produce 3 libraries. 2. Sequence. 3. Assemble into maps using sequence reads to build contigs to build scaffolds. 4. Use unique sequence overlaps found in sequence reads to build CONTIGS. 5. Paired-end reads can be used to span gaps to order and orient CONTIGS into SCAFFOLDS. 14. WHOLE-GENOME SHOTGUN SEQUENCING 15. Celera’s three different sized generated fragments provided spatial information about the clones 16. CHALLENGES a. SEQUENCING ERROR – all machines make b. DISTINGUISHING SEQUENCE ERROR FROM POLYMORPHISMS i. Polymorphism: variant of a gene or any genomic DNA sequence that has two or more alleles. c. REPEATED SEQUENCES – where do they belong? d. UNCLONABLE DNA CANNOT BE SEQUENCED (heterochromatic) i. A high proportion of the DNA located in regions of constitutive heterochromatin consists of long stretches of simple repetitive sequences like SSR's (simple sequence repeats). In addition, heterochromatic regions are often repositories for many transposable elements. 17. Sequenced yes. Understood? Not completely. Bio 110 010 student Genetic Analysis II Beavers Page 5 of 9 a. Which DNA sequences once ordered correspond to genes? Centromeres? Telomeres? Transposable elements? b. Clues for the location of genes – ORF’s and locating transcribed regions: 1. Open reading frames (stretches of nucleotides that have a reading frame of triplets uninterrupted by a stop codon. 2. Any sequence of DNA can have 6 reading frames. 3. Other information is necessary to verify. c. All genes are transcribed into RNA – even if they will never be translated. i. Works well for RNA's that are abundant – like rRNA's. ii. Less effective for RNA's that are relatively rare in a cell – like mRNA's. iii. Once either is obtained – copy into DNA to study (cDNA). d. Annotation of the genome: analyzing which sequence of DNA do which tasks. 18. GENETIC VARIATION a. Genomes of Watson, Venter and an unknown Chinese man reveal in total more than 5.6 million single nucleotide differences from the “standard” human genome. b. No standard human genome length. c. Until DNA could be evaluated on a molecular basis, all wild-type individuals were presumed to have the same alleles. d. It was found that wild-type individuals of the same species could produce variant forms of proteins, encoded by variant alleles. e. This is the origin of the term polymorphic: f. Polymorphic: a locus with two or more distinct alleles in a population. g. Genetic variants: describes alleles of a polymorphic locus. h. Polymorphism: variant of a gene or noncoding region that has two or more alleles. Molecular geneticists use this term to describe a variant of a locus within a population of organisms that has two or more alleles. Population geneticists reserve the term for variants at a locus where two or more alleles are present at a frequency of 1% or greater; for example, to describe the alternative forms of a gene that has more than one wild-type allele. i. Locus: any location (gene or not, single base pair or millions of base pairs) in the genome that is defined by chromosomal coordinates, regardless of biological function. j. In light of the broadened view of locus, we must recognize that: an allele of any locus is a variation in the DNA sequence itself, even if it has no impact on the expression of any trait. Functional or not, this makes no difference in the manner that a locus is transmitted from one generation to the next. k. Researchers can use nonfunctional loci as genetic markers to identify, locate, isolate, and follow the transmission of nearby genes. l. Anonymous DNA polymorphisms: differences in genomic DNA sequence with no effect on gene function. 19. SINGLE NUCLEOTIDE POLYMORPHISMS (SNPs) a. SNPs: – simplest and most useful class of genetic variant. b. SNPs defined: – single nucleotide polymorphisms. Particular base positions in the genome where alternative letters of the DNA alphabet commonly distinguish some people from others. c. SNPs occur due to: a mistake in DNA replication, mutagenic chemical or radiation. d. Account for most of the total variation that exists between human genomes. e. Occurs on average once every 1000 bases. f. Derived allele: an allele that arises through mutation. g. Ancestral allele: allele carried by last common ancestor of two species. Bio 110 010 student Genetic Analysis II Beavers Page 6 of 9 20. RESTRICTION SITE-ALTERING SNPs DETECTED BY SOUTHERN BLOT OR PCR a. Alleles of a SNP locus are well defined, single base changes in DNA sequence, they can be distinguished by a variety of molecular methods: i. Restriction enzyme digest ii. Gel Electrophoresis iii. Southern Blotting iv. PCR v. Allele specific oligonucleotide hybridization: allows short hybridization probes to distinguish single base mismatches under the correct experimental conditions. vi. DNA Microarrays – which allow labs to detect SNP alleles at over 1 million locations (loci) for a few hundred dollars. 21. InDels or DIPs (Deletion-Insertion polymorphisms) a. Genetic variation can be caused by addition or loss of DNA. b. This would be the deleting, duplicating or insertion of genetic material into chromosomes. c. It can be the loss or gain of one base pair all the way to the loss or gain of multiple megabases (millions of bases). d. The second most common form of genetic variation in the human genome is represented by InDels or DIPs (deletion-insertion polymorphisms). e. DIPs (Indels): short insertions or deletions of genetic material. Range in length from one base pair to hundreds of base pairs. f. Can cause a frameshift mutation if Indel is not inserted or deleted by 3 nucleotides or a multiple of 3. 22. SSRs: Simple Sequence Repeats (Microsatellites) a. SSRs (Simple Sequence Repeats) have more of a size differential and can be easily detected by PCR and gel electrophoresis. b. SSRs: sequences of one to a few bases that are REPEATED in tandem les than 10 to more than 100 times. c. Also called - STRP: short tandem repeat polymorphism d. Human genomes, as well as other complex organisms are loaded with loci defined by simple sequence repeats. e. DETECTION BY PCR AND GEL ELECTROPHORESIS: 23. SSRs, PCR and DNA Fingerprinting a. Using the product rule for independent assortment, the likelihood that any two random individuals share exactly the same combination of two alleles at a particular SSR loci is 10%. b. Same combination of alleles at a second SSR loci? 0.10 x 0.10 = 0.01 (100%) c. How about at the 13 positions used for DNA fingerprinting? 0.10 x 1013 = 1 chance in 10 trillion – right now the earth only has approximately 7 billion human beings. d. FBI maintains CODIS – a data base of DNA fingerprints using the same 13 SSR loci. e. All 50 states mandate collection of DNA fingerprint data from felons convicted of certain crimes such as sexual offenders. Also includes missing persons. 24. COPY NUMBER VARIANTS (CNVs), COPY NUMBER POLYMORPHISMS (CNPs) a. CNVs or CNPs are a category of genetic variation arising from LARGE regions of duplication or deletion, depending on frequency of occurrence in a population. b. Length of a CNV (a large block of genetic material with a variable repeat number) is 10 bp to 1 Mb per repeat. c. People possess variation in the number and type of olfactory receptor genes they have. d. Unequal crossing over a potential cause of CNV's. Bio 110 010 student Genetic Analysis II Beavers Page 7 of 9 25. Minisatellites -DNA Fingerprinting with Restriction Enzymes a. Many definitions for Minisatellites: repeats having a unit size in the range of 500 bp to 20 kb. b. Perhaps best generally defined as a VNTR (variable number tandem repeat): A type of DNA polymorphism created by a tandem arrangement of multiple copies of short DNA sequences. c. The POWER of Minisatellites: particular minisatellite sequences often occur at a small number of different genomic loci. (Microsatellites are at thousands of locations in the genome.) d. Using restriction enzyme digest, gel electrophoresis, and Southern blot hybridization, researchers can look simultaneously at allelic variation at multiple UNLINKED loci. 26. PREIMPLANTATION GENETIC DIAGNOSIS a. A technique that allows couples to establish the genotype of their fetus by fertilizing the harvested egg in-vitro with her partner's sperm. At approximately the 6-10 cell stage, one cell is removed from each viable embryo for testing. 27. DNA MICROARRAY a. A DNA array is a large set of DNA fragments displayed on a solid support (like a glass chip the size of a microscope slide cover). b. The goal: to analyze THE LEVELS OF EXPRESSION of the individual genes that are represented on the chip using the DNA samples being hybridized. c. Possible uses: (not an exhaustive list!!) d. Compare tissue- and cell type-specific gene expression. e. Compare developmental stage-specific gene expression. f. Compare gene expression during differentiation or even tumor genesis. g. Analyze two different cDNA's taken from one kind of cell at DIFFERENT phases of the cell cycle. h. Analyze inducible gene expression (how cells respond to environmental changes: hormones, heat shock, chemicals). i. The list goes on and on… 28. What if it were abnormalities detected in ultrasound? a. Signature genomics. 29. WGS vs. WES and Massively Parallel Sequencing a. WGS: Whole genome sequencing b. WES: whole-exome sequencing: sequencing of only genomic DNA corresponding to exons. c. Exome: the portion of a genome corresponding to exons; in humans, the exome is approximately 2% of the genome. d. Based on Sanger sequencing but with some new additions: i. Individual DNA molecules being synthesized by DNA polymerase are anchored in one place. ii. These methods control base addition temporarily so that each base can be identified before the next base is added iii. In some systems, the sensitivity of detection is so high that a single molecule of DNA can be monitored without the need for cloning or PCR amplification steps. iv. SEQUENCING MACHINES ARE ABLE TO RECORD THE SUCCESSIVE ADDITION OF NUCLEOTIDES TO EACH OF THE MILLIONS OF GROWING DNA MOLECULES IN REAL TIME. v. How? Bio 110 010 student Genetic Analysis II Beavers Page 8 of 9 1. a) Millions of fragments of single-stranded genomic DNA to which poly-A has been enzymatically added at the 3' end are hybridized to oligo-dT molecules attached to the surface of a special microarray called a flowcell. 2. b) Using the genomic fragment as a template and the oligo-dT as a PRIMER – DNA polymerase synthesized new DNA nucleotides with colored, base-specific fluorescent tags. These nucleotides are also blocked at their 3' ends so that only one nucleotide can be added at a time. This chemical block is reversible. 3. c) after a high resolution camera photographs the fluorescence, chemical applied to the flowcell remove the tag and blocking group from the just added nucleotide. 4. d) Each subsequent cycle begins by infusing the flowcell with a new dose of tagged nucleotides and polymerase, and is followed by an iteration of step c. 5. The sequencing machine takes about 100 pictures that record a sequence of colored flashes at each of millions of spots on a flowcell where a single DNA molecule is being synthesized. 6. The machines computer rearranges the data into millions of short sequence reads of about 100 nucleotides, and then assembles the genome sequence. 7. Wow. Bio 110 010 student Genetic Analysis II Beavers Page 9 of 9