BioSci D145 Lecture #4 • Bruce Blumberg (blumberg@uci.edu) – 4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) – phone 824-8573 • TA – Ron Leavitt (rleavitt@uci.edu) – 4351 Nat Sci 2, 824-6873 – office hours M 2:30-3:30 4206 Nat Sci 2 • check e-mail daily for announcements, etc.. • Updated lectures will be posted on web pages after lecture – http://blumberg.bio.uci.edu/biod145-w2016 – http://blumberg-lab.bio.uci.edu/biod145-w2016 – Last year’s midterm is now posted. – Term paper outlines due Thursday (1/28) by midnight. – No office hours on Thursday 1/28 BioSci D145 lecture 4 page 1 ©copyright Bruce Blumberg 2004-2016. All rights reserved Term paper outline • Title of your proposal • A paragraph introducing your topic and explaining why it is important; i.e., what impact will the knowledge gained have. – Why should any funding agency give you money to pursue this research? • NIH now requires a statement of human health relevance for all grant applications • NSF wants to know what is the intellectual merit of your proposed research and what broader impacts of your proposed research • Present your hypothesis – A supposition or conjecture put forth to account for known facts; esp. in the sciences, a provisional supposition from which to draw conclusions that shall be in accordance with known facts, and which serves as a starting-point for further investigation by which it may be proved or disproved and the true theory arrived at. • Enumerate 2-3 specific aims in the form of questions that test your hypothesis – At least one of these aims needs to have a strong “whole genome” component • Genomics, transcriptomic, proteomic, metabolomic, etc. BioSci D145 lecture 4 page 2 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA Sequence analysis • Complete DNA sequence (all nts both strands, no gaps) – complete sequence is desirable but takes time • how long depends on size and strategy employed – which strategy to use depends on various factors • how large is the clone? – cDNA ?, genomic? • How fast is sequence required? • sequencing strategies – Small-scale (not whole genome) • primer walking • cloning and sequencing of restriction fragments • progressive deletions – Bidirectional, unidirectional – Genome sequencing – nearly always shotgun sequencing • whole genome (traditional vs. nextgen) • with mapping – map first (C. elegans) – map as you go (many) BioSci D145 lecture 4 page 3 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA Sequence analysis (contd) • Primer walking - walk from the ends with oligonucleotides – sequence, back up ~50 nt from end, make a primer and continue • Why back up? – Need to see overlap to be sure about sequence you are reading BioSci D145 lecture 4 page 4 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA Sequence analysis (contd) • Primer walking (contd) – advantages • very simple • no possibility to lose bits of DNA – restriction mapping – deletion methods • no restriction map needed • best choice for short DNA – disadvantages • slowest method – about a week between sequencing runs • oligos are not free (and not reusable) • not feasible for large sequences – applications • cDNA sequencing when time is not critical • targeted sequencing – verification – closing gaps in sequences BioSci D145 lecture 4 page 5 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA Sequence analysis (contd) • Cloning and sequencing of restriction fragments – once the most popular method • make a restriction map, subclone fragments • sequence – advantages • straightforward • directed approach • can go quickly • cloned fragments often useful otherwise – RNase protection, nuclease mapping, in situ hybridization – disadvantages • possible to lose small fragments – must run high quality analytical gels • depends on quality of restriction map – mistaken mapping -> wrong sequence • restriction site availability – applications • sequencing small cDNAs • isolating regions to close gaps BioSci D145 lecture 4 page 6 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA Sequence analysis (contd) • nested deletion strategies - sequential deletions from one end of the clone • Exonuclease III-mediated deletion – cut with polylinker enzyme • protect ends – 3’ overhang – phosphorothioate – cut with enzyme between first cut and the insert • can’t leave 3’ overhang – timed digestions with Exo III – stop reactions, blunt ends – ligate and size select recombinants – sequence – advantages • unidirectional • processivity of enzyme gives nested deletions BioSci D145 lecture 4 page 7 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA Sequence analysis (contd) • Exonuclease III-mediated deletion (contd) – disadvantages • need two unique restriction sites flanking insert on each side • best used successively to get > 10kb total deletions • may not get complete overlaps of sequences – fill in with restriction fragments or oligos – applications • method of choice for moderate size sequencing projects – cDNAs – genomic clones • good for closing larger gaps • Small-scale sequence analysis – how is it practiced today? – Primer walking – ExoIII-mediated deletion with primer walking BioSci D145 lecture 4 page 8 ©copyright Bruce Blumberg 2004-2016. All rights reserved Genome sequencing • The problem – Genome sizes for most eukaryotes are large (108-109 bp) – High quality sequences only about 600-800 bp per run • Nextgen sequencing is ~75-400 bp • The solution – Break genome into lots of bits and sequence them all – Reassemble with computer • The benefit – Rapid increase in information about genome size, gene comparisons, etc • The cost – 3 x 109 bp(human haploid genome) ÷ 600 bp/reaction = 5 x 106 reactions for 1x coverage! – Need both strands (x2), need overlaps and need to be sure of sequences – ~107-108 reactions/runs required for a human-sized genome – About $1-2 per reaction these days, ~$8 commercially. BioSci D145 lecture 4 page 9 ©copyright Bruce Blumberg 2004-2016. All rights reserved Genome sequencing (contd) • Shotgun sequencing NOT invented by Craig Venter – Messing 1981 first description of shotgun sequencing – Sanger lab developed current methods in 1983 – approach • blast genome into small chunks • clone these chunks – 3-5 kb, 8 kb plasmid – 40 kb fosmid jump repetitive sequences • sequence + assemble by computer – A priori difficulties • how to get nice uniform distribution • how to assemble fragments • what to do about repeats? • How to minimize sequence redundancy? BioSci D145 lecture 4 page 10 ©copyright Bruce Blumberg 2004-2016. All rights reserved Genome sequencing(contd) BioSci D145 lecture 4 page 11 ©copyright Bruce Blumberg 2004-2016. All rights reserved Genome sequencing(contd) “Mate pairs” BioSci D145 lecture 4 page 12 ©copyright Bruce Blumberg 2004-2016. All rights reserved Genome sequencing (contd) • Shotgun sequencing (contd) – How to minimize sequence redundancy? • Best way to minimize redundancy is map before you start – C. elegans was done this way - when the sequence was finished, it was FINISHED » mapping took almost 10 years – mapping much too tedious and nonprofitable for Celera » who cares about redundancy, let’s sequence and make $$ » There is scientific value to draft genomes, too. • why does redundancy matter? – Finished sequence today costs about $0.50/base – Note that at 10x, 99.995% coverage leaves at least 150 kb of the human genome unsequenced BioSci D145 lecture 4 page 13 ©copyright Bruce Blumberg 2004-2016. All rights reserved Genome sequencing (contd) – Mapping by hybridization – Mapping by fingerprinting BioSci D145 lecture 4 page 14 ©copyright Bruce Blumberg 2004-2016. All rights reserved Traditional (map first) vs STC (map as you go along) mapping BioSci D145 lecture 4 page 15 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA sequence analysis • Landmarks in DNA sequencing – Sanger, Nicklen and Coulson. Sequencing with chain terminating inhibitors. Proc. Natl. Acad. Sci. 74, 5463-5467 (1977). – Sanger, F. et al. The nucleotide sequence of bacteriophage ΦX174. J Mol Biol 125, 225-46. (1978). – Sutcliffe, J. G. Complete nucleotide sequence of the Escherichia coli plasmid pBR322. Cold Spring Harb Symp Quant Biol 43, 77-90. (1979). – Sanger et al., Nucleotide sequence of bacteriophage lambda DNA. J Mol Biol 162, 729-73. (1982). – Messing, J., Crea, R. & Seeburg, P. H. A system for shotgun DNA sequencing. Nucl.Acids Res 9, 309-21 (1981). – Anderson, S. et al. Sequence and organization of the human mitochondrial genome. Nature 290, 457-65 (1981). – Deininger, P. L. Random subcloning of sonicated DNA: application to shotgun DNA sequence analysis. Anal Biochem 129, 216-23. (1983). – Baer et al. DNA sequence and expression of the B95-8 Epstein-Barr virus genome. Nature 310, 207-11. (1984). (189 kb) – Innis et al. DNA sequencing with Taq DNA polymerase and direct sequencing of PCR-amplified DNA Proc. Natl. Acad. Sci. 85, 9436-9440 (1988) BioSci D145 lecture 4 page 16 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA sequence analysis (contd) • Landmarks in DNA sequencing (contd). – 1995 - Haemophilus influenzae (1.83 Mb) • first bacterium sequenced, human pathogen – 1995 - Mycoplasma genitalium (0.58 Mb) – 1996 – 1996 – 1997 – 1997 – 1997 – 1997 • smallest free living organism - Saccharomyces cerevisiae genome (13 Mb) - Methanococcus jannaschii (1.66 Mb) • first Archaebacterium - Escherichia coli (4.6 Mb) - Bacillus subtilis (4.2 Mb) - Borrelia burgdorferi (1.44 Mb) • Lyme disease - Archaeoglobus fulgidus (2.18 Mb) • first sulfur metabolizing bacterium – 1997 - Helicobacter pylori (1.66 Mb) • first bacterium proven to cause cancer BioSci D145 lecture 4 page 17 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA sequence analysis (contd) • Landmarks in DNA sequencing (contd) – 1998 - Treponema pallidum (1.14 Mb) – 1998 - Caenorhabditis elegans genome (97 Mb) – 1999 - Deinococcus radiodurans (3.28 Mb) • resistant to radiation, starvation, ox stress – 2000 - Drosophila melanogaster (120 Mb) – 2000 - Arabidopsis thaliana (115 Mb) – 2001 - Escherichia coli O157:H7 (4.1 Mb) • Pathogenic variant of E. coli – 2001 – draft Human “genome” – 2002 – mouse genome – 2002 – Ciona intestinalis – – – – • Primitive chordate 2003 – “complete “human genome 2004 – rat genome 2006 – Human “genome” complete sequence of all chromosomes Many more genomes underway, check JGI, Sanger and other web sites BioSci D145 lecture 4 page 18 ©copyright Bruce Blumberg 2004-2016. All rights reserved The human genome • In Feb 12 2001, Celera and Human Genome project published “draft” human genome sequencs – Celera -> 39114 – Ensembl -> 29691 – Consensus from all sources ~30K • Number of genes – C. elegans – 19,000 – Arabidopsis - 25,000 • Predictions had been from 50-140k human genes – What’s up with that? – Are we only slightly more complicated than a weed? – How can we possibly get a human with less than 2x the number of genes as C. elegans – Implications? • UNRAVELING THE DNA MYTH: The spurious foundation of genetic engineering, Barry Commoner, Harpers Magazine Feb, 2002 BioSci D145 lecture 4 page 19 ©copyright Bruce Blumberg 2004-2016. All rights reserved The human genome • The answer – Gene sets don’t overlap completely (duh) – Floor is 42K – 130056build #236 UniGene Clusters (from EST and mRNA sequencing) – http://www.ncbi.nlm.nih.gov/unigene – Up from 123,459 in 2013 (85,793, 105,680, 128,826, 123,891 previous years) (“final” count • Important questions to be answered about what constitutes a “gene” – Crick genes? DNA-RNA-protein – How about RNAs? – miRNAs? – Antisense transcripts? – lncRNAs? BioSci D145 lecture 4 page 20 ©copyright = 42113 Bruce Blumberg 2004-2016. All rights reserved Genome sequencing(contd) – Whole genome shotgun sequencing (Celera) • premise is that rapid generation of draft sequence is valuable • why bother trying to clone and sequence difficult regions? – Basically just forget regions of repetitive DNA - not cost effective • using this approach, genomes rarely are completely finished – rule of thumb is that it takes at least as long to finish the last 5% as it took to get the first 95% • problems – sequence may never be complete as is C. elegans – much redundant sequence with many sparse regions and lots of gaps. – Fragment assembly for regions of highly repetitive DNA is dubious at best – “Finished” fly and human genomes lack more than a few already characterized genes BioSci D145 lecture 4 page 21 ©copyright Bruce Blumberg 2004-2016. All rights reserved Genome sequencing (contd) • Knowing what we know now – how to approach a large new genome? – Xenopus tropicalis 1.7 Gb (about ½ human) – BAC end sequencing – Whole genome shotgun – HAPPY mapping and radiation hybrid mapping to order scaffolds – Gaps closed with BACS – 8.5 x coverage (but > 9000 scaffolds for 18 chromosomes) – Finishing now in process • But how “finished” will it be? • 2016 update – now version 9.0 – FINALLY integrated BAC end sequences – Integrated genetic map – 50% of contigs > 72 kb – Xenopus laevis – v9.1 – • >90% of genome in chromosomal scaffolds • 2 “subgenomes” fully characterized. • annotation remains a big challenge. BioSci D145 lecture 4 page 22 ©copyright Bruce Blumberg 2004-2016. All rights reserved Functional Genomics - Analysis of gene function on a whole genome basis • Genome projects – DNA sequencing – Human genome, mouse, rat, Drosophila, C. elegans “finished” – model organisms progressing rapidly – Lots of new genes, but many lack known function • Functional genomics – Identification of gene functions • associate functions with new genes coming from genome projects • function of genes identified from characterizing diseases or mutants – Identification of genes by their function • discovery of new genes BioSci D145 lecture 4 page 23 ©copyright Bruce Blumberg 2004-2016. All rights reserved *Methods of profiling gene expression – large scale to whole genome • What are the possibilities – Array – micro or macro – Sequence sampling (EST generation) – SAGE – serial analysis of gene expression – Massively parallel signature sequencing (RNA-seq, Illumina, 454) • DNA microarray analysis was, until now totally dominant method – Two basic flavors • Spotted (spot DNA onto support) – cDNA microarrays – Oligonucleotide arrays – Moderately expensive • Synthesized (use photolithography to synthesize oligos onto silicon or other suitable support – Affymetrix Gene Chips dominate – VERY expensive – Both are in wide use and suitable for whole genome analysis BioSci D145 lecture 4 page 24 ©copyright Bruce Blumberg 2004-2016. All rights reserved Spotted arrays • Source material is prepared – cDNAs are PCR amplified OR – Oligonucleotides synthesized • Spotted onto treated glass slides • RNA prepared from 2 sources – Test and control • Labeled probes prepared from RNAs – Incorporate label directly – Or incorporate modified NTP and label later – Or chemically label mRNA directly • Hybridize, wash, scan slide • Express as ratio of one channel to other after processing BioSci D145 lecture 4 page 25 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA microarray types • Stanford type microarrayer – http://cmgm.stanford .edu/pbrown/mguide/ index.html • Printing method – Reminiscent of fountain pen BioSci D145 lecture 4 page 26 ©copyright Bruce Blumberg 2004-2016. All rights reserved Strategy to identify RAR target genes Agonist - TTNPB Antgonist - AGN193109 Harvest st 18 Poly A+ RNA Poly A+ RNA Amino-allyl labeled 1st strand cDNA Amino-allyl labeled 1st strand cDNA Alexa Fluor 555 (cy3) Alexa Fluor 647 (cy5) Alexa Fluor 555 (cy3) Alexa Fluor 647 (cy5) Probe microarrays upregulated BioSci D145 lecture 4 page 27 ©copyright downregulated Bruce Blumberg 2004-2016. All rights reserved DNA microarray • Statistical analysis of output – VERY IMPORTANT! • Replicates are very important • Preprocessing of data is needed – To remove spurious signals BioSci D145 lecture 4 page 28 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA microarray • Advantages – Custom arrays possible and affordable – Ratio of fluorescence is robust and reproducible • Disadvantages – Availability of chips – Expense of production on your own – Technical details in preparation BioSci D145 lecture 4 page 29 ©copyright Bruce Blumberg 2004-2016. All rights reserved Affymetrix GeneChips • High density arrays are synthesized directly on support – 4 masks required per cycle -> 100 masks per chip (25-mers) – Pentium IV requires about 30 masks – G.P. Li in Engineering directs a UCI facility that can make just about anything using photolithography BioSci D145 lecture 4 page 30 ©copyright Bruce Blumberg 2004-2016. All rights reserved Affymetrix GeneChips Streptavidin/phycoerythrin BioSci D145 lecture 4 page 31 ©copyright Bruce Blumberg 2004-2016. All rights reserved Affymetrix GeneChips – Each gene is represented by a series of oligonucleotide pairs • One perfect match • One with a single mismatch – Only hybridization to perfect match but not mismatch is considered to be real – Gene is considered “detected” if > ½ of oligo pairs are positive – Number of pairs depends on organism and how well characterized array behavior is • Human uses 8 pairs • Xenopus uses 16 pairs BioSci D145 lecture 4 page 32 ©copyright Bruce Blumberg 2004-2016. All rights reserved Affymetrix GeneChips • Result is in single color – Always need two chips – control and experimental for each condition – Also need replicates for each condition – For diverse biological samples (e.g., humans) 10 replicates required! – For less diverse samples (cell lines) probably 5 replicates needed • Advantages – Commercially available – Standardized • Disadvantages – About $700 to buy, probe and process each chip (at UCI)! • About $500 elsewhere – May not be available for your organism of interest – No ability to compare probes directly on the same chip • Must rely on technology BioSci D145 lecture 4 page 33 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA microarrays • What are they good for? – Identifying genes expressed in one condition vs. another • One tissue vs. another (heart vs liver) • Tissue vs. tumor (liver vs. hepatocarcinoma) • In response to a treatment (e.g., RA) • In response to disease (e.g., after viral infection) – Building expression profiles • Tissues • Cancers • Developmental stages • Expressed genes – Identifying organisms in food • Array can identify which animals are present in a mix • http://www.dnavision.com/files/FOODIDBrosh%20En.pdf BioSci D145 lecture 4 page 34 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA microarrays • What are they good for? (contd) – Response of animal to drugs or chemicals • Toxicogenomics • Pharmacogenomics – Diagnostics • SNP analysis to identify disease loci • Specific testing for known diseases BioSci D145 lecture 4 page 35 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA microarrays • What are the limitations of microarray technology? What sorts of factors might confound the experiment? – Signal intensity (or signal/noise) • Improved dyes, label uniformly – Biological variation (samples are inherently different) • Sufficient # of replicates is key • keep individuals separate – Not all mRNAs will be present at sufficient levels to detect • Amplification, but beware of bias – Good statistical analysis is required • Bayesian statistics are best (Pierre Baldi is local expert) – calculating the probability of a new event on the basis of earlier probability estimates which have been derived from empiric data – i.e., don’t assume random distribution in datasets, calculate probability based on real data – Bayesian approach great for small number of replicates, converges on t-test at high number of replicates • http://cybert.microarray.ics.uci.edu/ BioSci D145 lecture 4 page 36 ©copyright Bruce Blumberg 2004-2016. All rights reserved Other methods of transcriptome analysis - parallel • Microarray was once the dominant method – Direct RNA sequencing methods are rapidly displacing microarrays – SAGE (serial analysis of gene expression) • Nanostring N-Counter is modern implementation • Very short sequences – RNAseq • Directly sequence large numbers of RNAs • Longer sequences • SAGE – Relies on generating many very short sequences and matching these to the genome – 10 bp = short SAGE – 17 bp = “long” SAGE BioSci D145 lecture 4 page 37 ©copyright Bruce Blumberg 2004-2016. All rights reserved Other methods of transcriptome analysis - parallel • SAGE (continued) – What is the obvious shortcoming of this method? – Sequences may not be unique and could have difficulty mapping to the genome BioSci D145 lecture 4 page 38 ©copyright Bruce Blumberg 2004-2016. All rights reserved Other methods of transcriptome analysis - parallel • RNA seq – Ali Mortazavi is local expert – Use of massively parallel sequencing allows precise quantitation of transcript – Also allows discovery of rare splice forms – Discovery of unexpected transcripts – Main problem is in mapping sequence calls to genome • Sequencing has 1-2% errors which can make mapping to genome fail • or induce “in silico cross-hybridization” – Mapping to incorrect genomic location BioSci D145 lecture 4 page 39 ©copyright Bruce Blumberg 2004-2016. All rights reserved Microarray vs. RNAseq • Microarray – Assumes you know all the transcripts that are expressed in the organism/tissue of interest • RNAseq – No assumption re transcripts but best with genome sequence. – Works less well without – Any sequence you did not know was expressed will not be there. • except whole genome tiling arrays – Can discover novel sequences or new splice forms not yet characterized (if you have genome) – Detection limit issues • Signal-noise ratio – Detection limits are not a problem – can detect small # – Well validated , expression analysis can be quantitative • Not usually performed quantitatively – Getting better, expression analysis is quantitative with read depth ≥ 20 x106 mapped reads BioSci D145 lecture 4 page 40 ©copyright Bruce Blumberg 2004-2016. All rights reserved