Today’s Topics Computer Science Enabled by Computing : Decoding the Human Genome Upcoming Review for Final Exam CPS 001 41.1 Enabled by Computers Things we now take for granted: Possible only because of computing-- Several Examples (most mentioned before) 1. Modern Camera Zoom Lens 2. Certain Space Missions: e.g., “Sling Shot” paths 3. Medical Imaging o CAT scans (Nobel Prize!) o Other imaging procedure PET, … 4. Designing and Manufacturing a modern Computer 5. Communications (error checking, compression, …) 6. Decoding the Human Genome CPS 001 41.2 The Human Genome Each cell contains The human Nucleus contains 24 Chromosomes Chromosomes (composed of DNA), collectively include 20-25 thousand Genes E.g, Chromosome 5 includes 5923 genes Chromosomes composed of, collectively, Nucleus 3.5 Gpb (3,500,000,000 base pairs) Good Diagram of DNA CPS 001 http://www.accessexcellence.org/RC/VL/GG/dna2.html 41.3 The Human Genome Makeup: The Double Helix - DNA 3.5 Gpb o (how big a number can an int hold?) Bases denoted by letters A, C, G, T o Adenine, Cytosine, Guanine, Thymine Each strand of DNA (in each of our cells) approx 6 feet long! o (packed into volume approx. 0.0004 inches across) Letters printed as string 1mm apart is almost 1900 miles long Good Diagram of DNA CPS 001 http://www.accessexcellence.org/RC/VL/GG/dna2.html 41.4 How to Read (Sequence) DNA? Look at following strings Assume we didn’t know alphabet Can we reconstruct alphabet from these fragments? A AB ABCDE ABCDEF BCDEF CDEFGH FGHIJ GHIJK GHIJKL IJKLMN KLMNO LMNOP MNOPQR OPQRST PQRST QRSTU STUVWX UVWXY UVWXYZ VWXYZ YZ Z If we assume each letter used only once, can match on single character ABCDEF + FGHIJ yields ABCDEFGHIJ If uncertain of nature, may require longer overlap: IJKLMN + MNOPQR yields IJKLMNOPQR Can reconstruct Complete Alphabet from fragments CPS 001 41.5 Reconstruction from DNA fragments Problem is more difficult Only 4 characters: A C G T All kinds of repetition in the sequence Need larger overlap – how large? o Depends on kind of repetition we find Look at example with a sequence much longer than alphabet Fragments shown come from chopping up three identical copies of the sequence Breaks at “random” points CPS 001 41.6 Reconstruction from DNA sequence Look at following fragments (from 3 originals) AAGATGGTTCATTCT ACGGGCGGTGTTGGAGCAGA AGAGCT AGGTATATTGAGGAAG ATTGT CAAGTAAAAGGA CATTGTCAAGTAAAAG CCAACTAGTCAGCACTAC CCAACTAGTCAGCACTACAT CGGGCGGTGTTGGAGC CTGCAATTTCTG GAAGGTATAT GACTTGGGTA GCTCTGCAATTTCTG GCTGGGG GCTGGGGA GCTGGGGACGGGCGGTGT TAGTCAGCACTA TCTGCAATTTCTGCCAAC TGAGGAAGAAGA TGAGGAAGAAGATGGTTCA TGGAGCAGAGC TGGTTCATTCTGACTTGGGTA TGTCAAGTAAAAGGAAGGTATAT TTCTGACTTGGGTA Identify Overlaps to reconstruct TCTGCAATTTCTGCCAACTAGTCAGCACTACAT AGGTATATTGAGGAAGAAGATGGTTCA Eventually can get original sequence GCTGGGGACGGGCGGTGTTGGAGCAGAGCTCTGCAATTTCTGCCAACTAGTCAGCACTA CATTGTCAAGTAAAAGGAAGGTATATTGAGGAAGAAGATGGTTCATTCTGACTTGGGTA CPS 001 41.7 The Real World Have looked at toy problems: back to reality Why the obsession with fragments? String lengths are huge: (3 * 109) If we can sequence (read) a fragment, why not just do the whole thing? Automatic Sequencers Available Limited to lengths of the order of 1000 from end o (Can sequence whole strand if short enough) CPS 001 Thus the use of the Shotgun Method of Sequencing 41.8 Shotgun Sequencing Strand much too long for automatic sequencing Randomly cut them into small pieces (~5 Kbp) Make many identical copies of these strands Each of these small pieces are sequenced to produce reads What’s left is a Data Processing Problem Need to reconstruct original DNA strand by matching ends If random reads match nicely, work can be completed There may be problems! CPS 001 41.9 Shotgun Problems Gaps Due to random nature of shearing strands, there may be gaps in the sequence (Maybe all pieces broke at same place) May need to repeats for some sub-areas to fill gaps Repeats Long repeats may make matching ambiguous Need extra long fragments with ends sequenced o Can tell how many repeats “fit” in o Also can bridge gaps that have resisted sequencing Sequencing Errors CPS 001 Automatic sequencing is error prone – need multiple passes 41.10 The Computations Required Appears to be a Simple String Matching Problem Remember int indexOf(String) method String a, b; ... // input or compute data int pos = a.indexOf(b); pos tells where in a, b is located Combined with use of String substring(int, int) can check for overlap in the ends of strings Effectively “slide” ends over each other for match CPS 001 41.11 The Computations Required Seems simple enough in principle, but… Large numbers involved make task daunting E.g., must compare each read to every other read For N reads, involve N2 compares. Wouldn’t seem bad except when we calculate N o 3*109/103 (divide by approx size of read) o N2 is ~ 9*1012 compares o That’s only for 1 times coverage (need more!) Each compare also involves up to N2 char compares! (where N is length of string) CPS 001 41.12 The Computations Required Previous analysis is naïve + Can do better by grouping things o Like matching “words” rather than “letters” Whole process is not in an error free environment - Other problems not considered make thing much more complex Maybe string matches that match at 99% of positions must be considered a match Many good computer scientists and mathematicians involved CPS 001 41.13 Interesting Competition BAC to BAC Sequencing Public Human Genome Project (1988 - ) o Many cooperating laboratories, world wide Started much earlier than competition o Started with more primitive technologies Top down approach using bacterial artificial chromosome (BAC) Builds framework (scaffolding) first Then fill in details CPS 001 41.14 Interesting Competition Whole Genome Shotgun Sequencing Celera Genomics (private: Craig Ventnor, Eugene Myers) Later start (1998 - ), “finished” at same time o Benefited from much improved technology Sequencers much better Longer strands, better accuracy Faster computers Shotgun from the top down Use three sizes of fragments (1 Mbp, 50 Kbp, 10 Kbp) o Can user longer pieces to deal with repeats CPS 001 Everything done in parallel. 41.15 Interesting Competition Whole Genome Shotgun method appears to have won Much controversy at first Hybrid methods Job just beginning! Need to find out what in Genome affects what in practice Much labeled “junk” DNA because it doesn’t seem to affect anything. Is that the last word? CPS 001 41.16