Today’s Topics Computer Science Upcoming Enabled by Computing :

advertisement
Today’s Topics
Computer Science
Enabled by Computing :
Decoding the Human Genome
Upcoming
Review for Final Exam
CPS 001
41.1
Enabled by Computers

Things we now take for granted: Possible only
because of computing--

Several Examples (most mentioned before)
1.
Modern Camera Zoom Lens
2.
Certain Space Missions: e.g., “Sling Shot” paths
3.
Medical Imaging
o CAT scans (Nobel Prize!)
o Other imaging procedure PET, …
4.
Designing and Manufacturing a modern Computer
5.
Communications (error checking, compression, …)
6.
Decoding the Human Genome
CPS 001
41.2
The Human Genome

Each cell contains


The human Nucleus contains



24 Chromosomes
Chromosomes (composed of DNA), collectively
include

20-25 thousand Genes

E.g, Chromosome 5 includes 5923 genes
Chromosomes composed of, collectively,


Nucleus
3.5 Gpb (3,500,000,000 base pairs)
Good Diagram of DNA

CPS 001
http://www.accessexcellence.org/RC/VL/GG/dna2.html
41.3
The Human Genome

Makeup: The Double Helix - DNA

3.5 Gpb
o (how big a number can an int hold?)

Bases denoted by letters A, C, G, T
o Adenine, Cytosine, Guanine, Thymine

Each strand of DNA (in each of our cells) approx 6 feet
long!
o (packed into volume approx. 0.0004 inches across)


Letters printed as string 1mm apart is almost 1900 miles
long
Good Diagram of DNA

CPS 001
http://www.accessexcellence.org/RC/VL/GG/dna2.html
41.4
How to Read (Sequence) DNA?

Look at following strings

Assume we didn’t know alphabet

Can we reconstruct alphabet from these fragments?
A AB ABCDE ABCDEF BCDEF CDEFGH FGHIJ GHIJK
GHIJKL IJKLMN KLMNO LMNOP MNOPQR OPQRST PQRST
QRSTU STUVWX UVWXY UVWXYZ VWXYZ YZ Z

If we assume each letter used only once, can match on
single character
ABCDEF + FGHIJ yields ABCDEFGHIJ

If uncertain of nature, may require longer overlap:
IJKLMN + MNOPQR yields IJKLMNOPQR

Can reconstruct Complete Alphabet from fragments
CPS 001
41.5
Reconstruction from DNA fragments

Problem is more difficult

Only 4 characters: A C G T

All kinds of repetition in the sequence

Need larger overlap – how large?
o Depends on kind of repetition we find

Look at example with a sequence much longer than
alphabet

Fragments shown come from chopping up three identical
copies of the sequence

Breaks at “random” points
CPS 001
41.6
Reconstruction from DNA sequence

Look at following fragments (from 3 originals)
AAGATGGTTCATTCT ACGGGCGGTGTTGGAGCAGA AGAGCT
AGGTATATTGAGGAAG ATTGT CAAGTAAAAGGA CATTGTCAAGTAAAAG
CCAACTAGTCAGCACTAC CCAACTAGTCAGCACTACAT CGGGCGGTGTTGGAGC
CTGCAATTTCTG GAAGGTATAT GACTTGGGTA GCTCTGCAATTTCTG
GCTGGGG GCTGGGGA GCTGGGGACGGGCGGTGT TAGTCAGCACTA
TCTGCAATTTCTGCCAAC TGAGGAAGAAGA TGAGGAAGAAGATGGTTCA
TGGAGCAGAGC TGGTTCATTCTGACTTGGGTA
TGTCAAGTAAAAGGAAGGTATAT TTCTGACTTGGGTA

Identify Overlaps to reconstruct
TCTGCAATTTCTGCCAACTAGTCAGCACTACAT
AGGTATATTGAGGAAGAAGATGGTTCA

Eventually can get original sequence
GCTGGGGACGGGCGGTGTTGGAGCAGAGCTCTGCAATTTCTGCCAACTAGTCAGCACTA
CATTGTCAAGTAAAAGGAAGGTATATTGAGGAAGAAGATGGTTCATTCTGACTTGGGTA
CPS 001
41.7
The Real World

Have looked at toy problems: back to reality


Why the obsession with fragments?


String lengths are huge: (3 * 109)
If we can sequence (read) a fragment, why not just do the
whole thing?
Automatic Sequencers Available

Limited to lengths of the order of 1000 from end
o (Can sequence whole strand if short enough)

CPS 001
Thus the use of the Shotgun Method of Sequencing
41.8
Shotgun Sequencing

Strand much too long for automatic sequencing

Randomly cut them into small pieces (~5 Kbp)

Make many identical copies of these strands

Each of these small pieces are sequenced to produce
reads

What’s left is a Data Processing Problem


Need to reconstruct original DNA strand by matching ends

If random reads match nicely, work can be completed
There may be problems!
CPS 001
41.9
Shotgun Problems


Gaps

Due to random nature of shearing strands, there may be gaps in
the sequence

(Maybe all pieces broke at same place)

May need to repeats for some sub-areas to fill gaps
Repeats

Long repeats may make matching ambiguous

Need extra long fragments with ends sequenced
o Can tell how many repeats “fit” in
o Also can bridge gaps that have resisted sequencing

Sequencing Errors

CPS 001
Automatic sequencing is error prone – need multiple passes
41.10
The Computations Required

Appears to be a Simple String Matching Problem

Remember int indexOf(String) method
String a, b;
...
// input or compute data
int pos = a.indexOf(b);
pos tells where in a, b is located
Combined with use of
String substring(int, int)
can check for overlap in the ends of strings
Effectively “slide” ends over each other for match



CPS 001
41.11
The Computations Required



Seems simple enough in principle, but…
Large numbers involved make task daunting
E.g., must compare each read to every other read

For N reads, involve N2 compares.

Wouldn’t seem bad except when we calculate N
o 3*109/103 (divide by approx size of read)
o N2 is ~ 9*1012 compares
o That’s only for 1 times coverage (need more!)

Each compare also involves up to N2 char compares!
(where N is length of string)
CPS 001
41.12
The Computations Required

Previous analysis is naïve

+ Can do better by grouping things
o Like matching “words” rather than “letters”


Whole process is not in an error free environment


- Other problems not considered make thing much more
complex
Maybe string matches that match at 99% of positions must
be considered a match
Many good computer scientists and mathematicians
involved
CPS 001
41.13
Interesting Competition

BAC to BAC Sequencing

Public Human Genome Project (1988 - )
o Many cooperating laboratories, world wide

Started much earlier than competition
o Started with more primitive technologies

Top down approach using bacterial artificial chromosome
(BAC)

Builds framework (scaffolding) first

Then fill in details
CPS 001
41.14
Interesting Competition

Whole Genome Shotgun Sequencing

Celera Genomics (private: Craig Ventnor, Eugene Myers)

Later start (1998 - ), “finished” at same time
o Benefited from much improved technology
 Sequencers much better
 Longer strands, better accuracy
 Faster computers

Shotgun from the top down

Use three sizes of fragments (1 Mbp, 50 Kbp, 10 Kbp)
o Can user longer pieces to deal with repeats

CPS 001
Everything done in parallel.
41.15
Interesting Competition

Whole Genome Shotgun method appears to have won

Much controversy at first

Hybrid methods

Job just beginning!

Need to find out what in Genome affects what in
practice

Much labeled “junk” DNA because it doesn’t seem to
affect anything.

Is that the last word?
CPS 001
41.16
Download