MBG305_LS_05

advertisement
Applied Bioinformatics
Week 5
Topics
• Cleaning of Nucleotide Sequences
• Assembly of Nucleotide Reads
Theoretical Part I
• DNA sequencing
• Next generation sequencing
• Cleaning nucleotide sequences
DNA Sequencing
• Sanger Method
– Please explain
• Other methods
– Too many to discuss
– http://en.wikipedia.org/wiki/DNA_sequencing
Shotgun Sequencing
• Many short (~700 N) sequences
• Human genome sequencing project
– Finished?
• How can you make sense of these sequences?
• Contrast:
– Genome walking
Next Generation Sequencing
• Increases the throughput of sequencing
– More sequence per time
– Not more sequence per read (still around 500)
• Many commercial platforms available
– 454 pyrosequencing
– Illumina (Solexa) sequencing
– ...
• Price is dropping
– Whole genomes in a day
– http://www.1000genomes.org/
454
Pyrosequencing
http://genepool.bio.ed.ac.uk/
Illumina sequencing
http://seqanswers.com/forums/showthread.php?t=21
Where from is your DNA
• Did you just clone and sequence?
• Did you sent a sample to a company?
• Did you find the sequence in a database?
• Better make sure it is correct and clean
Vector Contaminations
• Long DNA pieces are
fragmented and
cloned into vectors
before sequencing.
• This usually causes
some amount of
vector to be
sequenced along with
the insert.
image: Wikipedia
Adapter Contaminations
• Long DNA pieces are
fragmented and
adapter sequences
are ligated to both
ends of the fragments
before sequencing.
• This causes adapters
to be sequenced
along with the desired
sequence.
Contaminations Cause
Misassembly
• One important
outcome of not
removing
contaminations from
genomic sequences
is that they cause
misassembly of
sequences
Cleaning Contaminations
• Several approaches and tools to clean vector
contaminations from genomic sequences have
been developed.
• Most of them rely on a reference vector library,
including:
–
–
–
–
–
–
–
LUCY, LUCY2
SeqTrim
DeconSeq
TagCleaner
cross_match
SeqClean
VecScreen
Problem Definition
• A vector is a circular
DNA sequence.
• After being linearized
in reference libraries,
vector contaminations
around the
linearization point can
no more be detected
and cleaned by
currently available
tools.
UniVec
• A vector library by NCBI
• Problems:
– Has complete sequences for
only 8 vectors, although full
length sequences are available
on public databases for the rest
as well.
– Only these 8 vectors are
appended to themselves by 49
nt to overcome circularization
problem.
– Some vectors are divided into
partitions, for no apparent
reason.
– Some adapter sequences are
appended to themselves as
well, whereas some are not.
Previous Solution
• Not designed for
entire libraries
• Proposes cutting the
first 60 nucleotides
from the start of a
vector sequence and
pasting it to the end
by using a simple text
editor
• No more has an
implementation
Y.-A. Chen, C.-C. Lin, C.-D.
Wang, H.-B. Wu, and P.-I.
Hwang, “An optimized procedure
greatly improves EST vector
contamination removal,” 2007.
Our Solution
• Appending all (or
filtered by the user)
vector sequences in a
reference library to
themselves or to first n
number of nucleotides
(n chosen by the user)
• As customizable as
possible, but still
efficient with a single
click
• Has a GUI for targetusers
Our Solution
• Possible
Customisations
– Cleaning already
introduced appendices in
the library
– Filtering the sequences
by a keyword in their
definition lines and/or by
length
– Virtual Circularization
• Appending sequences to
themselves by first n
nucleotides
Efficiency of Our Method
• Datasets:
– Every 600th EST
– P. somniferum EST
– Artificial Data
• Vector Libraries
– rawUV
– cleanUV
– appUV
The Percentage of Sequences Cleaned
rawUV
cleanUV
appUV
Every 600th EST
31.00
30.94
31.79
P. Somniferum
EST
17.26
17.26
18.03
Artificial Data
87.50
75.00
100.00
The Percentage of Nucleotides Cleaned
rawUV
cleanUV
appUV
Every 600th EST
2.86
2.85
2.90
P. Somniferum
EST
0.45
0.45
0.47
Artificial Data
15.35
15.35
19.93
Theoretical Part I
• Mind Mapping
• Break 10 min
Practical Part I
Screening for Vector seqs
• www.ncbi.nlm.nih.gov/VecScreen
• Get the U87251 sequence (FASTA)
– What is this number?
– Enter the sequence and run the analysis
• What do you see as a result?
– Would you continue with the experiment?
– Would you discard the sequence?
Sequencing
• Since we cannot do any sequencing here we
have to prepare a simulation
1. Select a nucleotide sequence of about 15000
bases
2. Copy and paste that sequence into word
1. 3 times
2. Separated by empty lines
Sequencing
3. Arbitrarily add linebreaks into the resulting
document
1. At least 30 (10 per copy min)
2. Spread out throughout the sequence
4. Add a FASTA definition line after each
line break
– Use >Copy-N-Fragment-X as a template for
the definition line
• Ensure that the overall number of
characters is less than 50000
Practical Part I
• 10 min break
Theoretical Part II
• Sequence Assembly
Assembling Sequences
• Shotgun sequencing
– Sequence fragments
– Find overlapping fragments
– Build contiguous sequences (contig)
– Assemble into whole genomes
• Genetic and physical maps
– Help orient fragments and contigs
• Problems with repetitive sequences
Sequence Tagged Sites
• Physical map
• Up to 200 bp long
• Unique for a region of the genome
• STS reference map
– Map to assemble BAC/ PAC clones
– Repeat process to map contigs to clones
Sequence Tagged Site
Sequence Tagged Site
Endonuclease Site
Chromosome
The restriction enzyme should digest the DNA into approximately 200 kB long fragments
Fragments with STS
Up to 700 kB!
If it fits into a plasmid
(Up to 10 kB)
Shortest Chromosome (21)
47 mB -> 250 BACs
1 BAC -> 10 – 50 Plasmids / Cosmids
Plasmid / Cosmid
Polymerase Chain Reaction will
lead predominantely to:
Primer
Use several nucleases
EcoRI
BamHI
HindIII
Target ~ 1000 nucleotides
Restriction
Sequence with degenerate primers?
or subclone and sequence
Sequencing
Clone01: ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT
Clone02: TGTGTAGCTAGCTGCGGCGCTAGGATAGGCATCTAGCTATCGGACTCTGTG
...
Clone20: GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT
...
>Clone01
ACCGACTACGATCGCACTCAGCATCGCGA
TCCGATACGTAGCTAGCTAGCT
>Clone02
TGTGTAGCTAGCTGCGGCGCTAGGATAGG
CATCTAGCTATCGGACTCTGTG
...
>Clone20
GTAGTACGTGCTAGCTACGTACGTACGAT
CGTACGTAGTACCGACTACGAT
...
Clones
01
02
03
.
.
20
.
.
27
01
0
S
Y
M
E
T
R
I
C
02
-15
0
Y
M
E
T
R
I
C
03
5
12
0
M
E
T
R
I
C
0
E
T
R
I
C
0
T
R
I
C
10
0
R
I
C
0
I
C
0
C
3
0
.
.
Smith-Waterman
or
more specialized Alg.
all vs all
20
-25
-15
3
.
.
27
Check here as well
15
0
2
-5
-4
-20
5
-5
Clone01 ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT
||||||||||||
Clone20 GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT
Check here as well
TACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT
ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT
||
||||||||||||
||||||||||||
||||||||||||
CG
GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT
GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGT
Not proportional
Chromosome
For each plasmid the BAC and therefore the position on the chromosome is known
Sequencing all plasmids will give the complete sequence of the genome
!Caution!
Highly simplified
Why?
What does coverage mean?
Assembling Software
• As you just saw assembling sequences is
computationally expensive
• Therefore most software is not available
online but often freely for download
Theoretical Part II
• Mind mapping
• 10 min break
Practical Part II
Restriction Maps
• You sent a sample for sequencing. You
might want to check if the sequence
makes sense
• What is a restriction map?
• www.restrictionmapper.org
CAP3 Assembly
• GOTO: http://pbil.univ-lyon1.fr/cap3.php
• Use the sequences you prepared earlier to assemble
them with cap3
• Analyze the results
– Did you get a full correct assembly?
Download