Basics of high-throughput sequencing

advertisement
Institute for Computational Biomedicine
Basics of high-throughput
sequencing
Olivier Elemento, PhD
TA: Jenny Giannopoulou, PhD
CSHL High Throughput Data Analysis Workshop, June 2012
Plan
1. What high-throughput sequencing is used for
2. Illumina technology
3. Primary data analysis (alignment, QC)
4. Read formats
5. Secondary Analysis (mutation calling, transcript
level quantification, etc)
6. Read data visualization
7. Useful R/BioC packages
8. Challenges and evolution of sequencing and its
analysis
1. What high-throughput sequencing is
used for
Full genome sequencing
Targeted sequencing
Exome sequencing
DNA methylation profiling
Bisulfite
treatment
mC  C
CU
After PCR
CC
UT
RNA-seq
ChIP-seq
DNA
Transcription
factor of
interest
Antibody
High-throughput mapping of chromatin interactions (HiC)
Elemento lab (more on this next week)
And many others
• Gene fusion detection
• Translational profiling (which mRNAs localize
to ribosomes)
• Small/miRNA sequencing
• Bacterial communities
• Protein-RNA interactions (PAR-CLIP, HITS-CLIP)
• …
2. Illumina technology
Illumina SBS Technology
Reversible Terminator Chemistry Foundation
DNA
(0.1-1.0 ug)
3’ 5’
A
G
C
T
G
C
T
A
C
G
A
T
A
C
C
C
G
A
T
C
G
A
T
A
T
C
G
A
T
G
C
T
Sample
preparation
Single
molecule
Cluster
growtharray
5’
Sequencing
1
2
3
4
5
6
7
8
9
T G C T A C G A T …
Image acquisition
Base calling
http://seqanswers.com/forums/showthread.php?t=21
http://www.illumina.com/technology/sequencing_technology.ilmn
© Illumina, Inc.
Single end vs pair end sequencing
What comes out of the machine:
short reads in fastq format
@D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1
CTCCTGGAAAACGCTTTGGTAGATTTGGCCAGGAGCTTTCTTTTATGTAAATTG
+D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1
[^^cedeefee`cghhhfcRX`_gfghf^bZbecg^eeb[caef`ef^a_`eXa
@D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1
TCCANCCATGGCAAATTCCATGGCACCGTCAAGGCTGAGAACGGGAAGCTTGTC
+D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1
ab_eBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1
TACAAGTGCAGCATCAAGGAGCGAATGCTCTACTCCAGCTGCAAGAGCCGCCTC
+D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1
_[_ceeec[^eeghdffffhh^efh_egfhfgeec_fbafhhhhd`caegfheh
@D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1
GAAGGAGAGAAGGGGAGGAGGGCGGGGGGCACCTACTACATCGCCCTCCACATC
+D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1
\^_accceg`gga`f[fgcb`Ucgfaa_LVV^[bbbbbRWW`W^Y[_[^bbbbb
@D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1
QS to int
In R:
as.integer(
charToRaw
(‘e'))-33
Pair end sequencing
s_8_1_sequence.txt.gz
@D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1
CTCCTGGAAAACGCTTTGGTAGATTTGGCCAGGAGCTTTCTTTTATGTAAATTG
+D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1
[^^cedeefee`cghhhfcRX`_gfghf^bZbecg^eeb[caef`ef^a_`eXa
@D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1
TCCANCCATGGCAAATTCCATGGCACCGTCAAGGCTGAGAACGGGAAGCTTGTC
+D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1
ab_eBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1
TACAAGTGCAGCATCAAGGAGCGAATGCTCTACTCCAGCTGCAAGAGCCGCCTC
+D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1
_[_ceeec[^eeghdffffhh^efh_egfhfgeec_fbafhhhhd`caegfheh
@D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1
GAAGGAGAGAAGGGGAGGAGGGCGGGGGGCACCTACTACATCGCCCTCCACATC
+D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1
\^_accceg`gga`f[fgcb`Ucgfaa_LVV^[bbbbbRWW`W^Y[_[^bbbbb
@D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1
GTGGCCGATTCCTGAGCTGTGTTTGAGGAGAGGGCGGAGTGCCATCTGGGTAGC
+D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1
aa_eeeeegggggihhiiifgeghfeghbgcghifiidg^dbgggeeeee`dcd
…
s_8_2_sequence.txt.gz
@D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/2
GGCATATTTAACAGCATTGAACAGAATTCTGTGTCCTGTAAAAAAATTAGCTTA
+D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/2
a__aaa`ce`cgcffdf_acda^ea]befffbeged`g[a`e_caaac]cb`gb
@D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/2
TTGAGGCTGTTGTCATACTTCTCATGGTTCACACCCATGACGAACATGGGGGCG
+D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/2
a__eeeeeggegefhhhiiihhhhhiieghhhghhiiffhiififhhiihegic
@D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/2
CGGGGTGCACCTCGTCGTAGAGGAACTCTGCCGTCAGCTCTGCCCCATCGCCAA
+D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/2
^__ee__cge`cghghhfgddgfgi]ehhfffff^ec[beegidffhhfhadba
@D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/2
CTTAGTCTCAGTTTTCCTCCAGCAGCCTGAGGAAACTCAAAGGCACAGTTCCCA
+D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/2
_abeaaacg^g^eghhhhgafghhdfghfedeghfiiicfbgdHYagfeecggf
@D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/2
TAGGCTCAAAGTCTAACGCCAATCCCGAACCTGGGCATCTGTACACACACACAC
+D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/2
abbeceeegggcghiihiihhhhiifhiiiiihiiiiiiihegh`eggfebfhg
…
Illumina sequencing using HiSeq2000
• Previously: GAIIx: ~30M
reads per lane, 8 lanes
(1QC)
•
• Now: HiSeq2000 + TruSeq
v3: 200M reads per lane, 816 lanes (1-2QC) in parallel
with HiSeq2000
• Multiplexing: attach
barcode, mix samples,
sequence, identify and
remove barcode
Full Genome Sequencing using
Illumina technology
• ~$4-6K reagent with Illumina (storage+analysis
costs not included)
• Exercise: you want to sequence 1 human
genome at 100X coverage; how many lanes ?
QC for Illumina (part 1)
3’ 5’
A
G
C
T
G
C
T
A
C
G
A
T
A
C
C
C
G
A
T
C
G
A
T
A
T
C
G
A
T
G
C
T
5’
Sequencing
3. Primary data analysis (alignment,
QC)
Read alignment programs
• BWA (Burrows-Wheeler Aligner)
–
–
–
–
http://bio-bwa.sourceforge.net/
Fast, accurate, can find (short) indels
Allow 1-3 mismatches by default
Can also align longer 454 reads
• Bowtie
–
–
–
–
http://bowtie-bio.sourceforge.net/index.shtml
Ultrafast, accurate, newest version finds indels too
Allow 1-3 mismatches by default
Integrated into TopHat (splice aligner)
• Others: Eland, Maq, SOAP, etc
BWA tutorial (for aligning single end
reads to genome)
• Get genome, e.g., from UCSC
– http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
• Combine into 1 file
– tar zvfx chromFa.tar.gz
– cat *.fa > wg.fa
•
Indexing the genome
– bwa index -p hg19bwaidx -a bwtsw wg.fa
• Align
– bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > s_3_sequence.txt.bwa
• Convert to SAM format
– bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz >
s_3_sequence.txt.sam
Aligning pair end reads
• Align two files separately
– bwa aln -t 4 hg19bwaidx s_3_1_sequence.txt.gz >
s_3_1_sequence.txt.bwa
– bwa aln -t 4 hg19bwaidx s_3_2_sequence.txt.gz >
s_3_1_sequence.txt.bwa
• Convert to SAM format
– bwa sampe hg19bwaidx s_3_1_sequence.txt.bwa
s_3_1_sequence.txt.bwa s_3_1_sequence.txt.gz
s_3_1_sequence.txt.gz > s_3_sequence.txt.sam
TopHat (spliced alignment)
Download genome index
ftp://ftp.cbcb.umd.edu/pub/data/bowtie_inde
xes/hg18.ebwt.zip
D~100bp
tophat –r 100 –p 4 –o outdir/ hg18
s_1_1_sequence.txt s_1_2_sequence.txt
Trapnell et al, 2009
Basic QC
• Fraction of mapped reads
• How many unique mappers ?
• Fraction of clonal reads (PCR duplicates)
4. Read formats
Read formats
• SAM/BAM
• Eland/Eland Export
SAM format
DH1608P1_0130:6:1103:10579:166379#TTAGGC
16 chr1 1249828 37 51M *
0
0
GGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG eb`XXYbZdadee^ceV]X][ccTcc^ebeece
eeeWbeeeeeeeceeaee XX:Z:NM_017871,32
NM:i:0 MD:Z:51
DH1608P1_0130:6:1102:3415:150915#TTAGGC 16 chr1 1249828 37 51M *
0
0
GGGCGGGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBac]bbbceedaeddeZceeea_ba_\_eee
eeeedaeeee XX:Z:NM_017871,32
NM:i:1 MD:Z:5T45
DH1608P1_0130:6:1102:13118:62644#TTAGGC 16 chr1 1249828 37 51M *
0
0
GGGCGTGCCTCGGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBBBBBBBBBBB`XTbSa`cffegdggeccbe
effdeggggg XX:Z:NM_017871,32
NM:i:2 MD:Z:7A3T39
DH1608P1_0130:6:1203:3012:157120#TTAGGC 16 chr1 1249826 25 51M *
0
0
AAGGCCGTGACTCTGATCTCAGCCCTCGTCTCCGCCGCGCTCCCGGACCCG BBBBBBBB^`QWZZ]UXYSZSTFRU]Z__SO[adcc[acdV
\`Y]YWY][_ XX:Z:NM_017871,34
NM:i:3 MD:Z:4G17G1A26
DH1608P1_0130:6:2206:4445:12756#TTAGGC 16 chr1 1246336 25 1M3487N50M *
0
0
CCAAAGGGTGTGACTCTGATCTCGGGCATCGTCTCCGCCGCGCTCCCGGAC BBBBBBBBBBBBBBBBBBBBBBBB`YdddYdc\
cacaNddddcdddaeeee XX:Z:NM_017871,37
NM:i:3 MD:Z:2C5C14A27
DH1608P1_0130:6:2203:7903:43788#TTAGGC 16 chr1 1246336 37 1M3487N50M *
0
0
CCCAAGGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGAC adbe[fbcbccb_cb^cb^^c^edgegggggdf
ggefffgfbfggggegeg XX:Z:NM_017871,37
NM:i:0 MD:Z:51
CIGAR string, eg 5M3487N46M = 5bp-long block, 3487 insert, 46bp-long block
MD tag, e.g, MD:Z:4T46 = 5 matches, 1 mismatch (T in read), 46 matches
XT tag, e.g. XT:A:U = unique mapper; XT:A:R = more than 1 high-scoring matches
Pair end SAM
D3B4KKQ1_0161:8:2206:11080:31374#CTTGTA 83 chr1 4481348 255
TTAGATGCATTTTCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAG
hiiiiiiihihhdhghggdiiihihffihhheihihhhgggggeeeeebbb NM:i:0 NH:i:1
D3B4KKQ1_0161:8:2206:8294:192062#CTTGTA 147 chr1 4481355 255
CATTTTCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACAC
efehffhgfdiihhhhhihghiiihfhihdhiihgghigefggeeeeebbb NM:i:0 NH:i:1
D3B4KKQ1_0161:8:2204:6985:145082#CTTGTA 147 chr1 4481360 255
TCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTA
ghfhgihihghgihgiiiifiiiiihhhhfifhihhiigggeeceeeea__ NM:i:0 NH:i:1
D3B4KKQ1_0161:8:2205:15014:60805#CTTGTA 83 chr1 4481360 255
TCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTA
hihheiihiiiiiiiiiiiiiiiiiifhiefhiiiiiigggggeceeebba NM:i:0 NH:i:1
D3B4KKQ1_0161:8:1105:17802:25847#CTTGTA 83 chr1 4481362 255
TTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTAAT
gheiiiihhhiiiiiiiiiihiiiiiihgfiiiiiiiigeggceeeeebb_ NM:i:0 NH:i:1
D3B4KKQ1_0161:8:1208:2232:73719#CTTGTA 147 chr1 4481366 255
CATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTAATTGTA
fhghiiiiiiiiiiiiiiiiiiihghiihiiiiihgggegfggeeeeebbb NM:i:0 NH:i:1
D3B4KKQ1_0161:8:2104:18142:93861#CTTGTA 83 chr1 4481367 255
ATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTAATTGTAT
ihghiiiheiiiiihhihfhifgghhhhfgfhiggge_ggggeeeeee_bb NM:i:0 NH:i:1
NM=edit distance
51M
=
4481165 0
51M
=
4481284 0
51M
=
4481202 0
51M
=
4481238 0
51M
=
4481198 0
51M
=
4481277 0
51M
=
4481198 0
NH=number of alignments for that read
BAM format
• Compressed, indexable version of SAM
• Can be uploaded to UCSC Genome Browser
SAMtools
• http://samtools.sourceforge.net/
• Convert SAM to BAM
– samtools view –bS file.sam > file.bam
• Sort BAM file
– samtools sort file.bam file.sorted # (will create file.sorted.bam)
• Index BAM file
– samtools index file.sorted.bam
• Convert BAM to SAM
– samtools view file.bam > file.sam
• Rsamtools
• http://www.bioconductor.org/packages/2.6/bioc/html/Rsamtools.html
SAMtools
• Get alignment statistics
– samtools flagstat pairendfile.bam
149923886 in total
0 QC failure
0 duplicates
124520915 mapped (83.06%)
149923886 paired in sequencing
74961943 read1
74961943 read2
120504218 properly paired (80.38%)
121586068 with itself and mate mapped
2934847 singletons (1.96%)
482748 with mate mapped to a different chr
143256 with mate mapped to a different chr (mapQ>=5)
SAMtools
• Get pileup
– samtools pileup file.sorted.bam
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
T
T
G
A
A
C
C
T
C
A
G
C
T
T
T
A
A
G
A
A
26
26
26
25
25
25
23
22
20
22
21
19
19
19
19
19
19
18
17
16
tTttTTTtTttTttTtTtTTGTTTTT
tTttTTTtTttTttTtTtTTTTTTTT
g$GggGGGgGggGggGgGgGGGGGGGG
AaaAAAaAaaAaaAaAaAAAAAAAA
AaaAAAaAaaAaaAaAaAAAAAAAA
C$c$c$CCCcCccCccCcCcCCCCCCCC
C$CCcCccCccCcCcCCCCCCCC^FC
T$T$tTttTttTtTtTTTTTTTTT
cCccCccCcCcCCCCCCCCC
a$AaaAaaAaAaAAAAAAAAA^FA^FA
G$g$g$GggGgGgGGGGGGGGGGG
CccCcCcCCCCCCCCCCC^FC
TttTtTtTTTTTTTTTTTT
TttTtTtTTTTTTTTTTTT
TttTtTtTTTTTTTTTTTT
AaaAaAaAAAAAAAAAAAA
A$aaAaAaAAAAAAAAAAAA
g$gGgGgGGGGGGGGGGGG
a$AaAaAAAAAAAAAAAA
A$aAaAAAAAAAAAAAA
^ = start of read at that position
ggggeggggg^Vgf_fggggJceb_g
ggggfggggg[RgfNfgfgg`ed^]f
gggg_ggggg[Ugfddgggga_eW\c
gggaefggg_Xgf_fggggadd]Zg
ggefggggdNVgbZbgggg`ee[\g
gfgfggfggYYgeadgggg`ea^\g
fgggge_`gf_dgggge_e]_gg
ggffg\Rgf_dggeggde]_cg
ggg`[gf_dggggg\d[]fg
ged_]ggadffgggecX^ggfg
ggc`gfWfggfggcaSdggfe
agg\dgggggbZUdfgfgg
eggcbfgfgg_cXdegfgg
aggccggdggccZdggfgf
`gfcfgggggccUcggcgg
ege_fgggggcc[aggcgg
XggLfggfggdeM_ggagg
gf\fgggggcfPcggegg
fce[gggg_eL]ggfdf
dfggfggdfS[ggegg
$ = end of read at that position
SAMtools
• Removing clonal reads
– Multiple reads that map to
same position, with same
orientation as usually
considered PCR duplicates
– For mutation detection (less
important for RNA-seq),
need to collapse them into 1
read (e.g. read with highest
quality score)
– samtools rmdup –s file.bam
file_noclonal.bam
5. Secondary Analysis (transcript level quantification,
mutation calling)
RPKM
Reads per kilobase of transcript per million reads
• R: Count how many reads map to a transcript
• K: Divide by ( length of transcript / 1,000 )
• M: Divide by (total number of mapped reads in
sample / 1,000,000 )
CuffLinks uses FPKM (same as RPKM, F=fragment, for paired end reads)
CuffLinks
cufflinks -p 4 –o outdir/
s_1_sequence.txt.sorted.bam
Trapnell et al, 2010
http://genes.mit.edu/burgelab/miso/
http://www.broadinstitute.org/software/scripture/
Detecting Single Nucleotide Variations
(SNVs)
Short read
AAAATACGCGTATTCTCCCAAAACAATATC
TCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT
Reference Human Genome (hg18)
Short read
AAAATACGCCTATTCTCCCAAAACAATATC
TCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT
Reference Human Genome (hg18)
Short read
AAAATACGCCTATTCTCCCATAACAATATC
TCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT
Reference Human Genome (hg18)
Sequencing has high error rate
Mismatch = real variation OR sequencing error
Short read
AAAATACGCCTATTCTCCCAAAACAATATC
TCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT
Reference Human Genome (hg18)
Typical mismatch rate of entire datasets = 0.5-2% (errors >> real variations)
Single Nucleotide Variation
chr2, pos=85623221 bp
Single Nucleotide Variation
chr14, pos=35859525 bp
Single Nucleotide Variation
chr1, pos=220952447
Cancer mutations
All cells in tumor have
heterozygous mutation
A fraction of cells have
heterozygous mutation
Loss of heterozygocity due to
loss of genetic material
The error/mismatch rate is not
uniform across read length
Mismatch
Popular SNV calling programs
• GATK
http://www.broadinstitute.org/gsa/wiki/index
.php/The_Genome_Analysis_Toolkit
• VarScan
• http://varscan.sourceforge.net/
SNVseeqer: Single Nucleotide Variation detection from deep
sequencing data
N reads at
considered
position
p
p98
p17
p14
p
p65
p
p1110
p
p3 1
k reads with
mutation
genome
Is k greater than expected by chance, given error rates pi ?
SZ = Z1 +
+ ZN
ìN
ü
P(SZ = k) = íÕ (1- pi )ý å wi1 ... w ik
î i=1
þ i1 <...<ik
The Poisson-Binomial distribution
Wacker et al, 2012; Jiang et al, 2012
Chen & Liu, 1997
Indel calling
• Complicated because indels often occur within
microsatellite regions, eg CACACACA
– CA--CACACA as good as CACA--CACA, CACACA--CA
• Since reads are aligned independently, local
realignment is needed
• DINDEL (used in 1000 Genomes Project)
http://www.sanger.ac.uk/resources/software/dindel/
Variant annotation
• Variants can be either mutation or (more often) polymorphism.
dbSNP catalogs all known polymorphisms
• Missense, nonsense, intron, 3’UTR, 5’UTR, etc
– SeattleSNP http://pga.gs.washington.edu/
• Severity of missense mutations
– PolyPhen http://genetics.bwh.harvard.edu/pph2/
– Mutation Assessor http://mutationassessor.org/
• GATK for variant annotation
http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_A
nalysis_Toolkit
• Cross-species conservation
6. Read data visualization
SAMtools
samtools tview file.sorted.bam wg.fa
UCSC Genome Browser
• Upload BAM file to genome browser or make
it accessible to UCSC from your own web page
Integrated Genome Viewer (IGV)
Read densities
genome
Read
count
T A T T A A T T A T C C C C A T A T A TG A T A T
genome
Wiggle files for Genome Browser
variableStep chrom=chr1 span=10
1471 0.3
1481 0.6
1491 0.6
1501 0.6
1511 0.6
1521 0.6
1531 1.1
1541 1.7
1551 1.9
1561 2.1
1571 2.5
1581 2.8
1591 3.2
1601 3.9
1611 3.9
1621 4.5
1631 4.8
1641 4.2
1651 3.9
1661 3.8
1671 3.2
1681 2.4
1691 1.9
1701 1.4
1711 1.3
1721 0.8
1871 1.4
1881 4.9
1891 9.1
1901 9.7
1911 10.7
1921 11.2
1931 12.3
http://genome.ucsc.edu/goldenPath/help/wiggle.html
http://genome.ucsc.edu/goldenPath/help/bigWig.html
7. BioConductor packages for highthrougput sequencing
BioC packages
• IRanges
http://bioconductor.org/packages/release/bioc/h
tml/IRanges.html
• Rsamtools
http://bioconductor.org/packages/2.7/bioc/html/
Rsamtools.html
• ShortRead
http://bioconductor.org/packages/release/bioc/h
tml/ShortRead.html
• rtracklayer
http://bioconductor.org/packages/2.8/bioc/html/
rtracklayer.html
• BSgenome
And many more
SAMTools, Unix programs and R/BioC
• RSAMtools
• Unix commands can be ran in R
system(“samtools rmdup –s file.bam file_noclonal.bam”)
http://manuals.bioinformatics.ucr.edu/home/ht-seq
8. Challenges and evolution of
sequencing and its analysis
Storage is becoming a real problem
Kahn, 2011, Science
Sequencing is becoming faster
Reads are becoming longer
PacBio
How do you interpret sequencing
data in a clinical context ?
Data integration
ChIP-seq for BCL6, BCOR, SMRT,
H3K79me2, H3K4me1, H3K4me3,
H3K27Ac, H3K9Ac, H3K27me3, and DNA
methylation (HELP) in LY1 cells
HiC
Integrative
statistical model
Predictions /
Mechanisms
Experiments
ChIP-seq / siRNA etc
The end
• ole2001@med.cornell.edu
• eug2002@med.cornell.edu
Download