atccgtatcacggtca-cagatcagtccagt

advertisement
Bioinformatics at Molecular Epidemiology
- new tools for identifying indels in sequencing data
Kai Ye
k.ye@lumc.nl
Data collection for osteoarthritis, cardiovascular
disease and longevity
•
•
•
•
•
•
•
•
•
Serum parameters
Cellular characteristics (biobank)
Skin ageing
Glycosylation
Metabonomic
Transcriptomic
Genetic (GWAS/sequence)
Epigenetic
Data Integration
350 612 #68
mV
6 dec B4
FLU
320
12 - 58.113
300
280
260
240
220
200
180
160
140
120
100
9 - 52.029
80
60
N-Acetylglucosamine
11 - 55.813
Galactose
40
20
0
3 - 41.934
10 - 54.688
7 - Intergrate-13 - 48.294
1 - 36.281
Mannose
15 - 66.956
13 - 60.439
16 - 69.878
4 - Intergrate-11 - 42.787
Sialic acid
2 - 38.161
Fucose
5 - 44.173
8 - 49.809
6 - Intergrate-12 - 45.324
14 - 65.038
17 - 72.705
18 - 76.407
-20
-50
0.0
min
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
55.0
60.0
65.0
70.0
75.0
80.0
Genetic &
Epigenetic
analyses
Joost Kok
Erik vd Akker
Kai Ye
Statistical analysis
metabonomic
analysis
About me
• 1995 – 2003 B.S. and M.S. in biology and
pharmaceutical science
• 2004 – 2008 PhD with Cum Laude at Leiden
University. Thesis title: Novel algorithms for protein
sequence analysis
• 2008 – 2009 Postdoc at European Bioinformatics
Institute, collaborating with scientists in Sanger
Institute
• Currently assistant professor at MolEpi
A Pindel approach for identifying
indels in Next-Gen sequencing data
• Paired-end reads in Next-gen
sequencing
• Indel detection algorithms
• Pindel
• Cancer genome project
• 1000 genomes project
Paired-end reads in Next Generation
sequencing
~ insert size
Mapping paired-end reads
SNP
 CNVs: copy number variations;
 INDELs: insertions and deletions;
 SVs: Structural variations
Gapped alignment for small indels
indel
ATCCGTATCACGGTCA-CAGATCAGTCCAGT
ATCCGTATCACGGTCAGCAGATCAGTCCAGT
Read-depth for CNVs
Read-pair approach for SVs
Sample
No Indel
Reference
Sample
Deletion
Reference
Sample
Insertion
Reference
Mapping paired-end reads
SNP or small
indel
•
read-pairs
•
read-depth
Mapping paired-end reads
SNP or small
indel
•
read-pairs
•
read-depth
Pindel: Deletions
test
ref
1base - 1million bases
Pindel: Deletions
ref
Anchor
08 April 2015
14
Pindel: Deletions
2 x average
distance
ref
Anchor
08 April 2015
15
Pindel: Deletions
2 x average
distance
ref
Anchor
Expected maximum deletion size + read length (36)
08 April 2015
16
Pindel: Deletions
sample
reference
08 April 2015
17
African male: NA18507
•
•
•
•
Bentley et al., Nature 2008
135Gb of sequence
~4 billion paired 35-base reads
After preprocessing:
56,161,333 pairs of one-end mapped reads
• Pindel
– 142,908 1-16bp insertions
– 162,068 1bp-10kb deletions
08 April 2015
18
Deletion size distribution
08 April 2015
19
Applications
• Cancer genome project
• 1000 genomes project
Cancer genome
•
•
•
•
COLO-829 cells
Normal ~30x paired-end 100bp reads
Tumor ~40x paired-end 100bp reads
Search for somatic (tumor specific) indels
1000genomes project
• Pilot 1: 180 people of 3 major geographic
groups (YRI, CEU, CHB and JPT) at low
coverage (~4x)
• Pilot 2: the genomes of two families (CEU and
YRI, both parents and an adult child) with
deep coverage (20x per genome)
• Pilot 3: sequencing the coding regions (exons)
of 1,000 genes in 1,000 people with deep
coverage (20x).
www.ebi.ac.uk/~kye/pindel
k.ye@lumc.nl
Download