GPU and machine learning solutions for comparative genomics

advertisement
GPU and machine learning solutions
for comparative genomics
Usman Roshan
Department of Computer Science
New Jersey Institute of Technology
Outline
• Talk centered around problem of mapping DNA
sequences to genome, analysis, and applications
• Prediction of chronic lymphocytic leukemia with
whole exome sequences and machine learning
– Data processing
– Results
• Graphics Processing Unit program for mapping
divergent reads to genomes and applications on
real data
– Overview of program
– Results on simulated and real data
Disease risk prediction
• Prediction of disease risk
with genome wide
association studies has
yielded low accuracy for
most diseases.
• Family history competitive
in most cases except for
cancer (Do et. al., PLoS
Genetics, 2012)
Disease risk prediction
• Our own studies have shown limited accuracy
with various machine learning methods
– Univariate and multivariate feature selection
– Multiple kernel learning
• What accuracy can we achieve with machine
learning methods applied to variants detected
from whole exome data?
Chronic lymphocytic leukemia prediction with
exome sequences and machine learning
• We selected exome sequences of chronic
lymphocytic leukemia from dbGaP. Largest at
the time of download in August 2013. 186
cases and 169 controls
• Case and control prediction accuracy with
genetic variants unknown
• Same dataset previously studied in Wang et.
al., NEJM, 2011 where new associated genes
are reported but no risk prediction
What is whole exome data?
Human genome sequence
Introns
Coding regions
Exons
Illumina 76bp short reads (exome data).
In practice flanking regions are also sequenced
and so some intronic regions are included.
Obtain structural variants (1)
Human genome reference sequence
Short reads are
aligned to human
genome
• Data of size 3.2 Terrabytes and 140X coverage
• Mapped to human genome reference with
BWA MEM (popular short read mapper)
Obtain structural variants (2)
Human genome
reference
ACCAG
ACCAG
ACCAG
ACCAG
ACCCG
ACCCG
ACCCG
Short reads from a
Single individual
Heterozygous SNP A/C
ATTGA
ATT--A
ATT--A
ATT--A
ATTGA
ATTGA
ATTGA
Heterozygous indel
• Obtained SNPs and indels from the alignments
for each individual
Obtain structural variants (3)
A/C
C0 AA
C1 AC
C2 AA
Co1 AC
Co2 CC
C/G
CC
CG
GG
CG
CG
Numerically encoded
C0
C1
C2
Co1
Co2
A/C
0
1
0
1
2
C/G
0
1
2
1
1
• Combine variants from different individuals to form a data matrix
• Each row is a case or control and each column is a variant
• 180 cases and 155 controls after excluding very large files and
problematic datasets
• 545,721 SNPs and indels (530,129 SNPs, 15,592 indels)
Perform cross-validation study
00120...
02221...
.
.
.
Training data
Validation data
Full dataset: each row
is a case or control
individual and each
column is a variant
(SNP or indel)
1. Split rows randomly into
training validation sets
(90:10 ratio).
2. Rank all variants on training
3. Learn support vector
machine classifer on training
data with top k ranked
variants
4. Predict case and control on
validation data.
5. Compute error and repeat
100 times
Variant ranking
F0
C0 1
C1 1
C2 1
Co1 1
Co2 2
F1
2
2
2
0
0
F2
0
1
2
1
0
Rank features
C0
C1
C2
Co1
Co2
F1
2
2
2
0
0
F2
0
1
2
1
0
F0
1
1
1
1
2
Different feature rankings
• Correlation coefficients between rankings on
SNPs
Pearson
F-score
F-score
Chi-square
0.99
0.37
0.37
Risk prediction with chi-square ranked SNPs
Accuracy of Top Ranked Features
Mean accuracy of 85.7% with top
60 ranked SNPs (across 100 splits)
1
0.9
Accuracy
0.8
0.7
snps
snps_indel
0.6
indel
0.5
0.4
0.3
top10
•
•
•
top20
top30
top40
top50
top60
top70
top80
top90 top100 top200 top300 top400 top500 top600 top700 top800 top900 top1000
Top ranked snps
Mean accuracy with significant SNPs only is 81% and significantly lower (Wilcoxon rank test pvalue=10-14)
Significant SNPs on chromosome 14 in IGH gene, predictive SNPs on chromosomes 2, 14, and 15
in intron and exons of IGK, IGH, and LOC642131.
One predictive SNP has mutations only in case individuals. Previous genes not significant.
Principal component analysis of SNP
data
PCA plot snps top 60 Features chi square
−0.05
−0.10
0.00
PC2
0.00
−0.10
PC2
0.10
0.10
PCA plot snps all Features
0.00
0.05
0.10
0.15
PC1
PCA plot of all 530,129 SNPs
0.20
−0.10
−0.05
0.00
0.05
PC1
PCA plot of top 60 chi-square ranked SNPs
Summary
• Our predictive could be used for prognosis but
replication in a different sample is first
required.
• Better alignments may yield more predictive
variants. NextGenMap has a better mapping
rate than BWA but is much slower
• Would our pipeline work other cancers?
Mapping divergent short reads to
genomes
Human genome reference sequence
Short reads are
aligned to human
genome
• Recall the problem of mapping short read to genomes
• Methods based on hash-tables and Burrows-Wheeler transform are fast
but accuracy falls quickly at divergence increases
• High performance Smith-Waterman implementations like CUDASW++ and
SSW take long to finish (even for bacterial genome mapping)
• Our objective: Align divergent reads faster than Smith-Waterman and
more accurate than hash-tables and Burrows-Wheeler transform.
MaxSSmap algorithm
Input: Whole genome and a short read
Genome fragments of same length
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
• Thread number i maps the read to fragment i.
• Threads run in parallel on a GPU (or CPU with many
cores)
• We also account for junctions between fragments
Experimental study
Genome sequence
Align reads with
NextGenMap
Some reads are not
mapped due to
mismatches and gaps.
We realign them with
MaxSSmap and SmithWaterman
Simulation study
Div.
BWA
(multicore)
NextGenMap
(GPU)
NextGenMap+M
axSSmap_fast
NextGenMAp+
MaxSSmap
NextGenMap+CU
DASW++
30%
with
gaps
0.5 (0)
19 (0.4)
82 (2.9)
90.5 (3.5)
92.5 (1.6)
Time
mins
0.4
2.1
162
222
1528
• Simulated 1 million 251 bp E.coli reads with Stampy and
aligned to Ecoli genome (approximately 4.6 million base
pairs). We know the true positions of the reads.
• Shown above are percentage of reads that were correctly
mapped by each program (incorrect in parenthesis)
Ancient DNA mapping
• Aligned 100,000 76bp ancient horse DNA reads to the horse
genome (approximately 2.3 billion base pairs). Measure
number of reads that were mapped.
• Shown above are percentage of reads that were mapped by
each program
• MaxSSmap alignments contain 39% mismatches on the
average
Mapping paired reads
Genome sequence
Reads come in pairs.
We align them with
NextGenMap and expect
them to be mapped
within 500 base pairs
We realign pairs
1. where both are
mapped farther than
500 base pairs
2. where at least one
read in the pair is
unmapped
Realigning paired reads to human
genome
• Align 100,000 101 bp paired reads from NA18278 in 1000 genomes to human
genome reference (3 billion base pairs).
• Shown here are percent of paired reads whose mapped positions are within
500 base pairs (also known as concordant reads).
• In MaxSSmap we realign discordant reads from NextGenMap as well.
• MaxSSmap alignments have 19% mismatches on the average
• Variant detection not performed yet
Summary
• Better accuracy and mapping rate than
NextGenMap and BWA
• Runtime for large genomes still very high
relative to NextGenMap but faster than SmithWaterman (speedup increases with number of
reads).
• More analysis needed on real data
Software and acknowledgements
• Our software, data, and publications can be
found at http://www.cs.njit.edu/usman
• Students: Bharati Jhadev, Nihir Patel, and Turki
Turki
• Dennis R. Livesay for GPU cluster at University of
North Caroline at Charlotte and Shahriar Afkhami
for GPU machine at NJIT
• NJIT system admins David Perel, Kevin Walsh, and
Gedaliah Wolosh for high performance
computing support and storage of genomic data.
References
• Turki Turki and Usman Roshan, MaxSSmap: A
GPU program for mapping divergent short
reads to genomes with the maximum scoring
subsequence (submitted)
• Bharati Jhadav, Nihir Patel, and Usman
Roshan, Prediction of chronic lymphocytic
leukemia with exome sequences, machine
learning (in preparation for submission)
Thank you!
• Questions….
Download