Data Analysis Introduction

advertisement
The Barcode of Life
Integrating machine learning techniques for
species prediction and discovery
www.barcodinglife.com
Welcome to the Meeting
•Barcode of Life -•Great opportunity to contribute to a fast
growing area of research.
•Some questions to keep in mind for the
afternoon…..
Open Questions
•Species discovery vs. prediction
•Data structure, missing data, sample sizes
•New visualization tools for both discovery and
prediction
•Confidence measures for species discovery and
individual specimen assignments
– controlling number of false discoveries
- power of detection
Barcoding Data – A first look at the data.
Dimacs BOL Data Analysis Working Group meeting
September 26 2005
Rebecka Jornsten
Department of Statistics, Rutgers University
http://www.stat.rutgers.edu/~rebecka/DIMACSBOL/DimacsMeetingDATA/
Thanks to Kerri-Ann Norton
Outline
•Data Structure and Data Retrieval
•Sequencing and Base Calling
•Distance metrics – Sequence information
•Clustering
•Classification
•Open Questions – Discussion
Questions to think about are highlighted in red.
What do the data look like?
What do the data look like?
What do the data look like?
Sequencing
Sequencing
•Peak finding
•Deconvolution
•Denoising
•Normalization
•Base calling
•Quality assessment
(ABI base caller,
Phred)
Sample Data
• www.barcodinglife.com
• Leptasterias data – six-rayed sea
stars
• Astraptes data - moths
• Collembola data - springtails
Sample Data
• Leptasterias data – six-rayed sea stars
5 species, 21 specimens
Sample sizes 3-7
Sequence length 1644
• Astraptes data – moths
12 species, 451 specimens
Sample sizes 3-96, 8 with more than 20
Sequence length 594
• Collembola data – springtails
18 species, 54 specimens
Sample sizes 1-5
Sequence length 635
Sequence Information
•We can compute the information content for
each nucleotide.
•Is there a lot of variability between species at
locus j?
• Is there lot of variability within a species at
locus j?
•Are the same loci discriminating between
multiple species?
Astraptes:
Within-species entropy for the 9 species with 20+ specimens
“pure”
Mutual information of each locus for the 9 species
Pair wise (Mutual Information)
10 vs. 11
10 vs. 12
10 vs. 12
2 vs. 10
2 vs. 12
Distance Metric
•To group the specimens in an unsupervised
fashion we need to come up with a distance
metric.
•Without prior information of which loci are
informative, we compute distances using the
entire sequence (for Astraptes 594 bases)
•The 0-1 distance metric is the most commonly
used
•However, some bases are ‘uncalled’ – usually
denoted by letter other than a,c,g,t
•How should we take this into account?
H. Clustering: Astraptes
PAM Clustering: Astraptes
H. Clustering:Collembola
Another example: Leptasterias – 5 species, 21 specimens
All groups
Groups 3 vs. 4
H. Clustering: Leptasterias
Group 2
Problem….
Group 1
PAM Clustering: Leptasterias
Selecting the number of clusters via silhouette width, CV etc
leads to the combining of species 3 and 4 – these data does
not support a separate species.
Classification
•A classifier that in principle closely resembles the
hierarchical clustering approach is kNN
Leave-one-out Cross-Validation:
On the Leptasterias data 1-2 specimens are
misallocated with this classifier.
Both of these specimens are in group 4 (and
mislabeled as 3).
Via cross validation we see that one observation is
only labeled as 4 if it’s in the training set, o/w 3.
The other mislabeled observation fails in 15 out of 20
training scenarios. Both these specimens may have
been mislabeled?
Classification
•A simple alternative is to use a centroid-based
classifiers
•Assign new specimens to the species with respect to
which the specimen is closest to the species consensus
sequence.
•We can match specimens to a consensus sequence
based on the 0-1 distance, or
•use the position weights of each letter base in the
consensus sequence.
Classification
•On the Leptasterias data, the consensus sequence
(CS) based classifier makes 2 errors (LOO CV)
•“Vote of confidence”=weighted 0/1 distance to CS
Classification
•Relative voting (RV) strength illustrates that species 3 and 4
are difficult to separate, and the misallocated specimens are
associated with low relative votes
RV = max(weighted similarity)-(max-1)(weighted similarity)
(max-1)(weighted similarity)
Base calling is not perfect – errors are made and there are
programs (e.g. phred) that can analyze the ABI traces and
assign confidence measures to each base.
An interesting question is – can we obtain similar error
rates for species prediction and discovery with smaller
sample sizes if quality measures are incorporated into the
analysis?
Before the Discussion Session
•
Try out some clustering techniques on the sample
data
•
Number of uncalled bases?
•
Length of sequences?
•
Sample sizes – effect on clustering?
•
Sample sizes – effect on classification?
Download