Introductory Description for BestBet 2.0

advertisement
The Matlab© program above is called BestBet because the title is both
descriptive and memorable. The core idea of BestBet follows both the reasoning
of Expectation Maximization (EM) and its iterative pursuit of the most complete
estimate of the proportion of organisms possible.
This algorithm works well without additional data such as a data base of
constituent organisms, together with their restriction fragment lengths associated
with these organisms. Although, in the cases that such a data base is available,
the program is even more effective.
To see the core idea, suppose that we look at the data and see that a specific
cutoff, x1 has a plurality in over all other restriction fragment lengths. Similarly,
suppose that x2 has a plurality for the second enzyme, and that x3 is the most
abundant fragment length for the third enzyme.
Now, considering this information, which triple of fragment lengths would you
pick as most likely to give this information, that is, to have the largest
expectation of creating this data set?
Obviously, the best bet would be to assume that some of the samples contain an
organism with restriction cut offs equal to (x1, x2, x3). We do not have a good
way to immediately estimate the proportion of this organism, so we construct a
learning model in the spirit of EM.
We make the minimal assumption that at least some small amount of the
organism exist in some of the samples with fragment lengths (x1, x2, x3).
Remove a small amount of the organism from the data. This means that we
remove a fixed small percentage of each fragment length, (x1, x2, x3), from each
sample in which it occurs.
These results will be reported in detail in the methods paper that supports and
dovetails (Treusch, et al) but in the case of samples from the BATS location
taken over roughly half a decade, BestBet was quite successful at picking out the
top 3 – 5 organisms from each sample.
Finally, to apply BestBet to the case of a [fairly] complete data base of organisms
and triples of fragment lengths, run the program as before, only allowing triples
(x1, x2, x3) from the data base to be considered.
1. Arthur Dempster, Nan Laird, and Donald Rubin. "likelihood from incomplete data
via the EM algorithm". Journal of the Royal Statistical Society, Series B, 39(1):1–
38, 1977.
2. Dinov, ID. "Expectation Maximization and Mixture Modeling Tutorial".
California Digital Library, Statistics Online Computational Resource, Paper
EM_MM, http://repositories.cdlib.org/socr/EM_MM, December 9, 2008.
3. Alexander H. Treusch, Kevin L. Vergin, Liam A. Finlay, Michael G. Donatz,
Bobert M. Burton, Craig A. Carlson and Stephen J. Giovannoni. "Seasonality and
Vertical Structure of an Ocean Gyre". Submitted, January 20, ISME Journal.
Download