MBVL_IM_BI_Oct2015 - Tetherless World Constellation

Intellectual Merit
There are five main objectives in our research plan that are applicable to Intellectual Merit: 1) developing
data access and computational infrastructure for the Marine Biodiversity Virtual Lab (MBVL); 2)
generating derived data products; 3) developing predictive biodiversity modeling infrastructure; 4)
producing traceable product workflows; and 5) developing a knowledge base for biodiversity indicators
and implementation of linked data standards. Here we provide additional technical detail about the
proposed approach for two significant computational challenges, one a component of Objective 2
(biodiversity data products) and the other of Objective 3 (predictive biodiversity models).
Objective 2 - The derived data products will include time-series of operational taxonomic unit (OTU) /
oligotype presence/absence data and phytoplankton species abundance data for biodiversity indicators
Computational challenge #1: Applying emerging and novel image analysis and classification techniques
to the challenge of characterizing taxa, community structure, and interactions
Emerging image analysis techniques such as convolutional neural networks (CNN) have demonstrated
significant improvements in image recognition tasks across a broad range of image types and use cases.
CNNs are a promising fit for IFCB image analysis because the large volume of hand-labeled IFCB
images (over 3 million) enables the development of suitably high-volume training sets. In addition, the
microfluidic architecture of the IFCB imaging process results in consistent observational conditions
(lighting, focus, and to some extent orientation) that mitigates the risk that classification accuracy will be
impaired by the large variety of phytoplankton morphology, lifecycle stages, and interactions. Initial work
to evaluate this technique (Orenstein et al. 2015) suggests that CNNs compare favorably to the existing
IFCB image classification approach. In addition, CNNs may enable new capabilities such as automated
identification of complex interactions such as predation and parasitic interactions that have already been
observed in IFCB data using manual approaches.
In the MBVL we will implement 1) the existing IFCB image classification pipeline, consisting of handcoded image segmentation and feature extraction paired with a random forest classifier, and 2) adapt and
train CNNs to perform enhanced image recognition and classification. Both approaches will produce
taxonomically-resolved, high-volume time series data that can be used alongside other observational
variables for ecosystem modeling.
Some of the risks associated with re-engineering the image processing pipeline are familiar from previous
work. For example, detecting rare taxa for which few training examples exist is challenging for any
supervised method and this can be mitigated with judicious training set design, e.g., targeted boosting of
rare taxa examples (Orenstein et al. 2015). Other risks are more difficult to quantify, such as CNNs being
potentially unwieldy and requiring attention to details such as image pre-processing steps and training
procedures in order to avoid commonly-encountered pitfalls in applying them to large-scale classification
problems. Our strategy for mitigating these potential risks is for MBVL to support the existing IFCB
image classification pipeline, which is already producing taxonomic time series of usable quality. This
will allow us to proceed with model development immediately while simultaneously working on
developing the new CNN-based image classification approach.
Objective 3 – Enable development of diagnostic and predictive models and their integration with
biodiversity data products, with infrastructure design focused on specific science-driven use cases
Computational challenge #2: Applying a range of promising model types to biodiversity time series to
enable understanding and prediction of future change, noting that the two proposed use cases are very
different from a modelling perspective: empirical (more inductive) vs. mechanistic (more deductive).
Several innovative computational analysis techniques have recently demonstrated early successes in
modeling aspects of structure and function in ecosystems, and the MBVL will enable their application to
emerging hypotheses from our existing research. Novel aspects include applications in the context of
rapidly changing and complex marine systems (contrasted with investigations of a spatially-resolved
human microbiome snapshot, for instance), requirements to integrate heterogeneous products / indicators
to adequately characterize marine biodiversity, and focus on addressing hypotheses and making
prediction related to temporal change in systems where time lags and temporal succession are important
(i.e., characterization and prediction of a non-static system of interacting components).
Within the MBVL infrastructure, we propose to incorporate a set of analytic approaches that fall into two
general categories: 1) relatively established model types that range from statistical (e.g., generalized linear
models, random forest, and other regression models) to mechanistic (e.g., reaction-diffusion, population
dynamic and other mechanistic models traditional to ecology) and 2) emerging modeling strategies for
complex systems of interacting components, including such approaches as extended local similarity
analysis (eLSA; Li et al. 201) and ensemble methods based on multiple similarity metrics to produce cooccurrence/co-exclusion networks (Faust et al. 2012; Lima-Mendez et al. 2015), and mixture models. A
distinct advantage of these emerging approaches is that they are amenable to heterogeneous inputs (e.g.,
OTUs, image-based taxa, environmental parameters).
A notable risk in achieving this computational challenge is associated with the dependence on successful
completion of Objective 2 and computational challenge #1. One way that we will mitigate this risk is by
handling both established and emerging analytics in challenge #1. If emerging approaches for
computational challenge #1 prove too ambitious to incorporate, then a less capable but still useful
outcome from Objective 2 can be met through established analytic techniques that are low risk to embed
in the MBVL infrastructure in such a way that they will provide reproducible, traceable, provenanceaware, “living” data/analysis products that facilitate challenge #2. Additional risk in achieving Objective
3 and the associated computational challenges are associated with the fact that pre-existing analytics to
address time series (lags, succession, etc.) do not yet exist for some approaches (e.g., Faust et al. 2012).
Validation of more complex models, and assessments for over-fitting become difficult and may not be
robust enough for model selection. Even certain emerging approaches (e.g., eLSA) have time series
analytics already developed, however, and those can be implemented preferentially if necessary to
mitigate risk.
Use case example requiring computational challenges 1 and 2 to be met
The role of parasitic interactions in regulating marine plankton diversity and community structure appears
to be greater than previously understood (e.g., Lima-Mendez et al. 2015, Peacock et al. 2014). The
implications of these kinds of structuring ecological interactions for biodiversity and ecosystem change
cannot be deciphered with a single type of observational data or modeling approach. For instance, recent
analyses of IFCB data suggest that phytoplankton-parasite interactions are important on the New England
shelf and are strongly influenced by rapidly-changing ocean temperatures, but approaches involving only
IFCB data cannot resolve critical aspects of this multi-faceted system, such as identification and detect of
pre-infection parasite stages. Genetic sequence analysis on the other hand, provides information about
taxonomic identity and occurrence of parasites regardless of whether they are actively infecting their host.
MVBL will enable researchers to use the array of computational approaches described above to integrate
time series of host (from IFCB data products), parasites (from VAMPS data products) and environmental
factors, to evaluate patterns of co-occurrence among these interacting species (including time lags and
other successional patterns), and construct and test models that further understanding of biodiversity
dynamics and enable predictions about change in community structure and ecosystem function as rapid
climate change continues to impact northeast US coastal waters.
Common approaches for data types that differ in taxonomic, genotypic, phenotypic, spatial and
temporal properties
In designing our research plan, we specifically selected two types of observational data that differ widely
(gene sequences vs. cell images), yet each contain fundamental information about diversity. A primary
challenge in addressing the need for biodiversity indicators and models is the need to access and interpret
heterogeneous data types. The Virtual Laboratory will accommodate critical differences in the work flows
required to produce appropriate diversity data products (e.g., time series of OTU frequency or time series
of image-based biomass grouped by morphological taxa). At the same time, it will support derivation of
integrated products with common basis; for example, once the derived data from VAMPS represents
species-level richness and the derived data from IFCB represents species-level richness, then we can
determine a combined species-level richness for the base of the food web as a composite indicator.
Importantly, as addressed in the computational challenges discussed above, the MBVL modelling
framework will include analytic approaches that accommodate disparate inputs, even going beyond
biological diversity to include environmental or habitat properties. A strength of the common MBVL
infrastructure elements (generalizable product workflow and traceability methodologies and modelling) is
that it will provide a scalable solution for future expansion to other biodiversity data types (e.g.,
zooplankton taxa, fishery stock assessments, observer survey results for marine mammals and birds).
Faust, K., J. F. Sathirapongsasuti, J. Izard, N. Segata, D. Gevers, J. Raes, C. Huttenhower. 2012.
Microbial co-occurrence relationships in the human microbiome. PLOS Comput. Biol. 8, e1002606
Xia, L.C., D. Ai, J. Cram, J.A. Fuhrman, F. Sun. 2013. Efficient statistical significance approximation for
local association analysis of high-throughput time series data. Bioinformatics. 29: 230-237
Xia, L.C., J.A. Steele, J.A. Cram, Z.G. Cardon, S.L. Simmons, J.J. Vallino, J.A. Fuhrman, F. Sun.
Extended local similarity analysis (eLSA) of microbial community and other time series data with
replicates. 2011. BMC Systems Biology. 5:S15
Lima-Mendez et al. 2015. Determinants of community structure in the global plankton interactome.
Science. 348: 6237. 1262073
Orenstein. E.C., O. Beijbom, E.E. Peacock, H.M. Sosik. 2015. WHOI-Plankton: A large scale fine grained
visual recognition benchmark data set for plankton classification. Proc. IEEE Computer Society
Conference on Computer Vision and Pattern Recognition. 2 pp.
Broad Impacts --- including outreach and inclusion of under-represented groups, are integral to
consideration of funding for all NSF proposals.
There are five main objectives in our research plan that are applicable to Broader Impacts: 1) developing
data access and computational infrastructure for the Marine Biodiversity Virtual Lab (MBVL); 2)
developing predictive biodiversity modeling infrastructure; 3) producing traceable product workflows; 4)
developing a knowledge base for biodiversity indicators and implementation of linked data standards and
5) the explicit outreach and broader impacts of the MBVL.
Herein we indicate how the first four objectives target the explicit outreach and impacts we propose. All
investigators will foster engagement of under-represented minorities into ocean science, bioinformatics,
and computer science by sponsoring research projects of undergraduates engaged in the Woods Hole
Partnership Education Program (PEP; http://www.woodsholediversity.org/pep/about.html) and the
Biological Discovery in Woods Hole REU program (http://www.mbl.edu/education/othereducational/reu_details/). PEP, a summer science intern program launched in 2009 and designed to
promote diversity, targets undergraduates majoring in the natural sciences, engineering, or mathematics.
Students in the program take course work, carry out 6-10 week research projects, and have various other
opportunities for engagement in science (seminars, field trips, at-sea training, etc.). The REU program
targets underrepresented minorities and students from small colleges lacking research opportunities in
biology. In addition to a 10-week laboratory research experience, students participate in field trips and
attend weekly course meetings, seminars and luncheons that explore a wide range of topics (e.g., graduate
school application, ethics, career paths) to encourage the students to prepare and pursue a career in
biological sciences. The proposed CyberSEES research will attract students for projects at the intersection
of ocean science, bioinformatics, and computer science, and all co-PIs will solicit and review candidate
students for this opportunity. In addition to acting as research advisors for individual student projects, PIs
will provide seminars to the group and the greater student population, further opening up outreach and
influencing a broader range of students. Through a combination of the WHOI Summer Student
Fellowship, PEP, Woods Hole REU programs, and Rensselaer High School programs, we will sponsor at
least three undergraduate research interns each summer of this project, with a recruitment focus and
preference on under-represented minorities, drawn both from the Woods Hole and Rensselaer student
The Rensselaer Research Experience for High School Students has had an increasingly strong focus on
computer science in science application settings and on minorities and under-served communities. This
summer (2015; not known at the time of proposal submission) 4 students are resident in the Tetherless
World Constellation (3 are minorities). Two students are working on the Jefferson Project
(http://news.rpi.edu/content/2013/06/27/new-project-aims-make-new-york’s-lake-george-“smartest-lake”world), and two are working on computational aspects of the Deep Carbon Observatory
(http://www.deepcarbon.net). Both projects exemplify the mix of environmental and computer science in
this CyberSEES proposal. The MBVL will be a very attractive draw for the outstanding students who
apply and the PI will dedicate two students positions per year to the MBVL (these are paid for externally)
and fund interactions (travel/ accommodation) with the Woods Hole-based students.
In each year of the project, WHOI PIs and RPI PI (Fox; WHOI adjunct scientist) will participate in one of
two specific annual lecture series (for each year of the project, i.e. 6 in total) aimed at making cutting
edge science accessible and appealing to broader audiences:
Science Made Public Lecture Series, an annual summertime series of publicly accessible talks by
scientists and engineers sponsored by the WHOI Exhibit Center; designed for a lay audience, the series is
open to all and overlaps with the heavy tourist season in Woods Hole Village. Our project will provide an
opportunity to make computer science accessible through interactive engagement with web-based access
to information about intriguing marine organisms.
The Summer Lecture Series designed for undergraduates participating in research internships in
Woods Hole. The series provides an introduction to the breadth of oceanographic research, with talks
aimed to be as interactive as possible in order to motivate students to participate and ask questions. Our
project will provide a novel element emphasizing the challenges and opportunities of “big data” and
computational innovation in marine biodiversity research.
In regard to curriculum and coursework delivered by the investigator team, we re-iterate the following.
Additional links between this project and undergraduate and graduate education will include
incorporation of the MBVL tools into the Marine Biodiversity and Conservation SEA Semester
(http://www.sea.edu/voyages/caribbean_latespring_studyabroadprogram ) at the Sea Education
Association in Woods Hole, the MBL Semester in Environmental Science (co-PI Mark Welch), and into
MIT/WHOI Joint Program courses (co-PIs Sosik and Beaulieu). Since this curriculum has been delivered
over many years it provides concrete participation in workshops, as well as student mentoring to directly
use the tools we will be developing. Models and workflows developed by the project will be presented by
Sosik and/or Beaulieu in the MBL summer course Strategies and Techniques for Analysis of Microbial
Population Structures (co-directed by Mark Welch), which trains ~60 graduate students, postdocs, and
independent investigators every year. Similarly, Rensselaer’s offerings in Data Science and Data
Analytics (both taught by PI Fox with frequent input from co-I Beaulieu) will feature the MBVL models
and workflows as a central resource for student projects and engagement.