1. Intellectual Merit There are five main objectives in our research plan that are applicable to Intellectual Merit: 1) developing data access and computational infrastructure for the Marine Biodiversity Virtual Lab (MBVL); 2) generating derived data products; 3) developing predictive biodiversity modeling infrastructure; 4) producing traceable product workflows; and 5) developing a knowledge base for biodiversity indicators and implementation of linked data standards. Here we provide additional technical detail about the proposed approach for two significant computational challenges, one a component of Objective 2 (biodiversity data products) and the other of Objective 3 (predictive biodiversity models). Objective 2 - The derived data products will include time-series of operational taxonomic unit (OTU) / oligotype presence/absence data and phytoplankton species abundance data for biodiversity indicators Computational challenge #1: Applying emerging and novel image analysis and classification techniques to the challenge of characterizing taxa, community structure, and interactions Emerging image analysis techniques such as convolutional neural networks (CNN) have demonstrated significant improvements in image recognition tasks across a broad range of image types and use cases. CNNs are a promising fit for IFCB image analysis because the large volume of hand-labeled IFCB images (over 3 million) enables the development of suitably high-volume training sets. In addition, the microfluidic architecture of the IFCB imaging process results in consistent observational conditions (lighting, focus, and to some extent orientation) that mitigates the risk that classification accuracy will be impaired by the large variety of phytoplankton morphology, lifecycle stages, and interactions. Initial work to evaluate this technique (Orenstein et al. 2015) suggests that CNNs compare favorably to the existing IFCB image classification approach. In addition, CNNs may enable new capabilities such as automated identification of complex interactions such as predation and parasitic interactions that have already been observed in IFCB data using manual approaches. In the MBVL we will implement 1) the existing IFCB image classification pipeline, consisting of handcoded image segmentation and feature extraction paired with a random forest classifier, and 2) adapt and train CNNs to perform enhanced image recognition and classification. Both approaches will produce taxonomically-resolved, high-volume time series data that can be used alongside other observational variables for ecosystem modeling. Some of the risks associated with re-engineering the image processing pipeline are familiar from previous work. For example, detecting rare taxa for which few training examples exist is challenging for any supervised method and this can be mitigated with judicious training set design, e.g., targeted boosting of rare taxa examples (Orenstein et al. 2015). Other risks are more difficult to quantify, such as CNNs being potentially unwieldy and requiring attention to details such as image pre-processing steps and training procedures in order to avoid commonly-encountered pitfalls in applying them to large-scale classification problems. Our strategy for mitigating these potential risks is for MBVL to support the existing IFCB image classification pipeline, which is already producing taxonomic time series of usable quality. This will allow us to proceed with model development immediately while simultaneously working on developing the new CNN-based image classification approach. Objective 3 – Enable development of diagnostic and predictive models and their integration with biodiversity data products, with infrastructure design focused on specific science-driven use cases Computational challenge #2: Applying a range of promising model types to biodiversity time series to enable understanding and prediction of future change, noting that the two proposed use cases are very different from a modelling perspective: empirical (more inductive) vs. mechanistic (more deductive). Several innovative computational analysis techniques have recently demonstrated early successes in modeling aspects of structure and function in ecosystems, and the MBVL will enable their application to emerging hypotheses from our existing research. Novel aspects include applications in the context of rapidly changing and complex marine systems (contrasted with investigations of a spatially-resolved human microbiome snapshot, for instance), requirements to integrate heterogeneous products / indicators to adequately characterize marine biodiversity, and focus on addressing hypotheses and making prediction related to temporal change in systems where time lags and temporal succession are important (i.e., characterization and prediction of a non-static system of interacting components). Within the MBVL infrastructure, we propose to incorporate a set of analytic approaches that fall into two general categories: 1) relatively established model types that range from statistical (e.g., generalized linear models, random forest, and other regression models) to mechanistic (e.g., reaction-diffusion, population dynamic and other mechanistic models traditional to ecology) and 2) emerging modeling strategies for complex systems of interacting components, including such approaches as extended local similarity analysis (eLSA; Li et al. 201) and ensemble methods based on multiple similarity metrics to produce cooccurrence/co-exclusion networks (Faust et al. 2012; Lima-Mendez et al. 2015), and mixture models. A distinct advantage of these emerging approaches is that they are amenable to heterogeneous inputs (e.g., OTUs, image-based taxa, environmental parameters). A notable risk in achieving this computational challenge is associated with the dependence on successful completion of Objective 2 and computational challenge #1. One way that we will mitigate this risk is by handling both established and emerging analytics in challenge #1. If emerging approaches for computational challenge #1 prove too ambitious to incorporate, then a less capable but still useful outcome from Objective 2 can be met through established analytic techniques that are low risk to embed in the MBVL infrastructure in such a way that they will provide reproducible, traceable, provenanceaware, “living” data/analysis products that facilitate challenge #2. Additional risk in achieving Objective 3 and the associated computational challenges are associated with the fact that pre-existing analytics to address time series (lags, succession, etc.) do not yet exist for some approaches (e.g., Faust et al. 2012). Validation of more complex models, and assessments for over-fitting become difficult and may not be robust enough for model selection. Even certain emerging approaches (e.g., eLSA) have time series analytics already developed, however, and those can be implemented preferentially if necessary to mitigate risk. Use case example requiring computational challenges 1 and 2 to be met The role of parasitic interactions in regulating marine plankton diversity and community structure appears to be greater than previously understood (e.g., Lima-Mendez et al. 2015, Peacock et al. 2014). The implications of these kinds of structuring ecological interactions for biodiversity and ecosystem change cannot be deciphered with a single type of observational data or modeling approach. For instance, recent analyses of IFCB data suggest that phytoplankton-parasite interactions are important on the New England shelf and are strongly influenced by rapidly-changing ocean temperatures, but approaches involving only IFCB data cannot resolve critical aspects of this multi-faceted system, such as identification and detect of pre-infection parasite stages. Genetic sequence analysis on the other hand, provides information about taxonomic identity and occurrence of parasites regardless of whether they are actively infecting their host. MVBL will enable researchers to use the array of computational approaches described above to integrate time series of host (from IFCB data products), parasites (from VAMPS data products) and environmental factors, to evaluate patterns of co-occurrence among these interacting species (including time lags and other successional patterns), and construct and test models that further understanding of biodiversity dynamics and enable predictions about change in community structure and ecosystem function as rapid climate change continues to impact northeast US coastal waters. Common approaches for data types that differ in taxonomic, genotypic, phenotypic, spatial and temporal properties In designing our research plan, we specifically selected two types of observational data that differ widely (gene sequences vs. cell images), yet each contain fundamental information about diversity. A primary challenge in addressing the need for biodiversity indicators and models is the need to access and interpret heterogeneous data types. The Virtual Laboratory will accommodate critical differences in the work flows required to produce appropriate diversity data products (e.g., time series of OTU frequency or time series of image-based biomass grouped by morphological taxa). At the same time, it will support derivation of integrated products with common basis; for example, once the derived data from VAMPS represents species-level richness and the derived data from IFCB represents species-level richness, then we can determine a combined species-level richness for the base of the food web as a composite indicator. Importantly, as addressed in the computational challenges discussed above, the MBVL modelling framework will include analytic approaches that accommodate disparate inputs, even going beyond biological diversity to include environmental or habitat properties. A strength of the common MBVL infrastructure elements (generalizable product workflow and traceability methodologies and modelling) is that it will provide a scalable solution for future expansion to other biodiversity data types (e.g., zooplankton taxa, fishery stock assessments, observer survey results for marine mammals and birds). References Faust, K., J. F. Sathirapongsasuti, J. Izard, N. Segata, D. Gevers, J. Raes, C. Huttenhower. 2012. Microbial co-occurrence relationships in the human microbiome. PLOS Comput. Biol. 8, e1002606 Xia, L.C., D. Ai, J. Cram, J.A. Fuhrman, F. Sun. 2013. Efficient statistical significance approximation for local association analysis of high-throughput time series data. Bioinformatics. 29: 230-237 Xia, L.C., J.A. Steele, J.A. Cram, Z.G. Cardon, S.L. Simmons, J.J. Vallino, J.A. Fuhrman, F. Sun. Extended local similarity analysis (eLSA) of microbial community and other time series data with replicates. 2011. BMC Systems Biology. 5:S15 Lima-Mendez et al. 2015. Determinants of community structure in the global plankton interactome. Science. 348: 6237. 1262073 Orenstein. E.C., O. Beijbom, E.E. Peacock, H.M. Sosik. 2015. WHOI-Plankton: A large scale fine grained visual recognition benchmark data set for plankton classification. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2 pp. 2. Broad Impacts --- including outreach and inclusion of under-represented groups, are integral to consideration of funding for all NSF proposals. There are five main objectives in our research plan that are applicable to Broader Impacts: 1) developing data access and computational infrastructure for the Marine Biodiversity Virtual Lab (MBVL); 2) developing predictive biodiversity modeling infrastructure; 3) producing traceable product workflows; 4) developing a knowledge base for biodiversity indicators and implementation of linked data standards and 5) the explicit outreach and broader impacts of the MBVL. Herein we indicate how the first four objectives target the explicit outreach and impacts we propose. All investigators will foster engagement of under-represented minorities into ocean science, bioinformatics, and computer science by sponsoring research projects of undergraduates engaged in the Woods Hole Partnership Education Program (PEP; http://www.woodsholediversity.org/pep/about.html) and the Biological Discovery in Woods Hole REU program (http://www.mbl.edu/education/othereducational/reu_details/). PEP, a summer science intern program launched in 2009 and designed to promote diversity, targets undergraduates majoring in the natural sciences, engineering, or mathematics. Students in the program take course work, carry out 6-10 week research projects, and have various other opportunities for engagement in science (seminars, field trips, at-sea training, etc.). The REU program targets underrepresented minorities and students from small colleges lacking research opportunities in biology. In addition to a 10-week laboratory research experience, students participate in field trips and attend weekly course meetings, seminars and luncheons that explore a wide range of topics (e.g., graduate school application, ethics, career paths) to encourage the students to prepare and pursue a career in biological sciences. The proposed CyberSEES research will attract students for projects at the intersection of ocean science, bioinformatics, and computer science, and all co-PIs will solicit and review candidate students for this opportunity. In addition to acting as research advisors for individual student projects, PIs will provide seminars to the group and the greater student population, further opening up outreach and influencing a broader range of students. Through a combination of the WHOI Summer Student Fellowship, PEP, Woods Hole REU programs, and Rensselaer High School programs, we will sponsor at least three undergraduate research interns each summer of this project, with a recruitment focus and preference on under-represented minorities, drawn both from the Woods Hole and Rensselaer student candidates. The Rensselaer Research Experience for High School Students has had an increasingly strong focus on computer science in science application settings and on minorities and under-served communities. This summer (2015; not known at the time of proposal submission) 4 students are resident in the Tetherless World Constellation (3 are minorities). Two students are working on the Jefferson Project (http://news.rpi.edu/content/2013/06/27/new-project-aims-make-new-york’s-lake-george-“smartest-lake”world), and two are working on computational aspects of the Deep Carbon Observatory (http://www.deepcarbon.net). Both projects exemplify the mix of environmental and computer science in this CyberSEES proposal. The MBVL will be a very attractive draw for the outstanding students who apply and the PI will dedicate two students positions per year to the MBVL (these are paid for externally) and fund interactions (travel/ accommodation) with the Woods Hole-based students. In each year of the project, WHOI PIs and RPI PI (Fox; WHOI adjunct scientist) will participate in one of two specific annual lecture series (for each year of the project, i.e. 6 in total) aimed at making cutting edge science accessible and appealing to broader audiences: Science Made Public Lecture Series, an annual summertime series of publicly accessible talks by scientists and engineers sponsored by the WHOI Exhibit Center; designed for a lay audience, the series is open to all and overlaps with the heavy tourist season in Woods Hole Village. Our project will provide an opportunity to make computer science accessible through interactive engagement with web-based access to information about intriguing marine organisms. The Summer Lecture Series designed for undergraduates participating in research internships in Woods Hole. The series provides an introduction to the breadth of oceanographic research, with talks aimed to be as interactive as possible in order to motivate students to participate and ask questions. Our project will provide a novel element emphasizing the challenges and opportunities of “big data” and computational innovation in marine biodiversity research. In regard to curriculum and coursework delivered by the investigator team, we re-iterate the following. Additional links between this project and undergraduate and graduate education will include incorporation of the MBVL tools into the Marine Biodiversity and Conservation SEA Semester (http://www.sea.edu/voyages/caribbean_latespring_studyabroadprogram ) at the Sea Education Association in Woods Hole, the MBL Semester in Environmental Science (co-PI Mark Welch), and into MIT/WHOI Joint Program courses (co-PIs Sosik and Beaulieu). Since this curriculum has been delivered over many years it provides concrete participation in workshops, as well as student mentoring to directly use the tools we will be developing. Models and workflows developed by the project will be presented by Sosik and/or Beaulieu in the MBL summer course Strategies and Techniques for Analysis of Microbial Population Structures (co-directed by Mark Welch), which trains ~60 graduate students, postdocs, and independent investigators every year. Similarly, Rensselaer’s offerings in Data Science and Data Analytics (both taught by PI Fox with frequent input from co-I Beaulieu) will feature the MBVL models and workflows as a central resource for student projects and engagement.