TRANSFORMING MORPHOLOGICAL SYSTEMATICS FROM DESKTOP TO WEB APPLICATIONS: DEVELOPMENT OF THE ONLINE WORKSPACE MorphoBank.org Project Description Papers cited in “Results from prior NSF support” marked with * in References Cited. I. Results from Prior NSF Support PI-O’Leary (DEB-9985847, $189,998 [including supplements], 1997-2003): “Collaborative: Cetacean phylogeny: A Reconciliation of Fossil and Neontological Data and the Importance of Taxonomic Sampling.” Comprehensive analysis of morphology for the phylogeny of whales; one of the largest morphological phylogenetic analyses produced for mammals (> 600 characters). 5 papers published, 2 in review (including monograph with 121 full page original illustrations [Fig.1], 341 ms pages]), 4 abstracts; construction of MorphoBank (online database/web application for morphological systematics), www.morphobank.org. Broader Impact.- MorphoBank reviewed in articles in Science*, Bioscience* and Trends in Ecology and Evolution*; Training: 1 female postdoc, 2 female undergraduates, 1 minority undergraduate, 1 high school student who became first African American to win First Place in Zoology, Intel Science Fair; 1 mentorship award to PI. PI conducted: 1 international meeting; 10 invited scientific talks, 2 public lectures, and spoke at 11 invited scientific workshops. Collaborated with SUNY-C-STEP program to train minority undergraduates. PI-O’Leary (Co-PI, D. Krause, EAR-0116517, $227,934, 2001-2005): “Acquisition of Instruments and Technical Support for an Interdepartmental Fossil Preparation Laboratory.” Constructed fossil preparation laboratory with two technicians serving ten paleobiologists at Stony Brook University, and ten other U.S. and international collaborators. For the grant as a whole: 12 papers; 23 abstracts; two MsC. theses, one website. Broader Impact.-Training: Lab used by students, specifically 2 graduate, 4 undergraduate; development of new paleontology course for Women in Science and Engineering program; fossil casts distributed to international repositories. PI-O’Leary (Doctoral Dissertation Improvement grant for R. V. Hill, DEB-0206533, $8,723, 2002-2004): “Comparative Anatomy and Evolution of Osteoderms in the Amniote Integument.” Conflicting hypotheses of turtle relationships evaluated using the largest morphological data set compiled for Amniota. 3 papers and 3 abstracts. Broader Impact.Collections: data deposited on MorphoBank.org. Co-PI Ferguson: “SUNY Louis Stokes Alliance for Minority Participation” (HRD 9623931, 11/96 - 10/01 and 0114756, 11/01 - 10/06): Training: education program increased by 157% underrepresented minority enrollment in science and engineering undergraduate majors and by 63% science bachelor's degrees at Stony Brook. Co-PI Baru (put Chaitan information in here) Co-PI Lin is a beginning investigator. II. Morphological (Phenotype-Based) Phylogenetics: Desktop to Web Applications How did the phenotype evolve across the Tree of Life over the past 4.5 billion years of Earth history? What is the range of shapes in teeth that have evolved in parallel? What was the body shape of an extinct species known only from a skull, and how can we predict this from a phylogenetic tree based on morphology? These are all questions that can be addressed using the contemporary methods of morphological (phenotype-based) systematics. To address these questions, phylogeneticists (systematists) are increasingly working collaboratively (e.g, NSF - Assembling the Tree of Life [ATOL] Program) and need to access and integrate as much data as possible for their work. Advances in developmental biology also rely on documentation of a wide variety of phenotypic data and an understanding of how variation in shape evolved. There is a particular need to share labeled images, because morphology is best communicated in labeled images rather than in text descriptions alone. The need for faster and simpler image sharing technologies in phylogenetics and developmental biology means there is a growing need for the development of open source web applications. Such applications are fundamental to enabling collaboration and data sharing, and to facilitating the growth of morphological data collection, which has lagged behind collection of molecular sequences for phylogenetics. The Importance of Phylogenetics (Systematics) Phylogenetics is reconstructing the interrelationships among all (fossil and living) species, building the Tree of Life, and answering such questions as who is more closely related to whom. In 2005, Science magazine identified the creation of a consensus Tree of Life as one of the 125 most important questions facing science (Seife, 2005). Research (including the NSF ATOL workshops report; Cracraft et al. [2004]) has repeatedly emphasized the importance of using raw data from the phenotype (morphology) in addition to molecular sequences to tackle this large scale scientific problem. To reconstruct a phylogenetic tree we use algorithms that employ optimality criteria (e.g., parsimony, maximum likelihood) to explain data assembled into matrices. These matrices contain comparative data about species; data that have been coded as 0, 1, 2, etc…. to describe characters and character states. A character might be “Wing: present (0); absent (1)”, or simply the presence or absence of a given nucleotide. These are called “homology statements” because they are hypotheses that a similarity in a group of organisms is present because it was inherited from the common ancestor of those organisms. Depending on the problem being studied, a matrix might contain molecular data or phenotypic data, or both. Phenotypic data may include anatomical data, behavioral data, or physiological data, to give a few examples. Molecular data lend themselves most readily to codification in a matrix as 0s, 1s, etc…. because they are already discrete entities (e.g., A, C, G, T), whereas morphological data can be harder to codify into separate states. Theoretical work argues, however, that morphological data are very important for phylogenetics (e.g., Wiens, 2004; Smith and Turner, 2005), thus it is important to systematics as a whole to enable phenotypic data collection that is efficient and repeatable. The number of digital libraries of phenotypic data has grown extensively over the last ten years. These digital libraries provide virtual collections that systematists can draw on to populate matrices with images over the web. These digital libraries are centered around the following features: (1) types of media (e.g., Digimorph for CT scans) (2) museum collections (e.g., new Scripps-UCSD project, Paleoportal accessible museum collections), (3) groups of species (e.g., Flybase, Fishbase, PEET projects, ATOL projects, model organisms), (4) published texts (e.g., JSTOR, AMNH series), (5) spatial and temporal data for species (e.g., Paleobiology Database, others?), (6) developmental biology (e.g., NCBI? examples). These are important fundamental resources for raw data for systematists working with the phenotype of living and fossil organisms. The growth of digital media and the importance of these digital libraries underscores a broad change in anatomy-based research that has used the web to expand access to specimens and text information. These new resources have helped transform the contemporary concept of monography. As noted by Dettai et al. (2004) the growth of matrices is limited by a lack of tools for morphological systematics. Although morphological work is enhanced by the immediate availability of images (media) that document characters and character states, desktop programs for phylogenetic systematics (eg Phylip. Mesquite] have relatively limited capabilities for manipulating and displaying images. MorphoBank 2.0 was created to provide the cyberinfrastructure to link these digital images directly to phylogenetic matrices. Morphological Systematics as Team-Based Research Through the AToL program, NSF has placed new emphasis on the importance of collaborative, team-based phylogenetics research. As evidence of this, morphological matrices have been growing in numbers of both taxa and characters (O’Leary and Gatesy, in review, > 600 characters; another example). Sharing of images among collaborative team members on phylogenetics projects can greatly clarify concepts of homology among team members. Web applications that promote remote collaborations are appearing in many forms, and these are transforming the nature of data sharing in both scientific and nonscientific work communities. Some examples of generic collaborative tools available widely include Google Spreadsheets [spreadsheets.google.com], Wikipedia [http://en.wikipedia.org/wiki/Main_Page] and a number of collaborative wikis [http://www.pcmag.com/article2/0,4149,1402872,00.asp], which allow text and document sharing. More sophisticated scientific data sharing areas built for specific projects such as the GEON portal [www.geonweb.org] permit joining and querying of relational databases. Like the users of these other web applications, morphological phylogeneticists can benefit greatly from tools that grant access to data from anywhere in the world, particularly when this access provides tools that are uniquely designed to meet the specific needs of working phylogenists/systematists. Such an application would enable a team member with internet access to look at data in a matrix simultaneously, grant data/image access to reviewers of unpublished data, facilitate sharing of images of homology statements without swapping folders of images through ftp or email, and always provide the ability to download nexus files of matrices and images readable by widely used desktop programs (e.g., Mesquite). What is MorphoBank.org? MorphoBank.org is was conceived, designed and deployed with the goal of providing the kind of on-line, location-independent collaboration space required by large scale collaborative projects investigating phylogeny. Morphobank.org is an open source, platform-independent, online database and collaborative workspace for phylogenetics. The tool in its present state was constructed with minimal funding by a domain scientist in close collaboration with a software engineer with the driving design metric of providing the maximum useful functionality for collaboration among working systematists at the lowest possible cost in terms of effort, computational resource demands, and maintenance. The first versions of MorphoBank.org were built with seed funds from NSF to PI-O’Leary to sponsor a workshop and a second grant from NOAA (<$35,000 direct costs). MorphoBank.org currently contains over 2,600 images and more than fifteen active projects-inpreparation. Investigators of XX ATOL projects have expressed interest in depositing their data on the site. Milestones in the development of the site include the first published paleontological matrix with images (Hill, 2005; fossil amniotes) and a soon to be released phylogenetic matrix that will have over 1,000 images document 631 characters for 70 taxa (O’Leary and Gatesy, in review; fossil mammals). MorphoBank.org differs from existing digital library initiatives Several other initiatives that use the web to database morphology are underway (e.g., Digimorph, MorphBank, Miranker ATOL project, Electronic Field guides). Some of the most prominent of these have formed a collaborative group with plans for growing interoperability in the future. Unlike these projects, which are conceived primarily as repositories for organizing and archiving images and metadata, the primary function of MorphoBank is to provide a web application which provides tools for creating and analyzing existing data in a collaborative workspace. The tools implemented by MorphoBank are easily adaptable to whatever interoperability standards adopted by this collaborative group, and MorphoBank will allow users of those resources to easily work with their data to produce matrices and speed their time to publication. The partnership that we have initiated with them (Letter: Riccardi) provides an important foundation for establishing the interoperability required by practicing systematists in MorphoBank, while leveraging the Database development efforts in the collaboration. Current (MorphoBank v 2.0) Features At present, MorphoBank.org allows users to conduct web-based management of comparative anatomical data of fossil and living organisms for scientific research and education. MorphoBank.org is structured around the concept of “Projects,” which simply means a collection of related images and metadata in use by a scientist or team of scientists. Built with user feedback from phylogeneticists, other informatics teams working on databasing aspects of morphology have recognized the high quality of the MorphoBank user interface for phylogenetics (Letter: Riccardi). Features of the current implementation include: Registration: registered users access MorphoBank via a password-protected login. Once registered, a project owner can designate the workgroup members who are to have access to the project. Image Uploading: Registered users can upload and catalogue multimedia submissions (2D and 3D images, e.g., drawings, photos, and CT scans). The uploaded images can be accessed by all group members, and project collaborators can add annotations on these media. Along with images, MorphoBank captures and records a variety of metadata such as author of submission, related publications, critical commentary, names of species and higher taxa, and descriptions of characters. The user is able to add and edit these labels and properties. Dynamic Character Matrix Creation/Editing: For phylogenetic research, MorphoBank displays dynamic phylogenetic matrices of morphological characters with labeled character information (homology statements). This is accomplished over the web - in real time - by autonomous teams of researchers building and editing phylogenetic matrices with affiliated annotated image data, or simply sharing images with annotations (Figure 3). Nexus file uploads are also used to create matrices. Users have complete control over naming, and can add/edit taxa, characters, states, etc. Figure 2. Screen shot from MorphoBank.org (2.0) showing image upload/ media viewing area. Searching. MorphoBank’s search engine is capable of searching all aspects of a project’s data and returning taxonomic records, specimen data, characters, media and matrices, any of which can be downloaded. The search engine implements Boolean operators, exclusion, wildcards, stemming, spell correction and parenthetical grouping. The search returns image and metadata from all projects that have been parsed to the MorphoBank.org public archive by investigators, as well as any unpublished projects to which a registered user is currently contributing. Figure 3. Screen shots of matrix editor from MorphoBank.org (2.0). Annotations. The usefulness of images to researchers is greatly enhanced by linking text to them, which can be done in MorphoBank. More specifically for phylogenetics, MorphoBank can link and display an image affiliated with a taxon-character-character state intersection. In MorphoBank, images and annotations (such as the Tables listed above) are also stored separately and can be retrieved separately. Information about an image (author, date, size, original format), as well as embedded descriptive and technical metadata in IPTC, EXIF, XMP format, is captured at the time of upload. Publication. Once a phylogeneticist (or team) is ready to make work available to the public in open storage (e.g., coinciding with the appearance of a published paper), the images, matrices and annotations are permanently archived and searchable by the public on MorphoBank.org. The database is particularly helpful for comparing hypotheses of homology, which are central to the phylogenetics of fossil or living taxa. Once a project is published, the site search engine can make connections between like-named entities in formerly separate projects. MorphoBank v 2.0 Architecture: The MorphoBank.org (2.0) web application follows a standard 3-tier architecture model. The site’s view and logic layers are implemented in PHP. The choice of PHP allowed us to implement in a lightweight environment that is conducive to rapid development and deployment, consistent with our minimal frills, maximal functionality aesthetic. The PHP language provides many useful features for data access and sharing, but avoids the very large overhead required by the enterprise web applications built on platforms such as Java Struts/EJB [ref]. General web site functionality is accessible via standard web browsers (c.f. Figure 2). Highly interactive user interfaces, such as the matrix editor (Figure 3) and image viewer/ annotation tool, however, are implemented as “Rich Internet Applications”. with Actionscript Adobe Flash v8 and Java. This application loads images very quickly (typically < 1 sec), with potential for even more improvement. We have used Actionscript for the development of complex web-based user interfaces in Macromedia Flash and Javascript for browser-based “rich” user interfaces. The web-based clients for the database were built using Java and Flash (ActionScript); the latter provides very lightweight and responsive browser-based image manipulation. MorphoBank.org uses an Apache open source web server and ImageMagick image processing software, which can manipulate images in over 90 formats, including TIFF, JPEG and PNG). The database tier of MorphoBank.org consists of a relational database for metadata, which links to stored image files. The logic model at the logic layer is mapped into the database. The mapping is managed by tools so that the system is feasible to extend its logic model without redesigning the database schema. The relational database is implemented in MySQL, a widely deployed open source relational database management system that uses Structured Query Language [SQL]. The relational database stores all aspects of the hosted projects. The primary tables in the database are: Taxonomic names; Characters (e.g., anatomical descriptive phrases) and associated character states; Specimens (including collection information, voucher number and taxonomic name, as well other Darwin Core 2 compatible fields); Media (still imagery, video, sound); Matrices (taxa and characters united by character state assertions). MorphoBank.org allows annotations on these entities. These entities are each contained within independent Project Workspaces. Naming collisions between projects do not occur because each project is self-contained. As noted above, we support a number of media formats: for still imagery, we store the original uploaded file, JPEGs in several sizes, and a TilePic-format version equivalent in resolution to the original used for our pan-and-zoom image viewer. The MorphoBank.org database incorporates structures for various domain specific entities as well as a configurable metadata schema [http://www.morphobank.org/doc/schema_latest.pdf] that supports mapping to widely used schemas like Darwin Core 2 [http://darwincore.calacademy.org] and Dublin Core [http://dublincore.org/]. Entities such as taxonomic names, characters and matrices that are invariant across all use cases are implemented using domain-specific entities. Other types of data are stored as project-specific metadata with an explicit mapping to a project-defined standard. Hardware. MorphoBank.org is currently run on an IBM Blade Server (HS20) with dual 3.2 GHz processors on a RedHat Linux ES Server operating system (Letter:Eisenberg). III. Prototype to Community Tool: MorphoBank.org v 3.0 The primary objectives of MorphoBank are 1) to create a web application that enables the association of images with matrices, and thereby increase the repeatability, and ultimately the efficiency, of phylogenetic work on the phenotype, and 2) to make this web application accessible to all teams of collaborating morphological systematists who desire access. In Section II, we reported the completion of the first objective: MorphoBank 2.0 provides a functioning proof-of-concept prototype that allows teams of researchers to store, share, and annotate images, and to create, upload (in nexus format), edit, store, and download matrices for use in tree inference software. Growth of the site (see XX letters from groups wishing to use the site) indicates that the tools provided by MorphoBank are useful, and meet a need within the community. However, demands on the site are beginning to exceed what PI O’Leary can supervise as a single investigator. Moreover, with the growing user base, MorphoBank has an increasing need for new functionalities, and for improved database functionalities. In order to meet the needs of the growing user population, we propose to take steps to make this infrastructure available to as many biologists as possible. In Section III, we describe plans for the evolution of MorphoBank into an online infrastructure resource that can be scaled to an entire community of users, and that provides increased connectivity to resources that exist elsewhere in the community. To assemble the skill sets required for this proposal, we propose to expand the MorphoBank team to include Dr. Chaitan Baru, Head of Data and Knowledge Systems at SDSC, and Dr. Kai Lin, Staff Scientist currently working on the data integration team of the GEON Portal [www.geonportal.org]. This group will provide the capabilities and experience to insure that the database effort continues in a way that will serve the needs of the anticipated growth in user base, and to assist with integration of remote metadata and data resources. The SDSC team will provide an effective complement to the domain science and development experience contributed by MorphoBank’s creators at Stony Brook. Specific deliverables for the requested funding period are: Increased capacity for a large user community Improved interoperability with databases, portals and grids Integration with developing online taxonomic authorities. Linking to public access online literature (pdfs) - mining data archives for images for phylogenetic research (SDSC). Integration with other online community software/services Hosting multispecies ontologies of phylogenetic characters Development of sophisticated drawing, editing and measurement tools Enhanced documentation of the site. User feedback based modifications to the site, optimization of web applications Formation of a committee responsible for the site. Provide for the sustainability of the site at the close of funding. Figure 4 provides a cartoon of the proposed expansion of MorphoBank, showing the existing technologies, those that will be added during the current funding, and those that will be provided by existing community projects, as accessed through the MorphoBank web application. Our key metric of success for the project is to bring several large morphological matrices online by the end of three years: specific exemplar projects from across the Tree of Life that present different computer programming challenges. These exemplar projects (see Section IV below) will help us develop MorphoBank into a tool that can transform the way morphological data are collected for systematics by making data deposit for morphology more uniform. Our strategy for achieving these goals are detailed below. Figure 4. Morphobank 2.0 (mustard), and plans for MorphoBank 3.0 (yellow) in relation to other community projects. Specific Project Objectives 1. Increased Capacity for a Large User Community. MorphoBank 2.0 is served from a small cluster at Stony Brook. This has been adequate for the prototyping phase, but more computing power will be required as we expand the user base. We have already established a development account at SDSC data central, which provides us to serve a MorphoBank mirror site on the SDSC production hardware, which includes online disc, MySQL, DB2 and Oracle space, an IBM p690 as database server, served through the SDSC WebFarm, which will provide scalable access to the site via the internet. The business logic at the server will be mainly implemented as web services, they provide independent functions to the web tire. At the same time, these functions can be easily used and integrated by other applications for serving bigger community. The presence of the mirror will provide both increased capacity and increased reliability of all services. 2. Improved interoperability with databases, portals and grids. In adding to the functionality of MorphoBank, we plan to increase the pallet of data and tools available to investigators, for data access and retrieval, this can be accomplished in a simple way through link-outs, link-ins to MorphoBank. Some specific examples where this would be helpful are: 1) investigators researching a systematics problem in MorphoBank.org often have temporal or spatial data for the taxa under consideration and these are relevant to the science under investigation. 2) a user has deposited images in another site (e.g., Digimorph, MorphBank) and may wish to affiliate these directly with a cell in a matrix on MorphoBank.org (Fig.3). The use of linkouts would permit the user to contact the images in the other repositories, and upload them into their project area in MorphoBank. Alternatively, the user may simply establish a URI link to that image, and as long as the other resource is publicly available and functioning (and not password protected), the image will appear in their matrix. 3) A user searching in Paleoportal may wish to be alerted by a search that there is a matrix of fossils available in MorphoBank, this can be achieved most simply by a linkout to MorphoBank. MorphoBank.org has already established a collaboration with NCBI such that MorphoBank.org data can be retrieved in searches of NCBI (Letter: Scott Federhen). The MorphoBank can accept images and image URIs, i.e., the references of the images from other sites can be saved in MorphoBank. At the same time, all the resource in MorphoBank are exported as URIs for linking from other sites. Therefore we can provide Paleoportal with linkout information as well, so that the user will be made aware of the opportunity to examine images in MorphoBank. Each of these is an example of the importance of interoperability among sites and will allow us to remain committed to providing a unique tool kit for users, but to use interoperability to avoid duplicating the functions of other web services. In addition to data services, MorphoBank will provide users with software tools that are not created or maintained by Morphobank, but which are available to the community as published Web Services. As an exemplar of this capability, we will work with the CIPRES team at SDSC to incorporate their tree inference software package as a web service. This is a natural partnership for MorphoBank, which has the ability to upload and download matrices as Nexus files, and CIPRES, whose software to consumes Nexus files, infer trees by Parsimony, Maximum Likelihood, and Bayesian techniques, exports the trees in Nexus format, and provides tree viewing tools. The CIPRES team has committed to co-develop this capability with MorphoBank (letter: Miller) as a test of the ability of the CIPRES software to be ported to new applications. AN added advantage of this collaboration is that as the community develops a new XML format to replace Nexus, we can partner with the CIPRES team to incorporate this new exchange language into our tools. 3. Integration with developing online taxonomic authorities. A growing number of online taxonomic authorities (extant: UBIO [http://www.ubio.org/] and ITIS [www.itis.usda.gov/], extinct: PBDB [http://paleodb.org/cgi-bin/bridge.pl]) serve as important reference systems for consistent use of taxonomic names. Taxonomic names are being regularly entered into MorphoBank.org for fossil and living organisms by experts on various groups. Our user-focused approach has taught us that within individual projects, scientists require complete flexibility in constructing their taxa, and for this reason, MorphoBank does not impose a controlled vocabulary upon the individual scientists. However, when the work is completed, it is important to be able to map the results of an experiment back to the world of existing naming authorities. Thus, interoperability between MorphoBank and taxonomic authorities, and MorphoBank.org must be preserved to aid the development and entry of accurate information in all sites, and the availability of accurate information to the public. This standardization also applies to institution codes and eventually to LSIDs. Tools to assemble naming authorities under a single web application have already been described [Page, RDM (2005). A Taxonomic Search Engine: Federating taxonomic databases using web services, BMC Bioinformatics 2005, 6:48 doi:10.1186/1471-2105-6-48], as have tools for mapping trees onto each other [c.f. Critical Points for Interactive Schema Matching, with Guilian Wang, Young-Kwang Nam, and Kai Lin. Technical Report CS2004-0779, UCSD Department of Computer Science, 31 January 2004; Patrick Ziegler, Christoph Kiefer, Christoph Sturm, Klaus R. Dittrich, and Abraham Bernstein: Generic Similarity Detection in Ontologies with the SOQA-SimPack Toolkit (Demo Paper). To appear in: 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD 2006), Chicago, USA, June 26-29. MorphoBank 3.0 will use AJAX technology to seamlessly integrate these online resources into MorphoBank. More concretely, when users enter the genus and species, the information is sent to the MorphoBank server without flushing the pages, and the taxa from these online resources is fetched on-the-fly by the MorphoBank, and the fetched taxa is sent back to the browsers to automatically populate the other fields of the web forms. In this way source information is captured directly from the naming authorities (specifically UBio [http://www.ubio.org/], ITIS [www.itis.usda.gov/], and PBDB initially, but others, such as Index Fungorum [http://www.indexfungorum.org/Names/Names.asp], the Animal Diversity Web [http://animaldiversity.ummz.umich.edu/site/index.html]) will be added as required. The project area for MorphoBank 3.0 will provide users with the option of populating their taxa descriptor fields with information from these services, and with the ability to modify it if the published name servers do not provide sufficient granularity. The important benefit of this service is that each project creator will have the ability to instantly see where their work differs from existing authorities, and provide the ability to make corrections in a very precise manner. 4. Linking to public access online literature. Mining data archives for images for phylogenetic research. Linking images to homology statements that can then be shared among colleagues is one of the most powerful tools introduced by MorphoBank.org (Figure 2). In many cases, a relevant image may already be published in a natural history journal. Unless appropriate permission is granted, MorphoBank.org asks authors not to link images that are copyrighted to public access matrices. However, as journals become open-access online either because they have made the decision to do so (e.g., American Museum publications: Bulletin, Novitates) or because they have always been fully electronic and open access (e.g., Palaeontologia Electronica; currently at least 198 biology, and 43 geology journals), MorphoBank.org can offer authors the opportunity to affiliate images from those publications with a cell. The addition of this capability to MorphoBank can be accomplished in phases. In the first stage, search tools can be created to identify relevant articles. Simple links to search at NCBI/PubMed can be implemented simply, and click-through paths provided to articles and images required by investigators. Further click-through can produce the relevant article as permitted by the journals policy and the users’ institutional licensing agreements. If requested by users, it will be possible also populate MorphoBank fields like view, specimen number, and provide literature references according to NLM standards for literature citations, and associate the linkout to journal article or NLM abstract. According to user input/requirements, we can also investigate more information extraction tools to search for appropriate literature images. Tools for extracting Taxon names [Koning, D., Sarkar, I. N., and Thomas Moritz, (2005) “TAXONGRAB: Extracting Taxonomic Names From Text” Biodiversity Informatics, 2, 79-82] and for ontology based literature mining [Muller, H.M., Kenny, E. E., Sternberg, P.W. (2004) “Textpresso:An Ontology-based information retrieval and extraction system for biological literature” PLoS Biol 2:e309] have been reported, and could be implemented for MorphoBank. To affiliate an image in an online PDF file, we can use the same steps to let the user point to the online PDF file, and then extract the images from the PDF file, and populate MorphoBank with the selected image. 5. Hosting multispecies ontologies of phylogenetic characters (Stony Brook and SDSC). Many practicing morphological phylogeneticists are not currently using controlled vocabularies (ontologies) to describe their characters and character states. By contrast, controlled vocabularies are becoming very important for describing phenotypic features identified in developmental research. However, new research efforts are starting to introduce ontologies into phylogenetic research, including two ATOL projects (Cypriniformes, Spiders) that are drafting their characters as multispecies ontologies. Inclusion of these projects in MorphoBank.org provides both an example to other research projects wishing to follow this pattern and also provides a basis for future interoperability between sites like GMOD that are databasing developmental information. 6. Development of sophisticated drawing, editing and measurement tools (Stony Brook/SDSC). Just as it is hard to imagine an anatomical atlas conveying much information if it did not have labels, the labels introduced on MorphoBank.org greatly enhance the usefulness of an illustration and the clarity with which an investigator can communicate what he/she means by a homology statement. We can develop more powerful annotation tool which allows users to put annotations on the selected points or areas on the images rather than on the whole images. 7. User feedback based modifications to the site, optimization of web applications (Stony Brook). Site users have requested an number of changes that can only be implemented with more funds for programming time. They are: a. Enhanced versioning capabilities and the ability to rewind work to earlier stages or single investigators. Perhaps one of the least attended to issues in database management is the issue of data/document versioning (c.f. www.cs.ucl.ac.uk/staff/btagger/LitReview.pdf and references therein). While some solutions to this issue have recently appeared in the contest of information lifecycle management (ILM) [e.g http://www.abrevity.com/abrevity_fdm_ds.pdf], we are not aware of any comparable solutions that have been reported that are appropriate for individual academic scale projects. At present, both MorphoBank staff, and the visualization team at SDSC [http://vis.sdsc.edu/research/cancer.html] are (fortuitously) pursing solutions to this problem for annotation of visualization images, and both sites have (again fortuitously) adopted essentially identical architectures for the sharing and presenting of digital images. The issue of multiple annotations of individual images is already a solved problem within these two programs, and the specification of a versioning solution is under active investigation at SDSC. b. Improved navigation of the thousands of cells that appear in large matrices. c. Enabling a format for visitors to add images/comments to existing matrices. This emulates a blog capability. This is yours, but I can say that the SDSC group has tools to do this kind of work. d. Moving matrix tools to the cells, to optimize workflow and speed with which a user can switch from one tool to another. e. Improving aspects of the bug-reporting system to make it immediately accessible during login. MorphoBank 3.0 will utilize the FogBugz reporting/feature request system [www.fogcreek.com/FogBugz], which is inexpensive, and provides the ability to store bugs in a database, assign and track bugs to a distributed workforce, and to correspond easily with reports of issues and feature requests. This system has been used successfully for bug tracking and project management at SDSC for several years. f. Clearinghouse for active projects - with both a public and a private face, including details of last login. This will be important for the committee supervising the site (to encourage projects to move towards public access eventually). 8. Formation of a committee responsible for the site (Stony Brook). Following the example of the Paleobiology Database, we plan to develop a committee of involved systematists to establish rules of conduct for the site and rotating leadership. The Paleobiology Database has established working guidelines that have been in place successfully for several years. The existence of such a committee will increase our visibility within the community, and strengthen our ties with the community. It will also be a helpful source of resources and ideas about how MorphoBank can be sustained once the funding provided under this proposal has expired. 9. Enhanced Usability/Documentation of the site. We have already been extremely successful with computer science undergraduates at Stony Brook University, who have been selected through the C-STEP program for minority science education (supervised by co-PI Ferguson, see below) in building documentation for the site under the close supervision of PIO’Leary and Kaufman (Senior Personnel). These internships have resulted in online screenshot based movies that are tutorials for new users (and have been tested on new users). We plan to continue these internships as a way of introducing young computer scientists to this work and to keep information about the site features current. However, Cherri Pancake, usability expert at Oregon State University (personal communication) has advised: “if a web application requires a manual to use, it will fail to attract users.” While the MorphoBank 2.0 has been designed for a committed group of users who will tolerate initial usability issues, we are committed to minimize the barrier to entry for use of the site, and to actively solicit advice from new users on the MorphoBank’s usability. One way to address this issue is to recruit undergraduate interns to learn to use the application as part of a class assignment to construct a particular matrix, and to require their feedback on which parts of the application are difficult to learn, and which are easy. This feedback can be translated directly back to the developers for adjustments in the look and feel of the web site. Where are the movies? Is there a place we can view them? Because it is hard to figure it out solo at this time. 10. Provide for the sustainability of the site at the close of funding. Sustainability is among the most pressing issues for projects that develop data integration and sharing techniques. Research programs and their funding are by definition transitory, yet their work product is often of enduring value. Nowhere is this more true than for the activity of collecting and annotating morphological images. The sustainability question can be divided into two specific areas for MorphoBank: 1) how to preserve data access and 2) how to permit growth of the resource. Data Access: To preserve data access, both the web application and the data it accesses must be sustained. The history and design aesthetic of MorphoBank is oriented towards making this as economical as possible: we have consistently aimed for useful but very low cost functionality. While the work proposed here would bring about significant improvements in functionality, the fundamental architecture of the web application will remain simple and easy to sustain. PI O’Connor is committed to sustaining the basic functionalities of the MorphoBank web application, as it has become a tool of substantial importance in her own work. The creation and persistence of the tool over the past 5 years with minimal funding testifies to the credibility of this claim. In addition to maintaining the Web application, data access over the long term means navigating migrations of the application to new servers, upgrades in OS and RDBMS, and protecting against loss of data through storage media events. We are currently taking advantage of the SDSC “Data Central” resource as a key element of the sustainability strategy. This program provides allocations (awarded competitively, on an annual basis) for on-line disc and database resources to serve large data collections through MySQL, DB2, and Oracle RDBMS. We have already received a development allocation from this program to begin our work. The allocation will give us the opportunity to create a stable mirror of MorphoBank on production hardware at SDSC which will be backed up on a nightly basis. Under the allocation award, overhead costs of equipment, maintenance, and server upgrades will be provided by SDSC. The staff at SDSC will also provide 24/7 attention to application uptime, and provide advice and guidance in the event that migration to DB2 or Oracle is required. SDSC staff will keep the OS and RDBMS current as part of the allocation process, and will notify the user community of upcoming migrations. We will continue to submit annual allocation requests for allocations under this program as long as it remains a viable method for attaining access to disc and database resources, as well as the accompanying support and expertise. Expert advice from SDSC will include timelines for upgrades, and known issues with upcoming changes. This contact will keep the cost of migrations to a minimum. In the event that the allocation procedure becomes inaccessible, we will migrate to a distributed data storage model if our resources are not sufficient to meet demand. A strategy using the SDSC Storage Resource Broker to accomplish this is described in the Resource Growth section below. Resource Growth: To maintain relevance in a rapidly expanding domain, it must be possible to 1) upgrade existing analytical tools and add new ones, and 2) expand the storage capacity of the resource. The resources requested for this proposal are modest; however, they are sufficient to produce a well-documented web application with published APIs. This will allow the developer community to contribute new tools to MorphoBank through the proposed funding period and beyond. This design pattern is currently being used successfully in sustaining software developed for other projects, notably the CIPRES [www.phylo.org] and GEON projects [http://www.geongrid.org/]. The current trajectory for MorphoBank is well suited to the model of open contributions from the community. Development is ongoing in PHP, a stable, mature language that is well suited for managing visual images and for rapid development but does not incur the enormous overhead imposed by enterprise Java systems like Java Struts [http://struts.apache.org/ ] and EJB [http://java.sun.com/products/ejb/]. The MorphoBank architecture is designed to incorporate loosely coupled web services, which will allows other developers to expose their tools within MorphoBank (see point 2 above). In this model new services will not be constrained to PHP, but can be created in the language of the developer’s choice and exposed as Web Services. The collaboration of the MorphoBank group with the CIPRES project (see letter: Miller) will serve as a test case for deploying loosely coupled services within the MorphoBank application. The scalability of MorphoBank storage resources is another key issue in sustaining the program. We do not anticipate outgrowing the resources available through the SDSC Data Allocation protocol during the proposed funding period. Our current RDB use is less than 50 MB, and the use of file space for images is well below 1 TB. However, it is quite possible that much larger storage resources will be required going forward, if MorphoBank becomes highly subscribed. The solution to this issue is to maintain the RDB (which will be of modest size) in a single location, and to adopt a distributed storage resource solution to handle the image files (which account for most of the storage required by MorphoBank). The distributed model will allow the storage costs to be borne by individual projects without imposing the difficult challenge of providing a fee-for-service recharge system. The Storage Resource Broker (SRB; www.sdsc.edu/srb), is an ideal solution for this part of the sustainability problem. The SRB is a mature software package under active development. It provides its users with the ability to transparently share data across distributed heterogeneous data resources. It accomplishes this through a single user sign-on and a single logical file hierarchy for all remote resources. The SRB is an attractive solution because it would allow each MorphoBank user to store images on their own resource (or resources), and still access them through Morphobank. The SRB can be used to connect metadata stored in the central RDB to the remote image files in a transparent way (i.e. the user will not be aware they are leaving the MorphoBank domain to fetch the data). Thus, projects that have huge storage and archiving budgets can share their images within the MorphoBank Web Application even if their images cannot be housed within the MorphoBank storage area. In the event that all MorphoBank storage is filled, this model will still allow the resource to scale to new users, although it admittedly also returns responsibility for back-up and uptime to the data providers. IV. Targeted collaborations: Exemplars for General Problems Systematists are collecting some of the largest datasets of images of morphology as they build increasingly large phenotype-based matrices to examine different aspects of the Tree of Life. Examples include several of the AToL projects currently underway: Squamates (XX images), Beetles (XX images), Spiders (XX images) along with other projects on dinosaurs (XX images), mammals (XX images), fungi (XX images). Below we describe specific projects and collaborators that we have identified as examples that fit the research problems identified above. We have targeted collaborations with particular phylogenetics research programs on different clades. This accomplishes two goals: (1) solution of each particular problem provides the community with a general solution, and (2) this allows us to clearly specify minimum deliverable data entry into MorphoBank.org by the time the proposed project is completed. Of course, investigators outside this select group will be encouraged to use MorphoBank as well. Achieving Mesquite-MorphoBank interoperability. Project: Spider AToL (letters: Ramirez, Wheeler, Wayne?). The spider AToL project will have amassed over xx images for xx taxa by completion of the morphological data collection (xx% already collected). These data are currently organized in the program Mesquite in a new version built for the project that is scheduled for public release. We propose to develop a query language that will enable individuals using Mesquite to upload and download a matrix with images from MorphoBank.org. Offline editor. Most investigators are currently working on the desktop only, often using Mesquite. Mesquite has some limited image viewing capabilities but these are limited compared to those in MorphoBank. They contain no zoom and pan features, no labels, and cannot be viewed simultaneously by a team. This is a software development project that would have to be done in collaboration with Wayne, because no one except him can figure out how to add modules to Mesquite. It would be a job the cipres team could handle, but not without funding. Linking MorphoBank to online taxonomic authorities (ITIS, UBIO, PBDB). Projects: Squamate ATOL, Spider ATOL, Mammal AToL project (under review currently). This activity will provide immediate connection between information housed in taxonomic web lexicons (e.g., common name) and data amassed by scientists for systematics projects. (see item 3 above). We may also include a direct link to IUCN. Ubio is seeking insect collaborations. Taxon searches will provide results from all providers, and indicate any matches, and map the matches with existing tools. In addition, the global tree of life from the Tree of Life Web site can be downloaded and served along with taxonomic names (http://tolweb.org/tree/home.pages/downloadtree.html). Retrieving digital library data in a matrix cell; Projects: Squamate AToL project (letter: Kearney). Making MorphoBank a back end for the Beetle ATOL project. Need to speak to David Maddison about this - not sure he is on board Integration of molecular workspaces with morphological workspaces. Exemplar: Fungus AToL project (letters: Hibbet). The fungal AToL project has developed the WASABI collaborative workspace for handling molecular sequences. The WASABI web software is considered one of the most developed of the AToL projects, but does not currently handle morphology. Since fungal AToL project plans inclusion of morphological data, we plan to have MorphoBank.org house the fungal AToL project morphological data, group on shared software functionalities, and to achieve good integration of morphological character and sequence data. Dynamic matrix data linkage for journals. Exemplar: Mammal AtoL project (pending) and future submissions to Palaeontologia Electronica. David Polly and Palaeontologia electronica - dynamic matrix support linked through morphological phylogenies there. Linking to specific images in online pdf files. Exemplar: AMNH online publications series (Novitates, Bulletin, a full public resource) linked to O’Leary’s whale matrix, Mammal AToL matrices, and the Spider ATOL project. Need to be able to link from cells to exact figures available in online pdfs of publications. It should be possible to mount a search from within a cell, label that image within MorphoBank and save and display that as a cell thumbnail. Dynamic link to Paleobiology Database (PBDB) Exemplars: Student Claeson: fossil rays; Curry Rogers and Wilson: Sauropod dinosaurs. Specimens and taxa (two tables in MorphoBank.org), particularly fossils, have associated data that is not currently housed in MorphoBank.org. These include tables for temporal and geographic occurrences. When someone deposits a matrix in MorphoBank.org, we would like that to create the option - through a link, not through a table in MorphoBank.org - to specify PBDB data such as time and place. This would then result in a link between the two databases such that a search in the PBDB would point investigators to current work in MorphoBank.org and a search in MorphoBank would tell a user if temporal and spatial data are also available for the taxa in the PBDB. This would be built first around the stated exemplar cases in which the investigator actually enters data into both databases herself and then expanded to a dynamic search of data potentially entered by different researchers. Searches in each database of the common tables (taxonomic name, specimen number) would return information on both databases. Also, it is important to enter specimen data and repository information along standards established by PBDB, Paleoportal, and GBIF. Hosting a morphological matrix with characters and character states that are ontologybased. Exemplar: Zebrafish ATOL -check (Letter: Mabee) Spider AToL (letter: Ramirez/Wheeler). A perusal of major systematics journals (e.g., Systematic Biology, Cladistics, Journal of Vertebrate Paleontology) indicates that practicing morphological systematists are not currently using ontologies in defining characters and character states. Certain AToL projects, such as the two mentioned here, are creating ontologies from their characters and/or by working closely with existing ontologies that have been developed for model organisms. Currently MorphoBank.org does not contain a mechanism to search and store information in accordance with a known ontology. These exemplar projects will give us the opportunity to develop tools to allow ontology based naming and searching of character states (this will be an important functionality for other projects as well). We will adapt existing tools to read and interpret ontologies in OWL (and OBOL if necessary) format in the context of the MorphoBank application. Addition of new features, Standardization of repository names, specimen numbers, and bibliographic references with other databases. Exemplar problem: all projects listed. As online libraries become the norm rather than the exception, it is important that our growing database follow standards that they have developed. V. Broader Impact Outreach links (I don’t understand this section, but will be happy to add if you can explain) These I think are more services to the broader biological community that they are core activities in systematics. Please feel free to add things here. I will also develop this more 1. Link to animal diversity web. 2. Link to Tree of Life project? Maddison 3. Improved Google searches? 4. Links to common names - Fishbase has this. SDSC 5. Link to IUCN red list. Search a taxon on MorphoBank, pulls up links to Red list too? Search Red list and it tells you if there are current studies on MorphoBank. SDSC Promotion of Teaching, Training and Learning. In this project, undergraduate students will be full participants in the biological and computer science research. O’Leary has budgeted research stipends for students to work in her lab consistently during this project. Specific deliverables for the students involved include: co-authorship of talks and papers, opportunities for public speaking, student prizes on campus and nationally, and inclusion in national scientific meetings. The undergraduate students will be offered opportunities to travel to national meetings with O’Leary, and Stony Brook University will seek funds for the students’ travel. The SDSC group will apply for Research Experiences for Undergraduates (REU) funds from the allocation at the SDSC, and from students will be recruited from the UCSD Academic Enrichment Program’s Faculty Mentorship Program (FMP) and the California Alliance for Minority Participation in Science, Engineering and Mathematics Program (CAMP) (http://aep.ucsd.edu/default2.htm). Broadening the Participation of Underrepresented Minorities. At present underrepresented minorities are less than full participants in the professional scientific fields of systematics, evolutionary biology and vertebrate paleontology as well as computer science. Stony Brook University is ideally positioned to make a difference in minority participation in these sciences for three reasons: (1) it is a state school with a large minority undergraduate population, (2) it has strong Department of Anatomical Sciences research programs in these areas, and (3) it has an active network of federal and state programs that serve underrepresented minority students in STEM (science, technology, engineering and mathematics) majors form secondary school to the professoriate. The Carnegie Foundation has recognized Stony Brook (student population ~ 21,000) as a “Type I Research” university, the foundation’s highest classification and a reflection of the university’s diversity of research, and its large number of doctoral candidates. In this project we propose to our research to specific campus programs targeted at bringing minorities into the sciences. They are: the NSF SUNY Alliance for Graduate Education and the Professoriate (AGEP), the SUNY Louis Stokes Alliance for Minority Participation (LSAMP), the New York State Science and Technology Entry Program (STEP), and the Collegiate Science and Technology Entry Program (C-STEP). As noted above, O’Leary has already independently mentored minority high school students in science research (O’Leary Biographical Sketch), and one of these students has won international awards for his science projects. O’Leary and Ferguson have recently started an new collaboration between the STEP programs and the research of the Department of Anatomical Sciences. A central part of minority outreach planned is targeting undergraduate research to minority populations through SUNY LSAMP and C-STEP. In addition to the outreach efforts at Stony Brook, the SDSC group will participate in outreach activities by recruiting undergraduates in underrepresented groups through the Faculty Mentorship Program (FMP) and the California Alliance for Minority Participation in Science, Engineering and Mathematics Program (CAMP) (http://aep.ucsd.edu/default2.htm). These students will be recruited to experiment with the web application, to report usability issues, to experiment with creating matrices in the web application, and to collaborate with the Mammalian Diversity Web to create educational demonstrations for undergraduate students. Enhancing Infrastructure for Research and Education. The proposed research is highly collaborative both in the development that is proposed, and in the use of the resulting tools. We have reported in the text, and in the appendices, letters of commitment from XX projects distributed across at least YY sites. The collaborations cross boundaries of domain, from taxonomy to biodiversity to paleontology to systematics to computer science and software engineering. The project will bring together groups that have developed independently, and help to bring them into a unique community of systematics informatics. The product of the work will provide a new infrastructure in a production environment, this contributing to the ability of morphological researchers to conduct their work. Dissemination of results and enhancing understanding. We will help to build web presence for undergraduate, and late high school students that will be disseminated through our relationships with the Animal Diversity Web, and the AMNH (others). Benefits to Society. I can’t think of anything for this area.