MorphoBank: An online workspace for morphological systematics

advertisement
TRANSFORMING MORPHOLOGICAL SYSTEMATICS FROM DESKTOP TO WEB
APPLICATIONS: DEVELOPMENT OF THE ONLINE WORKSPACE MorphoBank.org
Project Description
Papers cited in “Results from prior NSF support” marked with * in References Cited.
I.
Results from Prior NSF Support
PI-O’Leary (DEB-9985847,
$189,998
[including
supplements],
1997-2003):
“Collaborative: Cetacean phylogeny: A Reconciliation of Fossil and Neontological Data and the
Importance of Taxonomic Sampling.” Comprehensive analysis of morphology for the phylogeny
of whales; one of the largest morphological phylogenetic analyses produced for mammals (> 600
characters). 5 papers published, 2 in review (including monograph with 121 full page original
illustrations [Fig.1], 341 ms pages]), 4 abstracts; construction of MorphoBank (online
database/web application for morphological systematics), www.morphobank.org. Broader
Impact.- MorphoBank reviewed in articles in Science*, Bioscience* and Trends in Ecology and
Evolution*; Training: 1 female postdoc, 2 female undergraduates, 1 minority undergraduate, 1
high school student who became first African American to win First Place in Zoology,
Intel Science Fair; 1 mentorship award to PI. PI conducted: 1 international meeting; 10 invited
scientific talks, 2 public lectures, and spoke at 11 invited scientific workshops. Collaborated with
SUNY-C-STEP program to train minority undergraduates.
PI-O’Leary (Co-PI, D. Krause, EAR-0116517, $227,934, 2001-2005): “Acquisition of
Instruments and Technical Support for an Interdepartmental Fossil Preparation Laboratory.”
Constructed fossil preparation laboratory with two technicians serving ten paleobiologists at
Stony Brook University, and ten other U.S. and international collaborators. For the grant as a
whole: 12 papers; 23 abstracts; two MsC. theses, one website. Broader Impact.-Training: Lab
used by students, specifically 2 graduate, 4 undergraduate; development of new paleontology
course for Women in Science and Engineering program; fossil casts distributed to international
repositories.
PI-O’Leary (Doctoral Dissertation Improvement grant for R. V. Hill, DEB-0206533,
$8,723, 2002-2004): “Comparative Anatomy and Evolution of Osteoderms in the Amniote
Integument.”
Conflicting hypotheses of turtle relationships evaluated using the largest
morphological data set compiled for Amniota. 3 papers and 3 abstracts. Broader Impact.Collections: data deposited on MorphoBank.org.
Co-PI Ferguson: “SUNY Louis Stokes Alliance for Minority Participation” (HRD 9623931,
11/96 - 10/01 and 0114756, 11/01 - 10/06): Training: education program increased by 157%
underrepresented minority enrollment in science and engineering undergraduate majors and by
63% science bachelor's degrees at Stony Brook.
Co-PI Baru (put Chaitan information in here)
Co-PI Lin is a beginning investigator.
II. Morphological (Phenotype-Based) Phylogenetics: Desktop to Web
Applications
How did the phenotype evolve across the Tree of Life over the past 4.5 billion years of
Earth history? What is the range of shapes in teeth that have evolved in parallel? What was the
body shape of an extinct species known only from a skull, and how can we predict this from a
phylogenetic tree based on morphology? These are all questions that can be addressed
using the contemporary methods of morphological (phenotype-based) systematics. To
address these questions, phylogeneticists (systematists) are increasingly working collaboratively
(e.g, NSF - Assembling the Tree of Life [ATOL] Program) and need to access and integrate as
much data as possible for their work. Advances in developmental biology also rely on
documentation of a wide variety of phenotypic data and an understanding of how variation in
shape evolved. There is a particular need to share labeled images, because morphology is best
communicated in labeled images rather than in text descriptions alone. The need for faster and
simpler image sharing technologies in phylogenetics and developmental biology means there is
a growing need for the development of open source web applications. Such applications are
fundamental to enabling collaboration and data sharing, and to facilitating the growth of
morphological data collection, which has lagged behind collection of molecular sequences for
phylogenetics.
The Importance of Phylogenetics (Systematics)
Phylogenetics is reconstructing the interrelationships among all (fossil and living)
species, building the Tree of Life, and answering such questions as who is more closely related
to whom. In 2005, Science magazine identified the creation of a consensus Tree of Life as
one of the 125 most important questions facing science (Seife, 2005). Research (including
the NSF ATOL workshops report; Cracraft et al. [2004]) has repeatedly emphasized the
importance of using raw data from the phenotype (morphology) in addition to molecular
sequences to tackle this large scale scientific problem.
To reconstruct a phylogenetic tree we use algorithms that employ optimality criteria (e.g.,
parsimony, maximum likelihood) to explain data assembled into matrices. These matrices
contain comparative data about species; data that have been coded as 0, 1, 2, etc…. to describe
characters and character states. A character might be “Wing: present (0); absent (1)”, or simply
the presence or absence of a given nucleotide. These are called “homology statements”
because they are hypotheses that a similarity in a group of organisms is present because it was
inherited from the common ancestor of those organisms. Depending on the problem being
studied, a matrix might contain molecular data or phenotypic data, or both. Phenotypic data may
include anatomical data, behavioral data, or physiological data, to give a few examples.
Molecular data lend themselves most readily to codification in a matrix as 0s, 1s, etc…. because
they are already discrete entities (e.g., A, C, G, T), whereas morphological data can be harder to
codify into separate states. Theoretical work argues, however, that morphological data are
very important for phylogenetics (e.g., Wiens, 2004; Smith and Turner, 2005), thus it is
important to systematics as a whole to enable phenotypic data collection that is efficient
and repeatable.
The number of digital libraries of phenotypic data has grown extensively over the last ten
years. These digital libraries provide virtual collections that systematists can draw on to populate
matrices with images over the web. These digital libraries are centered around the following
features: (1) types of media (e.g., Digimorph for CT scans) (2) museum collections (e.g., new
Scripps-UCSD project, Paleoportal accessible museum collections), (3) groups of species (e.g.,
Flybase, Fishbase, PEET projects, ATOL projects, model organisms), (4) published texts (e.g.,
JSTOR, AMNH series), (5) spatial and temporal data for species (e.g., Paleobiology Database,
others?), (6) developmental biology (e.g., NCBI? examples).
These are important fundamental resources for raw data for systematists working with
the phenotype of living and fossil organisms. The growth of digital media and the importance of
these digital libraries underscores a broad change in anatomy-based research that has used the
web to expand access to specimens and text information. These new resources have helped
transform the contemporary concept of monography.
As noted by Dettai et al. (2004) the growth of matrices is limited by a lack of tools for
morphological systematics. Although morphological work is enhanced by the immediate
availability of images (media) that document characters and character states, desktop programs
for phylogenetic systematics (eg Phylip. Mesquite] have relatively limited capabilities for
manipulating and displaying images. MorphoBank 2.0 was created to provide the
cyberinfrastructure to link these digital images directly to phylogenetic matrices.
Morphological Systematics as Team-Based Research
Through the AToL program, NSF has placed new emphasis on the importance of
collaborative, team-based phylogenetics research. As evidence of this, morphological matrices
have been growing in numbers of both taxa and characters (O’Leary and Gatesy, in review, >
600 characters; another example). Sharing of images among collaborative team members on
phylogenetics projects can greatly clarify concepts of homology among team members.
Web applications that promote remote collaborations are appearing in many forms, and
these are transforming the nature of data sharing in both scientific and nonscientific work
communities. Some examples of generic collaborative tools available widely include Google
Spreadsheets [spreadsheets.google.com], Wikipedia [http://en.wikipedia.org/wiki/Main_Page]
and a number of collaborative wikis [http://www.pcmag.com/article2/0,4149,1402872,00.asp],
which allow text and document sharing. More sophisticated scientific data sharing areas built for
specific projects such as the GEON portal [www.geonweb.org] permit joining and querying of
relational databases.
Like the users of these other web applications, morphological phylogeneticists can
benefit greatly from tools that grant access to data from anywhere in the world, particularly when
this access provides tools that are uniquely designed to meet the specific needs of working
phylogenists/systematists. Such an application would enable a team member with internet
access to look at data in a matrix simultaneously, grant data/image access to reviewers of
unpublished data, facilitate sharing of images of homology statements without swapping folders
of images through ftp or email, and always provide the ability to download nexus files of matrices
and images readable by widely used desktop programs (e.g., Mesquite).
What is MorphoBank.org?
MorphoBank.org is was conceived, designed and deployed with the goal of providing the
kind of on-line, location-independent collaboration space required by large scale collaborative
projects investigating phylogeny. Morphobank.org is an open source, platform-independent,
online database and collaborative workspace for phylogenetics. The tool in its present state was
constructed with minimal funding by a domain scientist in close collaboration with a software
engineer with the driving design metric of providing the maximum useful functionality for
collaboration among working systematists at the lowest possible cost in terms of effort,
computational resource demands, and maintenance.
The first versions of MorphoBank.org were built with seed funds from NSF to PI-O’Leary
to sponsor a workshop and a second grant from NOAA (<$35,000 direct costs).
MorphoBank.org currently contains over 2,600 images and more than fifteen active projects-inpreparation. Investigators of XX ATOL projects have expressed interest in depositing their data
on the site. Milestones in the development of the site include the first published paleontological
matrix with images (Hill, 2005; fossil amniotes) and a soon to be released phylogenetic matrix
that will have over 1,000 images document 631 characters for 70 taxa (O’Leary and Gatesy, in
review; fossil mammals).
MorphoBank.org differs from existing digital library initiatives
Several other initiatives that use the web to database morphology are underway (e.g.,
Digimorph, MorphBank, Miranker ATOL project, Electronic Field guides). Some of the most
prominent of these have formed a collaborative group with plans for growing interoperability in
the future. Unlike these projects, which are conceived primarily as repositories for organizing
and archiving images and metadata, the primary function of MorphoBank is to provide a web
application which provides tools for creating and analyzing existing data in a collaborative
workspace. The tools implemented by MorphoBank are easily adaptable to whatever
interoperability standards adopted by this collaborative group, and MorphoBank will allow users
of those resources to easily work with their data to produce matrices and speed their time to
publication. The partnership that we have initiated with them (Letter: Riccardi) provides an
important foundation for establishing the interoperability required by practicing systematists in
MorphoBank, while leveraging the Database development efforts in the collaboration.
Current (MorphoBank v 2.0) Features
At present, MorphoBank.org allows users to conduct web-based management of
comparative anatomical data of fossil and living organisms for scientific research and education.
MorphoBank.org is structured around the concept of “Projects,” which simply means a collection
of related images and metadata in use by a scientist or team of scientists. Built with user
feedback from phylogeneticists, other informatics teams working on databasing aspects of
morphology have recognized the high quality of the MorphoBank user interface for phylogenetics
(Letter: Riccardi). Features of the current implementation include:
Registration: registered users access MorphoBank via a password-protected login. Once
registered, a project owner can designate the workgroup members who are to have access to
the project.
Image Uploading: Registered users can upload and catalogue multimedia submissions (2D and
3D images, e.g., drawings, photos, and CT scans). The uploaded images can be accessed by all
group members, and project collaborators can add annotations on these media. Along with
images, MorphoBank captures and records a variety of metadata such as author of submission,
related publications, critical commentary, names of species and higher taxa, and descriptions of
characters. The user is able to add and edit these labels and properties.
Dynamic Character Matrix Creation/Editing: For phylogenetic research, MorphoBank displays
dynamic phylogenetic matrices of morphological characters with labeled character information
(homology statements). This is accomplished over the web - in real time - by autonomous
teams of researchers building and editing phylogenetic matrices with affiliated annotated image
data, or simply sharing images with annotations (Figure 3). Nexus file uploads are also used to
create matrices. Users have complete control over naming, and can add/edit taxa, characters,
states, etc.
Figure 2. Screen shot from
MorphoBank.org (2.0) showing image
upload/ media viewing area.
Searching. MorphoBank’s search engine is capable of searching all aspects of a project’s data
and returning taxonomic records, specimen data, characters, media and matrices, any of which
can be downloaded. The search engine implements Boolean operators, exclusion, wildcards,
stemming, spell correction and parenthetical grouping. The search returns image and metadata
from all projects that have been parsed to the MorphoBank.org public archive by investigators,
as well as any unpublished projects to which a registered user is currently contributing.
Figure 3. Screen shots of matrix editor from MorphoBank.org (2.0).
Annotations. The usefulness of images to researchers is greatly enhanced by linking text to
them, which can be done in MorphoBank. More specifically for phylogenetics, MorphoBank can
link and display an image affiliated with a taxon-character-character state intersection. In
MorphoBank, images and annotations (such as the Tables listed above) are also stored
separately and can be retrieved separately. Information about an image (author, date, size,
original format), as well as embedded descriptive and technical metadata in IPTC, EXIF, XMP
format, is captured at the time of upload.
Publication. Once a phylogeneticist (or team) is ready to make work available to the public in
open storage (e.g., coinciding with the appearance of a published paper), the images, matrices
and annotations are permanently archived and searchable by the public on MorphoBank.org.
The database is particularly helpful for comparing hypotheses of homology, which are central to
the phylogenetics of fossil or living taxa. Once a project is published, the site search engine can
make connections between like-named entities in formerly separate projects.
MorphoBank v 2.0 Architecture:
The MorphoBank.org (2.0) web application follows a standard 3-tier architecture model.
The site’s view and logic layers are implemented in PHP. The choice of PHP allowed us to
implement in a lightweight environment that is conducive to rapid development and deployment,
consistent with our minimal frills, maximal functionality aesthetic. The PHP language provides
many useful features for data access and sharing, but avoids the very large overhead required
by the enterprise web applications built on platforms such as Java Struts/EJB [ref].
General web site functionality is accessible via standard web browsers (c.f. Figure 2).
Highly interactive user interfaces, such as the matrix editor (Figure 3) and image viewer/
annotation tool, however, are implemented as “Rich Internet Applications”. with Actionscript
Adobe Flash v8 and Java. This application loads images very quickly (typically < 1 sec), with
potential for even more improvement. We have used Actionscript for the development of
complex web-based user interfaces in Macromedia Flash and Javascript for browser-based
“rich” user interfaces. The web-based clients for the database were built using Java and Flash
(ActionScript); the latter provides very lightweight and responsive browser-based image
manipulation. MorphoBank.org uses an Apache open source web server and ImageMagick
image processing software, which can manipulate images in over 90 formats, including TIFF,
JPEG and PNG).
The database tier of MorphoBank.org consists of a relational database for metadata,
which links to stored image files. The logic model at the logic layer is mapped into the database.
The mapping is managed by tools so that the system is feasible to extend its logic model without
redesigning the database schema. The relational database is implemented in MySQL, a widely
deployed open source relational database management system that uses Structured Query
Language [SQL]. The relational database stores all aspects of the hosted projects. The primary
tables in the database are: Taxonomic names; Characters (e.g., anatomical descriptive
phrases) and associated character states; Specimens (including collection information, voucher
number and taxonomic name, as well other Darwin Core 2 compatible fields); Media (still
imagery, video, sound); Matrices (taxa and characters united by character state assertions).
MorphoBank.org allows annotations on these entities. These entities are each contained within
independent Project Workspaces. Naming collisions between projects do not occur because
each project is self-contained. As noted above, we support a number of media formats: for still
imagery, we store the original uploaded file, JPEGs in several sizes, and a TilePic-format
version equivalent in resolution to the original used for our pan-and-zoom image viewer.
The MorphoBank.org database incorporates structures for various domain specific
entities
as
well
as
a
configurable
metadata
schema
[http://www.morphobank.org/doc/schema_latest.pdf] that supports mapping to widely used
schemas like Darwin Core 2 [http://darwincore.calacademy.org] and Dublin Core
[http://dublincore.org/]. Entities such as taxonomic names, characters and matrices that are
invariant across all use cases are implemented using domain-specific entities. Other types of
data are stored as project-specific metadata with an explicit mapping to a project-defined
standard.
Hardware. MorphoBank.org is currently run on an IBM Blade Server (HS20) with dual 3.2 GHz
processors on a RedHat Linux ES Server operating system (Letter:Eisenberg).
III.
Prototype to Community Tool: MorphoBank.org v 3.0
The primary objectives of MorphoBank are 1) to create a web application that enables
the association of images with matrices, and thereby increase the repeatability, and ultimately
the efficiency, of phylogenetic work on the phenotype, and 2) to make this web application
accessible to all teams of collaborating morphological systematists who desire access. In
Section II, we reported the completion of the first objective: MorphoBank 2.0 provides a
functioning proof-of-concept prototype that allows teams of researchers to store, share,
and annotate images, and to create, upload (in nexus format), edit, store, and download
matrices for use in tree inference software. Growth of the site (see XX letters from groups
wishing to use the site) indicates that the tools provided by MorphoBank are useful, and meet a
need within the community. However, demands on the site are beginning to exceed what PI
O’Leary can supervise as a single investigator. Moreover, with the growing user base,
MorphoBank has an increasing need for new functionalities, and for improved database
functionalities. In order to meet the needs of the growing user population, we propose to take
steps to make this infrastructure available to as many biologists as possible.
In Section III, we describe plans for the evolution of MorphoBank into an online
infrastructure resource that can be scaled to an entire community of users, and that provides
increased connectivity to resources that exist elsewhere in the community. To assemble the skill
sets required for this proposal, we propose to expand the MorphoBank team to include Dr.
Chaitan Baru, Head of Data and Knowledge Systems at SDSC, and Dr. Kai Lin, Staff Scientist
currently working on the data integration team of the GEON Portal [www.geonportal.org]. This
group will provide the capabilities and experience to insure that the database effort continues in
a way that will serve the needs of the anticipated growth in user base, and to assist with
integration of remote metadata and data resources. The SDSC team will provide an effective
complement to the domain science and development experience contributed by MorphoBank’s
creators at Stony Brook. Specific deliverables for the requested funding period are:
 Increased capacity for a large user community
 Improved interoperability with databases, portals and grids
 Integration with developing online taxonomic authorities.
 Linking to public access online literature (pdfs) - mining data archives for images
for phylogenetic research (SDSC).
 Integration with other online community software/services
 Hosting multispecies ontologies of phylogenetic characters
 Development of sophisticated drawing, editing and measurement tools
 Enhanced documentation of the site.
 User feedback based modifications to the site, optimization of web applications
 Formation of a committee responsible for the site.
 Provide for the sustainability of the site at the close of funding.
Figure 4 provides a cartoon of the proposed expansion of MorphoBank, showing the existing
technologies, those that will be added during the current funding, and those that will be provided
by existing community projects, as accessed through the MorphoBank web application.
Our key metric of success for the project is to bring several large morphological matrices
online by the end of three years: specific exemplar projects from across the Tree of Life that
present different computer programming challenges. These exemplar projects (see Section IV
below) will help us develop MorphoBank into a tool that can transform the way morphological
data are collected for systematics by making data deposit for morphology more uniform. Our
strategy for achieving these goals are detailed below.
Figure 4. Morphobank 2.0 (mustard), and plans for MorphoBank 3.0 (yellow) in relation to other
community projects.
Specific Project Objectives
1. Increased Capacity for a Large User Community. MorphoBank 2.0 is served from a small
cluster at Stony Brook. This has been adequate for the prototyping phase, but more computing
power will be required as we expand the user base. We have already established a development
account at SDSC data central, which provides us to serve a MorphoBank mirror site on the
SDSC production hardware, which includes online disc, MySQL, DB2 and Oracle space, an IBM
p690 as database server, served through the SDSC WebFarm, which will provide scalable
access to the site via the internet. The business logic at the server will be mainly implemented
as web services, they provide independent functions to the web tire. At the same time, these
functions can be easily used and integrated by other applications for serving bigger community.
The presence of the mirror will provide both increased capacity and increased reliability of all
services.
2. Improved interoperability with databases, portals and grids. In adding to the functionality
of MorphoBank, we plan to increase the pallet of data and tools available to investigators, for
data access and retrieval, this can be accomplished in a simple way through link-outs, link-ins to
MorphoBank. Some specific examples where this would be helpful are: 1) investigators
researching a systematics problem in MorphoBank.org often have temporal or spatial data for
the taxa under consideration and these are relevant to the science under investigation. 2) a user
has deposited images in another site (e.g., Digimorph, MorphBank) and may wish to affiliate
these directly with a cell in a matrix on MorphoBank.org (Fig.3). The use of linkouts would permit
the user to contact the images in the other repositories, and upload them into their project area
in MorphoBank. Alternatively, the user may simply establish a URI link to that image, and as
long as the other resource is publicly available and functioning (and not password protected), the
image will appear in their matrix. 3) A user searching in Paleoportal may wish to be alerted by a
search that there is a matrix of fossils available in MorphoBank, this can be achieved most
simply by a linkout to MorphoBank. MorphoBank.org has already established a collaboration
with NCBI such that MorphoBank.org data can be retrieved in searches of NCBI (Letter: Scott
Federhen). The MorphoBank can accept images and image URIs, i.e., the references of the
images from other sites can be saved in MorphoBank. At the same time, all the resource in
MorphoBank are exported as URIs for linking from other sites. Therefore we can provide
Paleoportal with linkout information as well, so that the user will be made aware of the
opportunity to examine images in MorphoBank. Each of these is an example of the importance
of interoperability among sites and will allow us to remain committed to providing a unique tool
kit for users, but to use interoperability to avoid duplicating the functions of other web services.
In addition to data services, MorphoBank will provide users with software tools that are
not created or maintained by Morphobank, but which are available to the community as
published Web Services. As an exemplar of this capability, we will work with the CIPRES team
at SDSC to incorporate their tree inference software package as a web service. This is a natural
partnership for MorphoBank, which has the ability to upload and download matrices as Nexus
files, and CIPRES, whose software to consumes Nexus files, infer trees by Parsimony,
Maximum Likelihood, and Bayesian techniques, exports the trees in Nexus format, and provides
tree viewing tools. The CIPRES team has committed to co-develop this capability with
MorphoBank (letter: Miller) as a test of the ability of the CIPRES software to be ported to new
applications. AN added advantage of this collaboration is that as the community develops a new
XML format to replace Nexus, we can partner with the CIPRES team to incorporate this new
exchange language into our tools.
3. Integration with developing online taxonomic authorities. A growing number of online
taxonomic authorities (extant: UBIO [http://www.ubio.org/] and ITIS [www.itis.usda.gov/], extinct:
PBDB [http://paleodb.org/cgi-bin/bridge.pl]) serve as important reference systems for consistent
use of taxonomic names. Taxonomic names are being regularly entered into MorphoBank.org
for fossil and living organisms by experts on various groups. Our user-focused approach has
taught us that within individual projects, scientists require complete flexibility in constructing their
taxa, and for this reason, MorphoBank does not impose a controlled vocabulary upon the
individual scientists. However, when the work is completed, it is important to be able to map the
results of an experiment back to the world of existing naming authorities. Thus, interoperability
between MorphoBank and taxonomic authorities, and MorphoBank.org must be preserved to aid
the development and entry of accurate information in all sites, and the availability of accurate
information to the public. This standardization also applies to institution codes and eventually to
LSIDs.
Tools to assemble naming authorities under a single web application have already been
described [Page, RDM (2005). A Taxonomic Search Engine: Federating taxonomic databases
using web services, BMC Bioinformatics 2005, 6:48 doi:10.1186/1471-2105-6-48], as have tools
for mapping trees onto each other [c.f. Critical Points for Interactive Schema Matching, with
Guilian Wang, Young-Kwang Nam, and Kai Lin. Technical Report CS2004-0779, UCSD
Department of Computer Science, 31 January 2004; Patrick Ziegler, Christoph Kiefer, Christoph
Sturm, Klaus R. Dittrich, and Abraham Bernstein: Generic Similarity Detection in Ontologies with
the SOQA-SimPack Toolkit (Demo Paper). To appear in: 2006 ACM SIGMOD International
Conference on Management of Data (SIGMOD 2006), Chicago, USA, June 26-29. MorphoBank
3.0 will use AJAX technology to seamlessly integrate these online resources into MorphoBank.
More concretely, when users enter the genus and species, the information is sent to the
MorphoBank server without flushing the pages, and the taxa from these online resources is
fetched on-the-fly by the MorphoBank, and the fetched taxa is sent back to the browsers to
automatically populate the other fields of the web forms. In this way source information is
captured directly from the naming authorities (specifically UBio [http://www.ubio.org/], ITIS
[www.itis.usda.gov/], and PBDB initially, but others, such as Index Fungorum
[http://www.indexfungorum.org/Names/Names.asp],
the
Animal
Diversity
Web
[http://animaldiversity.ummz.umich.edu/site/index.html]) will be added as required. The project
area for MorphoBank 3.0 will provide users with the option of populating their taxa descriptor
fields with information from these services, and with the ability to modify it if the published name
servers do not provide sufficient granularity. The important benefit of this service is that each
project creator will have the ability to instantly see where their work differs from existing
authorities, and provide the ability to make corrections in a very precise manner.
4. Linking to public access online literature. Mining data archives for images for
phylogenetic research. Linking images to homology statements that can then be shared among
colleagues is one of the most powerful tools introduced by MorphoBank.org (Figure 2). In many
cases, a relevant image may already be published in a natural history journal. Unless
appropriate permission is granted, MorphoBank.org asks authors not to link images that are
copyrighted to public access matrices. However, as journals become open-access online either
because they have made the decision to do so (e.g., American Museum publications: Bulletin,
Novitates) or because they have always been fully electronic and open access (e.g.,
Palaeontologia Electronica; currently at least 198 biology, and 43 geology journals),
MorphoBank.org can offer authors the opportunity to affiliate images from those publications with
a cell.
The addition of this capability to MorphoBank can be accomplished in phases. In the first
stage, search tools can be created to identify relevant articles. Simple links to search at
NCBI/PubMed can be implemented simply, and click-through paths provided to articles and
images required by investigators. Further click-through can produce the relevant article as
permitted by the journals policy and the users’ institutional licensing agreements. If requested by
users, it will be possible also populate MorphoBank fields like view, specimen number, and
provide literature references according to NLM standards for literature citations, and associate
the linkout to journal article or NLM abstract.
According to user input/requirements, we can also investigate more information
extraction tools to search for appropriate literature images. Tools for extracting Taxon names
[Koning, D., Sarkar, I. N., and Thomas Moritz, (2005) “TAXONGRAB: Extracting Taxonomic
Names From Text” Biodiversity Informatics, 2, 79-82] and for ontology based literature mining
[Muller, H.M., Kenny, E. E., Sternberg, P.W. (2004) “Textpresso:An Ontology-based information
retrieval and extraction system for biological literature” PLoS Biol 2:e309] have been reported,
and could be implemented for MorphoBank. To affiliate an image in an online PDF file, we can
use the same steps to let the user point to the online PDF file, and then extract the images from
the PDF file, and populate MorphoBank with the selected image.
5. Hosting multispecies ontologies of phylogenetic characters (Stony Brook and SDSC).
Many practicing morphological phylogeneticists are not currently using controlled vocabularies
(ontologies) to describe their characters and character states. By contrast, controlled
vocabularies are becoming very important for describing phenotypic features identified in
developmental research. However, new research efforts are starting to introduce ontologies into
phylogenetic research, including two ATOL projects (Cypriniformes, Spiders) that are drafting
their characters as multispecies ontologies. Inclusion of these projects in MorphoBank.org
provides both an example to other research projects wishing to follow this pattern and also
provides a basis for future interoperability between sites like GMOD that are databasing
developmental information.
6. Development of sophisticated drawing, editing and measurement tools (Stony
Brook/SDSC). Just as it is hard to imagine an anatomical atlas conveying much information if it
did not have labels, the labels introduced on MorphoBank.org greatly enhance the usefulness of
an illustration and the clarity with which an investigator can communicate what he/she means by
a homology statement. We can develop more powerful annotation tool which allows users to put
annotations on the selected points or areas on the images rather than on the whole images.
7. User feedback based modifications to the site, optimization of web applications
(Stony Brook). Site users have requested an number of changes that can only be implemented
with more funds for programming time. They are:
a. Enhanced versioning capabilities and the ability to rewind work to earlier stages or
single investigators. Perhaps one of the least attended to issues in database
management
is
the
issue
of
data/document
versioning
(c.f.
www.cs.ucl.ac.uk/staff/btagger/LitReview.pdf and references therein). While some
solutions to this issue have recently appeared in the contest of information lifecycle
management (ILM) [e.g http://www.abrevity.com/abrevity_fdm_ds.pdf], we are not
aware of any comparable solutions that have been reported that are appropriate for
individual academic scale projects. At present, both MorphoBank staff, and the
visualization team at SDSC [http://vis.sdsc.edu/research/cancer.html] are (fortuitously)
pursing solutions to this problem for annotation of visualization images, and both sites
have (again fortuitously) adopted essentially identical architectures for the sharing and
presenting of digital images. The issue of multiple annotations of individual images is
already a solved problem within these two programs, and the specification of a
versioning solution is under active investigation at SDSC.
b. Improved navigation of the thousands of cells that appear in large matrices.
c. Enabling a format for visitors to add images/comments to existing matrices. This
emulates a blog capability. This is yours, but I can say that the SDSC group has tools
to do this kind of work.
d. Moving matrix tools to the cells, to optimize workflow and speed with which a user can
switch from one tool to another.
e. Improving aspects of the bug-reporting system to make it immediately accessible
during login. MorphoBank 3.0 will utilize the FogBugz reporting/feature request system
[www.fogcreek.com/FogBugz], which is inexpensive, and provides the ability to store
bugs in a database, assign and track bugs to a distributed workforce, and to
correspond easily with reports of issues and feature requests. This system has been
used successfully for bug tracking and project management at SDSC for several years.
f.
Clearinghouse for active projects - with both a public and a private face, including
details of last login. This will be important for the committee supervising the site (to
encourage projects to move towards public access eventually).
8. Formation of a committee responsible for the site (Stony Brook). Following the
example of the Paleobiology Database, we plan to develop a committee of involved systematists
to establish rules of conduct for the site and rotating leadership. The Paleobiology Database has
established working guidelines that have been in place successfully for several years. The
existence of such a committee will increase our visibility within the community, and strengthen
our ties with the community. It will also be a helpful source of resources and ideas about how
MorphoBank can be sustained once the funding provided under this proposal has expired.
9. Enhanced Usability/Documentation of the site. We have already been extremely
successful with computer science undergraduates at Stony Brook University, who have been
selected through the C-STEP program for minority science education (supervised by co-PI
Ferguson, see below) in building documentation for the site under the close supervision of PIO’Leary and Kaufman (Senior Personnel). These internships have resulted in online screenshot
based movies that are tutorials for new users (and have been tested on new users). We plan to
continue these internships as a way of introducing young computer scientists to this work and to
keep information about the site features current. However, Cherri Pancake, usability expert at
Oregon State University (personal communication) has advised: “if a web application requires a
manual to use, it will fail to attract users.” While the MorphoBank 2.0 has been designed for a
committed group of users who will tolerate initial usability issues, we are committed to minimize
the barrier to entry for use of the site, and to actively solicit advice from new users on the
MorphoBank’s usability. One way to address this issue is to recruit undergraduate interns to
learn to use the application as part of a class assignment to construct a particular matrix, and to
require their feedback on which parts of the application are difficult to learn, and which are easy.
This feedback can be translated directly back to the developers for adjustments in the look and
feel of the web site. Where are the movies? Is there a place we can view them? Because it
is hard to figure it out solo at this time.
10. Provide for the sustainability of the site at the close of funding. Sustainability is among
the most pressing issues for projects that develop data integration and sharing techniques.
Research programs and their funding are by definition transitory, yet their work product is often
of enduring value. Nowhere is this more true than for the activity of collecting and annotating
morphological images. The sustainability question can be divided into two specific areas for
MorphoBank: 1) how to preserve data access and 2) how to permit growth of the resource.
Data Access: To preserve data access, both the web application and the data it accesses must
be sustained. The history and design aesthetic of MorphoBank is oriented towards making this
as economical as possible: we have consistently aimed for useful but very low cost functionality.
While the work proposed here would bring about significant improvements in functionality, the
fundamental architecture of the web application will remain simple and easy to sustain. PI
O’Connor is committed to sustaining the basic functionalities of the MorphoBank web
application, as it has become a tool of substantial importance in her own work. The creation and
persistence of the tool over the past 5 years with minimal funding testifies to the credibility of this
claim.
In addition to maintaining the Web application, data access over the long term means
navigating migrations of the application to new servers, upgrades in OS and RDBMS, and
protecting against loss of data through storage media events. We are currently taking advantage
of the SDSC “Data Central” resource as a key element of the sustainability strategy. This
program provides allocations (awarded competitively, on an annual basis) for on-line disc and
database resources to serve large data collections through MySQL, DB2, and Oracle RDBMS.
We have already received a development allocation from this program to begin our work. The
allocation will give us the opportunity to create a stable mirror of MorphoBank on production
hardware at SDSC which will be backed up on a nightly basis. Under the allocation award,
overhead costs of equipment, maintenance, and server upgrades will be provided by SDSC. The
staff at SDSC will also provide 24/7 attention to application uptime, and provide advice and
guidance in the event that migration to DB2 or Oracle is required. SDSC staff will keep the OS
and RDBMS current as part of the allocation process, and will notify the user community of
upcoming migrations.
We will continue to submit annual allocation requests for allocations under this program
as long as it remains a viable method for attaining access to disc and database resources, as
well as the accompanying support and expertise. Expert advice from SDSC will include timelines
for upgrades, and known issues with upcoming changes. This contact will keep the cost of
migrations to a minimum. In the event that the allocation procedure becomes inaccessible, we
will migrate to a distributed data storage model if our resources are not sufficient to meet
demand. A strategy using the SDSC Storage Resource Broker to accomplish this is described in
the Resource Growth section below.
Resource Growth: To maintain relevance in a rapidly expanding domain, it must be possible to
1) upgrade existing analytical tools and add new ones, and 2) expand the storage capacity of the
resource.
The resources requested for this proposal are modest; however, they are sufficient to
produce a well-documented web application with published APIs. This will allow the developer
community to contribute new tools to MorphoBank through the proposed funding period and
beyond. This design pattern is currently being used successfully in sustaining software
developed for other projects, notably the CIPRES [www.phylo.org] and GEON projects
[http://www.geongrid.org/].
The current trajectory for MorphoBank is well suited to the model of open contributions
from the community. Development is ongoing in PHP, a stable, mature language that is well
suited for managing visual images and for rapid development but does not incur the enormous
overhead imposed by enterprise Java systems like Java Struts [http://struts.apache.org/ ] and
EJB [http://java.sun.com/products/ejb/]. The MorphoBank architecture is designed to incorporate
loosely coupled web services, which will allows other developers to expose their tools within
MorphoBank (see point 2 above). In this model new services will not be constrained to PHP, but
can be created in the language of the developer’s choice and exposed as Web Services. The
collaboration of the MorphoBank group with the CIPRES project (see letter: Miller) will serve as
a test case for deploying loosely coupled services within the MorphoBank application.
The scalability of MorphoBank storage resources is another key issue in sustaining the
program. We do not anticipate outgrowing the resources available through the SDSC Data
Allocation protocol during the proposed funding period. Our current RDB use is less than 50 MB,
and the use of file space for images is well below 1 TB. However, it is quite possible that much
larger storage resources will be required going forward, if MorphoBank becomes highly
subscribed. The solution to this issue is to maintain the RDB (which will be of modest size) in a
single location, and to adopt a distributed storage resource solution to handle the image files
(which account for most of the storage required by MorphoBank). The distributed model will
allow the storage costs to be borne by individual projects without imposing the difficult challenge
of providing a fee-for-service recharge system.
The Storage Resource Broker (SRB; www.sdsc.edu/srb), is an ideal solution for this part
of the sustainability problem. The SRB is a mature software package under active development.
It provides its users with the ability to transparently share data across distributed heterogeneous
data resources. It accomplishes this through a single user sign-on and a single logical file
hierarchy for all remote resources. The SRB is an attractive solution because it would allow each
MorphoBank user to store images on their own resource (or resources), and still access them
through Morphobank. The SRB can be used to connect metadata stored in the central RDB to
the remote image files in a transparent way (i.e. the user will not be aware they are leaving the
MorphoBank domain to fetch the data). Thus, projects that have huge storage and archiving
budgets can share their images within the MorphoBank Web Application even if their images
cannot be housed within the MorphoBank storage area. In the event that all MorphoBank
storage is filled, this model will still allow the resource to scale to new users, although it
admittedly also returns responsibility for back-up and uptime to the data providers.
IV.
Targeted collaborations: Exemplars for General Problems
Systematists are collecting some of the largest datasets of images of morphology as they
build increasingly large phenotype-based matrices to examine different aspects of the Tree of
Life. Examples include several of the AToL projects currently underway: Squamates (XX
images), Beetles (XX images), Spiders (XX images) along with other projects on dinosaurs (XX
images), mammals (XX images), fungi (XX images). Below we describe specific projects and
collaborators that we have identified as examples that fit the research problems identified above.
We have targeted collaborations with particular phylogenetics research programs on different
clades. This accomplishes two goals: (1) solution of each particular problem provides the
community with a general solution, and (2) this allows us to clearly specify minimum deliverable
data entry into MorphoBank.org by the time the proposed project is completed. Of course,
investigators outside this select group will be encouraged to use MorphoBank as well.
Achieving Mesquite-MorphoBank interoperability. Project: Spider AToL (letters:
Ramirez, Wheeler, Wayne?).
The spider AToL project will have amassed over xx images for xx taxa by completion of the
morphological data collection (xx% already collected). These data are currently organized in the
program Mesquite in a new version built for the project that is scheduled for public release. We
propose to develop a query language that will enable individuals using Mesquite to upload and
download a matrix with images from MorphoBank.org.
Offline editor. Most investigators are currently working on the desktop only, often using
Mesquite. Mesquite has some limited image viewing capabilities but these are limited compared
to those in MorphoBank. They contain no zoom and pan features, no labels, and cannot be
viewed simultaneously by a team. This is a software development project that would have to
be done in collaboration with Wayne, because no one except him can figure out how to
add modules to Mesquite. It would be a job the cipres team could handle, but not without
funding.
Linking MorphoBank to online taxonomic authorities (ITIS, UBIO, PBDB).
Projects: Squamate ATOL, Spider ATOL, Mammal AToL project (under review currently).
This activity will provide immediate connection between information housed in taxonomic web
lexicons (e.g., common name) and data amassed by scientists for systematics projects. (see
item 3 above). We may also include a direct link to IUCN. Ubio is seeking insect collaborations.
Taxon searches will provide results from all providers, and indicate any matches, and map the
matches with existing tools. In addition, the global tree of life from the Tree of Life Web site can
be
downloaded
and
served
along
with
taxonomic
names
(http://tolweb.org/tree/home.pages/downloadtree.html).
Retrieving digital library data in a matrix cell; Projects: Squamate AToL project (letter:
Kearney).
Making MorphoBank a back end for the Beetle ATOL project.
Need to speak to David Maddison about this - not sure he is on board
Integration of molecular workspaces with morphological workspaces. Exemplar: Fungus
AToL project (letters: Hibbet). The fungal AToL project has developed the WASABI
collaborative workspace for handling molecular sequences. The WASABI web software is
considered one of the most developed of the AToL projects, but does not currently handle
morphology. Since fungal AToL project plans inclusion of morphological data, we plan to have
MorphoBank.org house the fungal AToL project morphological data, group on shared software
functionalities, and to achieve good integration of morphological character and sequence data.
Dynamic matrix data linkage for journals. Exemplar: Mammal AtoL project (pending) and
future submissions to Palaeontologia Electronica. David Polly and Palaeontologia
electronica - dynamic matrix support linked through morphological phylogenies there.
Linking to specific images in online pdf files. Exemplar: AMNH online publications series
(Novitates, Bulletin, a full public resource) linked to O’Leary’s whale matrix, Mammal
AToL matrices, and the Spider ATOL project. Need to be able to link from cells to exact
figures available in online pdfs of publications. It should be possible to mount a search from
within a cell, label that image within MorphoBank and save and display that as a cell thumbnail.
Dynamic link to Paleobiology Database (PBDB) Exemplars: Student Claeson: fossil rays;
Curry Rogers and Wilson: Sauropod dinosaurs. Specimens and taxa (two tables in
MorphoBank.org), particularly fossils, have associated data that is not currently housed in
MorphoBank.org. These include tables for temporal and geographic occurrences. When
someone deposits a matrix in MorphoBank.org, we would like that to create the option - through
a link, not through a table in MorphoBank.org - to specify PBDB data such as time and place.
This would then result in a link between the two databases such that a search in the PBDB
would point investigators to current work in MorphoBank.org and a search in MorphoBank would
tell a user if temporal and spatial data are also available for the taxa in the PBDB. This would be
built first around the stated exemplar cases in which the investigator actually enters data into
both databases herself and then expanded to a dynamic search of data potentially entered by
different researchers. Searches in each database of the common tables (taxonomic name,
specimen number) would return information on both databases.
Also, it is important to enter specimen data and repository information along standards
established by PBDB, Paleoportal, and GBIF.
Hosting a morphological matrix with characters and character states that are ontologybased.
Exemplar: Zebrafish ATOL -check (Letter: Mabee) Spider AToL (letter:
Ramirez/Wheeler). A perusal of major systematics journals (e.g., Systematic Biology, Cladistics,
Journal of Vertebrate Paleontology) indicates that practicing morphological systematists are not
currently using ontologies in defining characters and character states. Certain AToL projects,
such as the two mentioned here, are creating ontologies from their characters and/or by working
closely with existing ontologies that have been developed for model organisms. Currently
MorphoBank.org does not contain a mechanism to search and store information in accordance
with a known ontology. These exemplar projects will give us the opportunity to develop tools to
allow ontology based naming and searching of character states (this will be an important
functionality for other projects as well). We will adapt existing tools to read and interpret
ontologies in OWL (and OBOL if necessary) format in the context of the MorphoBank
application.
Addition of new features, Standardization of repository names, specimen numbers, and
bibliographic references with other databases. Exemplar problem: all projects listed. As
online libraries become the norm rather than the exception, it is important that our growing
database follow standards that they have developed.
V.
Broader Impact
Outreach links (I don’t understand this section, but will be happy to add if you can
explain)
These I think are more services to the broader biological community that they are core activities
in systematics. Please feel free to add things here. I will also develop this more
1. Link to animal diversity web.
2. Link to Tree of Life project? Maddison
3. Improved Google searches?
4. Links to common names - Fishbase has this. SDSC
5. Link to IUCN red list. Search a taxon on MorphoBank, pulls up links to Red list too? Search
Red list and it tells you if there are current studies on MorphoBank. SDSC
Promotion of Teaching, Training and Learning. In this project, undergraduate students will
be full participants in the biological and computer science research. O’Leary has budgeted
research stipends for students to work in her lab consistently during this project. Specific
deliverables for the students involved include: co-authorship of talks and papers, opportunities
for public speaking, student prizes on campus and nationally, and inclusion in national scientific
meetings. The undergraduate students will be offered opportunities to travel to national
meetings with O’Leary, and Stony Brook University will seek funds for the students’ travel. The
SDSC group will apply for Research Experiences for Undergraduates (REU) funds from the
allocation at the SDSC, and from students will be recruited from the UCSD Academic
Enrichment Program’s Faculty Mentorship Program (FMP) and the California Alliance for
Minority Participation in Science, Engineering and Mathematics Program (CAMP)
(http://aep.ucsd.edu/default2.htm).
Broadening the Participation of Underrepresented Minorities. At present underrepresented
minorities are less than full participants in the professional scientific fields of systematics,
evolutionary biology and vertebrate paleontology as well as computer science. Stony Brook
University is ideally positioned to make a difference in minority participation in these sciences for
three reasons: (1) it is a state school with a large minority undergraduate population, (2) it has
strong Department of Anatomical Sciences research programs in these areas, and (3) it has an
active network of federal and state programs that serve underrepresented minority students in
STEM (science, technology, engineering and mathematics) majors form secondary school to the
professoriate. The Carnegie Foundation has recognized Stony Brook (student population ~
21,000) as a “Type I Research” university, the foundation’s highest classification and a reflection
of the university’s diversity of research, and its large number of doctoral candidates.
In this project we propose to our research to specific campus programs targeted at
bringing minorities into the sciences. They are: the NSF SUNY Alliance for Graduate Education
and the Professoriate (AGEP), the SUNY Louis Stokes Alliance for Minority Participation
(LSAMP), the New York State Science and Technology Entry Program (STEP), and the
Collegiate Science and Technology Entry Program (C-STEP). As noted above, O’Leary has
already independently mentored minority high school students in science research (O’Leary Biographical Sketch), and one of these students has won international awards for his science
projects. O’Leary and Ferguson have recently started an new collaboration between the
STEP programs and the research of the Department of Anatomical Sciences.
A central part of minority outreach planned is targeting undergraduate research to
minority populations through SUNY LSAMP and C-STEP. In addition to the outreach efforts at
Stony Brook, the SDSC group will participate in outreach activities by recruiting undergraduates
in underrepresented groups through the Faculty Mentorship Program (FMP) and the California
Alliance for Minority Participation in Science, Engineering and Mathematics Program (CAMP)
(http://aep.ucsd.edu/default2.htm). These students will be recruited to experiment with the web
application, to report usability issues, to experiment with creating matrices in the web
application, and to collaborate with the Mammalian Diversity Web to create educational
demonstrations for undergraduate students.
Enhancing Infrastructure for Research and Education.
The proposed research is highly collaborative both in the development that is proposed, and in
the use of the resulting tools. We have reported in the text, and in the appendices, letters of
commitment from XX projects distributed across at least YY sites. The collaborations cross
boundaries of domain, from taxonomy to biodiversity to paleontology to systematics to computer
science and software engineering. The project will bring together groups that have developed
independently, and help to bring them into a unique community of systematics informatics. The
product of the work will provide a new infrastructure in a production environment, this
contributing to the ability of morphological researchers to conduct their work.
Dissemination of results and enhancing understanding.
We will help to build web presence for undergraduate, and late high school students that will be
disseminated through our relationships with the Animal Diversity Web, and the AMNH (others).
Benefits to Society.
I can’t think of anything for this area.
Download