ppt

advertisement
Software Development
• The team
–
–
–
–
–
–
–
–
FSU: David Swofford, Mark Holder, Fredrik Ronquist
UBC: Wayne Maddison
UConn: Paul Lewis
Arizona: David Maddison
AMNH: Ward Wheeler
UT-Austin: Tandy Warnow
UNM: Bernard Moret, David Bader, Tiffani Williams
SDSC/UCSD: Mark Miller, Fran Berman, Philip Bourne, John
Huelsenbeck
Software Development
• Activities
– Design of system architecture with APIs
– Development of internal/external documentation
system
– Wrapping/interoperability of existing software
– Implementation of new “solution modules” for existing
and novel methods
– Testing/usability assessment
– Integration of database, tree estimation, and post-tree
analysis functions
– Implementation of scalar, parallel, and distributed
versions
Overall design of system architecture
• Modularity, communication, distributed processing:
– Protocols for communication among components (modules) need to be defined
in general, and in particular for key categories of modules (database, tree search
engine, GUI)
– Different styles of communication may be needed for different links: the
database-tree search link may need to be very high speed; the link to GUI not so
high.
– The system needs to be designed to accommodate different styles of use
(theoreticians feeding simulated data repeatedly into the tree search engines
versus heavily interactive use by single user interested in secondary conclusions
about character evolution.).
Overall design of system architecture
• Modularity, communication, distributed processing (cont.):
– The system needs to be “commandable” in different ways -- command line, GUI,
or some more direct pipeline.
– The system must be designed to be convenient for developers to add/try out their
own modules.
– The system should manage the distribution of computations to local machines or
on a grid; this should be easy both for the user (i.e. it happens automatically
depending on resources available and needs of the computation) and for the
developer (code can be compiled for grid-ready computations with little
modification).
– Error handling must generate messages informative to the user and useful to the
developer.
Overall design of system architecture
• Data standards, logging:
– Development of XML standards for communication among modules &
data storage. Given NEXUS experience, this will not be trivial (e.g., not
just simple data, but including assumptions, which grade into a
programming language for computation, as in Hy-Phy batch files and in
PAUP and Mesquite blocks).
– Sophisticated logging of analyses, including allowing the "freezing" or
snapshotting of calculations midstream and undo; the log could be
presented as a user-friendly notebook.
Base class providing
ability to read a NEXUS
data file (manages a
linked list of NxsBlock
objects)
NxsReader
Floor provides a functional
program with a simple,
portable, command line
interface
Floor
TreesManager treesMgr
TaxaManager taxaMgr
NxsBlock
Base class providing the
ability to read a private
NEXUS block (list of
commands and associated
handlers)
Canopy
CharManager charMgr
OutputManager outMgr
Derived class providing GUI (plots,
dialogs, output window, etc.)
Overview of Phorest classes
.
.
.
Web Forms
PHP
(all information about
command entered via web
form, including command
name, description, available
options, and names of classes
and/or functions responsible
for handling the command)
MySQL
(information
about all
commands
stored in
database)
HTML
(e.g. online
documentation)
PHP
(PHP scripts used to write out
XML file, which can then be
XML converted to HTML, PDF, or
source code, etc. using the
eXtensible Stylesheet
Language)
XSL
PDF
C++/Java
(e.g. command
reference manual)
(source code
to implement
command)
Modules
• Tree search and database modules should have top priority.
• Core tree search engines
– At least initially, Phorest/MrBayes/Poy can serve as modules to use for
tree searches. Modifications will or may be needed to link them into
databases and GUI's, including bringing them to speak a common
language for data structures and communication (XML/SOAP?). This
common language/ communication protocol needs to be designed.
– Development of new tree search modules
• Modifications to the tree search modules will be needed to incorporate
improvements to algorithms from the theoreticians, and to accommodate
new criteria or assumptions.
Modules (cont.)
• Database:
– Issues in the database design include:
• aligned versus unaligned sequence storage & retrieval
• storage of assumptions and methods (see above re: XML)
• whether the database serves to store not just input to analyses, not just
output from analyses, but also information sufficient to replicate step by step
(snapshotting etc.).
– Our main concern is how the software interacts with the databases and
making sure databases are designed in accordance with software
needs
Modules (cont.)
• User-interfaces, including editing tools
– A basic data matrix/sequence editor to prepare data for submission to
the database or tree search engines.
• Perhaps this could be achieved by adapting an existing tool, but we should
be prepared to build one from scratch to be able to handle the flexibility we'll
demand. An appealing and well-done editor integrated into a system to
submit for tree searches could do a lot to attract users.
– Tree viewers (possibly for large trees) and GUI front ends for the
various commands in the system.
– Some/all of this work by SDSC “professionals?”
Modules (cont.)
• Post-tree analyses
– Modules to determine implications of trees for conclusions on character
evolution, speciation, etc. should be linkable into the tree search and database,
especially as the trees themselves may be viewed as the intermediate, not the
ultimate, product of an analysis.
• Modules to aid design of the system
– Linking Mesquite & other existing programs to our architecture could provide
some of these services.
• Simulation engines:
– These should appear to the system much like the database, as a source of
character data.
• The software development team will be primarily responsible for defining the protocols
for supplying character data; others will write the simulations code.
Testing/usability assessment
• Communication with users will be an integral part of the
software development process.
– Participation in workshop courses will both serve as outreach
and give us valuable feedback about design.
– Journaling system will allow us to track how users interact with
the system, and allow problems to be reproduced, analyzed, and
debugged
Resolving conflicting priorities
• Some goals are common across the software suite
– interoperability
– ease of use
– flexibility
• Others will inevitably conflict, e.g.:
– Mesquite: emphasizes modularity, extensibility, visualization
– Phorest: emphasizes efficient use of memory, speed, native GUI
feel, scalability
– MrBayes: emphasizes rapid exploration and implementation of
new ideas
• This is the challenge: how to bundle everything
together into a coherent package, without sacrificing the
strengths of different approaches?
“Notion of overall schedule”
• Year 1
– Develop a concrete set of use cases that document goals and focus further
efforts
– Define a system architecture, with a collection of simple APIs (Swofford, W.
Maddison, Lewis, Holder, working with SDSC professional team)
– Write wrappers around existing software from team members (e.g., PAUP*,
MrBayes, Mesquite, Poy, GRAPPA, etc.) to temporary use on our systems
(Swofford, W. Maddison, D. Maddison, Lewis, Holder, Huelsenbeck, Ronquist)
– Continue implementation of phorest with the ultimate goal of applying lessons
learned there to improve the design of the ITR project and to maintain
compatibility with the developing IT resource (Swofford, Lewis, Holder,
Ronquist?)
– Work with the database group to develop standards for data input and exchange
formats, drawing from experience with Nexus and other existing methods,
incorporating XML and possibly other emerging metalanguages. (Swofford, W.
Maddison, Lewis, others)
– Work with other team members on issues related to algorithm engineering and
high-performance computing; set up a prototype analysis ("stunt run") of a few
million-sequence datasets to demonstrate feasibility and gather some preliminary
computational data. (Moret, Bader, Berman, Wheeler, plus other members listed
above)
“Notion of overall schedule”
• Year 2
– Fully implement the architecture designed in the first year, providing a framework
for installation of software modules for phylogeny reconstruction, post-tree
analyses, performance evaluation, and simulations of evolution.
– Begin populating the framework with solutions modules (replacing old code with
newly developed modules)
– Implement a rough user interface for temporary use, with partial integration of the
database, solution modules, and current simulation tools.
– Release an alpha version of the software suite and make available for testing.
– Conduct a new stunt run, with improved simulated data and new modules; use
results from this to develop new plans for application and platform scalability
“Notion of overall schedule”
• Year 3
– Continue populating the computational framework with reconstruction algorithms,
evaluation modules, etc., with the goal of replacing earlier re-wrapped software
with new modules that (among other things) smoothly integrate database
functions with software modules.
– Release a beta version that for use by outside users ("official" collaborators,plus
other ATOL investigators, students, and participants in our annual workshops);
this version should run on laptops through small SMPS.
– Perform a formal evaluation effort by outside users based on this beta version,
and plan revisions based on the results of this evaluation.
– Work to make all of the code base run efficiently on large machines, including
clusters of SMPs
– Work with algorithms group to implement, test, and refine the best algorithms
devised to date.
“Notion of overall schedule”
• Year 4
– Implement post-tree analysis modules and integrate them with the
database
– Implement accepted recommendations from the user panel and
experimental findings of Year 3.
– Perform large-scale testing on large datasets produced by the
simulation team
– Work closely with ATOL partners on analysis of their data, identifying
and learning from the problems exposed in that process
“Notion of overall schedule”
• Year 5
– Incorporate modules contributed by international collaborators and others into
the framework
– Enable software suite for Grid usage.
– Identify new development targets
– Work closely with the SDSC team to produce a “final” package that will include all
of our work and it will run tests with this package on a large variety of platforms.
– Document and report the rate of software development over the years of the
project and its success in attracting outside contributors, in the interest of
improving the efficiency of large open-source software development projects.
Download