Experimental Visual Environment The EVE Project Bercovici Sivan 1 Experimental Visual Environment Table of content Table of content ............................................................................................................. 2 Introduction .................................................................................................................... 3 Goal ............................................................................................................................ 3 Tool Overview ........................................................................................................... 5 Requirements ................................................................................................................. 8 Design ............................................................................................................................ 9 Information ................................................................................................................ 9 Information Provider ................................................................................................ 10 Information Consumer ............................................................................................. 10 Tool .......................................................................................................................... 10 Tool Input................................................................................................................. 11 Tool Output .............................................................................................................. 11 Tool Category .......................................................................................................... 11 Toolbox .................................................................................................................... 11 Workspace................................................................................................................ 11 Implementation ............................................................................................................ 12 Input ......................................................................................................................... 12 Manual nucleotide sequence input ....................................................................... 12 Nucleotide sequence file input ............................................................................. 13 Range annotation file input .................................................................................. 13 Fasta loader .......................................................................................................... 13 Sequence tools ......................................................................................................... 14 Add range annotation ........................................................................................... 14 Filter range annotations........................................................................................ 14 Search repeats ...................................................................................................... 15 Break sequence by selection ................................................................................ 15 Pair alignment ...................................................................................................... 16 Sequence viewer ...................................................................................................... 17 Statistics ................................................................................................................... 18 Sequence pattern histogram ................................................................................. 18 Histogram viewer ................................................................................................. 18 Position specific nucleotide frequencies .............................................................. 19 Position specific frequencies viewer .................................................................... 19 Blast ......................................................................................................................... 21 Blast two groups .................................................................................................. 21 Blast results viewer .............................................................................................. 21 Extending EVE ............................................................................................................ 23 Encapsulating the Data ............................................................................................ 23 Creating the Tool ..................................................................................................... 24 Registering the Tool ................................................................................................. 26 Future directions .......................................................................................................... 27 Conclusion ................................................................................................................... 28 2 Experimental Visual Environment Introduction Over the past 30 years, the field of bioinformatics has shown an increasing growth in research and development. Nonetheless, as we examine the academically available research tools and infrastructures, we identify two major shortcomings: - There is no standard infrastructure/information format for bioinformatics tools. Many bioinformatics tools are hard to handle by biology researchers. the It is evident that no single environment exists in the academic world that offers an integrated research tool. The same goes for bioinformatics information-handling infrastructure. The result of the first shortcoming is that many times, the same infrastructure is written over and over again, by different researchers. While there are some libraries available (e.g., perl libraries that handle sequences), there is no emphasis that newly added infrastructure will be able to communicate with legacy infrastructure, and visa versa. More and more environments are becoming available daily, each offering a set of tools, supporting a set of formats, and producing new information sets in new formats. This yields a constant learning stage by researches, as well as the lack of tools interaction. As many tools do not offer a visual interface, working with these tools is sometimes hard, up to a point where the tool is not used by researchers although it offers important information. Goal The main goal of EVE is to provide with a freely available, easy-to-use visual working environment for the bioinformatics researchers, as well as an address for the collection of bioinformatics tools developed in different locations. The EVE project attempts to deal with the two shortcomings described in the previous section by offering bridges between different information formats, and providing a convenient visual tools’ pipe-line construction environment. We believe that by encapsulating existing tools, and offering conversion services between the formats these tools use, we could reach greater flexibility in the construction of information pipe-lines. EVE does not constitute a new format, suggesting it as a standard, but rather offers conversion services between different formats. This allows different tools to be able to communicate. Thus, as long as new tools will use the same sub-set of formats, EVE will be able to overcome the toolsinteraction problem. We also believe that by emphasizing the visualization of the tools, we enable biology researchers to actually use the tools in an intuitive fashion, shortening the learning 3 Experimental Visual Environment curve, and increasing productivity. More importantly, with visualization in mind, previously cryptic outputs from different tools are now more convenient to work with. 4 Experimental Visual Environment Tool Overview In EVE, a user interacts with a single workspace. Information processing pipelines are drawn by the user as a graph, where the nodes are information processing units, and edges mark the information flow between the different tools. Figure 1 – the workspace A set of tools in different categories is offered to the user in a browsing tree, at the left side of the environment. Figure 2 – the toolbox 5 Experimental Visual Environment A user can drag a selected tool from the toolbox into the workspace. To build an edge, representing information flow, a user has to click-and-drag a tool’s output (depicted as blue circles in figure 1) into a tool’s input (depicted as red-circles in figure 1). Each tool may have unique execution parameters. Double-clicking on a tool in the workspace will enable a user to modify the tool’s specific parameters, prior to its execution. A tool is either valid or non-valid. In order to make a tool valid, one has to execute it. A user is able to execute the entire graph by pressing on the play button at the top tool-bar. When a tool is non-valid, it is marked with a red circle holding the letter ‘V’ inside (signifying a validity problem). Figure 3 – Validity and Connectivity problems Each tool either provides information, or consumes information provided by other tools. Thus, a tool is either properly connected to its information providing tools, or not. As one can study from the previous figure, if a tool’s connection is not satisfied, the tool is marked with a red circle holding the letter ‘C’ in it (signifying connectivity problem). Sometimes, a tool will require the user to set its execution parameters. If a tool’s parameters are not properly set, it would be indicated as a red-circle holding a ‘P’ letter in it (signifying parameters problem). This is shown in the next figure. Figure 4 – Parameters problem The environment offers two more state indication components. One is the logger which logs events from the different tools and user interactions. The logger is available in the lower part of the environment. The other component is the short messages bar which displays context-sensitive messages, bellow the logger. Figure 5 – The logger 6 Experimental Visual Environment A user can drag tools across the workspace, re-arranging the graph layout. Also, a user is able to remove tools and edges by first selecting them (with a mouse click), and pressing the delete button. To learn about the different tools, a user can read the available tool-tips by pointing with the mouse on a tool, in either the workspace or the toolbox browsing tree. Pointing on a tool’s inputs or outputs will reveal the type of information consumed or produced by the tool, respectively. Figure 6 – Tool-tips 7 Experimental Visual Environment Requirements After identifying the shortcomings of current tools, we define the following requirements from the EVE tool: Ease of use The tool should be usable in an intuitive fashion. An emphasis should be made making to tool as visual as possible. Short learning curve Learning how to use the tool should inflict a marginal learning curve. Flexibility The tool should allow the interaction of different tools from the field of bioinformatics. Basic inputs support The tool should support basic data formats for the elementary data used by most bioinformatics applications. Basic data-structures The tool should provide data-structure for the elementary data types handled by most bioinformatics tools. Basic viewers The tool should provide with a set of basic viewers for the examination of the basic data-structures. Extendibility The tool should be easily extendable to support new arising external bioinformatics tools. Free The tool should be available freely. 8 Experimental Visual Environment Design In this section, we review the fundamental building blocks of EVE. The most basic component is the information objects. These objects are either produced by tools, consumed by tools, or both. When information is produced by the tool, it is pushed to the information providers objects, which lie inside a specific tool output. A tool may have, of course, several outputs. From the other side of the information flow arrow, a tool may have several tool inputs. Each tool input may have several information consumers. The tool retrieves the information it requires for processing from the information consumers. This flexible design allows us to easily develop new tools with different number of inputs and outputs that uses the supported information types, while offering a different processing service. Information As stated above, the most basic component in eve is the Information class, which encapsulates the basic data-units on which processing is made. The Information interface defines the basic methods that must be provided by any object that wishes to be able to move through a pipeline defined under the EVE environment. In EVE, we have implemented several Information objects that were marked as fundamental to bioinformatics research. Among the implemented basic-data units of Information are: 1. Nucleotide Sequence Holds a single nucleotide sequence, as well as specially marked sub-ranges. Other information stored in these objects is the sequence name, and a multi-line comment. 2. Multiple nucleotide sequence Holds a set of nucleotide sequences 3. Histogram Holds statistical information. The x-axis represents the different categories, and the y-axis represents the value measured for each category. 4. Motif Holds multiple-sequences as well as position-specific nucleotide frequencies. 5. Blast two groups information the given multiple-sequences. Holds the pair-wise alignment results of 9 Experimental Visual Environment Information Provider An Information Provider interacts with a corresponding set of information consumers (i.e., Information Consumer objects). It can interact with a ranging number of information consumers, as defined during construction, sending them a copy of the required internally stored information upon request. Information Consumer An Information Consumer interacts with a corresponding set of information providers (i.e., Information Provider objects). It can interact with a ranging number of information providers, as defined during construction. The consumer can ask from any of the registered providers for a copy of the information they hold. Tool This is one of the most basic components of EVE - the skeleton of a tool from which all EVE tools must inherit. It provides the mechanism for setting up the tool's input and output, and defines the interface for tool processing, and description. The processing of a tool is divided into 5 stages: Pre-process It defines the part of the tool’s execution that must occur prior to the tool’s main processing stage. This stage may be executed any time during pipe-line execution, as long as this time is before the tool’s main execution stage. Acquire information At this stage, the tool retrieves all information from the tools on which it depends for input. This stage may take place any time before the actual processing stage, yet after the input sources have all finished executing and the pre-processing stage of the tool has finished. Processing The main processing stage of the tool. At this stage all data needed for processing is available for the tool to use. The processing stage takes place after the information acquiring stage has finished. Post information This stage, which takes place after the main tool’s processing stage has finished, is responsible for pushing the information through the tool’s output. This stage is important so that the tools which depend on the output of this tool, will be able to retrieve the required data. Post-process This stage may execute any time after the information posting stage has finished. It is intended mostly for cleanup. Breaking the tool’s execution into these 5 steps, enables better task scheduling, as well as enforces programmers to give the required attention to all of the tool’s natural execution stages. 10 Experimental Visual Environment Tool Input A tool input consumes information from different information providers, based on the information consumers that it holds. A Tool may have several tool input objects attached to it. Tool Output A tool output provides information for different information consumers, based on the information providers that it holds. A Tool may have several tool output objects attached to it. Tool Category A Tool Category object is used to manage a set of tools inside the toolbox. Toolbox A singleton class is that manages the set of tools in EVE. The toolbox acts as a container of tools, set inside different containing categories. Workspace The workspace, implemented in the GraphPanel class, provides the main EVE workspace. Upon this component, the information pipeline will be drawn in the shape of a flow graph. 11 Experimental Visual Environment Implementation In this section, we review the tools, models and viewer, implemented in this version of EVE. Input EVE provides a few basic input methods, either manually or from file. In the next sub-sections, we describe these methods (i.e., EVE tools). Manual nucleotide sequence input Description: This tool can be used by EVE users to manually enter a nucleotide sequence. Inputs: None. Outputs: Supports both SingleNucleotideSequence and MultipleNucleotideSequence output. Parameters: The parameters of this tool are the sequence name, the actual sequence, and a comment to the sequence. The parameter editing panel is depicted in the next figure. Nucleotide sequence manual input 12 Experimental Visual Environment Nucleotide sequence file input Description: This tool is used to load a plain nucleotide sequence from a file Inputs: None. Outputs: Supports both SingleNucleotideSequence and MultipleNucleotideSequence output. Parameters: Double-clicking on this tool will open the default file browser, allowing the user the select the file that is to be loaded. Range annotation file input Description: This tool is used to load range annotations for sequences. The file format is: <Range start position> <Range end position> <Range annotation string>. Inputs: None. Outputs: Range annotations information object. Parameters: Double-clicking on this tool will open the default file browser, allowing the user the select the file containing the range annotations that is to be loaded. Fasta loader Description: This tool is used to load multiple nucleotide sequence from files in the FASTA format. Inputs: None. Outputs: The tool outputs MultipleNucleotideSequences information. Parameters: Double-clicking on this tool will open the default file browser, allowing the user the select the file that is to be loaded. 13 Experimental Visual Environment Sequence tools In this section we describe the basic sequence tool provided in this version of EVE. The tools are used to perform elementary tasks on the sequences. Add range annotation Description: This tool is used to annotated a given sequence with a given range annotation. Inputs: The left input is the single nucleotide sequence information, and the right input is used to bring the range annotation to be used. Outputs: This tool outputs both MultipleNucleotideSequences objects. a SingleNucleotideSequence and a Parameters: None. Filter range annotations Description: This tool is used to pre-select specific ranges in multiple nucleotide sequence, for later processing. Inputs: MultipleNucleotideSequence. Outputs: MultipleNucleotideSequence. Parameters: Users can not edit the parameters of this tool in the usual manner, but rather during the execution of the pipe-line, a sequence range-browser will popup, showing the available annotations for all of the input multiple nucleotide sequences. The browser enables the user to select the desirable ranges. 14 Experimental Visual Environment Search repeats Description: This tool is used to search small patterns repeats provided by the user in the input multiple nucleotide sequences. The results will be shown as annotated ranges in the output MultipleNucleotideSequences object. Inputs: MultipleNucleotideSequences. Outputs: MultipleNucleotideSequences. Parameters: A user has to insert the searched nucleotide pattern, as well as the searched repeat number. Search repeats in a given nucleotide sequences Break sequence by selection Description: This tool is used to break given nucleotide sequence into new multiple nucleotide sequence according to the selection (i.e., annotated ranges). Inputs: SingleNucleotideSequence. Outputs: MultipleNucleotideSequences. Parameters: None. 15 Experimental Visual Environment Pair alignment Description: This tool is used to perform either local or global alignment on two given nucleotide sequences. Inputs: The tool takes two inputs, each with a SingleNucleotideSequence information object. Outputs: MultipleNucleotideSequence. Parameters: A user may input the alignment method (i.e., either global or local), as well as the penalty matrix. This is depicted in the following figure. Pair alignment parameters 16 Experimental Visual Environment Sequence viewer Description: The sequence viewer is one of the most important tools available in the current version of EVE. It provides the users with a way to examine multiple nucleotide sequences, and their annotated ranges. The Sequence Viewer As depicted in the previous figure, the viewer is constructed from 3 parts. The upper quick navigation bar enables the user to set the viewing window size and location, and gives a quick overview of the sequences and their annotated ranges. Change in the sliding window will be reflected in the center viewing piece. The center viewing piece shows the actual sequences, marking annotated ranges with a red color. On the left side of the viewer, one can examine the annotated ranges of each sequence separately. A user is also able to mark the ranges it wishes to examine. Checking to range boxes will switch their visibility state on and off. These changes will be reflected both in the quick navigation bar and the center nucleotide sequence viewing piece. Inputs: MultipleNucleotideSequences. Outputs: None. Parameters: A user can not edit the parameters of this tool. Double-clicking on the tool when its state is valid will reveal the multiple nucleotide sequences viewer, and enable the user to examine the input multiple nucleotide sequences. 17 Experimental Visual Environment Statistics The current version of EVE provides with two sequence-related statistics tool. One focus on the short pattern statistics and the other tool focuses on position-specific nucleotide frequencies. Sequence pattern histogram Description: This tool is used to examine patterns in certain length in the given sequences. Inputs: MultipleNucleotideSequences. Outputs: HistogramInformation Parameters: As depicted in the following figure, a user can set the size of the patterns on which this statistical tool will gather information. Sequence patterns - statistical tool parameters Histogram viewer Description: This tool is a general histogram viewer. It is not specialized to any specific tool, but rather works with the generic HistogramInformation input. Inputs: HistogramInformation Outputs: None. Parameters: There are no special parameters a user can set in this viewer, yet when a user double-clicks on the tool when the tool is valid, the histogram viewer will appear. The histogram viewer shows the statistical information given as input as an histogram. As shown in the following figure, when a user points on a specific bar, a tool-tip will appear reflecting the statistical information represented by the bar. 18 Experimental Visual Environment The histogram viewer Position specific nucleotide frequencies Description: This tool is used to examine the position-specific nucleotide frequencies. This tool is mainly used for the quick discovery of motifs, and consensus patterns in multiple nucleotide sequence alignments. Inputs: MultipleNucleotideSequence Outputs: MotifInformation Parameters: None. Position specific frequencies viewer Description: Visually similar to the sequence viewer tool, this viewer provides the user with the ability to examine the position specific frequencies information gathered by the position-specific nucleotide frequencies tool. The main change from the previously described sequence viewer is that the center piece no longer presents the sequences themselves, but rather the frequencies gathered. Inputs: MotifInformation Outputs: None. Parameters: This tool does not take user parameters. When a user double-clicks on a valid tool, the viewer will appear. The following figure shows the position specific frequencies viewer. 19 Experimental Visual Environment Position specific frequencies viewer One can notice in the example above that there is a short 3-base consensus in the beginning of the given 6 sequences. 20 Experimental Visual Environment Blast We support simple blast services, as an example of how external tools can be easily encapsulated for use under the EVE environment. Blast two groups Description: This tool takes two groups of nucleotide sequences, and performs alignment on all the pairs which are a combination of a nucleotide sequence from the first group, and a nucleotide sequence from the second group. The tool that performs the actual alignment is the blastn tool, which is executed locally. Inputs: Two MultipleNucleotideSequences information objects. Outputs: BlastTwoGroupsInformation Parameters: A user can set the maximal allowed e-value. Any alignment which produces an e-value above the user specified limit will disregarded. Blasting Two groups - parameters Blast results viewer Description: This tool takes the blast results of the Blast two groups tools, and enables the users to view them. Inputs: BlastTwoGroupsInformation Outputs: None. Parameters: There are no unique parameters to be set by the user. When a user double-clicks on a valid blast results viewer tool, the viewer will appear, showing the given input results. 21 Experimental Visual Environment Example for the blast viewer results 22 Experimental Visual Environment Extending EVE In this section we provide a step-by-step guide for the implementation of a new tool. The motivation is to show how easy it is to extend the EVE tool, encouraging the addition of new tools for the environment. Encapsulating the Data The first step in the construction of a new tool is to examine the type of information it uses, as well as the type of data it produces. A programmer should first examine the currently available Information objects, as the most basic types of data, used by most bioinformatics applications, have been already implemented. We encourage the re-use of these Information objects, so as to increase the interconnectivity with both currently available tools, and future implemented tools. In case none of the Information object fit the requirements of the new tool (either for input or output), a new class should written, implementing the light Information interface. The most important point there is that a programmer should implement the createCopy method which should create an entirely independent copy (i.e., deep copy) of the Information object. As an example, we provide the code for the single nucleotide information class. public class NucleotideSequenceInformation implements Information { public NucleotideSequenceInformation( NucleotideSequence seq ) { this.sequence = seq; } public Information createCopy() { return new NucleotideSequenceInformation( new NucleotideSequence( sequence ) ); } public String getDescription() { return "Nucleotide sequence"; } public Class getType() { return NucleotideSequence.class; } public NucleotideSequence getSequence() { return sequence; } private NucleotideSequence sequence; } The second step is to implement the Information Consumer and the Information Provider interfaces for the newly created Information type. When implementing these interfaces, one should notice two important points. The first point is that the constructor of both the Information Consumer and the Information Provider must call 23 Experimental Visual Environment the parent construction, providing the minimum and maximum number of information clients as parameters. A zero in the maximum providers field will enable the user to connect an infinite number of information providers to the specific information consumer. These parameters will be enforced in the workspace, allowing a user to connect only the allowed amount of edges between the tools. The second important point is that the type method must be overriding so that it would reflect the exact data type of the handled Information. As an example, we provide the code for the nucleotide sequence information consumer and provider: public class SingleNucleotideSequenceInformationConsumer extends InformationConsumer { public SingleNucleotideSequenceInformationConsumer() { super(1,1); } public NucleotideSequence getSequence() { return ((NucleotideSequenceInformation)getInformation(0)).getSequence(); } public Class type() { return NucleotideSequenceInformation.class; } public String getDescription() { return "Single nucleotide sequence information consumer"; } } public class SingleNucleotideSequenceInformationProvider extends InformationProvider { public SingleNucleotideSequenceInformationProvider() { super(0,0); } public Class type() { return NucleotideSequenceInformation.class; } public void setSequence( NucleotideSequence seq ) { setInformation( new NucleotideSequenceInformation(seq) ); } public String getDescription() { return "Nucleotide sequence information provider"; } } Creating the Tool When constructing a new tool, one should extend the Tool class. In order to change the default image of the tool, it is recommended that during the construction 24 Experimental Visual Environment of the tool (i.e., in the constructor), a new tool image would be loaded into the toolImage variable. Next, the programmer should initialize the tool’s inputs and outputs in the constructor of the tool. The required number of inputs and outputs should be built from the corresponding ToolInput and ToolOutput objects. These objects can hold numerous information consumers and providers (consumers for inputs, and providers for outputs). Upon constructing the inputs and outputs, they should be added to the tool (during construction) using the addToolInput and addToolOutput methods. As an example for the coding of a tool, we show the code for the nucleotide-patternhistogram tool constructor: public SequenceHistogramTool() { toolImage = new ImageIcon(getClass() .getClassLoader() .getResource("images/SequenceStatisticsToolIcon.gif")); // // Setup tool input/output ToolOutput toolOutput = new ToolOutput( this ); toolOutput.addInformationProvider( new HistogramInformationProvider()); addToolOutput( toolOutput ); // // Two single nucleotide sequence inputs ToolInput toolInput = new ToolInput( this ); toolInput.addInformationConsumer( new MultipleNucleotideSequenceInformationConsumer() ); addToolInput( toolInput ); // // Setup tool status setParametersStatus( true ); invalidate(); } It is important to notice that the programmer should implement the newInstance method, returning a new instance of the new tool. A user should implement four out of the five tool processing steps, by overriding the following methods: preProcess process postProcess updateInformationProviders The step in which the information is brought to the tool’s information consumers is inherited from the Tool abstract class. Nonetheless, in most cases, a user would prefer bringing the information a step forward from the different information consumers and into the actual tool. It is recommended that this stage will be implemented as part of the process method. 25 Experimental Visual Environment In order to allow a user to interact with the tool, so as to set its specific parameters, a programmer should override the editParameters method. This method will be invoked when a user double-clicks a tool on the workspace. Registering the Tool The last step in the introduction of a new tool into EVE is the registration stage. In order to register a tool into EVE, a programmer should add it to a category. In case the tool belongs to a new category, this category must be constructed in the EVEFrame class, during construction. Next, the tool should be added to the correct category, following by the registration of the category with to the toolbox. As an example, we provide the code for the registration of the statistics tools. This piece of code is taken from the EVEFrame constructor: category = new ToolCategory("Statistics"); category.addTool( new SequenceHistogramTool() ); category.addTool( new HistogramViewerTool() ); category.addTool( new GatherPositionSpecificFrequenciesTool() ); category.addTool( new PositionSpecificFrequenciesViewer() ); toolbox.addToolCategory( category ); 26 Experimental Visual Environment Future directions EVE is in a preliminary stage. In order to enable users to actually work with the tool, more external tools should be encapsulated. Among these tools are the blastp, clustalW, ATR hunter, motif searchers, and more. Interacting with tools available on the web is also an important step so as to increase the richness of EVE. Internet available data-bases should be accessible through EVE as well, thus providing the biology researcher with a single working center. More file formats should be supports, apart from the plain sequence format, and the FASTA format. EVE is currently nucleotide-oriented, thus extending it to support amino-acids sequences is also an important step. The information viewers, such as the sequence viewer and position specific nucleotide frequencies viewer should be more stable, allowing the examination of a large number of long sequences. The viewers should also be extended, adding more visual information such as an outline of the degree of consensus between the input sequences. More statistical tools should be implemented, providing the user the ability to locally analyze and examine the data. Persistency support is an important feature that will enable a user to save and load workspaces. Some work should still be made on the improvement of the workspace – the graph layout, and a non-bounded workspace. A scripting interface should be added to allow advance users to interact with EVE and its produced data through short scripts. Another interesting direction would be to examine the possibility of scheduling the independent tasks in a pipeline to different distant computers. This way a pipe-line can be executed in a greater speed. 27 Experimental Visual Environment Conclusion EVE is an initial effort to provide a freely available, visual working environment for the bioinformatics researchers. As shown in the previous sections, EVE is easy to extend with both new tools and encapsulated external tools. EVE provides with basic means for input, and viewing of data. The basic Information objects provided by this version of EVE allow the flexibility of the tool, enabling different tools to interact. We conclude that the EVE tool marks the path we believe should be taken toward the future working environment of bioinformatics researcher. As the number of tools grow and the information they provides becomes more dense, researchers need a convenient integrated environment which will enable them to interact with these tools in an easy and non-cryptic manner. 28