Documentation for PathBinder in MetNet Table of Contents (revise… 1 System Overview ........................................................................................................ 1 2 User Manual ................................................................................................................ 2 2.1 System Requirement ........................................................................................... 3 2.2 Installation Guide ...............................................Error! Bookmark not defined. 2.3 PathBinder Updater ............................................................................................. 7 2.3.1 2.3.2 2.3.3 2.3.4 2.4 2.4.1 3 Download MEDLINE files ............................................................................. 8 Update sentence repository ......................................................................... 9 Update species filters ...................................................................................... 9 Update dictionary ...................................................................................... 11 PathBinder Viewer .............................................................................................. 3 User Interface .............................................................................................. 3 Programmer Guide ...................................................................................................... 7 3.1 Database schema ............................................................................................... 12 3.2 Call PathBinder from Other Applications......................................................... 13 1 System Overview MEDLINE PathBinder Updater PathBinderDB in MetNet MetNetDB PathBinder Viewer FCModeler Figure 1. System overview of PathBinder in MetNet PathBinder is a gateway that provides enhanced access to MEDLINE information about metabolic interactions. MEDLINE baseline release files, as well as monthly update files, are searched against a precompiled dictionary, which consists of biochemical entities (metabolites, enzymes, co-factors, etc.), interaction-related verbs and sub-cellular locations. The sentences containing at least two search terms are stored in PathBinderDB, accessible through a GUI or programming API. Major enhancement includes: More relevant information retrieval. A user can query for sequences containing two biochemical entities, optionally in conjunction with interaction-related verbs or sub-cellular locations. Since the entities appear in the same sentences, they are more likely to have interactions with each other. On the other hand, if such queries were sent to PubMed directly, the entities might appear anywhere within an abstract, possibly far apart from each other with little chance of having interactions. Automatic synonym expansion. Any query term is automatically expanded to include its synonyms. Taxonomy filter. PathBinder provides a hierarchical taxonomy filter function, which enables a user to focus the queries on only the abstracts that mention the species of interest. 1 Combined with MetNetDB and FCModeler, PathBinder provides users with literature evidence for curating and simulating metabolic and regulatory pathways. PathBinder consists of three sub-systems, PathBinderUpdater, PathBinderViewer and PathBinderDatabase (a part of the MetNet database: http://www.public.iastate.edu/~mash/MetNet/MetNet_db.htm). PathBinderUpdater scans the entire MEDLINE release and monthly update off-line against a pre-compiled dictionary for sentences containing at least two search terms, and their synonyms. The search terms include biochemical entities (i.e. metabolites, proteins, and RNAs, and their synonyms, imported from the MetNetDB database), interactionrelated verbs (and their inflections), and selected sub-cellular locations (see table 1 for number of terms in each categories). PathbinderUpdater also collects species information from all abstracts by querying PubMed. PathBinderUpdater updates PathBinderDB every 2 or 3 months. The search and query results are stored in PathBinderDB. As of February 16, 2016, it contains over 17 million scanned sentences. See Programmer’s Guide for more details about its internal structure. PathBinderViewer is a stand-alone easy-to-use GUI for intuitive access of the information in PathBinderDB. The PathBinderViewer search tool also provides an API that can be evoked from other applications, such as FCModeler and MetNetDB. Thus, the user can be using software such as FCModeler or MetNetDB, click on any two terms on the screen, and evoke a PathBinderViewer window containing those two terms. Table 1. Statistics of PathBinderDB (February 16, 2016) Biochemical entities and synonyms Verbs and inflections Sub-cellular locations and synonyms Stored sentences 114,286 432 76 17,586,396 2 2 User Manual: PathBinderViewer--a biologists search tool This chapter describes how to install and use PathBinderViewer as a stand alone application. 2.1 System Requirement To use PathBinderViewer, a user needs a computer running Windows, Mac Os, Linux or Unix with Java J2SE 1.4 (or higher) and Java WebStart installed. A web browser and Internet connection are also required. 2.2 Start PathBinderViewer Type the following URL into a web browser’s address bar: http://metnetdb.gdcb.iastate.edu/PbFrame.jnlp If this is the first time running PathBinderViewer on the computer, Java WebStart will download PathBinderViewer program automatically from the server. After download is finished, WebStart will ask the user for permission to access local hard disk and Internet connection. Please click “Yes”. WebStart will also ask users whether or not to put a shortcut on the desktop. If a shortcut is created, PathBinderViewer can be started directly by double-clicking the shortcut, without the need to open a web browser and type in the URL. If this is not the first run, WebStart will check for updates to the program, and automatically download the updates if there is any. No user intervention is required. This makes sure that a user is always using the latest version of the program. 2.3 Users guide to PathBinder Viewer-a biologists search tool 2.3.1 User Interface The PathBinderViewer user interface consists of two separate windows: the control window (Fig. 2) and the sentence window (Fig. 3). The control window is for specifying the search terms (entities, verbs, and/or locations) and filters (taxonomy and/or location). The difference between a term and a filter is that the term must be present in a sentence, while a filter can appear anywhere in an abstract. For example, if a user specifies PKC and acetyl-coA carboxylase as two search terms, and Arabidopsis as a filter, PathBinderViewer will retrieve sentences containing both PKC and acetyl-coA carboxylase from the abstracts that contain Arabidopsis anywhere within the abstracts. The retrieved sentences are displayed in the sentence window with query terms highlighted. In addition, the synonyms of entity terms are listed. 3 Figure 2. PathBinderViewer: the control window The upper half of the control window is for setting up the taxonomy filter; and the lower half for specifying biochemical entities, verbs and/or sub-cellular locations. The entities and the verbs can only be used as search terms, while the locations may work in either term or filter mode. 2.3.2 Setup taxonomy filter There are three options for using the species filter: No filter (all abstracts are qualified). Use the simplified taxonomy tree in the upper-left window for commonly studied species of green plants. Hold down “Shift” or “Ctrl” key while clicking tree nodes to setup a filter including multiple species. Search the NCBI taxonomy database for plant species that are not present in the simplified tree. Type the keyword(s) in the “search term” field, and click the search button ( ) in the “Taxonomy search result” panel. Check the box(es) below the search field if you want an exact and/or case-sensitive search. By default, search result will overwrite the previous searches. However, if the “append” checkbox is checked, it will append the new search to the previous ones. The species in the search result list (upper-right window) can be removed with “delete” ( ) or “clear” ( ) commands, sorted ( ), and/or saved to a file 4 Figure 3. PathBinderViewer: sentence window ( ) for future use ( ). Highlight the items in the result list to setup the taxonomy filter (hold down “Shift” or “Ctrl” key while clicking an item for multiple selections). After highlighting the species names of interest in the simplified tree or in the search result, click the “Update filter” button. If sub-cellular locations are used in filter mode, also select the locations, and then click the “Update filter” button. The number of abstracts that pass the filter will be displayed as “PMID count.” 2.3.3 Specify search terms A user may ask PathBinder to show sentences that contain three types of search terms: One or more interaction-related verbs (and their reflections), optional; One or more sub-cellular locations (and their synonyms), optional; One or two entity names. If neither verbs nor locations are requested, two entities are required. Otherwise, at least one entity is required. Sub-cellular locations may also work in filter mode. Verb(s) and location(s) are organized into hierarchical groups in the lower-left and lowerright windows (Fig. 2). A single verb (location) or a group of verbs (locations) can be easily specified by clicking a tree node. Hold down “Shift” or “Ctrl” key while clicking 5 a node for multiple selections. Either tree may be quickly disabled by unchecking the box above the tree. There are three ways to populate the entity list (“Entity 1” and “Entity 2”): Searching in MetNet database’s “Entity” and “Entity_synonym” tables. . Type the keyword(s) in the “search term” field, and click the search button ( ) in the “Entity 1” or “Entity 2” panel. Check the box(es) below the search field for exact and/or case-sensitive search. The search result may overwrite or append to the previous searches (“append” checkbox). The list may be saved ( ) for future use. Loading entity names from a file ( ). The file may be previously saved search results, generated by other programs (MetNetDB, FcModeler, etc.), or manually created in any text editor. The file is in plain text format, 1 entity/line. Each line contains two fields separated by a vertical bar (“|”). The 1st field is EntityID in the “Entity” or “Entity_synonym” table, and the 2nd entity name. Inserted directly from other programs (MetNetDB, FcModeler, etc.). See “Programmer’s Guide” for more details. “Entity 2” list has an additional populating method by copying from “Entity 1” list ( ). Entities in both lists can be removed with “delete” ( ) or “clear” ( ) commands. The lists may be sorted ( ) alphabetically in ascending order. 2.3.4 Use sub-cellular locations as a filter By selecting the corresponding radio button, a user can use sub-cellular locations in filter mode. After selecting the location node(s), as well as species names if taxonomy filter is also used, click the “update filter” button. 2.3.5 Display sentences After setting up the species/location filter and specifying the search terms, click the “Show sentences” button to bring up the sentence window (Fig. 3). The requested sentences are displayed in the main window with search terms highlighted in different colors. All of the synonyms of entity 1 and/or entity 2 are shown in the drop-down lists above the main window. The PMID at the beginning of each sentence is a clickable link, which will open the corresponding PubMed abstract in a web browser. 6 3 System Administrators’ Guide to PathBinderUpdater This chapter describes how to install and use PathBinderUpdater as a stand alone application. PathBinderUpdater is used by the MetNetDB system administrator to keep PathbinderDB update-to-date with regard to MEDLINE. A frequency of updating every 2 or 3 months is recommended. PathBinderUpdater has four main functions: 1. Download MEDLINE XML release files and monthly updates; 2. Scan downloaded MEDLINE files against a dictionary to update the sentence repository; 3. Query PubMed to update species/location filter information; 4. Rebuild the dictionary if necessary (i.e. many changes have been made to the “Entity” and “Entity_synonym” tables in MetNet database). 3.1 System Requirement and Installation PathBinderUpdater has already installed on the server (metnetdb.gdcb.iastate.edu). No further installation is required. In case that the server is moved to a new computer, follow the steps below to setup PathBinderUpdater on the new server. The new server must have Java (J2SE 1.4 or higher) installed. There is no preference for operating systems. Copy the content of entire “c:\pb” fold (including sub-folds) to the new server. cleaned.tax Green plant taxonomy dictionary pb.dic Search term dictionary pb_metnet_src.jar Java source code pbframe.bat PathBinderViewer executable pbupdater.bat PathBinderUpdater executable sample_entity.list Sample entity list sample_species.list Sample species list singles_no_abb.txt Plural noun dictionary stop.list Stopwords for term dictionary lib\icons.zip Icons lib\liquidlnf.jar Liquid L&F library lib\pb_metnet.jar PathBinderUpdater library lib\pbframe.jar PathBinderView library Register the new server’s IP address at the MEDLINE FTP server, which can be requested via email by the licensee of MEDLINE. [Visit http://www.nlm.nih.gov/bsd/licensee.html for more details about licensing MEDLINE and registering IP address] 7 3.2 Download MEDLINE files remote folder file type filter remote folder content local folder local folder browser local folder content Figure 4. PathBinderUpdater user interface (MEDLINE FTP client) Users can browse files in four folders on the remote FTP server: Baseline gz (MEDLINE baseline release files compressed in gzip format) Baseline zip (MEDLINE baseline release files compressed in zip format) Monthly update (MEDLINE monthly update files in gzip and zip format) Sample (sample MEDLINE files) A user may select a file type filter to show only a certain type of files (zip only, gz only, or all type) in the remote folder content window. There are several methods to select the files for download: Click the file directly to select a single file; Press and hold down “Shift” or “Ctrl” key while clicking a file to select multiple files; Use the “Select all” button to select all files shown in the remote folder (files masked by the file type filter are not selected); Use the “Select undownloaded” button to select all files shown in the remote folder that are not in the local folder. Once one or more remote files are selected, the “download” button will be enabled. Clicking the button will start downloading process. Downloading and verifying md5 checksum files is optional. It is designed for unreliable network connections. If a file did not pass md5 check, it should be re-downloaded. However, during the process of downloading 619 baseline and monthly update files, there was not any mismatch. 8 By default, the downloaded MEDLINE files will be stored in the current working folder on the local computer. The local folder can be changed by using “local folder browser” or typing directly in the text field. 3.3 Update sentence repository entity dictionary plural noun dictionary local MEDLINE folder Figure 5. PathBinderUpdater user interface: Sentence repository The controls on the “Sentence repository” tab are disabled if the computer has not been connected to MetNetDB. To connect to MetNetDB, type in the MetNet server name (metnetdb.gdcb.iastate.edu), database name (dj_metnet), username and password, and click “Connect MetNet” button. To scan MEDLINE sentences, PathBinderUpdater needs a term dictionary, a plural noun dictionary, and the folder name where MEDLINE files are stored. The default term dictionary is pb.dic, and the default plural dictionary is singles_no_abb.txt. To use other dictionaries other than the default, type the file path directly in the corresponding text field, or select one using the “…” button. The entity dictionary needs periodical updates when the “Entity” and the “Entity_synonym” tables in MetNet database have been changed significantly. There is no need to update the plural noun dictionary. The MEDLINE folder is where the downloaded MEDLINE release files are stored. Once a folder is specified (typed in or selected through “…”), PathBinderUpdater searches for files whose names end with .xml, .xml.gz, or .xml.zip in the folder, and list them in the “Unprocessed files” window. All of the files in this list will be scanned. If some of the files were scanned before, highlight them (hold down “Shift” or “Ctrl” button for multiple selection) and click “<” button to move them into the “Processed files” window. After scanning, the files will be automatically moved to the “Processed files” window. The “<<” button moves all files from the unprocessed 9 window to the processed window; and the “>” and “>>” buttons move the files in the opposite direction. The “Clear sentence repository” button deletes all of the scanned sentences in the database for rebuilding the repository. It should be used after the entity dictionary is updated. The “Delete duplicated sentences” button is for the cases when same abstracts are scanned multiple times. This may happen accidentally, or because there are some duplicated PMIDs in MEDLINE monthly update files. 3.4 Update filters The controls on the “Sentence repository” tab are disabled if the computer has not been connected to MetNetDB. To connect to MetNetDB, type in the MetNet server name (metnetdb.gdcb.iastate.edu), database name (dj_metnet), username and password, and click “Connect MetNet” button. To update filters, select (use “…” button) or type in the species dictionary filename and the location dictionary filename, and click “Update” button. The default dictionaries are cleaned.tax and location.dic. Optionally, a user can type in the date of last update, so that the program only updates the most recent MEDLINE since last update. However, there is not much significant saving on running time by specifying a date (only about 5 ~ 10%). If no date is given, the previous filters should be deleted with the “Clear” button before the update, or using the “Delete duplicate” button after the update. Using “Clear” is much faster than “Delete duplicate.” After the update, the user should rebuild the simplified taxonomy tree (“Build simplified tree” button). Figure 6. PathBinderUpdater user interface: Species 10 3.5 Update dictionary The controls on the “Dictionary” tab are disabled if the computer has not connected to MetNetDB. To connect to MetNetDB, type in host name, database name, username and password, and click “Connect MetNet” button. Since the entire sentence repository needs a complete rebuild after update of the dictionary, it should be conducted when necessary, i.e. when significant changes have been made to the “Entity”, the “Entity_synonym”, the “Location” and the “Location_synonym” tables in MetNetDB. To create new dictionaries, type their filenames in the corresponding fields. To overwrite existing dictionaries, type in the old filenames or select with “…” button. An optional stopword file can be specified to exclude those entity names that are also common English words. A sample stopword file is in the working directory (c:\pb), stop.list. Figure 7. PathBinderUpdater user interface: Sentence repository 11 4 Programmer’s Guide This chapter descries the internal structure of PathBinder database, and how to use PathBinder from other applications (e.g. MetNetDB, FCModeler). 4.1 Database schema PathBinder adds to the MetNet database 15 tables, which can be conceptually divided into 4 groups around the “pb_sentence” table (Fig. 8). The table stores the sentences that contain at least two search terms, and to which PMIDs they belong. The group of location (verb) tables registers which location (verb) appears in which sentence and the information about the locations (verbs). The group of entity tables registers which entity appears in which sentence, similar to the verb and location groups. However, unlike the other two groups, there is an additional intermediate layer of termID between sentenceID and entityID. EntityIDs are first mapped to termIDs (pb_term2entity), and then mapped to sentenctIDs indirectly through termIDs (pb_term_hits). Such design is due to high ambiguity in MetNet entity names Entities Entity_synonym Entity PK pb_term2entity pb_term EntityID PK TermID 1:n FK1 EntityID Synonym Name Type Organism Username Date Source 1:n FK2 FK1 1:n EntityID TermID term 1:n Locations pb_term_hits pb_location_hits Verbs pb_verb PK FK1 FK2 pb_verb_hits verb verb_group 1:n FK1 FK2 1:n verbID SentenceID pb_sentence PK pb_prefilter FK2 FK1 tax_simplified PMID taxId m:n FK2 FK1 taxId parentId common_name name left_id right_id orderId 1:n pb_loc2pmid 1:n location_synonym m:n m:n verbID variant name definition goid parentID 1:n PMID Sentence m:n pb_verb_variant locationID SentenceID SentenceID 1:n FK1 FK2 FK1 1:n verbID location 1:n PK locationID TermID SentenceID FK2 FK1 PMID locationID FK1 locationID synonym pb_tax2pmid 1:n FK1 FK2 PMID taxId green_plants_node 1:1 1:n PK green_plants_name taxId parentTaxId rank left_id right_id 1:n FK1 taxid name nameClass Taxonomy Figure 2. PathBinder database schema 12 (i.e. a name or synonym corresponds to multiple entityIDs), and its purpose is to save table space and improve query speed. If entityIDs were used directly to register hits in the sentences, any ambiguous entity names (synonyms) would have to keep multiple copies of records in the database, each copy for one entityID. If several ambiguous names were found in the same sentence, all possible combinations of the entityIDs would need a copy of the record. If the most three ambiguous names in the current version of MetNet happened to be in the same sentence, it would need more than 5,000,000 copies. Therefore, a unique termID is assigned to every unique name (synonym), and only 1 copy is needed to register a hit. The group of taxonomy tables stores which PMIDs are associated with which taxa nodes. Please note that some information in the “tax_simplified” and “tax_prefilter” are redundant. It is there for faster queries. 4.2 Call PathBinderViewer from Other Applications Although other applications may retrieve PathBinder sentences from the above-described tables directly, it is not recommended to do so unless it is absolutely necessary. PathBinderViewer provides API for other applications to use PathBinder easily. To integrate PathBinder into other applications, include “pb_metnet.jar” in the class path. A programmer may call PathBinderViewer from the control window or the sentence window. 4.2.1 Call PathBinderViewer from the control window The biggest advantage of evoke PathBinderViewer at the control window is its easy-touse and flexible interface for setting up species filter and specifying locations and/or verbs. To activate the control window, simply create a new instance of PbFrame, and call its show() method. import edu.iastate.jtm.util.MySqlConnector; import fcmodeler.pathbinder.gui.PbFrame; ... DbConnector db = new MySqlConnector("metnetdb.gdcb.iastate.edu", "dj_metnet", "dingjing", "dingjing"); PbFrame pbf = new PbFrame(db); pbf.pack(); pbf.show(); ... // // // // server database username password The species filter setup and location (verb) selection cannot be done from outside the control window. That’s the point of using the control window. Otherwise, the other option (evoke the sentence window directly) might be considered. On the other hand, 13 there are several ways to populate the two entity lists, including ones from outside the control window. Use PathBinderViewer’s built-in search capability to find entities. Read from entity files. The files can be generated by other applications. Insert entities into the lists directly from the other application using PbFrame’s addEntity() method, which makes PathBinder seemingly more integrated with the application. After setting up the species filter and specifying the entities, locations and/or verbs, the sentence window can be brought up by clicking the “Show sentence” button in the control window, or calling PbFrame’s showPathBinder() method. 4.2.2 Call PathBinderViewer from the sentence window PathBinderViewer can also be evoked from the sentence window, if the capability of species filter, verbs and locations is not necessary, or does not meet the requirements of the other applications. The following sample code shows how to evoke the sentence window. import edu.iastate.jtm.util.MySqlConnector; import fcmodeler.pathbinder.PathBinder; import fcmodeler.pathbinder.PbSentence; import fcmodeler.pathbinder.gui.PbViewer; ... /* Create an instance of PathBinder, and send query to database. */ DbConnector db = new MySqlConnector("metnetdb.gdcb.iastate.edu", // server "dj_metnet", // database "dingjing", // username "dingjing"); // password PathBinder pb = new PathBinder(db); // Sentence querier pb.sendQuery(eid1, // int, entityID of entity 1 eid2, // int, entityID of entity 2 location, // Set, locationIDs in String (could be null) verb); // Set, verbIDs in String (could be null) /* Retrieve sentences and entity synonyms */ PbSentence[] sens = pb.getSentence(true); // sorted on PMIDs Set term1 = pb.getEntity1TermID(); Set t1name = pb.getEntity1TermName(); Set term2 = pb.getEntity2TermID(); Set t2name = pb.getEntity2TermName(); /* Create an instance of PbViewer, and display the retrieved sentences */ PbViewer pbViewer = new PbViewer(); // The sentence window pbViewer.setSentences(sens, term1, term2, t1name, t2name, location, verb); pbViewer.pack(); pbViewer.show(); ... 14 Optionally, a species/location filter can be setup before sending query for sentences. import fcmodeler.pathbinder.TaxonomyFilter; ... /* Create an instance of PmidFilter. */ PmidFilter pf = new PmidFilter(db); pf.addTaxEntry(“33090”, 124376, 222641); //Add a entry for “green plants” pf.updateFilterTable(true, null); // Create the filter ... PathBinder pb = new PathBinder(db); // Sentence querier pb.useFilter(true); pb.sendQuery(... 15