Aaron Schoenhofer Introduction The purpose of this project is to represent data in a way that is simultaneously readable and meaningful for humans and software. To be human-readable, data must be represented in an organized, user-friendly fashion (Noy, 2004) that does not overwhelm the reader, especially when large datasets are involved. Both humans and machines need some sort of context or semantics in order to place meaning on some piece of data (Hussain, 2006). An ontology is used to represent datasets of experimental results from protein-protein interaction(PPI) experiments. The datasets contain information on the proteins involved in the interaction, what kind of interaction occurred, and references and database accession numbers. Displaying data in the form of individuals and properties improves human-readability. The XML/RDF format allows machine-readability. Explicit relationships give context. Methods I used the Protege-OWL editor (http://protege.stanford.edu/) to create and modify the ontology. Classes and properties in the ontology are based on the PSI MI format (http://psidev.sourceforge.net/mi/rel25/). ProtegeOWL also has an API that can be used with Java to manipulate an ontology. Java and the SAX parser is used to parse input files. Eclipse is the IDE. The Protege-OWL editor can automatically generate Java code for the classes and properties in the ontology. With the ontology project open, select the Code menu. From the Code menu, select Generate ProtegeOWL Java Code. A pop-up window has options for setting the output directory, package name, and Factory class name. I named the package owlOntCode25 and the factory OwlOntCode25Factory. After clicking OK, the output directory should contain two packages. One package has the name that was selected in the code generation window, and the second package has the same name with 'Impl' appended. Instances of these classes can be created and used to populate an ontology. To set up an Eclipse project, create a new Eclipse Java project in the workspace and import the directories containing the packages generated by Protege-OWL into the project's source directory. To use the ProtegeOWL API, several jar files will need to be included in the build path. To do this, go to Java Build Path in the project properties menu. Select the Libraries tab, then Add External JARs button. The jar files needed are in the Protege installation directory and in the subdirectory plugins/edu.stanford.smi.protegex.owl of the Protege installation directory. Or unzip the eclipse project at http://www.cis.ksu.edu/~aps7777/Ontology. Import in eclipse or create new project and import. Set path by adding jar files in P-OWL-JARs directory with Add JARs button. A few data structures are parameterized and cause build errors when not using Java 1.5. Changing the compiler to 1.5 removes the build errors. This can be done by selecting Java Compiler in the project properties menu. Check the Enable Project Specific settings box, and change the compliance level to 5.0. Then click yes to rebuild. To run the program, PsiMiXmlParser is the main class. Arguments are optional, four xml files will be parsed by default if arguments are not supplied. Files are MIF versions 1.1, 2.0, and 2.5. Code Description Package xmlInOwlOut Classes in the xmlInOwlOut package parse xml files and use the extracted data to instantiate classes and their properties. New individuals are instantiated the help of owlOntCode and owlOntCode.impl packages. After creating new individuals from the data in the xml files, a new owl ontology file is written to disk. The new ontology containing the newly instantiated individuals can be viewed in the Protege-OWL editor. PsiMiXmlParser.java - This class contains the main method for the program. It reads and loads any files given as command line arguments. It then parses the data in the xml files and uses it to instantiate DbSourceEntry DbInteractionEntry classes. Only one instantiation of DbSourceEntry is made per input file. The DbSourceEntry class keeps a record of where the xml file came from (MIPS, BIND, etc.) and is used later when creating individuals in the ontology. A DbInteractionEntry instance is created for each proteinprotein interaction in the xml file. When all files have been parsed, PsiMiXmlParser calls the method processOutputQueue in OwlOutput.java. There are three methods that do most of the parsing. One of the three methods will be chosen for parsing after reading the level and version attributes from the entryset tag. DbSuperEntry.java - This is an abstract class and the superclass of DbSourceEntry and DbInteractionEntry. A HashMap is used for storing information parsed from the xml input files. To make the information accessible, hashMap keys are named after the data that can be retrieved using the key. DbSuperEntry also has methods for putting and getting data in the HashMap. Although HashMap's built-in method for retrieving values may return null values, DbSuperEntry's method for retrieving map values always returns a non-null value. This method and the reason for it is explained in the Code Examples section below. DbInteractionEntry.java - This is a subclass of DbSuperEntry. DbInteractionEntry's inherited HashMap is used as a container for the data within each pair of interaction tags in the input files. Each instance of DbInteractionEntry represents one interaction and is used to create classes and properties defined in the ontology. DbSourceEntry.java - This is also a subclass of DbSuperEntry. The difference between this subclass and DbInteractionEntry is that each instance of DbSourceEntry represents a source instead of an interaction. In a PSI MI record, source refers to the provider of the data (http://psidev.sourceforge.net/mi/xml/doc/user/). The PSI MI files for this project usually come from MIPS or BIND. One DbSourceEntry is created for each parsed input file, and contains the data within a pair of source tags of a file in PSI MI format. OwlOutput.java - As input files are parsed, the data is placed in instantiations of the subclasses of DbSuperEntry and placed in a queue in the OwlOutput class. When all input files have been parsed, the entries are dequeued. New instances of classes in the ontology are created from the data contained in each entry. OwlOutput uses the instanceof statement to determine whether the data should be used to create classes and properties of Source or Interaction. When the queue is empty, the populated ontology is written to disk as an owl file and can be opened in the Protege-OWL editor. OwlOutputHelperFunctions.java - This class is used by OwlOutput when attempting to create new instances of classes in the ontology. Using the owl API to instantiate an individual with a non-unique name causes errors. Therefore, all methods in OwlOutputHelperFunctions that directly instantiate new individuals are declared private. OwlOutput must call the public createNew method which gets a unique name by calling the getNextUniqueName method before attempting to create a new individual. Every time a new individual is created its name also inserted into a HashSet. The getNextUniqueName method is recursive and will not return until it has found a name that is not in the set of used names. Classes in the owlOntCode and owlOntCode.impl packages were automatically generated by Protege-OWL. Calling the factory method creates and returns new individuals of classes in the ontology. The returned individuals can then be used for setting their specific properties. Individuals that are successfully created will be included in the populated ontology file that is written to disk just before the program ends. Code Examples Attempting to instantiate classes or properties with a null value causes errors. If a class or property is not instantiated at all, it may cause problems later. Instead of returning a null value, key NotFound is returned in case the code that invoked getFromMap() isn't expecting a null value. // if ret ! = null then return ret, else return not found message ////////////// //Prevents returning null and helps debugging ////////////////////////////////// public String getFromMap(String k) { String ret = entryMap.get(k); return ( (ret != null) ? ret : ("Key_" + k + "_NotFound") ); } Creating a working model based on an ontology loaded from an online repository. String uri = "http://www.cis.ksu.edu/~aps7777/Ontology/OWL-Repos/PSI-MIowlModel = ProtegeOWL.createJenaOWLModelFromURI(uri); Ontology25/PMO25.owl"; Once the input has been processed and individuals inserted, the following code saves the populated ontology to a local file. //Find the current working directory and base the file name on the current time. String outputBaseStr = getOutputBaseFN(); //Set project and owl file names. Then set namespace. Project project = (Project) owlModel.getProject(); project.setProjectFilePath( outPathStr + baseFN + ".pprj" ); JenaKnowledgeBaseFactory.setOWLFileName(project.getSources(), outPathStr + baseFN + ".owl" ); //Namespace for a local file should be in the form //file:///<directory>/<OwlFileName># String newNamespace = outPathStr.replaceFirst("/", "///") + baseFN + ".owl#"; NamespaceManager nsMan = owlModel.getNamespaceManager(); nsMan.setDefaultNamespace( newNamespace ); //Try to save and print result. Collection errors = new ArrayList(); project.save(errors); if (errors.size() == 0) { System.out.println("Saved as " + outPathStr + baseFN + ".owl"); } else { printErrors("saving file", errors); } Creating a new Primary Reference. The parameter indName is a unique name for the individual, and the other three parameter values are set according to data parsed from the input files. private PrimaryRef newPrimaryRef(String indName, String id, String db, String vers) PrimaryRef ret; ret = new OwlOntCodeFactory(owlModel).createPrimaryRef(indName); usedNames.add(indName); ret.addId(id); ret.addDb(db); ret.addVersion(vers); { return ret; } The returned instance of PrimaryRef can be set as another individual's hasXref property. For example, to set a reference for an individual of a class: PrimaryRef pRef = newPrimaryRef( ... ); individualName.addHasXref( pRef ); //getNextUniqueName() first. If the name is not unique, getNextUniqueName() will //will return a similar but unique name for an individual. //Works by incrementing last digit(s) of name. If <ClassName>-0, is taken, //try <ClassName>-1 an so on until unique name is found, then return. public String getNextUniqueName(String className, DbSuperEntry dbe) { String newName = dbe.getSourceDbName() + className + "-0"; if ( !usedNames.contains(newName) ) { return newName; } else { return incrementSuffix(newName); } } Three methods from auto-generated owlOntCode package. public RDFProperty getHasNamesProperty() { final String uri = "http://www.cis.ksu.edu/~aps7777//Ontology/OWLOntology/PMO.owl#hasNames"; final String name = getOWLModel().getResourceNameForURI(uri); return getOWLModel().getRDFProperty(name); } Repos/PSI-MI- public void setShortLabel(Collection newShortLabel) { setPropertyValues(getShortLabelProperty(), newShortLabel); } public RDFProperty getShortLabelProperty() { final String uri = "http://www.cis.ksu.edu/~aps7777//Ontology/OWL-Repos/PSIOntology/PMO.owl#shortLabel"; final String name = getOWLModel().getResourceNameForURI(uri); return getOWLModel().getRDFProperty(name); } MI- The error below is caused by both newSource() and newPrimaryRef() setting the same property to the same value. newPrimaryRef() first sets a source as isBibRef property. Then the Source also sets the same PrimaryRef as it's hasBibRef. hasBibRef and isBibRef are inverse properties. If one is set, then the inverse is set as well. To fix this error, set either the property or its inverse, not both. I removed the line ret.addIsBibRef(isRefOf); from newPrimaryRef(). public Source newSource(String className, DbSourceEntry se) { ... Source ret = new OwlGeneratedFactory(owlModel).createSource(indName); PrimaryRef pref = newPrimaryRef( ... ); ret.addHasBibRef( ((BibRef) pref) ); ... } private PrimaryRef newPrimaryRef(String indName, String id, String db, Source isRefOf) { PrimaryRef ret = new OwlGeneratedFactory(owlModel).createPrimaryRef(indName); ... ret.addIsBibRef(isRefOf); return ret; } [OWLFrameStore] Warning: Attempted to assign duplicate value to MIPSSource-0.hasBibRef Reasoner DIGReasoner r = (DIGReasoner) drf.create( conf ); // now make a model ... OntModel m = ModelFactory.createOntologyModel( spec, null ); m.read( fn ); } // list inconsistent classes StmtIterator i = m.listStatements( null, OWL.equivalentClass, OWL.Nothing ); while (i.hasNext()) { System.out.println( "Class " + i.nextStatement().getSubject() + " is } Logger logger.info("Program complete at " + endTime + "\n"); long time = (endTime-initTime); logger.info("Program Duration: " + ( time/1000.0 ) + " seconds\n"); Screenshots unsatisfiable" ); Screenshot1. Inheritance - Subclasses inherit properties from their superclasses(Hussain; Sachs). The screenshot shows an instance of PrimaryRef. Some properties have values and some have a red outline. The red outline indicates that a required property value is missing, but the values are missing because PrimaryRef should not have all of these properties. They are inherited from its superclass, Source. PrimaryRef should be a sibling class of Source, not a subclass. Another example of incorrectly using inheritance is with InteractionList and Interaction. Because InteractionList is a list of Interaction instances, the Interaction class sounds like it could be a subclass of InteractionList. However, this means Interaction will also inherit the hasInteraction property, which does not conform to PSI MI. To correctly use inheritance, a subclass's superclass should only have properties that the subclass has as well. BibRef and Xref are the correct superclasses of PrimaryRef. Inheritance is correctly used in this case because the superclasses do not have any properties that the subclass should not. InteractionDetection could be the superclass of ProteinParticipant. The superclass has the properties hasNames and hasXref. The subclass, ProteinParticipant, should also have these properties and can be specialized with its own properties like hasInteractionWith and hasHostOrganism. Inheritance can reduce complexity by making relationships explicit and making re-use of code and design elements easier. Screenshot 2. Four input files were successfully parsed. The output .owl file is open in the Protege-OWL editor. Input files were MIF versions 1.1, 2.0, and 2.5. Because each version has its differences, I wrote a separate parser for each version. The parser to be used is determined at run-time by reading the xml tag containing version information that is in the first few lines of a MIF xml file. Screenshot 3. Displaying the Names properties of an instance of Participant. Screenshot 4. Ontology in the OWLViz tab. --- Interactions --BINDProteinParticipant-0 -> BINDProteinParticipant-1 BINDProteinParticipant-1 -> BINDProteinParticipant-0 ... MIPSProteinParticipant-10 -> MIPSProteinParticipant-11 MIPSProteinParticipant-11 -> MIPSProteinParticipant-10 --- End Interactions --Output. Print participants in all instances of Interaction. Problems While Coding Correctly parsing input files was the first problem I ran into while coding. The xml structure of a file from one source is usually different from the xml structure of a file from a different source. Most sources did not fully conform to the PSI MI format. Although revisions to PSI MI have improved the format (HH2004; http://psidev.sourceforge.net/mi/rel25/changes1To25.html), the parser needs to handle files that were created according to previous PSI MI versions. Because of this, I had to I had to write several different parsing methods in order to make the program robust. Testing the parser was also difficult, mainly because the values returned by the parser had to be checked by hand. PSI MI files from BIND are not indented, and most files are so large that only checking a small subset of values is possible. The DIP website has a MIF File Viewer that is helpful for displaying a PSI MI file in human readable format, but only works on files that conform exactly to PSI MI 2.5.3 (http://dip.doembi.ucla.edu/dip/MIFut.cgi). Even some of the latest DIP files could not be correctly displayed. There are many possible run-time errors that must be prevented for the program to run correctly. All individuals must have unique names. NullPointer exceptions were common. Null values from the parser are prevented by returning a NotFound string instead of null when the parser cannot a value. HashMap's get(Key k) method returns null if the key cannot be found and when the value for the key has not been set. Returning null values from the HashMap is prevented by making the HashMap only accessible from get and set methods. Limiting access to HashMap's built-in get method, which returns a null value, means that handling the possibility of null value can be done in place instead of everywhere it is called from. Although the program saves a .pprj file to disk, it is saved with all tabs in the Protege-OWL editor set to hidden. Tab visibility can be toggled under Preferences in the OWL menu of the editor, but it's easier and faster to open the .owl file. The .owl file is saved along with the .pprj file and does not have the tab visibility problem because its tabs are set to visible by default. MIF 1.0 Upgrade to MIF 2.0 (PromptTab Screenshot) Several changes were made from the first version of the PSI MI format. These changes are documented at http://psidev.sourceforge.net/mi/rel2/doc/changes1To2.html. BIND uses MIF version 2.0, therefore, output is in MIF v.2 format. Improvements & Further Work The PSI MI format allows some elements to have multiple values, such as secondaryRef and alias. Multiple values are represented by repeating each value with its own pair of tags. The parser ignores any consecutively repeated tags, unless they are entry tags or interaction tags. Since the parser does recognize the first set of tags, checking and parsing another set can be added to the code. I have added code in the OwlOutputHelperFunctions class so that the parser can now read elements with multiple values. However, the values still need to be inserted into the ontology after parsing. While parsing large PSI MI files, 30-40MB, the program ran out of memory. Currently, all parsed data is stored in a queue and retrieved after parsing is complete to instantiate individuals in the ontology. To conserve memory, instantiations can be made while parsing instead of storing all the data in a queue. Some tag names have been changed in revisions to PSI MI. For example, interactionDetection has been changed to interactionDetectionMethod. The parser handles both tags, but bases the decision on the source of the xml file. If the source is MIPS, the parser will use interactionDetection because MIPS has not changed to the newer interactionDetectionMethod. Otherwise, the parser checks for interactionDetectionMethod. The parser would be more robust if it based the decision on the version, version information is available to the parser as an attribute of the entrySet tags. Another alternative is to check for the correct tag, and if not found look for older versions of the name. Now the level and version tags are being used. The parser uses interactionDetectionMethod unless the version tags are 1.1, in which case interactionDetection is used. The naming scheme for instances in the ontology could be improved. Currently, I'm using <SourceName><ClassName>-<integer>. Some example names for instances are MIPSInteraction-2, MIPSNames-2, BINDInteraction-2, BINDInteraction-3, etc. Retrieving a specific individual is difficult with this naming scheme because the name that an individual will be given cannot be predicted. As of PSI MI version 2.5, major elements require an id attribute that is unique in the file (http://psidev.sourceforge.net/mi/rel25/changes1To25.html). This revision allows a more predictable naming scheme to be used. Because the id attribute is unique, names in the form <SourceName>-<ClassName>-<id> will also be unique. The pseudocode to get a specific instance of Interactor is: SourceName = interaction source, probably the same as data currently working with; ClassName = "Interaction"; if (participant is looking for an instance of its interactor) lookingForId = participant.getId(); iterator = all instances of interactor while ( iterator not empty && hasNext ) if ( ((Interaction)iterator.next()).getparticipantId() == lookingFor.getId() ) { //found } else if ( interactionList looking for an instance of interactor ) lookingForId = interactorId iterator = all instances of interactor while ( iterator not empty && hasNext ) if ( ((Interactor)iterator).getId() == lookingFor.getId() ) { //found } //getIndividual fn could also be used Using the above pseudocode allows for checking if a specific individual exists in the ontology. If the individual does not exist, it can be instantiated. If it already exists, the existing instance can be retrieved Works Cited Hermjakob, H., et al. (2004). The HUPO PSI's Molecular Interaction format-a community standard for the representation of protein interaction data. Nature Biotechnology, 22, 177-183. Hermjakob, H. (2005). Molecular Interaction XML Format 2.5 Documentation of schema changes from version 1.0 to 2.5 Retrieved March 2007, from http://psidev.sourceforge.net/mi/rel25/changes1To25.html Horrdidge, M., Knublauch, H., Rector, A., Stevens, R., Wroe, C. (2004). A Practical Guide To Building OWL Ontologies the Protégé-OWL Plugin and CO-ODE Tools. Retrieved January 14, 2007 from http://www.coode.org/resources/tutorials/ProtegeOWLTutorial.pdf. using Hussain, F. K., Sidhu, A. S., Dillon, T. S., Chang, E. (2006). Engineering Trustworthy Ontologies. Case Study of Protein Ontology. IEEE Symposium on Computer-Based Medical Systems. Kalyanpur, A., Pastor, D., Battle, S., Padget, J., (2004). Automatic Mapping of OWL Ontologies into Java, Proceedings of Sixteenth International Conference on Software Engineering and Knowledge Engineering (SEKE), June 20-24, 2004, Banff, Canada. Retrieved March 2007, from http://www.mindswap.org/~aditkal/SEKE04.pdf Noy, N. F., Rubin, D. L., & Musen, M.A. (2004). Making Biomedical Ontologies and Ontology Repositories work. IEEE Intelligent Systems. 04, 78-81. Sachs, E. (2006). Getting Started with Protégé-Frames. Retrieved January 14, 2007 from http://protege.stanford.edu/doc/tutorial/get_started/table_of_content.html.