P-OwlWriteUp

advertisement
Aaron Schoenhofer
Introduction
The purpose of this project is to represent data in a way that is simultaneously readable and meaningful for
humans and software. To be human-readable, data must be represented in an organized, user-friendly fashion
(Noy, 2004) that does not overwhelm the reader, especially when large datasets are involved. Both humans and
machines need some sort of context or semantics in order to place meaning on some piece of data (Hussain,
2006).
An ontology is used to represent datasets of experimental results from protein-protein interaction(PPI)
experiments. The datasets contain information on the proteins involved in the interaction, what kind of
interaction occurred, and references and database accession numbers. Displaying data in the form of individuals
and properties improves human-readability. The XML/RDF format allows machine-readability. Explicit
relationships give context.
Methods
I used the Protege-OWL editor (http://protege.stanford.edu/) to create and modify the ontology. Classes and
properties in the ontology are based on the PSI MI format (http://psidev.sourceforge.net/mi/rel25/). ProtegeOWL also has an API that can be used with Java to manipulate an ontology. Java and the SAX parser is used to
parse input files. Eclipse is the IDE.
The Protege-OWL editor can automatically generate Java code for the classes and properties in the
ontology. With the ontology project open, select the Code menu. From the Code menu, select Generate ProtegeOWL Java Code. A pop-up window has options for setting the output directory, package name, and Factory
class name. I named the package owlOntCode25 and the factory OwlOntCode25Factory. After clicking OK, the
output directory should contain two packages. One package has the name that was selected in the code
generation window, and the second package has the same name with 'Impl' appended. Instances of these classes
can be created and used to populate an ontology.
To set up an Eclipse project, create a new Eclipse Java project in the workspace and import the directories
containing the packages generated by Protege-OWL into the project's source directory. To use the ProtegeOWL API, several jar files will need to be included in the build path. To do this, go to Java Build Path in the
project properties menu. Select the Libraries tab, then Add External JARs button. The jar files needed are in the
Protege installation directory and in the subdirectory plugins/edu.stanford.smi.protegex.owl of the Protege
installation directory. Or unzip the eclipse project at http://www.cis.ksu.edu/~aps7777/Ontology. Import in
eclipse or create new project and import. Set path by adding jar files in P-OWL-JARs directory with Add JARs
button.
A few data structures are parameterized and cause build errors when not using Java 1.5. Changing the
compiler to 1.5 removes the build errors. This can be done by selecting Java Compiler in the project properties
menu. Check the Enable Project Specific settings box, and change the compliance level to 5.0. Then click yes to
rebuild.
To run the program, PsiMiXmlParser is the main class. Arguments are optional, four xml files will be
parsed by default if arguments are not supplied. Files are MIF versions 1.1, 2.0, and 2.5.
Code Description
Package xmlInOwlOut
Classes in the xmlInOwlOut package parse xml files and use the extracted data to instantiate classes and
their properties. New individuals are instantiated the help of owlOntCode and owlOntCode.impl packages.
After creating new individuals from the data in the xml files, a new owl ontology file is written to disk. The new
ontology containing the newly instantiated individuals can be viewed in the Protege-OWL editor.
PsiMiXmlParser.java - This class contains the main method for the program.
It reads and loads any files given as command line arguments. It then parses the data in the xml files and uses it
to instantiate DbSourceEntry DbInteractionEntry classes. Only one instantiation of DbSourceEntry is made per
input file. The DbSourceEntry class keeps a record of where the xml file came from (MIPS, BIND, etc.) and is
used later when creating individuals in the ontology. A DbInteractionEntry instance is created for each proteinprotein interaction in the xml file. When all files have been parsed, PsiMiXmlParser calls the method
processOutputQueue in OwlOutput.java. There are three methods that do most of the parsing. One of the three
methods will be chosen for parsing after reading the level and version attributes from the entryset tag.
DbSuperEntry.java - This is an abstract class and the superclass of DbSourceEntry and DbInteractionEntry. A
HashMap is used for storing information parsed from the xml input files. To make the information accessible,
hashMap keys are named after the data that can be retrieved using the key. DbSuperEntry also has methods for
putting and getting data in the HashMap. Although HashMap's built-in method for retrieving values may return
null values, DbSuperEntry's method for retrieving map values always returns a non-null value. This method and
the reason for it is explained in the Code Examples section below.
DbInteractionEntry.java - This is a subclass of DbSuperEntry. DbInteractionEntry's inherited HashMap is used
as a container for the data within each pair of interaction tags in the input files. Each instance of
DbInteractionEntry represents one interaction and is used to create classes and properties defined in the
ontology.
DbSourceEntry.java - This is also a subclass of DbSuperEntry. The difference between this subclass and
DbInteractionEntry is that each instance of DbSourceEntry represents a source instead of an interaction. In a
PSI MI record, source refers to the provider of the data (http://psidev.sourceforge.net/mi/xml/doc/user/). The
PSI MI files for this project usually come from MIPS or BIND. One DbSourceEntry is created for each parsed
input file, and contains the data within a pair of source tags of a file in PSI MI format.
OwlOutput.java - As input files are parsed, the data is placed in instantiations of the subclasses of
DbSuperEntry and placed in a queue in the OwlOutput class. When all input files have been parsed, the entries
are dequeued. New instances of classes in the ontology are created from the data contained in each entry.
OwlOutput uses the instanceof statement to determine whether the data should be used to create classes and
properties of Source or Interaction. When the queue is empty, the populated ontology is written to disk as an
owl file and can be opened in the Protege-OWL editor.
OwlOutputHelperFunctions.java - This class is used by OwlOutput when attempting to create new instances of
classes in the ontology. Using the owl API to instantiate an individual with a non-unique name causes errors.
Therefore, all methods in OwlOutputHelperFunctions that directly instantiate new individuals are declared
private. OwlOutput must call the public createNew method which gets a unique name by calling the
getNextUniqueName method before attempting to create a new individual. Every time a new individual is
created its name also inserted into a HashSet. The getNextUniqueName method is recursive and will not return
until it has found a name that is not in the set of used names.
Classes in the owlOntCode and owlOntCode.impl packages were automatically generated by Protege-OWL.
Calling the factory method creates and returns new individuals of classes in the ontology. The returned
individuals can then be used for setting their specific properties. Individuals that are successfully created will be
included in the populated ontology file that is written to disk just before the program ends.
Code Examples
Attempting to instantiate classes or properties with a null value causes errors. If a class or property is not
instantiated at all, it may cause problems later. Instead of returning a null value, key NotFound is returned in
case the code that invoked getFromMap() isn't expecting a null value.
// if ret ! = null then return ret, else return not found message //////////////
//Prevents returning null and helps debugging //////////////////////////////////
public String getFromMap(String k) {
String ret = entryMap.get(k);
return ( (ret != null) ? ret : ("Key_" + k + "_NotFound") );
}
Creating a working model based on an ontology loaded from an online repository.
String uri = "http://www.cis.ksu.edu/~aps7777/Ontology/OWL-Repos/PSI-MIowlModel = ProtegeOWL.createJenaOWLModelFromURI(uri);
Ontology25/PMO25.owl";
Once the input has been processed and individuals inserted, the following code saves the populated ontology to
a local file.
//Find the current working directory and base the file name on the current time.
String outputBaseStr = getOutputBaseFN();
//Set project and owl file names. Then set namespace.
Project project = (Project) owlModel.getProject();
project.setProjectFilePath( outPathStr + baseFN + ".pprj" );
JenaKnowledgeBaseFactory.setOWLFileName(project.getSources(),
outPathStr + baseFN + ".owl" );
//Namespace for a local file should be in the form //file:///<directory>/<OwlFileName>#
String newNamespace = outPathStr.replaceFirst("/", "///") + baseFN + ".owl#";
NamespaceManager nsMan = owlModel.getNamespaceManager();
nsMan.setDefaultNamespace( newNamespace );
//Try to save and print result.
Collection errors = new ArrayList();
project.save(errors);
if (errors.size() == 0) { System.out.println("Saved as " + outPathStr + baseFN + ".owl"); }
else { printErrors("saving file", errors); }
Creating a new Primary Reference. The parameter indName is a unique name for the individual, and the other
three parameter values are set according to data parsed from the input files.
private PrimaryRef newPrimaryRef(String indName, String id, String db, String vers)
PrimaryRef ret;
ret = new OwlOntCodeFactory(owlModel).createPrimaryRef(indName);
usedNames.add(indName);
ret.addId(id);
ret.addDb(db);
ret.addVersion(vers);
{
return ret;
}
The returned instance of PrimaryRef can be set as another individual's hasXref property. For example, to set a
reference for an individual of a class:
PrimaryRef pRef = newPrimaryRef( ... );
individualName.addHasXref( pRef );
//getNextUniqueName() first. If the name is not unique, getNextUniqueName() will
//will return a similar but unique name for an individual.
//Works by incrementing last digit(s) of name. If <ClassName>-0, is taken,
//try <ClassName>-1 an so on until unique name is found, then return.
public String getNextUniqueName(String className, DbSuperEntry dbe) {
String newName = dbe.getSourceDbName() + className + "-0";
if ( !usedNames.contains(newName) ) { return newName; }
else { return incrementSuffix(newName); }
}
Three methods from auto-generated owlOntCode package.
public RDFProperty getHasNamesProperty() {
final String uri = "http://www.cis.ksu.edu/~aps7777//Ontology/OWLOntology/PMO.owl#hasNames";
final String name = getOWLModel().getResourceNameForURI(uri);
return getOWLModel().getRDFProperty(name);
}
Repos/PSI-MI-
public void setShortLabel(Collection newShortLabel) {
setPropertyValues(getShortLabelProperty(), newShortLabel);
}
public RDFProperty getShortLabelProperty() {
final String uri = "http://www.cis.ksu.edu/~aps7777//Ontology/OWL-Repos/PSIOntology/PMO.owl#shortLabel";
final String name = getOWLModel().getResourceNameForURI(uri);
return getOWLModel().getRDFProperty(name);
}
MI-
The error below is caused by both newSource() and newPrimaryRef() setting the same property to the same
value. newPrimaryRef() first sets a source as isBibRef property. Then the Source also sets the same PrimaryRef
as it's hasBibRef. hasBibRef and isBibRef are inverse properties. If one is set, then the inverse is set as well. To
fix this error, set either the property or its inverse, not both. I removed the line ret.addIsBibRef(isRefOf); from
newPrimaryRef().
public Source newSource(String className, DbSourceEntry se) {
...
Source ret = new OwlGeneratedFactory(owlModel).createSource(indName);
PrimaryRef pref = newPrimaryRef( ... );
ret.addHasBibRef( ((BibRef) pref) );
...
}
private PrimaryRef newPrimaryRef(String indName, String id, String db, Source isRefOf)
{
PrimaryRef ret = new OwlGeneratedFactory(owlModel).createPrimaryRef(indName);
...
ret.addIsBibRef(isRefOf);
return ret;
}
[OWLFrameStore] Warning: Attempted to assign duplicate value to MIPSSource-0.hasBibRef
Reasoner
DIGReasoner r = (DIGReasoner) drf.create( conf );
// now make a model
...
OntModel m = ModelFactory.createOntologyModel( spec, null );
m.read( fn ); }
// list inconsistent classes
StmtIterator i = m.listStatements( null, OWL.equivalentClass, OWL.Nothing );
while (i.hasNext()) {
System.out.println( "Class " + i.nextStatement().getSubject() + " is
}
Logger
logger.info("Program complete at " + endTime + "\n");
long time = (endTime-initTime);
logger.info("Program Duration: " + ( time/1000.0 ) + " seconds\n");
Screenshots
unsatisfiable" );
Screenshot1.
Inheritance - Subclasses inherit properties from their superclasses(Hussain; Sachs). The screenshot shows an
instance of PrimaryRef. Some properties have values and some have a red outline. The red outline indicates that
a required property value is missing, but the values are missing because PrimaryRef should not have all of these
properties. They are inherited from its superclass, Source. PrimaryRef should be a sibling class of Source, not a
subclass.
Another example of incorrectly using inheritance is with InteractionList and Interaction. Because
InteractionList is a list of Interaction instances, the Interaction class sounds like it could be a subclass of
InteractionList. However, this means Interaction will also inherit the hasInteraction property, which does not
conform to PSI MI.
To correctly use inheritance, a subclass's superclass should only have properties that the subclass has as
well. BibRef and Xref are the correct superclasses of PrimaryRef. Inheritance is correctly used in this case
because the superclasses do not have any properties that the subclass should not. InteractionDetection could be
the superclass of ProteinParticipant. The superclass has the properties hasNames and hasXref. The subclass,
ProteinParticipant, should also have these properties and can be specialized with its own properties like
hasInteractionWith and hasHostOrganism.
Inheritance can reduce complexity by making relationships explicit and making re-use of code and design
elements easier.
Screenshot 2. Four input files were successfully parsed. The output .owl file is open in the Protege-OWL editor.
Input files were MIF versions 1.1, 2.0, and 2.5. Because each version has its differences, I wrote a separate
parser for each version. The parser to be used is determined at run-time by reading the xml tag containing
version information that is in the first few lines of a MIF xml file.
Screenshot 3. Displaying the Names properties of an instance of Participant.
Screenshot 4. Ontology in the OWLViz tab.
--- Interactions --BINDProteinParticipant-0 -> BINDProteinParticipant-1
BINDProteinParticipant-1 -> BINDProteinParticipant-0
...
MIPSProteinParticipant-10 -> MIPSProteinParticipant-11
MIPSProteinParticipant-11 -> MIPSProteinParticipant-10
--- End Interactions --Output. Print participants in all instances of Interaction.
Problems While Coding
Correctly parsing input files was the first problem I ran into while coding. The xml structure of a file from
one source is usually different from the xml structure of a file from a different source. Most sources did not
fully conform to the PSI MI format. Although revisions to PSI MI have improved the format (HH2004;
http://psidev.sourceforge.net/mi/rel25/changes1To25.html), the parser needs to handle files that were created
according to previous PSI MI versions. Because of this, I had to I had to write several different parsing methods
in order to make the program robust.
Testing the parser was also difficult, mainly because the values returned by the parser had to be checked by
hand. PSI MI files from BIND are not indented, and most files are so large that only checking a small subset of
values is possible. The DIP website has a MIF File Viewer that is helpful for displaying a PSI MI file in human
readable format, but only works on files that conform exactly to PSI MI 2.5.3 (http://dip.doembi.ucla.edu/dip/MIFut.cgi). Even some of the latest DIP files could not be correctly displayed.
There are many possible run-time errors that must be prevented for the program to run correctly. All
individuals must have unique names. NullPointer exceptions were common. Null values from the parser are
prevented by returning a NotFound string instead of null when the parser cannot a value. HashMap's get(Key k)
method returns null if the key cannot be found and when the value for the key has not been set. Returning null
values from the HashMap is prevented by making the HashMap only accessible from get and set methods.
Limiting access to HashMap's built-in get method, which returns a null value, means that handling the
possibility of null value can be done in place instead of everywhere it is called from.
Although the program saves a .pprj file to disk, it is saved with all tabs in the Protege-OWL editor set to
hidden. Tab visibility can be toggled under Preferences in the OWL menu of the editor, but it's easier and faster
to open the .owl file. The .owl file is saved along with the .pprj file and does not have the tab visibility problem
because its tabs are set to visible by default.
MIF 1.0 Upgrade to MIF 2.0 (PromptTab Screenshot)
Several changes were made from the first version of the PSI MI format. These changes are documented at
http://psidev.sourceforge.net/mi/rel2/doc/changes1To2.html. BIND uses MIF version 2.0, therefore, output is in
MIF v.2 format.
Improvements & Further Work
The PSI MI format allows some elements to have multiple values, such as secondaryRef and alias. Multiple
values are represented by repeating each value with its own pair of tags. The parser ignores any consecutively
repeated tags, unless they are entry tags or interaction tags. Since the parser does recognize the first set of tags,
checking and parsing another set can be added to the code. I have added code in the OwlOutputHelperFunctions
class so that the parser can now read elements with multiple values. However, the values still need to be
inserted into the ontology after parsing.
While parsing large PSI MI files, 30-40MB, the program ran out of memory. Currently, all parsed data is
stored in a queue and retrieved after parsing is complete to instantiate individuals in the ontology. To conserve
memory, instantiations can be made while parsing instead of storing all the data in a queue.
Some tag names have been changed in revisions to PSI MI. For example, interactionDetection has been
changed to interactionDetectionMethod. The parser handles both tags, but bases the decision on the source of
the xml file. If the source is MIPS, the parser will use interactionDetection because MIPS has not changed to
the newer interactionDetectionMethod. Otherwise, the parser checks for interactionDetectionMethod. The
parser would be more robust if it based the decision on the version, version information is available to the parser
as an attribute of the entrySet tags. Another alternative is to check for the correct tag, and if not found look for
older versions of the name. Now the level and version tags are being used. The parser uses
interactionDetectionMethod unless the version tags are 1.1, in which case interactionDetection is used.
The naming scheme for instances in the ontology could be improved. Currently, I'm using
<SourceName><ClassName>-<integer>. Some example names for instances are MIPSInteraction-2,
MIPSNames-2, BINDInteraction-2, BINDInteraction-3, etc. Retrieving a specific individual is difficult with
this naming scheme because the name that an individual will be given cannot be predicted.
As of PSI MI version 2.5, major elements require an id attribute that is unique in the file
(http://psidev.sourceforge.net/mi/rel25/changes1To25.html). This revision allows a more predictable naming
scheme to be used. Because the id attribute is unique, names in the form <SourceName>-<ClassName>-<id>
will also be unique. The pseudocode to get a specific instance of Interactor is:
SourceName = interaction source, probably the same as data currently working with;
ClassName = "Interaction";
if (participant is looking for an instance of its interactor)
lookingForId = participant.getId();
iterator = all instances of interactor
while ( iterator not empty && hasNext )
if ( ((Interaction)iterator.next()).getparticipantId() == lookingFor.getId() )
{ //found }
else if ( interactionList looking for an instance of interactor )
lookingForId = interactorId
iterator = all instances of interactor
while ( iterator not empty && hasNext )
if ( ((Interactor)iterator).getId() == lookingFor.getId() ) { //found }
//getIndividual fn could also be used
Using the above pseudocode allows for checking if a specific individual exists in the ontology. If the individual
does not exist, it can be instantiated. If it already exists, the existing instance can be retrieved
Works Cited
Hermjakob, H., et al. (2004). The HUPO PSI's Molecular Interaction format-a community standard for the representation of
protein interaction data. Nature Biotechnology, 22, 177-183.
Hermjakob, H. (2005). Molecular Interaction XML Format 2.5 Documentation of schema changes from version 1.0 to 2.5
Retrieved March 2007, from http://psidev.sourceforge.net/mi/rel25/changes1To25.html
Horrdidge, M., Knublauch, H., Rector, A., Stevens, R., Wroe, C. (2004). A Practical Guide To Building OWL Ontologies
the Protégé-OWL Plugin and CO-ODE Tools. Retrieved January 14, 2007 from http://www.coode.org/resources/tutorials/ProtegeOWLTutorial.pdf.
using
Hussain, F. K., Sidhu, A. S., Dillon, T. S., Chang, E. (2006). Engineering Trustworthy Ontologies. Case Study of Protein
Ontology. IEEE Symposium on Computer-Based Medical Systems.
Kalyanpur, A., Pastor, D., Battle, S., Padget, J., (2004). Automatic Mapping of OWL Ontologies into Java, Proceedings of
Sixteenth International Conference on Software Engineering and Knowledge Engineering (SEKE), June 20-24, 2004, Banff,
Canada. Retrieved March 2007, from http://www.mindswap.org/~aditkal/SEKE04.pdf
Noy, N. F., Rubin, D. L., & Musen, M.A. (2004). Making Biomedical Ontologies and Ontology Repositories work. IEEE Intelligent
Systems. 04, 78-81.
Sachs, E. (2006). Getting Started with Protégé-Frames. Retrieved January 14, 2007 from
http://protege.stanford.edu/doc/tutorial/get_started/table_of_content.html.
Download