Effectively Querying Unstructured Data Using Hierarchies of Properties Tao Chen Department of Computer Science Memorial University of Newfoundland St. John’s, NL, Canada chent@cs.mun.ca Jeffrey Parsons Faculty of Business Administration Memorial University of Newfoundland St. John’s, NL, Canada jeffreyp@mun.ca Research Question There is increasing pressure on organizations to implement methods to effectively locate and manage relevant resources in unstructured or textual documents, such as news stories, blogs, wikis, and online customer feedback forms. In this paper, we evaluate the effectiveness of enhancing keyword-based queries with a property precedence schema, a mechanism for organizing properties from more general to more specific. We propose a method for developing such a schema and demonstrate the effectiveness of this approach in finding relevant documents that would not be identified using simple keyword search, or lexicon-assisted keyword searh. Approach There are at least two challenges in using repositories of unstructured documents: 1) different information producers may use very different terminology to express the same or similar ideas; and 2) different information producers may express information at different levels of detail. In general, information retrieval techniques treat a document as a unit (e.g., Hofmann 1999; Sebastiani 2002; Li et al. 2008). Property precedence (Parsons and Wand 2003) has been developed as a mechanism to integrate schemas and overcome some difficulties of matching data across sources. This framework allows the model to handle data with different granularities and less structure by decomposing documents into smaller structures. In this research we apply property precedence to extract structure from textual documents and use this structure to enhance the effectiveness of queries over unstructured documents. A property, P1, is said to precede another, P2, if and only if the set of instances possessing P2 is a subset of the set of instances possessing P1. Properties can be modeled as predicates, or statements about instances. The basic idea of property precedence is that two properties of different sources might be distinct from each other, yet have similar meaning at a more general conceptual level (Parsons and Wand 2003). For example, “earnings” and “depreciation and amortization” are different, but in the discussion of corporate income, “depreciation and amortization” is reflected in “earnings”. In the “property-instance” paradigm, the basic query is ‘find the instances possessing a property,’ For example, suppose the data set has three instance with properties, expressed in the sentences: “Anne is a student”, “Bob is a teacher”, and “Charles is a high school teacher and attends Memorial University for higher degree.” Suppose further that there is a known precedence relation: “student” precedes “attending university.” The answer to the query “who is a student?” in this data set is “Anne” and “Charles” since “Anne” possesses property “student” and “Charles” possesses property “attending university,” which is preceded by “student”. Main Findings / Expected Contributions To test the effectiveness of queries that employ a property precedence schema, we use the Retuers-21578 data set (Reuters-21578, Distribution 1.0), which set has 12902 labelled documents covering business news. Every document is a news story and the labels assigned to the document (by humans) indicate the topics of the corresponding news story – these topics might not appear as words in the document. We consider every news story as an instance, the document of the news story as the data/description of the instance, and the topics of the news story as properties of the instance. We applied a probability model to partition the sentences in the story in a way such that each part in the partition is meaningful to humans. The parts in the partition are considered as properties of the instance. To distinguish the properties in the topics from the properties in the data, we call the former “topic properties.” Two-thirds of all documents in the data set are training instances. The remainder are testing instances. As topic properties might not appear as words in the document, we applied a machine learning approach to determine the existence of a topic property in a testing instance. This approach summarizes the knowledge of a topic property from the instances known to possess the topic property and applies this knowledge to determine whether a testing instance possesses the topic property (Chen 2008). To build a property precedence schema, we applied the definition of property precedence. Excerpts of the resulting precedence schema are given in Fig. 1. Fig. 1 Excerpts of the property precedence schema acq earn Precedes Precedes affiliate investment firm crude Precedes alaskan oil effort to acquire diesel fuel aggregate corp canadian petroleum expression of interest champlin refine financial restructure colombian pipeline ... ... an initial dividend ship Precedes channel ferry anticipate loss crewman asset writedowns flag vessel company pre-tax profit kuwaiti tanker consolidate balance sheet panama canal ... ... To evaluate the effectiveness of the precedence schema, we tested queries involving the topic properties. For example, a query for stories dealing with acquisitions (“acq”) produces some interesting results. An excerpt of the result is: “Atlantis Group Inc said it bought 100,000 shares of Charter-Crellin Inc common stock, or 6.3 pct of the total outstanding, and may seek control in a negotiated transaction.” This story does not contain words such as “acquire” or “acquisition”. Without knowing that “acq” precedes “negotiate transaction”, the query is unable to retrieve such results. In total, we tested 90 queries involving 90 topic properties. We compared the case where property precedence schema is enabled (available to be used by the queries) versus disabled. When the property precedence schema is disabled, the recall is 47.58% (2639/5546)). When the precedence schema is enabled, recall is 71.42% (3961/5546)). Thus, the property precedence schema results in a 50% improvement in recall. We observed 107 incorrect results when the property precedence schema was enabled. We examined the incorrect results and noticed examples such as “crude” precedes “mln barrel” and “ship” precedes “freight cost”. By further investigating the corresponding news stories, incorrect results produced by the precedence relations can be considered correct because these are subtopics, instead of major topics of the stories. If the topic properties, instead of being as specific as “crude” and “ship”, are more general properties such as “oil products” and “transport”, these precedence relations will be correct. We conclude that property precedence can significantly increase the number of correct results by bridging the semantic difference between terminologies used in different documents and the number of incorrect results brought by such a property precedence schema is manageable. To summarize, the results indicate that the property precedence schema resolves heterogeneity in terminology and level of detail across documents, and significantly improves the effectiveness of the query. This approach finds many documents relevant to a query when the keyword(s) of the query do not appear in a document, without requiring use of a lexicon or domain ontology. This approach can be integrated with search engines to retrieve web pages relevant to a query when the keyword(s) of the query do not appear in the web pages. Current Status of Manuscript A completed draft manuscript is written. By the time of the Winter Conference, we intend to report additional performance results of the approach on a more general class of web documents. References Chen, T. “Integrating Unstructured Data Using Property Precedence,” M. Sci. thesis, Memorial Univeristy of Newfoundland, St. John’s, Canada, 2008. Hofmann, T. "Probabilistic latent semantic indexing." Proceedings of the 22nd Annual international ACM SIGIR Conference on Research and Development in information Retrieval, Berkeley, CA, August 1999, 50-57. Li, T., C. Ding, Y. Zhang, and B. Shao. "Knowledge transformation from word space to document space." Proceedings of the 31st Annual international ACM SIGIR Conference on Research and Development in information Retrieval, Singapore, July 2008, 187-194. Liu, Y., W. Li, Y. Lin, and L. Jing. "Spectral geometry for simultaneously clustering and ranking query search results." Proceedings of the 31st Annual international ACM SIGIR Conference on Research and Development in information Retrieval, Singapore, July 2008, 539-546. Sebastiani, F. "Machine learning in automated text categorization," ACM Computing Surveys, 34(1), March 2002, 1-47. Parsons, J., and Y. Wand. "Emancipating instances from the tyranny of classes in information modeling." ACM Transactions on Database Systems, 25(2), June 2000, 228-268. Parsons, J., and Y. Wand. "Attribute-Based Semantic Reconciliation of Multiple Data Sources," Journal on Data Semantics, 1, 2003, 21-47. Reuters-21578, Distribution 1.0. Reuters-21578. http://www.daviddlewis.com/resources/testcollections/reuters21578/.