Effectively Querying Unstructured Data Using Hierarchies of Properties

advertisement
Effectively Querying Unstructured Data Using Hierarchies of Properties
Tao Chen
Department of Computer Science
Memorial University of Newfoundland
St. John’s, NL, Canada
chent@cs.mun.ca
Jeffrey Parsons
Faculty of Business Administration
Memorial University of Newfoundland
St. John’s, NL, Canada
jeffreyp@mun.ca
Research Question
There is increasing pressure on organizations to implement methods to effectively locate and manage
relevant resources in unstructured or textual documents, such as news stories, blogs, wikis, and online
customer feedback forms. In this paper, we evaluate the effectiveness of enhancing keyword-based
queries with a property precedence schema, a mechanism for organizing properties from more general to
more specific. We propose a method for developing such a schema and demonstrate the effectiveness of
this approach in finding relevant documents that would not be identified using simple keyword search, or
lexicon-assisted keyword searh.
Approach
There are at least two challenges in using repositories of unstructured documents: 1) different information
producers may use very different terminology to express the same or similar ideas; and 2) different
information producers may express information at different levels of detail. In general, information
retrieval techniques treat a document as a unit (e.g., Hofmann 1999; Sebastiani 2002; Li et al. 2008).
Property precedence (Parsons and Wand 2003) has been developed as a mechanism to integrate
schemas and overcome some difficulties of matching data across sources. This framework allows the
model to handle data with different granularities and less structure by decomposing documents into
smaller structures. In this research we apply property precedence to extract structure from textual
documents and use this structure to enhance the effectiveness of queries over unstructured documents.
A property, P1, is said to precede another, P2, if and only if the set of instances possessing P2 is a
subset of the set of instances possessing P1. Properties can be modeled as predicates, or statements about
instances. The basic idea of property precedence is that two properties of different sources might be
distinct from each other, yet have similar meaning at a more general conceptual level (Parsons and Wand
2003). For example, “earnings” and “depreciation and amortization” are different, but in the discussion of
corporate income, “depreciation and amortization” is reflected in “earnings”.
In the “property-instance” paradigm, the basic query is ‘find the instances possessing a
property,’ For example, suppose the data set has three instance with properties, expressed in the sentences:
“Anne is a student”, “Bob is a teacher”, and “Charles is a high school teacher and attends Memorial
University for higher degree.” Suppose further that there is a known precedence relation: “student”
precedes “attending university.” The answer to the query “who is a student?” in this data set is “Anne”
and “Charles” since “Anne” possesses property “student” and “Charles” possesses property “attending
university,” which is preceded by “student”.
Main Findings / Expected Contributions
To test the effectiveness of queries that employ a property precedence schema, we use the Retuers-21578
data set (Reuters-21578, Distribution 1.0), which set has 12902 labelled documents covering business
news. Every document is a news story and the labels assigned to the document (by humans) indicate the
topics of the corresponding news story – these topics might not appear as words in the document. We
consider every news story as an instance, the document of the news story as the data/description of the
instance, and the topics of the news story as properties of the instance. We applied a probability model to
partition the sentences in the story in a way such that each part in the partition is meaningful to humans.
The parts in the partition are considered as properties of the instance. To distinguish the properties in the
topics from the properties in the data, we call the former “topic properties.” Two-thirds of all documents
in the data set are training instances. The remainder are testing instances. As topic properties might not
appear as words in the document, we applied a machine learning approach to determine the existence of a
topic property in a testing instance. This approach summarizes the knowledge of a topic property from the
instances known to possess the topic property and applies this knowledge to determine whether a testing
instance possesses the topic property (Chen 2008). To build a property precedence schema, we applied
the definition of property precedence. Excerpts of the resulting precedence schema are given in Fig. 1.
Fig. 1 Excerpts of the property precedence schema
acq
earn
Precedes
Precedes
affiliate
investment firm
crude
Precedes
alaskan oil
effort to acquire
diesel fuel
aggregate corp
canadian
petroleum
expression of
interest
champlin refine
financial
restructure
colombian
pipeline
...
...
an initial dividend
ship
Precedes
channel ferry
anticipate loss
crewman
asset writedowns
flag vessel
company pre-tax
profit
kuwaiti tanker
consolidate
balance sheet
panama canal
...
...
To evaluate the effectiveness of the precedence schema, we tested queries involving the topic
properties. For example, a query for stories dealing with acquisitions (“acq”) produces some interesting
results. An excerpt of the result is: “Atlantis Group Inc said it bought 100,000 shares of Charter-Crellin
Inc common stock, or 6.3 pct of the total outstanding, and may seek control in a negotiated transaction.”
This story does not contain words such as “acquire” or “acquisition”. Without knowing that “acq”
precedes “negotiate transaction”, the query is unable to retrieve such results. In total, we tested 90 queries
involving 90 topic properties. We compared the case where property precedence schema is enabled
(available to be used by the queries) versus disabled. When the property precedence schema is disabled,
the recall is 47.58% (2639/5546)). When the precedence schema is enabled, recall is 71.42%
(3961/5546)). Thus, the property precedence schema results in a 50% improvement in recall. We
observed 107 incorrect results when the property precedence schema was enabled. We examined the
incorrect results and noticed examples such as “crude” precedes “mln barrel” and “ship” precedes “freight
cost”. By further investigating the corresponding news stories, incorrect results produced by the
precedence relations can be considered correct because these are subtopics, instead of major topics of the
stories. If the topic properties, instead of being as specific as “crude” and “ship”, are more general
properties such as “oil products” and “transport”, these precedence relations will be correct. We conclude
that property precedence can significantly increase the number of correct results by bridging the semantic
difference between terminologies used in different documents and the number of incorrect results brought
by such a property precedence schema is manageable.
To summarize, the results indicate that the property precedence schema resolves heterogeneity in
terminology and level of detail across documents, and significantly improves the effectiveness of the
query. This approach finds many documents relevant to a query when the keyword(s) of the query do not
appear in a document, without requiring use of a lexicon or domain ontology. This approach can be
integrated with search engines to retrieve web pages relevant to a query when the keyword(s) of the query
do not appear in the web pages.
Current Status of Manuscript
A completed draft manuscript is written. By the time of the Winter Conference, we intend to
report additional performance results of the approach on a more general class of web documents.
References
Chen, T. “Integrating Unstructured Data Using Property Precedence,” M. Sci. thesis, Memorial Univeristy
of Newfoundland, St. John’s, Canada, 2008.
Hofmann, T. "Probabilistic latent semantic indexing." Proceedings of the 22nd Annual international ACM
SIGIR Conference on Research and Development in information Retrieval, Berkeley, CA, August
1999, 50-57.
Li, T., C. Ding, Y. Zhang, and B. Shao. "Knowledge transformation from word space to document space."
Proceedings of the 31st Annual international ACM SIGIR Conference on Research and
Development in information Retrieval, Singapore, July 2008, 187-194.
Liu, Y., W. Li, Y. Lin, and L. Jing. "Spectral geometry for simultaneously clustering and ranking query
search results." Proceedings of the 31st Annual international ACM SIGIR Conference on
Research and Development in information Retrieval, Singapore, July 2008, 539-546.
Sebastiani, F. "Machine learning in automated text categorization," ACM Computing Surveys, 34(1),
March 2002, 1-47.
Parsons, J., and Y. Wand. "Emancipating instances from the tyranny of classes in information modeling."
ACM Transactions on Database Systems, 25(2), June 2000, 228-268.
Parsons, J., and Y. Wand. "Attribute-Based Semantic Reconciliation of Multiple Data Sources," Journal
on Data Semantics, 1, 2003, 21-47.
Reuters-21578, Distribution 1.0. Reuters-21578.
http://www.daviddlewis.com/resources/testcollections/reuters21578/.
Download