Measuring Generality of Documents Hyun Woong Shin1, Eduard Hovy1, Dennis McLeod1, and Larry Pryor2 1Computer Science Department 2Annenberg School for Journalism University of Southern California Los Angeles, CA 90089 April, 2006 Contents • • • • • • Problems of Current Technology Hypothesis Related Work Goals and Technical Issues Generality Spectrum Concluding Remarks and Future Work Hyun Woong Shin April, 2006 2 Problems of Current Technology • Information Retrieval • Retrieved information is frequently irrelevant – Most IR systems operationalize “Relevant” as term frequency – Difficult to capture and/or present semantics of corpus and/or user intent • Specification of desired information is based on a user’s knowledge and interests – Users may want to retrieve a specific level of information (e.g., “general story”) Hyun Woong Shin April, 2006 3 Hypothesis • The notation of generality (level of conceptual abstraction) is useful for improving user satisfaction with the result of an information request • To test Hypothesis, •Define generality •Quantify generality •Confirm generality Hyun Woong Shin April, 2006 4 The Generality • Generality • Level of conceptual abstraction • Specification of desired information • Nothing to do with • Story organization • Audience • Measure of the concept in question, within a framework/taxonomy of concepts • These concepts then can be placed in appropriate structured context Hyun Woong Shin April, 2006 5 Definition of Generality • Definition • In journalism, generality is reflected in both how a story is linearly organized and the needs and expectations of the audience • Abstraction • Measuring the specificity of words (subset of index terms) in a document • Assumption • The generality or specificity of a topic depends on the number of specific words Hyun Woong Shin April, 2006 6 Generality Quantification • Specific word set • Subset of the index term set that does not appear in other word sets S i Ti T j U j i • An index term • Semantic reference serves as a mnemonic device for recalling the main themes of a document • Identification of index terms • Term extraction (index terms) • Stop word removal (stop word list) • Stemming and conflation (Porter’s algorithm) • Index term weighting Hyun Woong Shin April, 2006 7 Generality Algorithm (Basic) Let D {d j | j J } be a document containing a set of words from an index set J and S {ti | i I } be a set of specific terms for an index set I Define a characteristic function : I J {0, 1} 1, ti d j by (i, j ) 0, ti d j (i, j ) g j iI for the number D j of terms in D j Dj Then, if D j S , then g j 0 Hyun Woong Shin April, 2006 8 Generality Algorithm (Adjust) Supposethat S k and Sk i are sets of specific terms of a parentnode and its children nodes respctively for i 1,2 ,..c (number of children node for parent nodeN k ) Let N {N k |k A} be a set of set of nodes for an indexset A and k is ordering by depth {g i,m|i A and m I} be a set of the degree of generality, gi,m for a document d m in a nodeN i The adjusted generality is: d(k ,m) g k i,m wk i where, wk i S k S k i Dk for the number Dk of terms in a nodeN k and some child node Nk i containing the document d m Hyun Woong Shin April, 2006 9 Experiments • Corpus Analysis • Sports news from web sources • 12 ontology nodes and 62 articles • Evaluation Plan • Measure a relationship between human judges • Compare the values from humans and algorithm • Use Pearson’s correlation coefficient Hyun Woong Shin April, 2006 10 Results • Between human judges Pearson correlation coefficient (r) p-value from the t-test Level 0 (n=62) Level 1 (n=62) Level 2 (n=29) 0.84 0.81 0.81 < .0001 (df=60) < .0001 (df=60) <.0001 (df=27) • Between human and algorithm Level 0 (n=62) Level 1 (n=62) Level 2 (n=29) Pearson correlation coefficient (r) -0.22 0.73 0.68 p-value from the t-test 0.075 (df=60) <.0001 (df=60) <.0001 (df=27) Hyun Woong Shin April, 2006 11 Results (Cont’d) • Scatter plot Degree of generality Human judges’ scores Documents Scores from the algorithm Hyun Woong Shin April, 2006 12 Conclusion and Future Work • The major contribution Introduce, quantified, confirmed a new criterion, generality A new facility for capturing a user intent • The result shows that There is a generality (confirmed by human judges) The relationship between human judges and our algorithm is significant related (p < 0.0001) • Future Work Implement the generality model on IR system Test the reranking algorithm in realistic IR tasks Hyun Woong Shin April, 2006 13