Measuring Generality of Documents Hyun Woong Shin , Eduard Hovy , Dennis

Measuring Generality of Documents
Hyun Woong Shin1, Eduard Hovy1, Dennis
McLeod1, and Larry Pryor2
Science Department
School for Journalism
University of Southern California
Los Angeles, CA 90089
April, 2006
Problems of Current Technology
Related Work
Goals and Technical Issues
Generality Spectrum
Concluding Remarks and Future Work
Hyun Woong Shin
April, 2006
Problems of Current Technology
• Information Retrieval
• Retrieved information is frequently irrelevant
– Most IR systems operationalize “Relevant” as term
– Difficult to capture and/or present semantics of
corpus and/or user intent
• Specification of desired information is based on a user’s
knowledge and interests
– Users may want to retrieve a specific level of
information (e.g., “general story”)
Hyun Woong Shin
April, 2006
• The notation of generality (level of conceptual
abstraction) is useful for improving user satisfaction
with the result of an information request
• To test Hypothesis,
•Define generality
•Quantify generality
•Confirm generality
Hyun Woong Shin
April, 2006
The Generality
• Generality
• Level of conceptual abstraction
• Specification of desired information
• Nothing to do with
• Story organization
• Audience
• Measure of the concept in question, within a
framework/taxonomy of concepts
• These concepts then can be placed in
appropriate structured context
Hyun Woong Shin
April, 2006
Definition of Generality
• Definition
• In journalism, generality is reflected in both
how a story is linearly organized and the needs
and expectations of the audience
• Abstraction
• Measuring the specificity of words (subset of
index terms) in a document
• Assumption
• The generality or specificity of a topic depends
on the number of specific words
Hyun Woong Shin
April, 2006
Generality Quantification
• Specific word set
• Subset of the index term set that does not
appear in other word sets
S i  Ti  T j
j i
• An index term
• Semantic reference serves as a mnemonic device
for recalling the main themes of a document
• Identification of index terms
• Term extraction (index terms)
• Stop word removal (stop word list)
• Stemming and conflation (Porter’s algorithm)
• Index term weighting
Hyun Woong Shin
April, 2006
Generality Algorithm (Basic)
Let D  {d j | j  J } be a document containing a set of words from an index set J
and S  {ti | i  I } be a set of specific terms for an index set I
Define a characteristic function  : I  J  {0, 1}
1, ti  d j
by  (i, j )  
0, ti  d j
 (i, j )
g j  iI
for the number D j of terms in D j
Then, if D j  S  , then g j  0
Hyun Woong Shin
April, 2006
Generality Algorithm (Adjust)
Supposethat S k and Sk i are sets of specific terms of a parentnode and
its children nodes respctively for i  1,2 ,..c
(number of children node for parent nodeN k )
Let N  {N k |k  A} be a set of set of nodes for an indexset A and
k is ordering by depth
{g i,m|i  A and m I} be a set of the degree of generality, gi,m
for a document d m in a nodeN i
The adjusted generality is:
d(k ,m) g k i,m  wk i
where, wk i 
S k  S k i
for the number Dk of terms in a nodeN k and
some child node Nk i containing the document d m
Hyun Woong Shin
April, 2006
• Corpus Analysis
• Sports news from web sources
• 12 ontology nodes and 62 articles
• Evaluation Plan
• Measure a relationship between human judges
• Compare the values from humans and algorithm
• Use Pearson’s correlation coefficient
Hyun Woong Shin
April, 2006
• Between human judges
Pearson correlation coefficient (r)
p-value from the t-test
Level 0 (n=62)
Level 1 (n=62)
Level 2 (n=29)
< .0001 (df=60) < .0001 (df=60)
<.0001 (df=27)
• Between human and algorithm
Level 0 (n=62)
Level 1 (n=62)
Level 2 (n=29)
Pearson correlation coefficient (r)
p-value from the t-test
0.075 (df=60)
<.0001 (df=60)
<.0001 (df=27)
Hyun Woong Shin
April, 2006
Results (Cont’d)
• Scatter plot
Degree of generality
Human judges’ scores
Scores from the algorithm
Hyun Woong Shin
April, 2006
Conclusion and Future Work
• The major contribution
 Introduce, quantified, confirmed a new criterion,
 A new facility for capturing a user intent
• The result shows that
 There is a generality (confirmed by human judges)
 The relationship between human judges and our
algorithm is significant related (p < 0.0001)
• Future Work
 Implement the generality model on IR system
 Test the reranking algorithm in realistic IR tasks
Hyun Woong Shin
April, 2006