Creating and Visualizing Document Classification

advertisement
Creating and Visualizing
Document Classification
J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell
Justification for fuzzy document
classification
Fuzzy aims….how can you
know exactly what you’re
looking for when you don’t
know the possibilities?
“anomalous state of knowledge”
(Belkin et al 1982)
So fuzzy clusters reflect the
cognitive state
Research overview
Hypothesis: Fuzzy results clustering and visualization
should save time by directing searchers to the level of
results that they wish to view (rather than breaking off
arbitrarily at screen bottom)
…in a prototype digital library for paleontology
Talk overview
Background: often classification with algorithms
alone, de-emphasizing document
Approach: * Facets and browse categories
* Metadata generation
* Classifier algorithms
* Visualization: labels and color grid
Findings from paleontologist experiments
positive response to our fuzzy classification
muted response to our fuzzy visualization
Background: fuzzy clustering
Text classification is well-researched (Sebastiani,
2002 review).
It depends on algorithm used
(k-nearest neighbor, naïve bayes, support
vector, etc.)
and on document representation (bag of words,
or with natural language processing factors)
Our work differs from others’ in its emphasis on
document representation which we hoped would
provide greater precision.
Background: fuzzy info visualization
“Research in visualisation of
fuzzy systems is still at an
early stage”
(Pham and Brown, 2003)
-- location on the page—
with the top being most
relevant (see left, ours)
-- 3D
-- icons (see left)
-- color gradations with
dark most relevant (ours)
Pre-set queries:
facets based on user needs
Queries are supported by controlled
vocabulary, or ontology
Metadata generation:
classification according to
article rhetoric
(could be improved)
Knowledge Engineering
rather than machine learning for small document set
Rules for finding matches of
document to query
Example:
Ma [number]
Mya [number]
Myr [number]
B.P [number]
in document matches to
associated time periods
Rules for clustering
documents into fuzzy
categories (requires
metadata generation)
Example:
*** Highly relevant if
match found in title or
abstract
** Relevant if match
found in caption…
To solve problem of showing
uncertainty clusters in a familiar list
To solve problem of
showing more results per
screen as well as
showing clusters
Participant experiments
(algorithm testing)
Participants: 3 paleontologists (undergraduate,
graduate and museum curator)
Method: Compare classifications of people and
system for same articles
• Sample: 30 articles, mix of training and nontraining set articles, from 3 categories: gingko (3
levels relevancy), allosaurus (3 levels
relevancy), neither
RESULTS: 70% agreed at least 1/3 of participant
ratings
Participant experiments
(interface testing)
Pilot testing with paleontologist in our group
• Paleontology conferences:
– Spring 2009 NACP (North American Paleontological
Convention) – 17 returned
– Fall 2009 SVP (Society of Vertebrate Paleontologists)
Ask 3 graduate or undergraduates in paleontology to
classify the articles – results not yet returned
• Questionnaires
– Spring questionnaire: design focus
– Fall: comparative focus (features as well as design)
RESULTS 58.8% liked our labels
35.7% liked our grid
Future directions
To improve fuzzy classification:
adapt CiteSeer parse algorithm to improve
our classification
To improve visualization:
list view with labels and colors for uncertainty
levels
Contributions in summary
(1) Fuzzy result groupings represent “fuzzy”
concept of search aim as in user’s mind,
so uncertainly labels are appreciated
(2) Fuzzy color blocks that represent
abstract categories are not liked; stick to
minor modifications of the familiar list
References
Belkin, N.J., Oddy, R.N. and Brooks H.M. (1982) ASK for
information retrieval,. Part I: Background and theory;
Part II: Results of a design study, Journal of
Documentation, vol. 3, no. 2&3, pp. 61-71: 145-164,
1982.
Pham, B. & Brown, R. (2003). Analysis of visualization
requirement for fuzzy systems. Proceedings of the 1st
international conference on computer graphics and
interactive techniques in Australasia and South East
Asia, Melbourne, Australia, 181 ff.
Sebastiani, (2002) Machine learning in automated text
categorization, ACM Computing Surveys, 34 (1), 1-47.
Download