ppt

advertisement
Understanding Text Corpora
with Multiple Facets
Lei Shi, Furu Wei, Shixia Liu, Xiaoxiao Lian, Li Tan and Michelle X. Zhou
IBM Research
Emergency Room Records
Hotel Reviews
Intelligence Reports
Email Documents
Financial News/Blogs/Message Boards
Outline
 Problem & Related Work
 Multi-Facet Text Data Model and Text Processing
– Data model
– Text pre-processing
– Content summarization
 Visualization
– Metaphor
– Creation algorithm
– Interactions
 Video Demo
Problem & Related Work
 It’s challenging to build a visual analytics
tool to explain multi-faceted text corpora!
– How to combine the raw text data with rich
text analytics result for visualization?
– What visual metaphors to apply to effectively
illustrate text content, evolution and facet
correlations?
– How to customize interactions to assist user in
data navigation and other visual analytics task?
 Related work
– Text trend visualization
• ThemeRiver, NameVoyager, etc.
– Text content visualization
• Tag cloud, Wordle, PhraseNet, etc.
– Text entity pattern visualization
• TileBars, Jigsaw, FeatureLens, Takmi, etc.
– Text visualization in specific domains
• Themail@email, TileBars@search,
Multi-Facet Data Model and Text Pre-Processing
 Multi-Facet Data Model for Text Corpora -– Time Facet
D  , Fu , Fs , T 
• Explicit field or extracted from raw text
– Category Facet
• Topic modeling by Latent Dirichlet Allocation (LDA, Blei et al. 2003)
• Category labels from document classification/clustering
• Leverage other nominal structured information (hotel names, countries, etc.)
– Unstructured (Content) Facets
• Inherent multiple text fields
• Multiple facets from NE extraction (people, location, organization) or POS
parsing (Noun, Verbs, Adjective)
– Structured Facets
• Categorical, numerical or nominal data fields
• Other calculated categorical value (sentiment orientations, average ratings)
Content Facet Summarization
kth document in
the collection
{…, P(Ti | Dk), …}
A set of topic
probabilities
A set of topics
{T1, …Ti,… TN }
A set of keywords
Rank the topics to present
most valuable ones first
{W1, …, Wj, …, WM}
A set of word probabilities
Select keyword sub-set for each
time segment for content summary {…, P(Wj | Ti), …}
{…} t-1, {…, Wj, …}t, {…} t+1,
Content Facet Summarization
 Topic/category re-ranking by topic coverage and variance: find the most
active topic with significant variety
Doc-topic dist.
– Topic coverage:
Doc no.
Doc length
– Topic variance:
– Balancing two metrics:
 Keyword re-ranking
– Topic keyword re-ranking:
Topic-keyword distribution
Topic number
– Time-sensitive keyword re-ranking: preserve completeness and distinctiveness
• Completeness: cover the original keywords of a topic
• Distinctiveness: distinguish one time segment from another
System Architecture
Text
collection
Text Preprocessing
Text content +
meta data
Text Summarization
Summarization
results
Visualization
User
Interaction
Visualization Metaphors
 Multi-stack trend visualization + Time-sensitive tag clouds
– Vis-data mappings: time facet – x (time) axis, category facet – stack,
unstructured facets – tag clouds, structured facet – keyword style (color/font)
– Other mappings: document count – y axis, re-ranked occurrence count -keyword size
Unstructured Facets
Category Facet
Structured Facets
Time
Keywords Layout
 Keyword layout with the sweep-line greedy algorithm
Interactions
 Temporal zooming for time facet navigation
 Topic editing for category facet navigation
 Unstructured facet navigation panel
 Structured facet mapping
 Other customized interactions: topic focus-in-context view
Focus-In-Context View Calculation
 Constraints for detailed trend view
– Contour-preserving
– Flexible space control
– All topic trends as undistorted as possible
 1D
–
–
–
fisheye distortion
Height calculation for expanded trend
Order-preserving height adjustment
Apply fisheye distortion from the center line of selected topic
Video Demo
Visual Analytics for Emergency Room Record
Thai
Korean
Traditional Chinese
Russian
Gracias
Thank You
English
Spanish
Obrigado
Brazilian Portuguese
Arabic
Danke
German
Grazie
Italian
Simplified Chinese
Merci
French
Japanese
Tamil
18
Hindi
Download