Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj Acknowledgements: Slides by students at CS512 (Spring 2009) 3/13/2016 1 Outline • Probabilistic Topic Models (Yue Lu) • Opinion Mining (Hyun Duk Kim) • Mining Query Logs for Personalized Search (Yuanhua Lv) • Online Analytical Processing on Multidimensional Text Database (Duo Zhang) Probabilistic Topic Models Yue LU Department of Computer Science University of Illinois, Urbana-Champaign Many slides are adapted/taken from different sources, including presentations by ChengXiang Zhai, Qiaozhu Mei and Tom Griffiths 3 Intuition • Documents exhibit multiple topics. topic: Social network website topic: education topic: criticism 4 What is a Topic? Representation: a multinomial distribution of words, i.e., a unigram language model retrieval 0.2 information 0.15 model 0.08 query 0.07 language 0.06 feedback 0.03 …… Topic: A broad concept/theme, semantically coherent, which is hidden in documents 5 Organize Information with Topics price 0.0772 oil 0.0643 gas 0.0454 increase 0.0210 product 0.0203 fuel 0.0188 company 0.0182 … Resolution Natural hazards Categories government response oil price, Topics loss statistics, … new orleans, president bush.. oil, new, put, … orleans, is, … 1 several Entities 50~100 … Phrases hundreds Words thousands new orleans, put together, .. How many in a document? Patterns Many ... 6 The Usage of Topic Models Topic 1 government 0.3 response 0.2 ... donate 0.1 relief 0.05 help 0.02 ... Topic 2 … Topic k [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … city 0.2 new 0.1 orleans 0.05 ... Background B is 0.05 the 0.04 a 0.03 ... • Usage of a topic model: – – – – – – Summarize themes/aspects Navigate documents Retrieve documents Segment documents Document classification Document clustering 7 General Idea of Probabilistic Topic Models • Cast intuition into a generative probabilistic process (Generative Process) – Each document is a mixture of corpus-wide topics (multinomial distribution/unigram LM) – Each word is drawn from one of those topics • Since we only observe the documents, need to figure out (Estimation/Inference) – What are the topics? – How are the documents divided according to those topics? • Two basic models: PLSA and LDA 8 Probabilistic Latent Semantic Analysis/Indexing [Hofmann 99] PLSA: Generation Process Document [Hofmann 99], [Zhai et al. 04] Topics battery 0.3 life 0.2.. design 0.1 screen 0.05 1 2 … price 0.2 purchase 0.15 k Generate a word in a document d1 d2 w dk B Is 0.05 B the 0.04 a 0.03 .. Collection background Parameters: B=noise-level (manually set) ’s and ’s need to be estimated PLSA: Estimation [Hofmann 99], [Zhai et al. 04] Document Topics battery ? life ? design ? screen ? 1 2 … price ? purchase ? k Generate a word in a document d1? d2 ? dk? Is ? the ? a? w B Collection B background Estimated with Maximum Likelihood Estimator (MLE) through an EM algorithm Log-likelihood of the collection Problems with PLSA – “Documents have no generative probabilistic semantics” •i.e., document is just a symbol – Model has many parameters •linear in number of documents •need heuristic methods to prevent overfitting – Cannot generalize to new documents Latent Dirichlet Allocation [Blei et al. 03] Basic Idea of LDA [Blei et al. 03], [Griffiths&Steyvers 02, 03, 04] • β Topics 1 2 … k • d1 d2 dk Document w • Adding a Dirichlet Prior α on topic distribution in documents Adding a Dirichlet Prior β on word distribution in topics α, β can be vectors, but for convenience, α = α1= α2=…; β = β1 = β2=… (Smoothed LDA) Dirichlet Hyperparameters α, β • • • • • Generally have a smoothing effect on multinomial parameters Large α, β : more smoothed topic/word distribution Small α, β: more skewed topic/word distribution (e.g. bias towards a few words for each topic) Common settings: α=50/K, β=0.01 PLSA is maximum a posteriori estimated LDA when using uniform prior: α=1, β=1 Inference • Exact inference is intractable • Approximation techniques: – Mean field variational methods (Blei et al., 2001, 2003) – Expectation propagation (Minka and Lafferty, 2002) – Collapsed Gibbs sampling (Griffiths and Steyvers, 2002) – Collapsed variational inference (Teh et al., 2006) Would like to know more? • “Parameter estimation for text analysis” by Gregor Heinrich • “Probabilistic topic models” by Mark Steyvers Opinion Mining Hyun Duk Kim 3/13/2016 Data Mining: Principles and Algorithms 18 Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion 3/13/2016 Data Mining: Principles and Algorithms 19 Web 2.0 “ Web 2.0 is the business revolution in the computer industry caused by the move to the Internet as a platform, and an attempt to understand the rules for success on that new platform.” [Wikipedia] Users participate in content creation ex. Blog, review, Q&A forum 3/13/2016 Data Mining: Principles and Algorithms 20 Opinion Mining Huge volume of opinions on the Web Ex. Product reviews, Blog posts about politic issues Need a good technique to summarize them Example of commercial system (MS live search) 3/13/2016 Data Mining: Principles and Algorithms 21 Usefulness of opinion mining Individuals Purchasing a product/ service Tracking political topics Other decision making tasks Businesses and organizations product and service benchmarking survey on a topic Ads placements Place an ad when one praises an product Place an ad from a competitor if one criticizes a product [Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008] 3/13/2016 Data Mining: Principles and Algorithms 22 Subtasks Opinion finding & sentiment classification Opinion finding Sentiment classification Lexicon based method Machine learning Opinion Summarization How to show opinion finding/classification results effectively Methods If the opinion is positive or negative In detail, ‘positive/negative/mixed’ Methods If the target text is opinion or fact Basic statistics showing Feature level summary [Hu & Liu, KDD'04/ Hu & Liu, AAAI'04] Summary paragraph generation [Kim et al, TAC'08] Probabilistic analysis [Mei et al, WWW'07] Other works 3/13/2016 Data Mining: Principles and Algorithms 23 Opinion Finding Lexicon-based method Prepare opinion word list Check special part of speech expressing opinions Ex. Adjective: ‘excellent’, ‘horrible’ / Verb: ‘like’, ‘hate’ Decision based on the those words occurrences Lexicon sources Ex. Word: ‘good’, ‘bad’ / Phrase: ‘I think’, ‘In my opinion’ Manually classified word lists WordNet External sources: Wikipedia (objective), review data (subjective) Machine learning Train with tagged examples Main features Opinion lexicons Part-of-speech tag, Punctuation (ex. ! ), Modifiers (ex. not, very) Word tokens, Dependency 3/13/2016 Data Mining: Principles and Algorithms 24 Opinion Sentiment Classification Method Similar to opinion finding Lexicon based method Machine learning Instead of using ‘opinionated word/examples’, use ‘positive and negative’ word/examples If positive/negative dominant -> positive or negative Both positive and negative dominantly exist -> mixed 3/13/2016 Data Mining: Principles and Algorithms 25 Opinion Sentiment Classification Query dependent sentiment classification Motivation: Sentiments are expressed differently in different queries Ex. Small can be good for ipod size, but can be bad for LCD monitor size Use external web sources to obtain positive and negative opinionated lexicons Key Ideas Objective words: Wikipedia, product specification part of Amazon.com Subjective words: Reviews from Amazon.com, Rateitall.com and Epinions.com [Lee et al, TREC '08/ Jia et al, TREC '08] Reviews rated 4 or 5 out of 5: positive words Reviews rated 1 or 2 out of 5: negative words Top ranked in Text Retrieval Conference [Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008] 3/13/2016 Data Mining: Principles and Algorithms 26 Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion 3/13/2016 Data Mining: Principles and Algorithms 27 Opinion Summarization Basic statistics Show how many numbers of opinions 3/13/2016 Ex. Opinions about ipod Positive Negative 80% 20% Data Mining: Principles and Algorithms 28 Opinion Summarization (cont.) Feature-based summary [Hu & Liu, KDD '04/ Hu & Liu, AAAI '04] Find lower level of features and analyze. Ex. Opinions about ipod Battery life Pos Neg Pos Neg Pos Neg 50% 50% 95% 5% 30% 70% Usually nouns / noun phrases Frequent feature identification 3/13/2016 Price Feature extraction Design Association mining Feature pruning and infrequent feature identification based on heuristic rules Sentiment summary for each features Data Mining: Principles and Algorithms 29 Opinion Summarization (cont.) Summary paragraph generation [Kim et al, TAC '08] General NLP summarization techniques Opinion filtering Sentence extraction based summary Show sentences opinionated. Show sentences having the same polarity to the goal of the summary Opinion ordering Paragraph division by opinion polarity 3/13/2016 [Paragraph1] … Following are positive opinions… Following are negative opinions… [Paragraph2] … Following are mixed opinions… … Data Mining: Principles and Algorithms 30 Opinion Summarization (cont.) Probabilistic analysis Topic sentiment mixture model [Mei et al, WWW '07] Topic modeling with opinion priors Figure. The generation process of the topicsentiment mixture model 3/13/2016 Data Mining: Principles and Algorithms 31 Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion 3/13/2016 Data Mining: Principles and Algorithms 32 Other works Comparative analysis Focus on texts having contradiction or comparison. Finding comparative sentences [Jindal & Liu, SIGIR '06] Finding preferred entity 3/13/2016 Comparison indicator such as ‘than’ or ‘as well as’. Ex. ‘Ipod’ is better than ‘Zune’. Sequential patterns showing comparative sentences ex. 〈{NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN}〉 comparative [Murthy & Liu, COLING '08] Rule based approach Context-dependent orientation finding using Pros and Cons reviews. Data Mining: Principles and Algorithms 33 Other works Opinion Integration [Lu & Zhai, WWW '08] Integrate expert reviews with arbitrary text collection Arbitrary: not structured, various & updated data Semi-supervised topic model 3/13/2016 Expert reviews: well structured, easy to find features, not often updated Extract structure aspects (features) data from the expert review to cluster general documents Add supplementary opinions from general documents Data Mining: Principles and Algorithms 34 Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion 3/13/2016 Data Mining: Principles and Algorithms 35 Challenges in opinion mining Polarity terms are context sensitive Ex. Small can be good for ipod size, but can be bad for LCD monitor size Even in the same domain, use different words depending on target feature Ex. Long ‘ipod’ battery life vs. long ‘ipod’ loading time Partially solved (query dependent sentiment classification) Implicit and complex opinion expressions Rhetoric expression, metaphor, double negation Ex. The food was like a stone Need both good IR and NLP techniques for opinion mining. Cannot divide into pos/neg clearly Not all opinions can be classified into two categories Interpretation can be changed based on conditions Ex. 1) The battery life is ‘long’ if you do not use LCD a lot (pos) 2) The battery life is ‘short’ if you use LCD a lot (neg) Current system classify the first one as positive and second one as negative. However, actually both are saying the same fact. [Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008] 3/13/2016 Data Mining: Principles and Algorithms 36 Discussion A difficult task Essential for many blog or review mining techniques Current stage of opinion finding Good performance in sentence level, specific domain, sub-problem. Still low accuracy in general case MAP score of TREC ‘08 top performed system 3/13/2016 Opinion finding: 0.4569 Polarity finding: 0.2297~0.2723 A lot of margin to improve ! Data Mining: Principles and Algorithms 37 References I. Ounis, C. Macdonald and I. Soboroff, Overview of the TREC 2008 Blog Track , TREC, 2008. Opinion Mining and Summarization: Sentiment Analysis. Tutorial given at WWW-2008, April 21, 2008 in Beijing, China. Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of the 16th International World Wide Web Conference (WWW' 07), pages 171-180, 2007. Minqing Hu and Bing Liu. "Mining and summarizing customer reviews". To appear in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004, full paper), Seattle, Washington, USA, Aug 22-25, 2004. Minqing Hu and Bing Liu. "Mining Opinion Features in Customer Reviews." To appear in Proceedings of Nineteeth National Conference on Artificial Intellgience (AAAI-2004), San Jose, USA, July 2004. Yue Lu and ChengXiang Zhai. "Opinion Integration Through Semisupervised Topic Modeling", In Proceedings of the 17th International World Wide Web Conference (WWW'08) Kavita Ganesan, Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008 Hyun Duk Kim, Dae Hoon Park, V.G.Vinod Vydiswaran, and ChengXiang Zhai,Opinion Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization: UIUC at TAC 2008 Opinion Summarization Pilot, Text Analysis Conference (TAC), Maryland, USA. 3/13/2016 Data Mining: Principles and Algorithms 38 References Y. Lee, S.-H. Na, J. Kim, S.-H. Nam, H.-Y. Jung and J.-H. Lee , KLE at TREC 2008 Blog Track: Blog Post and Feed Retrieval , TREC, 2008. L. Jia, C. Yu and W. Zhang, UIC at TREC 208 Blog Track, TREC, 2008. Nitin Jindal and Bing Liu. "Identifying Comparative Sentences in Text Documents" To appear in Proceedings of the 29th Annual International ACM SIGIR Conference on Research & Development on Information Retrieval (SIGIR-06), Seattle 2006. Opinion Mining and Summarization (including review spam detection), tutorial given at WWW-2008, April 21, 2008 in Beijing, China. Murthy Ganapathibhotla and Bing Liu, Mining opinions in comparative sentences, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 241–248, Manchester, August 2008 3/13/2016 Data Mining: Principles and Algorithms 39 Thank you 3/13/2016 Data Mining: Principles and Algorithms 40 Mining User Query Logs for Personalized Search Yuanhua Lv (Some slides are taken from Xuehua Shen, Bin Tan, and ChengXiang Zhai’s presentation) Problem of Current Search Engines Jaguar Suppose we know: 1. 2. Short-term query logs: previous query = “racing cars”. [Shen et al. 05] Apple Software Car Long-term query logs: “car” occurs far more frequentlyAnimal than “Apple” in the user’s query logs of the recent 2 months. [Tan et al. 06] Chemistry Software 42 Problem Definition ? User Information Need Q1 User Query e.g., Apple software {C1,1 , C1,2 ,C1,3 ,…} C1 User Clickthrough Apple - Mac OS X The Apple Mac OS X product page. Describes features in the current version of Mac OS X, a screenshot gallery, latest software downloads, and a directory of ... e.g., Q2 … {C2,1 , C2,2 ,C2,3 ,… } C2 Qk How to model and mine user query logs? 43 Retrieval Model Basis: Unigram language model + KL divergence Mining query logs to update query model U Qk D θ ' Qk Q θD Similarity Measure k ' || ) D ( D(QQkk || DD ) Results | pk ()w| Qpk(,w p( w p | (kw) Qk|1Q ,..., k ) Q1 , Ck 1 ,..., C1 ) Query Logs 44 Mining Short-term User Query Logs [Shen et al. 05] Average user’s previous clickthrough C1 p( w | H C ) i k 1 1 k 1 p( w | C ) … i i 1 Ck-1 H p( w | H Q ) k11 p( w | Q ) … i 1 i Qk-1 H 1 Average user’s previous queries Q1 i k 1 p( w | H ) p( w | H C ) (1 ) p( w | H Q ) C H Q Combine previous clickthrough and previous queries 1 k Linearly interpolate current query and history model Qk p(w | k ) p(w | Qk ) (1 ) p(w | H ) 45 Four Heuristic Variants • FixInt: fixed coefficient interpolation p(w | k ) p(w | Qk ) (1 ) p(w | H ) Mining Short-term User Query Logs [Shen et al. 05] Average user’s previous clickthrough C1 p( w | H C ) i k 1 1 k 1 p( w | C ) … i i 1 Ck-1 H p( w | H Q ) k11 p( w | Q ) … i 1 i Qk-1 H 1 Average user’s previous queries Q1 i k 1 p( w | H ) p( w | H C ) (1 ) p( w | H Q ) C H Q Combine previous clickthrough and previous queries 1 k Fixed α? Linearly interpolate current query and history model Qk p(w | k ) p(w | Qk ) (1 ) p(w | H ) 47 Four Heuristic Variants • FixInt: fixed coefficient interpolation • BayesInt: adapt the interpolation coefficient to different query length – Intuition: if the current query Qk is longer, we should trust Qk more Mining Short-term User Query Logs [Shen et al. 05] Average user’s previous clickthrough C1 p( w | H C ) i k 1 1 k 1 p( w | C ) … i i 1 Ck-1 Average? H p( w | H Q ) k11 p( w | Q ) … i 1 i Qk-1 H 1 Average user’s previous queries Q1 i k 1 p( w | H ) p( w | H C ) (1 ) p( w | H Q ) C H Q Combine previous clickthrough and previous queries 1 k Fixed α? Linearly interpolate current query and history model Qk p(w | k ) p(w | Qk ) (1 ) p(w | H ) 49 Four Heuristic Variants • FixInt: fixed coefficient interpolation • BayesInt: adapt the interpolation coefficient to different query length – Intuition: if the current query Qk is longer, we should trust Qk more • OnlineUp: assign more weight to more recent records. • BatchUp: the user becomes better and better at query formulation as time goes on, but we do not need to “decay” the clickthrough. Data Set of Evaluation • • • • Data collection: TREC AP88-90 Topics: 30 hard topics of TREC topics 1-150 System: search engine + RDBMS Context: Query and clickthrough history of 3 participants. 51 Overall Effect of Search Context FixInt Query BayesInt OnlineUp BatchUp (=0.1,=1.0) (=0.2,=5.0) (=5.0,=15.0) (=2.0,=15.0) MAP MAP pr@20 MAP pr@20 MAP pr@20 pr@20 Q3 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483 Q3+HQ+HC 0.0726 0.1967 0.0816 0.2067 0.0706 0.1783 0.0810 0.2067 Improve 72.4% 93.8% 39.4% 67.7% 20.2% 92.4% Q4 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933 Q4+HQ+HC 0.0891 0.2233 0.0955 0.2317 0.0792 0.2067 0.0950 0.2250 Improve 66.2% 19.9% 47.8% 6.9% 77.2% 32.6% 15.5% 78.2% 39.4% 16.4% • Short-term query log helps system improve retrieval accuracy • BayesInt better than FixInt; BatchUp better than OnlineUp 52 Mining Long-term User Query Log [Tan et al. 05] • Can we mine long-term user query log similarly? • Challenge: long-term query log is noisy – How do we handle the noise? – Can we still improve performance? • Solution: – Assign weights to the query log data (EM algorithm) Hierarchical History Models Weights for query log units S1 S2 q1D1C1 q2D2C2 θS1 θS2 St-1 ...... q D C t-1 t-1 t-1 ...... θH St qtDt unit history model θSk ← qkDkCk overall history model θH = Σwk θSk θSt-1 original query model θq θq document model θq,H {θd} D(θq,H||θd) θd contextual query model θq,H Discriminative Weighting with Mixture Model S1 S2 q1D1C1 q2D2C2 θS1 Background model θB St-1 ...... q D C t-1 t-1 t-1 ...... θS2 λ2? λ1? λt-1? St qtDt θSt-1 θH θq <d1> jaguar car perfect for racing <d2> jaguar is a big cat... <d3> locate jaguar dealer in champaign... λq? λB? θMix Select {λ} to fit the data: maximize p(Dt|θMix) EM algorithm Experimental Results two query types recurring ≫ fresh combination ≈ clickthrough > docs > query, contextless Summary • Mining user query logs can personalize search results and improve retrieval performance – Four different models to exploit short-term query logs [Shen et al. 05]. – Assign weights to the long-term query logs to reduce the effect of noise [Tan et al. 06]. Reference • Xuehua Shen, Bin Tan, ChengXiang Zhai: Context-sensitive information retrieval using implicit feedback. SIGIR 2005: 43-50 • Bin Tan, Xuehua Shen, ChengXiang Zhai: Mining long-term search history to improve search accuracy. KDD 2006: 718-723 The End Thank you ! 59 Data Mining: Concepts and Techniques — Chapter 11 — 11.8. Online Analytical Processing on Multidimensional Text Database Duo Zhang Department of Computer Science University of Illinois at Urbana-Champaign http://sifaka.cs.uiuc.edu/~dzhang22/ 3/13/2016 60 Online Analytical Processing on Multidimensional Text Database Motivation Text Cube: Computing IR Measures for Multidimensional Text Database Analysis Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases 3/13/2016 61 Motivation • Industry and commercial applications often collect huge amount of data containing both structured data records and unstructured text data in a multidimensional text database • • Incident reports • Job descriptions • Product reviews • Service feedback It is highly desirable and strategically important to support high-performance search and mining over such databases 3/13/2016 62 Examples Aviation Safety Reporting System Location Environment … Narrative 199801 TX Daylight … …… I TOLD HIM I WAS AT 2000 FT AND HE SAID OK…… 199801 LA Daylight … ……WE STOPPED THE DSCNT AT CIRCLING MINIMUMS…… 199801 LA Night … ……THE TAXI/LNDG LIGHTS VERY DIM. NO OTHER VISIBLE TFC IN SIGHT…… 199902 FL Night … ……I FEEL WE SHOULD ALL EDUCATE OURSELVES ON CHKLISTS…… How to organize the data to help experts efficiently explore and digest text information? Time e.g. compare the reports in 1998 and reports in 1999? How to help experts analyze a specific type of anomaly in different contexts? e.g. what did pilots say about anomaly “landing without clearance” during daylight v.s. night? Online Analytical Processing on Multidimensional Text Database Motivation Text Cube: Computing IR Measures for Multidimensional Text Database Analysis C. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao (ICDE’08) Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases 3/13/2016 64 Text Cube Text Cube A novel data cube model integrating the power of traditional data cube and IR techniques for effective text mining Computing IR measures for multidimensional text database analysis Heterogeneous records to be examined Structured categorical attributes Unstructured free text IR statistics are evaluated TF-IDF Inverted Index 3/13/2016 65 Text Cube - Implementation Preprocessing stemming, stop words elimination, TF-IDF weighting Concept hierarchy construction A dimension hierarchy takes the form of a tree or a DAG. An attribute at a lower level reveals more details Four operations are supported: roll-up, drill-down, slice and dice Term hierarchy construction A term hierarchy represents semantic levels of terms in the text and their correlations Infusion with expert knowledge Two novel operations: Pull-up & Push-down 3/13/2016 66 Text Cube - Implementation Partial materialization: if a non-materialized cell is retrieved, we compute it on-the-fly based on the partially materialized cuboids A balance between time and space: given a time threshold δ, we minimize storage size within the query time bound δ for retrieving all cells to be interested in 3/13/2016 67 Experiment – Efficiency and Effectiveness Compare avgTF under different “Environment: Weather Elements” Compare avgTF under different “Supplementary: Problem Areas” 68 Online Analytical Processing on Multidimensional Text Database Motivation Text Cube: Computing IR Measures for Multidimensional Text Database Analysis Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases D. Zhang, C. Zhai, and J. Han (SDM’09) 3/13/2016 69 Motivation Aviation Safety Reporting System Location Environment … Narrative 199801 TX Daylight … …… I TOLD HIM I WAS AT 2000 FT AND HE SAID OK…… 199801 LA Daylight … ……WE STOPPED THE DSCNT AT CIRCLING MINIMUMS…… 199801 LA Night … ……THE TAXI/LNDG LIGHTS VERY DIM. NO OTHER VISIBLE TFC IN SIGHT…… 199902 FL Night … ……I FEEL WE SHOULD ALL EDUCATE OURSELVES ON CHKLISTS…… How to organize the data to help experts efficiently explore and digest text information? Time e.g. compare the reports in 1998 and reports in 1999? How to help experts analyze a specific type of anomaly in different contexts? e.g. what did pilots say about anomaly “landing without clearance” during daylight v.s. night? Solution: Topic Cube Topic Topic turbulence Encounter birds undershoot overshoot LAX SJC MIA Location AUS Deviation 98.02 98.01 99.02 99.01 1998 1999 CA drill-down roll-up FL TX Location Challenges: How to support operations along the topic dimension? How to quickly extract semantic topics? Constructing Topic Cube ALL roll-up Anomaly Altitude Deviation…… Anomaly Maintenance Problem Undershoot …… Overshoot Improper Documentatio n Time Loc Env … 98.01 TX Daylight … 98.01 98.01 99.02 LA LA FL Daylight Night Night … …… …… Anomaly Inflight Encounter Improper Maintenanc e Birds Altitude Ft Climb … Narrative Descent Cloud Ft … … 0.03 0.02 0.01 …. 0.06 0.03 0.01 …. … Descent System View … drill-down …… Turbulence 0.05 0.02 0.01 …. Altitude Ft Instruct … 0.04 0.03 0.01 …. Materialization Topic Dimension (Anomaly Event) Mtopic-agg C C LAX-overshoot LAX-altitude Mstd-agg Standard Dimension (Location) CA-altitude Mstd-agg Mtopic-agg pc(0) ( w | i( L ) ) C c(w, d ) p( zd ,w j ') w' j '{ s( L1) , i i ,e( L1) } i c(w ', d ) p( z d Mtopic-agg US-altitude j '{ s( L1) , ,e( L1) } d i LAX-all Mstd-agg C CA-all Mstd-agg US-overshoot Mtopic-agg: Mtopic-agg C CA-overshoot C C Mstd-agg Mtopic-agg C Mtopic-agg d ,w' j ') Mstd-agg C Mstd-agg: ( Ln ) pc(0) ( w | ) j a US-all c(w, d ) p( z ci d Dci d ,w c(w ', d ) p( z w' ci d Dci j) d ,w' j) Experimental Results daylight night Word p(w|θ) Tower 0.075 Pattern 0.061 Final 0.060 Runway 0.053 Land 0.052 Downwind 0.039 Tower 0.035 Runway 0.029 Light 0.027 Instrument Landing System 0.015 Beacon 0.014 landing without clearance …WINDS ALOFT AT PATTERN ALT OF 1000 FT MSL, WERE MUCH STRONGER AND A DIRECT XWIND. NEEDLESS TO SAY, THE PATTERNS AND LNDGS WERE DIFFICULT FOR MY STUDENT AND THERE WAS LIGHT TURB ON THE DOWNWIND… …I LISTENED TO HWD ATIS AND FOUND THE TWR CLOSED AND AN ANNOUNCEMENT THAT THE HIGH INTENSITY LIGHTS FOR RWY 28L WERE INOP. BROADCASTING IN THE BLIND AND LOOKING FOR THE TWR BEACON AND LOW INTENSITY LIGHTS AGAINST A VERY BRIGHT BACKGROUND CLUTTER OF STREET LIGHTS, ETC… Iterations Closeness to the optimum point 1 4 7 1013161922252831343740434649 Objective Function -480000 -490000 -500000 -510000 -520000 -530000 -540000 Time (sec.) Aggregate Baseline 200 150 100 50 0 Aggregate Baseline 0.2568% 0.2466% 0.2365% 0.2263% 0.2162% 0.2060% 0.1959% 0.1857% 0.1756% 0.1654% 0.1553% 0.1451% 0.1350% 0.1248% Context 3/13/2016 75