Research in IR at MS Microsoft Research (http://research.microsoft.com) Adaptive Systems and Interaction - IR/UI Machine Learning and Applied Statistics Data Mining Natural Language Processing Collaboration and Education MSR Cambridge MSR Beijing Microsoft Product Groups … many IR-related IR Research at MSR Improvements to representations and matching algorithms New directions User modeling Domain modeling Interactive interfaces An example: Text categorization Traditional View of IR Query Words Ranked List What’s Missing? User Modeling Query Words Ranked List Domain Modeling Information Use Domain/Obj Modeling Not all objects are equal … potentially big win A priori importance Information use (“readware”; collab filtering) Inter-object relationships Link structure / hypertext Subject categories - e.g., text categorization. text clustering Metadata E.g., reliability, recency, cost -> combining User/Task Modeling Demographics Task -- What’s the user’s goal? e.g., Lumiere Short and long-term content interests e.g., Implicit queries Interest model = f(content_similarity, time, interest) e.g., Letiza, WebWatcher, Fab Information Use Beyond batch IR model (“query->results”) Consider larger task context Knowledge management Human attention is critical resource Techniques for automatic information summarization, organization, discover, filtering, mining, etc. Advanced UIs and interaction techniques E.g, tight coupling of search, browsing to support information management The Broader View of IR User Modeling Query Words Ranked List Domain Modeling Information Use Text Categorization: Road Map Text Categorization Basics Inductive Learning Methods Reuters Results Future Plans Text Categorization Text Categorization: assign objects to one or more of a predefined set of categories using text features Example Applications: Sorting new items into existing structures (e.g., general ontologies, file folders, spam vs.not) Information routing/filtering/push Topic specific processing Structured browsing & search Text Categorization - Methods Human classifiers (e.g., Dewey, LCSH, MeSH, Yahoo!, CyberPatrol) Hand-crafted knowledge engineered systems (e.g., CONSTRUE) Inductive learning of classifiers (Semi-) automatic classification Classifiers A classifier is a function: f(x) = conf(class) from attribute vectors, x=(x1,x2, … xd) to target values, confidence(class) Example classifiers if (interest AND rate) OR (quarterly), then confidence(interest) = 0.9 confidence(interest) = 0.3*interest + 0.4*rate + 0.1*quarterly Inductive Learning Methods Supervised learning to build classifiers Labeled training data (i.e., examples of each category) “Learn” classifier Test effectiveness on new instances Statistical guarantees of effectiveness Classifiers easy to construct and update; requires only subject knowledge Customizable for individual’s categories and tasks Text Representation Vector space representation of documents word1 word2 word3 word4 ... Doc 1 = <1, Doc 2 = <0, Doc 3 = <0, 0, 1, 0, 3, 0, 0, 0, … > 0, … > 5, … > Text can have 107 or more dimensions e.g., 100k web pages had 2.5 million distinct words Mostly use: simple words, binary weights, subset of features Feature Selection Word distribution - remove frequent and infrequent words based on Zipf’s law: log(frequency) * log(rank) ~ constant # Words (f) 1 2 3 … Words by rank order (r) m Feature Selection (cont’d) Fit to categories - use mutual information to select features which best discriminate p( x, C ) MI ( x , C ) p ( x , C ) log( ) category vs. not p( x) p(C ) Designer features - domain specific, including non-text features Use 100-500 best features from this process as input to learning methods Inductive Learning Methods Find Similar Decision Trees Naïve Bayes Bayes Nets Support Vector Machines (SVMs) All support: “Probabilities” - graded membership; comparability across categories Adaptive - over time; across individuals Find Similar Aka, relevance feedback xi , j Rocchio w j irel n inon _ rel xi , j N n Classifier parameters are a weighted combination of weights in positive and negative examples -- “centroid” New items classified using: w j x j j Use all features, idf weights, 0 Decision Trees Learn a sequence of tests on features, typically using top-down, greedy search Binary (yes/no) or continuous decisions f7 P(class) = .6 f1 !f1 !f7 P(class) = .9 P(class) = .2 Naïve Bayes Aka, binary independence model Maximize: Pr (Class | Features) P( x | class ) P(class ) P(class | x ) P( x ) Assume features are conditionally independent - math easy; surprisingly effective C x1 x2 x3 xn Bayes Nets Maximize: Pr (Class | Features) Does not assume independence of features - dependency modeling C x1 x2 x3 xn Support Vector Machines Vapnik (1979) Binary classifiers that maximize margin Find hyperplane separating positive and negative examples 2 Optimization for maximum margin: min w , w x b 1, w x b 1 Classify new items using: w x w support vectors Support Vector Machines Extendable to: Non-separable problems (Cortes & Vapnik, 1995) Non-linear classifiers (Boser et al., 1992) Good generalization performance Handwriting recognition (LeCun et al.) Face detection (Osuna et al.) Text classification (Joachims) Platt’s Sequential Minimal Optimization (SMO) algorithm very efficient SVMs: Platt’s SMO Algorithm SMO is a very fast way to train SVMs SMO works by analytically solving the smallest possible QP sub-problem Substantially faster than chunking Scales somewhere between O( N ) and O( N 2 ) Text Classification Process text files Index Server word counts per file Find similar Feature selection data set Decision tree Naïve Bayes Learning Methods Bayes nets test classifier Support vector machine Reuters Data Set (21578 - ModApte split) 9603 training articles; 3299 test articles Example “interest” article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. REUTER Average article 200 words long Reuters Data Set (21578 - ModApte split) 118 categories An article can be in more than one category Learn 118 binary category distinctions Most common categories (#train, #test) • • • • • Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) • • • • • Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Category: “interest” rate=1 rate.t=1 lending=0 prime=0 discount=0 pct=1 year=0 year=1 Category: Interest Example SVM features - w • 0.70 prime • 0.67 rate • 0.63 interest • 0.60 rates • 0.46 discount • 0.43 bundesbank • 0.43 baker • -0.71 dlrs • -0.35 world • -0.33 sees • -0.25 year • -0.24 group • -0.24 dlr • -0.24 january Accuracy Scores Based on contingency table System: Yes System: No Truth: Yes Truth: No a b c d Effectiveness measure for binary classification error rate = (b+c)/n accuracy = 1 - error rate precision (P) = a/(a+b) recall (R) = a/(a+c) break-even = (P+R)/2 F measure = 2PR/(P+R) Reuters - Accuracy ((R+P)/2) earn acq money-fx grain crude trade interest ship wheat corn Avg Top 10 Avg All Cat Findsim NBayes BayesNets Trees LinearSVM 92.9% 95.9% 95.8% 97.8% 98.0% 64.7% 87.8% 88.3% 89.7% 93.6% 46.7% 56.6% 58.8% 66.2% 74.5% 67.5% 78.8% 81.4% 85.0% 94.6% 70.1% 79.5% 79.6% 85.0% 88.9% 65.1% 63.9% 69.0% 72.5% 75.9% 63.4% 64.9% 71.3% 67.1% 77.7% 49.2% 85.4% 84.4% 74.2% 85.6% 68.9% 69.7% 82.7% 92.5% 91.8% 48.2% 65.3% 76.4% 91.8% 90.3% 64.6% 61.7% 81.5% 75.2% 85.0% 80.0% 88.4% N/A Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category Break Even: (Recall + Precision) / 2 92.0% 87.0% ROC for Category - Grain 1 Precision 0.8 0.6 Linear SVM 0.4 Decision Tree Naïve Bayes 0.2 Find Similar 0 0 0.2 0.4 0.6 0.8 1 Recall Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category ROC for Category - Earn 1 Precision 0.8 0.6 Linear SVM 0.4 Decision Tree Naïve Bayes 0.2 Find Similar 0 0 0.2 0.4 Recall 0.6 0.8 1 ROC for Category - Acq 1 Precision 0.8 0.6 Linear SVM 0.4 Decision Tree 0.2 Naïve Bayes Find Similar 0 0 0.2 0.4 0.6 Recall 0.8 1 ROC for Category - Money-Fx 1 Precision 0.8 0.6 Linear SVM 0.4 Decision Tree Naïve Bayes 0.2 Find Similar 0 0 0.2 0.4 Recall 0.6 0.8 1 ROC for Category - Grain 1 Precision 0.8 0.6 Linear SVM 0.4 Decision Tree Naïve Bayes 0.2 Find Similar 0 0 0.2 0.4 0.6 Recall 0.8 1 ROC for Category - Crude 1 Precision 0.8 0.6 Linear SVM 0.4 Decision Tree Naïve Bayes 0.2 Find Similar 0 0 0.2 0.4 Recall 0.6 0.8 1 ROC for Category - Trade 1 Precision 0.8 0.6 Linear SVM 0.4 Decision Tree 0.2 Naïve Bayes Find Similar 0 0 0.2 0.4 0.6 Recall 0.8 1 ROC for Category - Interest 1 Precision 0.8 0.6 Linear SVM 0.4 Decision Tree Naïve Bayes 0.2 Find Similar 0 0 0.2 0.4 Recall 0.6 0.8 1 ROC for Category - Ship 1 Precision 0.8 0.6 Linear SVM 0.4 Decision Tree Naïve Bayes 0.2 Find Similar 0 0 0.2 0.4 0.6 Recall 0.8 1 Precision ROC for Category - Wheat 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Linear SVM Decision Tree Naïve Bayes Find Similar 0 0.2 0.4 Recall 0.6 0.8 1 Precision ROC for Category - Corn 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Linear SVM Decision Tree Naïve Bayes Find Similar 0 0.2 0.4 Recall 0.6 0.8 1 SVM: Dumais et al. vs. Joachims Top 10 Reuters categories, microavg BE Dumais et al. Accuracy Features, type Features, number Stemming Title/Body Number of categs Split Joachims 92.0 binary 300/category none distinguished 118 ModApte (9603/3299) 84.2 86.5 (rbf, g=.8) tf*idf 9947 stemmed ? 90 ModApte (9603/3299) Reuters - Sample Size (SVM) 100% category sample set 1 0-acq 1-earn 3-money-fx 4-grain 5-crude 6-trade 7-interest 8-ship 9-wheat 10-corn microtop10 category sample set 2 0-acq 1-earn 3-money-fx 4-grain 5-crude 6-trade 7-interest 8-ship 9-wheat 10-corn microtop10 10% 5% 1% samp sz (p+r)/2 samp sz (p+r)/2 samp sz (p+r)/2 samp sz (p+r)/2 2876 1650 538 433 389 369 347 197 212 182 98.3% 97.0% 80.2% 95.9% 90.4% 80.9% 79.9% 85.5% 92.5% 93.0% 93.9% 281 162 55 46 45 40 32 20 24 23 97.8% 94.6% 66.3% 91.5% 82.9% 78.2% 68.4% 57.4% 84.8% 78.2% 89.7% 145 80 28 21 18 21 17 11 11 9 97.4% 90.4% 63.9% 87.0% 76.9% 76.4% 55.3% 53.9% 65.7% 60.3% 86.0% 35 14 3 3 3 2 2 2 2 1 93.6% 65.6% 41.9% 50.3% ?? 12.0% 50.8% ?? 50.7% 50.9% 70.3% 100% samp sz 2876 1650 538 433 389 369 347 197 212 182 (p+r)/2 98.3% 97.0% 80.2% 95.9% 90.4% 80.9% 79.9% 85.5% 92.5% 93.0% 93.9% 10% samp sz 264 184 62 40 35 41 30 21 19 16 (p+r)/2 97.6% 94.1% 72.9% 85.8% 81.6% 80.1% 71.8% 62.8% 80.1% 76.6% 89.6% 5% samp sz 139 79 29 28 15 22 15 10 15 11 (p+r)/2 97.3% 90.6% 71.0% 88.6% 65.6% 72.7% 63.0% 54.5% 80.1% 69.6% 86.4% 1% samp sz 26 16 5 7 2 5 3 2 5 4 (p+r)/2 94.6% 73.5% 49.9% 76.5% ?? 51.6% 45.1% 50.6% 68.3% 43.4% 75.5% Reuters - Other Experiments Simple words vs. NLP-derived phrases NLP-derived phrases factoids (April_8, Salomon_Brothers_International) mulit-word dictionary entries (New_York, interest_rate) noun phrases (first_quarter, modest_growth) No advantage for Find Similar, Naïve Bayes, SVM Vary number of features Binary vs. 0/1/2 features No advantage of 0/1/2 for Decision Trees, SVM Number of Features - NLP Plain Words, NLP Phrases, LF BreakEven Point 0.95 Words-all Words-10 0.9 Words+Phrases-all 0.85 Words+Phrases-10 0.8 Words+Phrases+LF-all Words+Phrases+LF-10 0.75 0 500 1000 1500 Number of Features 2000 Reuters Summary Accurate classifiers can be learned automatically from training examples Linear SVMs are efficient and provide very good classification accuracy Best results for this test collection Widely applicable, flexible, and adaptable representations Text Classification Horizon Other applications OHSUMED, TREC, spam vs. not-spam, Web Text representation enhancements Use of hierarchical category structure UI for semi-automatic classification Dynamic interests More Information General stuff http://research.microsoft.com/~sdumais SMO http://research.microsoft.com/~jplatt Optimization Problem to Train SVMs Let yi = desired output of ith example (+1/-1) Let ui = actual output ui w xi b Margin = 1 / || w || 2 min || w || , 1 2 yiui 1 i Dual Quadratic Programming Problem Equivalent dual problem: Lagrangian saddle point One-to-one relationship between Lagrange multipliers a and examples 1 2 w a j w x j b y j 1 , a j 0 2 i i min 1 y y xn xm a na m a n 2 n m j n ,m a n 0, w yna n xn n y a n n n n 0 b w xm ym , a m 0