Web Mining Giuseppe Attardi [includes slides borrowed from C. Manning] Types of Mining Web Usage Mining – Access logs, query logs, etc. – For improving performance (caching) – For understanding user needs or preferences Web Content Mining – For extracting information or knowledge Content Mining Web Search Link Analysis Categorization Clustering Knowledge Extraction – Ontology – Semantic Web Question Answering Text Categorization Categorization Problem: Given a universe of objects and a predefined set of classes, or categories, assign each object to its correct class. Examples: Problem Tagging Objects Categories words in context POS tag Word Sense Disambiguation words in context word sense PP attachment sentences parse trees Language identification text language Text categorization text topic Overview Definition of Text Categorization Techniques – Naïve Bayes – Decision Trees – Maximum Entropy Modeling – k-Nearest Neighbor Classification – SVM Text categorization Classification (= Categorization) – Task of assigning objects to classes or categories Text categorization – Task of classifying the topic or theme of a document Statistical classification Training set of objects Data representation model Model class Training procedure Evaluation Training set of objects A set of objects, each labeled by one or more classes Example from Reuters <REUTERS TOPICS="YES" NEWID="2005"> <DATE> 5-MAR-1987 09:22:57.75</DATE> <TOPICS><D>earn</D></TOPICS> <PLACES><D>usa</D></PLACES> <TEXT>&#2; <TITLE>NORD RESOURCES CORP &lt;NRD> 4TH QTR NET</TITLE> <DATELINE> DAYTON, Ohio, March 5 - </DATELINE> <BODY>Shr 19 cts vs 13 cts Net 2,656,000 vs 1,712,000 Revs 15.4 mln vs 9,443,000 Avg shrs 14.1 mln vs 12.6 mln Shr 98 cts vs 77 cts Net 13.8 mln vs 8,928,000 Revs 58.8 mln vs 48.5 mln Avg shrs 14.0 mln vs 11.6 mln NOTE: Shr figures adjusted for 3-for-2 split paid Feb 6, 1987. Reuter &#3;</BODY></TEXT> </REUTERS> Data Representation Model The training set is encoded via a data representation model Typically, each object in the training set is represented by a pair (x, c), where: – x: a vector of measurements – c: class label Data Representation For text categorization: – use words that are frequent in “earnings” documents – the 20 most representative words are: vs, min, cts, loss, &, 000, profit, dlrs, pct, etc. Each document is represented as vector x j s1j ,..., skj where 1 log tfi j j si round 10 1 log l j and tfij is the number of occurrences of word i in document j and lj is the length of document j. word sij vs 5 mln 5 cts 3 ; 3 & 3 000 4 loss 0 ‘ 0 “ 0 3 4 profit 0 dlrs 3 1 2 pct 0 is 0 s 0 that 0 net 3 lt 2 at 0 Model Class and Training Procedure Model class – A parameterized family of classifiers – e.g. a model class for binary classification: g(x) = w∙x + w0 • if g(x) > 0, choose class c1, else c2 Training procedure – Algorithm to select one classifier from this family – i.e., to select proper parameters values (e.g. w, w0) Evaluation Borrowing from IR, NLP systems are evaluated by precision, recall, etc. Example: for text categorization, given a set of documents of which a subset is in a particular category (say, “earnings”), the system classifies some other subset of the documents as belonging to the “earnings” category. The results of the system are compared with the actual results as follows: Correct Incorrect Assigned tp (true positive) fp (false positive) Not Assigned tn (true negative) fn (false negative) Evaluation measures Precision tp tp fp Recall tp tp fn Accuracy tp tn tp tn fp fn Error fp fn tp tn fp fn Evaluation of text categorization macro-averaging – Compute an evaluation measure for each contingency table separately and average over categories – gives equal weight to each category a1 a2 a3 – macro-averaged precision = a 1 b1 a 2 b 2 a 3 b 3 3 micro-averaging – Make a single contingency table for all categories by summing the scores in each cell, then compute the evaluation measure for the whole table – gives equal weight to each object – micro-averaged precision = (a1 a 2 a 3) ( a 1 a 2 a 3 ) ( b1 b 2 b 3 ) Classification Techniques Naïve Bayes Decision Trees Maximum Entropy Modeling Support Vector Machines k-Nearest Neighbor Bayesian Classifiers Bayesian Methods Learning and classification methods based on probability theory Bayes theorem plays a critical role in probabilistic learning and classification Build a generative model that approximates how data is produced Uses prior probability of each category given no information about an item Categorization produces a posterior probability distribution over the possible categories given a description of an item Notation P(A) P(A, B) P(A | B) probability of an event A probability of both events A and B conditional probability of event A given that an event B has occurred Bayes’ Rule P (C , X ) P (C | X ) P ( X ) P ( X | C ) P (C ) prior probability P ( X | C ) P (C ) P (C | X ) P( X ) posterior probability Maximum a posteriori Hypothesis hMAP argmax P ( h | D) hH hMAP P ( D | h ) P ( h) argmax P ( D) h H hMAP argmax P ( D | h) P ( h) hH Maximum likelihood Hypothesis If all hypotheses are a priori equally likely, need only to consider the P(D | h) term: hML argmax P ( D | h) hH Naïve Bayes Classifiers Task: Classify a new instance based on a tuple of attribute values x1 , x2 ,, xn c MAP argmax P (c j | x1 , x2 ,, xn ) c j C c MAP argmax c j C P ( x1 , x2 ,, xn | c j ) P (c j ) P (c1 , c2 ,, cn ) c MAP argmax P ( x1 , x2 ,, xn | c j ) P (c j ) c j C Naïve Bayes Classifier: Assumptions P(cj) – Can be estimated from the frequency of classes in the training examples. P(x1,x2,…,xn | cj) – O(|X|n•|C|) – Could only be estimated if a very, very large number of training examples was available. Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities. The Naïve Bayes Classifier Flu X1 runnynose X2 sinus X3 cough X4 fever X5 muscle-ache Conditional Independence Assumption: features are independent of each other given the class: P( X 1 ,, X 5 | C ) P( X 1 | C ) P( X 2 | C ) ... P( X 5 | C ) Learning the Model C X1 X2 X3 X4 X5 X6 Common practice: maximum likelihood – simply use the frequencies in the data Pˆ (c j ) Pˆ ( x i | c j ) N (C c j ) N N ( X i xi , C c j ) N (C c j ) Problem with Max Likelihood Flu X1 runnynose X2 sinus X3 cough X4 fever X5 muscle-ache P ( X 1 ,, X 5 | C ) P ( X 1 | C ) P ( X 2 | C ) ... P ( X 5 | C ) What if we have seen no training cases where patient had no flu and muscle aches? N ( X 5 t , C nf ) Pˆ ( X 5 t | C nf ) 0 N (C nf ) Zero probabilities cannot be conditioned away, no matter the other evidence! arg max c Pˆ (c ) i Pˆ ( x i | c ) Smoothing to Avoid Overfitting Pˆ ( x i | c j ) N ( X i xi , C c j ) 1 N (C c j ) k # of values of Xi Somewhat more subtle version Pˆ ( x i ,k | c j ) overall fraction in data where Xi=xi,k N ( X i x i ,k , C c j ) mp i ,k N (C c j ) m extent of “smoothing” Text Classification Using Naïve Bayes: Basic method Attributes are text positions, values are words. c NB argmax P (c j ) P ( xi | c j ) c jC i argmax P (c j ) P ( x1 " our" | c j ) P ( xn " text"| c j ) c jC Naive Bayes assumption is clearly violated. – Example? Still too many possibilities Assume that classification is independent of the positions of the words – Use same parameters for each position Text Classification Algorithms: Learning From training corpus, extract Vocabulary Calculate required P(cj) and P(xk | cj) terms – For each cj in C do • docsj subset of documents for which the target class is cj • P (c j ) | docs j | | total # documents | • Textj single document containing all docsj • for each word xk in Vocabulary – nk number of occurrences of xk in Textj – P ( xk | c j ) nk 1 n | Vocabulary | Text Classification Algorithms: Classifying positions all word positions in current document which contain tokens found in Vocabulary Return cNB, where c NB argmax P (c j ) c jC P( x i positions i | cj) Naive Bayes Time Complexity Training Time: O(|D|Ld + |C||V|)) where Ld is the average length of a document in D – Assumes V and all Di , ni, and nij pre-computed in O(|D|Ld) time during one pass through all of the data. – Generally just O(|D|Ld) since usually |C||V| < |D|Ld Test Time: O(|C| Lt) where Lt is the average length of a test document Very efficient overall, linearly proportional to the time needed to just read in all the data Underflow Prevention Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities Class with highest final un-normalized log probability score is still the most probable Naïve Bayes Posterior Probabilities Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not – Output probabilities are generally very close to 0 or 1 Two Models Model 1: Multivariate binomial – One feature Xw for each word in dictionary – Xw = true in document d if w appears in d – Naive Bayes assumption: • Given the document’s topic, appearance of one word in document tells us nothing about chances that another word appears Two Models Model 2: Multinomial – One feature Xi for each word pos in document • feature’s values are all words in dictionary – Value of Xi is the word in position i – Naïve Bayes assumption: • Given the document’s topic, word in one position in document tells us nothing about value of words in other positions – Second assumption: • word appearance does not depend on position P(Xi = w | c ) = P(Xj = w | c) for all positions i,j, word w, and class c Parameter estimation Binomial model: fraction of documents of topic cj ˆ P ( X w t | c j ) in which word w appears Multinomial model: Pˆ ( X i w | c j ) fraction of times in which word w appears across all documents of topic cj – creating a mega-document for topic j by concatenating all documents in this topic – use frequency of w in mega-document Feature selection via Mutual Information We might not want to use all words, but just reliable, good discriminators In training set, choose k words which best discriminate the categories. One way is in terms of Mutual Information: p(ew , ec ) I ( w , c ) p(ew , ec ) log p(ew ) p(ec ) e w { 0 ,1} ec { 0 ,1} – For each word w and each category c Feature selection via MI (2) For each category we build a list of k most discriminating terms For example (on 20 Newsgroups): – sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, … – rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, … Greedy: does not account for correlations between terms In general feature selection is necessary for binomial NB, but not for multinomial NB Evaluating Categorization Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances) Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system Results can vary based on sampling error due to different training and test sets Average results over multiple training and test sets (splits of the overall data) for the best results Example: AutoYahoo! Classify 13,589 Yahoo! webpages in “Science” subtree into 95 different topics (hierarchy depth 2) Sample Learning Curve (Yahoo Science Data) Importance of Conditional Independence Assume a domain with 20 binary (true/false) attributes A1,…, A20, and two classes c1 and c2. Goal: for any case A=A1,…,A20 estimate P(A,ci). A) No independence assumptions: Computation of 221 parameters (one for each combination of values) ! The training database will not be so large! Huge Memory requirements / Processing time. Error Prone (small sample error). B) Strongest conditional independence assumptions (all attributes independent given the class) = Naive Bayes: P(A,ci)=P(A1,ci)P(A2,ci)…P(A20,ci) Computation of 20*22 = 80 parameters. Space and time efficient. Robust estimations. What if the conditional independence assumptions do not hold?? C) More relaxed independence assumptions Tradeoff between A) and B) Conditions for Optimality of Naïve Bayes Answer Fact Sometimes NB performs well even if the Conditional Independence assumptions are badly violated. Questions Assume two classes c1 and c2. A new case A arrives. NB will classify A to c1 if: P(A, c1) > P(A, c2) P(A,c1) P(A,c2) Class of A Actual Probability 0.1 0.01 c1 Estimated Probability by NB 0.08 0.07 c1 WHY? And WHEN? Hint Classification is about predicting the correct class label and NOT about accurately estimating probabilities. Despite the big error in estimating the probabilities the classification is still correct. Correct estimation accurate prediction but NOT accurate prediction Correct estimation Naïve Bayes is Not So Naïve Naïve Bayes: First and Second place in KDD-CUP 97 competition, among 16 (then) state of the art algorithms Goal: Financial services industry direct mail response prediction model. Predict if the recipient of mail will actually respond to the advertisement – 750,000 records. Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Instead Decision Trees & Nearest-Neighbor methods can heavily suffer from this. Very good in Domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data A good dependable baseline for text classification (but not the best)! Optimal if the Independence Assumptions hold: – If assumed independence is correct, then it is the Bayes Optimal Classifier for problem Very Fast: – Learning with one pass over the data; testing linear in the number of attributes, and document collection size Low Storage requirements Handles Missing Values Naïve Bayes Drawbacks Doesn’t do higher order interactions Typical example: Chess end games – Each move completely changes the context for the next move – C4.5 99.5% accuracy; NB 87% accuracy. What if you have BOTH high order interactions AND few training data? Doesn’t model features that do not equally contribute to distinguishing the classes – If few features ONLY mostly determine the class, additional features usually decrease the accuracy – Because NB gives same weight to all features Decision Trees Decision Trees Example: decision whether to assign documents to the category "earning" node 1 7681 articles P(c|n1) = 0.300 split: cts value: 2 cts < 2 cts 2 node 2 5977 articles P(c|n2) = 0.116 split: net value: 1 net < 1 node 3 5436 articles P(c|n3) = 0.050 node 5 1704articles P(c|n5) = 0.943 split: vs value: 2 net 1 vs < 2 node 4 541 articles P(c|n4) = 0.649 node 6 201 articles P(c|n6) = 0.694 vs 2 node 7 1403 articles P(c|n7) = 0.996 Decision Trees - Training procedure (1) Growing a tree with training data – splitting criterion • for finding the feature and its value on which to split • e.g. maximum information gain – stopping criterion • determines when to stop splitting • e.g. all elements at node have same category Pruning it back to reasonable size – to avoid overfitting the training set e.g. ‘dlrs’ and ‘pct’ in just one document – to optimize performance Maximum Information Gain Information gain H(t) – H(t|a) = H(t) – (pLH(tL) + pRH(tR)) where: a is attribute we split on t is distribution of the node we split pL and pR are the percent of nodes passed on to left and right nodes tL and tR are the distributions of left and right nodes Choose attribute which maximizes IG Example: H(n1) = - 0.3 log(0.3) - 0.7 log(0.7) = 0.881 H(n2) = 0.518 H(n5) = 0.315 H(n1) – H(n1| ‘cts’) = 0.881 – (5977/7681) · 0.518 – (1704/7681) · 0.315 = 0.408 Decision Trees – Pruning (1) At each step, drop a node considered least helpful Find best tree using validation on validation set – validation set: portion of training data held out from training Find best tree using cross-validation – I. Determine the optimal tree size • 1. Divide training data into N partitions • 2. Grow using N-1 partitions, and prune using held-out partition • 3. Repeat 2. N times • 4. Determine average pruned tree size as optimal tree size – II. Training using total training data, and pruning back to optimal size Decision Trees - Pruning (2) Effect of pruning on accuracy Optimal performance on test set pruning 951 nodes Decision Trees Summary Useful for non-trivial classification tasks (for simple problems, use simpler methods) Tend to split the training set into smaller and smaller subsets: – may lead to poor generalizations – not enough data for reliable prediction – accidental regularities Volatile: very different model from slightly different data Can be interpreted easily – easy to trace the path – easy to debug one’s code – easy to understand a new domain Maximum Entropy Modeling Maximum Entropy Modeling Maximum Entropy Modeling – The model with maximum entropy of all the models that satisfy the constraints – desire to preserve as much uncertainty as possible Model class: log linear model Training procedure: generalized iterative scaling MaxEntropy: example data Features Outcome Sunny, Happy Outdoor Sunny, Happy, Dry Outdoor Sunny, Happy, Humid Outdoor Sunny, Sad, Dry Outdoor Sunny, Sad, Humid Outdoor Cloudy, Happy, Humid Outdoor Cloudy, Happy, Humid Outdoor Cloudy, Sad, Humid Outdoor Cloudy, Sad, Humid Outdoor Rainy, Happy, Humid Indoor Rainy, Happy, Dry Indoor Rainy, Sad, Dry Indoor Rainy, Sad, Humid Indoor Cloudy, Sad, Humid Indoor Cloudy, Sad, Humid Indoor MaxEnt: example predictions Context Cloudy, Happy, Humid Outdoor 0.771 Indoor 0.228 Rainy, Sad, Humid 0.001 0.998 Maximum Entropy Modeling (2) word Model class: loglinear model 1 K ( x ,c ) f p( x , c ) i i Z i 1 1 if sij 0 and c 1 fi ( x j , c) otherwise 0 – i : weight for i-th feature – Z : normalizing constant Class of new document – compute p(x, 0), p(x, 1) – choose the class label with the greater probability log(i) vs 0.613 mln -0.110 cts 1.298 ; -0.432 & -0.429 000 -0413 loss -0.332 ‘ -0.085 “ 0.202 3 -0.463 profit 0.360 dlrs -0.202 1 -0.211 pct -0.260 is -0.546 s -0.490 that -0.285 net -0.300 lt 1.016 at -0.465 fK+1 0.009 Maximum Entropy Modeling (3) Training procedure: generalized interative scaling – Expected value of p: E f p i p x , c x ,c f i ( x, c) – Epfi maximum entropy distribution p*: Ep*fi = Epfi Algorithm 1. initialize (1). compute Epfi. n=1 2. compute p(n)(x, c) for each training data 3. compute Ep(n)fi 4. update (n+1) 5. if converged, stop. otherwise n=n+1, goto 2 Maximum Entropy Modeling (4) Define (K+1)th feature: for the constraint that sum of fi is equal to C K N f K 1 ( x , c ) C f i ( x , c ) C max x ,c i 1 i 1 f i ( x, c) Expected value of p is defined as E p f i p x , c f i ( x , c ) x ,c Expected value for the empirical distribution is computed as 1 E ~p f i ~p x , c f i ( x , c ) N x ,c N j 1 fi ( x j , c) Expected value of p is approximately computed as 1 E p fi N p(c | x j ) f i ( x j , c) N j 1 c GIS Algorithm (full) 1. Initialize {ai(1)}. Maximum Entropy Modeling (6) word Application to text categorization – trained on 9603 articles, 500 iteration – test result: 88.6% accuracy log(i) vs 0.613 mln -0.110 cts 1.298 ; -0.432 & -0.429 000 -0413 loss -0.332 ‘ -0.085 “ 0.202 3 -0.463 profit 0.360 dlrs -0.202 1 -0.211 pct -0.260 is -0.546 s -0.490 that -0.285 net -0.300 lt 1.016 at -0.465 fK+1 0.009 Maximum Entropy Modeling (7) Shortcoming of MEM – restricted to binary feature • low performance in some situations – computationally expensive: slow convergence – the lack of smoothing can cause problem Strength of MEM – can specify all possible relevant information • complex features can be defined – can use heterogeneous features and weighting feature – an integrated framework for feature selection & classification • a very large number of features could be down to a manageable size during training procedure Vector Space Classifiers Vector Space Representation Each document is a vector, one component for each term (= word) Normalize to unit length Properties of vector space – terms are axes – n docs live in this space – even with stemming, may have 10,000+ dimensions, or even 1,000,000+ Classification Using Vector Spaces Each training doc a point (vector) labeled by its class Similarity hypothesis: docs of the same class form a contiguous region of space. Or: Similar documents are usually in the same class. Define surfaces to delineate classes in space Classes in a Vector Space Similarity hypothesis true in general? Government Science Arts Given a Test Document Figure out which region it lies in Assign corresponding class Test Document = Government Government Science Arts k-Nearest Neighbor k-Nearest Neighbor Classification To classify document d into class c Define k-neighborhood N as k nearest neighbors of d Count number of documents l in N that belong to c Estimate P(c|d) as l/k Example: k=6 (6NN) P(science| )? Government Science Arts kNN Learning Algorithm Learning is just storing the representations of the training examples in D Testing instance x: – Compute similarity between x and all examples in D – Assign x the category of the most similar example in D Does not explicitly compute a generalization or category prototypes Also called: – Case-based learning – Memory-based learning – Lazy learning kNN Is Close to Optimal Cover and Hart 1967 Asymptotically, the error rate of 1-nearestneighbor classification is less than twice the Bayes rate [error rate of classifier knowing model that generated data] In particular, asymptotic error rate is 0 if Bayes rate is 0 Assume: query point coincides with a training point Both query point and training point contribute error → 2 times Bayes rate kNN with Inverted Index Naively finding nearest neighbors requires a linear search through |D| documents in collection But determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents Use standard vector space inverted index methods to find the k nearest neighbors Testing Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appears – Typically B << |D| kNN: Discussion No feature selection necessary Scales well with large number of classes – Don’t need to train n classifiers for n classes Classes can influence each other – Small changes to one class can have ripple effect Scores can be hard to convert to probabilities No training necessary – Actually: perhaps not true. (Data editing, etc.) kNN vs. Naive Bayes Bias/Variance tradeoff – Variance ≈ Capacity kNN has high variance and low bias. – Infinite memory NB has low variance and high bias. – Decision surface has to be linear (hyperplane – see later) Consider: Is an object a tree? (Burges) – Too much capacity/variance, low bias • Botanist who memorizes • Will always say “no” to new object (e.g., # leaves) – Not enough capacity/variance, high bias • Lazy botanist • Says “yes” if the object is green – Want the middle ground kNN vs. Naïve Bayes Bias/Variance tradeoff Variance ≈ Capacity kNN has high variance and low bias Regression has low variance and high bias Consider: Is an object a tree? (Burges) Too much capacity/variance, low bias – Botanist who memorizes – Will always say “no” to new object (e.g., #leaves) Not enough capacity/variance, high bias – Lazy botanist – Says “yes” if the object is green kNN: Discussion Classification time linear in training set No feature selection necessary Scales well with large number of classes – Don’t need to train n classifiers for n classes Classes can influence each other – Small changes to one class can have ripple effect Scores can be hard to convert to probabilities No training necessary – Actually: not true. Why? Binary Classification Consider 2 class problems How do we define (and find) the separating surface? How do we test which region a test doc is in? Separation by Hyperplanes Assume linear separability for now: – in 2 dimensions, can separate by a line – in higher dimensions, need hyperplanes Can find separating hyperplane by linear programming (e.g. perceptron): – separator can be expressed as ax + by = c Linear Programming / Perceptron Find a, b, c, such that ax + by c for red points ax + by c for blue points Relationship to Naïve Bayes? Find a, b, c, such that ax + by c for red points ax + by c for blue points Linear Classifiers Many common text classifiers are linear classifiers Despite this similarity, large performance differences – For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose? – What to do for non-separable problems? Which Hyperplane? In general, lots of possible solutions for a, b, c Which Hyperplane? Lots of possible solutions for a, b, c. Some methods find a separating hyperplane, but not the optimal one (e.g., perceptron) Most methods find an optimal separating hyperplane Which points should influence optimality? – All points • Linear regression • Naïve Bayes – Only “difficult points” close to decision boundary • Support vector machines • Logistic regression (kind of) Hyperplane: Example Class: “interest” (as in interest rate) Example features of a linear classifier (SVM) wi ti 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank wi ti -0.71 dlrs -0.35 world -0.33 sees -0.25 year -0.24 group -0.24 dlr More Than Two Classes One-of classification: each document belongs to exactly one class – How do we compose separating surfaces into regions? Any-of or multiclass classification – For n classes, decompose into n binary problems Vector space classifiers for one-of classification – Use a set of binary classifiers – Centroid classification – K Nearest Neighbor classification Composing Surfaces: Issues ? ? ? Set of Binary Classifiers: Any of Build a separator between each class and its complementary set (docs from all other classes) Given test doc, evaluate it for membership in each class Apply decision criterion of classifiers independently Done Set of Binary Classifiers: One of Build a separator between each class and its complementary set (docs from all other classes) Given test doc, evaluate it for membership in each class Assign document to class with: – maximum score – maximum confidence – maximum probability Why different from multiclass/any of classification? Negative Examples Formulate as above, except negative examples for a class are added to its complementary set. Positive examples Negative examples Centroid Classification Given training docs for a class, compute their centroid Now have a centroid for each class Given query doc, assign to class whose centroid is nearest Example Government Science Arts Support Vector Machines Which Hyperplane? In general, lots of possible solutions for a, b, c Support Vector Machine (SVM) finds an optimal solution Support Vector Machine (SVM) SVMs maximize the margin around the separating hyperplane: Support vectors – a.k.a. large margin classifiers The decision function is fully specified by a subset of training samples, the support vectors Quadratic programming problem State of the art text classification method Maximize margin Geometric Margin wT x b r w Distance from example to the separator is Examples closest to the hyperplane are support vectors Margin ρ of the separator is the width of separation between support vectors of classes ρ r Maximum Margin: Formalization w: hyperplane normal xi: data point i yi: class of data point i (+1 or -1) Constraint optimization formalization: xi · w + b ≥ +1 for yi = +1 (2) xi · w + b ≤ -1 for yi = -1 (3) maximize margin: 2/||w|| (1) Quadratic Programming One can show that hyperplane w with maximum margin is: w i yi x i i i: Lagrange multipliers xi: data point i yi: class of data point i (+1 or -1) where the i are the solution to maximizing: LD i 12 i j yi y j xi x j i i Most i will be zero Building an SVM Classifier Now we know how to build a separator for two linearly separable classes What about classes whose exemplary docs are not linearly separable? Not Linearly Separable Find a line that penalizes points on “the wrong side” Penalizing Bad Points Negative for bad points. Define distance for each point with respect to separator ax + by = c: (ax + by) - c for red points c - (ax + by) for blue points. Solve Quadratic Program Solution gives “separator” between two classes: choice of a,b Given a new point (x,y), can score its proximity to each class: – evaluate ax+by – Set confidence threshold 3 5 7 Non-linear SVMs Datasets that are linearly separable (with some noise) work out great: x 0 But what are we going to do if the dataset is just too hard? x 0 How about … mapping data to a higher-dimensional space: x2 0 x Non-linear SVMs: Feature spaces General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) The “Kernel Trick” The linear classifier relies on an inner product between vectors K(xi,xj) = xiTxj If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes: K(xi, xj) = φ(xi)Tφ(xj) A kernel function is some function that corresponds to an inner product in some expanded feature space. Example: 2-dimensional vectors x = [x1 x2]; let K(xi,xj) = (1 + xiTxj)2 Need to show that K(xi,xj) = φ(xi)Tφ(xj): K(xi,xj) = (1 + xiTxj)2 = = 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2 = = [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] = φ(xi) Tφ(xj) where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2] Kernels Why use kernels? – Make non-separable problem separable – Map data into better representational space Common kernels – Linear – Polynomial K(x,z) = (1+xTz)d – Radial basis function (infinite dimensional space) Evaluation: Classic Reuters Data Set Most (over)used data set 21578 documents 9603 training, 3299 test articles (ModApte split) 118 categories – An article can be in more than one category – Learn 118 binary category distinctions Average document: about 90 types, 200 tokens Average number of classes assigned – 1.24 for docs with at least one category Only about 10 out of 118 categories are large Common categories (#train, #test) • Earn (2877, 1087) • Acquisitions (1650, 179) • Money-fx (538, 179) • Grain (433, 149) • Crude (389, 189) • • • • • Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Performance of SVM SVM are seen as best-performing method by many Statistical significance of most results not clear There are many methods that perform about as well as SVM Example: regularized regression (Zhang&Oles) Example of a comparison study: Yang&Liu Dumais et al. 1998: Reuters - Accuracy earn acq money-fx grain crude trade interest ship wheat corn Avg Top 10 Avg All Cat Rocchio NBayes Trees LinearSVM 92.9% 95.9% 97.8% 98.2% 64.7% 87.8% 89.7% 92.8% 46.7% 56.6% 66.2% 74.0% 67.5% 78.8% 85.0% 92.4% 70.1% 79.5% 85.0% 88.3% 65.1% 63.9% 72.5% 73.5% 63.4% 64.9% 67.1% 76.3% 49.2% 85.4% 74.2% 78.0% 68.9% 69.7% 92.5% 89.7% 48.2% 65.3% 91.8% 91.1% 64.6% 61.7% 81.5% 75.2% na 88.4% 91.4% 86.4% Yang&Liu: SVM vs Other Methods Results for Kernels (Joachims) Confusion matrix Entry (i, j) means 53 of the docs actually in class i were put in class j by the classifier Actual Class Class assigned by classifier 53 In a perfect classification, only the diagonal has non-zero entries SVM Summary Choose hyperplane based on support vectors – Support vector = “critical” point close to decision boundary (Degree-1) SVMs are linear classifiers Kernels: powerful and elegant way to define similarity metric Perhaps best performing text classifier – But there are other methods that perform about as well as SVM, such as regularized logistic regression (Zhang & Oles 2001) Partly popular due to availability of SVMlight – SVMlight is accurate and fast – and free (for research) Now lots of software: libsvm, TinySVM, …. Comparative evaluation of methods Real world: exploit domain specific structure! Resources Manning and Schütze. Foundations of Statistical Natural Language Processing. Chapter 16. MIT Press. Christopher J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition, 1998. S. T. Dumais, Using SVMs for text categorization, IEEE Intelligent Systems, 13(4), Jul/Aug 1998. S. T. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998. Inductive learning algorithms and representations for text categorization. Proceedings of CIKM ’98, pp. 148-155. A re-examination of text categorization methods (1999) Yiming Yang, Xin Liu 22nd Annual International SIGIR. Tong Zhang, Frank J. Oles: Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4(1): 5-31 (2001) Trevor Hastie, Robert Tibshirani and Jerome Friedman, "Elements of Statistical Learning: Data Mining, Inference and Prediction" SpringerVerlag, New York. T. Joachims, Learning to Classify Text using Support Vector Machines. Kluwer, 2002 Tong Zhang, Frank J. Oles: Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4(1): 5-31 (2001) Question Answering Question Answering from Open-Domain Text Search Engines (IR) return list of (possibly) relevant documents Users still to have to dig through returned list to find answer QA: give the user a (short) answer to their question, perhaps supported by evidence People do ask questions Examples from AltaVista query log – – – – who invented surf music? how to make stink bombs where are the snowdens of yesteryear? which english translation of the bible is used in official catholic liturgies? – how to do clayart – how to copy psx – how tall is the sears tower? Examples from Excite query log (12/1999) – – – – – how can i find someone in texas where can i find information on puritan religion? what are the 7 wonders of the world how can i eliminate stress What vacuum cleaner does Consumers Guide recommend Around 12–15% of query logs The Google answer #1 Include question words (why, who, etc.) in stop-list Do standard IR Sometimes this (sort of) works: – Question: Who was the prime minister of Australia during the Great Depression? – Answer: James Scullin (Labor) 1929–31 Page about Curtin (WW II Labor Prime Minister) (Can deduce answer) Page about Curtin (WW II Labor Prime Minister) (Lacks answer) Page about Chifley (Labor Prime Minister) (Can deduce answer) But often it doesn’t… Question: How much money did IBM spend on advertising in 2002? Answer: I dunno, but I’d like to … The Google answer #2 Take the question and try to find it as a string on the web Return the next sentence on that web page as the answer Works brilliantly if this exact question appears as a FAQ question, etc. Works lousily most of the time Reminiscent of the line about monkeys and typewriters producing Shakespeare But a slightly more sophisticated version of this approach has been revived in recent years with considerable success… AskJeeves AskJeeves was the most hyped example of “Question answering” – Have basically given up now: just web search except when there are factoid answers of the sort MSN also does It largely did pattern matching to match your question to their own knowledge base of questions If that works, you get the human-curated answers to that known question If that fails, it falls back to regular web search A potentially interesting middle ground, but a fairly weak shadow of real QA Online QA Examples LCC: http://www.languagecomputer.com/demos/question_answering/i ndex.html AnswerBus is an open-domain question answering system: www.answerbus.com Ionaut: http://www.ionaut.com:8400/ EasyAsk, AnswerLogic, AnswerFriend, Start, Quasm, Mulder, Webclopedia, etc. Question Answering at TREC Question answering competition at TREC Until 2004, consisted of answering a set of 500 fact-based questions, e.g. “When was Mozart born?”. For the first three years systems were allowed to return 5 ranked answer snippets (50/250 bytes) to each question. – IR think – Mean Reciprocal Rank (MRR) scoring: • 1, 0.5, 0.33, 0.25, 0.2, 0 for 1, 2, 3, 4, 5, 6+ doc – Mainly Named Entity answers (person, place, date, …) From 2002 the systems are only allowed to return a single exact answer and the notion of confidence has been introduced The TREC Document Collection The retrieval collection uses news articles from the following sources: • AP newswire, 1998-2000 • New York Times newswire, 1998-2000 • Xinhua News Agency newswire, 1996-2000 In total there are 1,033,461 documents in the collection: 3GB of text Too much to handle entirely using NLP techniques so the systems usually consist of an initial information retrieval phase followed by more advanced processing Many supplement this text with use of the web, and other knowledge bases Sample TREC questions 1. Who is the author of the book, “The Iron Lady: A Biography of Margaret Thatcher”? 2. What was the monetary value of the Nobel Peace Prize in 1989? 3. What does the Peugeot company manufacture? 4. How much did Mercury spend on advertising in 1993? 5. What is the name of the managing director of Apricot Computer? 6. Why did David Koresh ask the FBI for a word processor? 7. What debts did Qintex group leave? 8. What is the name of the rare neurological disease with symptoms such as: involuntary movements (tics), swearing, and incoherent vocalizations (grunts, shouts, etc.)? Top Performing Systems Currently the best performing systems at TREC can answer approximately 60-80% of the questions Approaches and successes have varied a fair deal – Knowledge-rich approaches, using a vast array of NLP techniques • Notably Harabagiu, Moldovan et al. – SMU/UTD/LCC – AskMSR system stressed how much could be achieved by very simple methods with enough text – Middle ground is to use a large collection of surface matching patterns (Insight Software) TREC 2000 Q&A track 693 fact-based, short answer questions – either short (50 B) or long (250 B) answer ~3 GB newspaper/newswire text (AP, WSJ, SJMN, FT, LAT, FBIS) Score: MRR (penalizes second answer) Questions: 186 (Encarta), 314 (seeds from Excite logs), 193 (syntactic variants of 54 originals) Sample Questions Q: Who shot President Abraham Lincoln? A: John Wilkes Booth Q: How many lives were lost in the Pan Am crash in Lockerbie? A: 270 Q: How long does it take to travel from London to Paris through the Channel? A: three hours 45 minutes Q: Which Atlantic hurricane had the highest recorded wind speed? A: Gilbert (200 mph) Q: Which country has the largest part of the rain forest? A: Brazil (60%) Question Types Class 1 Answer: single datum or list of items C: who, when, where, how (old, much, large) Class 2 A: multi-sentence C: extract from multiple sentences Class 3 A: across several texts C: comparative/contrastive Class 4 A: an analysis of retrieved information C: synthesized coherently from several retrieved fragments Class 5 A: result of reasoning C: word/domain knowledge and common sense reasoning Question subtypes Class 1.A About subjects, objects, manner, time or location Classs 1.B About properties or attributes Class 1.C Taxonomic nature TREC 2000 Results Knowledge Rich 0.8 Approach: 0.7 Falcon 0.6 0.5 0.4 MRR 0.3 0.2 0.1 a Pi s IC N TT M SI LI IB M o W at er lo en s Q ue SM U 0 Falcon: Architecture Question Collins Parser + NE Extraction Paragraph Index Question Semantic Form Collins Parser + NE Extraction Answer Semantic Form Paragraph filtering Question Taxonomy Answer Logical Form Expected Answer Type Answer Paragraphs WordNet Coreference Resolution Question Expansion Abduction Filter Question Logical Form Answer Question Processing Paragraph Processing Answer Processing Question parse S VP S VP PP NP WP VBD DT JJ NNP NP NP TO VB IN NN Who was the first Russian astronaut to walk in space Question semantic form first Russian astronaut Answer type PERSON walk space Question logic form: first(x) astronaut(x) Russian(x) space(z) walk(y, z, x) PERSON(x) Expected Answer Type WordNet QUANTITY dimension size Argentina Question: What is the size of Argentina? Questions about definitions Special patterns: – What {is|are} …? – What is the definition of …? – Who {is|was|are|were} …? Answer patterns: – …{is|are} – …, {a|an|the} –…- Question Taxonomy Question Location Country Reason City Product Province Nationality Continent Manner Number Speed Currency Degree Language Dimension Mammal Rate Reptile Duration Game Percentage Organization Count Question expansion Morphological variants – invented inventor Lexical variants – killer assassin – far distance Semantic variants – like prefer Indexing for Q/A Alternatives: – IR techniques – Parse texts and derive conceptual indexes Falcon uses paragraph indexing: – Vector-Space plus proximity – Returns weights used for abduction Abduction to justify answers Backchaining proofs from questions Axioms: – Logical form of answer – World knowledge (WordNet) – Coreference resolution in answer text Example – Q: When was the internal combustion engine invented? – A: The first internal-combustion engine was built in 1867 – invent → create_mentally → create → build Effectiveness: – 14% improvement – Filters 121 erroneous answers (of 692) – Takes 60% of question processing time TREC 2001 Results 0.7 Surface Approach: Insight Software 0.6 0.5 0.4 0.3 MRR 0.2 0.1 sa Pi SR M IB M ng o ap or e Si er lo at IS I W e LC C O ra cl In si g ht 0 TREC 2001: no NLP Best system from Insight Software using surface patterns AskMSR uses a Web Mining approach, by retrieving suggestions from Web searches Insight Sofware: Surface patterns approach Best at TREC 2001: 0.68 MRR Use of Characteristic Phrases “When was <person> born” – Typical answers • “Mozart was born in 1756.” • “Gandhi (1869-1948)...” – Suggests phrases (regular expressions) like • “<NAME> was born in <BIRTHDATE>” • “<NAME> ( <BIRTHDATE>-” – Use of Regular Expressions can help locate correct answer Use Pattern Learning Example: • “The great composer Mozart (1756-1791) achieved fame at a young age” • “Mozart (1756-1791) was a genius” • “The whole world would always be indebted to the great music of Mozart (1756-1791)” – Longest matching substring for all 3 sentences is "Mozart (1756-1791)” – Suffix tree would extract "Mozart (1756-1791)" as an output, with score of 3 Reminiscent of Information Extraction pattern learning Pattern Learning (cont.) Repeat with different examples of same question type – “Gandhi 1869”, “Newton 1642”, etc. Some patterns learned for BIRTHDATE – a. born in <ANSWER>, <NAME> – b. <NAME> was born on <ANSWER> , – c. <NAME> ( <ANSWER> – d. <NAME> ( <ANSWER> - ) Experiments 6 different Q types – from Webclopedia QA Typology (Hovy et al., 2002a) • • • • • • BIRTHDATE LOCATION INVENTOR DISCOVERER DEFINITION WHY-FAMOUS Experiments: pattern precision BIRTHDATE table: • • • • • • • 1.0 0.85 0.6 0.59 0.53 0.50 0.36 <NAME> ( <ANSWER> - ) <NAME> was born on <ANSWER>, <NAME> was born in <ANSWER> <NAME> was born <ANSWER> <ANSWER> <NAME> was born - <NAME> ( <ANSWER> <NAME> ( <ANSWER> - DEFINITION • 1.0 <NAME> and related <ANSWER> • 1.0 form of <ANSWER>, <NAME> • 0.94 as <NAME>, <ANSWER> and Shortcomings & Extensions Need for POS &/or semantic types • "Where are the Rocky Mountains?” • "Denver's new airport, topped with white fiberglass cones in imitation of the Rocky Mountains in the background , continues to lie empty” • <NAME> in <ANSWER> Named Entity tagger &/or ontology could enable system to determine “background” is not a location name Shortcomings ... (cont.) Long distance dependencies – “Where is London?” – “London, which has one of the most busiest airports in the world, lies on the banks of the river Thames” – would require pattern like: <QUESTION>, (<any_word>)*, lies on <ANSWER> Abundance & variety of Web data helps system to find an instance of patterns w/o losing answers to long distance dependencies Shortcomings... (cont.) System currently has only one anchor word – Doesn't work for Q types requiring multiple words from question to be in answer • “In which county does the city of Long Beach lie?” • “Long Beach is situated in Los Angeles County” • required pattern: <Q_TERM_1> is situated in <ANSWER> <Q_TERM_2> Did not use case • “What is a micron?” • “...a spokesman for Micron, a maker of semiconductors, said SIMMs are...” If Micron had been capitalized in question, would be a perfect answer AskMSR: Web Mining Web Question Answering: Is More Always Better? – Dumais, Banko, Brill, Lin, Ng (Microsoft, MIT, Berkeley) Q: “Where is the Louvre located?” Want “Paris” or “France” or “75058 Paris Cedex 01” or a map Don’t just want URLs AskMSR: Details 1 2 3 5 4 Step 1: Rewrite queries Intuition: The user’s question is often syntactically quite close to sentences that contain the answer – Where is the Louvre Museum located? – The Louvre Museum is located in Paris – Who created the character of Scrooge? – Charles Dickens created the character of Scrooge. Query rewriting – – – Classify question into seven categories Who is/was/are/were…? When is/did/will/are/were …? Where is/are/were …? a. Category-specific transformation rules Nonsense, eg “For Where questions, move ‘is’ to all possible but who locations” cares? It’s only a few “Where is the Louvre Museum located” more queries “is the Louvre Museum located” to Google. “the is Louvre Museum located” “the Louvre is Museum located” “the Louvre Museum is located” “the Louvre Museum located is” b. Expected answer “Datatype” (eg, Date, Person, Location, …) When was the French Revolution? DATE Hand-crafted classification/rewrite/datatype rules Step 2: Query search engine Send all rewrites to a Web search engine Retrieve top N answers For speed, rely just on search engine’s “snippets”, not the full text of the actual document Step 3: Mining N-Grams Simple: Enumerate all N-grams (N=1,2,3 say) in all retrieved snippets – Use hash table and other fancy footwork to make this efficient Weight of an N-gram: occurrence count, each weighted by “reliability” (weight) of rewrite that fetched the document Example: “Who created the character of Scrooge?” – – – – – – – – Dickens - 117 Christmas Carol - 78 Charles Dickens - 75 Disney - 72 Carl Banks - 54 A Christmas - 41 Christmas Carol - 45 Uncle - 31 Step 4: Filtering N-Grams Each question type is associated with one or more “data-type filters” = regular expressions When… Date Where… Location What … Person Who … Boost score of N-grams that do match regexp Lower score of N-grams that don’t match regexp Details omitted from paper…. Step 5: Tiling the Answers Scores 20 Charles Dickens Dickens 15 10 merged, discard old n-grams Mr Charles Score 45 Mr Charles Dickens tile highest-scoring n-gram N-Grams N-Grams Repeat, until no more overlap AskMSR Results Standard TREC contest test-bed: ~1M documents; 900 questions Technique doesn’t do too well (though would have placed in top 9 of ~30 participants!) – MRR = 0.262 (i.e., right answer ranked about #4#5 on average) – Why? Because it relies on the enormity of the Web! Using the Web as a whole, not just TREC’s 1M documents… MRR = 0.42 (i.e., on average, right answer is ranked about #2-#3) AskMSR Issues In many scenarios (e.g., monitoring an individual’s email…) we only have a small set of documents Works best/only for “Trivial Pursuit”style fact-based questions Limited/brittle repertoire of – question categories – answer data types/filters – query rewriting rules PiQASso: Pisa Question Answering System “Computers are useless, they can only give answers” Pablo Picasso PiQASso Architecture ? Question analysis Answer analysis MiniPar Relation Matching Answer Scoring Question Classification Type Matching Popularity Ranking MiniPar Answe r found? Document collection WNSense Sentence Splitter Indexer Query Formulation /Expansion WordNet Answer Pars Answer Question Analysis What metal has the highest melting point? 1 Parsing obj subj lex-mod What metal has the highest melting point? mod 2 Keyword extraction 3 Answer type detection metal, highest, melting, point SUBSTANCE 4 Relation extraction <SUBSTANCE, has, subj> <point, has, obj> <melting, point, lex-mod> <highest, point, mod> 1. NL question is parsed 2. POS tags are used to select search keywords 3. Expected answer type is determined applying heuristic rules to the dependency tree 4. Additional relations are inferred and the answer entity is identified Answer Analysis Tungsten is a very dense material and has the highest melting point of any metal. 1 Parsing …………. 2 Answer type check SUBSTANCE 1. Parse retrieved paragraphs 2. Paragraphs not containing an entity of the expected type are discarded 3. Dependency relations are extracted from Minipar output 4. Matching distance between word relations in question and answer is computed 5. Too distant paragraphs are filtered out 6. Popularity rank used to weight distances 3 Relation extraction <tungsten, material, pred> <tungsten, has, subj> <point, has, obj> … 4 Matching Distance Tungsten 5 Distance Filtering 6 Popularity Ranking ANSWER Match Distance between Question and Answer Analyze relations between corresponding words considering: – number of matching words in question and in answer – distance between words. Ex: moon - satellite – relation types. Ex: words in the question related by subj while the matching words in the answer related by pred http://medialab.di.unipi.it/askpiqasso.html TREC 2004 Questions: factoid (230) and list (56) in series (65), definition – Where was Franz Kafka born? – When was he born? – What books did he author? Question series provide context TREC 2004 Results (factoid) Knowledge Rich 0.8 Approach: PowerAnswer2 0.7 0.6 0.5 0.4 0.3 Accuracy 0.2 0.1 da n Fu IR ST IT M IB M LC C W al e Si ng s ap or e Sa ar la nd 0 PowerAnswer2 Architecture Q Context Insertion A Answer Formulation Strategy Selection Answer Ranking COGEX Answer Detection Keyword Selection Semantic Matching Parser NER Lexical Matching Keyword Alternations/ Expansions Passage Retrieval COGEX Theorem Prover Semantic Calculus Linguistic Axioms Axiom Building Logic Prover Q Semantic Parser Anwer Ranking Text WordNet Axioms Word Knowledge Axioms Linguistic axioms improve overall accuracy by 12% QA: Linguistic + Web mining Saarland: linguistic + Web Match question with syntactic pattern surface structure of possible answer Search Web, parse, tag and match results Use NER to identify answer Example Question: When did Amtrak begin operations? pattern: When,did,NP,Verb,NP|PP target: ref(3), ref(4)->past, ref(5), in, NP target: in,NP,[,],ref(3),ref(4)->past,ref(5) answerType: NP|PP – Weights: Date = 4, year = 3, other = 1 Web search Queries: – “Amtrak began operations in” – “In” “Amtrak began operations” Results: – “Amtrak began operations in 1971” – “… through 1971, when Amtrak began operations in the state” Matches: – “1971”: 4 – “the state”: 1 Effectiveness Algorithm Accuracy Rephrasing 0.20 Google fallback 0.395 QA without linguistic knowledge Lexiclone: Approach Find answer in a smaller and specific database Extract a few predicative definitions from the text summary Find in AQUAINT sentence with same predicatives as those found Accuracy: 0.63 (second best score in 2003) Lexiclone: Summary “Most don’t live in the city, …” 3 nouns, 3 verbs, 4 adjectives Build all triads: – City live most – Most live city – City live in – Most live in –… QA: role of Named Entities Lexical Terms Extraction as input to IR Questions approximated by sets of unrelated words (lexical terms) Similar to bag-of-word IR models: but choose nominal non-stop words and verbs Question (from TREC QA track) Lexical terms Q002: What was the monetary monetary, value, value of the Nobel Peace Prize in Nobel, Peace, Prize 1989? Q003: What does the Peugeot company manufacture? Peugeot, company, manufacture Q004: How much did Mercury spend on advertising in 1993? Mercury, spend, advertising, 1993 Extracting Answers for Factoid Questions: NER In TREC 2003 the LCC QA system extracted 289 correct answers for factoid questions A Name Entity Recognizer was responsible for 234 of them Names are classified into classes matched to questions QUANTITY 55 ORGANIZATION 15 PRICE 3 NUMBER 45 AUTHORED WORK 11 SCIENCE NAME 2 DATE 35 PRODUCT 11 ACRONYM 1 PERSON 31 CONTINENT 5 ADDRESS 1 COUNTRY 21 PROVINCE 5 ALPHABET 1 OTHER LOCATIONS 19 QUOTE 5 URI 1 CITY 19 UNIVERSITY 3 NE-driven QA The results of the past 5 TREC evaluations of QA systems indicate that current state-of-theart QA is largely based on the high accuracy recognition of Named Entities: – – – – Precision of recognition Coverage of name classes Mapping into concept hierarchies Participation into semantic relations (e.g. predicate-argument structures or frame semantics) Not all problems are solved yet! Where do lobsters like to live? – on a Canadian airline Where are zebras most likely found? – near dumps – in the dictionary Why can't ostriches fly? – Because of American economic sanctions What’s the population of Mexico? – Three What can trigger an allergic reaction? – ..something that can trigger an allergic reaction References AskMSR: Question Answering Using the Worldwide Web – Michele Banko, Eric Brill, Susan Dumais, Jimmy Lin – http://www.ai.mit.edu/people/jimmylin/publications/Bank o-etal-AAAI02.pdf – In Proceedings of 2002 AAAI SYMPOSIUM on Mining Answers from Text and Knowledge Bases, March 2002 Web Question Answering: Is More Always Better? – Susan Dumais, Michele Banko, Eric Brill, Jimmy Lin, and Andrew Ng – http://research.microsoft.com/~sdumais/SIGIR2002-QASubmit-Conf.pdf D. Ravichandran and E.H. Hovy. 2002. Learning Surface Patterns for a Question Answering System. ACL conference, July 2002. References S. Harabagiu, D. Moldovan, M. Paşca, R. Mihalcea, M. Surdeanu, R. Bunescu, R. Gîrju, V.Rus and P. Morărescu. FALCON: Boosting Knowledge for Answer Engines. The Ninth Text REtrieval Conference (TREC 9), 2000. Marius Pasca and Sanda Harabagiu, High Performance Question/Answering, in Proceedings of the 24th Annual International ACL SIGIR Conference on Research and Development in Information Retrieval (SIGIR-2001), September 2001, New Orleans LA, pages 366-374. L. Hirschman, M. Light, E. Breck and J. Burger. Deep Read: A Reading Comprehension System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, 1999. C. Kwok, O. Etzioni and D. Weld. Scaling Question Answering to the Web. ACM Transactions in Information Systems, Vol 19, No. 3, July 2001, pages 242-262. M. Light, G. Mann, E. Riloff and E. Breck. Analyses for Elucidating Current Question Answering Technology. Journal of Natural Language Engineering, Vol. 7, No. 4 (2001). Dependency Parsing Parsing in QA Top systems in TREC 2005 perform parsing of queries and answer paragraphs Some use specially built parser Parsers are slow: ~ 1min/sentence Constituent Parsing Requires Grammar – CFG, PCFG, Unification Grammar Produces phrase structure parse tree VP S VP S VP NP NP NP VP ADJP Rolls-Royce Inc. said it expects its sales to remain steady Dependency Tree Word-word dependency relations Far easier to understand and to annotate Rolls-Royce Inc. said it expects its sales to remain steady Parsing as classification Inductive dependency parsing Parsing based on Shift/Reduce actions Learn from annotated corpus which action to perform Right Shift Left Parser Actions top next Ho VER:aux visto VER:pper una DET ragazza NOM con PRE gli DET occhiali NOM . POS Parser State The parser state is a quadruple S, I, T, A, where S is a stack of partially processed tokens I is a list of (remaining) input tokens T is a stack of temporary tokens A is the arc relation for the dependency graph (w, r, h) A represents an arc w → h, tagged with dependency r Parser Actions Shift Right Left S, n|I, T, A n|S, I, T, A s|S, n|I, T, A S, n|I, T, A{(s, r, n)} s|S, n|I, T, A S, s|I, T, A{(n, r, s)} Learning Features feature Value W word L lemma P part of speech (POS) tag M morphology: e.g. singular/plural W< word of the leftmost child node L< lemma of the leftmost child node P< POS tag of the leftmost child node, if present M< whether the rightmost child node is singular/plural W> word of the rightmost child node L> lemma of the rightmost child node P> POS tag of the rightmost child node, if present M> whether the rightmost child node is singular/plural Learning Event left context Sosteneva VER che PRO target nodes leggi NOM le DET context anti ADV Serbia NOM right context che PRO , PON erano VER discusse ADJ (-3, W, che), (-3, P, PRO), (-2, W, leggi), (-2, P, NOM), (-2, M, P), (-2, W<, le), (-2, P<, DET), (-2, M<, P), (-1, W, anti), (-1, P, ADV), (0, W, Serbia), (0, P, NOM), (0, M, S), (+1, W, che), ( +1, P, PRO), (+1, W>, erano), (+1, P>, VER), (+1, M>, P), (+2, W, ,), (+2, P, PON) Projectivity A dependency tree is projective iff every arc is projective An arc wi→wk is projective iff j, i < j < k or i > j > k, wi →* wk Non Projective Většinu těchto přístrojů lze take používat nejen jako fax , ale Actions for non-projective arcs Right2 Left2 Right3 Left3 Extract Insert s1|s2|S, n|I, T, A s1|S, n|I, T, A{(s2, r, n)} s1|s2|S, n|I, T, A s2|S, s1|I, T, A{(n, r, s2)} s1|s2|s3|S, n|I, T, A s1|s2|S, n|I, T, A{(s3, r, n)} s1|s2|s3|S, n|I, T, A s2|s3|S, s1|I, T, A{(n, r, s3)} s1|s2|S, n|I, T, A n|s1|S, I, s2|T, A S, I, s1|T, A s1|S, I, T, A CoNLL-X Shared Task Results Maximum Entropy Language MBL LAS % UAS % LA % Train sec Arabic 54.15 69.50 72.97 181 Bulgarian 72.90 85.24 77.68 Chinese 70.00 81.33 Czech 62.10 Danish Parse sec LAS % UAS % LA % Train sec Parse sec 2.6 59.70 74.69 75.49 24 950 452 1.5 79.17 85.92 83.22 88 353 58.75 1156 1.8 72.17 83.08 75.55 540 478 73.44 69.84 13800 12.8 69.20 80.22 77.72 496 13500 71.72 78.84 74.65 386 3.2 76.13 83.65 82.06 52 627 Dutch 63.71 68.93 66.47 679 3.3 68.97 74.73 75.93 132 923 German 75.88 80.25 78.39 9315 4.3 79.79 84.31 86.88 1399 3756 Japanese 78.01 82.05 73.68 129 0.8 83.39 86.73 89.95 44 97 Portuguese 79.40 85.03 80.79 1044 4.9 80.97 86.78 85.27 160 670 Slovene 60.63 72.14 69.36 98 3.0 62.67 76.60 72.72 16 547 Spanish 70.33 74.25 82.19 204 2.4 74.37 79.70 85.23 54 769 Swedish 75.20 83.03 72.42 1424 2.9 74.85 83.73 77.81 96 1177 Turkish 48.83 65.25 49.81 177 2.3 47.58 65.25 59.65 43 727 Future Directions Opinion Extraction – Finding opinions (positive/negative) – Blog track in TREC2006 Intent Analysis – Determine author intent, such as: problem (description, solution), agreement (assent, dissent), preference (likes, dislikes), statement (claim, denial) References S. Chakrabarti, Mining the Web, Morgan-Kaufmann, 2004. G. Attardi. Experiments with a Multilanguage Non-projective Dependency Parser, CoNLL-X, 2006. H. Yamada, Y. Matsumoto. Statistical Dependency Analysis with Support Vector Machines. In Proc. IWPT, 2003.