ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems Hiroshi Nakagawa Introduction Feature Extraction(Phrase Extraction) Minoru Yoshida Feature Extraction(Information Extraction Approach) End (University of Tokyo) Contents 1. 2. 3. 4. 5. Introduction Feature Extraction Feature Weighting / Similarity Calculation Clustering Evaluation Issues Contents 1. 2. 3. 4. 5. Introduction Feature Extraction Feature Weighting / Similarity Calculation Clustering Evaluation Issues Introduction 1. 2. 3. 4. Motivation Problem Settings Differences from other problems History Motivation A study of the query log of the AllTheWeb and Altavista • Web search for person names: search sites gives an idea of the over 10% of all queries relevance of the people search task: 11-17% of thesearch queries were • “Same-name” Problem in person name composed of a person name – When different real-world entities have the same name, with additional terms the reference from the name to the entity can and be 4% were ambiguous. identified simply as person names: (Artiles+, 2009 – Many different persons having theWePS2) same name • (e.g.,) John Smith ordinary search engines, it is tough to – Persons having the With same name as a famous one find Bill Gates who is not a Microsoft • (e.g.,) Bill Gates Difficult to access to the target founder! Domination! person Problem in People Search Query Search engine Results Which pages for what persons? Person Name Clustering Query Each page in a cluster refers to the same entity. Search engine Search result Clusters of Web pages Sample System query= Ichiro Suzuki:famous Japanese baseball player Keywords about the person Documents about the same person 数理情報学輪講(2008/04/18) 8 Output Example (Ichiro Suzuki) Painter Used as an example name because Ichiro is so famous Dentist Lawyer 数理情報学輪講(2008/04/18) 9 Introduction 1. 2. 3. 4. Motivation Problem Settings Differences from other problems History Problem Setting • Given: a set of Web pages returned from a search engine when entering person name queries • Goal: to cluster Web pages – One cluster for one entity – Possibly with related information (e.g., biography and/or related words) Another usage : If a person has many aspects, like scientist and poet, these aspects are grouped together. Easy to grasp who he/she is. Example: Sakai Shuichi Sakai shuichi is a professor of the University of Tokyo in the field of Computer Architecture: These pages are about his books of Computer Architecture He is a Japanese poet too. These pages are about his collection of poems . Example: Famous car maker”TOYOTA” These pages are about TOYOTA’s retailer’s network These pages are about TOYOTA HOME which is a house maker and one of TOYOTA company’s group enterprise Introduction 1. 2. 3. 4. Motivation Problem Settings Differences from other problems History Difference from Other Tasks Method WSD,Catego Person Name rization Clustering Goal Categorize Learning Supervised Document Clustering Cluster documents Cluster similar about the same documents entity(=person) Answers Not definite Definite y/n Definite y/n but exact # Task dependent Number of # of #Unknown of entities in real world Cluster categories (unknown) Training Yes Difficult to use No Data Unsupervised Unsupervised • Cluster documents for the same person • Difficult to use training data for other person names 15 WSD: Word Sense Disambiguation I was strolling the bank. Do you use a bank card there? Did you go to the bank? ? bank (1) Heavy and sophisticated NLP tools such as HPSG parser is not suitable for the purpose. (2)The system should work in tolerant Noisy Web data speed light weight tools is needed – Light linguistic tools Challenges • • POS taggers, Stemmer, NE taggers • Pattern-based information extraction • How to use “training data” – Most systems use unsupervised clustering approach – Some systems assume “background knowledge” • How to determine K (number of clusters) Remember this K does not depend on users intention but is exact and fixed, in real use. Different form usual clustering! Introduction 1. 2. 3. 4. Motivation Problem Settings Differences from other problems History History (Word Sense Disambiguation) (Coreference Resolution) 1998 Cross-document coreference Resolution [Bagga+, 98] – Naive VSM 2003 Disambiguation for Web Search Results 2007 Web People Search Workshop (WePS) [Mann+, 03] – Biographic data [Artiles+, 07][Artiles+, 09] History • Web People Search Workshop – 1st, SemEval-2007 – 2nd, WWW-2009 • Document Clustering • Attribute Extraction – 3rd, CLEF-2010(Conference on Multilingual and Multimodal Information Access Evaluation ) 20-23 September 2010, Padua. • Document Clustering & Attribute Extraction • Organization Name Disambiguation WePS2 Data Source: 30names WePS2 Data 1 (Artiles+, 09) WePS2 Data 2 WePS2 Data 3 WePS2 summary report Contents 1. 2. 3. 4. 5. Introduction Feature Extraction Feature Weighting / Similarity Calculation Clustering Evaluation Issues Main Steps 1. 2. 3. 4. 5. Preprocessing Feature extraction Feature weighting / Similarity calculation Clustering (Related Information Extraction) PREPROCESSING In addition, alphabetically ordered name list page. (Ono+, 08) Preprocessing • Filter out useless pages (“junk pages”) – the name is matched, but the matched string doesn’t refer to a person (e.g., company name) • Data cleaning – HTML Tag removal – Sentence (snippet) extraction – Coreference resolution(used by Bagga+) In fact, very difficult task of NLP Junk Page Filtering • SVM-based classification (Wan+, 05) – features words related or not related to the person name • Simple lexical features • Stylistic features (fonts / tags) • query-relevant features (next-to-query words) • linguistic features (NE counts) … Such as how many person, organization, location name appear. i.e. how many and which words in bold font FEATURE EXTRACTION Feature Extraction • How to characterize each name appearance – Name itself can not be used for disambiguation! • Each name appearances can be characterized by contexts. • Possible contexts – Surrounding words, adjacent strings, syntactically related words, etc. – Which to use? Basic Approach • Use all words in documents – Or snippets (texts around the name) – Or titles/summaries (first sentence, etc.) • Use TFIDF weighting scheme Problem • There exist: – relatively useful features and relatively useless features • (especially for person name disambiguation) – Useful: NEs, biography, noun phrases, etc. – Useless: General words, boilerplate, etc. • How to distinguish useful features from others • How to give weight to each feature Named Entities • Documents about Bill Gates related person name related organization name Noun Phrases • Documents about Bill Gates related key words Other Words • Documents about Bill Gates Other Words • Documents about Bill Gates more important Extracting Useful Features • Thresholding • Tool-based approach Based on score related to our purpose: TFIDF etc. – POS tagging, NE tagging • Information Extraction approach • Meta-data approach – Link structures, Meta tags Later described by Yoshida Thresholding • Calculate TFIDF scores of words • Discard the words with low TFIDF scores Unigram, Bigram, even N-gram can be used (Chen+, 09) , where Google 5 gram corpus (from 1T words) is used to calculate TFIDF score such as Log-Likelihood Ratio, Other Scores: Mutual information, KLdivergence, Tool Based Approach • Available Tools: – POS tagging – NE Extraction (sophisticated High performance POS taggers are developed for many languages. For western languages , stemmers are also developed . unsophisticated but simple) bigram, N-gram – Keyword extraction middle between NE and bigram,N-gram Part of Speech (POS) Tagging • Detect the grammatical categories of the words – Nouns, verbs, prepositions, adverbs, adjectives, … – Typically nouns are used as features William Henry "Bill" Gates III (born October 28, 1955) is an NOUNS VERB NOUNS VERB DETERMINER American business magnate, philanthropist, … ADJECTIVE NOUNS – Noun phrases can be extracted with some simple rules – Many available tools (e.g., Tree Tagger) Named Entities (NE) Extraction • Find “proper names” in texts – e.g., names of persons, organizations, locations, … – Include time expressions in many cases William Henry "Bill" Gates III (born October 28, 1955) is an PERSON DATE American business magnate, philanthropist, … – Many available tools (Stanford NER, OpenNLP, Espotter, …) Key Phrase Extraction • Noun phrases consisting of 2 or more words – Likely to be topic-related concepts – Term-extraction tool “Gensen”(Nakagawa+, 05) • Noun phrases with the score of “term-likelihood” • Topic related term -> higher score Gates held the positions of CEO and chief software architect, SCORE=45.2 and remains the largest individual shareholder … SCORE=22.4 Gensen(言選) Web Score From corpus we extract: 信息処理, 計算機処理能力, 処理段階, 信息処理学会 Information proc, computer proc. capacity, proc. step, info. proc.society R:# of right adjacent words: :3 +1 L:# of left adjacent words:2 +1 能力 信息 Information 計算機 computer capacity 処理 段階 Processing (=proc) step 学会 society L(W=処理)=2+1 R(W=処理)=3+1 LR(W=処理)=3×4=12 Calculation of LR and FLR Compound word:W ={ w1, ... , wn} where wi is a simple noun. L(wi) = # of left side connection of wi+1 R(wi) = # of right side connection of wi+1 Score LR of Comp. word:W={ w1 ... wn}, like 信息処理学会 is defined as follows: LR(W ) L( wi ) R( wi ) i 1 n 1/ 2 n Normalized by length Example :LR(信息処理) =[L(信息)×R(信息) × L(処理)×R(処理) ]1/4 Or LR(information processing) =[L(info.)×R(info.) × L(proc.)×R(proc.) ]1/4 Calculation of LR and FLR LR(W ) L( wi ) R( wi ) i 1 n 1/ 2 n Normalized by length Thisfrequency FLR is the score to rank word:W term F(W) is the independent of comp. where “independent” means thatcandidates W is not a part of longer comp. word. Then FLR(W) is defined as FLR(W) = F(W) × LR(W) Example F(W) has similar effect as TF. Then, if corpus is big, F(w) affects more to FLR(w). FLR(信息処理) =F (信息処理)×[L(信息)×R(信息) × L(処理)×R(処理) ]1/4 Example of term extraction by Gensen Web: English article:SVM on Wikipedia Support vector machines (SVMs) are a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. The original SVM algorithm was invented by Vladimir Vapnik and the current standard incarnation (soft margin) was proposed by Corinna Cortes and Vladimir Vapnik[1]. The standard SVM is a non-probabilistic binary linear classifier, i.e. it predicts, for each given input, which of two possible classes the input is a member of. Since an SVM is a classifier, then given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. ……. Another approach is to use an interior point method that uses Newton-like iterations to find a solution of the Karush-Kuhn-Tucker conditions of the primal and dual problems.[10] Instead of solving a sequence of broken down problems, this approach directly solves the problem as a whole. To avoid solving a linear system involving the large kernel matrix, a low rank approximation to the matrix is often used to use the kernel trick. Extracted terms (term score) Top 18-38 set of point 20.73 linear classifier 19.99 maximum-margin hyperplane 19.92 example 19.60 one 17.32 Vladimir Vapnik 15.87 parameter 14.70 linear SVM 14.40 training set 14.00 optimization 13.42 model 12.25 training vector 12.04 support vector classification 11.70 two classe 11.57 normal vector 11.38 kernel trick 11.22 maximum margin classifier 11.22 ..... Top 1-17 hyperplane 116.65 margin 109.54 SVM 74.08 vector 56.12 point 52.85 support vector 49.34 training data 48.12 data 47.83 problem 44.27 space 44.09 data point 38.01 classifier 30.59 classification 29.58 optimization problem 26.05 set 25.30 support vector machine 24.66 kernel 21.00 Top 408–426(last) Vandewalle 1.00 derive 1.00 it 1.00 Leisch 1.00 2.3 1.00 H1 1.00 c 1.00 Hornik 1.00 mean 1.00 testing 1.00 transformation 1.00 unconstrained 1.00 homogeneous 1.00 need 1.00 learner 1.00 grid-search 1.00 convex 1.00 See 1.00 trade 1.00 Contents 1. 2. 3. 4. 5. Introduction Feature Extraction Feature Weighting / Similarity Calculation Clustering Evaluation Issues Contents 1. 2. 3. 4. 5. Introduction Feature Extraction Feature Weighting / Similarity Calculation Clustering Evaluation Issues Introduction 1. 2. 3. 4. Motivation Problem Settings Differences from other problems History Information Extraction Approach • Information extraction: – The task to extract specific type of information – e.g., person and his/her working place William Henry "Bill" Gates III (born October 28, 1955) is an NAME DATE OF BIRTH American business magnate, philanthropist, … NATIONALITY OCCUPATION Information Extraction Approach • Useful features for disambiguation (Wan+, 2005) (Mann+, 2003) (Niu+, 04) • Also used as “summaries” of clusters – To be help of users to find objective clusters – WePS-2 “attribute extraction task” Information Extraction Approach • Different methods for different attributes – Simple patterns (hand-crafted / automatically obtained) • Phone, FAX, URL, E-mail – Syntactic rules (hand-crafted /automatically generated) • Date of birth, Titles, positions, – Dictionary match (from wikipedia, etc.) • Occupation, Major, Degree, Nationality – Keywords extracted by NER tools • Birth place (LOCATION), Affiliation (ORGANIZATION), Schools (ORGANIZATION) Hand-Crafted Patterns • Typically written with regular expressions • Phone, FAX – +## (#) ####-#### • URLs – http://www.xxx.xxx.xxx/... • E-mails – xxx@xxx.com • Needs some classification (Phone or FAX?) – Supervised learning – Keyword-based approach (e.g., “born” for date of birth) Automatically Generated Patterns • Patterns for birth years (Mann+, 03) <name> (<birth year> - ####) <name> <name> ( <birth year> <name> was born in <birth year> • Patterns for titles (Wan+, 05) <name> is a <title> Automatically Generated Patterns • Approach by (Mann+, 03) – Bootstrapping method • Start with seed facts <name> (<birth year> - ####) <name> <name> ( <birth year> <name> was born in <birth year> – (e.g., (Mozart, 1756)) • Find sentences (from the Web) that contain both of elements – (e.g., “Mozart was born in 1756”) • Perform some generalization – (e.g., “<name> was born in <birth year>”) • Extract substrings with high score (measured using current facts) • Extract new facts Dictionary Matching • Construct a list of occupations, nations (for “nationality” attributes), etc. from existing dictionaries – Wikipedia, WordNet, etc. Dictionary Matching • e.g., List of countries Link Structure Approach • It is difficult to find correct network structures – Difficulty in finding “in-links” • Needs some approximation • (Bekkerman+, 05) : “socially linked persons tend to link similar pages” – Determine whether two pages are linked or not – MaxEnt classification with “linked-page” (URLs in pages) features FEATURE WEIGHTING / SIMILARITY CALCULATION Feature Weighting • Knowledge-based approach – US Census data, WordNet • • • • Web-query approach SVD Bootstrapping Determination of link/non-link by supervised classifiers Knowledge-Base Approach • US Census data – Frequent name -> ambiguous (Fleishman+, 04) • WordNet – Semantic similarity for concept words • WordNet distance WordNet • Publicly available “dictionary” (thesaurus) – Hierarchical structures between words – We can find “synonyms”, “hyponyms”, “hypernyms” of words • Many “semantic distance” measures between two words – Path length – Depth of common hypernyms –… Web-Query Approach • Name-concept relation (Fleishman+, 04) • Validate relations between context NEs by Web search counts (Kalashnikov+, 08) (NurayTuran+, 09) • Use query “name + bigram”, concatenating the snippetes into a new document (Chen+, 09) • Obtaining reliable counts (google_df) (Bekkerman+, 05) Name-concept relation (Fleishman+, 04) • Task: distinguish (name, concept) pairs – (Paul Simon, pop star) ; (Paul Simon, singer) – (Paul Simpn, pop star) ; (Paul Simon, politician) • MaxEnt Classifier • Features using Web counts (N:name, c:concept, +:AND operation) – Q(N + c1 + c2) : Intersection – | Q(N + c1) - Q(N + c2) |: Difference – Q(N + c1 + c2) / (Q(N + c1) + Q(N + c2)) : Ratio Validate relations between context NEs by Web search counts (Kalashnikov+, 08) (Nuray-Turan+, 09) • NE-based document similarity calculated using Web counts – NE: persons or organizations • WebDice (C:context set … [c1] OR [c2] OR …) – 2Q(N + C1 + C2) / (Q(N + C1) + Q(N + C2)) – 2Q(N + C1 + C2) / (Q(N) + Q(C1 + C2)) – The second one was better Use query “name + bigram”, concatenating the snippetes into a new document (Chen+, 09) • Obtain additional features for similarity calculation – Web page -> b: maximal weight bigram – Snippets100(N + b) -> one new document – New document -> additional features (tokens) Obtaining reliable counts (google_df) (Bekkerman+, 05) • Google_tfidf(w) = tf(w) / log(Q(w)) • Some recent systems use Google N-gram (Chen+, 09) Dimension Reduction by SVD (Pedersen+, 05) • Reduce sparseness of context vectors • More semantic-level representations (can use word similarities in contexts) • Bigram features (contexts) Cluster Refinement by Bootstrapping (1/4) (Yoshida+, 10) • Strong features can identify a person – High precision, but not always observed Strong Features •NEs •CKWs... Bill Gates Paul Allen Microsoft Weak Features program same person Bill Gates Steve Ballmer Microsoft Bill Gates program program Not useful in general, but useful for this name 72 Cluster Refinement by Bootstrapping (2/4) Feature Set F Document Set D Feature-Cluster Document-Cluster Relation rF ,C Relation rD,C Document-Feature Matrix P d1 f1 d2 f2 d3 d4 f3 d5 d6 ・ ・ ・ ・ dn ・ ・ Initial Cluster fm Cluster Refinement by Bootstrapping (3/4) Feature Set F Document Set D Feature-Cluster Document-Cluster Relation rF ,C Relation rD,C Document-Feature Matrix P d1 f1 (t ) T (t ) d2 f2 d3 F ,C D ,C d4 f3 d5 ( t 1) (t ) d6 D ,C F ,C ・ ・ ・ ・ dn r P r r Pr (t 1) T (t ) D ,C D ,C Initial Cluster r PP r ・ ・ fm Cluster Refinement by Bootstrapping (4/4) Each document is taken in the cluster with the largest relation value 0.8 0.2 0.3 0.40 1.0 0.1 0.2 0.40 0.15 0.85 0.2 0.10 0.15 0.85 0.2 0.10 0.5 0.4 0.3 0.30 Refined values 0.40 0.60 0.05 0.05 0.20 0.10 0.05 0.45 0.40 0.20 PP T 0.10 0.05 0.40 0.45 0.20 Initial values 0.30 1 0 0 0.20 1 0 0 0.20 0 1 0 0.20 0 1 0 0.30 0 0 1 75 Determination of “linked” or ”not-linked“ by supervised classifiers • MaxEnt Classification (Fleischman+, 04) – Features: name features, web features, etc. • SkyLine-Based Classification (Kalashnikov+, 08) – Features: search engine hit counts CLUSTERING Problem: How to Determine K • Hierarchical clustering with thresholds • Online Clustering (Single Pass Clustering) • Building “core” clusters (2-stage clustering) • Variable-Component-Number Clustering (e.g., Dirichlet Process Mixture) Hierarchical clustering with thresholds • Used in many systems • Popular settings: – Agglomerative clustering – Group-average method (or, single-link method in some times) – Predetermined threshold (or, determined by cross-validation in some times) Hierarchical clustering with thresholds Low Cluster Similarity High →2 clusters {1,2,3,5,9}, {4,6,7,8} 5 2 3 9 18 7 6 4 Document ID →4 clusters, {2,5},{1,3,9}, {6,7,8},{4} Cluster similarity: sim C C x , C y 1 Cx C y group average method sim d d x C x , d y C y d x ,dy 80 Cluster-Distance Calculation (single linkage method) (complete linkage method) × (centroid method) × Online Clustering • Single Pass Clustering (Balog+, 08) – Take pages from the 1st in search results 6 1 5 2 4 3 Online Clustering • Single Pass Clustering (Balog+, 08) – Take pages from the 1st in search results – For each page, find the most similar cluster 6 1 5 2 4 3 Online Clustering • Single Pass Clustering (Balog+, 08) – Take pages from the 1st in search results – For each page, find the most similar cluster – If the similarity is below the threshold, create a new cluster • Similarity: Naïve Bayes | Cosine with TFIDF 6 1 5 2 4 3 Building Core Clusters • 1st stage clustering – High Precision Clusters – Relatively high threshold (Mann+, 03) – Use strong features only (Ikeda+, 09) • 2nd stage clustering – Treat Remaining Documents – Add to the most similar 1st stage clusters (Mann+, 03) (Ikeda+, 09) – Feature weighting by 1st stage clusters (Yoshida+, 10) Query Expansion Approach (Ikeda+, 09) • Re-extract key-phrases by using 1st-stage clusters – Key-phrases for documents -> key-phrases for clusters – More reliable than one document Current cluster Top CKWs home runs, major leagues, all stars, 1 2 Search Other documents 1 1 Extract top CKWs from the current cluster 2 Search for the CKWs in documents out of the cluster 3 If such documents exist, then copy them into the cluster (soft clustering) 4 Remove 1-element clusters 87 87 Feature weighting by 1st stage clusters (Yoshida+, 10) 1. Make clusters by strong features 2. Weight weak features using clusters, and refine similarities 3. Refine clusters by using new similarities 88 Using Dirichlet Process Mixture (Ono+, 08) • Topic = word distribution – Topic:”economics” = Word distribution: {“dollar”:0.03, “stock”:0.05, “share”:0.01, ...} • Document = mixture of topics – {economics:0.3, politics:0.2, ...} • Document’s topic = topic with highest weight • Modeling by DPUM (Dirichlet Process Unigram Mixture) – # of topics is automatically determined 89 Example: Estimation of Latent Topics word-1 word-2 Latent entity = each (red) bar Document = each point word-3 Dirichlet Process Unigram Mixture G0=Distrubution for θ (Dirichlet Distribution) θ=Multi. Distribution UM θ M θd G wdn Nd θ (Countable number of Multi. distributions) DPUM Parameter Estimation Initial entity distribution Estimation of entity distribution by iteratively maximizing likelihood Politics Economics Merge clusters with the same topic Emonomics Politics Politics Sports Society Arts Entertainment EVALUATION ISSUES Evaluation Issues • Evaluation Measures • Available Corpus • WePS Workshop Evaluation Measures • Precision / Recall / F-measure • Purity / Inverse Purity • B-cubed Precision / Recall / F-measure – Extended B-cubed Recall and Precision for Clustering • Features and recall/precision • First stage cluster = high precision A=5 B=8 C=3 Recall A:size of cluster B:# of correct documents C:# of correct documents in C cluster Pprecision 0.6 A Precision Rrecall C 0.375 97 B Recall and Precision [Larsen and Aone 1999] • Correct clusters • Machine-made clusters are calculated for each for each that maximize as: F-measure Total F-measure (F): Note: Precision (P),Recall (R) are calculated in the same way. 99 Example Correct clusters: C P = 2 /3 R = 2 /5 F = 1 /2 Machine-made clusters: D [A][A][A] [A][A] [A][A] [B] P = 3 /5 R = 3 /5 F = 3 /5 [B] [B] [A][A][A] [B][C] … Purity / Inverse Purity • Similar to precision / recall – L: manually annotated categories (clusters) – C: clusters output by systems B-Cubed Precision/Recall • Entity-wise accuracy calculation – C: cluster (by system) containing e – L: cluster (by human) containing e 102 B-Cubed Precision/Recall • Borrowed from (Amigo, 09) 103 Other Metrics • Counting pairs – Given pair of documents, label “link” or “unlink” – Problem: # of pairs is quadratic to size of clusters • Entropy – Low entropy in cluster -> pure • Edit distance – Distance from system output to correct output Which Metrics to Use • Constraints (borrowed from (Amigo, 09)) • Homogeneity: the purer, the better • Completeness: the more complete, the better Which Metrics to Use • Constraints (borrowed from (Amigo, 09)) • Rag bag – Noisy cluster <- noise: better! – Pure cluster <- noise: worse! Which Metrics to Use • Constraints (borrowed from (Amigo, 09)) • Cluster size vs. quantity – A small error in big cluster : better! – (Large number of) small errors in small clusters : worse! Which Metrics to Use • Borrowed from (Amigo, 09) Which Metrics to Use • Borrowed from (Amigo, 09) Which Metrics to Use • Borrowed from (Amigo, 09) Baselines 111 P-IP vs. B-Cubed: for Practical Data • Purity/Inverse-Purity measure is not appropriate in soft-clustering case – It gives very high scores to “cheat” baseline clustering (COMBINED in the table) • B-cubed measure is appropriate in this case Available Corpus • John Smith Corpus (Bagga+, 98) • 12 different people (Bekkerman+, 05) • WePS corpus (Artiles+, 07)(Artiles+, 09) – WePS-1 • 79 person names (49 training + 30 test), 100 top pages for each – WePS-2 • 30 person names, 150 top pages for each WePS (Web People Search) Workshops (Artiles+, 07)(Artiles+, 09) • Evaluation campaigns for person name disambiguation (along with person attribute extraction) • WePS-1 – with SemEval-2007 – 16 teams participated • WePS-2 – with WWW-2009 – 17 teams participated 114 References • (Amigo+, 09) Enrique Amigó , Julio Gonzalo , Javier Artiles , Felisa Verdejo, A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information Retrieval, v.12 n.4, p.461-486, August 2009 • (Artiles+, 07) Javier Artiles , Julio Gonzalo , Satoshi Sekine, The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task, Proceedings of the 4th International Workshop on Semantic Evaluations, p.6469, June 23-24, 2007, Prague, Czech Republic • (Artiles+, 09) J. Artiles, J. Gonzalo, and S. Sekine. WePS 2 Evaluation Campaign: overview of the Web People Search Clustering Task. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009. • (Bagga+, 98) Amit Bagga , Breck Baldwin, Entity-based cross-document coreferencing using the Vector Space Model, Proceedings of the 17th international conference on Computational linguistics, August 10-14, 1998, Montreal, Quebec, Canada • (Balog+, 08) K. Balog, L. Azzopardi, and M. de Rijke. Personal name resolution of web people search. In WWW2008 Workshop: NLP Challenges in the Information Explosion Era (NLPIX 2008), 2008. • (Balog+, 09) Krisztian Balog, Jiyin He, Katja Hofmann, Valentin Jijkoun, Christof Monz, Manos Tsagkias, Wouter Weerkamp and Maarten de Rijke, The University of Amsterdam at WePS2. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009. References • (Bekkerman+, 05) Ron Bekkerman , Andrew McCallum, Disambiguating Web appearances of people in a social network, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan • (Bollegala+, 06) Danushka Bollegala , Yutaka Matsuo , Mitsuru Ishizuka, Extracting key phrases to disambiguate personal name queries in web search, Proceedings of the Workshop on How Can Computational Linguistics Improve Information Retrieval?, July 23-23, 2006, Sydney, Australia • (Bunescu+, 06) R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), 2006. • (Chen+, 09) Names.Ying Chen, Sophia Yat Mei Lee and Chu-Ren Huang, PolyUHK: A Robust Information Extraction System for Web Personal, 2nd Web People Search Evaluation Workshop (WePS 2009), 2009. • (Chen+, 07) Ying Chen, James Martin, Towards Robust Unsupervised Personal Name Disambiguation, EMNLP-CoNLL 2007, pp. 190-198, 2007 • (Elmacioglu+, 07) Ergin Elmacioglu , Yee Fan Tan , Su Yan , Min-Yen Kan , Dongwon Lee, PSNUS: web people name disambiguation by simple clustering with rich features, Proceedings of the 4th International Workshop on Semantic Evaluations, p.268-271, June 23-24, 2007, Prague, Czech Republic References • • • • • • (Fleishman+, 2004) Fleischman, M.B. and E.H. Hovy, Multi-Document Person Name Resolution. Proceedings of the Reference Resolution Workshop at the 42nd Annual Meeting of the Association for Computational Linguistics (ACL). Barcelona, Spain, 2004 (Gooi+, 04) Chung H. Gooi, James Allan, Cross-Document Coreference on a Large Scale Corpus, HLT-NAACL 2004: Main Proceedings, pp. 9-16, 2004 (Han+, 04) Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, JCDL 2004, pp. 296-305, 2004 (Ikeda+, 09) M. Ikeda, S. Ono, I. Sato, M. Yoshida, and H. Nakagawa. Person Name Disambiguation on the Web by Two-Stage Clustering. 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009. (Kalashnikov+, 08) Dmitri V. Kalashnikov , Rabia Nuray-Turan , Sharad Mehrotra, Towards breaking the quality curse.: a web-querying approach to web people search., Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore (Li+, 04) X. Li, P. Morie and D. Roth, Robust Reading: Identification and Tracing of Ambiguous Names. Proc. of the Annual Meeting of the North American Association of Computational Linguistics (NAACL) , pp. 17-24, 2004 References • • • • • • (Malin, 05) Bradley Malin, Unsupervised name disambiguation via social network similarity, In Workshop on Link Analysis, Counterterrorism, and Security, with SDM 2005 (Murakami, 10) Hiroshi Ueda, Harumi Murakami, and Shoji Tatsumi, Suggesting Subject Headings using Web Information Sources, ... Conference on Agents and Artificial Intelligence (ICAART 2010) Volume 1 Artificial Intelligence, pp.640-643, 2010. (Mann+, 03) Gideon S. Mann , David Yarowsky, Unsupervised personal name disambiguation, Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, p.33-40, May 31, 2003, Edmonton, Canada (Nakagawa+, 03) H. Nakagawa and T. Mori. Automatic term recognition based on statistics of compound nouns and their components. Terminology, 9(2):201--219, 2003. (Niu+, 04) Cheng Niu , Wei Li , Rohini K. Srihari, Weakly supervised learning for cross-document person name disambiguation supported by information extraction, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p.597-es, July 21-26, 2004, Barcelona, Spain (Nuray-Turan+, 09) R. Nuray-Turan, Z. Chen, D. Kalashnikov, and S. Mehrotra. Exploiting web querying for web people search in weps2. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009. References • • • • • • (On+, 07) B.-W. On and D. Lee. Scalable name disambiguation using multi-level graph partition. In Proc. of the SIAM SDM Conf., Minneapolis, Minnesota, USA, 2007 (Ono+, 08) Shingo Ono , Issei Sato , Minoru Yoshida , Hiroshi Nakagawa, Person name disambiguation in web pages using social network, compound words and latent topics, Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining, May 20-23, 2008, Osaka, Japan (Pedersen+, 05) Ted Pedersen, Amruta Purandare, Anagha Kulkarni , Name Discrimination by Clustering Similar Contexts, CICLing 2005, pp. 226-237, 2005 (Resnick+, 94) Paul Resnick , Neophytos Iacovou , Mitesh Suchak , Peter Bergstrom , John Riedl, GroupLens: an open architecture for collaborative filtering of netnews, Proceedings of the 1994 ACM conference on Computer supported cooperative work, p.175-186, October 22-26, 1994, Chapel Hill, North Carolina, United States (Yoshida+, 10) Minoru Yoshida, Masaki Ikeda, Shingo Ono, Issei Sato, Hiroshi Nakagawa, Person name disambiguation by bootstrapping, In SIGIR '10: Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval , pp. 10-17, 2010 (Wan+, 05) Xiaojun Wan , Jianfeng Gao , Mu Li , Binggong Ding, Person resolution in person search results: WebHawk, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany