Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真 Date:2010/10/26 1 The textual advertising market is becoming the substantial source of the Web revenue Contextual advertising has played an important role in it. Relevance between content and ads leads users to click and browse the ads and brings the advertisers potential increase in revenue. 2 The key step of contextual advertising Keyword extraction affects the accuracy of the advertising system directly Research has been done on English keyword extraction. There is little work existing on Chinese keyword extraction. 1. 2. The unique characteristics of Chinese language The Internet and Webadvertising market have just started in China 3 News and email query extraction TFIDF The closed captioning of TV news Mail subjec Information extraction Extract phrases The extraction techniques adopted are different from keyword extraction. Keyword extraction in case of English Keyphrase Extraction Algorithm (KEA) three features TFIDF Distance (number of words before firstword/all words) Term frequency 4 DataProcess 5 Candidate selection criterions The length of a candidate is as least two words. The candidate occurs in different places in the same document 1. 2. Considered as the identical one Its value of features will be combined 6 Building the classifier(Using C4.5 decision tree algorithm) Feature selection. Binary Value Linguistic features. noun, verb … Named Entity. Name,Place … Numeric Value Length. Length of the candidate Length of the document Sentence number of the document 7 Building the classifier(Using C4.5 decision tree algorithm) Feature selection. Location. First (nth phrase/all phrases),(nth sentence/all sentences) Last (nth phrase/all phrases),(nth sentence/all sentences) TFIDF. Traditional log2 (TF +1) log2 (IDF +1) Information entropy. H(x) = −(T/N)*log2(T/N) Diameter. Last(nth phrase)-first(nth phrase) Last(nth sentence)-first(nth sentence) 8 Corpus construction. Contains 2200 documents 2000 for training and 100 for testing Labeling. Submit the candidates in a document to Google Performance measures Top − N = CorrectNum/TotalNum 9 Algorithm comparison experiment. 10 Feature contribution experiment. 11 Feature contribution experiment. To analyze other features’ influences 12 The experimental results show that our approach is promising and has a large improvement over KEA and Yih’s work, ignoring the difference of the language. We attribute the superior performance to the appropriate features we select and the classification algorithm we adopt. 13