Reporter: 洪紹祥 Adviser: 鄭淑真 Date:2010/10/26

advertisement
Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen
College of Computer Science, Zhejiang University
Hangzhou, China
Reporter: 洪紹祥
Adviser: 鄭淑真
Date:2010/10/26
1



The textual advertising market is becoming the
substantial source of the Web revenue
Contextual advertising has played an
important role in it.
Relevance between content and ads leads users
to click and browse the ads and brings the
advertisers potential increase in revenue.
2

The key step of contextual advertising



Keyword extraction affects the accuracy of the
advertising system directly
Research has been done on English keyword
extraction.
There is little work existing on Chinese
keyword extraction.
1.
2.
The unique characteristics of Chinese language
The Internet and Webadvertising market have just
started in China
3

News and email query extraction

TFIDF
 The closed captioning of TV news
 Mail subjec

Information extraction

Extract phrases
 The extraction techniques adopted are different from keyword
extraction.

Keyword extraction in case of English

Keyphrase Extraction Algorithm (KEA)
 three features
 TFIDF
 Distance
 (number of words before firstword/all words)
 Term frequency
4

DataProcess
5

Candidate selection criterions
The length of a candidate is as least two words.
The candidate occurs in different places in the same
document
1.
2.


Considered as the identical one
Its value of features will be combined
6

Building the classifier(Using C4.5 decision tree
algorithm)
 Feature selection.
 Binary Value
 Linguistic features.
 noun, verb …
 Named Entity.
 Name,Place …
 Numeric Value
 Length.
 Length of the candidate
 Length of the document
 Sentence number of the document
7

Building the classifier(Using C4.5 decision tree algorithm)
 Feature selection.
 Location.
 First (nth phrase/all phrases),(nth sentence/all sentences)
 Last (nth phrase/all phrases),(nth sentence/all sentences)
 TFIDF.
 Traditional
 log2 (TF +1)
 log2 (IDF +1)
 Information entropy.
 H(x) = −(T/N)*log2(T/N)
 Diameter.
 Last(nth phrase)-first(nth phrase)
 Last(nth sentence)-first(nth sentence)
8

Corpus construction.

Contains 2200 documents
 2000 for training and 100 for testing

Labeling.
 Submit the candidates in a document to Google

Performance measures

Top − N = CorrectNum/TotalNum
9

Algorithm comparison experiment.
10

Feature contribution experiment.
11

Feature contribution experiment.

To analyze other features’ influences
12


The experimental results show that our
approach is promising and has a large
improvement over KEA and Yih’s work,
ignoring the difference of the language.
We attribute the superior performance to the
appropriate features we select and the
classification algorithm we adopt.
13
Download