PowerPoint Presentation - At Home with Technology: Web Page

advertisement
University of Tehran
Database Research Group
Persian@CLEF
2008
Mono & Cross Language
Experiments on Persian Text
Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian
Database Research Group
School of Electrical and Computer Engineering
University of Tehran
18 Sep 2008
1
Outline
Persian Language
Persian Test Collections
Hamshahri in CLEF 2008
UT Participants
Using Part of Speech Tagging in Persian Information Retrieval
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
Local Cluster Analysis Using Part of Speech Tagging
Investigation on Application of Local Cluster Analysis and Part of Speech
Tagging on Persian Text
Cross Language Experiments at Persian@CLEF 2008
Next Year
2
The Persian Language
A branch of Indo-European Languages
Official Language of Iran, Afghanistan
and Tajikistan
Its morphological analysis is
Comparably difficult
The word “‫ ”خبر‬has two plural forms:
• Persian rules: “‫”خبرها‬
• Arabic rules: “‫”اخبار‬
3
Some Processing Issues
Writing Style Issues:
e.g. “‫ ”می شود‬and “‫ ”میشود‬are the same
e.g. “‫ ”کتابها‬and “‫ ”کتاب ها‬are the same
KASRE:
e.g. ‫ چراغ علی خانه را سوزاند‬has two different
meanings:
• CheraghAli burned the house
• Ali’s lantern burned the house
4
Some Processing Issues
Encoding

5
Persian in the Middle East
December 31, 2007
Source: Internet World Stats, http://internetworldstats.com/
User Population Growth on the Web
(2000-2009)
6
Persian Test Collections
IR Domain
Ghavanin (domain specific)
Hamshahri (news) WEB:
http://ece.ut.ac.ir/dbrg/hamshahri
NLP Domain
Bijankhan (2 Million Word) WEB:
http://ece.ut.ac.ir/dbrg/bijankhan
7
Hamshahri in CLEF 2008
News articles of Hamshahri newspaper from year
1996 to 2002
Size of the documents varies from short news (under
1 KB) to rather long articles (e.g. 140 KB)
22 assessors
Evaluation based on DIRECT System
8
Hamshahri in CLEF 2008
Collection size
564 MB (Unicode text)
No. Of documents
166,774
No. Of unique terms
417,339
Average length of documents
380 Terms
No. Of categories
9
No. Of Topics
50 bilingual
9
Implementation of our methods
We submitted top 100 for each run
10
Using Part of Speech Tagging in Persian Information Retrieval
Reza Karimpour, Amineh Ghorbani, Azadeh Pishdad, Mitra Mohtarami, Abolfazl AleAhmad, Hadi
Amiri, Farhad Oroumchian
POS Tagging
Hamshahri tagged
document collection
Hamshahri corpus
Bijankhan Tagged
collection of
documents
As train data
Stemming
Simple
Stemming
Retrieval
Stemmed and
tagged corpus
POS Tagging
Refine Query
part of speeches
with
corresponding
weight
Stemming
Stemmed
and
tagged
queries
Simple
Stemming
Bijankhan Tagged
collection of
documents
As train data
Query
User
11
Using Part of Speech Tagging in Persian Information Retrieval
Config.
Corpus
Query
1
Tagged
Title with equal weighting for all POS tags
2
Stemmed and tagged
Stemmed title with equal weighting for all POS tags
3
Stemmed
Stemmed title without POS tagging
4
Stemmed
Stemmed Title plus description
5
Stemmed Title plus description (stop words removed)
6
Stemmed (stop words
removed)
Tagged
7
Tagged
8
Normal
Title plus description with equal weighting for all
POS tags
Title with various weighting schemes for different
POS tags
Title (Neither stemmed nor tagged)
12
Using Part of Speech Tagging in Persian Information Retrieval
20 less used tags
omitted, others
equal weight
Noun=3
Noun=3
Noun=0
Noun=0
Noun=0
Verb=2
Verb=0
Verb=2
Verb=0
Verb=0
Adj=1
Avj=3
Adj=0
Adj=1
Adj=0
Adv=1
Adv = 0
Adv=0
Adv=0
Adv=1
Average
precision
0.2745
0.2635
0.2597
0.1108
0.1198
0.0977
R-Precision
0.3097
0.3104
0.2888
0.1256
0.1186
0.1111
13
Using Part of Speech Tagging in Persian Information Retrieval
14
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
Zahra Aghazade, Nazanin Dehghani, Leili Farzinvash, Razieh Rahimi, Abolfazel AleAhmad, Hadi
Amiri, Farhad Oroumchian
Weighting Model
Description
BB2
Bose-Einstein model for randomness, the ratio of two Bernoulli's processes for first normalization, and
Normalization 2 for term frequency normalization
BM25
DFR_BM25
The BM25 probabilistic model
The DFR version of BM25
IFB2
Inverse Term Frequency model for randomness, the ratio of two Bernoulli's processes for first
normalization, and Normalization 2 for term frequency normalization
In_expB2
Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for
first normalization, and Normalization 2 for term frequency normalization
In_expC2
Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for
first normalization, and Normalization 2 for term frequency normalization with natural logarithm
InL2
Inverse document frequency model for randomness, succession for first normalization, and
Normalization 2 for term frequency normalization
PL2
Poisson estimation for randomness, succession for first normalization, and Normalization 2 for term
frequency normalization
TF_IDF
The tf*idf weighting function, where tf is given by Robertson's tf and idf is given by the standard Sparck
Jones' idf
Terrier Open Source Retrieval Engine: http:// ir.dcs.gla.ac.uk/terrier/
15
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
Weighting Model
Average Precision
R-Precision
BB2
0.3854
0.4167
BM25
0.3562
0.4009
DFR_BM25
0.4006
0.4347
IFB2
0.4017
0.4328
In_expB2
0.3997
0.4329
In_expC2
0.4190
0.4461
InL2
0.3832
0.4200
PL2
0.4314
0.4548
TF_IDF
0.3574
0.4017
16
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
And two other variations of this operator: IOWA and
NOWA
17
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
18
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
Post hoc Results
Retrieval Method
Toolkit
Average Precision
R-Precision
TF_IDF with unstemmed
single terms
Terrier
0.3847
0.4122
PL2 with 4gram terms
Terrier
0.3669
0.3939
Indri with stemmed terms
Lemur
0.3955
0.4149
IOWA
0.4515
0.4708
NOWA
0.4522
0.4736
Dif
+5.6
+5.67
19
Investigation on Application of Local Cluster Analysis and Part of Speech
Tagging on Persian Text
Amir Hossein Jadidinejad, Mitra Mohtarami,Hadi Amiri
Preprocessing
Training
Bijankhan
Collection
Test
POS Tagger
(MLE and TNT)
TNT
Hamshahri
Clear Collection
Hamshahri
Tagged Collection
By MLE
Hamshahri
Tagged Collection
By TNT
Content-less tag removal
Ret
Useful Tags
Post Processing
MLE
Retrieval Engine
Relevant Cluster
Reranked
Results
Cluster Analysis
Clustering
Retrieved
Results
Irrelevant Cluster
20
Investigation on Application of Local Cluster Analysis and Part of Speech
Tagging on Persian Text
But the result was not good on the test set
21
Cross Language Experiments at Persian@CLEF 2008
Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian
Run
tot-ret rel-ret
MAP
Retrieval Model
Tool
Using Light
Stemmer
5161
1967
26.89
Vector Space
Lucene
Without Stemmer
5161
1991
27.08
Vector Space
Lucene
3Grams
5161
1901
26.07
Language Modeling
Lemur
4Grams
5161
1950
26.70
Language Modeling
Lemur
5Grams
5161
1983
27.13
Language Modeling
Lemur
Term-Based
5161
2035
28.14
Language Modeling
Lemur
22
Cross Language Experiments at Persian@CLEF 2008
Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian
Query Translation
Probabilistic Structured Queries (PSQ)
Combinatorial Translation Probability (CTP)

23
Cross Language Experiments at Persian@CLEF 2008
Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian
Query Translation Results
0.6
0.5
Precision
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9
10
11
Recall
All Meanings; MAP 6.73
First Meaning; MAP 12.4
PSQ_CTP+4Grams; MAP 14.46
24
Cross Language Experiments at Persian@CLEF 2008
Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian
Document Translation
Using Shiraz machine translation
system from CRL of NMSU
Took 10 days to translate 130,000+
docs from Persian to English
25
Cross Language Experiments at Persian@CLEF 2008
Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian
Document Translation & Hybrid Results
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Document Translation; MAP 12.88
Monolingual; MAP 27.08
Query Translation; MAP 14.46
Hybrid; MAP 16.19
26
Next Year
Ham2 for the Next Year
Extended Version of Hamshahri Collection
2 times larger (~1.5 GB)
<DOC>
<DOCID>HAM2-851011-001</DOCID>
<DOCNO>HAM2-851011-001</DOCNO>
<ORIGINALFILE>/1385/851011/news/_adabh.htm</ORIGINALFILE>
<ISSUE>4172‫ شماره‬- ‫ سال چهاردهم‬- 1385 ‫ دي‬11 ‫ دوشنبه‬- Jan 1, 2007</ISSUE>
<DATE>2007-01-01</DATE>
<CAT xml:lang="fa">‫<ادب و هنر‬/CAT>
<CAT xml:lang="en">Literature and Art</CAT>
<TITLE>
<![CDATA[‫مدیركل كتاب و كتابخواني وزارت فرهنگ و ارشاد اسالمي خبر داد‬
‫>]]آیین نامه خرید كتاب اصالح شد‬
</TITLE>
<TEXT>
<image>/1385/851011/news/008505.jpg</image>
<![CDATA[
‫ آیین نام‬:‫ مدیر كل كتاب و كتاب خواني وزارت فرهنگ و ارشاد اسالمي گفت‬:‫فارس‬
</TEXT>
</DOC>
<DOC>
27
Questions?
Thanks For Your Attention
Database Research Group
http://ece.ut.ac.ir/dbrg
28
Download