University of Tehran Database Research Group Persian@CLEF 2008 Mono & Cross Language Experiments on Persian Text Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian Database Research Group School of Electrical and Computer Engineering University of Tehran 18 Sep 2008 1 Outline Persian Language Persian Test Collections Hamshahri in CLEF 2008 UT Participants Using Part of Speech Tagging in Persian Information Retrieval Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Local Cluster Analysis Using Part of Speech Tagging Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text Cross Language Experiments at Persian@CLEF 2008 Next Year 2 The Persian Language A branch of Indo-European Languages Official Language of Iran, Afghanistan and Tajikistan Its morphological analysis is Comparably difficult The word “ ”خبرhas two plural forms: • Persian rules: “”خبرها • Arabic rules: “”اخبار 3 Some Processing Issues Writing Style Issues: e.g. “ ”می شودand “ ”میشودare the same e.g. “ ”کتابهاand “ ”کتاب هاare the same KASRE: e.g. چراغ علی خانه را سوزاندhas two different meanings: • CheraghAli burned the house • Ali’s lantern burned the house 4 Some Processing Issues Encoding 5 Persian in the Middle East December 31, 2007 Source: Internet World Stats, http://internetworldstats.com/ User Population Growth on the Web (2000-2009) 6 Persian Test Collections IR Domain Ghavanin (domain specific) Hamshahri (news) WEB: http://ece.ut.ac.ir/dbrg/hamshahri NLP Domain Bijankhan (2 Million Word) WEB: http://ece.ut.ac.ir/dbrg/bijankhan 7 Hamshahri in CLEF 2008 News articles of Hamshahri newspaper from year 1996 to 2002 Size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) 22 assessors Evaluation based on DIRECT System 8 Hamshahri in CLEF 2008 Collection size 564 MB (Unicode text) No. Of documents 166,774 No. Of unique terms 417,339 Average length of documents 380 Terms No. Of categories 9 No. Of Topics 50 bilingual 9 Implementation of our methods We submitted top 100 for each run 10 Using Part of Speech Tagging in Persian Information Retrieval Reza Karimpour, Amineh Ghorbani, Azadeh Pishdad, Mitra Mohtarami, Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian POS Tagging Hamshahri tagged document collection Hamshahri corpus Bijankhan Tagged collection of documents As train data Stemming Simple Stemming Retrieval Stemmed and tagged corpus POS Tagging Refine Query part of speeches with corresponding weight Stemming Stemmed and tagged queries Simple Stemming Bijankhan Tagged collection of documents As train data Query User 11 Using Part of Speech Tagging in Persian Information Retrieval Config. Corpus Query 1 Tagged Title with equal weighting for all POS tags 2 Stemmed and tagged Stemmed title with equal weighting for all POS tags 3 Stemmed Stemmed title without POS tagging 4 Stemmed Stemmed Title plus description 5 Stemmed Title plus description (stop words removed) 6 Stemmed (stop words removed) Tagged 7 Tagged 8 Normal Title plus description with equal weighting for all POS tags Title with various weighting schemes for different POS tags Title (Neither stemmed nor tagged) 12 Using Part of Speech Tagging in Persian Information Retrieval 20 less used tags omitted, others equal weight Noun=3 Noun=3 Noun=0 Noun=0 Noun=0 Verb=2 Verb=0 Verb=2 Verb=0 Verb=0 Adj=1 Avj=3 Adj=0 Adj=1 Adj=0 Adv=1 Adv = 0 Adv=0 Adv=0 Adv=1 Average precision 0.2745 0.2635 0.2597 0.1108 0.1198 0.0977 R-Precision 0.3097 0.3104 0.2888 0.1256 0.1186 0.1111 13 Using Part of Speech Tagging in Persian Information Retrieval 14 Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Zahra Aghazade, Nazanin Dehghani, Leili Farzinvash, Razieh Rahimi, Abolfazel AleAhmad, Hadi Amiri, Farhad Oroumchian Weighting Model Description BB2 Bose-Einstein model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization BM25 DFR_BM25 The BM25 probabilistic model The DFR version of BM25 IFB2 Inverse Term Frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization In_expB2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization In_expC2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization with natural logarithm InL2 Inverse document frequency model for randomness, succession for first normalization, and Normalization 2 for term frequency normalization PL2 Poisson estimation for randomness, succession for first normalization, and Normalization 2 for term frequency normalization TF_IDF The tf*idf weighting function, where tf is given by Robertson's tf and idf is given by the standard Sparck Jones' idf Terrier Open Source Retrieval Engine: http:// ir.dcs.gla.ac.uk/terrier/ 15 Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Weighting Model Average Precision R-Precision BB2 0.3854 0.4167 BM25 0.3562 0.4009 DFR_BM25 0.4006 0.4347 IFB2 0.4017 0.4328 In_expB2 0.3997 0.4329 In_expC2 0.4190 0.4461 InL2 0.3832 0.4200 PL2 0.4314 0.4548 TF_IDF 0.3574 0.4017 16 Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track And two other variations of this operator: IOWA and NOWA 17 Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track 18 Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Post hoc Results Retrieval Method Toolkit Average Precision R-Precision TF_IDF with unstemmed single terms Terrier 0.3847 0.4122 PL2 with 4gram terms Terrier 0.3669 0.3939 Indri with stemmed terms Lemur 0.3955 0.4149 IOWA 0.4515 0.4708 NOWA 0.4522 0.4736 Dif +5.6 +5.67 19 Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text Amir Hossein Jadidinejad, Mitra Mohtarami,Hadi Amiri Preprocessing Training Bijankhan Collection Test POS Tagger (MLE and TNT) TNT Hamshahri Clear Collection Hamshahri Tagged Collection By MLE Hamshahri Tagged Collection By TNT Content-less tag removal Ret Useful Tags Post Processing MLE Retrieval Engine Relevant Cluster Reranked Results Cluster Analysis Clustering Retrieved Results Irrelevant Cluster 20 Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text But the result was not good on the test set 21 Cross Language Experiments at Persian@CLEF 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Run tot-ret rel-ret MAP Retrieval Model Tool Using Light Stemmer 5161 1967 26.89 Vector Space Lucene Without Stemmer 5161 1991 27.08 Vector Space Lucene 3Grams 5161 1901 26.07 Language Modeling Lemur 4Grams 5161 1950 26.70 Language Modeling Lemur 5Grams 5161 1983 27.13 Language Modeling Lemur Term-Based 5161 2035 28.14 Language Modeling Lemur 22 Cross Language Experiments at Persian@CLEF 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Query Translation Probabilistic Structured Queries (PSQ) Combinatorial Translation Probability (CTP) 23 Cross Language Experiments at Persian@CLEF 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Query Translation Results 0.6 0.5 Precision 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 Recall All Meanings; MAP 6.73 First Meaning; MAP 12.4 PSQ_CTP+4Grams; MAP 14.46 24 Cross Language Experiments at Persian@CLEF 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Document Translation Using Shiraz machine translation system from CRL of NMSU Took 10 days to translate 130,000+ docs from Persian to English 25 Cross Language Experiments at Persian@CLEF 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Document Translation & Hybrid Results 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 Document Translation; MAP 12.88 Monolingual; MAP 27.08 Query Translation; MAP 14.46 Hybrid; MAP 16.19 26 Next Year Ham2 for the Next Year Extended Version of Hamshahri Collection 2 times larger (~1.5 GB) <DOC> <DOCID>HAM2-851011-001</DOCID> <DOCNO>HAM2-851011-001</DOCNO> <ORIGINALFILE>/1385/851011/news/_adabh.htm</ORIGINALFILE> <ISSUE>4172 شماره- سال چهاردهم- 1385 دي11 دوشنبه- Jan 1, 2007</ISSUE> <DATE>2007-01-01</DATE> <CAT xml:lang="fa"><ادب و هنر/CAT> <CAT xml:lang="en">Literature and Art</CAT> <TITLE> <![CDATA[مدیركل كتاب و كتابخواني وزارت فرهنگ و ارشاد اسالمي خبر داد >]]آیین نامه خرید كتاب اصالح شد </TITLE> <TEXT> <image>/1385/851011/news/008505.jpg</image> <![CDATA[ آیین نام: مدیر كل كتاب و كتاب خواني وزارت فرهنگ و ارشاد اسالمي گفت:فارس </TEXT> </DOC> <DOC> 27 Questions? Thanks For Your Attention Database Research Group http://ece.ut.ac.ir/dbrg 28