Topic Clustering of Stemmed Transcribed Arabic Broadcast News Authors Ahmed Abdelaziz Jafar (O6U) Prof. Mohamed Waleed Fakhr (AAST) Prof. Mohamed Hesham Farouk (Cairo University) Outlines • • • • • • Motivations Challenges Objectives Research Procedure Experimental Results Conclusion Motivations • Why Topic Clustering of Transcribed Broadcast News – The amount of audible news broadcasted on TV channels, radio stations and on the Internet are growing rapidly. – This rapid growth demands reliable and fast techniques to organize and store those vast amounts of news in order to facilitate future processing. Motivations (cont’d) • Why Transcribing News: – News is important, thus archiving it is also important. – Prepared news stories are carefully edited and highly structured. • Why Arabic Language: – Arabic is one of the six most populous languages in the world. – Automatic transcription and processing of Arabic documents is an active research field due its complex morphological nature. Challenges • Speech Transcription Challenges: – Transcription Errors: – Transcription errors include: – – – – Word Deletion Errors Word Insertion Errors Word Misidentification (Substitution)Errors Minor Spelling Errors – The main causes of such errors results from drawbacks of the ASR system. Challenges • Speech Transcription Challenges: – Grammatical Errors: – Use of grammatically incorrect sentences. – Common problem in conversational speech. – Out of Vocabulary Problem (OOV): – Presence of unknown words that appear in the speech but not in the recognition vocabulary of the ASR. – The daily growth of natural languages is the main cause of such problem. – Combination of the previously mentioned problems. Objectives • Achieve automatic topic clustering of transcribed speech documents. • Overcome the negative effect of some of the transcription errors by using stemming techniques with the aid of a Chi-square-based similarity measure. Research Procedure Audio Files ASR System Transcribed Documents Transcription Process Preprocessing Steps Tokenization Clustering-Based Topic Identification Stop Words Removal Clustering Algorithm Words Formatting N-topics identified Similarity Measure Stemming Weighted Matrix Construction Research Procedure (cont’d) • Transcription Process Audio Files ASR System Transcribed Speech Documents • Dragon Dictation - Free application made by Nuance Communications, currently available only on iOS platform. - Speaker independent recognizer that supports many languages including Arabic. - Open domain recognizer, hence not require writing a grammar. Research Procedure (cont’d) • Transcription Process • ASR System Output (Transcribed Documents) – 1000 Transcribed news stories: – collected from various Arabic news networks broadcast: Al-Jazeera, Al-Arabiya, and BBC Arabic. – divided into five general topics: arts and culture, economics, politics, science, and sports. – The average length of the original audible news story is about two minutes. Research Procedure (cont’d) • Transcription Process • ASR System Evaluation – The ASR system is evaluated using The Word Error Rate (WER), which is commonly used to measure speech recognition accuracy. It is based on the frequency of occurrences of three types of errors: Substitutions, Insertions, and deletions. – WER is calculated as follows: Research Procedure (cont’d) • Transcription Process • ASR System Evaluation using WER #Reference Words #Substitutions #Insertion #Deletion 68720 17327 2105 574 WER % 29.1123399 Research Procedure (cont’d) • Preprocessing Transcription Process • Impact of Stop Words Removal Step on the Transcription Errors Preprocessing Steps Reference Words Substitutions Insertion Deletion 30040 3607 1831 766 WER % 20.6524634% Tokenization Stop Words Removal Words Formatting Stemming Weighted Matrix Construction Research Procedure (cont’d) • Preprocessing Transcription Process Preprocessing Steps Tokenization • Unify all different shapes of the same letter to one form. • Also to remove some unwanted suffixes (و, ا, )وا in order to fine-tune the input for the stemming step. Stop Words Removal Words Formatting Stemming Weighted Matrix Construction Research Procedure (cont’d) • Preprocessing Transcription Process • Used Stemming Techniques Preprocessing Steps • Light Stemming: Light stemming does not deal with patterns or infixes; it is simply the process of stripping off prefixes and/or suffixes. • Root-Based Stemming: Removes suffixes, infixes and prefixes and uses pattern matching to extract the roots. • Rule-Based Light Stemming: Hybrid technique between light and rule-based stemming. Tokenization Stop Words Removal Words Formatting Stemming Weighted Matrix Construction Research Procedure (cont’d) • Preprocessing Transcription Process According to Okapi method Combined Weight of the word (CW) is calculated as follows: Preprocessing Steps Tokenization Stop Words Removal D1 D2 D3 … Dj W1 CW11 CW12 CW13 … CW1j W2 CW21 CW22 CW23 … CW2J W3 CW31 CW32 CW33 … CW3J … … … … … … Wi CWi1 CWi1 CWi1 … CWij Words Formatting Stemming Weighted Matrix Construction Research Procedure (cont’d) • Topic Identification • Basic k-means algorithm • Spectral clustering algorithm (Shi–Malik) Clustering-Based Topic Identification Transcription Process Clustering Algorithm Preprocessing N-topics identified Similarity Measure • Chi-square Similarity Measure • Cosine Similarity Measure Experimental Results • Experiments – four test scenarios are evaluated: – Without applying stemming – when light-stemming is applied – when root-based stemming is applied – when rule-based stemming is applied – In each scenario, the dataset is divided into smaller subsets of sizes ranged from 50 to 200 documents per topic category. Experimental Results (cont’d) • Experiments – The clustering algorithms are applied on all the subsets in each scenario two times per subset: – one time with the use of the Chi-square similarity measure. – the other time with the use of the popular cosine similarity. – The accuracy of the clustering is evaluated for each subset, and then the average accuracy is calculated among all the subsets. Experimental Results (cont’d) • Results (Transcribed Documents) Average Accuracy Clustering Approach/Similarity Non- Light- Root- Rule- Measure Stemmed Stemmed Stemmed Stemmed k-Means /Cosine 39.42% 44.61% 54.41% 60.04% k-Means/Chi-square 44.3% 47.6% 56.5% 63.35% Spectral Clustering/Cosine 45.62% 50.96% 65.57% 71.33% 46.5% 53.8% 68.9% 76.11% Spectral Clustering/Chisquare Experimental Results (cont’d) • Results (Original Documents) Average Accuracy Clustering Approach/Similarity Non- Light- Root- Rule- Measure Stemmed Stemmed Stemmed Stemmed k-Means /Cosine 62.2% 64.63% 68.06% 76.84% k-Means/Chi-square 65.9% 67.97% 72.84% 79.05% Spectral Clustering/Cosine 72.2% 74.97% 80.77% 85.15% 74.87% 76.85% 82.74% 87.21% Spectral Clustering/Chisquare Experimental Results (cont’d) • Results Evaluation – By comparing the accuracy results, and by observing the clustering confusion matrix for each clustering scenario for original and transcribed data it is concluded that: – in both sets of data, there are documents causing clustering confusion. – The existence of topic overlaps in the original data is the main cause of such confusion. – The information loss due to the transcription errors is increasing the confusion even more in the transcribed data. Experimental Results (cont’d) • Results Evaluation Confusion Matrices Sample 1: • Original text divided into subsets of 200 Docs. • Rule-based stemming is applied. • Spectral clustering algorithm is applied Sample 2: • Transcribed text divided into subsets of 200 Docs. • Rule-based stemming is applied. • Spectral clustering algorithm is applied Arts Economics Politics Science Sports Arts 170 21 2 16 9 218 Economics 10 125 4 6 5 150 Politics 13 33 193 10 7 256 Science 6 15 1 167 2 191 Sports 1 6 0 1 177 185 Arts Economics Politics Science Sports Arts 156 25 3 21 17 222 Economics 13 102 3 9 5 132 Politics 21 39 191 15 11 277 Science 8 22 2 149 7 188 Sports 2 12 1 6 160 181 Experimental Results (cont’d) • Experiments (Phase 2) – Fuzzy c-means algorithm and Possibilistic Gustafson-Kessel (GK) algorithm are applied on both the transcribed and the original data, and the membership matrix is analyzed to evaluate the amount of confusing documents in each topic. – A document is considered confusing to the clustering process if: – its membership degrees to all clusters are under a certain predefined threshold. – if its membership degrees to all clusters are convergent. – By determining which documents are affecting the clustering accuracy, they can be excluded. Experimental Results • Experiments (Phase 2) Confusing documents detected in the original data a) Confusing documents detected by fuzzy c-means b) Confusing documents detected by possibilistic GK algorithm Experimental Results (cont’d) • Experiments (Phase 2) Number of confusing documents detected in the transcribed data a) Confusing documents detected by fuzzy c-means b) Confusing documents detected by possibilistic GK algorithm Experimental Results (cont’d) • Results (Phase 2) – The average clustering accuracy improved to a maximum of 79.34% and 90.52% respectively for the remaining data after using fuzzy c-means and maximum of 85.62% and 92.26% respectively for the remaining data after using possibilistic GK algorithm. – In both cases, the maximum average accuracy is obtained when spectral clustering is used on rule-based stemmed data. – Manual categorization can be considered a solution to categorize the excluded documents. Conclusion • Research Contributions: – Utilizing stemming to overcome the negative effects of some of the transcription errors (misidentification errors) existing in the Arabic transcribed text. – stemming techniques have improved the accuracy of all clustering algorithms applied on the transcribed Arabic documents in all scenarios by an average of 19.7%. – Rule-based light stemming has improved the accuracy of the clustering process by an average of 23.75%. – Root-based and light stemming techniques improved the accuracy of the clustering process by an average of 17.39% and 5.28% respectively. Conclusion (cont’d) • Research Contributions: – Utilizing Chi-square similarity measures as helping method to stemming in order to eliminate some of the transcription errors existing in the Arabic transcribed text. Conclusion (cont’d) • The research has showed that: – Rule-based light stemming has improved the accuracy of the clustering process more than the other stemming techniques. – The spectral clustering algorithm achieved more accuracy than the k-means algorithm in all cases. – Chi-square similarity method is superior to the popular and traditional cosine similarity and it is best utilized by the spectral clustering algorithm. Conclusion (cont’d) • The research has showed that: – Applying the fuzzy c-means and the possibilistic GK algorithm on both the transcribed and original data has revealed some of the characteristics of the data. – Economics topic has the biggest number of confusing documents. – Arts and Science have the second and third places in the number of occurrences of confusing documents. – Politics topic has the second least confusing documents, and it is the most topic that received wrong-clustered documents from all other categories. References Ibrahim Abu El-Khair, “Effects of stop words elimination for Arabic information retrieval: a comparative study,” International Journal of Computing & Information Sciences, 2006, pp. 119-133. [2] Eiman Al-Shammari and Jessica Lin, “A novel Arabic lemmatization algorithm,” Proceedings of the second workshop on Analytics for noisy unstructured text data, 2008, pp. 113-118. [3] L.S. Larkey and M.E. Connell, “Arabic information retrieval at UMass in TREC-10,” Proceedings of the Tenth Text REtrieval Conference (TREC-10)”, EM Voorhees and DK Harman ed, 2001, pp. 562-570. [4] S. Khoja and R. Garside, “Stemming Arabic text,” Lancaster, UK, Computing Department, Lancaster University, 1999. [5] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. “Okapi at TREC-3,” Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994. NIST SP 500-225, 1995, 109-126. [6] Oktay Ibrahimov, Ishwar Sethi, and Nevenka Dimitrova, “A novel similarity based clustering algorithm for grouping broadcast news,” Proc. of SPIE Conf. 'Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, 2002. [7] Michael Steinbach, George Karypis, and Vipin Kumar, “A comparison of document clustering techniques”, University of Minnesota, 2000. [8] Ulrike von Luxburg, “A tutorial on spectral clustering,” Springer Statistics and Computing, vol. 17, Issue 4, pp 395416, December 2007. [9] Jianbo Shi and Jitendra Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, No. 8, August 2000. [10] Dragon Dictation Application on iOS https://itunes.apple.com/us/app/dragon-dictation/id341446764?mt=8 [11] Kanaan, G.; Al-Shalabi, R.; Ababneh, M.; Al-Nobani, A., "Building an effective rule-based light stemmer for Arabic language to improve search effectiveness," Innovations in Information Technology, 2008. IIT 2008. International Conference, pp.312,316. [12] D.E. Gustafson and W.C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proc. IEEE CDC, pages 761– 766, San Diego, CA, USA, 1979. [1] Thank You Questions