Topic Clustering of Stemmed Transcribed Arabic Broadcast

advertisement
Topic Clustering of Stemmed
Transcribed Arabic Broadcast News
Authors
Ahmed Abdelaziz Jafar (O6U)
Prof. Mohamed Waleed Fakhr (AAST)
Prof. Mohamed Hesham Farouk (Cairo University)
Outlines
•
•
•
•
•
•
Motivations
Challenges
Objectives
Research Procedure
Experimental Results
Conclusion
Motivations
• Why Topic Clustering of Transcribed Broadcast News
– The amount of audible news broadcasted on TV channels,
radio stations and on the Internet are growing rapidly.
– This rapid growth demands reliable and fast techniques to
organize and store those vast amounts of news in order to
facilitate future processing.
Motivations (cont’d)
• Why Transcribing News:
– News is important, thus archiving it is also important.
– Prepared news stories are carefully edited and highly
structured.
• Why Arabic Language:
– Arabic is one of the six most populous languages in the
world.
– Automatic transcription and processing of Arabic
documents is an active research field due its complex
morphological nature.
Challenges
• Speech Transcription Challenges:
– Transcription Errors:
– Transcription errors include:
–
–
–
–
Word Deletion Errors
Word Insertion Errors
Word Misidentification (Substitution)Errors
Minor Spelling Errors
– The main causes of such errors results from drawbacks of the
ASR system.
Challenges
• Speech Transcription Challenges:
– Grammatical Errors:
– Use of grammatically incorrect sentences.
– Common problem in conversational speech.
– Out of Vocabulary Problem (OOV):
– Presence of unknown words that appear in the speech but
not in the recognition vocabulary of the ASR.
– The daily growth of natural languages is the main cause of
such problem.
– Combination of the previously mentioned problems.
Objectives
• Achieve automatic topic clustering of transcribed
speech documents.
• Overcome the negative effect of some of the
transcription errors by using stemming techniques
with the aid of a Chi-square-based similarity measure.
Research Procedure
Audio Files
ASR System
Transcribed
Documents
Transcription Process
Preprocessing Steps
Tokenization
Clustering-Based
Topic Identification
Stop Words Removal
Clustering Algorithm
Words Formatting
N-topics identified
Similarity Measure
Stemming
Weighted Matrix
Construction
Research Procedure (cont’d)
•
Transcription Process
Audio Files
ASR System
Transcribed
Speech
Documents
• Dragon Dictation
- Free application made by Nuance Communications, currently available
only on iOS platform.
- Speaker independent recognizer that supports many languages
including Arabic.
- Open domain recognizer, hence not require writing a grammar.
Research Procedure (cont’d)
•
Transcription Process
• ASR System Output (Transcribed Documents)
– 1000 Transcribed news stories:
– collected from various Arabic news networks broadcast:
Al-Jazeera, Al-Arabiya, and BBC Arabic.
– divided into five general topics: arts and culture,
economics, politics, science, and sports.
– The average length of the original audible news story is
about two minutes.
Research Procedure (cont’d)
• Transcription Process
• ASR System Evaluation
– The ASR system is evaluated using The Word Error Rate
(WER), which is commonly used to measure speech
recognition accuracy. It is based on the frequency of
occurrences of three types of errors: Substitutions,
Insertions, and deletions.
– WER is calculated as follows:
Research Procedure (cont’d)
• Transcription Process
• ASR System Evaluation using WER
#Reference Words
#Substitutions
#Insertion
#Deletion
68720
17327
2105
574
WER %
29.1123399
Research Procedure (cont’d)
• Preprocessing
Transcription
Process
• Impact of Stop Words Removal Step on
the Transcription Errors
Preprocessing Steps
Reference Words
Substitutions
Insertion
Deletion
30040
3607
1831
766
WER %
20.6524634%
Tokenization
Stop Words Removal
Words Formatting
Stemming
Weighted Matrix
Construction
Research Procedure (cont’d)
• Preprocessing
Transcription
Process
Preprocessing Steps
Tokenization
• Unify all different shapes of the same letter to
one form.
• Also to remove some unwanted suffixes (‫و‬,‫ ا‬, ‫)وا‬
in order to fine-tune the input for the stemming
step.
Stop Words Removal
Words Formatting
Stemming
Weighted Matrix
Construction
Research Procedure (cont’d)
• Preprocessing
Transcription
Process
• Used Stemming Techniques
Preprocessing Steps
• Light Stemming: Light stemming does not deal
with patterns or infixes; it is simply the process of
stripping off prefixes and/or suffixes.
• Root-Based Stemming: Removes suffixes, infixes
and prefixes and uses pattern matching to extract
the roots.
• Rule-Based Light Stemming: Hybrid technique
between light and rule-based stemming.
Tokenization
Stop Words Removal
Words Formatting
Stemming
Weighted Matrix
Construction
Research Procedure (cont’d)
• Preprocessing
Transcription
Process
According to Okapi method Combined Weight of
the word (CW) is calculated as follows:
Preprocessing Steps
Tokenization
Stop Words Removal
D1
D2
D3
…
Dj
W1
CW11
CW12
CW13
…
CW1j
W2
CW21
CW22
CW23
…
CW2J
W3
CW31
CW32
CW33
…
CW3J
…
…
…
…
…
…
Wi
CWi1
CWi1
CWi1
…
CWij
Words Formatting
Stemming
Weighted Matrix
Construction
Research Procedure (cont’d)
• Topic Identification
• Basic k-means algorithm
• Spectral clustering algorithm (Shi–Malik)
Clustering-Based
Topic Identification
Transcription
Process
Clustering Algorithm
Preprocessing
N-topics identified
Similarity Measure
• Chi-square Similarity Measure
• Cosine Similarity Measure
Experimental Results
• Experiments
– four test scenarios are evaluated:
– Without applying stemming
– when light-stemming is applied
– when root-based stemming is applied
– when rule-based stemming is applied
– In each scenario, the dataset is divided into smaller subsets
of sizes ranged from 50 to 200 documents per topic
category.
Experimental Results (cont’d)
• Experiments
– The clustering algorithms are applied on all the subsets in
each scenario two times per subset:
– one time with the use of the Chi-square similarity
measure.
– the other time with the use of the popular cosine
similarity.
– The accuracy of the clustering is evaluated for each subset,
and then the average accuracy is calculated among all the
subsets.
Experimental Results (cont’d)
• Results (Transcribed Documents)
Average Accuracy
Clustering
Approach/Similarity
Non-
Light-
Root-
Rule-
Measure
Stemmed
Stemmed
Stemmed
Stemmed
k-Means /Cosine
39.42%
44.61%
54.41%
60.04%
k-Means/Chi-square
44.3%
47.6%
56.5%
63.35%
Spectral Clustering/Cosine
45.62%
50.96%
65.57%
71.33%
46.5%
53.8%
68.9%
76.11%
Spectral Clustering/Chisquare
Experimental Results (cont’d)
• Results (Original Documents)
Average Accuracy
Clustering
Approach/Similarity
Non-
Light-
Root-
Rule-
Measure
Stemmed
Stemmed
Stemmed
Stemmed
k-Means /Cosine
62.2%
64.63%
68.06%
76.84%
k-Means/Chi-square
65.9%
67.97%
72.84%
79.05%
Spectral Clustering/Cosine
72.2%
74.97%
80.77%
85.15%
74.87%
76.85%
82.74%
87.21%
Spectral Clustering/Chisquare
Experimental Results (cont’d)
• Results Evaluation
– By comparing the accuracy results, and by observing the
clustering confusion matrix for each clustering scenario for
original and transcribed data it is concluded that:
– in both sets of data, there are documents causing
clustering confusion.
– The existence of topic overlaps in the original data is
the main cause of such confusion.
– The information loss due to the transcription errors
is increasing the confusion even more in the
transcribed data.
Experimental Results (cont’d)
• Results Evaluation
Confusion Matrices
Sample 1:
• Original text divided into
subsets of 200 Docs.
• Rule-based stemming is
applied.
• Spectral clustering
algorithm is applied
Sample 2:
• Transcribed text divided into
subsets of 200 Docs.
• Rule-based stemming is
applied.
• Spectral clustering
algorithm is applied
Arts
Economics
Politics
Science
Sports
Arts
170
21
2
16
9
218
Economics
10
125
4
6
5
150
Politics
13
33
193
10
7
256
Science
6
15
1
167
2
191
Sports
1
6
0
1
177
185
Arts
Economics
Politics
Science
Sports
Arts
156
25
3
21
17
222
Economics
13
102
3
9
5
132
Politics
21
39
191
15
11
277
Science
8
22
2
149
7
188
Sports
2
12
1
6
160
181
Experimental Results (cont’d)
• Experiments (Phase 2)
– Fuzzy c-means algorithm and Possibilistic Gustafson-Kessel (GK)
algorithm are applied on both the transcribed and the original
data, and the membership matrix is analyzed to evaluate the
amount of confusing documents in each topic.
– A document is considered confusing to the clustering process if:
– its membership degrees to all clusters are under a certain
predefined threshold.
– if its membership degrees to all clusters are convergent.
– By determining which documents are affecting the clustering
accuracy, they can be excluded.
Experimental Results
• Experiments (Phase 2)
Confusing documents detected in the original data
a) Confusing documents detected by fuzzy
c-means
b) Confusing documents detected by possibilistic
GK algorithm
Experimental Results (cont’d)
• Experiments (Phase 2)
Number of confusing documents detected in the transcribed data
a) Confusing documents detected by fuzzy
c-means
b) Confusing documents detected by possibilistic
GK algorithm
Experimental Results (cont’d)
• Results (Phase 2)
– The average clustering accuracy improved to a maximum of
79.34% and 90.52% respectively for the remaining data
after using fuzzy c-means and maximum of 85.62% and
92.26% respectively for the remaining data after using
possibilistic GK algorithm.
– In both cases, the maximum average accuracy is obtained
when spectral clustering is used on rule-based stemmed
data.
– Manual categorization can be considered a solution to
categorize the excluded documents.
Conclusion
• Research Contributions:
– Utilizing stemming to overcome the negative effects of
some of the transcription errors (misidentification errors)
existing in the Arabic transcribed text.
– stemming techniques have improved the accuracy of all
clustering algorithms applied on the transcribed Arabic
documents in all scenarios by an average of 19.7%.
– Rule-based light stemming has improved the accuracy of the
clustering process by an average of 23.75%.
– Root-based and light stemming techniques improved the
accuracy of the clustering process by an average of 17.39%
and 5.28% respectively.
Conclusion (cont’d)
• Research Contributions:
– Utilizing Chi-square similarity measures as helping method
to
stemming in order to eliminate some of the
transcription errors existing in the Arabic transcribed text.
Conclusion (cont’d)
• The research has showed that:
– Rule-based light stemming has improved the accuracy of
the clustering process more than the other stemming
techniques.
– The spectral clustering algorithm achieved more accuracy
than the k-means algorithm in all cases.
– Chi-square similarity method is superior to the popular and
traditional cosine similarity and it is best utilized by the
spectral clustering algorithm.
Conclusion (cont’d)
• The research has showed that:
– Applying the fuzzy c-means and the possibilistic GK
algorithm on both the transcribed and original data has
revealed some of the characteristics of the data.
– Economics topic has the biggest number of confusing
documents.
– Arts and Science have the second and third places in the
number of occurrences of confusing documents.
– Politics topic has the second least confusing documents, and
it is the most topic that received wrong-clustered documents
from all other categories.
References
Ibrahim Abu El-Khair, “Effects of stop words elimination for Arabic information retrieval: a comparative study,”
International Journal of Computing & Information Sciences, 2006, pp. 119-133.
[2] Eiman Al-Shammari and Jessica Lin, “A novel Arabic lemmatization algorithm,” Proceedings of the second
workshop on Analytics for noisy unstructured text data, 2008, pp. 113-118.
[3] L.S. Larkey and M.E. Connell, “Arabic information retrieval at UMass in TREC-10,” Proceedings of the Tenth Text
REtrieval Conference (TREC-10)”, EM Voorhees and DK Harman ed, 2001, pp. 562-570.
[4] S. Khoja and R. Garside, “Stemming Arabic text,” Lancaster, UK, Computing Department, Lancaster University,
1999.
[5] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. “Okapi at
TREC-3,” Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
NIST SP 500-225, 1995, 109-126.
[6] Oktay Ibrahimov, Ishwar Sethi, and Nevenka Dimitrova, “A novel similarity based clustering algorithm for grouping
broadcast news,” Proc. of SPIE Conf. 'Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV,
2002.
[7] Michael Steinbach, George Karypis, and Vipin Kumar, “A comparison of document clustering techniques”, University
of Minnesota, 2000.
[8] Ulrike von Luxburg, “A tutorial on spectral clustering,” Springer Statistics and Computing, vol. 17, Issue 4, pp 395416, December 2007.
[9] Jianbo Shi and Jitendra Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 22, No. 8, August 2000.
[10] Dragon Dictation Application on iOS https://itunes.apple.com/us/app/dragon-dictation/id341446764?mt=8
[11] Kanaan, G.; Al-Shalabi, R.; Ababneh, M.; Al-Nobani, A., "Building an effective rule-based light stemmer for Arabic
language to improve search effectiveness," Innovations in Information Technology, 2008. IIT 2008. International
Conference, pp.312,316.
[12] D.E. Gustafson and W.C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proc. IEEE CDC, pages 761–
766, San Diego, CA, USA, 1979.
[1]
Thank You
Questions
Download