Text segmentation in Informedia Faculty Mentor: Alex Hauptmann TA Mentor: Vandi Verma

advertisement
Text segmentation in Informedia
Faculty Mentor: Alex Hauptmann
TA Mentor: Vandi Verma
Students: Zhirong Wang
Ningning Hu
Jichuan Chang
Data and Methods
• Data
– CNN WorldView (01/1999-10/2000)
– Stemming, merging, stop words removal, …
• Methods
– Classification
• Artificial Neural Network (sentence)
• Naive Bayes (sentence/fixed length window)
• SVM (sentence)
001630 CENTURY
– Topic change detection
• EM clustering
• # topics, block size
001631
001633
001633
001635
001636
001638
001641
001641
001642
001643
001654
001654
>>> WE PEOPLE TEND TO
PUT THINGS LIKE THE PASSING OF A
MILLENIUM IN SHARP FOCUS. WE
CELEBRATE, CONTEMPLATE, EVEN
WORRY A BIT, SOMETIMES WORRY A
LOT. AFTER ALL, IT'S SOMETHING
THAT HAPPENS ONLY ONCE EVERY ONE
THOUSAND YEARS. A BIG DEAL?
PERHAPS NOT TO ALL LIVING THINGS,
AS CNN'S RICHARD BLYSTONE
FOUND OUT WHEN HE CONSIDERED ONE
VERY OLD TREE. >>> HO HUM.
ANOTHER MILLENNIUM. THE GREAT YEW
Experimental Result
Identified boundary
Sentences
Reference boundary
OK
Miss
False Alarm
OK
OK
Recall = (#OK) / (#OK + #Miss)
Precision = (#OK) / (#OK + #False Alarm)
• Feature selection
• Block size
• Best Classifier:
– Naive Bayes Classifier
– Fixed length block
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Precision
Recall
SVM
ANN
NB
TCD
FL
Discussion
• Impact of data set
– Good recall, lower precision
– Noisy: close-captioning text
– Ratio of positive to negative examples
• Combining different classifiers
– Different granularity
– Voting
Download