Text segmentation in Informedia Faculty Mentor: Alex Hauptmann TA Mentor: Vandi Verma Students: Zhirong Wang Ningning Hu Jichuan Chang Data and Methods • Data – CNN WorldView (01/1999-10/2000) – Stemming, merging, stop words removal, … • Methods – Classification • Artificial Neural Network (sentence) • Naive Bayes (sentence/fixed length window) • SVM (sentence) 001630 CENTURY – Topic change detection • EM clustering • # topics, block size 001631 001633 001633 001635 001636 001638 001641 001641 001642 001643 001654 001654 >>> WE PEOPLE TEND TO PUT THINGS LIKE THE PASSING OF A MILLENIUM IN SHARP FOCUS. WE CELEBRATE, CONTEMPLATE, EVEN WORRY A BIT, SOMETIMES WORRY A LOT. AFTER ALL, IT'S SOMETHING THAT HAPPENS ONLY ONCE EVERY ONE THOUSAND YEARS. A BIG DEAL? PERHAPS NOT TO ALL LIVING THINGS, AS CNN'S RICHARD BLYSTONE FOUND OUT WHEN HE CONSIDERED ONE VERY OLD TREE. >>> HO HUM. ANOTHER MILLENNIUM. THE GREAT YEW Experimental Result Identified boundary Sentences Reference boundary OK Miss False Alarm OK OK Recall = (#OK) / (#OK + #Miss) Precision = (#OK) / (#OK + #False Alarm) • Feature selection • Block size • Best Classifier: – Naive Bayes Classifier – Fixed length block 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Precision Recall SVM ANN NB TCD FL Discussion • Impact of data set – Good recall, lower precision – Noisy: close-captioning text – Ratio of positive to negative examples • Combining different classifiers – Different granularity – Voting