Text Segmentation in the Informedia Project

advertisement
Text Segmentation in the Informedia Project
Faculty Mentor:
Students:
Alex Hauptmann
Jichuan Chang
Ningning Hu
Zhirong Wang
(alex@cs.cmu.edu)
(cjc@cs.cmu.edu)
(hnn@cs.cmu.edu)
(zhirong@cs.cmu.edu)
Abstract
In this paper, we report our experiences of building text segmenter for the
Informedia project. Several methods are used in our project; their experimental
results are compared and analyzed. Based on the application background of
Informedia project, we choose to use relaxed error metrics to do performance
evaluation.
1.
Introduction
Segmentation is an integral and critical process in the Informedia digital video library.
The success of information retrieval in Informedia hinges on the critical assumption that
we can segment the whole news broadcast into individual paragraphs or stories. The
segmentation task can be conducted on different media (video, speech, text, etc.), and
their result can be integrated to achieve better performance. This paper will report the
experiences of building close-captioning text segmenter to aid the segmentation of audio
or video data.
Text segmentation problem focus on how to identify the story boundary, where one
region of text ends and another begins, within a document. This work was motivated by
the observations that such a seemingly simple problem can actually prove quite difficult
to automate [3], and that a tool for partitioning a stream of undifferentiated text into
coherent regions would be of great benefit to a number of existing applications.
Consider the following scenario: a video-on-demand application can respond to a news
event query, by providing the user with a stream of video containing related news clips.
This application may be able to accurately locate positions in its database which are
highly relevant to the query, but unable to determine how much of the neighboring data
should be provided to the user. Apparently, without an accurate segmentation tool, the
user will be flooded with overly abundant or unrelated (commercials, for example)
information!
Text segmenter also helps to detect subtopics in a long passage, allowing the reader to
quickly jump to his/her most interested topics. Because segmentation information
provides additional structural information about the document, such tools can also be
used in information extraction and summarization tasks, to quickly build the outlines of
key points in a long passage.
We treat the text segmentation problem as a task of automatically locating topic
boundaries. It can be refined as a classification problem: given a block of continuous
words (or sentences), a segmenter should tell us if there exists a boundary in this block by
observing a set of labeled data. Different classification methods are used in our project to
1
compare their performance, and find out a better method. These methods include Neural
Network (BP network), Naive Bayes Classification, and Support Vector Machine.
Sentences within the correctly segmented text regions are (parts of) semantically
coherent, belonging to the same topic. The story boundaries in news transcript are usually
related to topic shift in between different part of the document. This observation suggests
the method of text segmentation by detecting topic changes. According to this
observation, we also use topic change detection to assist the segmenter. The underlying
topics of news stories are identified using Expectation Maximization clustering method.
2.
Some previous work
In this section we very briefly discuss some previous approaches to the text segmentation
problem.
2.1 Exponential Models
In Beeferman’s paper “Text Segmentation Using Exponential Models”[3], introduces a
new statistical approach to partitioning text automatically into coherent segments. The
approach enlists both short-range and long-range language models to help it sniff out
likely sites of topic changes in text. To aid its search, the system consults a set of simple
lexical hints it has learned to associate with the presence of boundaries through
inspection of a large corpus of annotated data.
2.2 Machine learning
In Litman’s paper “Combining Multiple Knowledge Sources for Discourse Segmentation
“[9], predicts discourse segment boundaries from linguistic features of utterances, using a
corpus of spoken narratives as data. In this paper, they present two methods for
developing segmentation algorithms from training data: hand tuning and machine
learning. When multiple types of features are used, results approach human performance
on an independent test set (both methods), and using cross-validation (machine learning).
2.3 Lexical cohesion
In Kozima’s paper “Text Segmentation Based on Similarity Between Words “[10],
proposes a new indicator of text structure, called the lexical cohesion profile (LCP),
which locates segment boundaries in a text. A text segment is a coherent scene; the words
in a segment are linked together via lexical cohesion relations. LCP records mutual
similarity of words in a sequence of text. The similarity of words, which represents their
cohesiveness, is computed using a semantic network. Comparison with the text segments
marked by a number of subjects shows that LCP closely correlates with the human
judgments. LCP may provide valuable information for resolving anaphora and ellipsis.
Kozima generalizes lexical cohesiveness to apply to a window of text, and plots the
cohesiveness of successive text windows in a document, identifying the valleys in the
measure as segment boundaries.
2.4 Text tiling.
In Hearst’s paper “Multi-paragraph Segmentation of Expository Text “[8], describes
TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph
2
discourse units which reflect the subtopic structure of the texts. The algorithm uses
domain-independent lexical frequency and distribution information to recognize the
interactions of multiple simultaneous themes.
Two fully implemented versions of the algorithm are described and shown to produce
segmentation that corresponds well to human judgments of the major subtopic boundaries
of thirteen lengthy texts.
A cosine measure is used to gauge the similarity between constant-size blocks of
morphologically analyzed tokens. First-order rates of change of this measure are then
calculated to decide the placement of boundaries between blocks, which are then adjusted
to coincide with the paragraph segmentation, provided as input to the algorithm. This
approach leverages the observation that text segments are dense with repeated content
words. Relying on this fact, however, may limit precision because the repetition of
concepts within a document is subtler than can be recognized by only a “bag of words"
tokenizer and morphological filter.
2.5 Dragon’s Approach for text segmentation
In Allan’s paper “Topic Detection and Tracking Pilot Study Final Report “[11], describes
several approaches to do text segmentation. One is from Dragon. Dragon's approach to
segmentation is to treat a story as an instance of some underlying topic, and to model an
unbroken text stream as an unlabeled sequence of these topics. In this model, finding
story boundaries is equivalent to finding topic transitions. At a certain level of abstraction,
identifying topics in a text stream is similar to recognizing speech in an acoustic stream.
Each topic block in a text stream is analogous to a phoneme in speech recognition, and
each word or sentence (depending on the granularity of the segmentation) is analogous to
an ``acoustic frame''. Identifying the sequence of topics in
an unbroken transcript therefore corresponds to recognizing phonemes in a continuous
speech stream. Just as in speech recognition, this situation is subject to analysis using
classic Hidden Markov Model (HMM) techniques, in which the hidden states are topics
and the observations are words or sentences.
3.
Our Approach
We approach the text segmentation problem by several methods. As there are many tools
in public domain which have already implemented most of the methods, our work mainly
fall in the preparation of data, selecting features, parameter tuning, and result comparison
and analysis.
3.1 Data Collection and Preparation
The close-captioning raw data come from Informedia project, including several
thousands of transcripts under different classes: CNN World View, The World Today,
Early Prime and Science & Technology. We choose to use the CNN World View
transcripts from year 1999 to October 2000 (about 500 passages). The data come in
proprietary format (see example below), with timing information besides (“>>>”
indicates a boundary).
001630 CENTURY >>> WE PEOPLE TEND TO
001631 PUT THINGS LIKE THE PASSING OF A
001633 MILLENIUM IN SHARP FOCUS. WE
3
001633
001635
001636
001638
001641
001641
001642
001643
001654
001654
CELEBRATE, CONTEMPLATE, EVEN
WORRY A BIT, SOMETIMES WORRY A
LOT. AFTER ALL, IT'S SOMETHING
THAT HAPPENS ONLY ONCE EVERY ONE
THOUSAND YEARS. A BIG DEAL?
PERHAPS NOT TO ALL LIVING THINGS,
AS CNN'S RICHARD BLYSTONE
FOUND OUT WHEN HE CONSIDERED ONE
VERY OLD TREE. >>> HO HUM.
ANOTHER MILLENNIUM. THE GREAT YEW
The raw data are pre-processed in several logical steps (by one or more programs):
(1) Capitalize all words; remove non-printable characters and timing information.
(2) Remove stories being too short (less than 50 words) or too long (more than 1500
words), these two limits are based on the actual distribution of stories length.
Below is the pre-processed example in one sentence per line format:
WE PEOPLE TEND TO PUT THINGS LIKE THE PASSING (omitted for short).
WE CELEBRATE, CONTEMPLATE, EVEN WORRY A BIT, SOMETIMES WORRY A LOT.
AFTER ALL, IT'S SOMETHING THAT HAPPENS ONLY (omitted for short).
A BIG DEAL PERHAPS NOT TO ALL LIVING THINGS, (omitted for short).
>>>
HO HUM.
ANOTHER MILLENNIUM.
(3) Stemming using a standard algorithm (Porter's stemming algorithm).
(4) Merge numbers, titles, date and time, abbreviations into their class names. For
example, “3:50 PM” and “10:00 AM” are both represented as __TIME__.
(5) For different methods we used, the intermediate data are divided into fixed length
(in words or sentences) text blocks.
(6) Some tools we used will exclude the stop words (common words like “the” and
“where” which rarely help the classification) from input data. So we will remove
them only when needed.
(7) For some tools (ANN and SVM), we need to transform the block of text into
vectors. Given the selected feature space, a vector is actually composed of
component values (are number of occurrence of distinct words).
3.2 Support Vector Machine
Support vector machines are based on the Structural Risk Minimization principle
stemming from computational learning theory. The nice property of SVM is: it is
classification performance is independent of the dimensionality of the feature space,
which is particularly useful for our text segmentation problem (usually involves tens of
thousands of features -- words). It is also reported that SVM has achieved substantial
improvement over current text classification methods.
In our experiment, we divide the passage into many blocks with exactly 2 consecutive
sentences. For each block, if there is a boundary between the sentences, it is labeled as
“yes” and called “boundary block”; otherwise the block is labeled as “no” and the
sentences are called “background block”.
We have tried two SVM classification tools:
4


Rainbow: Rainbow is a statistical text classification tool built on top of the Bow
library. It first build an index file by counting word in training data, then a SVM
classifier are trained and used to classify the testing data.
In our experiment, the performance is similar with Naive Bayes classification
(although achieved in much longer time).
SVMlight is a SVM tool built for text classification. It accepts vector (of word counts)
or sparse vector input. After counting the number of distinct word, we realized that
even for SVM, there are too many features to train a classifier.
In order to reduce the dimension of feature space within several hundreds, we
decide to choose only the words with the highest average mutual information. To
simplify the computation, we actually chose words from the sentences sitting just
before and after the boundary, which have the largest difference between their
occurrence probability in boundary blocks and background blocks.
The result is disappointing mainly because SVMlight takes too long to learn. For a
simple case with 500 training data (actually 3 passages), the training process finished
in 15 minutes (out machine runs Linux on top of a Intel Pentium III box). The
training time increases quickly with the training data size, when we change the size of
training data into 1600 (about 8 passages) if failed to finish after 15 hours1.
3.3 Neural Network
We use the stochastic gradient descent version of Back Propagation algorithm for feedforward networks containing 2 layers of sigmoid units. The network structure is
illustrated in Figure 1. Units in each layer are connected with all units from the
proceeding layer. The output is a vector of 2 components, they correspond to the
probabilities in predicting the input data is a boundary block or not. Below we will
discuss other two network parameters:
2 output units
...
...
...
10 hidden units
...
100 input units
Figure 1. Structure of 2-layer Back Propagation Network
 Input Units and Hidden Units
The number and features of input units are determined by experiments. We choose to use
n-vector (n=100 or 200) by counting the occurrence of top n words with the highest
mutual information (just the same as used in SVMlight). It’s supposed that more input
units can improve the performance. In our experiments, Neural Networks with 200 input
units outperform those with 100 units by 5%, but the computation increase much quicker.
1
Actually I never finished the training process in 2 weeks and finally gave it up.
5
The number of hidden units is also determined by experiments and tradeoff between
accuracy and computation cost. We finally choose to use 100 input units and 10 hidden
units (see Figure 1).
 Merging False Alarms
By observing the classification result of ANN, we noticed an interesting phenomenon:
about 15% false alarms are “clustered” around some true boundary. For example, below
are results of classification of a short passage. Boundary blocks are represented as 0’s and
background blocks 1’s. The three consecutive 0s in classification line show one of such
“false alarm cluster”. Because there are usually more than two sentences contributing to a
story’s introduction and conclusion (sign-offs), features in such sentences (that
suggesting the existence of a nearby boundary) are also learned by our Neural Network.
Such features cause some confusion when or segmenter try to distinguish between a
boundary block and a background block.
Reference Classification: 111111011111011110111111
Classification (Before Merging): 110111000101011110111111
Classification (After Merging): 110111011101011110111111
One method to reduce such confusion is to include temporal information, which may help
the segmenter to distinguish the first boundary block in the introduction part and the
following background blocks. But we chose to use a much simple, brute-fore method to
solve this problem – merging such false alarms. Assume that the input data are arranged
according to their sequence of occurrence, we simple transverse the false alarm cluster,
select one block with the highest target value (the one our segmenter feels the most like
to contain a boundary), and change other 0’s into 1’s.
This method is simple and effective, except it might also remove some true boundary but
leave a false alarm out there, which will slightly reduce the recall value. We can’t recover
such errors, but the relaxed error metrics will not count such errors.
 Stop words
We also studied the impact of removing stop words in our experiments. Removing stop
words is one of the common practices in text processing applications. Why bother to
observe the effect of classification with stop words? Because our mutual information
statistics result showed that stop words occupy more than 2/3 of the 50 words, different
importance in the two groups. We did experiments using input data with and without
removing stop words, trying to find out if the stop words really matter in our
classification. The result shows that segmentation with stop words increases the recall
value, but decrease the precision. This suggests that there probably exist some special
pattern of stop words in boundary blocks, which helps to identify more true boundaries.
But such pattern also occurs in background blocks and can introduce more false alarms.
3.4 Naive Bayes Classification
Naive Bayes classification is a powerful method widely used in text classification
applications. Rainbow toolkit is utilized in our experiments. One of the major problems
in Naive Bayes classification (also in other methods) is the selection of training data.
After cutting raw data into blocks (size = 2 sentences), there are only 7% boundary
6
blocks. After finished several initial experiments, we realized that such low frequency of
boundary blocks can’t be used to effectively train our classifier (it can only identify 10%
true boundaries).
Actually, increasing the percentage of boundary blocks in training data can effectively
improve the recall value, but also hurts the precision of segmentation. We did some
experiments to choose the suitable percentage of boundary blocks in our training data. A
good tradeoff between precision and recall relies on the application context. In our
project, we assume that a lower precision will only provide the user with shorter news
clips, which is better than flooding the user with unrelated information, which is a result
of lower recall. Such tradeoff leads to a relative preference of recall value in our
experiments.
Also, different classification methods prefer different percentage of boundary blocks in
the training data. For Naive Bayes and Neural Network classification, we use 50%
boundary blocks in the training data.
3.5 Topic Change Detection
Topic change detection was used in Dragon's approach of text segmentation, and proved
to be quite effective (67% recall and 65% precision). Dragon uses multi-pass k-means
algorithm to construct the clusters, while we choose to use Rainbow’s EM clustering to
attach this problem. There are two important parameters to be determined in our method:
 Number of topics
When clustering documents, one must provide the number of clusters in our data set.
Dragon borrowed the trick from Speech Recognition field, using thresholding to limit the
size of search space (number of clusters) and iteratively to merge topics and create new
ones. In our project, due to the limitation of Rainbow tool, we have to choose the number
by intuition and experiments.
 Size of sliding window
We have tried different window size in the topic change detection method, and 8
sentences works the best. As the size of text window grows, more boundary blocks will
be combined into one text window, thus decreases the number of identified boundary
blocks. We stop at the point of 8 sentences in one window, because after that the portion
of such error begins to be not negligible.
Table 1 shows the result of different segmenter built with different sliding window size
and number of topics. According to this result, we can see that clustering into more topics
can improve the overall classification accuracy. But limited by Rainbow’s processing
ability, we only have the time to get such result.
Recall
Precision
Size
4
6
8
4
6
Topics
8
0.321839
0.256177 0.311724 0.196568 0.248424
16 0.421456 0.360208 0.38069 0.198198 0.267633 0.353846
Table 1. Segmentation performance using topic change detection method
7
3.6 Fixed Length Text Segmentation Using Naïve Bayes Classifier
We cut our news stories into small passages with a fixed window. These passages were
manually classified into two classes. If the passage includes the boundary sentence, it will
be labeled as “yes” class. “Yes” means there is a boundary in this passage. Else this
passage will be classified as “no” class. We divided the dataset into two parts, one set for
training and the other for testing. The testing data includes CNN World View 2000 from
July to October.
Now our objective is to use these two categories of data to build a classifier. Since Naive
Bayes classifiers are among the most successful known algorithms for learning to classify
text documents. So we applied Naïve Bayes classifiers to our problem.
4.
Experimental Result and Analysis
4.1 Error Metrics
After we got the experiment results, how can we evaluate the performance of different
segmentation methods? Two useful indicators are precision and recall, the conventional
information retrieval metrics. For our segmentation task,
Recall
= # actual boundaries identified / # total boundaries
Precision
= # actual boundaries identified / # boundaries identified
Researchers have also proposed other novel measurement for text segmentation problem:
For fixed length segment, [Dragon] uses the fraction of overlap part between the segment
and relevant story as the metrics of relevance; For text segmentation based on language
model, [3] proposes a new error metric based on the possible distance (in number of
words) between identified boundary and the neighboring actual boundaries. A similar but
much simplified idea is used in our approach.
We use a sentence as the minimum unit of segmentation. In the simplest case: a boundary
is correct if and only if it is a true boundary. But considering our application of
interactive query, one segmentation method is almost satisfactory if it always comes
close to the true boundary. The closeness can be defined in units of words or sentences.
Here we would relax our correctness criteria to accept all boundaries that are one or two
sentences off a true boundary. We call the distance between identified boundary and the
closest true boundary DR (degree of relaxation). Figure 2 illustrate the relaxed failure
model for our sentence-based segmentation methods.
Identified boundary
Sentences
Reference boundary
OK
(YY0)
Miss
(YN)
False Alarm
(NY)
OK
(NN)
OK
(YY1)
OK
(YY0)
Figure 2. Failure model of sentence-based text segmentation method (Adapted from [3])
(YY# means under the degree of relaxation #, identified boundary is OK.)
Below is our result with the relaxed error metric for ANN method (10 hidden units),
relaxed error metrics helps to reduce the error introduced by false alarm merging.
8
Before merging
DR
Precision Recall
0
0.241
1
2
0.554
After merging
Precision Recall
0.263
0.516
0.331
0.648
0.290
0.666
0.336
0.772
0.383
0.749
Table 2. Performance of ANN segmentation
4.2 Performance Evaluation
(1) SVM: Rainbow and SVMlight
SVMligh
Rainbow
0.07
??
Recall
0.223
??
Precision
Table 3. Segmentation Result of Rainbow and SVMligh
(2) ANN
2.1. Impact of Training Data Distribution
According to the following data, we choose to use 50% boundary blocks in our training
data. Because this distribution provides rather high recall value (71% after merging) and
acceptable precision value (33% after merging). This actually means the average length
of our segmentation is 5 sentences, corresponding to 30 seconds news broadcasting.
Impact of Training Data
Accuracy
1
0.8
Recall (no merge)
0.6
Recall (merged)
0.4
Precision (no merge)
0.2
Precision (merged)
0
%Y = 25%
%Y = 33%
%Y = 50%
%Y = 67%
Distribution
Figure 3. Impact of Training Data Distribution
2.2. Impact of stop word removal
Stop words removal helps to improve the recall value, but hurts the precision. Table 3
gives the result of ANN with 100 input units, using 50% boundary blocks in training
data. The same trend can be observed using different training data.
%Y = 50%
No Stop Words With Stop Words No Stop Words With Stop Words
100 Input Units (No merge)
(No merge)
(Merged)
(Merged)
0.597
0.721
0.705
0.846
Recall
0.201
0.115
0.327
0.208
Precision
9
Table 3. Impact of stop words removal
2.3. Impact of number of features (number of input units)
%Y = %50
100 Input Units 100 Input Units 200 Input Units 200 Input Units
(No merge)
(No merge)
(Merged)
(Merged)
0.597
0.628
0.705
0.739
Recall
0.201
0.137
0.327
0.219
Precision
Table 4. Impact of number of features
The effect of increasing input units is the same as stop words removal. The same
trend can be observed using different training data. The second reason that we chose to
use 100 input units is it greatly reduced the computation cost.
(3) Naive Bayes classification
%Y = 25%
%Y = 33% %Y = 50%
Recall
0.589
0.777
0.888
Precision
0.122
0.100
0.009
Table 5. Impact of training data using Naive Bayes Segmenter
(4) Topic change detection method
4
6
8
4
6
8
Recall
Precision
# topics = 8 0.322 0.256 0.312
# topics = 8 0.197
0.248
0.366
# topics =16 0.421 0.360 0.381
# topics =16 0.198
0.268
0.354
Table 6. Impact of window size and topics number
The result of topic change detection method is very different from other methods, with a
much lower recall but relatively high precision value. We can say that TCD method is a
conservative segmenter, which will not be tempted to identify too many boundaries,
because it uses global information only. Such global information can be combined with
ANN method, which uses only information within 2 consecutive sentences. Because the
window size in TCD method is different from ANN and Naive Bayes methods, we didn’t
test the result of voting with different methods. But we believe that future work can be
done in this direction to improve the segmentation accuracy by integrating their power.
(5) Fixed length segmentation
Here is the result of fixed length segmentation using Naïve Bayes Classifier.
Correct: 55313 out of 65359 (84.63 percent accuracy)
Class name
No
Yes
Total
Acc(%)
0
No
48330
7177
56507
87.3
1
Yes
2869
5983
8852
67.59
- Confusion details, row is actual, column is predicted
Percent_Accuracy average 84.63
Table 7. Fixed length segmentation using fixed window size (30 words)[3]
Table 8 is the Recall/Precision of this method
Recall
Precision
30 words (Fixed window size)
0.68
0.45
10
Table 8. Fixed length segmentation using fixed window size (30 words)[2]
(6) Performance of different segmentation methods
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Precision
Recall
SVM
ANN
NB
TCD
FL
Figure 4. Segmentation Accuracy
The chart above shows the performance values of different methods we have tried. The
best tradeoff point between precision and recall are select and compared. All of our
methods suffer from rather low precision values, but higher recall values (although topic
change detection also has relatively lower recall value). We currently chose to use the
Fixed Length (FL) segmenter because it reaches the best tradeoff point among all these
methods.
Our results are different from published segmentation result, partly due to the difference
of data set. The close-captioned transcripts we uses are much noisy (with noisy words,
omitted sentences and incorrect labels), which is proved to be more difficult to work
with.
5.
Conclusion
Compared with published result of methods, we exploited some of the simple and
traditional machine learning methods to the problem of text segmentation. We have
achieved little higher recall but rather lower precision performance. This leaves a lot of
space of improvement to our methods, for example, integrating time series analysis with
our ANN classification, using more topics to cluster news stories, etc. We can also
combine current method with more sophisticated methods (such as Dragon’s approach or
Hearst Algorithm), or even segmentation information coming from other media (such as
video segmentation and speech recognition).
11
References
[1] Y. Yang, T. Ault T. Pierce and C. Lattimer, “Improving text categorization methods
for event tracking”, Proceedings of ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR), pp65-72. 2000.
[2] T. Mitchell, “Machine Learning”
[3] D. Beeferman, A. Berger, and J. Lafferty, "Text Segmentation Using Exponential
Models," Proceedings of Empirical Methods in Natural Language Processing, AAAI 97,
Providence, RI, 1997
[4] M. A. Hearst, and C. Plaunt, “Subtopic structuring for full-length document access,”
in ProcACM SIGIR-93 Int’l Conf. On Research and Development in Information
Retrieval, pp. 59 – 68, Pittsburgh, PA, 1993.
[5] A. Merlino, D. Morey, and M. Maybury, “Broadcast News Navigation using Story
Segmentation,” ACM Multimedia 1997, November 1997
[6] A. Hauptmann, M. Witbrock, "Story Segmentation and Detection of Commercials in
Broadcast News Video," ADL-98 Advances in Digital Libraries Conference, Santa
Barbara, CA., April 22-24, 1998
[7] J. Lafferty, A. Berger, D. Beeferman, “Statistical Models for Text Segmentation,”
Special Issue on Natural Language Learning, C. Cardie and R. Mooney, eds. 34(1-3), pp
177-210, 1999
[8] M. A. Hearst, “Multi-paragraph segmentation of expository text”. Proceedings of the
32nd Annual Meeting of the Association for Computational Linguistics, 1994
[9] J. Litman and R.J. Passonneau, “Combining Multiple Knowledge Sources for
Discourse Segmentation”, in Proceedings of the ACL, 1995.
[10] H. Kozima, “Text Segmentation Based on Similarity between Words”, in
Proceedings of the ACL, 1993
[11] Allan, J., Carbonell, J.G., Doddington, G., Yamron, J. and Yang Y.” Topic Detection
and Tracking Pilot Study Final Report”, Proceedings of the Broadcast News
Transcription and Understranding Workshop (Sponsored by DARPA), Feb. 1998
Appendix A: Top 50 Features used in ANN experiments
We count the word appearing the first and last sentences of stories respectively, so the 50
words actually come from two groups: 25 words from the ending sentences and 25 from
the beginning sentences.
Without Stop Word Removal
CNN
THE
OF
IT
TO
REPORTS
AND
THAT
IS
THIS
CAPTIONING
ON
OF
A
I
THAT
AND
IN
IT
WORLDVIEW
THEY
WE
__USA__
ON
After Stop Word Removal
CNN
REPORTS
CAPTIONING
__NUM__
CLOSED
WORLDVIEW
ADDITION
REPORTING
BELL
LONDON
PROVIDED
ORDERED
12
WORLDVIEW
__USA__
UNITED
CLINTON
THINK
PRESIDENT
STATES
__NUM__
CNN
SPACE
WORLD
IRAQ
A
__NUM__
THEY
WE
ARE
CLOSED
BY
THERE
WORLDVIEW
AT
I
DO
ADDITION
YOU
HE
BE
THERE
DO
TO
HAVE
FOR
UNITED
THIS
WILL
CLINTON
FROM
THINK
HEART
JERUSALEM
ATLANTIC
WASHINGTON
WALTER
RODGERS
PEOPLE
COMMUNICATION
WHITE
THANK
CORRESPONDENT
MOSCOW
13
PRINCESS
DAY
DIANA
WELL
ISRAEL
NEW
JUDY
AHEAD
GET
TODAY
FIRST
MILITARY
MONEY
Download