Uploaded by yegej64022

10.1.1.105.8473

advertisement
An Introduction to Text Classification
Toni Giorgino
1
1 Dicembre 2004
1 Elaborato
finale per il terzo anno della scuola SAFI.
Il lavoro è basato sul corso Statistical Learning: Teoria ed Applicazioni dell’Anno Accademico 2002-03. Coordinatore: Prof. Giuseppe De Nicolao.
Abstract
The increased volume of information available in digital form and the need of organizing it has generated and progressively intensified the interest in automatic
text categorization. A widely used research approach to this problem is based
on machine learning techniques: an inductive process builds a classifier by automatically learning the characteristics of the categories. Machine learning is more
portable and less labour-intensive than manual definition of classifiers. This essay provides an account of the most prominent features of the machine learning
approach to text classification, including data preparation, attribute extraction
and selection, learning algorithms and kernel methods, performance measures,
and availability of training corpora.
Contents
1 Introduction
2
2 Supervised learning
3
3 Attributes
4
3.1
Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3.2
Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
3.3
N-gram
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
3.4
Attribute selection . . . . . . . . . . . . . . . . . . . . . . . . . .
7
3.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
4 Algorithms
10
4.1
Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . . . . . .
10
4.2
Rocchio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
5 Kernel based methods
12
5.1
Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . .
14
5.2
Kernel Nearest Neighbour . . . . . . . . . . . . . . . . . . . . . .
15
6 Data sets
16
6.1
Reuters 21578 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
6.2
Ohsumed
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
6.3
Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
7 Performance measures
7.1
20
Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 Conclusions
21
22
1
1
Introduction
Text classification is the problem of automatically assigning zero, one or more
of a predefined set of labels to a given segment of free text. The labels are to
be chosen to reflect the “meaning” of the text. Selecting the appropriate set of
labels may be ambiguous even for a human rater. When a machine is to try and
mimic the human behaviour, the algorithm will have to cope with a large amount
of uncertainty coming from various sources.
First of all, on a purely lexicographic level, human language is ambiguous
per se, including words and word combinations with multiple senses which are
deambiguated by the context. More importantly, the definition of meaning of
a text is still vaguely defined, and a matter of debate. One does not want to
answer the question whether a computer has “understood” a text, but rather –
operationally – whether it can provide a result which is comparable to what a
human would provide (and find useful).
The three hallmarks: (a) reasoning in presence of uncertainty, (b) complexity
of the task, and (c) heuristics required to approximate the human judgement,
hint us about the fact that text classification tasks belong to the field of Artificial
Intelligence [1].
2
Supervised learning
The task of assigning labels to items has been extensively studied in the broader
context of Machine Learning. An extensive amount of algorithms and methods
have been published in the literature. In the context of machine learning, items
to be classified are named examples. Each example is identified by a fixed number
of attributes, which are usually numeric or discrete.
2
Some algorithms try to identify features common to subsets of examples,
looking only at attributes and their correlations. These algorithms belong to the
class of unsupervised learning.
For text classification tasks, instead, one usually wants to reproduce a given
classification scheme. If a discrete label is to be predicted for a given item, this is
named class. The classification scheme, as discussed above, will not be explicitly
laid out according to human-designed rules – machine learning algorithms are
instead designed to learn it from training examples. In other words, they extract
the implicit knowledge contained in texts plus their labels.
To train a machine learning algorithm, therefore, human expert needs to prepare a set of texts, each of which is manually annotated with the “correct” labels.
These texts, named the training set, are processed and fed into the learning algorithms along with their labels. The algorithms will adjust a set of internal
parameters which, if successful, will generalize the training set in a sensible way.
Text classification problems therefore belong to the category said of supervised
learning.
3
Attributes
Most often, text is stored in computer form in the known ASCII representation
(which may be inappropriate for non-English languages). In this representation,
any text is encoded as a sequence of numbers, one per character, in the original
order.
While being a faithful representation of text which does not cause loss of
semantic information, in fact, this form is not suited to be fed into machine
learning algorithms. These algorithms, in fact, have been developed to keep
into consideration examples as vectors of features or attributes. Each example is
3
assumed to have a fixed number of attributes, and each attribute is supposed to
have a specific meaning (say, age, text length, etc.) for all examples. Some of
the algorithms, such as Support Vector Machines, perform well even in presence
of thousands of features.
The problem then arises, to encode texts into a suitable way, i.e. to decide
relevant features and extract them from the ASCII texts. Feature extraction is
of paramount importance, since the chosen attributes will be the only input on
which class decision is going to be based.
Two of the most widely used approaches for generating feature vectors from
text for information retrieval, bag of words and n-grams, will be described in the
following.
3.1
Preprocessing
The preparation of input texts for elaboration is also known in the machine
learning context as “data cleaning”.
One often-applied transformation to the input text is the substitution of characters outside of the usual 26-letter English alphabet with a single space. Multiple
spaces are then reduced to one, and upper case letters are folded into lower case.
These mappings will make any punctuation indistinguishable from white space,
which is accepted as an uninfluential loss of information.
This simple transformation is only appropriate to the English language. For
foreign languages, more complex transformation rules are needed. For European
languages, e.g. accented characters may be replaced with their unaccented equivalents. More in detail, non-English transformations will depend on the encoding
of the specific text, language, and several other factors (known as the locale).
On a more semantic level, a function often applied is stemming. It amounts to
4
the reduction of each word to the base form from which it originates. Stemming
is clearly language dependent. Stemming maps e.g. computer into comput and
going into go. A widely known stemming algorithm for the English was developed
by Porter [2], with variants for several European languages.
Before or after stemming, words such as conjunctions can be excluded at all.
This transformation is known as stop-word cancellation.
3.2
Bag of words
One widely used method for feature extraction is the bag of words approach
(figure 1). Each text is scanned and word frequencies are calculated with reference
to a known vocabulary. The word frequencies for each word are used as a vector
to characterize the text given [3].
A standard enhancement to the bag of word approach is to weight the count
of each word w by its inverse document frequency Iw :
Iw = log(n/fw )
where n is the total number of documents in the corpus, and fw is the number of
documents in which word w appears. The purpose of this weighting is to enhance
the contribution of uncommon words with respect to those more widely found,
including e.g. conjunctions and adverbs.
Applied together with word stemming, the bag of words approach works well,
in practice, even if the representation completely discards any information contained in the word ordering.
5
3.3
N-gram
The n-gram approach provides a natural representation to keep into account the
order of characters in the input text [4, 5]. Unlike bag of words plus stemming,
the n-gram approach can be applied without previous knowledge of the language
of the input.
The input sequence is broken down in n-plets of consecutive characters. For
example, when n = 2 the feature vector contains as many elements as there are
possible couples of characters. Considering the spacer symbol (section 3.1), the
feature space of character bigrams is 273 . Each attribute will count how many
times that specific n-gram appears in the given input document. For example, if
we take the four-character document text “to me”, it would contain the following
four bigrams: (to, o , m, me), or the three trigrams (to , o m, me).
Choice of n is an interesting tradeoff. Good results are often quoted in the
literature for n = 4. Increasing n will not improve classification accuracy indefinitely. In fact, when n grows the dimensionality of the feature space grows
exponentially, but most of these feature will be zero for all documents. Large
n-grams, in fact, have features which become increasingly specific, and very soon
so specific that each of them is only found in a single text (think of them as whole
sentences). All of the examples to be categorized, then, become very long vectors,
compound of zeroes and few ones. No vector would have features in common with
another, which means that all documents will be orthogonal to each other, and
no useful “distance” could possibly be calculated.
3.4
Attribute selection
Both bag of words and n-gram approach will represent a given text with a vector
with many features. The number of features in the vector is known as the dimen-
6
sion of the input space. There might be computational or theoretic reasons for
which one might want to reduce the dimensionality of the input space, i.e. to discard some of the features from all examples. Computational reasons for feature
selection include keeping the computation costs inside practical memory space or
time requirements. Theoretic reasons may apply, for example, because certain
algorithms are known to perform poorly in presence of many irrelevant features
(which do not help class prediction) or redundant features (strongly correlated
with each other).
One simple approach to feature selection, document frequency thresholding, is
based on the document frequency Iw defined above. One sorts words (or whatever
feature is used) depending on how many documents they appear in, and disregards terms for which Iw is below a certain treshold. The basic assumption is
that words which appear in few documents do not carry information with them,
i.e. are either non-informative for category prediction or not influential in global
performance. Improvement in categorization accuracy is possible if rare terms
happen to be noise terms.
A more sophisticated criterion takes formally into account the information
contained in a given feature. It measures the number of bits of information
which are obtained per category by knowing the presence or absence of a term in
a document. The information gain builds on the definition of entropy: suppose
a binary classification for which label may be + or −, with probabilities p+ and
p− respectively. The entropy of this document set, i.e. the minimum theoretical
number of bits required to store the class of one instance, is
H = p+ log p+ + p− log p−
Now suppose that a term t is present in a fraction s of the documents. Within
7
this document subset, the entropy would be
Ht = pt+ log pt+ + pt− log pt−
where pt+ indicates the fraction of documents which belong to class +, only considering those containing the term t. Analogously one will evaluate the entropy
Ht̄ of documents which do not contain term t. How much entropy was contributed
by the knowledge of t? The answer is obtained evaluating
G = sHt + (1 − s)Ht̄ − H
This is the information gain provided by the knowledge of presence or absence
of word t. Features will then be selected, which have information gain above
a certain treshold. As one would expect, the information gain G is zero for
words which appear in no or all documents. The definition given above is easily
extended to keep into account the multiple-class case.
3.5
Discussion
Several other feature selection techniques have been proposed and evaluated.
A comprehensive review is given in [6]. Results are quoted in the paper for two
classifiers and two different domains (Ohsumed and Reuters, see section 6), which
show consistently that with the bag-of-word approach, reducing of the feature
space to 10% of the original size is appropriate, and does not affect classification
performance (or improves it slightly). This means that classification performance
is rather robust with respect to feature selection: most feature selection methods
yields results which are unaffected by reduction of dimensionality from 16000 to
2000. Information gain feature selection, in particular, performs equally well with
8
as few as 200 features left.
This might mean either that there are relatively few most relevant features,
which are easily found by all selection algorithms, or that many features are
equally suited to produce good classification. As it turns out to be the case, the
latter seems to be true. This is gathered from figure 2, reproduced from reference
[3], which performed feature selection in bands. All features are ranked according
to their binary mutual information. Then a naive Bayes classier is trained using
only features ranked 1-200, 201-500, 501-1000, 1001-2000, 2001-4000, 4001-9947.
The results show that even features ranked lowest still contain considerable information and are somewhat relevant. Using the 50% “less desirable” features,
for example, still yields much better prediction than random class assignment.
4
Algorithms
This section will briefly review two algorithms which have shown state of the
art performance on text classification [7]. An intuitive description of the idea
underlying the algorithm will be provided, leaving any detail to the relevant
literature. Other important classification algorithms exist, such as the Naive
Bayes [8] and the decision tree classifier [9], which will not be discussed here.
4.1
Nearest Neighbours
This rather intuitive classification mechanism showed good performance on text
categorization tasks. The idea is to evaluate a “similarity measure” between the
example to be classified and the existing ones. The algorithm formalizes the
intuition that class of the unseen example is likely to be same as to the one of
the closest known instances. Degree of similarity is to be defined according to
a suitable criterion. When examples are identified by a vector, the euclidean
9
From: xxx@sciences.sdsu.edu
Newsgroups: comp.graphics
Subject: Need specs on Apple QT
I need to get the specs, or at least a
very verbose interpretation of the specs,
for QuickTime. Technical articles from
magazines and references to books would
be nice, too.
I also need the specs in a fromat usable
on a Unix or MS-Dos system. I can’t
do much with the QuickTime stuff they
have on ...
0
3
0
1
0
0
0
.
.
.
baseball
specs
graphics
references
hockey
car
clinton
1
0
2
0
unix
space
quicktime
computer
Figure 1: Bag-of-word approach to feature extraction. On the right, part of the
language dependent dictionary is shown along with the feature vector, representing the text on the left. In this example word stemming (section 3.1) is not
used.
100
Bayes
Random
Precision/Recall-Breakeven Point
80
60
40
20
0
0
1000
2000
3000
4000
5000
6000
7000
Features ranked by Mutual Information
8000
9000
Figure 2: Learning without the “best” features. Classification accuracy of a
naive Bayes classier trained using only features ranked 1-200, 201-500, 501-1000,
1001-2000, 2001-4000. Random-guess classification shown as dotted line.
10
distance may be used. Classification accuracy is generally improved when one
considers not only the closest instance, but rather the k closest examples, with
a majority rule (figure 3).
4.2
Rocchio
Another widely method in information retrieval is the Rocchio algorithm [10].
In summary, it vectorially averages all of the normalized positive examples, and
subtracts the average of the negative ones. Thus, it finds a sort of “prototype
vector” for the positive class. An unseen example can be classified according to
how close it is to the computed prototype, using a threshold (figure 4).
5
Kernel based methods
Sometimes one may find it useful to classify vectors not in their original form,
but after transformation through an arbitrary function φ. This may be the case,
for example, because the transformed space would be amenable of easier classification, for example linear separation. Say that φ maps the original feature space
S in the transformed space T , with original vectors v, w belonging to S.
An interesting situation then arises, for some of the machine learning schemes
only depend on the scalar product of the input vectors. When this is the case,
during their operation they are only going to treat input vectors through scalar
products in the form of
K(v, w) = φ(v) · φ(w).
K is called kernel function, and appears as a function of v and w. In practice,
however, we do not need to explicitly devise a mapping φ, nor we carry out the
inner product in T (which might be infinite-dimensional). What we do is simply
11
Figure 3: The k nearest neighbour algorithm, when the attribute space has two
dimensions. An unknown example is classified to have the same label ω as the
closest (or the majority of the k closest) examples.
+
+
+
+
+
+
+
a
+
b
b
a
−
b
b
b
Figure 4: Rocchio: positive normalized examples are averaged. Negative normalized examples are also averaged, and subtracted from the former.
12
choosing a real-valued kernel function K of two variables in S × S. For K to
be an inner product in the transformed space, it suffices to ensure it satisfies the
Mercer condition [11]. The single-variable transformation function φ needs not
be known explicitly.
Kernel methods are interesting because allow one to calculate distances between input vectors with arbitrary algorithms or functions of pairs of vectors.
The transformation will hopefully generate an image of the input space which is
separated by simpler decision surfaces. In practice, some parametric kernels are
commonly used. They are the polynomial kernel, K = (1 + x · y)p, the radial basis
function kernel K = exp(−α|x−y|2 ), and the sigmoid kernel K = tanh(αx·y+β).
It is worth repeating, however, that K needs not be an algebraic function of
x and y. A novel kernel, for example, has recently received remarkable interest. Originally proposed for bioinformatics applications, it measures the degree
of similarity between two texts (input vectors) comparing their common subsequences, allowing for characters’ insertions and deletions. The input space S for
this algorithm is texts themselves, and it is called string subsequence kernel [12].
5.1
Support Vector Machines
Methods only depending on scalar products of input vectors, i.e. on the Gram
matrix, are called Kernel Methods. Kernel Methods were popularized by Support
Vector Machines (SVM) [13]. A SVM is an algorithm which computes the linear
separation surface with maximum margin for a given training set (figure 5). Only
a subset of the input vectors will influence the choice of the margin (circled in the
figure); such vectors are called support vectors. When a linear separation surface
does not exist, for example in presence of noisy data, SVMs algorithms with a
slack variable are appropriate [14].
13
SVM methods have been shown to be excellent on text classification tasks
both theoretically and experimentally. Reference [3] shows SVM outperforming
other methods on most categories of the Reuters and Ohsumed dataset, without
the need of parameter tuning. (Although not clearly indicated, the experiments
seem to have been performed with bag of words feature vectors and no feature
selection.)
5.2
Kernel Nearest Neighbour
A simple modification to the k nearest neighbour algorithm was recently proposed, that makes it a kernel method [11]. If we express the euclidean distance
of two vectors v and w transformed through φ, we get
d2 (φ(x), φ(y)) = |φ(x) − φ(y)|2
= (φ(x) − φ(y)) · (φ(x) − φ(y))
= φ(x) · φ(x) + φ(y) · φ(y) − 2φ(x) · φ(y)
which, recalling the definition K(x, y) = φ(x) · φ(y), means
d2 (φ(x), φ(y)) = K(x, x) + K(y, y) − 2K(x, y).
h
Figure 5: Support vector machines find the separating hyperplane h which maximises distance from positive and negative training examples. Support vectors
are circled.
14
In summary, the norm distance in the image feature space can be calculated
by using a kernel function and the input vectors in the original feature space.
When the nearest-neighbour algorithm is applied in the transformed space, one
gets a kernel nearest-neighbour classifier (KNN).
6
Data sets
Supervised learning schemes pose the problem of finding amounts of training text
which have been properly and reliably annotated. The task of manually putting
labels is time-consuming, and needs by definition to be performed by humans
(at least partially). Annotated data sets may therefore be expensive to obtain.
Another peculiar problem with data sets is repeatability of the assessment of
algorithms. To compare classification performances among algorithms developed
by different researchers, one might want to disregard performance variations due
to the particular choice of examples.
Luckily, availability and repeatability of experiments can be solved by using
preannotated sets, often available on the Internet free of charge for research purposes. Most of those sets have been collected by interested institutions, and later
made available under a non restrictive distribution license. The texts collected in
a corpora tend to be homogeneous, usually reflecting the activity of the original
compiler.
Several institutions maintain collections of data sets potentially useful for data
mining purposes, for example The University of California Irvine (UCI Machine
Learning Archive) [15]. The open source software project R, besides being a
powerful statistics package, also includes a collection of data sets [16].
While there is relatively wide availability of instance classification oriented
data, less extensive data sets are available for text classification. The reason is
15
that collections of instance data can be reasonably anonymized in several ways,
and it is therefore more likely that the owner will be willing to release them
from ownership claims. On the contrary, most text available in electronic form
is subject by intellectual propriety rights of authors, so it not be reproduced
without permission. Luckily, there are exceptions to this rule and data sets for
text classification, named corpora, are indeed available.
6.1
Reuters 21578
One of the most widely examined text corpora from text classification is known as
Reuters-21578, which comes from the Carnegie Group, Inc. and Reuters, Ltd. It
is a collection of 21578 real-world news stories and news-agency headlines in the
English language. The total dataset size is approximately 25 megabytes. Most of
the stories are annotated with zero or more topics, according to their economic
subject categories. Other (orthogonal) annotations categories are present, such
as people, places, organisations etc.
Each of the annotation categories can be chosen for a prediction task, but
topics is preferred in existing literature because it is more abstract. People, places
and organisation may likely be found when the corresponding name is spotted
into the story text. Typically a document assigned to a category from one of
these sets explicitly includes some form of the category name in the document’s
text. (Something which is usually not true for topics categories.) However, not
all documents containing a named entity corresponding to the category name
are assigned to these category, since the entity was required to be a focus of the
news story [17]. Thus these proper name categories are not as simple to assign
correctly as might be thought.
Each text may be given one, more, or zero category labels. A “negative”
16
example for a given category is a text for which that category has not been
assigned. Text with no category labels assigned act as negative examples for
all categories. As it is apparent from figure 6, the majority of stories (47%) is a
negative example for all categories. Almost all of the remaining texts have exactly
one topic (44%), and the remaining 9% has two or more labels. In summary,
the data set is unbalanced towards negative examples. The correctness of so
many unlabelled documents is under question [18], so restricting training set to
documents with at least one label might be a good choice. Figure 7 shows the
frequencies of the most frequent categories. The most represented category is
topic earn, which was assigned to 17% of the assigned documents.
State of the art classification algorithms (SVM, Rocchio, Naive Bayes) tend
to achieve high accuracy on the Reuters Corpus: microaveraged breakeven-point
performance is in the order of 80–85% [7], and Reuters 21578 is thus considered
an “easy” data set. Performance measures are discussed in section 7.
6.2
Ohsumed
The Ohsumed collection [19] is a clinically-oriented MEDLINE subset, consisting
of 348,566 references (out of a total of over 7 million), covering all references
from 270 medical journals over a five-year period (1987-1991). The test database
is about 400 megabytes in size. A number of fields normally present in the
MEDLINE record but not pertinent to content-based information retrieval have
been deleted. The only fields present include the title, abstract, MeSH indexing
terms, author, source, and publication type.
Label prediction on the Ohsumed dataset is harder than Reuters-21578, arguably due to the specialistic nature of the articles, which assume a vast background field knowledge from the target readers. Microaveraged breakeven-point
17
0.2
0.0
0.1
Density
0.3
0.4
Number of topic labels per story
0
1
2
3
4
5
6
7
8
9
10 11 12
14
16
Figure 6: Topics per story for the Reuters-21578 dataset.
0.10
0.00
0.05
Fraction of stories
0.15
Topic frequency
earn
acq
money
fx
crude
grain
trade
interest
wheat
tl
Figure 7: Topic frequencies for the top categories in the Reuters-21578 dataset.
18
performance for state of the art algorithms on this dataset is the order of 65% [7].
6.3
Other
The DigiTrad (Digital Tradition) database contains folk songs collected from
several countries. Annotations were added to facilitate song retrieval. The type of
texts typical of DigiTrad are very different from those in Reuters-21578. Sentence
structure is loose and often non-existent; clever, flowery and rhyming language is
generally preferred over clarity; songs are often written in dialect; and the writers
of the songs tend to be indirect, often skirting around a topic without explicitly
mentioning it. The DigiTrad labels are therefore consistently very hard to predict
by automated learning [20].
A novel dataset from Reuters, the RCV1 Reuters Corpus, is likely to supersede
the Reuters 21578 dataset in the future. This corpus greatly expands the previous
one, including over 800.000 stories and 2.5 GB of text. The corpus is, however,
not readily available for download (a special request must be filed with NIST).
7
Performance measures
Obtaining a performance measure for classification is important to tune parameters in the algorithms, kernels, and to compare methods and datasets.
Performance measures build upon the concepts of precision p and recall r.
The former is the probability that a document predicted to be in class “+” truly
belongs to this class. The latter is the probability that a document belonging
to class “+” is classified into this class. When a single performance measure
is desired, the harmonic mean of precision and recall, F1 = 1/(p−1 + r −1 ), is
sometimes quoted as the performance for each specific class.
Many algorithms have adjustable parameters upon which precision and recall
19
will depend, such as a confidence threshold. When this is the case, a single
performance measure can be obtained by varying the it and finding the precisionrecall breakeven point, which is the (interpolated) value of p obtained by varying
the parameter until p = r.
7.1
Averaging
Both F1 and the breakeven point are relevant to single-class classification problems, i.e. when one example may be either positive or negative. For text classification, unless the very special case of a single label, p and r values for the various
classes should be combined.
There are two possible approaches for this: macro averaged precision sums,
for each class separately, the fraction of relevant documents over the retrieved
ones:
PM
N
relevant documents retrieved for class i
1 X
=
N i1
documents retrieved for class i
where N is the number of classes. Macro averaging is therefore simply the average
precision over classes. Macroaveraging has a problem if no document is retrieved
for one class, which would make PM undefined. Precision values of all classes
contribute equally to the macroaverage.
The other way to compute an overall measure of performance for a classification system is to first compute the number of documents retrieved and relevant
for all classes, sum them, and divide afterwards. The micro averaged precision is
computed as:
pm =
relevant documents retrieved in all classes
total documents retrieved
Thus, microaveraging circumvents the problem of empty sets and causes every
20
individual document to have an equal influence on the result. A very comprehensive review of the subject may be found e.g. in [21].
8
Conclusions
Widespread availability of digital data has pushed the interest in automated
text labelling for information retrieval. Text classification is a large field which
borrows several ideas from the neighbouring subjects of machine learning and
language modelling. This short essay provided an overview of several concepts
and techniques currently used in text classification. The topics examined include
data cleaning, feature extraction and selection, machine learning schemes and
performance measures.
Further discussion can be found in the vast existing literature on the subject,
which builds upon the concepts discussed here. A partial list of subjects which are
under current investigation includes the incorporation of “common sense” knowledge through the use of hypernyms and hyponyms [22], efficient implementation
of existing algorithms [23, 24] and approximations of kernels for more efficient
computation [25].
The availability of efficiently-implemented kernel methods and approximation
schemes, makes development of new kernels, especially suited for text comparison
[12], an area especially stimulating and amenable to yield results which can prove
useful in other domains, such as bioinformatics and multimedia retrieval.
References
[1] E. Motta, Reusable Components for Knowledge Modelling: Case Studies in
Parametric Design Problem Solving. IOS Press, 1999.
21
[2] M. Porter, “An algorithm for suffix stripping,” in Program, vol. 14, pp. 130–
130, 1980.
[3] T. Joachims, “Text categorization with support vector machines: learning with many relevant features,” in Proceedings of ECML-98, 10th European Conference on Machine Learning (C. Nédellec and C. Rouveirol, eds.),
no. 1398, (Chemnitz, DE), pp. 137–142, Springer Verlag, Heidelberg, DE,
1998.
[4] W. B. Cavnar and J. M. Trenkle, “N-gram-based text categorization,” in
Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis
and Information Retrieval, (Las Vegas, US), pp. 161–175, 1994.
[5] W. B. Cavnar, “Using an n-gram-based document representation with a
vector processing retrieval model.,” in TREC, pp. 269–277, 1994.
[6] Y. Yang and J. O. Pedersen, “A comparative study on feature selection in
text categorization,” in Proceedings of ICML-97, 14th International Conference on Machine Learning (D. H. Fisher, ed.), (Nashville, US), pp. 412–420,
Morgan Kaufmann Publishers, San Francisco, US, 1997.
[7] T. Joachims, “Text categorization with support vector machines: learning
with many relevant features,” tech. rep., University of Dortmund, Fachbereich Informatik, 1997.
[8] J. Demsar, B. Zupan, M. Kattan, J. Beck, and I. Bratko, “Naive Bayesianbased nomogram for prediction of prostate cancer recurrence,” Stud Health
Technol Inform, vol. 68, pp. 436–441, 1999.
[9] J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann,
1993.
[10] J. J. Rocchio, The SMART Retrieval System, Experiments in Automatic
Document Processing. Prentice Hall, 1971.
[11] K. Yu, L. Ji, and X. Zhang, “Kernel nearest-neighbor algorithm,” Neural
Process. Lett., vol. 15, no. 2, pp. 147–156, 2002.
[12] H. Lodhi, J. Shawe-Taylor, N. Cristianini, and C. J. C. H. Watkins, “Text
classification using string kernels,” in NIPS, pp. 563–569, 2000.
[13] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167,
1998.
[14] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning,
vol. 20, no. 3, pp. 273–297, 1995.
22
[15] C. Blake and C. Merz, “UCI repository of machine learning databases,”
1998.
[16] R Development Core Team, R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria, 2004.
ISBN 3-900051-07-0.
[17] P. J. Hayes and S. P. Weinstein, “CONSTRUE/TIS: a system for contentbased indexing of a database of news stories,” in Second Annual Conference
on Innovative Applications of Artificial Intelligence, 1990.
[18] Y. Yang, “An evaluation of statistical approaches to text categorization,”
Information Retrieval, vol. 1, no. 1/2, pp. 69–90, 1999.
[19] W. R. Hersh, C. Buckley, T. J. Leone, and D. H. Hickam, “Ohsumed: An
interactive retrieval evaluation and new large test collection for research,”
in Proceedings of the 17th Annual International ACM-SIGIR Conference on
Research and Development in Information Retrieval. Dublin, Ireland, 3-6
July 1994 (Special Issue of the SIGIR Forum) (W. B. Croft and C. J. van
Rijsbergen, eds.), pp. 192–201, ACM/Springer, 1994.
[20] S. Scott, “Feature engineering for a symbolic approach to text classification,”
Master’s thesis, Ottawa, CA, 1998.
[21] F. Sebastiani, “Machine learning in automated text categorization,” ACM
Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.
[22] S. Scott and S. Matwin, “Feature engineering for text classification,” in Proceedings of ICML-99, 16th International Conference on Machine Learning
(I. Bratko and S. Dzeroski, eds.), (Bled, SL), pp. 379–388, Morgan Kaufmann Publishers, San Francisco, US, 1999.
[23] T. Joachims, “Making large-scale support vector machine learning practical,” in Advances in Kernel Methods: Support Vector Machines (A. S.
B. Schölkopf, C. Burges, ed.), MIT Press, Cambridge, MA, 1998.
[24] A. K. McCallum,
“Bow:
A toolkit for statistical language modeling,
text retrieval,
classification and clustering.”
http://www.cs.cmu.edu/∼mccallum/bow, 1996.
[25] S. Schölkopf, B. Mika, S. Burges, C. Knirsch, P. Müller, K. Rätsch, and
G. Smola, “Input space vs. feature space in kernel-based methods,” IEEE
Transactions on Neural Networks, vol. 10, no. 5, pp. 1000–1017, 1999.
23
Download