Feature Extraction for Genre Classification

advertisement
Feature Extraction for Genre Classification
Eva Forsbom
evafo@stp.ling.uu.se
Uppsala University/
Graduate School of Language Technology
GSLT: Statistical Methods
Teacher: Joakim Nivre
Spring 2005
Abstract
In this paper, we describe an exploratory experiment on genre classification of Swedish texts, using as predictors various subsets of a base lemma
vocabulary previously derived from the Stockholm-Umeå Corpus. The purpose of this particular experiment was to find the subset that gives the best
classification or most useful features.
Three dimensions of the base vocabulary were investigated: size of feature set, genre vs. domain category set, and dispersion. Decision trees grown
for the various subsets did not predict the genres particularly well, partially
because of sparse data and a skewed distribution of genres in the corpus, although they outperformed both the baselines: uniform genre probability and
majority genre. However, the experiment suggested that the often-used set of
the 50 most frequent function words is well motivated for genre classification
if word frequency is used and a small and easy-to-get set is needed.
1 Introduction
Many natural language processing applications could benefit from knowing what
genre a document belongs to when processing it, e.g. for lexical and syntactic
disambiguation, and for relevance ranking. In our case, we would ultimately like
to use knowledge of genre as guidance in discourse parsing of texts. The purpose
of the experiment here, though, is to find the subset that gives the best classification
or most useful features.
There exist a multitude of definitions of what a ’genre’ is. It is often used
as a rather loose umbrella term for ’registers’, ’style’, ’text types’, and sometimes
even ’domain’. Swales (1990, p. 58) defines a genre to be a class of communicative
events where there is some shared set of communicative purposes, so that instances
of a genre will have some similarity in form or function.
1
Biber (1995, p. 9f) uses ’register’ as a synonym to ’genre’, with roughly the
same meaning as Swales, i.e. defined by external criteria and commonly recognised. ’Text type’ on the other hand, he sees as purely internal, and defined only by
linguistic criteria. He also sees text type categorisation as a prerequisite for genre
classification, rather than the opposite.
Karlgren (2000, p. 30) defines ’style’ as a consistent and distinguishable tendency to make choices in organising the material, and between synonyms and syntactic constructions, and are aiming to find ’functional styles’ that can be used to
classify a text into a genre (to predict the usefulness of a retrieved text in information retrieval).
In EAGLES preliminary recommendations for text typology annotation of corpora (EAGLES, 1996), they make use of 3 major external (E) and 2 major internal
(I) criteria for text classification:
E.1.origin — matters concerning the origin of the text that are thought to affect
its structure or content.
E.2.state — matters concerning the appearance of the text, its layout and relation
to non-textual matter, at the point when it is selected for the corpus.
E.3.aims — matters concerning the reason for making the text and the intended
effect it is expected to have.
I.1.topic — the subject matter, knowledge domain(s) of the text.
I.2.style — the patterns of language that are thought to correlate with external
parameters.
Melin and Lange (2000) define ’text type’ as the professional writers’ collective knowledge of how to adjust to the given conditions in a certain pragmatic or
productional situation, in the systemic functional linguistics tradition of Halliday
(1994). They reserve ’genre’ for historically established and strongly conventionalised text types, and ’register’ for a professional or social language usage; distinctions that are often used elsewhere in the literature as well.
Argamon and Dodick (2004a,b) also view ’genre’ in the systemic tradition.
’Style’ (and ’attitude’) is defined as a general preference for certain choices in
the network of possible choices for the same representational meaning, and ’register’ as differences between systemic preferences across genres, in terms of ’mode’
(communication channel), ’tenor’ (social relation between participants), and ’field’
(discourse domain).
Following these definitions, we will view ’genre’ as a functional rather than a
topical classification. For our purposes, we would like to find linguistic cues to the
author’s intention of a certain unit in the text. We assume that the intentions are
formulated by means of relational expressions, e.g. function words, mental verbs
and attitudinal expressions.
2
The paper is organised as follows: In Section 2, related work on genre classification and features used therein is described. Details on the experimental setup,
including descriptions of the feature pool, corpus, and method used, are given in
Section 3, and the outcome is discussed in Section 4. Finally, some concluding
remarks sum up the paper (Section 5).
2 Genre classification
Genre classification is related to, for example, text categorisation, author attribution and identification, but the classification tasks focus on different aspects of
language (Manning and Schütze, 1999, p. 575). While text categorisation is about
classifying texts in topics or themes, genre and author classification is about distinguishing different styles. In genre classification, it is the functional style that
matters, and in author classification, the issue is the individual variation within a
functional style (van Halteren et al., 2005).
Since genre classification is supposed to be based on functional variations of
style, frequences of function words have often been used as the feature set for
genre classification, as well as author classification (Lebart et al., 1998, p. 167f).
In particular, the highest-frequency function words have been used, alone or as a
complement to other stylometrics (Baayen, 2001, p. 214).
Biber (1995) used 67 linguistic criteria in a multidimensional analysis of the
given registers, as well as some of the subcategories, in the Lancaster-Oslo/Bergen
(LOB) and London-Lund corpora to define 7 English text types along 5 dimensions: involved vs. informational production, narrative vs. non-narrative discourse,
situation-dependent vs. elaborated reference, overt expression of argumentation,
abstract vs. non-abstract style. The 67 criteria belong to 16 major gramatical and
functional categories: tense and aspect markers, place and time adverbials, pronouns and pro-verbs, questions, nominal forms, passives, stative forms, subordination features, prepositional phrases, adjectives and adverbs, lexical specificity,
lexical classes, modals, specialised verb classes, reduced forms and discontinuous
structures, co-ordination, and negation.
Like Biber, Kessler et al. (1997) studied dimensions of register, or facets, rather
than genre. They categorise previously-used measures used in 4 levels: structural, lexical, character-level, and derivatives of the other levels, such as ratios (e.g.
words per sentences). For their experiment involving logistic regression and two
kinds of neural networks, they used 55 measures from the last three levels to categorise texts from the Brown corpus into three categoric facets and their sublevels:
Brow, Narrative, and Genre.
Karlgren (2000, with Cutting 1994) used a subset of Biber’s 67 features, i.e.
the ones that could be computed using only a part-of-speech tagger. They evaluated
the classifiers, based on the two first functions from discriminant analysis, on the
Brown corpus.
Santini (2004) also used part-of-speech tagged measures: trigrams, bigrams,
3
and unigrams. She trained the classifiers with a Naı̈ve Bayes algorithm and kernel
density estimation, and evaluated them on ten spoken and written genres from the
British National Corpus. Trigrams were found to have strong discriminative power.
In their review of previously-used measures, Stamatatos et al. (2000a) also
mentions various measures of vocabulary richness, but concludes that they are too
dependent on text length to be of much use. In their own experiment, they decided
to use output from a general-purpose text analysis tool, SCBD, such as sentence
and chunk boundaries, but also intermediate information from within the tool, such
as number of alternative analyses and number of failures. For training their classifiers, they chose multiple regression and discriminant analysis, and evaluated the
classifiers on a Greek corpus extracted from the web. Results were encouraging as
the classifiers outperformed existin lexically-based methods.
They also tried to find out the minimum size of a corpus for training genre
classifiers, and found that for homogeneous categories, 10 texts per category were
enough, and that the lower boundary for text lengths is 1,000 words per text.
As can be noted, many different measures, learning algorithms, and evaluation methods have been used, which makes comparison between experiments hard.
Frequency counts and lexically based features seem to be the most tried measures,
and balanced corpora compiled in the manner of the Brown and LOB corpora most
frequently used in evaluations, so we will follow line in order to be able to make
some comparisons.
3 Experimental setup
In the experiment reported here, genre classification is based on subsets from a
base lemma vocabulary ranked by frequency distribution and derived from the
Stockholm-Umeå Corpus (SUC, 1997). Various subsets are selected with regard to
category set size, feature set size, and dispersion, i.e. how many genres contribute
to the frequency count for a lemma. Texts from SUC are then used in training and
testing decision trees for genre classification, with the text typology given in SUC.
3.1 SUC
SUC is a balanced corpus of modern Swedish prose covering approximately 1 million word tokens. The texts are from the years 1990 to 1994, and they were selected and classified according to criteria corresponding to the ones used for the
Brown corpus (Francis and Kucera, 1979) and the Lancaster-Oslo/Bergen (LOB)
corpus (Johansson et al., 1986), albeit adapted to Swedish culture. The basic idea
of the compilation was that it should mirror what a Swedish person might read in
the early nineties. The taxonomy of texttypes, i.e. genres and domains, is shown
in Appendix A.
The distribution of genres (or main categories) is shown in Figure 1 and the
distribution of domains (or subcategories) in Figure 2. As can be noted, the dis-
4
150
100
0
50
Frequency
200
250
tribution is not ideal for genre analysis, but although SUC is not compiled for the
purpose of genre analysis, and really has too few samples of most genres, it is the
only Swedish larger corpus with other text types than news texts and a given text
typology. Since it has been compiled in the same spirit as other corpora used for
genre classification (cf. Section 2), it is also easier, but not easy, to compare the
results.
SUC consists of text samples from 1,040 texts, grouped into 500 files with an
average of 2,065 tokens per file. The samples were selected at random, but with an
effort to choose coherent stretches of text.
Version 2.0 of SUC (SUC, forthcoming) contains the same text samples, but
has corrected annotations and additional named entity annotation. For this experiment, version 2.0 was used.
a
b
c
e
f
g
h
j
k
Genre
Figure 1: Distribution of texts per genre in SUC.
3.2 Base lemma vocabulary
In many language technology applications, there is a need for a base vocabulary,
i.e. a vocabulary that could be reused for most domains and text types. For many
languages, there exist one or more frequency dictionaries that contain some kind of
base vocabulary, often based on a combination of word frequency and dispersion
in a corpus. Although such base vocabularies can be useful for some applications,
the corpora they are based on might not be representative for other applications,
and so the base vocabularies will be of less use.
In our case, there exist some Swedish frequency dictionaries based on various
text types collected at various time periods. The most recent and widely known
is “Nusvensk frekvensordbok” (NFO) (Allén, 1971), which is based on a 1 million word corpus of news text from the late sixties, Press65, and contains a base
vocabulary.
5
120
100
Frequency
80
60
40
20
aa
ab
ac
ad
ae
af
ba
bb
ca
cb
cc
cd
ce
cf
cg
ea
eb
ec
ed
fa
fb
fc
fd
fe
ff
fg
fh
fj
fk
ga
gb
ha
hb
hc
hd
he
hf
ja
jb
jc
jd
je
jf
jg
kk
kl
kn
kr
0
Domain
Figure 2: Distribution of texts per domain in SUC. (Domains are in alphabetical
order from the left.)
However, since NFO is only representative of news texts, its base vocabulary
cannot be considered representative for our present purposes, i.e. as a norm to
compare various text types with. Instead, we will use a base vocabulary we have
extracted from SUC.
The units of the base vocabulary are lemmas, or rather the baseforms from
the SUC annotation disambiguated for part-of-speech, so that the preposition om
’about’ becomes om.S and the subjunction om ’if’ becomes om.CS. They are
ranked according to relative frequency weighted with dispersion, i.e. how evenly
spread-out they are across the subdivisions of the corpus, so that more dispersed
words with the same frequency are ranked higher. This is done to compensate
for accidental peaks of frequency due to certain texts, domains or genres, and is
based on the ideas behind the base vocabularies of most frequency dictionaries of
today, including the Swedish NFO, introduced by Juilland (e.g. in Juilland and
Chang-Rodriguez (1964)).
In NFO, the same weighting scheme for dispersion is used as in Juilland’s
works, but it does not discriminate properly for some types of uneven distributions (Muller, 1965). In the German equivalent of NFO, Rosengren (1972, p.
6
XXIX) introduced another weighting scheme, Korrigierte Frequenz ’adjusted frequency’1 , which gives a better discrimination (see Equation 1). This measure is
also used in later frequency dictionaries, e.g. the one based on the Brown corpus,
and in our base vocabulary.
AF
where
AF
di
xi
n
=
Pn
i=1
√
2
d i xi
= adjusted frequency
= relative size of category i
= frequency in category i
= number of categories
(1)
The total vocabulary has 69,560 entries, but the base vocabulary is restricted
to entries which occur in at least 3 genres, 8,554 entries, since this turned out to
give the most stable ranking for adjusted frequency across three category divisions
(genre, domain, text). Our base vocabulary, therefore, are those words that are
not genre and domain dependent, given the subdivisions of SUC. The top-ranked
entries are mostly function words, but also stylistically neutral content words, e.g.
words in multi-word function words. These are the words we think would most
probably signal discourse patterns.
This base vocabulary from which the feature sets are selected requires relatively few resourses to compute, and it is probably useful for other applications
as well, e.g. as a basis for stop lists in information retrieval. It contains lexical
information and some morphosyntactic information in the reduced part-of-speech
bit of the lemma. Although the lemmatising masks information about the actual
form used, it is a way of handling disambiguation as well as data sparseness. By
combining the lemma information with part-of-speech tag information, one would
probably get a better prediction. The information on punctuation distribution gives
valuable clues that corresponds to measures such as number of words per sentence
(.!?), number of clauses (,), number of attributions (–), and number of declarative,
imperative, and interrogative sentences (.!?), without the extra effort of computing
the actual measures.
3.3 Decision trees
Frequency counts for the selected features were extracted into a feature vector for
each SUC text. The counts were log-normalised for text length, in the manner
frequently used for topic categorisation (Manning and Schütze, 1999, p. 580). The
score, s, in Equation 2 reflects the fact that the importance of an increase in relative
frequency is not linearly proportional to the increase.
1
Attributed by Rosengren to J. Lanke of Lund University.
7
sij = round(10 ·
1+log tfij
1+log lj )
where
sij = score for term i in document j
tfij = number of occurrences of term i in document j
lj = length of document j
(2)
The feature vectors were then used in building classification trees through the
rpart package of R (R Development Core Team, 2004; Therneau and Atkinson,
2004), which follows the description in Breiman et al. (1984) quite closely.
Decision trees are built by recursively splitting the learning sample into smaller
groups by some splitting criterion until a stopping criterion is met. The splitting
criterion is usually one of Information Gain (Breiman et al., 1984, p. 25f) or the
Gini Index (Breiman et al., 1984, p. 103). The Information Gain criterion is based
on entropy, while the Gini Index is based on estimated probability of misclassification or variance. It is not really clear which gives the best tree for a given data
set (Raileanu and Stoffel, 2004), so we used the default choice in rpart, the Gini
Index (see Equation 3):
i(t) =
where
i
t
J
p(j|t)
p(i|t)
PJ
j6=i p(j|t)
· p(i|t)
= Gini Index
= a node in the tree
= number of classes
= estimated class probability for the current class j
= estimated class probability for another class i
(3)
The most trivial stopping criterion is that all elements at a node have an identical feature vector or the same category so that splitting would not further discriminate between them. To prevent the tree from knowing the training set by heart,
i.e. overfitting the training data but performing less well on unseen data, the tree
is usually pruned afterwards. We use 10-fold cross-validation and minimal costcomplexity pruning.
Cost complexity is based on adding a complexity cost, i.e. the number of terminal nodes weighted by a cost parameter, to the resubstitution estimates of misclassification for a tree (Breiman et al., 1984, p. 34f, 66), see Equation 4.
Rα (T ) = R(T ) + α|T |
where
Rα (T )
α
R(T )
|T |
= cost complexity of tree T
= complexity parameter
= resubstitution estimate of misclassification for tree T
= number of terminal nodes in T
8
(4)
R(T ) is a measure of the estimated misclassification cost for the whole tree.
The misclassification cost for a single item could either be uniform (1), based on
the probability distribution of classes or supplied by the user. The default in rpart
is probability distribution.
The complexity parameter of the smallest tree which cross-validated error is
less than or equal to the cross-validation R(T ), plus the cross-validation standard
error, is then used for pruning the tree (Maindonald and Braun, 2003, p. 275).
4 Results
4.1 Size of category set
The SUC corpus is divided into 9 genre categories and 48 domain subcategories.
Training the decision trees on the larger set of categories resulted in poorer performance, approximately 10 per cent units worse than for the smaller category set
(cf. Tables 1 and 2, and Tables 3 and 4, respectively), although none of the sets
gave good predictors in the first place. This is most likely partly due to the skewed
category distribution, and the fact that SUC was not compiled to be representative
for genres, but to be representative for the distribution of genres a person might
read in a year. The larger set also has fewer texts to learn from, which gives the
smaller set an advantage. The fewer learning examples might also go in tandem
with the slightly bigger trees induced for the larger category set, since less generalisation is possible, and there is a bigger risk that the training set does not contain
any samples of a less represented category.
However, both sets outperformed both the uniform category probability baseline (1) and the majority category baseline (2), which could be an indicator that
the features used are potentially good predictors. Almost the same features appear
in the trees for both category sets, and they are mostly function words or punctuation2 , which seems to suggest that both the genre and domain divisions are based
also on functional style.
As a comparison, in the study by Karlgren (2000) on the Brown corpus, the
larger category sets also performed worse than the smaller ones: 2 categories gave
4% misclassifications (corresponding to the SUC category K vs. the rest), 4 categories gave 27% misclassifications (SUC categories ABC, EFG, HJ, K), and 15
categories gave 48% misclassifications (SUC categories A, B, C, E, F, G, H, J, KK,
KL, KN, KR).
2
Due to the human factor, information on the mapping from an index to the actual punctuation
character was unfortunately lost, except for the full stop, and could not be regained without repeating
the whole experiment. Although the mapping information would be very valuable in a final decision
tree, it did not seem crucial for the results here.
9
Subset
10
20
30
40
50
60
70
80
90
100
200
300
400
500
R(T)
60.67%
59.62%
61.35%
57.40%
58.85%
58.65%
59.14%
57.50%
60.00%
59.42%
56.92%
58.65%
57.88%
55.19%
Splits
2
3
3
5
5
5
5
5
5
5
5
5
5
9
600
700
800
900
1000
2000
Baseline 1
Baseline 2
Mean
Std
57.02%
57.12%
59.04%
56.63%
57.12%
56.83%
88.89%
74.13%
58.25%
1.54%
5
5
7
7
7
7
Features
och.CC; det.PF
han.PF; av.S; att.CS
han.PF; kunna.V; av.S
han.PF; eller.CC; av.S; att.CS, man.PI
=
=
=
=
=
=
=
=
=
han.PF; eller.CC; av.S; musik.NCU, man.PI; politisk.AQ; hon.PF;
historia.NCU; kunna.V
han.PF; eller.CC; av.S; musik.NCU, man.PI
=
han.PF; eller.CC; av.S; scen.NCU, man.PI; politisk.AQ, vilja.V
=
=
=
Table 1: Genre classification results for the 9 genres of SUC, with various subsets
(top N) of the ranked base vocabulary, and with contribution from all 9 genres.
4.2 Dispersion and size of feature set
The size of feature sets did not have much effect on performance, except for the
very smallest subsets (cf. Tables 1 and 3), but the size and shape of the trees
generally changed with size.
Regardless of the size of the category set, however, the trees stayed stable between sets of 50 and 100 features; for genres they were stable up to 500 features.
Up to 100 features, only words with contribution from all 9 genres are in the set,
and from 2000 features, there are no features with contribution from all genres (cf.
Table 5). With feature set sizes over 1000, more topical content words like film
’film’ and tränare ’coach’ are starting to break grounds, especially for the larger
category set.
Since dispersion is used in the weighted ranking of the base vocabulary, it
seems as if it would be rather safe to slacken the dispersion restriction and only use
the ranking restriction when selecting a subset of the base vocabulary, at least for
the top 500 features. It also seems as if the optimal subset is somewhere in the top
50-100 features, which supports the findings of related studies (cf. Section 2).
In particular, Stamatatos et al. (2000b) used the most frequent words computed
from a much larger corpus, British National Corpus, and tested it on a more homogeneous set of genres from the Wall Street Journal. They found that the best
10
Subset
10
20
30
40
R(T)
78.56%
76.25%
78.17%
71.83%
Splits
3
4
4
9
50
60
70
80
90
100
200
72.40%
72.60%
72.89%
71.54%
73.37%
72.40%
70.00%
9
9
9
9
9
9
9
300
400
70.10%
71.25%
9
9
500
600
700
71.83%
71.44%
72.60%
9
9
8
800
900
1000
2000
Baseline 1
Baseline 2
Mean
Std
72.31%
72.50%
73.56%
72.31%
97.92%
86.63%
72.89%
2.28%
8
8
8
8
Features
vara.V; och.CC; och.CC
vara.V; han.PF; och.CC; inte.RG
=
vara.V; han.PF; hon.PF; inte.RG; den.DF; och.CC, men.CC; X.F*;
ska.V
=
=
=
=
=
=
vara.V; han.PF; enligt.S; hon.PF; inte.RG; den.DF; och.CC; X.F*;
ska.V
=
vara.V; han.PF; enligt.S; hon.PF; inte.RG; den.DF; kommun.NCU;
X.F*; ska.V
=
=
vara.V; han.PF; enligt.S; hon.PF; inte.RG; den.DF; kommun.NCU;
X.F*
=
=
=
=
Table 2: Genre classification results for the 48 domains of SUC, with various subsets (top N) of the ranked base vocabulary, and with contribution from all 9 genres.
performance was with the top 30 words set, and from then on the performance degraded. Used in combination with punctuation marks, the set was also more stable
for fewer training samples than the other sets.
Apart from the features showing up in the trees, information from the nextin-rank primary splits and the surrogate splits, i.e. splits to use when values are
missing, gives clues to masked features, which are almost as informative as the
top-ranked primary split. For example, the third person pronoun hon ’she’ is lurking in the background when present in the feature set, while han ’he’ is always
showing up in the tree when present. Another method which takes more account
of covariation would therefore probably be preferable once the feature set has been
selected.
5 Concluding remarks
In this paper, we described an exploratory experiment on genre classification of
Swedish texts, using as predictors various subsets of a base lemma vocabulary
previously derived from the Stockholm-Umeå Corpus.
Three dimensions of the base vocabulary were investigated: size of feature
11
Subset
100
200
300
400
500
600
700
R(T)
59.42%
58.65%
59.90%
58.85%
59.23%
56.63%
54.71%
Splits
5
5
5
5
5
5
9
800
900
55.87%
57.69%
9
9
1000
2000
58.27%
60.29%
9
10
3000
55.48%
7
4000
Baseline 1
Baseline 2
Mean
Std
56.25%
88.89%
74.13%
58.30%
1.75%
7
Features
han.PF; eller.CC; av.S; att.CS, man.PI
=
=
=
=
han.PF; eller.CC; av.S; musik.NCU, man.PI
han.PF; eller.CC; av.S; musik.NCU, man.PI; politisk.AQ; hon.PF;
historia.NCU; kunna.V
=
han.PF; eller.CC; av.S; scen.NCU, kapitel.NCN; politisk.AQ, de.PF;
vilja.V, till exempel.RG
=
han.PF; eller.CC; av.S; film.NCU, kapitel.NCN; scen.NCU, de.PF;
politisk.AQ, till exempel.RG; kunna.V
han.PF; eller.CC; av.S; regi.NCU, kapitel.NCN; de.PF;
till exempel.RG
=
Table 3: Genre classification results for the 9 genres of SUC, with various subsets
(top N) of the ranked base vocabulary, and with contribution from at least 3 genres.
set, genre vs. domain category set, and dispersion. The decision trees grown for
the various subsets did not predict the genres particularly well, partially because of
sparse data and a skewed distribution of genres in the corpus, but they outperformed
both the baselines: uniform genre probability and majority genre.
Using the coarser-grained category set of genres gave better predictions than
the finer-grained set of domains. Regardless of category set, the trees stayed stable between sets of 50 and 100 features; for genres they were stable up to 500
features. Slacking the dispersion criteria to allow for more genre-specific features
increased the number of content words in the size-500 sets and above. The optimal
subset, then, seems to be one with a coarse-grained category set and a feature set
from the top 50-100 ranked features. In that range, the dispersion criteria does not
matter. However, to make any substantial conclusions, one would need a corpus
specifically compiled for genre analysis.
Decision trees are convincingly easy to interpret and good for feature extraction, but they tend to only look at one feature at a time, and do not account well
for dependencies. Features that are potentially useful in combination are easily
masked by stronger features. By looking at the lower ranked primary splits and
surrogate splits, it is possible to see recurring features that are always masked in
the final trees. Therefore, other methods which take dependencies more into account should probably be used for the final classifier.
12
Subset
100
R(T)
72,40%
Splits
9
200
71.06%
10
300
70.67%
9
400
70.96%
9
500
600
700
800
900
1000
2000
70.38%
72.31%
72.69%
71.64%
71.64%
72.21%
69.90%
9
9
9
9
9
9
12
3000
70.87%
10
4000
69.90%
11
Baseline 1
Baseline 2
Mean
Std
97.92%
86.63%
72.47%
2.34%
Features
vara.V; han.PF; hon.PF; inte.RG; den.DF; och.CC, men.CC; X.F*;
ska.V
vara.V; han.PF; enligt.S; hon.PF; inte.RG; den.PF; företag.NCN;liten.AQ; X.F*; ska.V
vara.V; han.PF; enligt.S; hon.PF; inte.RG; X.F*;och.CC; X.F*;
ska.V
vara.V; han.PF; enligt.S; hon.PF; inte.RG; X.F*;och.CC; X.F*,
kommun.NCU
=
=
=
=
=
=
vara.V; han.PF; film.NCU; Ume˚a.NP; best¨ammelse.NCU; hon.PF;
inte.RG; X.F*; företag.NCN; med.S; utst¨allning.NCU; X.F*
vara.V; han.PF; film.NCU; Ume˚a.NP; best¨ammelse.NCU; hon.PF;
inte.RG; jfr.V; och.CC; X.F*
vara.V; han.PF; film.NCU; Ume˚a.NP; best¨ammelse.NCU; hon.PF;
inte.RG; jfr.V; tr¨anare.NCU; och.CC; X.F*
Table 4: Genre classification results for the 48 domains of SUC, with various subsets (top N) of the ranked base vocabulary, and with contribution from at least 3
genres.
Subset
¬9
100
0
200
1
300
2
400
5
500
9
600
15
700
34
800
49
900
69
1000
105
2000
614
3000
1511
4000
2494
Table 5: Introduction of features without full contribution by set size in the base
vocabulary.
13
References
Sture Allén. Nusvensk frekvensordbok baserad på tidningstext 2. Lemman. [Frequency
dictionary of present-day Swedish based on newspaper material 2. Lemmas.]. Data
linguistica 4. Almqvist & Wiksell international, Stockholm, 1971.
Shlomo Argamon and Jeff T. Dodick. Conjunction and modal assessment in genre classification: A corpus-based study of historical and experimental science writing. In
Notes of AAAI Spring Symposium on Attitude and Affect in Text: Theories and Applications, pages 1–8, Stanford University, Palo Alto, California, USA, March 2004a.
URL http://lingcog.iit.edu/doc/SSS404ArgamonS.pdf.
Shlomo Argamon and Jeff T. Dodick. Linking rhetoric and methodology in formal scientific writing. In Proceedings of the 26th Annual Meeting of the Cognitive Science Society (CogSci 2004), Chicago, Illinois, USA, August 2004b. URL http:
//lingcog.iit.edu/doc/ArgamonDodickCR.pdf.
R. Harald Baayen. Word Frequency Distributions. Text, Speech and Language Technology
18. Kluwer Academic Publishers, Dordrecht, The Netherlands; Boston, Massachusetts,
USA, and London, England, 2001. ISBN 0-7923-7017-1.
Douglas Biber. Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press, Cambridge, UK, 1995. ISBN 0-521-47331-4.
Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. The Wadsworth Statistics/Probability Series. Wadsworth
International Group, Belmont, California, USA, 1984. ISBN 0-534-98053-8.
EAGLES. Preliminary recommendations on text typology. Preliminary Recommendation
EAG–TCWG–TTYP/P, Expert Advisory Group on Language Engineering Standards,
June 1996. URL http://www.ilc.cnr.it/EAGLES96/texttyp/texttyp.
html.
W. Nelson Francis and Henry Kucera. Manual of information to accompany a Standard
Sample of Present-day Edited American English, for use with digital computers. Providence, R.I., USA, 1979. URL http://www.hit.uib.no/icame/brown/bcm.
html. Original ed. 1964, revised 1971, revised and augmented 1979.
M. A. K. Halliday. An Introduction to Functional English Grammar. Edward Arnold,
London, third edition, 1994.
Stig Johansson, Eric Atwell, Roger Garside, and Geoffrey Leech. The Tagged LOB Corpus
Users’ Manual. Bergen, Norway, 1986. URL http://www.hit.uib.no/icame/
lobman/lob-cont.html.
Alphonse Juilland and E. Chang-Rodriguez. Frequency Dictionary of Spanish words. The
Romance Languages and Their Structure, First Series S 1. Mouton & Co, The Hague,
The Netherlands, 1964.
Jussi Karlgren. Stylistic experiments for information retrieval, volume 26 of SICS dissertation series. Swedish Institute of Computer Science (SICS), Kista, 2000. ISBN
91-7265-058-3. PhD thesis, Stockholm University.
14
Brett Kessler, Geoffrey Nunberg, and Hinrich Schütze. Automatic detection of text genre.
In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics (ACL/EACL’97), pages 32–38, Madrid, Spain, July 1997. URL
http://xxx.lanl.gov/abs/cmp-lg/9707002.pdf.
Ludovic Lebart, André Salem, and Lisette Berry. Exploring Textual Data. Text, Speech and
Language Technology 4. Kluwer Academic Publishers, Dordrecht, The Netherlands;
Boston, Massachusetts, USA, and London, England, 1998. ISBN 0-7923-4840-0.
John Maindonald and John Braun. Data Analysis and Graphics Using R: An Examplebased Approach. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, UK, 2003. ISBN 0-521-81336-0.
Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language
Processing. The MIT Press, Cambridge, Massachusetts, USA and London, England,
1999. ISBN 0-262-13360-1. Sixth printing with corrections 2003.
Lars Melin and Sven Lange. Att analysera text: Stilanalys med exempel [Analysing text:
Style analysis with examples]. Studentlitteratur, Lund, third edition, 2000. ISBN 91-4401562-3.
Charles Muller. Fréquence, dispersion et usage. Cahiers de lexicologie, VII(2):33–42,
1965.
R Development Core Team. R: A language and environment for statistical computing.
R Foundation for Statistical Computing, Vienna, Austria, 2004. URL http://www.
R-project.org.
Laura Elena Raileanu and Kilian Stoffel. Theoretical comparison between the Gini index
and information gain criteria. Annals of Mathematics and Artificial Intelligence, 41(1):
77–93, May 2004.
Inger Rosengren. Ein Frekvenzwörterbuch der deutschen Zeitungssprache. CWK Gleerup,
Lund, 1972.
Marina Santini. A shallow approach to syntactic feature extraction for genre classification.
In Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for
Computational Linguistics, Birmingham, UK, January 2004. URL http://www.
cs.bham.ac.uk/˜mgl/cluk/papers/santini.pdf.
Efstathios Stamatatos, Nikos Fakotakis, and George Kokkinakis. Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4), 2000a.
Efstathios Stamatatos, Nikos Fakotakis, and George Kokkinakis. Text genre detection
using common word frequencies. In Proceedings of the 18th International Conference
on Computational Linguistics (COLING2000), Saarbrücken, Germany, July 31 - August
4 2000b.
SUC. Stockholm-Umeå corpus. CD, version 1.0, 1997.
SUC. Stockholm-Umeå corpus. Version 2.0, forthcoming.
15
John M. Swales. Genre Analysis: English in academic and research settings. The Cambridge applied linguistics series. Cambridge University Press, Cambridge, UK, 1990.
ISBN 0-521-32869-1.
Terry M. Therneau and Beth Atkinson. rpart: Recursive Partitioning, 2004. R package
version 3.1-20. R port by Brian Ripley <ripley@stats.ox.ac.uk>. S-PLUS 6.x original
at http://www.mayo.edu/hsr/Sfunc.html.
Hans van Halteren, Harald R. Baayen, Fiona Tweedie, Marco Haverkort, and Anneke Neijt.
New machine learning methods demonstrate the existence of a human stylome. Journal
of Quantitative Linguistics, 12(1):65–77, 2005.
16
A
SUC taxonomy
ID
A
Genre
Press: Reportage
B
Press: Editorial
C
Press: Reviews
E
Skills and Hobbies
F
Popular Lore
G
Biographies, essays
H
Miscellaneous
J
Learned and scientific writing
K
Imaginative prose
ID
AA
AB
AC
AD
AE
AF
BA
BB
CA
CB
CC
CD
CE
CF
CG
EA
EB
EC
ED
FA
FB
FC
FD
FE
FF
FG
FH
FJ
FK
GA
GB
HA
HB
HC
HD
HE
HF
JA
JB
JC
JD
JE
JF
JG
JH
KK
KL
KN
KR
Domain
Political
Community
Financial
Cultural
Sports
Spot News
Institutional
Debate articles
Books
Films
Art
Theater
Music
Artists, shows
Radio, TV
Hobbies, amusements
Society press
Occupational and trade union press
Religion
Humanities
Behavioral sciences
Social sciences
Religion
Complementary life styles
History
Health and medicine
Natural science, technology
Politics
Culture
Biographies, memoirs
Essays
Government publications
Municipal publications
Financial reports, business
Financial reports, non-profit organisations
Internal publications, companies
University publications
Humanities
Behavioral sciences
Social sciences
Religion
Technology
Mathematics
Medicine
Natural science, technology
General fiction
Science fiction and mystery
Light reading
Humour
17
Download