A Case Study of Computing Appraisals in Design Text

advertisement
A Case Study of Computing Appraisals in Design Text
Xiong Wang and Andy Dong
University of Sydney, Australia
This paper presents case studies of the calculation of appraisals, the linguistic
construal of emotions and attitudinal positions, in natural language design text.
Using two data sets, one a standard set of movie review text and one a set of a
natural language design text, we compare the performance of support vector
machines for the classification of design documents by overall semantic
orientation based on two different numerical representations of the design text. For
the natural language design text data set, we additionally compare the performance
of the support vector machine for the categorization of the text into three
categories, Product, Process and People. We find that the sparse yet high
dimensional representation of the design text allows the support vector machine to
perform best. Further, we find modest benefit in encoding statistically derived data
about the semantic orientation and lexical data about the semantic category into
the representation beyond frequency counts on the occurrence of unigrams in the
text.
Computing sentiments in design text
In recent years, the computational linguistics community has turned its
attention toward the modeling of the subjectivity and sentiment of
language. The aim of understanding the sentiment of a text is to distinguish
the subject of the text from the subjective stance taken by the author
towards the topic. At the moment, a primary task of sentiment analysis is
to determine the semantic orientation, positive or negative, of the
sentiment. Determining the semantic orientation of design documents may
have useful outcomes such as determining the level of risk and uncertainty
in product specifications, assessing the temperament of a design team, or
managing the progress of the design process. For example, by mapping
semantic orientation in design text to a model of design as reflective
practice, we have been able to elicit the co-existence of affective
J.S. Gero and A.K. Goel (eds.), Design Computing and Cognition ’08,
© Springer Science + Business Media B.V. 2008
573
574
X. Wang, A. Dong
processing and rational cognitive processing [4] and the effect of attitudes
toward the formation of shared understanding [5]. Building a general
purpose sentiment classifier for design text is, however, not a priori
obvious. Some of the challenges in building such a classifier are studied in
this paper.
One of the principal challenges of developing sentiment analysis
systems has been the lack of a precise computational language model of
sentiment. Within the theory of systemic-functional linguistics, the
APPRAISAL system [7] provides a rigorous, network-based model for
sentiment, which linguists characterize as the construal of emotions and
interpersonal relations in language. The model has been partially
implemented [14].
Yet, what is intriguing is that one of the most accurate supervised
machine learning sentiment classifiers to date relies on a standard bag-ofwords representation of the text using unigrams found in the text [8] rather
than feature sets from the semantic resources in the APPRAISAL system.
The semantic resources for the APPRAISAL system are all the linguistic
means available to a speaker to express subjective content, such as affect
(“I like this book”) and engagement (“You know, this is a really good
book”). Pang’s method is based on characterizing a document as a bag-ofwords. The bag-of-words ignores knowledge about the words, such as
part-of-speech and semantic meaning, and the grammar, treating the text as
an unordered collection of words. Each document in the corpus is
numerically represented as a high-dimension vector of frequency counts of
unigrams matching a pre-determined list. For any unigram in the candidate
list of content-bearing words, if the unigram appears in both the target
document and the list, the corresponding position in the vector will be
labeled as 1; else it will be marked as 0. This method attained around 80%
accuracy in semantic orientation classification. Only by combining the
semantic resources for Attitude and Orientation from the APPRAISAL
system with bag-of-words features does the performance of sentiment
classification improve [14], though not by much. Another method which
employed cognitive and commonsense knowledge as rules also does not
perform as well as the bag-of-words classification [9].
There are two challenges in adapting the bag-of-words and appraisal
features methods to the analysis of design text. The first challenge has to do
with re-using the existing bag-of-words sentiment classifier. The standard test
corpus for sentiment analysis in the computational linguistics community
is movie reviews (http://www.cs.cornell.edu/people/pabo/movie-reviewdata/). All of the sentiment classifiers described previously were trained,
A Case Study of Computing Appraisals in Design Text
575
tested, and compared based on this movie review data set. Whilst it is not
altogether explainable, at least theoretically, why the bag-of-words method
works so well compared to a method rigorously grounded in linguistic
theory, it is tempting to apply the bag-of-words classifier immediately to
design text nonetheless. After all, the performance of the APPRAISAL
theory based classifier is not so markedly better to justify the additional
computational expense of generating the lexicon for the appraisal groups.
However, it is important to note that the bag-of-words classifier was
trained on a data set consisting of text about movie reviews. It is not a
priori obvious whether a supervised machine learning system trained on a
linguistic data set from a specific domain, movie reviews, could perform
equally well on the target data set, design text. The lexicon of the two
domains is different. It is not known if this difference will result in a
significant degradation in performance in accuracy of sentiment classification when the classifier is trained on one data set but deployed on
another. This drop in performance has been claimed [9] but not tested. Yet,
there is the potential that it might be possible to transfer the sentiment
classifier across domains. There is evidence from research by Wiebe [15]
that a sentence is 55.8% likely to have subjective content if there is an
adjective within; thus, the appearance of a broad range of adjectives in the
training data set might be sufficient for sentiment analysis, and it would
not be necessary to utilize all the words that appear in the target domain to
train the classifier. It would be highly attractive to re-use a sentiment
classifier trained on existing data sets due to the expense of producing
tagged data to train machine learning classifiers.
Second, the bag-of-words method relies on a very high-dimension
representation that hinges on training a system on a text domain which
contains a high coverage of words that are likely to appear in the target
corpora. The fewer words that the two corpora share, the less likely it is
that the bag-of-words sentiment classifier would perform well. This is not
altogether attractive since this makes such systems difficult to transport
across linguistic domains or even within a linguistic domain within which
the training set does not have as many unique words as found in the target
domain. Also, if such a system were deployed on a very large corpus, such
as the corpus associated with the design of very large engineering systems,
such as aircraft, it is very possible that there will be millions of features
(unigrams). This very high dimensionality reduces the computational
efficiency of the machine learning system and introduces other pragmatic
implementation challenges. Finally, the more unique words which exist in
the text, the more training cases are needed, which is, from a practical
576
X. Wang, A. Dong
standpoint, difficult to obtain. It would, instead, be desirable to have a
lexicon-independent representation.
In our prior work [13], we presented an initial feasibility study of a
method to calculate the semantic orientation of design text using a compact
representation of text wherein the possible semantic orientation of
individual words in clauses was calculated using a statistical measure of
co-occurrence with known positively and negatively oriented words. That
study confirmed the feasibility of the approach, but revealed two key
limitations. First, the calculation of a word’s semantic orientation is
computationally expensive and requires a real-time lookup to a very large
online database. For that research, we used Google, although other “offline” data sets are available such as the Google/Linguistic Data
Consortium Web 1T 5-gram data set. We found that the calculated values
from the Google queries fluctuated depending upon which Google servers
were “hit” during a particular query. Second, the evidence from the recent
research comparing computational sentiment analysis based on appraisal
theory against the bag-of-words approach, showing that the bag-of-words
approach is superior, leads us to question the utility of embedding
statistically derived data about semantic orientation into the representation
of text. Thus, in this project, we tested the claim that embedding
statistically derived data of a word’s semantic orientation into the text
representation could improve the sentiment classification of text by setting
the value of the sentiment of individual words to 1 (neutral).
In this paper, we set out to investigate these related issues. That is, this
paper investigates:
• The accuracy of a bag-of-words sentiment classifier trained on the
movie review data set applied to a design text data set (same algorithm,
same text representation, different training and target data sets)
• The accuracy of a bag-of-words sentiment classifier compared to a
sentiment classifier operating on a highly compact representation of
design text (same algorithm, different text representation, same training
and target data sets)
• The accuracy of the compact representation when knowledge of the
semantic orientation of individual words is included in the
representation and when it is absent
To address these research questions, we:
• Produced a tagged corpus of design text rated for semantic orientation
(positive or negative)
• Developed a compact document representation for text that includes
features for word category and semantic orientation
A Case Study of Computing Appraisals in Design Text
577
• Trained a bag-of-words sentiment classifier on the design text
• Executed various comparative tests
What is it about design documents?
There are characteristics which distinguish design documents from other
types of documents such as newspaper stories or political policy platforms.
The language of design is a special kind of language owing to the performity of the language. That is, the reality producing effect of the language
of design, where the reality is both the design work and the design process,
is itself an enactment of design, a praxis about materialising realities [3].
From this point of view, the suggestion is that there is a connection between
the practice of design and the linguistic properties of the language of
design. Design texts constitute the materiality of the design work through
language, and it is the sentiment in text which sanctions one work from
another. Thus, understanding this process of privileging one work over
another, which we believe is realised through the linguistic process of appraisal, is a key step in understanding how the language of design produces
design. Secondly, designers write down their thoughts about a project as
they are working on it. Design documents can be considered a recording of
a design process over time. Ascertaining the ebbs and flows of attitudes
toward the design work and design process could illustrate a more nuanced
picture of how well design happened.
In sentiment classification of design text, it is also important to consider
the subject matter. We categorize three high-level subject matters.
Appraisals of Product are directed towards the design work, including its
requirements and goals and the data informing the construal of the design
brief. Appraisals of Product can justify (provide rationale) decisions taken
during the design process. That is, appraisals of Product can explain how
the designers’ feelings toward the design work influenced the designing of
the work. In the appraisal of Product, the designer may rely on semantic
resources that apply an external, normative judgment or a personal,
subjective appreciation. For example, “Uniqlo is a comparatively reserved
design” which is a positive appraisal by 'Tokyo Style' of the style of the
Uniqlo clothing line.
Taking stances towards tangible tasks and actions performed during
designing identifies the appraisal of Process. Appraisal of Process is
generally associated with concrete actions. In all of the process-oriented
appraisal clauses, a tangible action is being evaluated. The evaluation
associates a position toward the state of being of the action. An example is,
578
X. Wang, A. Dong
“That’s a risky strategy here,” which displays a negative attitude towards a
way of doing a design task.
Appraisals of People express evaluations of a person’s (a stakeholder in
the design process) cognitive and physical states of being. Appraisals of
People tend to take on an air of normative evaluation about how people
should and should not be or behave. “Uniqlo hired the country’s hottest
retail designer” is a positive appraisal of the people doing design for
Uniqlo by Tokyo Style, here employing a normative evaluation where a
norm for a “hot” designer is assumed to be known by the reader.
System Development and Experiment Setup
Labeled Design Text
To understand the difference in performance of the bag-of-words
sentiment classifier when trained on a movie review data set and deployed
on design text, we needed to create labeled design text. That is, we needed
to create a new data set consisting of text about design works, the process
of designing, and designers, which were labeled for semantic orientation.
A cohort of three native English speakers with a background in a designrelated discipline (e.g., engineering, architecture, and computer-science)
was tasked with reading and categorizing various design texts. The texts
included formal and informal design text from various online sources and
across various design-related disciplines. All design texts were collected
by the authors. Each coder was paid to classify the texts. The rating
cohorts were trained to identify the proper category and its semantic
orientation according to the context. Training lasted for one hour. During
coding, 2 of the 3 coders had to agree on the semantic meaning (category),
semantic orientation (orientation), and the value of the orientation, that is,
positive or negative. Working in two hour time blocks, the coders read
various design texts, including formal design reports, reviews of designed
works, reviews of designers, and transcripts of conversations of designers
working together.
Every thirty minutes, the coders took a “fatigue check” test to assess
their performance. The fatigue test consisted of six appraisals that the
second author had previous labeled. These constituted a set of “known”
appraisals with correct content categorization and semantic orientation.
The appraisals were randomized so that the group would not receive two
tests containing the same set of appraisals in the same order. The fatigue
A Case Study of Computing Appraisals in Design Text
579
test usefully provides us a baseline for the "best" performance we could
expect from the machine learning classifier as well as an assessment of the
internal validity of the collected data. One-third of the data set was crosscategorized by the second author and a colleague coding per standard
practice in protocol analysis to ensure agreement upon a reliable
categorization and sentiment classification. Finally, the second author
reviewed the cohorts' work to ensure that they were correctly following the
rules for coding the categories and orientation according to the framework
for the language of appraisal in design. Following the review, the coders
were required to make one last pass through the data to re-code text the
author thought might be incorrectly labeled. In the practice of human
labeling of data for training computational linguistic classifiers, the
guideline of more than five votes per paragraph is used a baseline for
confirming "correct" labeling of a text. This coding system satisfies this
requirement and is reliable because at least five people, three people from
the student cohort and two researchers, agreed on the rating.
The accuracy of the coders for semantic orientation was about 90%
accuracy over most of the sessions [13]. Pang [8] found that human-based
classifiers were accurate only 58% to 64%; thus, the performance of our
coders is consistent (actually better since the coders discussed their
interpretations) with other studies and is likely to be the “best” that could
be expected from a computational system.
Methods of Representation of Design Text
Bag-of-Words Representation
The bag-of-words sentiment classifier operates on a simple word
occurrence representation format. In this representation format, one (a fulltext text analysis software program) counts the frequency of occurrence of
a single word (unigram) or word group (bigram, trigram … n-gram). This
word-by-document matrix F (see Table 1) is comprised of n words w1, w2,
…, wn in m documents d1, d2, … dm , where the weights indicate the total
frequency of occurrence of term wp in document dq.
Let us take the sentence “It is a great masterpiece” to show how to
compose the bag-of-words (BoW) representation. First, the “bag” consists
of a list of 12,111 words which appear in the design documents. The list of
words is generated from a full-text parse of the design text to extract key
words and phrases, but not stop words such as “a” and “the.” Because the
representation vector is sparse, we only show the columns which are nonzero. In the following equation, the right side is the BoW vector for the
580
X. Wang, A. Dong
sample sentence. For each vector component, the first number is the
column, and the appearance of the word (indicated by a 1) follows the
colon: BoW(“It is a great masterpiece.”) = [3:1 6:1 10:1 176:1 3077:1]
Table 1 Sample Word by Document Matrix
w1
w2
…
wn
d1
0
1
d2
1
0
0
1
…
dm
2
0
1
Compact Representation
To compare the accuracy of the bag-of-words sentiment classifier
compared to a sentiment classifier operating on a highly compact
representation of design text, we propose a compact representation of
design text that could overcome problems with lack of lexicon coverage
between the training data set and the target data set. That is, the aim of the
representation is to be lexicon independent and train the classifier only on
the appearance of a set of numerical values relating to the potential
sentiment of the text and the semantic category of the text.
The basic requirement for the representation is that it must encode
which categories words might belong to and whether the individual words
express a positive or negative sentiment. Thus, for each clause, we need to
encode the category – Product, Process, or People – and the sentiment of
the clause’s constituents in the numerical value. The content-bearing
constituents in a clause are nouns, verbs, and adjective or adverb
modifiers. To encode this information requires 3 × 3 = 9 combinations, 3
dimensions for each category and 3 dimensions for each constituent in a
clause. The insight here is that each triplet of vector dimensions encodes
the category of the clause and the length of each vector dimension encodes
the sentiment. In total, we have a 9–dimensional vector of the following
form (Fig. 1):
Fig. 1. 9-dimensional vector representation of text
where Pd = Product, Pr = Process, Pp = People, N = Noun, V = verb,
and A = adjective/adverb.
A Case Study of Computing Appraisals in Design Text
581
The value of each of the vector dimensions is determined in the
following way. First, all the content-bearing words are automatically
extracted from the text, and their grammatical relationships are identified.
Then, each word is looked up (queried) in the WordNet lexicographer
database to ascertain the logical grouping that might indicate the
appropriate category (Product, Process, People) for the word. The
WordNet lexicographer database and their syntactic category and logical
groupings were used to categorize words (nouns) as being about Product,
Process or People. Verbs, adjectives and adverbs are categorized according
to the category(ies) of the noun they related to grammatically. These
clusters of syntactically related words are called word groups.
For the noun in each word group, rules were applied to identify which
of the WordNet logical groupings would contain nouns in the categories
[13]. Two correction factors are multiplied with the count of the frequency
of occurrence of a word in the target clause applied: κ1, which is inversely
proportional to the number of possible Product-Process-People categories a
WordNet logical grouping can belong to; and, κ2, which takes into account
the uncertainty of a word’s category. Since the correction factor κ 2 for a
word may have up to three values, it is normally expressed as a vector of
the form κ2(word) = [κ2,Pd, κ2,Pr, κ2,Pp].
The semantic orientation (SO) of the words in each word group is
calculated using the SO-PMI measure, which is in turn based on their
pointwise mutual information (PMI) [12]. The strategy for calculating the
SO-PMI is to calculate the log-odds (Equation 1) of a canonical basket of
positive (Pwords) or negative (Nwords) words appearing with the target
word on the assumption that if the canonical good or bad word appears
frequently with the target word then the target word has a similar semantic
orientation.
Equation 1 The log odds that two words co-occur
As reported in our prior research [13], we used a Google query with
the NEAR operator to look up the co-occurrence of the target word with the
canonical basket of positive and negative words. The SO-PMI based on the
NEAR operator is described by Equation 2.
582
X. Wang, A. Dong
Equation 2 The semantic orientation of word based on mutual co-occurrence with
a canonical basket of positive and negative words
SO-PMI values generally follow intuitive notions of positive and
negative words. The SO-PMI values for a strongly negative word such as
“unlikely” is –8.1 whereas it is 1.7 for “unintended” and 17.9 for “unified”,
a positive word. However, this is not always true as many positive words
could be used ironically or sarcastically such as “This is a masterpiece?”.
SO-PMI values can be “close” and negative-valued even when,
qualitatively, it is intuitively known that the two words differ in
positive/negative orientation. For example, SO-PMI(“masterpiece”) = –5.7
whereas SO-PMI(“depressed”) = –6.7. No structural relation to the
sentiment of a text can necessarily be imputed from the SO-PMI values of
individual words alone. It is for this reason that it is necessary to use a
machine learning system trained over aggregations of SO-PMI values for
word groups in the text than individual words.
The distribution of SO-PMI values for the data sets is shown in Fig. 2.
From this figure, we can see that the SO-PMI values are fairly evenlydistributed. This implies that the words in the training and validation set
are not inherently biased negative or positive.
Fig. 2. Frequency Distribution over SO-PMI
A Case Study of Computing Appraisals in Design Text
583
We selected a basket of 12 canonical positive and negative words.
Adjectives and adverbs were selected based on most frequent occurrence
in written and spoken English according to the British National Corpus [6,
pp. 286-293]. Because this list is published separately, we joined both lists
and ordered them by frequency per million words. We selected only those
adjectives and adverbs which were judged positive or negative modifiers
according to the General Inquirer corpus [http://www.wjh.harvard.edu/~
inquirer/]. The basis for the selection of these frequently occurring words
as the canonical words is the increased likelihood of finding documents
which contain both the canonical word and the word for which the PMI–IR
is being calculated. This increases the accuracy of the SO-PMI measurement. Table 2 lists the canonical Pwords and Nwords and their frequency
per million words.
Table 2 Canonical positive and negative words
Positive Words
good (1276)
well (1119)
great (635)
important (392)
able (304)
clear (239)
Negative Words
bad (264)
difficult (220)
dark (104)
cold (103)
cheap (68)
dangerous (58)
The SO-PMI of all unigrams (noun, verb, modifiers) in the target
lexicon are pre-calculated and saved in a database to speed up the analysis.
In previously reported research, the assignment of the modifiers into a
specific category did not take into account which noun or verb a modifier
is complementing; rather, the category for a modifier was assigned by
lexical distance from a noun or verb. In this paper, we correct this gross
assumption by making use of a more rigorous grammatical parse of the
clauses. Second, we previously ignored the verb to be. There is the
possibility of its contribution to the semantic orientation in spoken English
when the verb is emphasized through a prosodic effect as in “It is!.”
In order to determine which noun or verb a modifier is associated with,
we generated a part-of-speech (POS) parse [11] for each clause. The POS
tagger provides a way to analyze grammatical relationship between words
within a sentence and outputs with various analysis formats including partof-speech, phrase structure trees (how the words are assembled into the
clause), and typed dependency (which words modify another word in the
clause). We applied the latter two formats to analyze labeled design
584
X. Wang, A. Dong
documents to correctly associate the modifiers with the respective noun or
verb as described in Fig. 3.
We will use the simple clause “It is a great masterpiece” to demonstrate
how to assign modifiers to the correct noun or verb. In Phrase Structure
Tree format, the clause is parsed as:
It/PRP is/VBZ a/DT great/JJ masterpiece/NN ./.
In Typed Dependency Format:
nsubj(masterpiece-5, It-1)
cop(masterpiece-5, is-2)
det(masterpiece-5, a-3)
amod(masterpiece-5, great-4)
Fig. 3. Interaction between phrase structure trees and typed dependency parse
All modifiers (adjectives and adverbs) and their complements (nouns
and verbs) are picked up according to the Phrase Structure Tree Format.
“is” (VBZ)
“great” (JJ)
“masterpiece” (NN)
The amod dependency (adjectival modifier) amod(masterpiece5, great-4)is the most relevant relation and identifies that “great”
modifies “masterpiece.”
A Case Study of Computing Appraisals in Design Text
585
For each noun in a sentence, all verbs and modifiers related to it will be
clustered with the noun and saved into a queue with the other clausal
participants. In this example, the queue for the sentence is [5 4 2]. For all
verbs and modifiers which do not belong to any noun, they will be attached
to the end of the queue. Stop words and anaphoric references are ignored.
Let w1, w2 and w3 be the SO-PMI values for each of these words,
respectively. The correction factor κ 2 for the word masterpiece is (based
on the WordNet 2.1 dictionary) is κ2(masterpiece) = [0.5 0.5 0]. The initial
9-dimension vector should be:
Because there is no relevant modifier relation for the word “is,” the
algorithm just:
1. looks up its SO-PMI value
2. divides the value by 3 since it could belong to any category
3. inserts the value into the vector
The final 9-dimension for this clause is “It is a great masterpiece”:
Supervised Machine Learning With Support Vector Machines
The supervised machine learning algorithm used in this research is based
on support vector machines. For a two-class pattern classification problem,
a support vector machine seeks to determine a separating hyperplane that
maximizes the margin of separation between the two classes of patterns.
Here, the two classes of patterns are documents of positive sentiment and
documents of negative sentiment.
For a set of patterns xi 2 <n with labels yi 2 {±1} that are linearly
separable in input space, the separating hyperplane w· x + b = 0 must
satisfy, for each training data sample (xi , yi) a set of constraints that are
usually written in the form yi [w· x + b] ≥ 1 i = 1, 2, … , m for a training
set of m data points. The distance between these two sets of points is 2 /
║w║. Thus, the margin of separation of the two classes in this linearly
separable case can be maximized by minimizing ½║w║2. This
minimization problem can be solved by forming the Lagrangian function
in the usual way. The solution then lies at the saddle point of the
Lagrangian given by minimizing with respect w and b, and maximizing
with respect to the Lagrange multipliers. Conditions on w and b at the
586
X. Wang, A. Dong
saddle point can be found by differentiating the Lagrangian and setting it
to zero. These conditions on w
and b allow them to be eliminated from the
Lagrangian and all that remains then is to maximize with respect to the
Lagrange multipliers. This step takes the form:
maximize
. The αi are the Lagrange multipliers which are
such that
constrained to be non-negative. Optimum values of w
and b (i.e., those that
define the hyperplane that provides the maximum margin of separation)
can then be obtained by substituting the values of the α i that maximize
W(α) into the conditions on w
and b that hold at the saddle point.
To format the training data for SVM, the bag-of-words or the compact
representation, respectively, is defined as the xi and the sentiment
orientation the y.
Data Processing
After the rated text data was collected, spell-checked, grammar-checked,
and saved in a sentence pool, additional pre-processing steps were taken
due to the following issues:
• Imbalance of text data distribution. In the 10131 rated sentences/
paragraphs, the distribution of Process/Product/People categories text is
3915:4484:1732. If we keep this ratio in the training to validation set,
that would lead to the imbalance of the training and validation set.
• Shortage of data. The machine learning system should be trained on 900
vectors and validated on 600 vectors (fixed length paragraphs)
respectively. In total, 1500 (900+600) paragraphs are needed. If each of
the paragraphs is composed by 7-8 rated sentences, then about 1050012000 rated sentences are required, and this amount exceeds the 10131
rated sentences/paragraphs mentioned above.
To obtain sufficient data points for this study, we must “reuse” the rated
sentences to generate data points.
We modified the data pre-processing step we have mentioned in our
previous paper [13] to produce more data for training. The difference
between the old method and the new one is that we use a sentence rather
than a paragraph as a textual unit. For each sentence in the sentence pool,
A Case Study of Computing Appraisals in Design Text
587
part-of-speech tagging will provide the phrase structure trees and typed
dependency in order to obtain the grammatical relationships. A noun-based
clustering algorithm is then applied. The basic idea is to identify every
noun in a sentence and put all verbs and modifiers (adjectives and adverbs)
connected to the noun together with it. The average value of the SO-PMI
of all words in a word-cluster will be distributed into the corresponding
categories in the 9-dimension vector. When all word-clusters in a sentence
are processed, a complete 9-dimension vector is generated.
The advantage to this approach is that we can generate enough training
data for training and validation by selecting paragraphs with the same
semantic orientation and semantic meaning, separate the sentences from
the paragraphs, and re-combine the sentences into synthetic paragraphs to
compose the training and validation vectors. That is, one sentence could be
used by more than one training or validation vector. As shown in Table 3,
after applying the sentence-reused data produced method, we have enough
data points for training and validation. If each paragraph is composed with
n sentences, then for a corpus with m sentences, theoretically, there are
possible paragraphs.
For example, there are 1732 rated sentences about People. If we applied
the pre-processing method in [13], then only about 220 paragraphs will be
composed. That is insufficient for training and validation purpose. By reusing the rated sentences, we can have more than 2.47×
paragraphs
for the same purpose.
Table 3 Statistics about collected design text data
Category
Rated Text Data (Paragraphs /
Sentences)
Process
500/3915
Product
550/4484
People
220/1732
Proposed Training/Validation
300/200
1698
300/200
5035
300/200
2.47
Re-used Sentences (×
)
Results and Discussion of Results
In the first set of results, we compared the accuracy of sentiment
classification for the movie review data set and the design text using
unigrams (Table 4). We conducted the test of the movie review data set to
ensure the accuracy of the software implementation. The results obtained
588
X. Wang, A. Dong
are consistent with that obtained by Pang. The bag-of-words text
representation achieved almost 88% accuracy in sentiment orientation for
the design text. The overall, average accuracy of sentiment classification
that we achieved in the previous method for sentiment classification of
design text using the compact representation and SO-PMI values for
individual words was 70.0% [13]. When we set the SO-PMI value for
individual words to be 1, that is neutral (Strictly, 1 is not neutral, as there is
no neutral word. As long as all the words have the same SO-PMI value,
and it is nonzero, then there is no difference between words in semantic
orientation. In other words, they are neutral relative to each other.), the
overall accuracy of sentiment classification was still 70.0%. There is no
significant improvement or degradation in performance even when
correcting for the correct noun modifiers in the 9-dimensional vector
representation.
Table 4 Accuracy of sentiment classification for bag-of-words and compact
representation
Data Set
Movie Review
Design Text
Bag-of-Words
80.2%
87.6%
Compact Representation
N/A
70%
The accuracy of categorization was also compared. Again, the bag-ofwords approach beats the compact representation. There is an
improvement in semantic categorization over our prior results [13] when
the SO-PMI value is not included in the representation (i.e., set to 1).
Turney [12] reported that it is not clear whether SO-PMI would help (or
hinder) semantic categorization. These results indicate that SO-PMI may
hinder semantic categorization when training over limited feature sets.
Table 5 Accuracy of categorization for bag-of-words and compact representation
Data Set
Product
Process
People
Bag-of-Words
87.05%
84.52%
89.82%
Compact Representation
79.7%
77.0%
78.3%
Table 6 reports on the accuracy of the sentiment classification when
trained on the movie review database using the unigram word list from the
movie review data set (A) and when using the unigram word list from the
design text data set (B). Here, we find that the results are worse than
merely guessing. What this finding confirms is that it is not possible to
A Case Study of Computing Appraisals in Design Text
589
train the sentiment classifier on one text domain and then deploy the
classifier on another domain since the classifier is highly sensitive to the
words which appear in both the training and target domains.
Table 6 Accuracy of sentiment classification when trained on design data
Data Set
Movie Review
A
33.02%
B
33.67%
We conjectured that the results might improve if we used a word list
consisting of words that appear in both the movie review data set and a
word list consisting of words that appear in either the movie review or the
design text data sets. To examine the effect of the word list on the
sentiment classification, we ran tests of the sentiment classifier when
trained on the movie review data set using words (unigrams) which appear
in both the movie and design text data sets (M & D) and using words
(unigrams) which appear in the movie or design text data sets (M || D). The
results are reported in Table 7. There was essentially no improvement. It
would appear that there are specific patterns of co-occurrence of words in
specific texts which are related to semantic orientation. This result
suggests that the accuracy of the sentiment classifier is not necessarily
improved merely by adding more words (evidence) into the training set,
particular if the added words do not necessarily appear in the training set.
Table 7 Accuracy of sentiment classification when trained on movie review
Data Set
Design Text
Bag-ofWords
(M && D)
33.58%
Bag-ofWords
(M || D)
32.56%
Despite the limitation of the bag-of-words approach in training the
sentiment classifier on a corpus which differs from the target set, the
results point to a more intriguing hypothesis. When Claude Shannon
published A Mathematical Theory of Communication [10], he showed that
it was possible to model the generation of communication as a
probabilistic system based on relatively simple rules on the statistical cooccurrence of letters in English words. In prior research, it has been
demonstrated that latent semantic analysis exploits the statistical cooccurrence of words in discourse to model the underlying knowledge
representation of the communicator and that meaning emerges from the
statistical co-occurrence of semantics [1]. An application of lexical chain
analysis showed the statistical co-occurrence of semantic links in discourse
reveals the way that ideas are connected by language and that concept
590
X. Wang, A. Dong
formation is driven by the accumulation of knowledge represented as
lexicalized concepts [2]. Finally, the success of the bag-of-words approach
for sentiment analysis shows that a relevant set of semantic features is
sufficient to recognize the semantic orientation. This new result adds
evidence that it might be possible to generate a computationally-derived
view of the language of design [3] purely on statistical patterns of cooccurrence of linguistic features rather than based on a rigorous linguistic
model.
Conclusions
This paper presented empirical studies of the calculation of semantic
orientation (sentiment analysis) of design texts using the same supervised
machine learning algorithm on two different representations of the text,
one which is knowledge lean (bag-of-words) and one which requires more
linguistic knowledge. In all cases, the knowledge lean representation
outperformed the other representation. The compact representation
performed better than guessing, but it still trails the bag-of-words method
significantly. The effect of setting the SO-PMI value to 1 as done in this
research is equivalent to reducing the number of dimensions of the
representation. The dimensional reduction to 9 dimensions suggests that
lower dimensional representations might work as well as higher
dimensional ones, though this claim needs further elaboration.
From these results and results from other researchers, the semantic
orientation of documents appears to be largely dependent upon the
semantic dependencies of words within a corpus, at least from a statistical
natural language processing point of view. Embedding any additional
knowledge about the semantic orientation of individual words in a clause
does not appear to improve or degrade the performance of the sentiment
classification. This result parallels findings in information retrieval which
has shown that a purely statistical natural language processing approach
such as latent semantic indexing (LSI) generally outperforms more
knowledge-rich approaches. A sentiment classifier based on the bag-ofwords approach, however, is very sensitive to the word list used in the
training set and the co-occurrence patterns. This sensitivity is not found in
LSI.
This is a rather pessimistic finding in terms of developing a general
purpose semantic orientation classifier given that it is not readily obvious
how to use existing tagged corpora to train a sentiment classifier and then
to deploy the trained classifier onto another corpus. Further investigation
A Case Study of Computing Appraisals in Design Text
591
could alter the words (unigrams and bigrams) used in the training vector.
Data from the British National Corpus of the most commonly used words
might be an attractive target for this word list. It would be interesting to
use an evolutionary optimization algorithm to select the best set of
“general” words from this word list which results in the best sentiment
classification over multiple corpora rather than a single corpus. This might
allow the trained classifier to be deployed against various corpora.
The main conclusion from this study is that if one is interested in
developing a sentiment classifier, then one must devote considerable time
and cost towards training the classifier on the target text. The expense of
producing the tagged training data should not be underestimated. The cost
of hiring the research assistants to produce the training data for the design
text exceeded tens of thousand Australian dollars. The training set that has
now been produced might be valuable for further research in sentiment
analysis of design text. For the time being, the bag-of-words approach
coupled with support vector machines appears to be the optimal approach
for sentiment classification.
Acknowledgements
This research was supported under Australian Research Council’s
Discovery Projects funding scheme (project number DP0557346). Xiong
Wang is supported by an Australian Postgraduate Award scholarship.
References
1. Dong A (2005) The latent semantic approach to studying design team
communication. Design Studies 26(5): 445-461
2. Dong A (2006) Concept formation as knowledge accumulation: a
computational linguistics study. Artificial Intelligence for Engineering
Design, Analysis and Manufacturing 20(1): 35-53
3. Dong A (2007) The enactment of design through language. Design Studies
28(1): 5-21
4. Dong A, Kleinsmann M, Valkenburg R (2007) Affect-in-cognition through
the language of appraisals. In McDonnell J, Lloyd P (eds), Design Thinking
Research Symposium 7, Central Saint Martins College of Art and Design,
University of the Arts London, London, UK: 69-80
5. Kleinsmann M, Dong A (2007) Investigating the affective force on creating
shared understanding. In 19th International Conference on Design Theory and
Methodology (DTM), ASME, New York, DETC2007-34240
592
X. Wang, A. Dong
6. Leech G, Rayson P, Wilson A (2001) Word frequencies in written and spoken
English based on the British National Corpus. Pearson Education Limited,
Harlow, UK
7. Martin JR, White PRR (2005) The language of evaluation: appraisal in
English. Palgrave Macmillan, New York
8. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification
using machine learning techniques. In 2002 Conference on Empirical
Methods in Natural Language Processing (EMNLP), Association for
Computational Linguistics, University of Pennsylvania, Philadelphia, PA: 7986
9. Shaikh MAM, Prendinger H, Mitsuru I (2007) Assessing Sentiment of Text
by Semantic Dependency and Contextual Valence Analysis. In Paiva A et al.
(eds), Affective Computing and Intelligent Interaction, Springer-Verlag Berlin
Heidelberg, Berlin, 191-202
10. Shannon CE (1948) A mathematical theory of communication. The Bell
System Technical Journal 27: 379-423 and 623-656
11. Toutanova K, Manning CD (2000) Enriching the knowledge sources used in a
maximum entropy part-of-speech tagger. In Schütze H, Su K-Y (eds),
Proceedings of the 2000 Joint SIGDAT Conference on Empirical methods in
natural language processing and very large corpora, Association for
Computational Linguistics, Morristown, NJ: 63-70
12. Turney PD, Littman ML (2003) Measuring praise and criticism: Inference of
semantic orientation from association. ACM Transactions on Information
Systems 21(4): 315-346
13. Wang J, Dong A (2007) How am I doing 2: Computing the language of
appraisal in design. In Design for Society: Knowledge, innovation and
sustainability, 16th International Conference on Engineering Design
(ICED'07), Ecole Centrale Paris, Paris, France, ICED'07/124
14. Whitelaw C, Garg N, Argamon S (2005) Using appraisal groups for sentiment
analysis. In CIKM '05, Proceedings of the 14th ACM international conference
on Information and knowledge management, ACM, NY: 625-631
15. Wiebe JM, Bruce RF, O'Hara TP (1999) Development and use of a goldstandard data set for subjectivity classifications. In Proceedings of the 37th
annual meeting of the Association for Computational Linguistics on
Computational Linguistics, Association for Computational Linguistics,
Morristown, NJ: 246-253
Download