A Case Study of Computing Appraisals in Design Text Xiong Wang and Andy Dong University of Sydney, Australia This paper presents case studies of the calculation of appraisals, the linguistic construal of emotions and attitudinal positions, in natural language design text. Using two data sets, one a standard set of movie review text and one a set of a natural language design text, we compare the performance of support vector machines for the classification of design documents by overall semantic orientation based on two different numerical representations of the design text. For the natural language design text data set, we additionally compare the performance of the support vector machine for the categorization of the text into three categories, Product, Process and People. We find that the sparse yet high dimensional representation of the design text allows the support vector machine to perform best. Further, we find modest benefit in encoding statistically derived data about the semantic orientation and lexical data about the semantic category into the representation beyond frequency counts on the occurrence of unigrams in the text. Computing sentiments in design text In recent years, the computational linguistics community has turned its attention toward the modeling of the subjectivity and sentiment of language. The aim of understanding the sentiment of a text is to distinguish the subject of the text from the subjective stance taken by the author towards the topic. At the moment, a primary task of sentiment analysis is to determine the semantic orientation, positive or negative, of the sentiment. Determining the semantic orientation of design documents may have useful outcomes such as determining the level of risk and uncertainty in product specifications, assessing the temperament of a design team, or managing the progress of the design process. For example, by mapping semantic orientation in design text to a model of design as reflective practice, we have been able to elicit the co-existence of affective J.S. Gero and A.K. Goel (eds.), Design Computing and Cognition ’08, © Springer Science + Business Media B.V. 2008 573 574 X. Wang, A. Dong processing and rational cognitive processing [4] and the effect of attitudes toward the formation of shared understanding [5]. Building a general purpose sentiment classifier for design text is, however, not a priori obvious. Some of the challenges in building such a classifier are studied in this paper. One of the principal challenges of developing sentiment analysis systems has been the lack of a precise computational language model of sentiment. Within the theory of systemic-functional linguistics, the APPRAISAL system [7] provides a rigorous, network-based model for sentiment, which linguists characterize as the construal of emotions and interpersonal relations in language. The model has been partially implemented [14]. Yet, what is intriguing is that one of the most accurate supervised machine learning sentiment classifiers to date relies on a standard bag-ofwords representation of the text using unigrams found in the text [8] rather than feature sets from the semantic resources in the APPRAISAL system. The semantic resources for the APPRAISAL system are all the linguistic means available to a speaker to express subjective content, such as affect (“I like this book”) and engagement (“You know, this is a really good book”). Pang’s method is based on characterizing a document as a bag-ofwords. The bag-of-words ignores knowledge about the words, such as part-of-speech and semantic meaning, and the grammar, treating the text as an unordered collection of words. Each document in the corpus is numerically represented as a high-dimension vector of frequency counts of unigrams matching a pre-determined list. For any unigram in the candidate list of content-bearing words, if the unigram appears in both the target document and the list, the corresponding position in the vector will be labeled as 1; else it will be marked as 0. This method attained around 80% accuracy in semantic orientation classification. Only by combining the semantic resources for Attitude and Orientation from the APPRAISAL system with bag-of-words features does the performance of sentiment classification improve [14], though not by much. Another method which employed cognitive and commonsense knowledge as rules also does not perform as well as the bag-of-words classification [9]. There are two challenges in adapting the bag-of-words and appraisal features methods to the analysis of design text. The first challenge has to do with re-using the existing bag-of-words sentiment classifier. The standard test corpus for sentiment analysis in the computational linguistics community is movie reviews (http://www.cs.cornell.edu/people/pabo/movie-reviewdata/). All of the sentiment classifiers described previously were trained, A Case Study of Computing Appraisals in Design Text 575 tested, and compared based on this movie review data set. Whilst it is not altogether explainable, at least theoretically, why the bag-of-words method works so well compared to a method rigorously grounded in linguistic theory, it is tempting to apply the bag-of-words classifier immediately to design text nonetheless. After all, the performance of the APPRAISAL theory based classifier is not so markedly better to justify the additional computational expense of generating the lexicon for the appraisal groups. However, it is important to note that the bag-of-words classifier was trained on a data set consisting of text about movie reviews. It is not a priori obvious whether a supervised machine learning system trained on a linguistic data set from a specific domain, movie reviews, could perform equally well on the target data set, design text. The lexicon of the two domains is different. It is not known if this difference will result in a significant degradation in performance in accuracy of sentiment classification when the classifier is trained on one data set but deployed on another. This drop in performance has been claimed [9] but not tested. Yet, there is the potential that it might be possible to transfer the sentiment classifier across domains. There is evidence from research by Wiebe [15] that a sentence is 55.8% likely to have subjective content if there is an adjective within; thus, the appearance of a broad range of adjectives in the training data set might be sufficient for sentiment analysis, and it would not be necessary to utilize all the words that appear in the target domain to train the classifier. It would be highly attractive to re-use a sentiment classifier trained on existing data sets due to the expense of producing tagged data to train machine learning classifiers. Second, the bag-of-words method relies on a very high-dimension representation that hinges on training a system on a text domain which contains a high coverage of words that are likely to appear in the target corpora. The fewer words that the two corpora share, the less likely it is that the bag-of-words sentiment classifier would perform well. This is not altogether attractive since this makes such systems difficult to transport across linguistic domains or even within a linguistic domain within which the training set does not have as many unique words as found in the target domain. Also, if such a system were deployed on a very large corpus, such as the corpus associated with the design of very large engineering systems, such as aircraft, it is very possible that there will be millions of features (unigrams). This very high dimensionality reduces the computational efficiency of the machine learning system and introduces other pragmatic implementation challenges. Finally, the more unique words which exist in the text, the more training cases are needed, which is, from a practical 576 X. Wang, A. Dong standpoint, difficult to obtain. It would, instead, be desirable to have a lexicon-independent representation. In our prior work [13], we presented an initial feasibility study of a method to calculate the semantic orientation of design text using a compact representation of text wherein the possible semantic orientation of individual words in clauses was calculated using a statistical measure of co-occurrence with known positively and negatively oriented words. That study confirmed the feasibility of the approach, but revealed two key limitations. First, the calculation of a word’s semantic orientation is computationally expensive and requires a real-time lookup to a very large online database. For that research, we used Google, although other “offline” data sets are available such as the Google/Linguistic Data Consortium Web 1T 5-gram data set. We found that the calculated values from the Google queries fluctuated depending upon which Google servers were “hit” during a particular query. Second, the evidence from the recent research comparing computational sentiment analysis based on appraisal theory against the bag-of-words approach, showing that the bag-of-words approach is superior, leads us to question the utility of embedding statistically derived data about semantic orientation into the representation of text. Thus, in this project, we tested the claim that embedding statistically derived data of a word’s semantic orientation into the text representation could improve the sentiment classification of text by setting the value of the sentiment of individual words to 1 (neutral). In this paper, we set out to investigate these related issues. That is, this paper investigates: • The accuracy of a bag-of-words sentiment classifier trained on the movie review data set applied to a design text data set (same algorithm, same text representation, different training and target data sets) • The accuracy of a bag-of-words sentiment classifier compared to a sentiment classifier operating on a highly compact representation of design text (same algorithm, different text representation, same training and target data sets) • The accuracy of the compact representation when knowledge of the semantic orientation of individual words is included in the representation and when it is absent To address these research questions, we: • Produced a tagged corpus of design text rated for semantic orientation (positive or negative) • Developed a compact document representation for text that includes features for word category and semantic orientation A Case Study of Computing Appraisals in Design Text 577 • Trained a bag-of-words sentiment classifier on the design text • Executed various comparative tests What is it about design documents? There are characteristics which distinguish design documents from other types of documents such as newspaper stories or political policy platforms. The language of design is a special kind of language owing to the performity of the language. That is, the reality producing effect of the language of design, where the reality is both the design work and the design process, is itself an enactment of design, a praxis about materialising realities [3]. From this point of view, the suggestion is that there is a connection between the practice of design and the linguistic properties of the language of design. Design texts constitute the materiality of the design work through language, and it is the sentiment in text which sanctions one work from another. Thus, understanding this process of privileging one work over another, which we believe is realised through the linguistic process of appraisal, is a key step in understanding how the language of design produces design. Secondly, designers write down their thoughts about a project as they are working on it. Design documents can be considered a recording of a design process over time. Ascertaining the ebbs and flows of attitudes toward the design work and design process could illustrate a more nuanced picture of how well design happened. In sentiment classification of design text, it is also important to consider the subject matter. We categorize three high-level subject matters. Appraisals of Product are directed towards the design work, including its requirements and goals and the data informing the construal of the design brief. Appraisals of Product can justify (provide rationale) decisions taken during the design process. That is, appraisals of Product can explain how the designers’ feelings toward the design work influenced the designing of the work. In the appraisal of Product, the designer may rely on semantic resources that apply an external, normative judgment or a personal, subjective appreciation. For example, “Uniqlo is a comparatively reserved design” which is a positive appraisal by 'Tokyo Style' of the style of the Uniqlo clothing line. Taking stances towards tangible tasks and actions performed during designing identifies the appraisal of Process. Appraisal of Process is generally associated with concrete actions. In all of the process-oriented appraisal clauses, a tangible action is being evaluated. The evaluation associates a position toward the state of being of the action. An example is, 578 X. Wang, A. Dong “That’s a risky strategy here,” which displays a negative attitude towards a way of doing a design task. Appraisals of People express evaluations of a person’s (a stakeholder in the design process) cognitive and physical states of being. Appraisals of People tend to take on an air of normative evaluation about how people should and should not be or behave. “Uniqlo hired the country’s hottest retail designer” is a positive appraisal of the people doing design for Uniqlo by Tokyo Style, here employing a normative evaluation where a norm for a “hot” designer is assumed to be known by the reader. System Development and Experiment Setup Labeled Design Text To understand the difference in performance of the bag-of-words sentiment classifier when trained on a movie review data set and deployed on design text, we needed to create labeled design text. That is, we needed to create a new data set consisting of text about design works, the process of designing, and designers, which were labeled for semantic orientation. A cohort of three native English speakers with a background in a designrelated discipline (e.g., engineering, architecture, and computer-science) was tasked with reading and categorizing various design texts. The texts included formal and informal design text from various online sources and across various design-related disciplines. All design texts were collected by the authors. Each coder was paid to classify the texts. The rating cohorts were trained to identify the proper category and its semantic orientation according to the context. Training lasted for one hour. During coding, 2 of the 3 coders had to agree on the semantic meaning (category), semantic orientation (orientation), and the value of the orientation, that is, positive or negative. Working in two hour time blocks, the coders read various design texts, including formal design reports, reviews of designed works, reviews of designers, and transcripts of conversations of designers working together. Every thirty minutes, the coders took a “fatigue check” test to assess their performance. The fatigue test consisted of six appraisals that the second author had previous labeled. These constituted a set of “known” appraisals with correct content categorization and semantic orientation. The appraisals were randomized so that the group would not receive two tests containing the same set of appraisals in the same order. The fatigue A Case Study of Computing Appraisals in Design Text 579 test usefully provides us a baseline for the "best" performance we could expect from the machine learning classifier as well as an assessment of the internal validity of the collected data. One-third of the data set was crosscategorized by the second author and a colleague coding per standard practice in protocol analysis to ensure agreement upon a reliable categorization and sentiment classification. Finally, the second author reviewed the cohorts' work to ensure that they were correctly following the rules for coding the categories and orientation according to the framework for the language of appraisal in design. Following the review, the coders were required to make one last pass through the data to re-code text the author thought might be incorrectly labeled. In the practice of human labeling of data for training computational linguistic classifiers, the guideline of more than five votes per paragraph is used a baseline for confirming "correct" labeling of a text. This coding system satisfies this requirement and is reliable because at least five people, three people from the student cohort and two researchers, agreed on the rating. The accuracy of the coders for semantic orientation was about 90% accuracy over most of the sessions [13]. Pang [8] found that human-based classifiers were accurate only 58% to 64%; thus, the performance of our coders is consistent (actually better since the coders discussed their interpretations) with other studies and is likely to be the “best” that could be expected from a computational system. Methods of Representation of Design Text Bag-of-Words Representation The bag-of-words sentiment classifier operates on a simple word occurrence representation format. In this representation format, one (a fulltext text analysis software program) counts the frequency of occurrence of a single word (unigram) or word group (bigram, trigram … n-gram). This word-by-document matrix F (see Table 1) is comprised of n words w1, w2, …, wn in m documents d1, d2, … dm , where the weights indicate the total frequency of occurrence of term wp in document dq. Let us take the sentence “It is a great masterpiece” to show how to compose the bag-of-words (BoW) representation. First, the “bag” consists of a list of 12,111 words which appear in the design documents. The list of words is generated from a full-text parse of the design text to extract key words and phrases, but not stop words such as “a” and “the.” Because the representation vector is sparse, we only show the columns which are nonzero. In the following equation, the right side is the BoW vector for the 580 X. Wang, A. Dong sample sentence. For each vector component, the first number is the column, and the appearance of the word (indicated by a 1) follows the colon: BoW(“It is a great masterpiece.”) = [3:1 6:1 10:1 176:1 3077:1] Table 1 Sample Word by Document Matrix w1 w2 … wn d1 0 1 d2 1 0 0 1 … dm 2 0 1 Compact Representation To compare the accuracy of the bag-of-words sentiment classifier compared to a sentiment classifier operating on a highly compact representation of design text, we propose a compact representation of design text that could overcome problems with lack of lexicon coverage between the training data set and the target data set. That is, the aim of the representation is to be lexicon independent and train the classifier only on the appearance of a set of numerical values relating to the potential sentiment of the text and the semantic category of the text. The basic requirement for the representation is that it must encode which categories words might belong to and whether the individual words express a positive or negative sentiment. Thus, for each clause, we need to encode the category – Product, Process, or People – and the sentiment of the clause’s constituents in the numerical value. The content-bearing constituents in a clause are nouns, verbs, and adjective or adverb modifiers. To encode this information requires 3 × 3 = 9 combinations, 3 dimensions for each category and 3 dimensions for each constituent in a clause. The insight here is that each triplet of vector dimensions encodes the category of the clause and the length of each vector dimension encodes the sentiment. In total, we have a 9–dimensional vector of the following form (Fig. 1): Fig. 1. 9-dimensional vector representation of text where Pd = Product, Pr = Process, Pp = People, N = Noun, V = verb, and A = adjective/adverb. A Case Study of Computing Appraisals in Design Text 581 The value of each of the vector dimensions is determined in the following way. First, all the content-bearing words are automatically extracted from the text, and their grammatical relationships are identified. Then, each word is looked up (queried) in the WordNet lexicographer database to ascertain the logical grouping that might indicate the appropriate category (Product, Process, People) for the word. The WordNet lexicographer database and their syntactic category and logical groupings were used to categorize words (nouns) as being about Product, Process or People. Verbs, adjectives and adverbs are categorized according to the category(ies) of the noun they related to grammatically. These clusters of syntactically related words are called word groups. For the noun in each word group, rules were applied to identify which of the WordNet logical groupings would contain nouns in the categories [13]. Two correction factors are multiplied with the count of the frequency of occurrence of a word in the target clause applied: κ1, which is inversely proportional to the number of possible Product-Process-People categories a WordNet logical grouping can belong to; and, κ2, which takes into account the uncertainty of a word’s category. Since the correction factor κ 2 for a word may have up to three values, it is normally expressed as a vector of the form κ2(word) = [κ2,Pd, κ2,Pr, κ2,Pp]. The semantic orientation (SO) of the words in each word group is calculated using the SO-PMI measure, which is in turn based on their pointwise mutual information (PMI) [12]. The strategy for calculating the SO-PMI is to calculate the log-odds (Equation 1) of a canonical basket of positive (Pwords) or negative (Nwords) words appearing with the target word on the assumption that if the canonical good or bad word appears frequently with the target word then the target word has a similar semantic orientation. Equation 1 The log odds that two words co-occur As reported in our prior research [13], we used a Google query with the NEAR operator to look up the co-occurrence of the target word with the canonical basket of positive and negative words. The SO-PMI based on the NEAR operator is described by Equation 2. 582 X. Wang, A. Dong Equation 2 The semantic orientation of word based on mutual co-occurrence with a canonical basket of positive and negative words SO-PMI values generally follow intuitive notions of positive and negative words. The SO-PMI values for a strongly negative word such as “unlikely” is –8.1 whereas it is 1.7 for “unintended” and 17.9 for “unified”, a positive word. However, this is not always true as many positive words could be used ironically or sarcastically such as “This is a masterpiece?”. SO-PMI values can be “close” and negative-valued even when, qualitatively, it is intuitively known that the two words differ in positive/negative orientation. For example, SO-PMI(“masterpiece”) = –5.7 whereas SO-PMI(“depressed”) = –6.7. No structural relation to the sentiment of a text can necessarily be imputed from the SO-PMI values of individual words alone. It is for this reason that it is necessary to use a machine learning system trained over aggregations of SO-PMI values for word groups in the text than individual words. The distribution of SO-PMI values for the data sets is shown in Fig. 2. From this figure, we can see that the SO-PMI values are fairly evenlydistributed. This implies that the words in the training and validation set are not inherently biased negative or positive. Fig. 2. Frequency Distribution over SO-PMI A Case Study of Computing Appraisals in Design Text 583 We selected a basket of 12 canonical positive and negative words. Adjectives and adverbs were selected based on most frequent occurrence in written and spoken English according to the British National Corpus [6, pp. 286-293]. Because this list is published separately, we joined both lists and ordered them by frequency per million words. We selected only those adjectives and adverbs which were judged positive or negative modifiers according to the General Inquirer corpus [http://www.wjh.harvard.edu/~ inquirer/]. The basis for the selection of these frequently occurring words as the canonical words is the increased likelihood of finding documents which contain both the canonical word and the word for which the PMI–IR is being calculated. This increases the accuracy of the SO-PMI measurement. Table 2 lists the canonical Pwords and Nwords and their frequency per million words. Table 2 Canonical positive and negative words Positive Words good (1276) well (1119) great (635) important (392) able (304) clear (239) Negative Words bad (264) difficult (220) dark (104) cold (103) cheap (68) dangerous (58) The SO-PMI of all unigrams (noun, verb, modifiers) in the target lexicon are pre-calculated and saved in a database to speed up the analysis. In previously reported research, the assignment of the modifiers into a specific category did not take into account which noun or verb a modifier is complementing; rather, the category for a modifier was assigned by lexical distance from a noun or verb. In this paper, we correct this gross assumption by making use of a more rigorous grammatical parse of the clauses. Second, we previously ignored the verb to be. There is the possibility of its contribution to the semantic orientation in spoken English when the verb is emphasized through a prosodic effect as in “It is!.” In order to determine which noun or verb a modifier is associated with, we generated a part-of-speech (POS) parse [11] for each clause. The POS tagger provides a way to analyze grammatical relationship between words within a sentence and outputs with various analysis formats including partof-speech, phrase structure trees (how the words are assembled into the clause), and typed dependency (which words modify another word in the clause). We applied the latter two formats to analyze labeled design 584 X. Wang, A. Dong documents to correctly associate the modifiers with the respective noun or verb as described in Fig. 3. We will use the simple clause “It is a great masterpiece” to demonstrate how to assign modifiers to the correct noun or verb. In Phrase Structure Tree format, the clause is parsed as: It/PRP is/VBZ a/DT great/JJ masterpiece/NN ./. In Typed Dependency Format: nsubj(masterpiece-5, It-1) cop(masterpiece-5, is-2) det(masterpiece-5, a-3) amod(masterpiece-5, great-4) Fig. 3. Interaction between phrase structure trees and typed dependency parse All modifiers (adjectives and adverbs) and their complements (nouns and verbs) are picked up according to the Phrase Structure Tree Format. “is” (VBZ) “great” (JJ) “masterpiece” (NN) The amod dependency (adjectival modifier) amod(masterpiece5, great-4)is the most relevant relation and identifies that “great” modifies “masterpiece.” A Case Study of Computing Appraisals in Design Text 585 For each noun in a sentence, all verbs and modifiers related to it will be clustered with the noun and saved into a queue with the other clausal participants. In this example, the queue for the sentence is [5 4 2]. For all verbs and modifiers which do not belong to any noun, they will be attached to the end of the queue. Stop words and anaphoric references are ignored. Let w1, w2 and w3 be the SO-PMI values for each of these words, respectively. The correction factor κ 2 for the word masterpiece is (based on the WordNet 2.1 dictionary) is κ2(masterpiece) = [0.5 0.5 0]. The initial 9-dimension vector should be: Because there is no relevant modifier relation for the word “is,” the algorithm just: 1. looks up its SO-PMI value 2. divides the value by 3 since it could belong to any category 3. inserts the value into the vector The final 9-dimension for this clause is “It is a great masterpiece”: Supervised Machine Learning With Support Vector Machines The supervised machine learning algorithm used in this research is based on support vector machines. For a two-class pattern classification problem, a support vector machine seeks to determine a separating hyperplane that maximizes the margin of separation between the two classes of patterns. Here, the two classes of patterns are documents of positive sentiment and documents of negative sentiment. For a set of patterns xi 2 <n with labels yi 2 {±1} that are linearly separable in input space, the separating hyperplane w· x + b = 0 must satisfy, for each training data sample (xi , yi) a set of constraints that are usually written in the form yi [w· x + b] ≥ 1 i = 1, 2, … , m for a training set of m data points. The distance between these two sets of points is 2 / ║w║. Thus, the margin of separation of the two classes in this linearly separable case can be maximized by minimizing ½║w║2. This minimization problem can be solved by forming the Lagrangian function in the usual way. The solution then lies at the saddle point of the Lagrangian given by minimizing with respect w and b, and maximizing with respect to the Lagrange multipliers. Conditions on w and b at the 586 X. Wang, A. Dong saddle point can be found by differentiating the Lagrangian and setting it to zero. These conditions on w and b allow them to be eliminated from the Lagrangian and all that remains then is to maximize with respect to the Lagrange multipliers. This step takes the form: maximize . The αi are the Lagrange multipliers which are such that constrained to be non-negative. Optimum values of w and b (i.e., those that define the hyperplane that provides the maximum margin of separation) can then be obtained by substituting the values of the α i that maximize W(α) into the conditions on w and b that hold at the saddle point. To format the training data for SVM, the bag-of-words or the compact representation, respectively, is defined as the xi and the sentiment orientation the y. Data Processing After the rated text data was collected, spell-checked, grammar-checked, and saved in a sentence pool, additional pre-processing steps were taken due to the following issues: • Imbalance of text data distribution. In the 10131 rated sentences/ paragraphs, the distribution of Process/Product/People categories text is 3915:4484:1732. If we keep this ratio in the training to validation set, that would lead to the imbalance of the training and validation set. • Shortage of data. The machine learning system should be trained on 900 vectors and validated on 600 vectors (fixed length paragraphs) respectively. In total, 1500 (900+600) paragraphs are needed. If each of the paragraphs is composed by 7-8 rated sentences, then about 1050012000 rated sentences are required, and this amount exceeds the 10131 rated sentences/paragraphs mentioned above. To obtain sufficient data points for this study, we must “reuse” the rated sentences to generate data points. We modified the data pre-processing step we have mentioned in our previous paper [13] to produce more data for training. The difference between the old method and the new one is that we use a sentence rather than a paragraph as a textual unit. For each sentence in the sentence pool, A Case Study of Computing Appraisals in Design Text 587 part-of-speech tagging will provide the phrase structure trees and typed dependency in order to obtain the grammatical relationships. A noun-based clustering algorithm is then applied. The basic idea is to identify every noun in a sentence and put all verbs and modifiers (adjectives and adverbs) connected to the noun together with it. The average value of the SO-PMI of all words in a word-cluster will be distributed into the corresponding categories in the 9-dimension vector. When all word-clusters in a sentence are processed, a complete 9-dimension vector is generated. The advantage to this approach is that we can generate enough training data for training and validation by selecting paragraphs with the same semantic orientation and semantic meaning, separate the sentences from the paragraphs, and re-combine the sentences into synthetic paragraphs to compose the training and validation vectors. That is, one sentence could be used by more than one training or validation vector. As shown in Table 3, after applying the sentence-reused data produced method, we have enough data points for training and validation. If each paragraph is composed with n sentences, then for a corpus with m sentences, theoretically, there are possible paragraphs. For example, there are 1732 rated sentences about People. If we applied the pre-processing method in [13], then only about 220 paragraphs will be composed. That is insufficient for training and validation purpose. By reusing the rated sentences, we can have more than 2.47× paragraphs for the same purpose. Table 3 Statistics about collected design text data Category Rated Text Data (Paragraphs / Sentences) Process 500/3915 Product 550/4484 People 220/1732 Proposed Training/Validation 300/200 1698 300/200 5035 300/200 2.47 Re-used Sentences (× ) Results and Discussion of Results In the first set of results, we compared the accuracy of sentiment classification for the movie review data set and the design text using unigrams (Table 4). We conducted the test of the movie review data set to ensure the accuracy of the software implementation. The results obtained 588 X. Wang, A. Dong are consistent with that obtained by Pang. The bag-of-words text representation achieved almost 88% accuracy in sentiment orientation for the design text. The overall, average accuracy of sentiment classification that we achieved in the previous method for sentiment classification of design text using the compact representation and SO-PMI values for individual words was 70.0% [13]. When we set the SO-PMI value for individual words to be 1, that is neutral (Strictly, 1 is not neutral, as there is no neutral word. As long as all the words have the same SO-PMI value, and it is nonzero, then there is no difference between words in semantic orientation. In other words, they are neutral relative to each other.), the overall accuracy of sentiment classification was still 70.0%. There is no significant improvement or degradation in performance even when correcting for the correct noun modifiers in the 9-dimensional vector representation. Table 4 Accuracy of sentiment classification for bag-of-words and compact representation Data Set Movie Review Design Text Bag-of-Words 80.2% 87.6% Compact Representation N/A 70% The accuracy of categorization was also compared. Again, the bag-ofwords approach beats the compact representation. There is an improvement in semantic categorization over our prior results [13] when the SO-PMI value is not included in the representation (i.e., set to 1). Turney [12] reported that it is not clear whether SO-PMI would help (or hinder) semantic categorization. These results indicate that SO-PMI may hinder semantic categorization when training over limited feature sets. Table 5 Accuracy of categorization for bag-of-words and compact representation Data Set Product Process People Bag-of-Words 87.05% 84.52% 89.82% Compact Representation 79.7% 77.0% 78.3% Table 6 reports on the accuracy of the sentiment classification when trained on the movie review database using the unigram word list from the movie review data set (A) and when using the unigram word list from the design text data set (B). Here, we find that the results are worse than merely guessing. What this finding confirms is that it is not possible to A Case Study of Computing Appraisals in Design Text 589 train the sentiment classifier on one text domain and then deploy the classifier on another domain since the classifier is highly sensitive to the words which appear in both the training and target domains. Table 6 Accuracy of sentiment classification when trained on design data Data Set Movie Review A 33.02% B 33.67% We conjectured that the results might improve if we used a word list consisting of words that appear in both the movie review data set and a word list consisting of words that appear in either the movie review or the design text data sets. To examine the effect of the word list on the sentiment classification, we ran tests of the sentiment classifier when trained on the movie review data set using words (unigrams) which appear in both the movie and design text data sets (M & D) and using words (unigrams) which appear in the movie or design text data sets (M || D). The results are reported in Table 7. There was essentially no improvement. It would appear that there are specific patterns of co-occurrence of words in specific texts which are related to semantic orientation. This result suggests that the accuracy of the sentiment classifier is not necessarily improved merely by adding more words (evidence) into the training set, particular if the added words do not necessarily appear in the training set. Table 7 Accuracy of sentiment classification when trained on movie review Data Set Design Text Bag-ofWords (M && D) 33.58% Bag-ofWords (M || D) 32.56% Despite the limitation of the bag-of-words approach in training the sentiment classifier on a corpus which differs from the target set, the results point to a more intriguing hypothesis. When Claude Shannon published A Mathematical Theory of Communication [10], he showed that it was possible to model the generation of communication as a probabilistic system based on relatively simple rules on the statistical cooccurrence of letters in English words. In prior research, it has been demonstrated that latent semantic analysis exploits the statistical cooccurrence of words in discourse to model the underlying knowledge representation of the communicator and that meaning emerges from the statistical co-occurrence of semantics [1]. An application of lexical chain analysis showed the statistical co-occurrence of semantic links in discourse reveals the way that ideas are connected by language and that concept 590 X. Wang, A. Dong formation is driven by the accumulation of knowledge represented as lexicalized concepts [2]. Finally, the success of the bag-of-words approach for sentiment analysis shows that a relevant set of semantic features is sufficient to recognize the semantic orientation. This new result adds evidence that it might be possible to generate a computationally-derived view of the language of design [3] purely on statistical patterns of cooccurrence of linguistic features rather than based on a rigorous linguistic model. Conclusions This paper presented empirical studies of the calculation of semantic orientation (sentiment analysis) of design texts using the same supervised machine learning algorithm on two different representations of the text, one which is knowledge lean (bag-of-words) and one which requires more linguistic knowledge. In all cases, the knowledge lean representation outperformed the other representation. The compact representation performed better than guessing, but it still trails the bag-of-words method significantly. The effect of setting the SO-PMI value to 1 as done in this research is equivalent to reducing the number of dimensions of the representation. The dimensional reduction to 9 dimensions suggests that lower dimensional representations might work as well as higher dimensional ones, though this claim needs further elaboration. From these results and results from other researchers, the semantic orientation of documents appears to be largely dependent upon the semantic dependencies of words within a corpus, at least from a statistical natural language processing point of view. Embedding any additional knowledge about the semantic orientation of individual words in a clause does not appear to improve or degrade the performance of the sentiment classification. This result parallels findings in information retrieval which has shown that a purely statistical natural language processing approach such as latent semantic indexing (LSI) generally outperforms more knowledge-rich approaches. A sentiment classifier based on the bag-ofwords approach, however, is very sensitive to the word list used in the training set and the co-occurrence patterns. This sensitivity is not found in LSI. This is a rather pessimistic finding in terms of developing a general purpose semantic orientation classifier given that it is not readily obvious how to use existing tagged corpora to train a sentiment classifier and then to deploy the trained classifier onto another corpus. Further investigation A Case Study of Computing Appraisals in Design Text 591 could alter the words (unigrams and bigrams) used in the training vector. Data from the British National Corpus of the most commonly used words might be an attractive target for this word list. It would be interesting to use an evolutionary optimization algorithm to select the best set of “general” words from this word list which results in the best sentiment classification over multiple corpora rather than a single corpus. This might allow the trained classifier to be deployed against various corpora. The main conclusion from this study is that if one is interested in developing a sentiment classifier, then one must devote considerable time and cost towards training the classifier on the target text. The expense of producing the tagged training data should not be underestimated. The cost of hiring the research assistants to produce the training data for the design text exceeded tens of thousand Australian dollars. The training set that has now been produced might be valuable for further research in sentiment analysis of design text. For the time being, the bag-of-words approach coupled with support vector machines appears to be the optimal approach for sentiment classification. Acknowledgements This research was supported under Australian Research Council’s Discovery Projects funding scheme (project number DP0557346). Xiong Wang is supported by an Australian Postgraduate Award scholarship. References 1. Dong A (2005) The latent semantic approach to studying design team communication. Design Studies 26(5): 445-461 2. Dong A (2006) Concept formation as knowledge accumulation: a computational linguistics study. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 20(1): 35-53 3. Dong A (2007) The enactment of design through language. Design Studies 28(1): 5-21 4. Dong A, Kleinsmann M, Valkenburg R (2007) Affect-in-cognition through the language of appraisals. In McDonnell J, Lloyd P (eds), Design Thinking Research Symposium 7, Central Saint Martins College of Art and Design, University of the Arts London, London, UK: 69-80 5. Kleinsmann M, Dong A (2007) Investigating the affective force on creating shared understanding. In 19th International Conference on Design Theory and Methodology (DTM), ASME, New York, DETC2007-34240 592 X. Wang, A. Dong 6. Leech G, Rayson P, Wilson A (2001) Word frequencies in written and spoken English based on the British National Corpus. Pearson Education Limited, Harlow, UK 7. Martin JR, White PRR (2005) The language of evaluation: appraisal in English. Palgrave Macmillan, New York 8. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, University of Pennsylvania, Philadelphia, PA: 7986 9. Shaikh MAM, Prendinger H, Mitsuru I (2007) Assessing Sentiment of Text by Semantic Dependency and Contextual Valence Analysis. In Paiva A et al. (eds), Affective Computing and Intelligent Interaction, Springer-Verlag Berlin Heidelberg, Berlin, 191-202 10. Shannon CE (1948) A mathematical theory of communication. The Bell System Technical Journal 27: 379-423 and 623-656 11. Toutanova K, Manning CD (2000) Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Schütze H, Su K-Y (eds), Proceedings of the 2000 Joint SIGDAT Conference on Empirical methods in natural language processing and very large corpora, Association for Computational Linguistics, Morristown, NJ: 63-70 12. Turney PD, Littman ML (2003) Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems 21(4): 315-346 13. Wang J, Dong A (2007) How am I doing 2: Computing the language of appraisal in design. In Design for Society: Knowledge, innovation and sustainability, 16th International Conference on Engineering Design (ICED'07), Ecole Centrale Paris, Paris, France, ICED'07/124 14. Whitelaw C, Garg N, Argamon S (2005) Using appraisal groups for sentiment analysis. In CIKM '05, Proceedings of the 14th ACM international conference on Information and knowledge management, ACM, NY: 625-631 15. Wiebe JM, Bruce RF, O'Hara TP (1999) Development and use of a goldstandard data set for subjectivity classifications. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, Association for Computational Linguistics, Morristown, NJ: 246-253