Special Topics in Text Mining Manuel Montes y Gómez http://ccc.inaoep.mx/~mmontesg/ mmontesg@inaoep.mx University of Alabama at Birmingham, Spring 2011 Non-thematic text classification Agenda • Authorship attribution – Description of the task – Applications and related tasks – Features and methods • Sentiment classification – Description of the task – Applications – Features and methods – Cross-domain sentiment classification 3 Special Topics on Information Retrieval Authorship attribution Stamatatos, E. 2009. A Survey of Modern Authorship Attribution Methods, Journal of the American Society for information Science and Technology, 60(3): 538-556. AA as a classification problem • In the typical authorship attribution problem, a text of unknown authorship is assigned to one candidate author, given a set of candidate authors for whom text samples of undisputed authorship are available. • From a machine learning point-of-view, this can be viewed as a multi-class single-label text categorization task. Possible applications of AA? Other related tasks? 5 Special Topics on Information Retrieval In the heart of AA • Information retrieval methods that allow to represent and classify large volumes of text. • Machine learning algorithms that help to handle multidimensional and sparse data allowing more expressive representations. • NLP techniques able to analyze text efficiently and providing new forms of measures for representing the style (e.g., syntax-based features). 6 Special Topics on Information Retrieval Applications areas of AA • Intelligence – Attribution of messages or proclamations to known terrorists – Linking different messages by authorship • Criminal law – Identifying writers of harassing messages – Verifying the authenticity of suicide notes • Civil law – Copyright disputes • Computer forensics – Identifying the authors of source code of malicious software • Literary research – Attributing anonymous or disputed literary works to known authors 7 Special Topics on Information Retrieval Related tasks • Author verification – Deciding whether a given text was written by a certain author or not • Plagiarism detection – Finding similarities between two texts • Author profiling or characterization – Extracting information about the age, education, sex, etc. of the author of a given text • Detection of stylistic inconsistencies – Analyzing collaborative writing – Detecting plagiarism (intrinsic plagiarism detection) 8 Special Topics on Information Retrieval Features and methods • As mentioned. the main idea behind AA is that by measuring some textual features we can distinguish between texts written by different authors. • It is important to have features that quantify the writing style of authors, and apply methods able to learn from that kind of features. How to address the AA problem? What features could be used? 9 Special Topics on Information Retrieval Lexical features (1) • Several different lexical features have been used in the task of AA: – Simple measures such as sentence length counts and word length counts • Can be applied to any language and any corpus • For certain languages is not trivial to do word segmentation Chinese, German, etc. – Vocabulary richness and the number of hapax legomena (i.e., words occurring once). • Vocabulary size heavily depends on text-length 10 Special Topics on Information Retrieval Lexical features (2) – Traditional bag-of-words text representation • Good for topic classification, but not necessarily capture the writing style of authors. – Function words • Are used in a largely unconscious manner by the authors and they are topic-independent – Subset of more frequent words • Similar problems than bag-of-words – Word n-grams • Not always better than individual word features • Dimensionality increases considerably 11 Special Topics on Information Retrieval Character features • According to this family of measures, a text is viewed as a mere sequence of characters. • Various character-level measures: – alphabetic characters count, digit characters count, uppercase and lowercase characters count, letter frequencies, punctuation marks count, etc. • Frequencies of character n-grams – – – – Lexical information (e.g., |_in_|, |text|) Contextual information (e.g., |in_t|) Use of punctuation and capitalization Common used suffix (e.g., |ful_|, |ing_| ) 12 Special Topics on Information Retrieval Character features – some issues • Extracting n-grams is language-independent and requires no special tools. • Dimensionality is considerably increased in comparison to the word-based representation. • How long should the n-grams be? – A large n better capture lexical and contextual information but it would also capture thematic information; increase dimensionality – A small n would be able to represent sub-word (syllable-like) information but it would not be adequate for representing the contextual information. 13 Special Topics on Information Retrieval Syntactic features (1) • The idea is that authors tend to use similar syntactic patterns unconsciously. • Syntactic information is considered more reliable authorial fingerprint in comparison to lexical Information • Disadvantages: – Robust and accurate NLP tools are require to perform syntactic analysis of texts – Language-dependent procedure 14 Special Topics on Information Retrieval Syntactic features (2) • POS tag frequencies or POS tag n-gram frequencies – A_DD few_JJ examples_NNS of_PREP heterologous_JJ expression_NN • Noun phrase counts, verb phrase counts, length of noun phrases, length of verb phrases, etc. – NP[Another attempt] VP[to exploit] NP[syntactic information] VP[was proposed] PP[by Stamatatos, et al. (2000)]. • Rewrite rule frequencies from the output of a syntactic parser – PP PREP + NP 15 Special Topics on Information Retrieval Semantic features • The more detailed the text analysis required for extracting features, the less accurate the produced measures. – Few attempts to exploit high-level features • Examples of the usage of semantic information: – Use semantic relations (from dependency trees) – Use synonyms and hypernyms of words (Wordnet) – Detect semantic similarity between words by means of LSI 16 Special Topics on Information Retrieval Domain-specific features • In some applications it is possible to use some structural measures to quantify the authorial style. • Some examples are: – Use of greetings and farewells in the messages – Types of signatures – Use of indentation – Paragraph length – Font color counts and font size counts 17 Special Topics on Information Retrieval Authorship attribution methods • Instance-based approaches – Each training text is individually represented as a separate instance of authorial style. – Uses vector space representations and apply supervised learning algorithms such as traditional text classification. • Profile-based approaches – Concatenate all the available training texts per author in one big file and extract a cumulative representation of that author’s style (profile) from this concatenated text. What is better? Advantages and disadvantages? 18 Special Topics on Information Retrieval Profile-based approaches (1) • Training just comprises the extraction of profiles for the candidate authors. • Attribution is based on the distance of the profile of an unseen text and the profile of each author. • It can be realized by using probabilistic and compression models 19 Special Topics on Information Retrieval Profile-based approaches (2) • Probabilistic models: attempt to maximize the probability P(x|a) for a text x to belong to an author a. – Can be applied to both character and word sequences • Compression models: the difference in bit-wise size of the compressed files d(x, xa)=C(xa +x)–C(xa) indicates the similarity of text x with author a. – Several compression algorithms have been tested including RAR, LZW, GZIP, BZIP2, 7ZIP. 20 Special Topics on Information Retrieval Comparison table 21 Special Topics on Information Retrieval Additional issues • The number of candidate authors – Increasing the number of authors leads to a significant decrease in performance – Character n-grams outperform other feature types; providing a more heterogeneous set of features improve the results significantly • The size of the training set – AA can lead to reasonable results even when only limited data is available – Character n-grams show more robustness to the effect of data size than syntactic or word-based features 22 Special Topics on Information Retrieval Sentiment analysis Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, Vol. 2, No 1-2 (2008) 1–135. New classification alternatives • Growing interest in non-topical text analysis – Analysis of the opinions, feelings, and attitudes expressed in a text, rather than just the facts. • Web resources such as discussion forums, review sites and blogs are a great source of information: – Many people guide their own decisions by the opinions that other consumers have publicity expressed. – Analysts from government, commercial, and political domains require tools to automatically track attitudes and feelings in the news and on-line forums Applications? 24 Special Topics on Information Retrieval Applications (1) • As a sub-component technology – Recommendation systems: penalize items that receive a lot of negative feedback – Information extraction: discard information found in subjective sentences – Question answering: handle opinion-oriented questions – Summarization: consider multiple viewpoints – Citation analysis: determine whether an author is citing a piece of work as supporting evidence or as research that he or she dismisses. 25 Special Topics on Information Retrieval Applications (2) • In business and government intelligence – Product quality: Classify products based on their reviews maybe for future recommendation, to stop production, etc. – Product analysis: Identifying product features that customers have expressed their opinions on to change design, to use in publicity, etc. – Analysis of political debates: find speeches that represent support of or opposition to a given proposal – Reputation analysis: identify bad and good opinions over public personalities (e.g., politicians). 26 Special Topics on Information Retrieval Two main tasks • Subjectivity classification/detection – Distinguish sentences used to present opinions and other forms of subjectivity from sentences used to objectively present factual information. • Sentiment classification – Classify the opinion as falling under one of two opposing sentiment polarities {positive and negative}, or locate its position on the continuum between these two polarities. How to carry out these tasks? Which features could be useful? 27 Special Topics on Information Retrieval Main features for sentiment analysis (1) • Bag of words – Better results using Boolean weight than tf-idf • Word presence is enough since sentiment is not usually highlighted through repeated use of the same terms. • Lexical features beyond single words – The position of a token within a textual unit can potentially have important effects on how much that token affects the overall sentiment or subjectivity status of the enclosing textual unit. – Word n-grams; their usefulness appears to be a matter of some debate. 28 Special Topics on Information Retrieval Main features for sentiment analysis (2) • Part of speech (POS) tags – Idea is to capture the presence (or polarity) of (certain) adjectives and adverbs. – 0ther parts of speech also contribute to express sentiments (nouns: gem; verbs: love). • Syntactic features – Collocations and syntactic patterns have been found to be useful for subjectivity detection. Patterns such as: • <subj> was satisfied; to condemn <dobj> 29 Special Topics on Information Retrieval Supervised sentiment classification • Uses labeled document sets • Consider all described features – Best results using lexical features – Robust results with binary weights • Applies standard text-categorization algorithms – Best reported results using SVM and Naïve Bayes How to do the classification without a training set? 30 Special Topics on Information Retrieval Unsupervised sentiment classification • Idea: it is not hard to identify the sentiment words and their orientation. • The algorithm in one paper (Turney, 2002) is: 1. Select phrases containing adjectives or adverbs 2. Extract pairs of words ADJ NOUN or NOUN NOUN 3. Estimate the semantic orientation of the extracted phrases, using the PMI-IR algorithm (against some seed words; e.g., awful and excellent) 4. Calculate the average semantic orientation of the phrases in the given review and classify the review as recommended if the average is positive and otherwise not recommended. 31 Special Topics on Information Retrieval