Special Topics on Information Retrieval

advertisement
Special Topics in
Text Mining
Manuel Montes y Gómez
http://ccc.inaoep.mx/~mmontesg/
mmontesg@inaoep.mx
University of Alabama at Birmingham, Spring 2011
Non-thematic text classification
Agenda
• Authorship attribution
– Description of the task
– Applications and related tasks
– Features and methods
• Sentiment classification
– Description of the task
– Applications
– Features and methods
– Cross-domain sentiment classification
3
Special Topics on Information Retrieval
Authorship attribution
Stamatatos, E. 2009. A Survey of Modern Authorship Attribution Methods, Journal of the
American Society for information Science and Technology, 60(3): 538-556.
AA as a classification problem
• In the typical authorship attribution problem, a
text of unknown authorship is assigned to one
candidate author, given a set of candidate authors
for whom text samples of undisputed authorship
are available.
• From a machine learning point-of-view, this can be
viewed as a multi-class single-label text
categorization task.
Possible applications of AA?
Other related tasks?
5
Special Topics on Information Retrieval
In the heart of AA
• Information retrieval methods that allow to
represent and classify large volumes of text.
• Machine learning algorithms that help to
handle multidimensional and sparse data
allowing more expressive representations.
• NLP techniques able to analyze text efficiently
and providing new forms of measures for
representing the style (e.g., syntax-based
features).
6
Special Topics on Information Retrieval
Applications areas of AA
• Intelligence
– Attribution of messages or proclamations to known terrorists
– Linking different messages by authorship
• Criminal law
– Identifying writers of harassing messages
– Verifying the authenticity of suicide notes
• Civil law
– Copyright disputes
• Computer forensics
– Identifying the authors of source code of malicious software
• Literary research
– Attributing anonymous or disputed literary works to known
authors
7
Special Topics on Information Retrieval
Related tasks
• Author verification
– Deciding whether a given text was written by a certain
author or not
• Plagiarism detection
– Finding similarities between two texts
• Author profiling or characterization
– Extracting information about the age, education, sex,
etc. of the author of a given text
• Detection of stylistic inconsistencies
– Analyzing collaborative writing
– Detecting plagiarism (intrinsic plagiarism detection)
8
Special Topics on Information Retrieval
Features and methods
• As mentioned. the main idea behind AA is that
by measuring some textual features we can
distinguish between texts written by different
authors.
• It is important to have features that quantify
the writing style of authors, and apply
methods able to learn from that kind of
features.
How to address the AA problem?
What features could be used?
9
Special Topics on Information Retrieval
Lexical features (1)
• Several different lexical features have been
used in the task of AA:
– Simple measures such as sentence length counts
and word length counts
• Can be applied to any language and any corpus
• For certain languages is not trivial to do word
segmentation  Chinese, German, etc.
– Vocabulary richness and the number of hapax
legomena (i.e., words occurring once).
• Vocabulary size heavily depends on text-length
10
Special Topics on Information Retrieval
Lexical features (2)
– Traditional bag-of-words text representation
• Good for topic classification, but not necessarily
capture the writing style of authors.
– Function words
• Are used in a largely unconscious manner by the
authors and they are topic-independent
– Subset of more frequent words
• Similar problems than bag-of-words
– Word n-grams
• Not always better than individual word features
• Dimensionality increases considerably
11
Special Topics on Information Retrieval
Character features
• According to this family of measures, a text is
viewed as a mere sequence of characters.
• Various character-level measures:
– alphabetic characters count, digit characters count,
uppercase and lowercase characters count, letter
frequencies, punctuation marks count, etc.
• Frequencies of character n-grams
–
–
–
–
Lexical information (e.g., |_in_|, |text|)
Contextual information (e.g., |in_t|)
Use of punctuation and capitalization
Common used suffix (e.g., |ful_|, |ing_| )
12
Special Topics on Information Retrieval
Character features – some issues
• Extracting n-grams is language-independent and
requires no special tools.
• Dimensionality is considerably increased in
comparison to the word-based representation.
• How long should the n-grams be?
– A large n better capture lexical and contextual
information but it would also capture thematic
information; increase dimensionality
– A small n would be able to represent sub-word
(syllable-like) information but it would not be
adequate for representing the contextual information.
13
Special Topics on Information Retrieval
Syntactic features (1)
• The idea is that authors tend to use similar
syntactic patterns unconsciously.
• Syntactic information is considered more
reliable authorial fingerprint in comparison to
lexical Information
• Disadvantages:
– Robust and accurate NLP tools are require to
perform syntactic analysis of texts
– Language-dependent procedure
14
Special Topics on Information Retrieval
Syntactic features (2)
• POS tag frequencies or POS tag n-gram
frequencies
– A_DD few_JJ examples_NNS of_PREP heterologous_JJ
expression_NN
• Noun phrase counts, verb phrase counts, length
of noun phrases, length of verb phrases, etc.
– NP[Another attempt] VP[to exploit] NP[syntactic
information] VP[was proposed] PP[by Stamatatos, et al.
(2000)].
• Rewrite rule frequencies from the output of a
syntactic parser
– PP  PREP + NP
15
Special Topics on Information Retrieval
Semantic features
• The more detailed the text analysis required for
extracting features, the less accurate the
produced measures.
– Few attempts to exploit high-level features
• Examples of the usage of semantic information:
– Use semantic relations (from dependency trees)
– Use synonyms and hypernyms of words (Wordnet)
– Detect semantic similarity between words by
means of LSI
16
Special Topics on Information Retrieval
Domain-specific features
• In some applications it is possible to use some
structural measures to quantify the authorial
style.
• Some examples are:
– Use of greetings and farewells in the messages
– Types of signatures
– Use of indentation
– Paragraph length
– Font color counts and font size counts
17
Special Topics on Information Retrieval
Authorship attribution methods
• Instance-based approaches
– Each training text is individually represented as a separate
instance of authorial style.
– Uses vector space representations and apply supervised
learning algorithms such as traditional text classification.
• Profile-based approaches
– Concatenate all the available training texts per author in
one big file and extract a cumulative representation of that
author’s style (profile) from this concatenated text.
What is better?
Advantages and disadvantages?
18
Special Topics on Information Retrieval
Profile-based approaches (1)
• Training just comprises the extraction of profiles
for the candidate authors.
• Attribution is based on the distance of the profile
of an unseen text and the profile of each author.
• It can be realized by using probabilistic and
compression models
19
Special Topics on Information Retrieval
Profile-based approaches (2)
• Probabilistic models: attempt to maximize the
probability P(x|a) for a text x to belong to an
author a.
– Can be applied to both character and word sequences
• Compression models: the difference in bit-wise size
of the compressed files d(x, xa)=C(xa +x)–C(xa)
indicates the similarity of text x with author a.
– Several compression algorithms have been tested
including RAR, LZW, GZIP, BZIP2, 7ZIP.
20
Special Topics on Information Retrieval
Comparison table
21
Special Topics on Information Retrieval
Additional issues
• The number of candidate authors
– Increasing the number of authors leads to a significant
decrease in performance
– Character n-grams outperform other feature types;
providing a more heterogeneous set of features
improve the results significantly
• The size of the training set
– AA can lead to reasonable results even when only
limited data is available
– Character n-grams show more robustness to the effect
of data size than syntactic or word-based features
22
Special Topics on Information Retrieval
Sentiment analysis
Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in
Information Retrieval, Vol. 2, No 1-2 (2008) 1–135.
New classification alternatives
• Growing interest in non-topical text analysis
– Analysis of the opinions, feelings, and attitudes
expressed in a text, rather than just the facts.
• Web resources such as discussion forums, review
sites and blogs are a great source of information:
– Many people guide their own decisions by the
opinions that other consumers have publicity
expressed.
– Analysts from government, commercial, and political
domains require tools to automatically track attitudes
and feelings in the news and on-line forums
Applications?
24
Special Topics on Information Retrieval
Applications (1)
• As a sub-component technology
– Recommendation systems: penalize items that
receive a lot of negative feedback
– Information extraction: discard information found in
subjective sentences
– Question answering: handle opinion-oriented
questions
– Summarization: consider multiple viewpoints
– Citation analysis: determine whether an author is
citing a piece of work as supporting evidence or as
research that he or she dismisses.
25
Special Topics on Information Retrieval
Applications (2)
• In business and government intelligence
– Product quality: Classify products based on their
reviews  maybe for future recommendation, to
stop production, etc.
– Product analysis: Identifying product features that
customers have expressed their opinions on  to
change design, to use in publicity, etc.
– Analysis of political debates: find speeches that
represent support of or opposition to a given proposal
– Reputation analysis: identify bad and good opinions
over public personalities (e.g., politicians).
26
Special Topics on Information Retrieval
Two main tasks
• Subjectivity classification/detection
– Distinguish sentences used to present opinions
and other forms of subjectivity from sentences
used to objectively present factual information.
• Sentiment classification
– Classify the opinion as falling under one of two
opposing sentiment polarities {positive and
negative}, or locate its position on the continuum
between these two polarities.
How to carry out these tasks?
Which features could be useful?
27
Special Topics on Information Retrieval
Main features for sentiment analysis (1)
• Bag of words
– Better results using Boolean weight than tf-idf
• Word presence is enough since sentiment is not usually
highlighted through repeated use of the same terms.
• Lexical features beyond single words
– The position of a token within a textual unit can
potentially have important effects on how much
that token affects the overall sentiment or
subjectivity status of the enclosing textual unit.
– Word n-grams; their usefulness appears to be a
matter of some debate.
28
Special Topics on Information Retrieval
Main features for sentiment analysis (2)
• Part of speech (POS) tags
– Idea is to capture the presence (or polarity) of
(certain) adjectives and adverbs.
– 0ther parts of speech also contribute to express
sentiments (nouns: gem; verbs: love).
• Syntactic features
– Collocations and syntactic patterns have been
found to be useful for subjectivity detection.
Patterns such as:
• <subj> was satisfied; to condemn <dobj>
29
Special Topics on Information Retrieval
Supervised sentiment classification
• Uses labeled document sets
• Consider all described features
– Best results using lexical features
– Robust results with binary weights
• Applies standard text-categorization
algorithms
– Best reported results using SVM and Naïve Bayes
How to do the classification without a training set?
30
Special Topics on Information Retrieval
Unsupervised sentiment classification
• Idea: it is not hard to identify the sentiment
words and their orientation.
• The algorithm in one paper (Turney, 2002) is:
1. Select phrases containing adjectives or adverbs
2. Extract pairs of words ADJ NOUN or NOUN NOUN
3. Estimate the semantic orientation of the extracted
phrases, using the PMI-IR algorithm (against some
seed words; e.g., awful and excellent)
4. Calculate the average semantic orientation of the
phrases in the given review and classify the review
as recommended if the average is positive and
otherwise not recommended.
31
Special Topics on Information Retrieval
Download