Twitter sentiment vs. Stock price !

advertisement
Twitter sentiment vs. Stock price!
Background
!
•  On April 24th 2013, the Twitter account
belonging to Associated Press was
hacked. Fake posts about the
Whitehouse being bombed and the
President being injured were posted.
This lead to a 1% loss on the Dow
Jones.
•  On May 6th 2010 a poorly written
algorithm triggered a selling spree that
caused a 9.2% drop of the Dow Jones.
•  Using text mining as part of trading
algorithms is common, and more
incidents similar to these have
happened (e.g. fake news about
American Airlines going bankrupt once
made the stock price fall quickly).
!
2
Aim
!
Inspired by this I wanted to look into the following:
•  Is it possible to collect posts from Twitter (known as tweets), that
mention a specified stock ticker (Apple Inc. uses AAPL), calculate a
sentiment score of these tweets and find a visual relationship
between this score and the stocks current price?
When we say a visual relationship we mean that we want to plot the score and the price
side by side and be able to visually see a relationship between them. More on this later…
3
!
Method - High level perspective
!
•  The general idea is to get all tweets for a specific hour, calculate the
average sentiment score of these tweets, and plot it next to the
closing price of the stock for that hour.
•  But what is a sentiment score?
1.  Find (or create) a corpus with tweets that are classified as
positive or negative, create features and use in a naïve Bayes
classifier (use the distribution rather than the label as the score)
2.  Use a lexicon of sentiment tagged words, (e.g. bad could be
negative and super could be good). For each tweet count the
number of positive and negative words and create a score from
these counts.
4
!
Approach 1
!
•  The first approach was built upon what we have seen in the labs,
creating features and using a naïve Bayes classifier.
•  I found a corpus of 1 600 000 tweets that were labelled as positive or
negative. Based on these I wanted to create features and use them in
a naive Bayes classifier.
•  I created unigram, bigram and trigram features. Furthermore I
created a TF-IDF index over these tweets and used it as a feature. I
also partially used the second approach (lexicon of sentiment words,
more on this later…).
5
!
Approach 1
!
•  However it turned out that after a few days trying to coerce my code to get this to work
in reasonable time I failed.
•  Since each run was taking very long I decided that I needed to save the tokenized and
cleaned tweets, along with their features (and the TF-IDF index) to disk.
•  However when trying to serialize the class structure I had created, the “pickle” module
included in Python was using > 5GB of RAM to check for cycles in the objects that
were saved, and it basically blew up every time (giving MemoryError).
•  So I had a choice of fixing this (and save to an SQL database rather than to a flat file),
or find another approach … I decided to use another approach.
6
!
Approach 2
!
• 
I found three lexicons that all consisted of words with a positive or negative
label attached to it. One of them also included the POS of the word used:
• 
Example of first lexicon (8221 words):
• 
word1=agony pos1=noun priorpolarity=negative
• 
word1=agree pos1=verb priorpolarity=positive
• 
Example of second lexicon (3642):
• 
Consisted of two files, one with positive words: worst, wreck,...
• 
And one with negative words: shield, shiny,…
• 
7
!
Example of third lexicon (6787):
• 
Consisted of two files, one with positive words: fine, flashy,...
• 
And one with negative words: spooky, sporadic,…
Approach 2
!
•  These lexicons were parsed and placed into a large lexicon
(duplicates were allowed and not removed from the lexicon)
•  I then downloaded 7945 tweets that contained the word AAPL (the
stock ticker for Apple Inc.)
•  For each of the tweets I did the following processing:
•  Lowercase, remove all http://… and other URL structures, remove
all usernames (i.e. @username), removed all multiple whitespaces
(i.e. “
“ became “ “), replaced #word with word, replaced
repetitions of letters to only two (e.g. yeeeeeehaaaaa became
yeehaa), removed all words that did not start with a number (i.e.
3am was removed), stripped punctuations (!, ?, ., ,)
8
!
Approach 2
!
•  Next step was to create the actual sentiment score. For each tweet I
wanted to look up the tokens in my lexicon to try and decide if the
token was positive or negative.
•  Since one of my lexicons also contained the POS of the word each
of my tweets were subjected to POS tagging.
•  Each token of a tweet was sent to the lexicon (along with the POS
tag) and a sentiment was returned.
•  I did a simple count of the positive and negative words.
9
!
Approach 2
!
•  Since multiple lexicons were included in my larger lexicon I needed a
way of decided which lexicon to trust for a given word (since there
was some overlap between the lexicons)
•  The following algorithm was created to solve ties:
10
!
1. 
If there is only one lexicon that contains the word then this lexicon wins.
2. 
If the token and POS matched the first lexicon then this lexicon wins.
3. 
If all lexicons agree on the sentiment then all win.
4. 
If lexicons disagree, then count (i.e. if one lexicon says positive, and the other two
say negative then negative wins).
5. 
If it is still a tie then return neutral.
Approach 2
!
•  So for each tweet there now exists a positive (p) and a negative (n)
count, and the total number of tokens (N).
•  The following two scores where then associated with each tweet:
•  Sentiment diff: p – n
•  Positive score: p / N
•  But I was not satisfied by this, because I felt that some words must
be more negative than others, and some words must be more
positive than others.
11
!
Approach 2
!
•  The idea was then to create a TF-IDF index using the tokens in the
lexicon (apprx. 8000 unique tokens) and 2000 tweets from the
downloaded AAPL tweets.
•  This TF-IDF index was created (and since it was a reasonable size it
could be serialized to disk).
•  The issue then arose that it was only really useful on the 2000 tweets
that I used to create the TF-IDF, when incoming tweets were to be
processed they did not belong to the index.
12
!
Approach 2
!
•  So since ignorance is bliss I invented the average TF-IDF weight:
•  I calculated the average TF-IDF for each token in the index, saved
this value and threw away all the other values in the index, creating
a very compact index of average TF-IDF values.
•  So for any token (regardless of which tweet it came from) I could
get an average weight for the token.
•  E.g. “good” could have weight “0.008” and “awesome” could have
weight “0.1”.
13
!
Approach 2
!
•  So armed with the average TF-IDF index I continued my sentiment
scoring.
•  Instead of counting the positive and negative words I instead looked
them up in the average TF-IDF index, and summed the weights. A
weighted positive count (wp) and a weighted negative count (wn)
gave the following scores:
•  Weighted sentiment diff: wp – wn
•  Weighted positive score: wp / N
14
!
Plotting
!
•  The 7945 tweets that were downloaded were grouped by hour, so all
tweets that were posted between 11:01 AM and 12:00 AM were
considered to belong to 12:00 AM.
•  For each grouping the individual sentiment score for each tweet was
calculated (using all four sentiment scores discussed). The total
sentiment score for the grouping was simply the average score.
•  From Google Finance hourly closing prices were downloaded for
AAPL (this means that at time 11:00 AM the latest price AAPL was
sold for is the closing price for this hour).
15
!
Plots
!
Sentiment difference (raw counts)!
At first glance visually useless, !
however it is worth noting that the!
maximum of each oscillation
increases…!
Note: The flat horizontal lines are
created while the stock market is
closed.!
Hourly price and sentiment score between the 21st of May and 27th of May!
16
!
Plots
!
p / N (raw counts)!
Difficult to find anything visually
appealing about this….!
Hourly price and positive score between the 21st of May and 27th of May!
17
!
Plots
!
wp / N (weighted sum)!
Just as bad as the positive score
without the TF-IDF weighting….!
Hourly price and weighted sum between the 21st of May and 27th of May!
18
!
Plots
!
wp – wn (weighted difference)!
•  “Chartists” - investors that
mainly look at charts of price and
volume rather than the
fundamental data about a
company. !
• 
• 
• 
19
!
Looks for “trends” in the charts.
!
One of the classical ways of
finding a trend it is to find “higherlows”. !
The support lines drawn in the
charts show that both the price
and the sentiment are creating
“higher lows”, indicating that the
stock and the sentiment are
entering (or already in) a period of
upward trend.!
Results
!
•  It is easy to conclude that most results were useless, however it is
interesting to see some similarity in trend (in the chartists sense of the
word) between price and the weighted diff.
!
•  One obvious flaw in the process could be the fact that I averaged the
sentiment score of each hour, if this was kept raw then hours were there
were a lot of positive tweets would possibly outweigh other hours more
clearly, and possibly remove some of the oscillation.
!
•  When comparing the sentiment scores created against the already
labelled tweets from approach 1 (the 1 600 000 tweets), the accuracy of
the scores were low (it would almost have been as good as just randomly
guessing the sentiment).!
•  An attractive feature of the sentiment scoring approach is the lack of a
labelled corpus (the lexicons can be reused).!
20
!
1.6
1.6.1
Task 1(c)
Method
Average TF-IDF
!
Using the selected processors (Table 4) the naive Bayes classifier was ran again, however
this time with some added feature
generators. We included the 1000 most frequent bigrams (creating has bigram(’word1’,’word2’)) features for each document.
A feature was added that tells the classifier if the average document word length is greater, less or equal to the corpus average
word length. Furthermore a 10-bin feature with cuto↵s in regards to the lexical diversity of the document was created.
1.6.2
1.7
Results
Table 5: Results from Task 1(c)
Without
Task
1(d)average TF-IDF features!
Processors
Features
Accuracy Pre(P) Rec(P) F-M(P) Pre(N) Rec(N) F-M(N)
Method
PunctuationProcessor,
HWFG,
0.79
0.84
0.75
0.79
0.75
0.84
0.79
The ideaStemmingProcessor,
was to include tf-idf as a binary
feature.
This
has
been
done
by
calculating
the
average
tf-idf
weight
for
each term
BFG,
in the entire
corpus,
and
then
setting
the
feature
[tfidf(’word’)
>
Avg]
or
[tfidf(’word’)

Avg]
for
each
frequent
term
in each
LemmatizerProcessor,
LDG,
document. As before only the 1000 most
frequent terms have been used.
AWLG,
LowerProcessor,
Stop- HWFG,
0.79
0.84
0.75
0.79
0.75
0.84
0.79
The TFIDF
feature
generator
was
added
to
the
generators
in
Sec
1.6,
using
only
a
selection
of
the
processors.
WordProcessor,
Stem- BFG,
mingProcessor, Lemma- LDG,
1.7.2 tizerProcessor,
Results
AWLG,
LowerProcessor,
Punc- HWFG,
0.79
0.84
0.75
0.79
0.75
0.84
0.79
tuationProcessor, Stem- BFG,
Table 6: Results from Task 1(d)
With average TF-IDF features!
mingProcessor, Lemma- LDG,
tizerProcessor,
AWLG,
Processors
Features
Accuracy Pre(P) Rec(P) F-M(P) Pre(N) Rec(N) F-M(N)
LowerProcessor, Number- HWFG,
HWFG,
0.79
0.84
0.75
0.79
0.75
0.84
0.79
PunctuationProcessor,
0.80
0.84
0.76
0.80
0.76
0.80
Processor, Punctuation- BFG,
BFG,
StemmingProcessor,
Processor, StopWordPro- LDG,
LDG,
LemmatizerProcessor,
cessor, StemmingProces- AWLG,
AWLG,
sor, LemmatizerProces- TFIDF,
sor,
LowerProcessor,
Stop- HWFG,
0.80
0.84
0.76
0.80
0.76
0.84
0.80
WordProcessor,
Stem- BFG,
mingProcessor,
LemmaLDG,
1.6.3 Conclusions
tizerProcessor,
AWLG,
Adding the feature generators do change
the column values more than trying di↵erent combinations of processors. However
TFIDF,
21! there is no di↵erence between the choice of processors.
LowerProcessor,
Punc- HWFG,
0.80
0.84
0.76
0.80
0.76
0.84
0.80
1.7.1
give a different result. It would be interesting to expand this
lexicon further to include more words, and also to try it on
text that is not as random as Tweets are. The average TF-IDF
index used here is not very large (only used 1000 tweets), this
could possible also increase usefulness if expanded.
Sources
R EFERENCES
!
[1] A. Nagar and M. Hahsler, “Using text and data mining
techniques to extract stock market sentiment from live
news streams.”
[2] N. Godbole, M. Srinivasaiah, and S. Skiena, “Large-scale
sentiment analysis for news and blogs,” in Proceedings of
the International Conference on Weblogs and Social Media
(ICWSM), vol. 2, 2007.
[3] V. Sehgal and C. Song, “Sops: stock prediction using
web sentiment,” in Data Mining Workshops, 2007. ICDM
Workshops 2007. Seventh IEEE International Conference
on. IEEE, 2007, pp. 21–26.
[4] W. Zhang and S. Skiena, “Trading strategies to exploit blog
and news sentiment,” in Proc. of the Fourth International
AAAI Conference on Weblogs and Social Media, 2010, pp.
375–378.
[5] M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining
(KDD-2004), 2004.
22
!
Lexicons
!
MPQA – Subjectivity lexicon
http://mpqa.cs.pitt.edu/
lexicons/!
!
tm.plugin.tags - This is an
R package that contain
positive and negative words!
!
Opinion mining, Sentiment
Analysis and Opinion Spam
Detection - http://
www.cs.uic.edu/~liub/FBS/
sentiment-analysis.html!
www.liu.se!
Download