Sentiment Analysis - Computer Science and Computer Engineering

advertisement
University of Arkansas – CSCE Department
CSCE 4613 Artificial Intelligence – Final Report – Fall 2013
Sentiment Analysis
Addam Hardy
Abstract
The purpose of this project was to research the science behind sentiment analysis engines and to
compare different methods for accuracy by implementing them in code. Sentiment analysis is an
intersection of Natural Language Processing and Machine Learning. It is used to classify and
filter data in a world with endless amounts of data. If you are a company and you would like to
know the reception of your new product on twitter, there is simply too much to read by hand and
classify by a human. Using machine learning we can pre-filter and pre-classify data on a scale of
negative to positive sentiment prior to showing the data to an end user.
1. Introduction
1.1 Problem
I am attempting to understand methods of sentiment analysis that have already been discovered
and to implement them myself. These methods can include Bayesian Classifiers, Neural
Networks, Support Vector Machines, and a few others
1.2 Objective
The objective of my project is to research and understand possible sentiment analysis methods
and to implement them in code. The main goal being to implement at least one web application
that can process a data training set, and then classify given text on a sentiment scale.
1.3 Context
Sentiment Analysis is related to Artificial Intelligence by the intersection of two AI disciplines:
Natural Language Processing and Machine Learning. It attempts to offset the effort required to
classify written language by any agent as either negative or positive, and the entire gray scale in
between.
2. Related Work
2.1 Key Technologies
I decided on using Ruby as the language to implement my demo of a sentiment analysis engine
behind a web application. Using Ruby will allow me to prototype a lot fast and worry about more
of the important parts of the implementation rather than dealing with the structure and rules of a
1
larger strongly typed language like C or Java. In my Ruby application the core of the sentiment
analysis engine will be a library for building Support Vector Machines.
3. Architecture
3.1 Requirements or Use Cases

Given a negative training set the engine will train itself on possible negative speech
patterns

Given a positive training set the engine will train itself on possible positive speech
patterns

The training sets will be snippets from movie reviews, which will set the domain
knowledge of the system to that of movie reviews and their sentiment therein.

After the system has been trained, the user can input any text into the web application and
it will output an estimate as to whether or not it has positive or negative sentiment.
3.2 Architecture or Design Space
The web app is built with a very lightweight Ruby web framework called Sinatra. It allows you
to create a very basic web application by directly declaring a route and the applicable block of
code that should execute with a returned response on that route.
Within the application there is a SentimentClassifier class that takes an input of an array of
movie reviews and then builds a corpus with the help of the Corpus class to input into the
Support Vector Machine to build sparse vectors for the tokenized dataset. The SVM provides a
non-probabilistic method for classifying the corpus on a sentiment scale of -1 to 1 (negative to
positive).
Once the data set has been “trained” into the SVM, the end-user is now able to input any text into
the textbox on the homepage and get a sentiment classification for the given text.
3.3. Tasks
Addam Hardy completed all research and tasks with a small amount of mentoring help from a
professional colleague.

Research on methods

Implement classes for Support Vector Machine

Implement web application for SVM demoing purposes

Reporting/Presentations
3.4 Testing
Tested the application using the MiniTest Unit Testing framework on Ruby. The test cases and
output is below:
Corpus
2
defines a sentiment_code of 1 for positive
Corpus
consumes multiple files and turns it into sparse vectors
Corpus
consumes a positive training set and unique set of words
Corpus
consumes a positive training set
Corpus
defines a sentiment_code of 1 for positive
Corpus::tokenize
downcases all the word tokens
Corpus::tokenize
ignores all stop symbols
Corpus::tokenize
ignores the unicode space
8 tests run in 0.00492 seconds.
Errors: 0 | Failures: 0 | Skips: 0
starting to get sparse vectors
SentimentClassifier
yields a zero error when it uses itself
starting to get sparse vectors
SentimentClassifier
cross validates with an error of 35% or less
SentimentClassifier
builds using defaults .neg for negative and .pos for positive
3 tests run in 114.664837 seconds.
Errors: 0 | Failures: 0 | Skips: 0
4. Results and Analysis
4.1 Summary
Using a Support Vector Machine, using a linear SVM kernel, we input each movie review
snippet into the SVM as a vector. In the snippet each word is tokenized and a list of stop words
3
are cross-referenced in order to remove words that will do little to improve the accuracy of the
classification and do not bear any weight on the content of the sentence. Example stop words
include: a, an, the, be, of, etc.
Using the linear kernel and normalizing data by removing stop words and getting snippets all of
the same average length, I was able to get between 83-86% accuracy for negative and positive
reviews.
I would have liked to compare the SVM to a Bayesian Naïve Classifier or a neural network
implementation but was unable to complete it without using someone else’s code.
4.2 Screenshots
4
5. Conclusions
5.1 Summary
Ultimate conclusions are that I am at a minimum able to function in this realm using the work of
others as a foundation. This was an initial venture for me into AI, NLP, and ML so it was more
than likely that I would be confused or lost most of the time and spend a majority of my time
looking for understanding rather than having it, but I feel even with a great deal of research, I
will probably only ever be a beneficiary and user of this knowledge rather than a creator of it. I
really enjoyed the challenge of trying to truly understand Bayesian Classification, Neural
Networks, Support Vector Machines, and others but it was definitely much more of a challenge
for me than a success. I was however able to come up with one implementation using a SVM and
a web application to demo it’s functionality. So in some small part, I can call my venture a
success. There is just much more ground to cover and much more to learn.
5.2 Potential Impact
I did not create or discover any new methods for sentiment analysis and only used methods that
have already been discovered and proven, so the external impact of my project is quite small.
However the personal impact of the project on my own work has lead me to venture further into
more scientific computer science and finally get to implement some demonstrable artificial
intelligence to the laymen with my demo web application.
5.3 Future Work
Since the math was difficult for me and I ended up spending a lot of time in the research phase
rather than implementation I was only able to implement one method of sentiment analysis and
was unable to comparatively analysis the accuracy of multiple methods. Given more time, I
would continue to implement more methods and build a testing framework to compare their
accuracy.
5.4 Project Value
I previously worked on a project that had a sentiment analysis component. Another member of
my team developed that component and it was always pretty fascinating to me how it worked. I
gained a lot of knowledge on how these systems are implemented and the quite difficult, at least
5
for me as a non-mathematician, the underlying theory and math can be. This project also allowed
me to experiment with two disciplines of Artificial Intelligence: Natural Language Processing
and Machine Learning.
Bios

Addam Hardy – Hardy is a senior student majoring in Computer Science at the
University of Arkansas, Fayetteville. He is also the lead developer at RevUnit in
Bentonville, AR, overseeing software development projects on many platforms to include
Ruby on Rails, iOS and Android.
References
[1] https://github.com/NaturalNode/natural
[2] http://nlp.stanford.edu/sentiment/
[3] http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[4] http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf
Appendix A – Deliverables Manifest

Final Report

Sinatra Web Application
o Install Ruby
o Install Ruby Gems
o Install the Bundler Gem
o `bundle install` in the project directory to install all dependencies
o `rackup` in the project directory to start the web application
o Go to http://localhost:9292 to view the demo application
6
Download