University of Arkansas – CSCE Department CSCE 4613 Artificial Intelligence – Final Report – Fall 2013 Sentiment Analysis Addam Hardy Abstract The purpose of this project was to research the science behind sentiment analysis engines and to compare different methods for accuracy by implementing them in code. Sentiment analysis is an intersection of Natural Language Processing and Machine Learning. It is used to classify and filter data in a world with endless amounts of data. If you are a company and you would like to know the reception of your new product on twitter, there is simply too much to read by hand and classify by a human. Using machine learning we can pre-filter and pre-classify data on a scale of negative to positive sentiment prior to showing the data to an end user. 1. Introduction 1.1 Problem I am attempting to understand methods of sentiment analysis that have already been discovered and to implement them myself. These methods can include Bayesian Classifiers, Neural Networks, Support Vector Machines, and a few others 1.2 Objective The objective of my project is to research and understand possible sentiment analysis methods and to implement them in code. The main goal being to implement at least one web application that can process a data training set, and then classify given text on a sentiment scale. 1.3 Context Sentiment Analysis is related to Artificial Intelligence by the intersection of two AI disciplines: Natural Language Processing and Machine Learning. It attempts to offset the effort required to classify written language by any agent as either negative or positive, and the entire gray scale in between. 2. Related Work 2.1 Key Technologies I decided on using Ruby as the language to implement my demo of a sentiment analysis engine behind a web application. Using Ruby will allow me to prototype a lot fast and worry about more of the important parts of the implementation rather than dealing with the structure and rules of a 1 larger strongly typed language like C or Java. In my Ruby application the core of the sentiment analysis engine will be a library for building Support Vector Machines. 3. Architecture 3.1 Requirements or Use Cases Given a negative training set the engine will train itself on possible negative speech patterns Given a positive training set the engine will train itself on possible positive speech patterns The training sets will be snippets from movie reviews, which will set the domain knowledge of the system to that of movie reviews and their sentiment therein. After the system has been trained, the user can input any text into the web application and it will output an estimate as to whether or not it has positive or negative sentiment. 3.2 Architecture or Design Space The web app is built with a very lightweight Ruby web framework called Sinatra. It allows you to create a very basic web application by directly declaring a route and the applicable block of code that should execute with a returned response on that route. Within the application there is a SentimentClassifier class that takes an input of an array of movie reviews and then builds a corpus with the help of the Corpus class to input into the Support Vector Machine to build sparse vectors for the tokenized dataset. The SVM provides a non-probabilistic method for classifying the corpus on a sentiment scale of -1 to 1 (negative to positive). Once the data set has been “trained” into the SVM, the end-user is now able to input any text into the textbox on the homepage and get a sentiment classification for the given text. 3.3. Tasks Addam Hardy completed all research and tasks with a small amount of mentoring help from a professional colleague. Research on methods Implement classes for Support Vector Machine Implement web application for SVM demoing purposes Reporting/Presentations 3.4 Testing Tested the application using the MiniTest Unit Testing framework on Ruby. The test cases and output is below: Corpus 2 defines a sentiment_code of 1 for positive Corpus consumes multiple files and turns it into sparse vectors Corpus consumes a positive training set and unique set of words Corpus consumes a positive training set Corpus defines a sentiment_code of 1 for positive Corpus::tokenize downcases all the word tokens Corpus::tokenize ignores all stop symbols Corpus::tokenize ignores the unicode space 8 tests run in 0.00492 seconds. Errors: 0 | Failures: 0 | Skips: 0 starting to get sparse vectors SentimentClassifier yields a zero error when it uses itself starting to get sparse vectors SentimentClassifier cross validates with an error of 35% or less SentimentClassifier builds using defaults .neg for negative and .pos for positive 3 tests run in 114.664837 seconds. Errors: 0 | Failures: 0 | Skips: 0 4. Results and Analysis 4.1 Summary Using a Support Vector Machine, using a linear SVM kernel, we input each movie review snippet into the SVM as a vector. In the snippet each word is tokenized and a list of stop words 3 are cross-referenced in order to remove words that will do little to improve the accuracy of the classification and do not bear any weight on the content of the sentence. Example stop words include: a, an, the, be, of, etc. Using the linear kernel and normalizing data by removing stop words and getting snippets all of the same average length, I was able to get between 83-86% accuracy for negative and positive reviews. I would have liked to compare the SVM to a Bayesian Naïve Classifier or a neural network implementation but was unable to complete it without using someone else’s code. 4.2 Screenshots 4 5. Conclusions 5.1 Summary Ultimate conclusions are that I am at a minimum able to function in this realm using the work of others as a foundation. This was an initial venture for me into AI, NLP, and ML so it was more than likely that I would be confused or lost most of the time and spend a majority of my time looking for understanding rather than having it, but I feel even with a great deal of research, I will probably only ever be a beneficiary and user of this knowledge rather than a creator of it. I really enjoyed the challenge of trying to truly understand Bayesian Classification, Neural Networks, Support Vector Machines, and others but it was definitely much more of a challenge for me than a success. I was however able to come up with one implementation using a SVM and a web application to demo it’s functionality. So in some small part, I can call my venture a success. There is just much more ground to cover and much more to learn. 5.2 Potential Impact I did not create or discover any new methods for sentiment analysis and only used methods that have already been discovered and proven, so the external impact of my project is quite small. However the personal impact of the project on my own work has lead me to venture further into more scientific computer science and finally get to implement some demonstrable artificial intelligence to the laymen with my demo web application. 5.3 Future Work Since the math was difficult for me and I ended up spending a lot of time in the research phase rather than implementation I was only able to implement one method of sentiment analysis and was unable to comparatively analysis the accuracy of multiple methods. Given more time, I would continue to implement more methods and build a testing framework to compare their accuracy. 5.4 Project Value I previously worked on a project that had a sentiment analysis component. Another member of my team developed that component and it was always pretty fascinating to me how it worked. I gained a lot of knowledge on how these systems are implemented and the quite difficult, at least 5 for me as a non-mathematician, the underlying theory and math can be. This project also allowed me to experiment with two disciplines of Artificial Intelligence: Natural Language Processing and Machine Learning. Bios Addam Hardy – Hardy is a senior student majoring in Computer Science at the University of Arkansas, Fayetteville. He is also the lead developer at RevUnit in Bentonville, AR, overseeing software development projects on many platforms to include Ruby on Rails, iOS and Android. References [1] https://github.com/NaturalNode/natural [2] http://nlp.stanford.edu/sentiment/ [3] http://www.csie.ntu.edu.tw/~cjlin/libsvm/ [4] http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf Appendix A – Deliverables Manifest Final Report Sinatra Web Application o Install Ruby o Install Ruby Gems o Install the Bundler Gem o `bundle install` in the project directory to install all dependencies o `rackup` in the project directory to start the web application o Go to http://localhost:9292 to view the demo application 6