Twist : User Timeline Tweets Classifier Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology Input: user timeline tweets Output: list of auto classified tweets Twitter allows users to create custom Friend Lists based on the user handles. Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet. Step 1: Data Collection Step 2: Text mining Step 3: Creation of the training file for the library Step 4: Evaluation of several classifiers Step 5: Selecting the best classifier Step 6: Validating the classification Step 7: Tuning the parameters Step 8: Repeat; until correct classification Remove special characters Tokenize Remove redundant letters in words Spell Check Stemming Language Identification Remove Stop Words Generate bigrams and change to lower case Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D Stopwords SF Giants! amaazzzing feelin’!!!! \/ :D Special chars SF Giants amaazzzing feelin Spell check SF Giants amazing feeling Stemming SF Giants amazing feel me stopwords SF Giants amazing feel Logistic Regression Classifier Reasons: Most popular linear classification technique for text classification Ability to handle multiple categories with ease Gave the best cross-validation accuracy and precision-recall score Library: LIBLINEAR for Python SF Giants amazing feel Indexing SF – 1 Giants -2 amazing-3 feel-4 Boolean SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1) Training Input for the SVM 1 1:1 2:1 3:1 4:1 Andy, Marti & The Twitter Team Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business” Tweets were not purely “Sports” or “Business” related Personal messages were prominent Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly Noise in the data: ▪ ▪ ▪ ▪ ▪ Tweets are in inconsistent format Lots of meaningless words Misspellings More of individual expression For example, BAAAAAAAAAAAASSKEttt!!!! bskball , futball, % , :D,\m/, ^xoxo Solution: Regular expressions and NLP toolkit Different words, same root Playing , plays , playful - play Solution: Stemming Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4) Comma separated values of the categories that each tweet Accuracy here is 94%. Precision: 0.89 Recall: 0.89 Experiment with different kernels for a better accuracy Category based tweets from https://twitter.com/i/#!/who_to_follow/interests Coding done in Python Database – sqlite3 ML tool – lib SVM Stemming – Porter’s Stemming NLP Tool kit