Project Presentation

advertisement
Twist : User Timeline
Tweets Classifier
Team :
Priya Iyer
Vaidy Venkat
Sonali Sharma
Mentor: Andy Schlaikjer



Auto classify tweets on the user’s timeline
into 4 predefined categories: Sports, Finance,
Entertainment, Technology
Input: user timeline tweets
Output: list of auto classified tweets

Twitter
allows users
to create
custom
Friend Lists
based on
the user
handles.

Our application is a twist on this functionality
of Twitter where we auto classify tweets on
the user’s timeline based on just the
occurrence of terms in the tweet.








Step 1: Data Collection
Step 2: Text mining
Step 3: Creation of the training file for the
library
Step 4: Evaluation of several classifiers
Step 5: Selecting the best classifier
Step 6: Validating the classification
Step 7: Tuning the parameters
Step 8: Repeat; until correct classification








Remove special characters
Tokenize
Remove redundant letters in words
Spell Check
Stemming
Language Identification
Remove Stop Words
Generate bigrams and change to lower case
Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D 
Stopwords
SF Giants! amaazzzing feelin’!!!! \/ :D 
Special chars
SF Giants amaazzzing feelin
Spell check
SF Giants amazing feeling
Stemming
SF Giants amazing feel me
stopwords
SF Giants amazing feel


Logistic Regression Classifier
Reasons:
 Most popular linear classification technique for
text classification
 Ability to handle multiple categories with ease
 Gave the best cross-validation accuracy and
precision-recall score
 Library: LIBLINEAR for Python
SF Giants amazing feel
Indexing
SF – 1 Giants -2 amazing-3 feel-4
Boolean
SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1)
Training Input
for the SVM
1 1:1 2:1 3:1 4:1
Andy,
Marti
&
The Twitter Team




Collected >2000 tweets from the “Who to
follow” interest lists on Twitter for “Sports”
and “Business”
Tweets were not purely “Sports” or
“Business” related
Personal messages were prominent
Solution: Compared against a corpus of
sports/business related terms and assigned
weights accordingly

Noise in the data:
▪
▪
▪
▪
▪
Tweets are in inconsistent format
Lots of meaningless words
Misspellings
More of individual expression
For example, BAAAAAAAAAAAASSKEttt!!!!
bskball , futball, % , :D,\m/, ^xoxo
Solution: Regular expressions and NLP toolkit

Different words, same root
Playing , plays , playful - play
Solution: Stemming




Mixed bag of sports(=1), finance(=2) tweets,
entertainment(=3) and technology (=4)
Comma separated values of the categories
that each tweet
Accuracy here is 94%. Precision: 0.89 Recall:
0.89
Experiment with different kernels for a better
accuracy

Category based tweets from
 https://twitter.com/i/#!/who_to_follow/interests





Coding done in Python
Database – sqlite3
ML tool – lib SVM
Stemming – Porter’s Stemming
NLP Tool kit
Download