Uploaded by xingxinma

Math899-twitter and reddit sentiments-project-Presentation

advertisement
TWITTER AND REDDIT
SENTIMENTS
Xingxin Ma
Math 899
CONTENTS
1. Project introduction
2. Data Entry
3. Data cleaning
4. Natural language feature extraction
5. Model selection
6. Conclusion
PRESENTATION TITLE
2
1. PROJECT INTRODUCTION
Many existing voice products(such as Alexa) or chatbots do not effectively
support sentiment judgement, do not differentiate between response
strategies for different emotions.
This will lower customer expectations for smart products or chatbots. When
implant a chatbot function in products, sentimental analyzing is an
important factor to be considered in the reply logic.
In this project, I will use the data named: “Twitter and Reddit Sentiments”
from Kaggle. The data is labeled. It is a supervised learning. I will use
mathematics methods to build the models. Practice to implant mathematics
knowledge to the code, and to do he sentimental prediction of the
sentences.
3
2. DATA ENTRY
The data is from kaggle. The download link is as below:
https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-sentimental-analysis-dataset
The data contains two files: Reddit_Data.csv and Twitter_Data.csv
The Reddit_data.csv contains 37k comments along with its Sentimental Labelling. The
Twitter_data.csv contains 163k tweets along with its Sentimental Labelling.
For the sentimental labels:
0 Indicating it is a Neutral Tweet/Comment,
1 Indicating a Positive Sentiment,
-1 Indicating a Negative Tweet/Comment.
I used 80% data as the training data, 20% data as the testing data.
PRESENTATION TITLE
4
2. DATA ENTRY
The data has two columns as below. And I also printed out the shape of the two data sets.
After combined reddit and twitter data, totally I have 200229 rows.
5
2. DATA ENTRY
Below is the histogram of different sentiments:
PRESENTATION TITLE
6
3.DATA CLEANING
3.1 Drop NA data
Using isnull().sum() to check how many NA
data in each column.
I found that 104 missing
clean_comment and 7 missing in category.
So I used dropna() to dropped 111 NA rows.
PRESENTATION TITLE
7
3.DATA CLEANING
3.2 Remove noises
I removed punctuation, URL, HTML and nonAscii
In twitter and reddit, there are some
abbreviations. I substituted the abbreviation
with the original words.
3.DATA CLEANING
3.3 Delete stop words
After the previous cleaning, I counted the
words and made a histogram of the 20 most
frequency occurring words. I found most
words have less meaning. So I should remove
common words such as "the", "and", "a"
which do not contribute much to the
sentiment of the text.
Here I used stopwords library
from nltk.corpus.
PRESENTATION TITLE
9
4. NATURAL
LANGUAGE
FEATURE
EXTRACTION
Step1: Count how many words totally
GOT ONE SENTENCE
ADD WORD TO
WORDS_NUMBER_MAP
ALL TO LOWERCASE
N
If not
in words_
number_
map
GOT TOKENS FROM
THE SENTENCE
DELETE NOISES
N
LEMMATIZE
If stop
words
Y
drop
words_number_map​
Words
Number
Word 1
1
Word 2
2
Word 3
3
.
.
.
.
.
.
Step2: Make a large words_table
INITIAL WORDS TABLE USING THE LARGEST WORDS NUMBER GOT IN STEP 1
GOT ONE SENTENCE
ADD WORD TO
WORDS_TABLE
ALL TO LOWERCASE
N
If not
in words_t
able
GOT TOKENS FROM
THE SENTENCE
DELETE NOISES
N
LEMMATIZE
Y
Add count number
of the words
PRESENTATION TITLE
If stop
words
Y
drop
13
Step3: Calculate the correlation coefficients
words_table
Sentiments
1
0
-1 1
...
Word 1
0
2
1
1
...
Word 2
2
3
1
0
Word 3
1
0
1
2
...
...
.
.
.
.
.
.
Every column is from one sentence.
Calculate the correlation coefficients between each row and the 1st row.
Step4: Sort the correlation coefficients and select
top 500 words
I selected top 500 words by sort
their absolute correlation
coefficients value. Right is the
histogram of the correlation
coefficiency.
Step4: Sort the correlation coefficients and select
top 500 words
Step5: Make a small words_table
INITIAL WORDS TABLE USING 500
AS ONE OF THE DIMENSION
GOT TOKENS FROM
THE SENTENCE
ALL TO LOWERCASE
GOT ONE SENTENCE
If stop
words
DELETE NOISES
Y
drop
N
ADD WORD TO
WORDS_TABLE
N
If not
in words_t
able
Y
Y
Add count number
of the words
If top 500
N
drop
LEMMATIZE
Model
1
KNN
5. MODEL SELECTION
Accuracy score
K=1 0.695
K=3 0.699
K=5 0.691
K=7
0.688
K=9 0.680
2
Multinomial Logistic Regression
0.878
3
Naïve Bayes
0.734
4
Random forest
0.849
5
Boosting
0.688
6
XGBoosting
0.817
7
LSTM(RNN)
0.877
PRESENTATION TITLE
18
6. CONCLUSION
From the test result, Multinomial Logistic Regression got the highest accuracy score: 0.878. And
LSTM(RNN) has almost the same score with Multinomial Logistic Regression: 0.877. XGBoosting
got a high accuracy score: 0.849.
For KNN, there is no big difference between different k value.
In this project I meet a lot of crash because RAM exhausted. The ways to solve this issues are:
a. delete large variables which is not useful in time
b. Optimization algorithm
c. divide to conquer
PRESENTATION TITLE
19
THANK YOU!
PRESENTATION TITLE
20
Download