TWITTER AND REDDIT SENTIMENTS Xingxin Ma Math 899 CONTENTS 1. Project introduction 2. Data Entry 3. Data cleaning 4. Natural language feature extraction 5. Model selection 6. Conclusion PRESENTATION TITLE 2 1. PROJECT INTRODUCTION Many existing voice products(such as Alexa) or chatbots do not effectively support sentiment judgement, do not differentiate between response strategies for different emotions. This will lower customer expectations for smart products or chatbots. When implant a chatbot function in products, sentimental analyzing is an important factor to be considered in the reply logic. In this project, I will use the data named: “Twitter and Reddit Sentiments” from Kaggle. The data is labeled. It is a supervised learning. I will use mathematics methods to build the models. Practice to implant mathematics knowledge to the code, and to do he sentimental prediction of the sentences. 3 2. DATA ENTRY The data is from kaggle. The download link is as below: https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-sentimental-analysis-dataset The data contains two files: Reddit_Data.csv and Twitter_Data.csv The Reddit_data.csv contains 37k comments along with its Sentimental Labelling. The Twitter_data.csv contains 163k tweets along with its Sentimental Labelling. For the sentimental labels: 0 Indicating it is a Neutral Tweet/Comment, 1 Indicating a Positive Sentiment, -1 Indicating a Negative Tweet/Comment. I used 80% data as the training data, 20% data as the testing data. PRESENTATION TITLE 4 2. DATA ENTRY The data has two columns as below. And I also printed out the shape of the two data sets. After combined reddit and twitter data, totally I have 200229 rows. 5 2. DATA ENTRY Below is the histogram of different sentiments: PRESENTATION TITLE 6 3.DATA CLEANING 3.1 Drop NA data Using isnull().sum() to check how many NA data in each column. I found that 104 missing clean_comment and 7 missing in category. So I used dropna() to dropped 111 NA rows. PRESENTATION TITLE 7 3.DATA CLEANING 3.2 Remove noises I removed punctuation, URL, HTML and nonAscii In twitter and reddit, there are some abbreviations. I substituted the abbreviation with the original words. 3.DATA CLEANING 3.3 Delete stop words After the previous cleaning, I counted the words and made a histogram of the 20 most frequency occurring words. I found most words have less meaning. So I should remove common words such as "the", "and", "a" which do not contribute much to the sentiment of the text. Here I used stopwords library from nltk.corpus. PRESENTATION TITLE 9 4. NATURAL LANGUAGE FEATURE EXTRACTION Step1: Count how many words totally GOT ONE SENTENCE ADD WORD TO WORDS_NUMBER_MAP ALL TO LOWERCASE N If not in words_ number_ map GOT TOKENS FROM THE SENTENCE DELETE NOISES N LEMMATIZE If stop words Y drop words_number_map​ Words Number Word 1 1 Word 2 2 Word 3 3 . . . . . . Step2: Make a large words_table INITIAL WORDS TABLE USING THE LARGEST WORDS NUMBER GOT IN STEP 1 GOT ONE SENTENCE ADD WORD TO WORDS_TABLE ALL TO LOWERCASE N If not in words_t able GOT TOKENS FROM THE SENTENCE DELETE NOISES N LEMMATIZE Y Add count number of the words PRESENTATION TITLE If stop words Y drop 13 Step3: Calculate the correlation coefficients words_table Sentiments 1 0 -1 1 ... Word 1 0 2 1 1 ... Word 2 2 3 1 0 Word 3 1 0 1 2 ... ... . . . . . . Every column is from one sentence. Calculate the correlation coefficients between each row and the 1st row. Step4: Sort the correlation coefficients and select top 500 words I selected top 500 words by sort their absolute correlation coefficients value. Right is the histogram of the correlation coefficiency. Step4: Sort the correlation coefficients and select top 500 words Step5: Make a small words_table INITIAL WORDS TABLE USING 500 AS ONE OF THE DIMENSION GOT TOKENS FROM THE SENTENCE ALL TO LOWERCASE GOT ONE SENTENCE If stop words DELETE NOISES Y drop N ADD WORD TO WORDS_TABLE N If not in words_t able Y Y Add count number of the words If top 500 N drop LEMMATIZE Model 1 KNN 5. MODEL SELECTION Accuracy score K=1 0.695 K=3 0.699 K=5 0.691 K=7 0.688 K=9 0.680 2 Multinomial Logistic Regression 0.878 3 Naïve Bayes 0.734 4 Random forest 0.849 5 Boosting 0.688 6 XGBoosting 0.817 7 LSTM(RNN) 0.877 PRESENTATION TITLE 18 6. CONCLUSION From the test result, Multinomial Logistic Regression got the highest accuracy score: 0.878. And LSTM(RNN) has almost the same score with Multinomial Logistic Regression: 0.877. XGBoosting got a high accuracy score: 0.849. For KNN, there is no big difference between different k value. In this project I meet a lot of crash because RAM exhausted. The ways to solve this issues are: a. delete large variables which is not useful in time b. Optimization algorithm c. divide to conquer PRESENTATION TITLE 19 THANK YOU! PRESENTATION TITLE 20