Uploaded by zaur huseynzada

Lab 10 NLP

advertisement
R Manual for Natural language processing (NLP) - MBM433
Nhat Quang Le
This manual aims to give students an overview of the analysis that they have learned in the regular lectures.
Exploring the business problem
Gift cards is a billion dollar industry. In fact, the global gift card market is estimated to be about US$ 1
trillion in 2020 (e.g., GlobeNewswire 2020) and is still growing rapidly. This represents a huge opportunity
for companies to offer new gift card ideas/products. Nevertheless, we still have limited understanding about
the customer’s preferences regarding gift cards. For example, what do customers expect from this product?
Are they satisfied with this product? Which options are the most (least) attractive to them? Any new
features should be offered? Imagine that you are working as a marketing analyst in an international gift card
company. You and your team have decided to answer the above questions using customer product reviews.
The given data set is explained below.
Exploring the data set
In this manual, you will work with a subset of the actual Amazon review data collected in 2018 which is made
publicly available by Ni, Li, and McAuley (Empirical Methods in Natural Language Processing (EMNLP)
2019). The original data set contains a total of 233.1 million online reviews published on Amazon.com
between May 1996 and Oct 2018 for 29 different product categories (e.g., Amazon Fashion, Applicances,
Automotive, Software, etc.). In this session, you will only work with one specific product category, Gift
Cards, which contains a total of 147,194 reviews for 1,548 different products.
To load the data to your working environment, you need to read the “Gift_Cards.json” file. As you might
know, the .json extension is used for JSON files that store data structures and objects in JavaScript Object
Notation (JSON) format. The JSON data format is commonly used to store data that is transmitted between
a web application and a server (e.g., when you collect data using APIs), especially in the text mining field.
An example of a simple JSON object can be found below:
# define a simple JSON object
json_example <- c('{"reviewerID": 12, "name": {"first": "Frank",
"last": "Bauer"}}',
'{"reviewerID": 24, "name": {"first": "Joe", "last": "Doe"}}',
'{"reviewerID": 38, "name": {"first": "Helene",
"last": "Fisher"}}')
json_example
As you can see, each row in the JSON object corresponds to all information we have about one customer
product review including both the type of info (e.g., reviewerID) and the value (e.g., 12). As such, we are
not going to load the information from the JSON file as it is but will convert it into a typical data frame
with columns representing the variables and rows representing the values. One way to do that is to use the
stream_in() from the jsonlite library. Follow the example code below but remember to adapt it using your
own link.
1
# read the .json file and convert it to a data frame
# REMEMBER: you need to replace the link below
# by the link to the .json file in your own computer
# and use the forward slash (not backslash)
library(jsonlite) # load the package - REMEMBER TO INSTALL IT FIRST
gift_reviews <- stream_in(file("C:/Users/Documents/Gift_Cards.json"))
As a backup, I also uploaded the “Gift_Cards.RData” file that contains the same data set, just in case you
cannot read the .json file.
The data set consists of the following variables:
• overall: product rating (from 1 to 5)
• vote: the number of votes for the helpfulness of a review
• verified: with (TRUE) or without (FALSE) a verified purchase
• reviewTime: time of the review
• reviewerID: ID of the reviewer
• asin: ID of the product
• style: containing three characteristics of the gift card: amount, format, and size
• reviewerName: name of the reviewer
• reviewText: text of the review
• summary: summary of the review
• unixReviewTime: time of the review (in unix format)
• image: link of the product image (if any)
2
Text analytics
To fasten the modeling process, we only work with a subset of this data set in this tutorial.
# only work with the first 1000 reviews
gift_reviews <- gift_reviews[1:1000, ]
1) DATA EXPLORATION
We should always start by exploring the original data. For example, how many observations do we have?
What kind of the data type do we have in each column? Any column(s) with missing values? Data
anomalies/abnormality?
# read the first rows in the data
head(gift_reviews)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
1
2
3
4
5
6
1
2
3
4
5
6
overall vote verified reviewTime
reviewerID
asin
1
25
FALSE 12 19, 2008 APV13CM0919JD B001GXRQW0
5 <NA>
FALSE 12 17, 2008 A3G8U1G1V082SN B001GXRQW0
5
4
FALSE 12 17, 2008 A11T2Q0EVTUWP B001GXRQW0
5 <NA>
FALSE 12 17, 2008 A9YKGBH3SV22C B001GXRQW0
1 <NA>
TRUE 12 17, 2008 A34WZIHVF3OKOL B001GXRQW0
3 146
FALSE 12 16, 2008 A221J8EC5HNPY6 B001GXRQW0
style.Gift Amount: style.Format: style.Size: reviewerName
50
<NA>
<NA>
LEH
50
<NA>
<NA>
Tali
50
<NA>
<NA>
Z
25
<NA>
<NA>
Giotravels
<NA>
<NA>
<NA>
King Dad
25
<NA>
<NA>
D. Daniels
1
2
3 aren’t we going to save trees?! :) People who were complaining about paper gift cards can simply bu
4
5
6
summary unixReviewTime image
1
Merry Christmas.
1229644800 NULL
2
Gift card with best selection
1229472000 NULL
3 A convenient and great gift for the environment :-)
1229472000 NULL
4
Totally make sense
1229472000 NULL
5
Give CASH!
1229472000 NULL
6
Great Gift but only if it works!
1229385600 NULL
Let’s create a word frequency chart with the raw data to learn more about it.
# create a frequency bar chart before cleaning the data
#------------------------# Step 1: combine all text
3
# join all text together
words_combined <- paste(gift_reviews$reviewText)
# split the sentences into words
words_combined <- unlist(strsplit(words_combined, split = " "))
# remove the empty character
words_combined <- words_combined[words_combined != ""]
#-----------------------------------------# Step 2: Count the times each word appears
# create a frequency table
freqTab <- as.data.frame(table(words_combined))
# sort the table
freqTab <- freqTab[order(freqTab$Freq, decreasing = T), ]
#------------------------# Step 3: Plot a bar chart
# select the top 20 mist frequent words
topFreqTab <- freqTab[1:20, ]
# create a bar chart
barplot(height = topFreqTab$Freq, # the height of the bars
horiz = F, # FALSE if draw the bar vertically
col = "darkgrey", # color of the bars
names.arg = topFreqTab$words_combined, # bar labels
main = "Word Frequency Bar Chart") # title of the plot
4
0
200
600
1000
1400
Word Frequency Bar Chart
the
I
a
gift
it
card
my
that
this
with
on
As you can see, the most frequent words are “to”, “the”, “a”, etc. which are not very meaningful for us to
understand customer perceptions regarding the gift cards. We should therefore proceed with data cleaning
and preprocessing.
2) DATA CLEANING AND PREPROCESSING
While cleaning and pre-processing data are important steps before analyzing the text, remember that certain
steps might be relevant or should be skipped depending on the type of analysis as well as the research
questions of interest. For example, stemming might make it difficult to study writing styles of the writers.
Thus, below I show you the standard procedure for data cleaning and pre-processing but not all steps will
be relevant in your future application of text analytics.
The basic steps include: 1) changing all text to lowercase, 2) removing stop words, 3) removing punctuation
and special characters, 4) tokenization, and 5) stemming and lemmatization.
Step 1: changing all review text to lowercase
# convert all text to lowercase
# and save it in a new column
gift_reviews$reviewText <- tolower(gift_reviews$reviewText)
# read the first two reviews
head(gift_reviews$reviewText, 2)
## [1] "amazon,\ni am shopping for amazon.com gift cards for christmas gifts and am really so disappoint
## [2] "i got this gift card from a friend, and it was the best! the site has so much to choose from...
5
Step 2: removing stop words
Stop words are words that are not informative and important for understanding the text. You can access
the common English stopwords using the stopwords() function from the tm library.
# load the package --- REMEMBER TO INSTALL IT FIRST
library(tm)
# list of common English stop words
stopwords()
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
[1]
[6]
[11]
[16]
[21]
[26]
[31]
[36]
[41]
[46]
[51]
[56]
[61]
[66]
[71]
[76]
[81]
[86]
[91]
[96]
[101]
[106]
[111]
[116]
[121]
[126]
[131]
[136]
[141]
[146]
[151]
[156]
[161]
[166]
[171]
"i"
"our"
"yours"
"his"
"herself"
"them"
"which"
"these"
"was"
"have"
"does"
"could"
"she’s"
"you’ve"
"he’d"
"you’ll"
"isn’t"
"haven’t"
"won’t"
"cannot"
"who’s"
"where’s"
"the"
"because"
"at"
"against"
"before"
"from"
"on"
"further"
"when"
"any"
"most"
"nor"
"so"
"me"
"ours"
"yourself"
"himself"
"it"
"their"
"who"
"those"
"were"
"has"
"did"
"ought"
"it’s"
"we’ve"
"she’d"
"he’ll"
"aren’t"
"hadn’t"
"wouldn’t"
"couldn’t"
"what’s"
"why’s"
"and"
"as"
"by"
"between"
"after"
"up"
"off"
"then"
"where"
"both"
"other"
"not"
"than"
"my"
"ourselves"
"yourselves"
"she"
"its"
"theirs"
"whom"
"am"
"be"
"had"
"doing"
"i’m"
"we’re"
"they’ve"
"we’d"
"she’ll"
"wasn’t"
"doesn’t"
"shan’t"
"mustn’t"
"here’s"
"how’s"
"but"
"until"
"for"
"into"
"above"
"down"
"over"
"once"
"why"
"each"
"some"
"only"
"too"
"myself"
"you"
"he"
"her"
"itself"
"themselves"
"this"
"is"
"been"
"having"
"would"
"you’re"
"they’re"
"i’d"
"they’d"
"we’ll"
"weren’t"
"don’t"
"shouldn’t"
"let’s"
"there’s"
"a"
"if"
"while"
"with"
"through"
"below"
"in"
"under"
"here"
"how"
"few"
"such"
"own"
"very"
"we"
"your"
"him"
"hers"
"they"
"what"
"that"
"are"
"being"
"do"
"should"
"he’s"
"i’ve"
"you’d"
"i’ll"
"they’ll"
"hasn’t"
"didn’t"
"can’t"
"that’s"
"when’s"
"an"
"or"
"of"
"about"
"during"
"to"
"out"
"again"
"there"
"all"
"more"
"no"
"same"
In addition, it is common to adapt this list of stop words according to the specific context. In our case, it
might be a good idea to add some more words that are commonly present when writing about gift cards but
are of little value, such as “amazon”, “gift”, “card”, etc. In addition, we can also remove some other general
words that often offer little insights: “really”, “also”, etc.
6
# create own list of stop words
library(tm)
my_stop_words <- c(stopwords(), "amazon", "gift", "card", "really", "also")
# remove stop words
library(tm)
gift_reviews$reviewText <- removeWords(gift_reviews$reviewText, my_stop_words)
# read the first two reviews
head(gift_reviews$reviewText, 2)
## [1] ",\n shopping .com
## [2] " got
friend,
cards christmas gifts
disappointed
best! site
much choose ... great ."
five choices
one
says \"merry
Step 3: removing punctuation and special characters
First we remove all punctuation marks such as period, question mark, exclamation point, comma, semicolon,
colon, etc. We will use the removePunctuation() function from the tm package.
# remove punctuation
library(tm)
gift_reviews$reviewText <- removePunctuation(gift_reviews$reviewText)
# read the first two reviews
head(gift_reviews$reviewText, 2)
## [1] "\n shopping com
## [2] " got
friend
cards christmas gifts
disappointed
best site
much choose great "
five choices
one
says merry chr
Next, we need to remove all unwanted special characters, symbols, extra spaces/tabs, and numbers
# remove special characters: ~!@#$%ˆ&*(){}_+:"<>?,./;'[]-=
gift_reviews$reviewText <- gsub("[ˆ[:alnum:]]", " ", gift_reviews$reviewText)
# remove unwanted characters: â í ü Â á ą ę ś ć
gift_reviews$reviewText <- gsub("[ˆa-zA-Z#]", " ", gift_reviews$reviewText)
# remove tabs, extra spaces
library(tm)
gift_reviews$reviewText <- stripWhitespace(gift_reviews$reviewText)
# remove numbers
gift_reviews$reviewText <- removeNumbers(gift_reviews$reviewText)
# read the first two reviews
head(gift_reviews$reviewText, 2)
## [1] " shopping com cards christmas gifts disappointed five choices one says merry christmas mentions
## [2] " got friend best site much choose great "
Let’s create a word frequency chart before tokenization and stemming/lemmatization to see if how the top
20 words have changed.
7
# create a frequency bar chart before tokenization
#------------------------# Step 1: combine all text
# join all text together
words_combined <- paste(gift_reviews$reviewText)
# split the sentences into words
words_combined <- unlist(strsplit(words_combined, split = " "))
# remove the empty character
words_combined <- words_combined[words_combined != ""]
#-----------------------------------------# Step 2: Count the times each word appears
# create a frequency table
freqTab <- as.data.frame(table(words_combined))
# sort the table
freqTab <- freqTab[order(freqTab$Freq, decreasing = T), ]
#------------------------# Step 3: Plot a bar chart
# select the top 20 mist frequent words
topFreqTab <- freqTab[1:20, ]
# create a bar chart
barplot(height = topFreqTab$Freq, # the height of the bars
horiz = F, # FALSE if draw the bar vertically
col = "darkgrey", # color of the bars
names.arg = topFreqTab$words_combined, # bar labels
main = "Word Frequency Bar Chart") # title of the plot
8
0
50
100 150 200 250 300
Word Frequency Bar Chart
cards
easy
use
get
time
like
give
got
love
Step 4: Tokenization
Let’s split up each online product review into smaller and more manageable sections or tokens. In other
words, we will break each online review into separate words. To do that, we use the tokens() function from
the quanteda library. Note that tokens() function is also available in another library called koRpus which
might be called by another library that we use in this tutorial. As such, we need to add “quanteda::” in
front of the function to make sure that R will use the function tokens() from the quanteda library.
# tokenize the product reviews
library(quanteda)
toks_review <- quanteda::tokens(gift_reviews$reviewText, what = c("word"))
# print the first 3 documents
head(toks_review, 3)
##
##
##
##
##
##
##
##
##
##
##
Tokens consisting of 3 documents.
text1 :
[1] "shopping"
"com"
[6] "disappointed" "five"
[11] "merry"
"christmas"
[ ... and 18 more ]
text2 :
[1] "got"
"friend" "best"
"cards"
"choices"
"site"
"much"
text3 :
9
"christmas"
"one"
"gifts"
"says"
"choose" "great"
## [1] "going"
"save"
## [6] "paper"
"cards"
## [11] "electronic" "via"
## [ ... and 43 more ]
"trees"
"can"
"people"
"simply"
"complaining"
"buy"
The results of the tokenization process are stored in a “tokens” object (i.e., toks_review) which is essentially
a list. As such, we can access the tokens of the first product review in the same way as we would like to
access the first element of a list.
# access the first element of toks_review
toks_review[1]
##
##
##
##
##
##
Tokens consisting of 1 document.
text1 :
[1] "shopping"
"com"
[6] "disappointed" "five"
[11] "merry"
"christmas"
[ ... and 18 more ]
"cards"
"choices"
"christmas"
"one"
"gifts"
"says"
# access only the tokens of the first element of toks_review
toks_review[[1]]
##
##
##
##
##
##
[1]
[6]
[11]
[16]
[21]
[26]
"shopping"
"disappointed"
"merry"
"alone"
"celebrating"
"correctness"
"com"
"five"
"christmas"
"wanting"
"principle"
"bad"
"cards"
"choices"
"mentions"
"reflects"
"send"
"marketing"
"christmas"
"one"
"christmas"
"actual"
"christmas"
"decision"
"gifts"
"says"
"sure"
"holiday"
"political"
"lynn"
Step 5: Stemming and lemmatization
As mentioned in the regular lecture, stemming is used to reduce words to a simple or root form by removing
their prefixes or suffixes. Lemmatization, on the other hand, is used to reduce words to their lemma form so
that different inflected forms of a word can be analyzed as a single term. One way to perform stemming and
lemmatization is to use the stem_words() and lemmatize_words() functions in the textstem library. See
examples below.
# load the package --- REMEMBER TO INSTALL IT FIRST
library(textstem)
# create a vector containing some example words
example_vec <- c("buy", "bought", "buying", "buyer")
# stemming
stem_words(example_vec)
## [1] "bui"
"bought" "bui"
"buyer"
# lemmatization
lemmatize_words(example_vec)
## [1] "buy"
"buy"
"buy"
"buyer"
10
As we need to apply stemming (or lemmatization) to our tokenized words in each review, we will use the
lapply() function to apply the stem_words() or lemmatize_words() function to each of our tokenized review
text.
# load the package --- REMEMBER TO INSTALL IT FIRST
library(textstem)
# stemming of review text
reviews_stem <- lapply(X = toks_review, # the targeted list
FUN = stem_words) # the function
# print the first stemmed words
head(reviews_stem)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
$text1
[1] "shop"
[6] "disappoint"
[11] "merri"
[16] "alon"
[21] "celebr"
[26] "correct"
$text2
[1] "got"
"com"
"five"
"christma"
"want"
"principl"
"bad"
"friend" "best"
$text3
[1] "go"
[7] "card"
[13] "email"
[19] "card"
[25] "monei"
[31] "deliveri"
[37] "copypast"
[43] "want"
[49] "via"
[55] "holidai"
"save"
"can"
"conveni"
"annoi"
"got"
"make"
"code"
"us"
"email"
$text4
[1] "can"
"alwai"
[9] "return" "like"
[17] "hassl"
"card"
"choic"
"mention"
"reflect"
"send"
"market"
"site"
"christma"
"on"
"christma"
"actual"
"christma"
"decis"
"much"
"tree"
"simpli"
"inde"
"sinc"
"left"
"lot"
"certif"
"sent"
"newborn"
"choos"
"peopl"
"bui"
"great"
"alwai"
"lose"
"easier"
"balanc"
"friend"
"wed"
"gift"
"sai"
"sure"
"holidai"
"polit"
"lynn"
"great"
"complain"
"electron"
"found"
"wonder"
"lose"
"redeem"
"ll"
"electron"
"housewarm"
"paper"
"via"
"paper"
"much"
"electron"
"need"
"whenev"
"card"
"happi"
"get"
"someon" "someth" "safeti" "net"
"return" "thing" "hassl" "take"
"care"
$text5
[1] "take"
[7] "monei"
[13] "time"
[19] "activ"
[25] "keep"
[31] "give"
"dollar"
"can"
"headach"
"issu"
"keep"
"discount"
$text6
[1] "card"
[6] "given"
[11] "forgot"
"truli"
"awai"
"activ"
"good"
"us"
"bui"
"risk"
"track"
"commit"
"monei"
"site"
"type"
"expir"
"much"
"purchas"
"person"
"got"
"sure"
"limit"
"benefit"
"card"
"deliveri"
"remain"
"power"
"problem"
"email"
"good"
11
"turn"
"spend"
"deal"
"issu"
"least"
"receiv"
"whoop"
"give"
"can"
"return"
##
##
##
##
[16]
[21]
[26]
[31]
"good"
"lurch"
"problem"
"fulli"
"notifi"
"bad"
"kudo"
"load"
"refund"
"will"
"quick"
"thank"
"leav"
"good"
"second"
"procrastin"
"know"
"check"
Lemmatization is much slower than stemming (as lemmatization also considers the context) so it would take
a while to have the results.
# load the package --- REMEMBER TO INSTALL IT FIRST
library(textstem)
# Lemmatization of review text
# Note that we need to unlist the toks_review first
reviews_lem <- lapply(X = toks_review,
# the targeted list
FUN = lemmatize_words) # the function
# print the first lemmatized words
head(reviews_lem)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
$text1
[1] "shop"
[6] "disappoint"
[11] "merry"
[16] "alone"
[21] "celebrate"
[26] "correctness"
$text2
[1] "get"
"com"
"five"
"christmas"
"want"
"principle"
"bad"
"friend" "good"
$text3
[1] "go"
[5] "complain"
[9] "simply"
[13] "email"
[17] "find"
[21] "since"
[25] "money"
[29] "lose"
[33] "lot"
[37] "copypaste"
[41] "will"
[45] "send"
[49] "via"
[53] "housewarmings"
$text4
[1] "can"
[7] "net"
[13] "hassle"
"card"
"choice"
"mention"
"reflect"
"send"
"market"
"site"
"save"
"paper"
"buy"
"convenient"
"paper"
"always"
"get"
"electronic"
"easy"
"code"
"whenever"
"friend"
"email"
"happy"
"always"
"can"
"take"
"much"
"christmas"
"one"
"christmas"
"actual"
"christmas"
"decision"
"choose" "great"
"tree"
"card"
"electronic"
"indeed"
"card"
"wonder"
"leave"
"delivery"
"redeem"
"certificate"
"want"
"electronic"
"newborn"
"holiday"
"get"
"return"
"care"
"someone"
"like"
"return"
12
"gift"
"say"
"sure"
"holiday"
"political"
"lynn"
"people"
"can"
"via"
"great"
"annoy"
"much"
"lose"
"make"
"need"
"balance"
"use"
"card"
"wedding"
"something" "safety"
"return"
"thing"
"hassle"
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
$text5
[1] "take"
[6] "turn"
[11] "benefit"
[16] "type"
[21] "risk"
[26] "keep"
[31] "give"
$text6
[1] "card"
[5] "receive"
[9] "email"
[13] "sure"
[17] "notify"
[21] "lurch"
[25] "know"
[29] "2"
[33] "thank"
"dollar"
"money"
"spend"
"card"
"expiration"
"track"
"discount"
"good"
"can"
"time"
"deal"
"delivery"
"much"
"commit"
"truly"
"give"
"whoop"
"good"
"refund"
"bad"
"problem"
"check"
"money"
"use"
"headache"
"activation"
"issue"
"remain"
"purchase"
"personal"
"away"
"forget"
"give"
"leave"
"will"
"kudo"
"fully"
"limitation"
"site"
"buy"
"issue"
"keep"
"less"
"power"
"problem"
"get"
"activate"
"good"
"procrastinator"
"good"
"quick"
"load"
Let’s combine all these stemmed words into one text again (i.e., de-tokenization) and put them back to the
original data set. In other words, we need to combine separate tokens into one string (e.g., combine “hello”
and “world” into “hello world”). To do this, we will use the paste() function and we need to set the collapse
argument to be ” “, meaning that all the separate words will be combined into one long string with a blank
space in between.
# combine separate tokens into one string
# store the results in a new column in the data set
gift_reviews$text_clean <- sapply(reviews_stem,
FUN = paste,
collapse = " ")
Now let’s create a frequency bar chart based on the stemmed words to see how the results have changed.
# create a frequency bar chart on stemmed reviews
#------------------------# Step 1: combine all text
# join all text together
words_combined <- paste(gift_reviews$text_clean)
# split the sentences into words
words_combined <- unlist(strsplit(words_combined, split = " "))
# remove the empty character
words_combined <- words_combined[words_combined != ""]
#-----------------------------------------# Step 2: Count the times each word appears
# create a frequency table
freqTab <- as.data.frame(table(words_combined))
13
# sort the table
freqTab <- freqTab[order(freqTab$Freq, decreasing = T), ]
#------------------------# Step 3: Plot a bar chart
# select the top 20 mist frequent words
topFreqTab <- freqTab[1:20, ]
# create a bar chart
barplot(height = topFreqTab$Freq, # the height of the bars
horiz = F, # FALSE if draw the bar vertically
col = "darkgrey", # color of the bars
names.arg = topFreqTab$words_combined, # bar labels
main = "Word Frequency Bar Chart") # title of the plot
0
50
100 150 200 250 300
Word Frequency Bar Chart
card
bui
can order
on
give
will receiv just
dai
Let’s compare this bar chart with the one we created before data cleaning and preprocessing, do you see any
differences? Any new words appear in the list? And do you think that stemming has helped us represent
the data more accurately?
3) SENTIMENT ANALYSIS
In this part, we will perform some basic sentiment analysis to see if different reviews contain different opinions
(i.e., the polarity of the text), using two values: positive and negative. To understand the sentiment, we
14
will use the Lexicoder Sentiment Dictionary (2015) which has been shared by Young, L. & Soroka, S.
(2012) (Lexicoder Sentiment Dictionary, available at http://www.snsoroka.com/data-lexicoder/). You can
access this dictionary by loading the quanteda library. As this dictionary has 4 different values: positive,
negative, neg_positive (positive words preceded by a negation), neg_negative (negative words preceded
by a negation), we need to subset the dictionary to only select positive and negative entries as follows:
data_dictionary_LSD2015[1:2].
# load the package --- REMEMBER TO INSTALL IT FIRST
library(quanteda)
# print the first words of the dictionary LSD2015
head(data_dictionary_LSD2015[1:2])
## Dictionary object with 2 key entries.
## - [negative]:
##
- a lie, abandon*, abas*, abattoir*, abdicat*, aberra*, abhor*, abject*, abnormal*, abolish*, abomi
## - [positive]:
##
- ability*, abound*, absolv*, absorbent*, absorption*, abundanc*, abundant*, acced*, accentuat*, ac
There are many ways to compute the sentiment scores of the text. In this example, we will compute the
sentiment score of the text in two steps as follows:
Step 1: Tokenize each text and use the function tokens_lookup() function from the quanteda library to
convert each token into either positive or negative sentiment using a specific dictionary (which is the LSD
2015 in our example)
Step 2: Compute the overall sentiment score of the text when assuming that:
a) Each positive word will be considered as +1 point, while each negative words will be considered as -1
point
b) The overall sentiment score is computed by adding all the points and then divide it by the total number
of words.
As such, the values range from -1 (all words are negative) to 1 (all words are positive). It is 0 when we
have similar numbers of positive and negative words. For example, if a tokenized text contains 4 different
words of which three of them are positive and one is negative. In this case, the total score is 3 - 1 = 2. The
sentiment score is therefore: 2/4 = 0.5, which indicates a positive sentiment.
# convert lemmatized tokens to positive/negative
# we use as.tokens() function to convert reviews_lem to tokens form
text_sent <- tokens_lookup(as.tokens(reviews_lem), dictionary = data_dictionary_LSD2015[1:2])
# print the first rows of text_sent
head(text_sent)
##
##
##
##
##
##
Tokens consisting of 6 documents.
text1 :
[1] "positive" "negative" "positive" "positive" "positive" "positive" "positive"
[8] "positive" "negative"
text2 :
15
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
[1] "positive" "positive" "positive"
text3 :
[1] "negative" "positive" "positive" "negative" "negative" "negative"
[7] "positive" "positive" "positive" "positive" "positive" "positive"
text4 :
[1] "positive" "positive" "negative" "positive" "negative"
text5 :
[1] "positive" "negative" "positive" "negative" "negative"
text6 :
[1] "negative" "negative" "positive" "positive" "negative" "negative" "positive"
[8] "negative" "positive"
# compute the sentiment score
# and store the results in a new column of the data set
gift_reviews$sent_score <- sapply(X = text_sent,
FUN = function(x) (length(x[x == "positive"]) length(x[x == "negative"]))/length(x))
Note that tokens_lookup() function only works with tokens objects so we need to use as.tokens() function
to convert lemmatized text reviews_lem to the tokens form. In addition, to compute the sentiment score,
we use the sapply() function to apply a function to each element of the given list (i.e., each tokenized review
in the list text_sent). In this case, the used function is an arbitrary function that is defined in the second
argument of the sapply() function. As you can see, the code “length(x[x == ”positive”])” will result in a
number of positive words in a given object x. The object “x” here refers to the value of an element of the
list “text_sent”. When this arbitrary function is applied to each element of text_sent, the value of x will
change accordingly.
Now let’s check the gift_reviews data set. Each review has been assigned one sentiment analysis score. How
can we interpret these results?
4) TOPIC MODELING: LDA
Topic modeling is a process to discovering potential hidden patterns in the data by identifying words that
occur together and then grouping them into topics. To do that, we will first need to create a term-document
matrix and then apply a topic model. In this example, we will use the Latent Dirichlet Allocation (LDA)
model.
Step 1: Create a Document Term Matrix (DTM)
Here we need to convert the list of our tokenized reviews into a DTM
# load the package --- REMEMBER TO INSTALL IT FIRST
library(quanteda)
# covert list of tokenized text into DTM
review_dfm <- dfm(as.tokens(reviews_stem))
# print the first rows and columsn of dtm
head(review_dfm)
16
## Document-feature matrix of: 6 documents, 2,514 features (98.97% sparse) and 0 docvars.
##
features
## docs
shop com card christma gift disappoint five choic on sai
##
text1
1
1
1
4
1
1
1
1 1
1
##
text2
0
0
0
0
0
0
0
0 0
0
##
text3
0
0
3
0
0
0
0
0 0
0
##
text4
0
0
0
0
0
0
0
0 0
0
##
text5
0
0
1
0
0
0
0
0 0
0
##
text6
0
0
1
0
0
0
0
0 0
0
## [ reached max_nfeat ... 2,504 more features ]
It is often a good idea to trim our DFM such that we will only work with terms that occur more than 5%
and less than 95%. Other thresholds (e.g., more than 7.5% and less than 90%) can also be chosen.
# load the package --- REMEMBER TO INSTALL IT FIRST
library(quanteda)
# trim the dfm
review_dfm_trimmed <- dfm_trim(x = review_dfm, # dfm object
min_docfreq = 0.05, # min frequency
max_docfreq = 0.95, # max frequency
docfreq_type = "prop") # type of min/max_docfreq
Step 2: Run the LDA model
We use the LDA() function from the topicmodels library. Note that we will need to choose an arbitrary
number of topics and then apply the model to the DTM object above. In addition, LDA model requires
that the matrix cannot contain any document with no terms in it. Therefore, we first need to remove all
documents with no terms.
# compute the number of words in each document
num_words <- apply(X = review_dfm_trimmed, # the matrix
MARGIN = 1, # apply the function to all rows
FUN = sum) # the function to be applied
# remove all documents with no term
review_dfm_trimmed_new <- review_dfm_trimmed[num_words > 0, ]
# load the package --- REMEMBER TO INSTALL IT FIRST
library(topicmodels)
k <- 3 # number of topics
seed <- 123 # for reproducibility purposes
# run LDA model
lda_results <- LDA(review_dfm_trimmed_new, # a DTM object
k = k) # number of topics
To obtain an overview of the results, we use the terms() function which requires two arguments: the first
one is the results of the LDA model, and the second one is the maximum number of terms should we return
for each topic.
# print the top 7 words for each of the k topics
terms(lda_results, 7)
17
##
##
##
##
##
##
##
##
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
Topic 1
"get"
"want"
"purchas"
"can"
"bui"
"on"
"give"
Topic 2
"us"
"purchas"
"time"
"card"
"great"
"will"
"give"
Topic 3
"card"
"order"
"like"
"bui"
"receiv"
"great"
"love"
Insights revealed by the analysis
In this tutorial, we have used text analytics to learn about customers’ perceptions of the gift card products.
Based on the analysis you have done above, are you able to identify the topics that were present in the
text? Can we use those insights to make a better marketing decision? In addition, are customers positive
or negative with the products? If the text is positive, why is that? And if the text is negative, what are the
common reasons for that?
18
Download