R Manual for Natural language processing (NLP) - MBM433 Nhat Quang Le This manual aims to give students an overview of the analysis that they have learned in the regular lectures. Exploring the business problem Gift cards is a billion dollar industry. In fact, the global gift card market is estimated to be about US$ 1 trillion in 2020 (e.g., GlobeNewswire 2020) and is still growing rapidly. This represents a huge opportunity for companies to offer new gift card ideas/products. Nevertheless, we still have limited understanding about the customer’s preferences regarding gift cards. For example, what do customers expect from this product? Are they satisfied with this product? Which options are the most (least) attractive to them? Any new features should be offered? Imagine that you are working as a marketing analyst in an international gift card company. You and your team have decided to answer the above questions using customer product reviews. The given data set is explained below. Exploring the data set In this manual, you will work with a subset of the actual Amazon review data collected in 2018 which is made publicly available by Ni, Li, and McAuley (Empirical Methods in Natural Language Processing (EMNLP) 2019). The original data set contains a total of 233.1 million online reviews published on Amazon.com between May 1996 and Oct 2018 for 29 different product categories (e.g., Amazon Fashion, Applicances, Automotive, Software, etc.). In this session, you will only work with one specific product category, Gift Cards, which contains a total of 147,194 reviews for 1,548 different products. To load the data to your working environment, you need to read the “Gift_Cards.json” file. As you might know, the .json extension is used for JSON files that store data structures and objects in JavaScript Object Notation (JSON) format. The JSON data format is commonly used to store data that is transmitted between a web application and a server (e.g., when you collect data using APIs), especially in the text mining field. An example of a simple JSON object can be found below: # define a simple JSON object json_example <- c('{"reviewerID": 12, "name": {"first": "Frank", "last": "Bauer"}}', '{"reviewerID": 24, "name": {"first": "Joe", "last": "Doe"}}', '{"reviewerID": 38, "name": {"first": "Helene", "last": "Fisher"}}') json_example As you can see, each row in the JSON object corresponds to all information we have about one customer product review including both the type of info (e.g., reviewerID) and the value (e.g., 12). As such, we are not going to load the information from the JSON file as it is but will convert it into a typical data frame with columns representing the variables and rows representing the values. One way to do that is to use the stream_in() from the jsonlite library. Follow the example code below but remember to adapt it using your own link. 1 # read the .json file and convert it to a data frame # REMEMBER: you need to replace the link below # by the link to the .json file in your own computer # and use the forward slash (not backslash) library(jsonlite) # load the package - REMEMBER TO INSTALL IT FIRST gift_reviews <- stream_in(file("C:/Users/Documents/Gift_Cards.json")) As a backup, I also uploaded the “Gift_Cards.RData” file that contains the same data set, just in case you cannot read the .json file. The data set consists of the following variables: • overall: product rating (from 1 to 5) • vote: the number of votes for the helpfulness of a review • verified: with (TRUE) or without (FALSE) a verified purchase • reviewTime: time of the review • reviewerID: ID of the reviewer • asin: ID of the product • style: containing three characteristics of the gift card: amount, format, and size • reviewerName: name of the reviewer • reviewText: text of the review • summary: summary of the review • unixReviewTime: time of the review (in unix format) • image: link of the product image (if any) 2 Text analytics To fasten the modeling process, we only work with a subset of this data set in this tutorial. # only work with the first 1000 reviews gift_reviews <- gift_reviews[1:1000, ] 1) DATA EXPLORATION We should always start by exploring the original data. For example, how many observations do we have? What kind of the data type do we have in each column? Any column(s) with missing values? Data anomalies/abnormality? # read the first rows in the data head(gift_reviews) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 1 2 3 4 5 6 overall vote verified reviewTime reviewerID asin 1 25 FALSE 12 19, 2008 APV13CM0919JD B001GXRQW0 5 <NA> FALSE 12 17, 2008 A3G8U1G1V082SN B001GXRQW0 5 4 FALSE 12 17, 2008 A11T2Q0EVTUWP B001GXRQW0 5 <NA> FALSE 12 17, 2008 A9YKGBH3SV22C B001GXRQW0 1 <NA> TRUE 12 17, 2008 A34WZIHVF3OKOL B001GXRQW0 3 146 FALSE 12 16, 2008 A221J8EC5HNPY6 B001GXRQW0 style.Gift Amount: style.Format: style.Size: reviewerName 50 <NA> <NA> LEH 50 <NA> <NA> Tali 50 <NA> <NA> Z 25 <NA> <NA> Giotravels <NA> <NA> <NA> King Dad 25 <NA> <NA> D. Daniels 1 2 3 aren’t we going to save trees?! :) People who were complaining about paper gift cards can simply bu 4 5 6 summary unixReviewTime image 1 Merry Christmas. 1229644800 NULL 2 Gift card with best selection 1229472000 NULL 3 A convenient and great gift for the environment :-) 1229472000 NULL 4 Totally make sense 1229472000 NULL 5 Give CASH! 1229472000 NULL 6 Great Gift but only if it works! 1229385600 NULL Let’s create a word frequency chart with the raw data to learn more about it. # create a frequency bar chart before cleaning the data #------------------------# Step 1: combine all text 3 # join all text together words_combined <- paste(gift_reviews$reviewText) # split the sentences into words words_combined <- unlist(strsplit(words_combined, split = " ")) # remove the empty character words_combined <- words_combined[words_combined != ""] #-----------------------------------------# Step 2: Count the times each word appears # create a frequency table freqTab <- as.data.frame(table(words_combined)) # sort the table freqTab <- freqTab[order(freqTab$Freq, decreasing = T), ] #------------------------# Step 3: Plot a bar chart # select the top 20 mist frequent words topFreqTab <- freqTab[1:20, ] # create a bar chart barplot(height = topFreqTab$Freq, # the height of the bars horiz = F, # FALSE if draw the bar vertically col = "darkgrey", # color of the bars names.arg = topFreqTab$words_combined, # bar labels main = "Word Frequency Bar Chart") # title of the plot 4 0 200 600 1000 1400 Word Frequency Bar Chart the I a gift it card my that this with on As you can see, the most frequent words are “to”, “the”, “a”, etc. which are not very meaningful for us to understand customer perceptions regarding the gift cards. We should therefore proceed with data cleaning and preprocessing. 2) DATA CLEANING AND PREPROCESSING While cleaning and pre-processing data are important steps before analyzing the text, remember that certain steps might be relevant or should be skipped depending on the type of analysis as well as the research questions of interest. For example, stemming might make it difficult to study writing styles of the writers. Thus, below I show you the standard procedure for data cleaning and pre-processing but not all steps will be relevant in your future application of text analytics. The basic steps include: 1) changing all text to lowercase, 2) removing stop words, 3) removing punctuation and special characters, 4) tokenization, and 5) stemming and lemmatization. Step 1: changing all review text to lowercase # convert all text to lowercase # and save it in a new column gift_reviews$reviewText <- tolower(gift_reviews$reviewText) # read the first two reviews head(gift_reviews$reviewText, 2) ## [1] "amazon,\ni am shopping for amazon.com gift cards for christmas gifts and am really so disappoint ## [2] "i got this gift card from a friend, and it was the best! the site has so much to choose from... 5 Step 2: removing stop words Stop words are words that are not informative and important for understanding the text. You can access the common English stopwords using the stopwords() function from the tm library. # load the package --- REMEMBER TO INSTALL IT FIRST library(tm) # list of common English stop words stopwords() ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## [1] [6] [11] [16] [21] [26] [31] [36] [41] [46] [51] [56] [61] [66] [71] [76] [81] [86] [91] [96] [101] [106] [111] [116] [121] [126] [131] [136] [141] [146] [151] [156] [161] [166] [171] "i" "our" "yours" "his" "herself" "them" "which" "these" "was" "have" "does" "could" "she’s" "you’ve" "he’d" "you’ll" "isn’t" "haven’t" "won’t" "cannot" "who’s" "where’s" "the" "because" "at" "against" "before" "from" "on" "further" "when" "any" "most" "nor" "so" "me" "ours" "yourself" "himself" "it" "their" "who" "those" "were" "has" "did" "ought" "it’s" "we’ve" "she’d" "he’ll" "aren’t" "hadn’t" "wouldn’t" "couldn’t" "what’s" "why’s" "and" "as" "by" "between" "after" "up" "off" "then" "where" "both" "other" "not" "than" "my" "ourselves" "yourselves" "she" "its" "theirs" "whom" "am" "be" "had" "doing" "i’m" "we’re" "they’ve" "we’d" "she’ll" "wasn’t" "doesn’t" "shan’t" "mustn’t" "here’s" "how’s" "but" "until" "for" "into" "above" "down" "over" "once" "why" "each" "some" "only" "too" "myself" "you" "he" "her" "itself" "themselves" "this" "is" "been" "having" "would" "you’re" "they’re" "i’d" "they’d" "we’ll" "weren’t" "don’t" "shouldn’t" "let’s" "there’s" "a" "if" "while" "with" "through" "below" "in" "under" "here" "how" "few" "such" "own" "very" "we" "your" "him" "hers" "they" "what" "that" "are" "being" "do" "should" "he’s" "i’ve" "you’d" "i’ll" "they’ll" "hasn’t" "didn’t" "can’t" "that’s" "when’s" "an" "or" "of" "about" "during" "to" "out" "again" "there" "all" "more" "no" "same" In addition, it is common to adapt this list of stop words according to the specific context. In our case, it might be a good idea to add some more words that are commonly present when writing about gift cards but are of little value, such as “amazon”, “gift”, “card”, etc. In addition, we can also remove some other general words that often offer little insights: “really”, “also”, etc. 6 # create own list of stop words library(tm) my_stop_words <- c(stopwords(), "amazon", "gift", "card", "really", "also") # remove stop words library(tm) gift_reviews$reviewText <- removeWords(gift_reviews$reviewText, my_stop_words) # read the first two reviews head(gift_reviews$reviewText, 2) ## [1] ",\n shopping .com ## [2] " got friend, cards christmas gifts disappointed best! site much choose ... great ." five choices one says \"merry Step 3: removing punctuation and special characters First we remove all punctuation marks such as period, question mark, exclamation point, comma, semicolon, colon, etc. We will use the removePunctuation() function from the tm package. # remove punctuation library(tm) gift_reviews$reviewText <- removePunctuation(gift_reviews$reviewText) # read the first two reviews head(gift_reviews$reviewText, 2) ## [1] "\n shopping com ## [2] " got friend cards christmas gifts disappointed best site much choose great " five choices one says merry chr Next, we need to remove all unwanted special characters, symbols, extra spaces/tabs, and numbers # remove special characters: ~!@#$%ˆ&*(){}_+:"<>?,./;'[]-= gift_reviews$reviewText <- gsub("[ˆ[:alnum:]]", " ", gift_reviews$reviewText) # remove unwanted characters: â í ü Â á ą ę ś ć gift_reviews$reviewText <- gsub("[ˆa-zA-Z#]", " ", gift_reviews$reviewText) # remove tabs, extra spaces library(tm) gift_reviews$reviewText <- stripWhitespace(gift_reviews$reviewText) # remove numbers gift_reviews$reviewText <- removeNumbers(gift_reviews$reviewText) # read the first two reviews head(gift_reviews$reviewText, 2) ## [1] " shopping com cards christmas gifts disappointed five choices one says merry christmas mentions ## [2] " got friend best site much choose great " Let’s create a word frequency chart before tokenization and stemming/lemmatization to see if how the top 20 words have changed. 7 # create a frequency bar chart before tokenization #------------------------# Step 1: combine all text # join all text together words_combined <- paste(gift_reviews$reviewText) # split the sentences into words words_combined <- unlist(strsplit(words_combined, split = " ")) # remove the empty character words_combined <- words_combined[words_combined != ""] #-----------------------------------------# Step 2: Count the times each word appears # create a frequency table freqTab <- as.data.frame(table(words_combined)) # sort the table freqTab <- freqTab[order(freqTab$Freq, decreasing = T), ] #------------------------# Step 3: Plot a bar chart # select the top 20 mist frequent words topFreqTab <- freqTab[1:20, ] # create a bar chart barplot(height = topFreqTab$Freq, # the height of the bars horiz = F, # FALSE if draw the bar vertically col = "darkgrey", # color of the bars names.arg = topFreqTab$words_combined, # bar labels main = "Word Frequency Bar Chart") # title of the plot 8 0 50 100 150 200 250 300 Word Frequency Bar Chart cards easy use get time like give got love Step 4: Tokenization Let’s split up each online product review into smaller and more manageable sections or tokens. In other words, we will break each online review into separate words. To do that, we use the tokens() function from the quanteda library. Note that tokens() function is also available in another library called koRpus which might be called by another library that we use in this tutorial. As such, we need to add “quanteda::” in front of the function to make sure that R will use the function tokens() from the quanteda library. # tokenize the product reviews library(quanteda) toks_review <- quanteda::tokens(gift_reviews$reviewText, what = c("word")) # print the first 3 documents head(toks_review, 3) ## ## ## ## ## ## ## ## ## ## ## Tokens consisting of 3 documents. text1 : [1] "shopping" "com" [6] "disappointed" "five" [11] "merry" "christmas" [ ... and 18 more ] text2 : [1] "got" "friend" "best" "cards" "choices" "site" "much" text3 : 9 "christmas" "one" "gifts" "says" "choose" "great" ## [1] "going" "save" ## [6] "paper" "cards" ## [11] "electronic" "via" ## [ ... and 43 more ] "trees" "can" "people" "simply" "complaining" "buy" The results of the tokenization process are stored in a “tokens” object (i.e., toks_review) which is essentially a list. As such, we can access the tokens of the first product review in the same way as we would like to access the first element of a list. # access the first element of toks_review toks_review[1] ## ## ## ## ## ## Tokens consisting of 1 document. text1 : [1] "shopping" "com" [6] "disappointed" "five" [11] "merry" "christmas" [ ... and 18 more ] "cards" "choices" "christmas" "one" "gifts" "says" # access only the tokens of the first element of toks_review toks_review[[1]] ## ## ## ## ## ## [1] [6] [11] [16] [21] [26] "shopping" "disappointed" "merry" "alone" "celebrating" "correctness" "com" "five" "christmas" "wanting" "principle" "bad" "cards" "choices" "mentions" "reflects" "send" "marketing" "christmas" "one" "christmas" "actual" "christmas" "decision" "gifts" "says" "sure" "holiday" "political" "lynn" Step 5: Stemming and lemmatization As mentioned in the regular lecture, stemming is used to reduce words to a simple or root form by removing their prefixes or suffixes. Lemmatization, on the other hand, is used to reduce words to their lemma form so that different inflected forms of a word can be analyzed as a single term. One way to perform stemming and lemmatization is to use the stem_words() and lemmatize_words() functions in the textstem library. See examples below. # load the package --- REMEMBER TO INSTALL IT FIRST library(textstem) # create a vector containing some example words example_vec <- c("buy", "bought", "buying", "buyer") # stemming stem_words(example_vec) ## [1] "bui" "bought" "bui" "buyer" # lemmatization lemmatize_words(example_vec) ## [1] "buy" "buy" "buy" "buyer" 10 As we need to apply stemming (or lemmatization) to our tokenized words in each review, we will use the lapply() function to apply the stem_words() or lemmatize_words() function to each of our tokenized review text. # load the package --- REMEMBER TO INSTALL IT FIRST library(textstem) # stemming of review text reviews_stem <- lapply(X = toks_review, # the targeted list FUN = stem_words) # the function # print the first stemmed words head(reviews_stem) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## $text1 [1] "shop" [6] "disappoint" [11] "merri" [16] "alon" [21] "celebr" [26] "correct" $text2 [1] "got" "com" "five" "christma" "want" "principl" "bad" "friend" "best" $text3 [1] "go" [7] "card" [13] "email" [19] "card" [25] "monei" [31] "deliveri" [37] "copypast" [43] "want" [49] "via" [55] "holidai" "save" "can" "conveni" "annoi" "got" "make" "code" "us" "email" $text4 [1] "can" "alwai" [9] "return" "like" [17] "hassl" "card" "choic" "mention" "reflect" "send" "market" "site" "christma" "on" "christma" "actual" "christma" "decis" "much" "tree" "simpli" "inde" "sinc" "left" "lot" "certif" "sent" "newborn" "choos" "peopl" "bui" "great" "alwai" "lose" "easier" "balanc" "friend" "wed" "gift" "sai" "sure" "holidai" "polit" "lynn" "great" "complain" "electron" "found" "wonder" "lose" "redeem" "ll" "electron" "housewarm" "paper" "via" "paper" "much" "electron" "need" "whenev" "card" "happi" "get" "someon" "someth" "safeti" "net" "return" "thing" "hassl" "take" "care" $text5 [1] "take" [7] "monei" [13] "time" [19] "activ" [25] "keep" [31] "give" "dollar" "can" "headach" "issu" "keep" "discount" $text6 [1] "card" [6] "given" [11] "forgot" "truli" "awai" "activ" "good" "us" "bui" "risk" "track" "commit" "monei" "site" "type" "expir" "much" "purchas" "person" "got" "sure" "limit" "benefit" "card" "deliveri" "remain" "power" "problem" "email" "good" 11 "turn" "spend" "deal" "issu" "least" "receiv" "whoop" "give" "can" "return" ## ## ## ## [16] [21] [26] [31] "good" "lurch" "problem" "fulli" "notifi" "bad" "kudo" "load" "refund" "will" "quick" "thank" "leav" "good" "second" "procrastin" "know" "check" Lemmatization is much slower than stemming (as lemmatization also considers the context) so it would take a while to have the results. # load the package --- REMEMBER TO INSTALL IT FIRST library(textstem) # Lemmatization of review text # Note that we need to unlist the toks_review first reviews_lem <- lapply(X = toks_review, # the targeted list FUN = lemmatize_words) # the function # print the first lemmatized words head(reviews_lem) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## $text1 [1] "shop" [6] "disappoint" [11] "merry" [16] "alone" [21] "celebrate" [26] "correctness" $text2 [1] "get" "com" "five" "christmas" "want" "principle" "bad" "friend" "good" $text3 [1] "go" [5] "complain" [9] "simply" [13] "email" [17] "find" [21] "since" [25] "money" [29] "lose" [33] "lot" [37] "copypaste" [41] "will" [45] "send" [49] "via" [53] "housewarmings" $text4 [1] "can" [7] "net" [13] "hassle" "card" "choice" "mention" "reflect" "send" "market" "site" "save" "paper" "buy" "convenient" "paper" "always" "get" "electronic" "easy" "code" "whenever" "friend" "email" "happy" "always" "can" "take" "much" "christmas" "one" "christmas" "actual" "christmas" "decision" "choose" "great" "tree" "card" "electronic" "indeed" "card" "wonder" "leave" "delivery" "redeem" "certificate" "want" "electronic" "newborn" "holiday" "get" "return" "care" "someone" "like" "return" 12 "gift" "say" "sure" "holiday" "political" "lynn" "people" "can" "via" "great" "annoy" "much" "lose" "make" "need" "balance" "use" "card" "wedding" "something" "safety" "return" "thing" "hassle" ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## $text5 [1] "take" [6] "turn" [11] "benefit" [16] "type" [21] "risk" [26] "keep" [31] "give" $text6 [1] "card" [5] "receive" [9] "email" [13] "sure" [17] "notify" [21] "lurch" [25] "know" [29] "2" [33] "thank" "dollar" "money" "spend" "card" "expiration" "track" "discount" "good" "can" "time" "deal" "delivery" "much" "commit" "truly" "give" "whoop" "good" "refund" "bad" "problem" "check" "money" "use" "headache" "activation" "issue" "remain" "purchase" "personal" "away" "forget" "give" "leave" "will" "kudo" "fully" "limitation" "site" "buy" "issue" "keep" "less" "power" "problem" "get" "activate" "good" "procrastinator" "good" "quick" "load" Let’s combine all these stemmed words into one text again (i.e., de-tokenization) and put them back to the original data set. In other words, we need to combine separate tokens into one string (e.g., combine “hello” and “world” into “hello world”). To do this, we will use the paste() function and we need to set the collapse argument to be ” “, meaning that all the separate words will be combined into one long string with a blank space in between. # combine separate tokens into one string # store the results in a new column in the data set gift_reviews$text_clean <- sapply(reviews_stem, FUN = paste, collapse = " ") Now let’s create a frequency bar chart based on the stemmed words to see how the results have changed. # create a frequency bar chart on stemmed reviews #------------------------# Step 1: combine all text # join all text together words_combined <- paste(gift_reviews$text_clean) # split the sentences into words words_combined <- unlist(strsplit(words_combined, split = " ")) # remove the empty character words_combined <- words_combined[words_combined != ""] #-----------------------------------------# Step 2: Count the times each word appears # create a frequency table freqTab <- as.data.frame(table(words_combined)) 13 # sort the table freqTab <- freqTab[order(freqTab$Freq, decreasing = T), ] #------------------------# Step 3: Plot a bar chart # select the top 20 mist frequent words topFreqTab <- freqTab[1:20, ] # create a bar chart barplot(height = topFreqTab$Freq, # the height of the bars horiz = F, # FALSE if draw the bar vertically col = "darkgrey", # color of the bars names.arg = topFreqTab$words_combined, # bar labels main = "Word Frequency Bar Chart") # title of the plot 0 50 100 150 200 250 300 Word Frequency Bar Chart card bui can order on give will receiv just dai Let’s compare this bar chart with the one we created before data cleaning and preprocessing, do you see any differences? Any new words appear in the list? And do you think that stemming has helped us represent the data more accurately? 3) SENTIMENT ANALYSIS In this part, we will perform some basic sentiment analysis to see if different reviews contain different opinions (i.e., the polarity of the text), using two values: positive and negative. To understand the sentiment, we 14 will use the Lexicoder Sentiment Dictionary (2015) which has been shared by Young, L. & Soroka, S. (2012) (Lexicoder Sentiment Dictionary, available at http://www.snsoroka.com/data-lexicoder/). You can access this dictionary by loading the quanteda library. As this dictionary has 4 different values: positive, negative, neg_positive (positive words preceded by a negation), neg_negative (negative words preceded by a negation), we need to subset the dictionary to only select positive and negative entries as follows: data_dictionary_LSD2015[1:2]. # load the package --- REMEMBER TO INSTALL IT FIRST library(quanteda) # print the first words of the dictionary LSD2015 head(data_dictionary_LSD2015[1:2]) ## Dictionary object with 2 key entries. ## - [negative]: ## - a lie, abandon*, abas*, abattoir*, abdicat*, aberra*, abhor*, abject*, abnormal*, abolish*, abomi ## - [positive]: ## - ability*, abound*, absolv*, absorbent*, absorption*, abundanc*, abundant*, acced*, accentuat*, ac There are many ways to compute the sentiment scores of the text. In this example, we will compute the sentiment score of the text in two steps as follows: Step 1: Tokenize each text and use the function tokens_lookup() function from the quanteda library to convert each token into either positive or negative sentiment using a specific dictionary (which is the LSD 2015 in our example) Step 2: Compute the overall sentiment score of the text when assuming that: a) Each positive word will be considered as +1 point, while each negative words will be considered as -1 point b) The overall sentiment score is computed by adding all the points and then divide it by the total number of words. As such, the values range from -1 (all words are negative) to 1 (all words are positive). It is 0 when we have similar numbers of positive and negative words. For example, if a tokenized text contains 4 different words of which three of them are positive and one is negative. In this case, the total score is 3 - 1 = 2. The sentiment score is therefore: 2/4 = 0.5, which indicates a positive sentiment. # convert lemmatized tokens to positive/negative # we use as.tokens() function to convert reviews_lem to tokens form text_sent <- tokens_lookup(as.tokens(reviews_lem), dictionary = data_dictionary_LSD2015[1:2]) # print the first rows of text_sent head(text_sent) ## ## ## ## ## ## Tokens consisting of 6 documents. text1 : [1] "positive" "negative" "positive" "positive" "positive" "positive" "positive" [8] "positive" "negative" text2 : 15 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## [1] "positive" "positive" "positive" text3 : [1] "negative" "positive" "positive" "negative" "negative" "negative" [7] "positive" "positive" "positive" "positive" "positive" "positive" text4 : [1] "positive" "positive" "negative" "positive" "negative" text5 : [1] "positive" "negative" "positive" "negative" "negative" text6 : [1] "negative" "negative" "positive" "positive" "negative" "negative" "positive" [8] "negative" "positive" # compute the sentiment score # and store the results in a new column of the data set gift_reviews$sent_score <- sapply(X = text_sent, FUN = function(x) (length(x[x == "positive"]) length(x[x == "negative"]))/length(x)) Note that tokens_lookup() function only works with tokens objects so we need to use as.tokens() function to convert lemmatized text reviews_lem to the tokens form. In addition, to compute the sentiment score, we use the sapply() function to apply a function to each element of the given list (i.e., each tokenized review in the list text_sent). In this case, the used function is an arbitrary function that is defined in the second argument of the sapply() function. As you can see, the code “length(x[x == ”positive”])” will result in a number of positive words in a given object x. The object “x” here refers to the value of an element of the list “text_sent”. When this arbitrary function is applied to each element of text_sent, the value of x will change accordingly. Now let’s check the gift_reviews data set. Each review has been assigned one sentiment analysis score. How can we interpret these results? 4) TOPIC MODELING: LDA Topic modeling is a process to discovering potential hidden patterns in the data by identifying words that occur together and then grouping them into topics. To do that, we will first need to create a term-document matrix and then apply a topic model. In this example, we will use the Latent Dirichlet Allocation (LDA) model. Step 1: Create a Document Term Matrix (DTM) Here we need to convert the list of our tokenized reviews into a DTM # load the package --- REMEMBER TO INSTALL IT FIRST library(quanteda) # covert list of tokenized text into DTM review_dfm <- dfm(as.tokens(reviews_stem)) # print the first rows and columsn of dtm head(review_dfm) 16 ## Document-feature matrix of: 6 documents, 2,514 features (98.97% sparse) and 0 docvars. ## features ## docs shop com card christma gift disappoint five choic on sai ## text1 1 1 1 4 1 1 1 1 1 1 ## text2 0 0 0 0 0 0 0 0 0 0 ## text3 0 0 3 0 0 0 0 0 0 0 ## text4 0 0 0 0 0 0 0 0 0 0 ## text5 0 0 1 0 0 0 0 0 0 0 ## text6 0 0 1 0 0 0 0 0 0 0 ## [ reached max_nfeat ... 2,504 more features ] It is often a good idea to trim our DFM such that we will only work with terms that occur more than 5% and less than 95%. Other thresholds (e.g., more than 7.5% and less than 90%) can also be chosen. # load the package --- REMEMBER TO INSTALL IT FIRST library(quanteda) # trim the dfm review_dfm_trimmed <- dfm_trim(x = review_dfm, # dfm object min_docfreq = 0.05, # min frequency max_docfreq = 0.95, # max frequency docfreq_type = "prop") # type of min/max_docfreq Step 2: Run the LDA model We use the LDA() function from the topicmodels library. Note that we will need to choose an arbitrary number of topics and then apply the model to the DTM object above. In addition, LDA model requires that the matrix cannot contain any document with no terms in it. Therefore, we first need to remove all documents with no terms. # compute the number of words in each document num_words <- apply(X = review_dfm_trimmed, # the matrix MARGIN = 1, # apply the function to all rows FUN = sum) # the function to be applied # remove all documents with no term review_dfm_trimmed_new <- review_dfm_trimmed[num_words > 0, ] # load the package --- REMEMBER TO INSTALL IT FIRST library(topicmodels) k <- 3 # number of topics seed <- 123 # for reproducibility purposes # run LDA model lda_results <- LDA(review_dfm_trimmed_new, # a DTM object k = k) # number of topics To obtain an overview of the results, we use the terms() function which requires two arguments: the first one is the results of the LDA model, and the second one is the maximum number of terms should we return for each topic. # print the top 7 words for each of the k topics terms(lda_results, 7) 17 ## ## ## ## ## ## ## ## [1,] [2,] [3,] [4,] [5,] [6,] [7,] Topic 1 "get" "want" "purchas" "can" "bui" "on" "give" Topic 2 "us" "purchas" "time" "card" "great" "will" "give" Topic 3 "card" "order" "like" "bui" "receiv" "great" "love" Insights revealed by the analysis In this tutorial, we have used text analytics to learn about customers’ perceptions of the gift card products. Based on the analysis you have done above, are you able to identify the topics that were present in the text? Can we use those insights to make a better marketing decision? In addition, are customers positive or negative with the products? If the text is positive, why is that? And if the text is negative, what are the common reasons for that? 18