Identifying Frequent Word Associations for Extracting Specific Product Ashequl Qadir

advertisement

Identifying Frequent Word Associations for Extracting Specific Product

Features from Customer Reviews

Ashequl Qadir

University of Wolverhampton, UK ashequl.qadir@wlv.ac.uk

, ashequlqadir@yahoo.com

Abstract

Product feature extraction from customer reviews is an important task in the field of opinion mining.

Extracted features help to assess feature based opinions written by the customers who bought a particular product and gave their valuable opinions concerning their satisfactions and criticisms. This helps future customers and vendors to know about the pros and cons of the product under consideration.

Due to unstructured format of the opinion text in most of the cases, it is necessary to formulate ways to extract product features both implicitly and explicitly visible in plain text reviews. In this paper, a process is discussed where frequent words associated with specific product features are identified with the help of a previously managed corpus of product reviews.

The process involves finding out the keywords or Ngrams that are frequently associated with the specific product features. These frequent associations are then normalized within each product feature scope with the popular tf.idf metric. Two different classification techniques are applied to associate unclassified review lines with appropriate product feature classes using the found word associations. Results are then evaluated by comparing with the human identified product features.

1. Introduction

It has been seen to be widely popular among customers to study reviews of features for specific products prior to buying. Both expert and relatively

Naïve customers look for feature oriented opinions of other customers and their experiences with a purchased product. Due to vast collection of reviews and their unstructured presentation in the web, it is quite inconvenient and time consuming for the customers to summarize opinions related to specific product features by reading plain text reviews one after another. In the opinion lines, some of these features are mentioned explicitly such as “This product is very reasonably priced” to indicate a

‘ Price’ feature, whereas, most of the features are implicitly visible within the text. An example can be

“I do travel a lot so they get banged about.” to indicate a ‘ Portability’ feature.

The task of processing product reviews is very challenging because in most of the cases, the written reviews contain a informal use of terms, expressions and in some cases, also grammatically incorrect sentences and misspelled words. This poses a potential threat to the training of a system. Sometimes, very indicative terms for a product feature are used by few expert users infrequently, which are difficult to identify. Also, some terms can be associated with more than one product feature in similar or different product domains. Sentences containing general comments or descriptions of events that do not relate to any specific product feature, but only expresses opinion about the product can also be found. These issues complicate the task of identifying product features in customer reviews.

There are many websites that publishes reviews written by customers. Opinion Sites such as epinions, cnet, e-commerce sites such as amazon, blogs etc. are very popular sources where reviews can be found written by the customers expressing their opinion after buying or using a product. Most of these reviews are written in plain text, some of them have options where customers can provide their overall rating or recommendation by means of a ‘thumbs up/thumbs down’ notation. But, these reviews do not contain formatted opinions indicating a positive or negative recommendation specific to the product features. A system that can identify various implicit and explicit product features can contribute significantly towards product feature based review and recommendation classification. A process has been approached in this paper by using the frequent keywords and N-grams that are most likely to appear with the specific product features.

2. Related Work

Yi et al. [1],[5] worked on designing a sentiment analyzer that extracts topic specific features. To select candidate feature terms, they took into consideration three term selection heuristics to extract noun phrases of specific patterns. They applied a mixture language model algorithm and a likelihood ratio algorithm on the extracted noun phrases to select the product features. Their approach is restricted to finding out explicit product features only.

Liu et al. [2] used association rule mining based on

Apriori algorithm to extract frequent phrases. To reduce the list of extracted phrases and identify potential product features they applied compactness pruning on phrases having more than one word by determining whether or not they appear together. For extracted single words, they applied a redundancy pruning where a candidate feature is pruned if it is a subset of another feature. They also tried to identify infrequent product features by determining the nearby noun phrases of review sentence where a frequent feature is absent and opinion words are present.

Popescu et al. [3] introduced an unsupervised information extraction system- OPINE that extracts explicit product features by recursively identifying parts and properties of a given product. It extracts noun phrases and computes pointwise mutual information between the phrases and meronymy discriminators associated with the product classes.

Their system distinguishes between part and property by WordNet’s IS-A hierarchy. Again, their approach is limited to extracting explicit product features.

Ghani et al. [4] approached explicit product feature extraction as a classification problem. They used

Naive Bayes with a multi-view semi-supervised algorithm. In their process, output of an unsupervised seed generation algorithm is combined with unlabeled data that is used by the semi-supervised algorithm to extract product attributes and values which are later linked together by dependency information and correlation score. For implicitly mentioned attributes, they used labeled data for their training corpus, trained the system and then performed classification using baseline, naive bayes and Expectation-

Maximization which is an iterative statistical technique for maximum likelihood estimation.

The approach described in this paper is different from those mentioned above in the aspects that it uses statistical methods by counting frequency of the Ngrams and then calculating tf.idf weight score to assign a review line to a product feature. It differs from the approaches above because it does not extract noun phrases and apply pruning to reduce candidate features, or does not try to identify part and property relations. Also, the described approach does not differentiate between explicit and implicit features and works equally for both.

Pang et al. [6] worked on evaluating machine learning approaches in classifying documents based on the positive and negative sentiments. They applied

Bayes’ rule to derive their Naive Bayes classifier. An adaptation of their Naive Bayes classifier has been utilized and described later in this paper for finding product features for reviews.

3. Methodology

The described process is divided into 3 sections consisting of tasks related to corpora creation; normalized weight calculation to identify frequent Ngrams associated with product features, and product feature identification using a classification scheme.

3.1. Corpus Creation

A corpus is created by obtaining reviews on a particular type of product from Amazon, using

Amazon AWS. The review texts are then split into individual sentences that indicate individual product feature. If more than one product feature is present in a single sentence, the sentence is segmented accordingly. Also, complex and compound sentences are segmented if they contain separate feature information. These units relating to specific product features are then tagged manually with feature titles.

To avoid noisy information, general sentences that do not relate to any product feature are removed.

For this experiment, 120 reviews have been collected by searching for product ‘harddisk’ in amazon.co.uk. Among them, 100 random reviews are used for training while keeping the rest 20 for testing.

3.2. Weight Calculation for frequent N-grams

Frequent N-grams are counted within specific feature scopes obtained from different reviews. Both unigrams and bigrams are taken as N-grams for consideration in the counting process. If all the words in a N-gram are function words, that N-gram is removed to eliminate N-grams that are common to any text and do not carry product feature specific terms.

To normalise the use of product specific but not feature specific N-grams, the tf.idf metric has been used.

If n i,j

is the occurrence of term t i

{ }

in document d j having k terms, | D | is the number of document in the corpus, d : t i

∈ d is the number of documents where term t i

appears, then tf.idf weight can be calculated by multiplying term frequency, tf inverse document frequency, the followings: idf i

where tf i,j i,j

with

and idf i

are tf i , j

= n

k i n

, j k , j

; idf i

= log d :

{ t

D i

∈ d

}

To calculate term frequency, number of occurrences of a N-gram is counted within a product feature scope and is divided by total number of N-grams within that product feature scope. The inverse document frequency is calculated by means of dividing the total number of unique product features previously tagged by a number of product features associated with Ngram after consideration and then taking the logarithm of the quotient. A sample of the identified

N-grams associated with top five popular product features from the training data set using tf.idf weight score is shown is Table 1.

Table 1. Product feature associated Ngrams

Product Features Unigram

Working Smoothness drive

Bigram

the drive

Working Smoothness works western digital

Working Smoothness passport no problems

Working Smoothness problems the same

Usability use to use

Usability drive easy to

Usability

Usability

Outlook

Outlook

Outlook

Outlook

Software Support

passport hard drive

easy the drive

drive

just

good

looks

drive

the drive

of kit

it looks

the same

the software

Software Support

Software Support

Software Support

Accessories

Accessories

Accessories

Accessories

software the drive

use hard drive

just

usb

sync software

the usb

power

case

cable

western digital

usb cable

power supply

3.3. Product Feature Identification

Identifying product features from the reviews has been seen as a document classification problem where each of the review lines can be considered as a document to be classified, and the product features are the classes. Two classification schemes have been used. The first one is Naive Bayes Classification.

According to Bayes’ rule,

P ( c | d ) =

P ( c )

P

P

(

( d d

)

| c )

Pang et al.[6] derived their NB classifier by reforming Bayes’ rule as following:

P

NB

( c | d ) : =

P ( c )(

i m

= 1

P (

P ( d ) f i

| c ) n i

( d ) ) where probability of finding a class c given document d is calculated by multiplying probability of finding class c , P(c) with probability of finding feature f given class c and then dividing by probability of finding document d , P( d ). Their feature set is denoted by f={f

1

…f m

} and n i document d .

(d) is the frequency of feature f i

in

Adapting their NB classifier for classifying review lines using N-grams as feature set, considering product features as classes and each review lines as documents, NB classifier can be rewritten as,

P

NB

( c | d ) : =

P ( c )(

i m

= 1

P

P

( d

(

) f i

| c ) w i

( c ) ) where set of N-grams in a review line is denoted by

{f

1

…f m

} and w i

(c) is the tf.idf weight of N-gram f i

for product feature class c . Laplace smoothing is used to avoid getting zero as the result of multiplication.

Because P(d) has no contribution towards selecting a class, it has been ignored. The review line, d is assigned to a product feature class, c * c * = arg max c

P ( c |

where, d )

In another classification scheme, a summation based approach has been used with the tf.idf weight of the Ngrams. A review line is assigned to a class c * where, c * = arg max c

m w i

( c ) w i

(c) is the tf.idf weight of N-gram for product feature class c . f i

in feature set

4. Test Results

Tests have been performed within a small scope of

20 reviews kept aside for testing purposes. Table 2 below shows the result of product feature identification using unigram as the selected N-gram. Product features have been sorted based on their popularity found in the training corpus. Few of the product features that were present in the training corpus but not present in the test dataset are omitted. NB denotes

Naive Bayes classification and SB signifies

Summation Based classification.

Table 2. Result of product feature identification using Unigram

Precision Recall F-measure

Product Features

Working Smoothness

Usability

Capacity

Outlook

Software Support

Accessories

Compatibility

Working Speed

Size

Portability

Longevity

Customer Support

Noise

Product Information

Average

NB SB NB SB NB SB

0.4118

1.0000

0.2917

0.2857

0.3415

0.4444

0.1429

0.3333

0.0909

0.2500

0.1111

0.2857

0.2000

1.0000

0.2500

0.6667

0.2222

0.8000

0.2857

0.5000

0.3333

0.6667

0.3077

0.5714

0.5000

0.8000

0.4167

0.6667

0.4545

0.7273

0.7500

0.5385

0.2500

0.6364

0.3750

0.5833

0.4000

1.0000

0.2857

0.1429

0.3333

0.2500

1.0000

0.5000

1.0000

1.0000

1.0000

0.6667

0.1429

0.8000

0.2000

1.0000

0.1667

0.8889

1.0000

0.0000

0.3333

0.0000

0.5000

0.0000

1.0000

0.0000

1.0000

NA

NA

1.0000

0.5000

1.0000

0.2000

0.1250

0.2857

0.2222

0.4000

0.3750

0.5000

1.0000

0.4444

0.5455

0.0000

0.2000

0.0000

0.5000

NA 0.2857

0.4095

0.6462

0.2965

0.5671

0.3785

0.5593

Table 3. Result of product feature identification using Unigram+Bigram

Product features

Precision

NB SB NB

Recall

SB

F-measure

NB SB

Working Smoothness

Usability

Capacity

Outlook

Software Support

Accessories

Compatibility

Working Speed

Size

Portability

Longevity

Customer Support

Noise

Product Information

0.7500

1.0000

0.1250

0.2273

0.2143

0.3704

1.0000

0.4286

0.2727

0.3000

0.4286

0.3529

0.2000

1.0000

0.2500

0.7500

0.2222

0.8571

0.6667

0.5000

0.3333

0.5000

0.4444

0.5000

0.6667

0.8000

0.3333

0.6667

0.4444

0.7273

0.0000

0.7778

0.0000

0.5833

NA 0.6667

1.0000

0.0000

0.1429

0.0000

0.2500

NA

1.0000

0.6250

0.6000

1.0000

0.7500

0.7692

0.4000

0.5000

0.4000

1.0000

0.4000

0.6667

0.0000

0.2500

0.0000

0.5000

0.0000

0.2000

0.0000

1.0000

0.0000

1.0000

0.0000

0.1250

NA

NA

NA

0.3333

0.3333

0.2222

0.5000

0.3750

0.5000

1.0000

0.5000

0.5455

0.0000

0.2000

0.0000

0.5000

0.0000

0.2857

Average 0.4417

0.5469

0.2112

0.5823

0.3654

0.5100

Table 3 shows the result of product feature identification when both unigrams and bigrams are used as N-grams.

Table 4. Accuracy Rate of classification

N-grams

Unigram

Unigram

+Bigram

Accuracy(%)

NB

29.09

19.09

SB

41.82

41.82

Not Classified (%)

NB for different lengths of N-grams tested.

0

0

SB

11.82

7.27

Table 4 gives the overall accuracy rate of identifying product features using the two classification schemes

5. Result Analysis

From the test performed using unigrams only,

Table 2 shows that summation based classification scheme performed better than naïve bayes classification both in terms of precision and recall and thus yielding a better f-measure score. Whereas, Table

3 shows that when both unigrams and bigrams are used, naïve bayes performed slightly better than the other scheme in precision for some of the popular product features. But in case of less popular product features, precision is still very low. On the other hand, summation based classification scheme achieved better results on an average.

Use of only bigrams was also tested but showed poor results because the frequency of the bigrams decreased notably and thus no longer remained indicative of the individual product features. A bigger training corpus might improve performance in this aspect. Also, the assumed statistical independence for

Naive Bayes classification deteriorates significantly.

Use of unigrams and bigrams together yielded slightly better result than using only unigram when summation based classification technique was applied, as can be seen in the average recall score. But average precision is still higher when only unigram is used.

Table 4 shows that the accuracy rate remained unchanged, but the percentage of undetermined classes decreased. Summation based classification technique failed to assign a product feature to all the review lines in cases where the N-grams in a review for test data set were not present in the training corpus. Increasing the size of the training corpus will increase the possibility of avoiding this issue. On the other hand, because of the default probability of finding a product feature class, naive bayes classification scheme was always able to assign a class to a review line, even though the N-grams in the review line to be classified may remain absent in the training corpus. This increases the possibility of wrong classification when Naive Bayes classification scheme is used.

The classification accuracy rates for both the schemes are still very low. Because no lemmatization or stemming had been used, the frequency of the same canonical form of words got distributed and thus contributed less towards indicating product feature associated with a review line. The use of word synonyms might also improve the accuracy as more accurate and relevant frequency count will be possible.

4. Conclusion

The presence of these frequently associated Ngrams alone might not be sufficient enough to identify product features from unstructured plain text reviews.

More tests in varying domains and with bigger corpus are needed to be done to improve performance.

However, it is quite evident that they can significantly contribute towards identifying implicit and explicit product features. The future continuation of this task will involve fine tuning within the association identification process, involving using synonyms of the found words, lemmatization, stemming and applying other classification techniques to find an optimum solution for identifying product features.

5. References

[1] Yi, J., Nasukawa, T., Bunescu, R. and Niblack, W.,

“Sentiment Analyzer: Extracting Sentiments about a Given

Topic using Natural Language Processing Techniques”, In

Proceedings of the IEEE International Conference on Data

Mining (ICDM) , IEEE Computer Society, 2003, pp. 427-

434.

[2] Hu, M. and Liu, B. “Mining Opinion Features in

Customer Reviews”, In Proceedings of AAAI , AAAI Press,

San Jose, USA, July 2004, pp. 755–760.

[3] Popescu, A.-M. and Etzioni, O., “Extracting Product

Features and Opinions from Reviews”, In Proceedings of the Human Language Technology Conference and the

Conference on Empirical Methods in Natural Language

Processing (HLT/EMNLP) , Association for Computational

Linguistics, Vancouver, British Columbia, Canada, 2005, pp. 339-346.

[4] Ghani, R., Probst, K., Liu, Y., Krema, M. and Fano, A.,

“Text Mining for Product Attribute Extraction”, SIGKDD

Explorations Newsletter , 8(1): pp. 41–48, 2006.

[5] Yi, J. and Niblack, W., “Sentiment mining in

WebFountain”, In Proceedings of the International

Conference on Data Engineering (ICDE) , IEEE Computer

Society , 2005, pp. 1073-1083.

[6] Pang, B., Lee, L. and Vaithyanathan, S., “Thumbs up?

Sentiment Classification using Machine Learning

Techniques”, In Proceedings of the Conference on

Empirical Methods in Natural Language Processing

(EMNLP) , Association for Computational Linguistics, 2002, pp. 79-86.

Download