Uploaded by Pranjali R

Garca-Prez2020 Chapter PredictiveAnalysisAndDataVisua

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/344587460
Predictive Analysis and Data Visualization Approach for Decision Processes in
Marketing Strategies: A Case of Study
Conference Paper · October 2020
DOI: 10.1007/978-3-030-61834-6_6
CITATIONS
READS
0
216
3 authors:
Andrés F. García-Pérez
María Millán
Universidad Tecnológica de Bolívar
Universidad Tecnológica de Bolívar
4 PUBLICATIONS 0 CITATIONS
2 PUBLICATIONS 0 CITATIONS
SEE PROFILE
Daniela E. Castellón Marriaga
Universidad Tecnológica de Bolívar
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Andrés F. García-Pérez on 23 February 2022.
The user has requested enhancement of the downloaded file.
SEE PROFILE
Predictive Analysis and Data Visualization
Approach for Decision Processes in Marketing
Strategies: A Case of Study
Andrés García-Pérez , María Alejandra Millán Hernández(B)
and Daniela E. Castellón Marriaga
,
Programa de Ingeniería Industrial, Universidad Tecnológica de Bolívar,
Cartagena de Indias, Colombia
agarcia@utb.edu.co, marialeja-9805@hotmail.com,
castellonmarriga@gmail.com
Abstract. In this paper, we perform a new strategy for recommender systems in
online entertainment platforms. As a case of study, we analyzed the reading preferences based on users of Goodreads, a social network for readers, to classify the
books depending on their associated with variables as average rating, rating count,
and text review count. Multivariate techniques cluster analysis and benchmarking
for comparison of predictive models were used. Graphs and data are presented,
allowing optimal evaluation of the number of clusters and the precision of models.
Finally, we show the existence of groups of elements that can be forgotten by traditional recommendation systems, due to their low visualization on the platform.
It is proposed to use promotional strategies to highlight these high-quality articles
but with little visibility. All in all, consider the classification of books that predictive models can offer, it can favor the authors, readers, and investors of Goodreads,
by the retention and attraction of users.
Keywords: Machine Learning · Predictive analytics · Data visualization ·
Recommender systems · Marketing strategies
1 Introduction
The creation of the concept of digital reading with the entry of electronic books brings
transformations to the relationship between author and reader. [1] manifests the challenges facing the publishing industry today, claiming that this industry requires a high
capital investment, referring to the costs of publishing and promoting books in physical
format, while the low prices of electronic publishing remove barriers to entry. These
economic costs make e-books affordable and show a competitive advantage against
the publishing industry. On the other hand, there is another scenario that also encompasses digital publication: self-publishing, where authors without an editor intervention
publicize his books through the internet making it easily accessible to readers.
Social networks have promoted the author-reader relationship, promoting feedback
between them, and expanding the possibilities of an idea exchange. Likewise, readers
© Springer Nature Switzerland AG 2020
J. C. Figueroa-García et al. (Eds.): WEA 2020, CCIS 1274, pp. 60–70, 2020.
https://doi.org/10.1007/978-3-030-61834-6_6
Predictive Analysis and Data Visualization Approach
61
from any part of the world can meet on digital platforms in spaces such as virtual reading
clubs, where topics of the books are commented and criticized, and especially books
recommendations are made. Goodreads is a readers social network, which was born as
a proposal for readers who wanted to have a virtual library with the books they had
read, the ones who were reading, and those who wanted to read. It offers, among its
basic functions adding ratings and reviews to books read, also making friends with other
readers with similar literary preferences.
The platform has around 90 million users. It was sold to Amazon six years after
its launching and helps authors to make themselves known doing giveaways of books
that have just released. Due to all its functionalities, Goodreads draws attention not only
from readers but also from authors, who establish links with readers, get feedback on
their books, know their popularity in the community and can assess their positioning
as writers. To keep the reader up to date on the latest books, it´s needed to select and
filter the books of greatest interest. In response, recommendation systems are integrated
since they study current people predilections to predict future ones, saving them the time
invested, and generating higher conversion rates.
Various ways of creating generic recommendation systems have been proposed [2].
Collaborative filtering, for example, assumes if a user has rated two books, then to
another user who has read one of these books, can be recommended the other one. In
[3] describes how Amazon and Netflix use algorithms to offer quality recommendations
to their users, using the collaborative filtering recommendation system. In this way,
through this method, it is intended to generate a list of similar items. However, the
user experience translates that recommendations are made based only on the previous
interests that the reader may have had, regardless of the incredible variety of other options
that this type of platform can offer. This opens up the possibility of creating user interest
lists that allow low-view authors to be promoted to keep readers more interested in the
recommendations.
The present work tries to expose a new approach recommendations system in virtual
entertainment platforms. Our proposal uses data visualization and machine learning
techniques for the creation of experimental lists of items that along together with the
inclusion of other features, can generate a higher conversion rate and maintain user traffic
on the platform. This methodology is applied to the Goodreads case study to identify
lists of books to be prioritized when trying to generate innovative recommendations that
favor Goodreads authors, readers, and investors.
2 Related Works
The recommendation systems are based on the relationship between products or users.
With this, there are two main techniques used for that task, collaborative filtering and
content-based recommendation. [4–6] evaluated their studies based on the collaborative
filtering system, where they focused on the description of the products to make the
recommendations. These authors based their work on the use of algorithms like KNN
to find the similarities between the items. On the other hand, [7–9] used a contentbased recommendation system, using the feedback collected from users on products and
using clustering or clustering algorithms in their research. While [10–12] suggest the
62
A. García-Pérez et al.
combination of these two systems, exposing a hybrid approach that combines techniques
from the previous recommenders to try to cover their deficiencies.
[13] propose adding a third level of user interests to the typical characteristics of users
and products. In this way, over-specialization is aided, thus personalizing the recommendations. Recently, works have used the power of Artificial Neural Networks for learning
about users or product features. This technique has also been used to retrieve missing
values as ratings and develop hybrid recommender systems [14, 15]. [16] suggests using
three similarity measures to avoid co-rated items: users, rating choice weight, and the
ratio of co-rated items, obtaining good results to minimize the deviation of similarity
calculation.
From the works analyzed, it is possible to observe the interest in generating emotion
in the users about the recommendations made, which can generate higher conversion
rates. On entertainment platforms, boredom can arise among users due to the always
similar recommendations. In this way, it is intended to start a discussion on the inclusion
in the list recommendation systems that allow exploring the wide range of options that
an entertainment platform can offer, without losing the quality of these.
3 Methodology
The study used the K-Means grouping technique and benchmarking for comparison
of supervised Machine Learning techniques. The analyzed data was taken from the
Goodreads database stored in [17], which has 11127 book records that have been read by
platform users until 2019. To classify the books, we consider the preference of the readers.
Three characteristics were chosen, which represent the most significant variability among
the books to carry out the segmentation. Features such as identifier code and ISBN were
left out. The variables considered were:
• Average rating (average_rating): The average rating that the book has received in total
(users can rate according to their perception from 1 to 5, with 1 being “very bad” and
5 “very good”).
• Ratings_count: Total number of ratings received by the book.
• Comment count (text_reviews_count): Total number of comments the book got.
The perception of users on this platform is that the recommendations made are often
based on similar authors, so growing authors with excellent ratings are ignored. The type
of recommendation proposed is aimed at the marketing team, to promote the promotion
of little-known writers, but with great potential in topics of interest to the user. In this
way, it is proposed that these types of recommendations that benefit authors and readers
by offering them a greater diversity of books be incorporated into marketing strategies.
As it was told, growing authors are overshadowed by famous authors. Figure 1 shows
the Top Ten famous authors in Goodreads. This violin chart illustrates the relationship
between authors and their book scores. It is observed that the scores given to the author
Rumiko Takahashi are highly concentrated around the median, that means majority of
people who read his books agree they are good. Also is noticed the scores given to
Stephen King are more spread out, so the rating presented a high standard deviation. In
Predictive Analysis and Data Visualization Approach
63
other terms there is people who liked his books and rate it low, and they are people who
loved and rate it high. All previous support the fact that if there is a group of well-rated
authors, they deserve the opportunity to be known.
Fig. 1. Top ten famous authors in Goodreads
Figure 2 shows the proposed method. The mass of data should be grouped by its
characteristics, such as score and count of ratings and comments. The score received
quality of work, while the score count and comment count, the author’s rating. The
cluster is then assigned to the records in the original database. The cluster to attack is
identified, and the promotional strategies are generated. Finally, classification techniques
are used to evaluate new data received and thus maintain the system.
Fig. 2. The workflow of the proposed method.
64
A. García-Pérez et al.
4 Results and Discussion
For this study, the R language was used. A pre-processing of the data was performed,
choosing the characteristics shown in Sect. 3, the missing values were removed, and the
data were normalized. The percentage of variance explained according to the number of
clusters is analyzed [18]. The variation in the number of clusters is evaluated with the
sum of squares of the residuals, as shown in Fig. 3. We determine the optimal number,
using the Silhouette Coefficient presented by [19]. For each observation i, the silhouette
coefficient (si ) is obtained as follows:
1. Calculate the average of the distances (ai ) between observation i and the rest of the
observations that belong to the same cluster. The smaller ai, the better the assignment
of i to its cluster has been.
2. Calculate the average distance between observation i and the other clusters. Understanding by the average distance between i and a given cluster (K) as the mean of
the distances between i and the cluster observations
3. Identify as bi the smallest of the average distances between i and the rest of the
clusters, that is, the distance to the nearest cluster (neighborhood cluster).
4. Calculate the value of silhouette, as shown in Eq. 1.
si =
bi − ai
max(ai , bi )
(1)
The average silhouette method considers as the optimal number of clusters, the one that
maximizes the mean of the silhouette coefficient of all the observations. In this way, the
maximum Average Silhouette Width (see Fig. 3). It is presented for K = 7 clusters.
Fig. 3. Elbow method and average silhouette for selection of cluster number.
After choosing the optimal number of clusters, it is possible to carry out an adequate
classification of the books, which will be divided into seven categories.
Predictive Analysis and Data Visualization Approach
65
Figure 4, shows the resulting clusters for the Goodreads database. There is a large
concentration of books for high average rating. Besides, a positive correlation is presented
for variations in the rating count and text review count, which represent the number of
people who evaluated the books.
Fig. 4. Goodreads book clusters
Clusters are shown in Fig. 5, for average rating and ratings count. With this, the
aim is to identify the cluster with well-ratings books, but little known/evaluated on
the platform. There are very popular books with high average ratings, such as those
belonging to cluster 3. At the same time, there are very little-known books with low
ratings, as in the case of cluster 1. The cluster of interest is presented as being 6, which
presents a range of evaluations similar to that of very popular books, but which are little
commented or evaluated.
Fig. 5. Identification of the target group.
66
A. García-Pérez et al.
In this way, the target cluster that should be prioritized has been visually identified.
For the present study, the two characteristics used for identification represented the
popularity and quality of work. Other possible approaches may include visualizing data
in multiple dimensions and factor rating system. Table 1 shows some examples of books
belonging to each cluster along with the characteristics present in each one of these.
Table 1. Example of books by clusters.
Book category
Description
Title
Cluster 1
Low rating count number, Low rating
average
Out to Eat London 2002 (Lonely Planet
Out to Eat)
Juiced Official Strategy Guide
Open City 6: The Only Woman He Ever
Left
Cluster 2
Low rating count number, High rating
average
Harry Potter and the Chamber of Secrets
(Harry Potter #2)
Harry Potter Boxed Set Books 1-5 (Harry
Potter #1-5)
Harry Potter Collection (Harry Potter
#1-6)
Cluster 3
High rating count number, High rating
average
Harry Potter and the Half-Blood Prince
(Harry Potter #6)
Harry Potter and the Order of the
Phoenix (Harry Potter #5)
Harry Potter and the Prisoner of Azkaban
(Harry Potter #3)
Cluster 4
Low rating count number, Medium rating
average
Bill Bryson’s African Diary
Hatchet Jobs: Writings on Contemporary
Fiction
Changeling (Changeling #1)
Cluster 5
Medium rating count number, High rating
average
Atlas Shrugged
Memoirs of a Geisha
Snow Flower and the Secret Fan
Cluster 6
Medium rating count number, High rating
average
The Ultimate Hitchhiker’s Guide to the
Galaxy (Hitchhiker’s Guide to the
Galaxy #1-5)
A Short History of Nearly Everything
In a Sunburned Country
Cluster 7
Low rating count number, High rating
average
Unauthorized Harry Potter Book Seven
News: Half-Blood Prince Analysis and
Speculation
Bryson’s Dictionary of Troublesome
Words: A Writer’s Guide to Getting It
Right
I’m a Stranger Here Myself: Notes on
Returning to America After Twenty
Years Away
Predictive Analysis and Data Visualization Approach
67
With clusters assigned by book, it is possible to process new data. In this way, it
is intended to determine when a book is moved from one cluster to another. We are
especially interested when a book enters to cluster 6, to be considered in the promotion
strategy. For this purpose, k-fold cross-validation was performed with k = 10, with the
examples called dataset for four models: Random Forests, Support Vector Machines
(svm), KNN and Extreme Gradient Boosting (xgboost). The performance measures to
choose the best model are Classification Accuracy (acc), Balanced Accuracy (bacc), and
Classification Error (ce). The results are presented in Fig. 6.
The Support Vector Machines model, on average, shows maximums with better
performance, as well as their average and, in general, offers a good relationship between
the minimum and maximum values for the different performance measures.
Fig. 6. Model performance.
To find out which model performed better in all tasks simultaneously, performance
statistics were calculated for each one. The positions occupied by each model, according
to the Table 2 are shown in Table 3. In this work, the Support Vector Machines model
presents the best evaluation in the test set, getting a rank of 1; and the second position
for the train set, getting a rank of 2. Contrarily, the Extreme Gradient Boosting model
presents the worst evaluation in the train and test set, getting a rank of 4 in both.
Figure 7 shows the classification of books using the Support Vector Machines model,
from the perspective of average rating and ratings count. In this way, the items (books) on
the platform that can be prioritized for promotional activities could be identified through
the proposed methodology. These activities can be manifested with more visualization
on the platform or a more significant number of recommendations to users, depending
on their interest.
This approach will increase the diversity of options offered, which can promote
further exploration of the platform. As it could be verified, the predictive model presents a
high hit rate. So, for system maintenance, this classification model can be used to identify
candidates for these promotions. This information can be used by a recommendation
system, which takes into account content diversification and other variables such as
common characteristics of users, products, and the user’s interests.
68
A. García-Pérez et al.
Table 2. Measures of each model for the test and training data set.
Model
Train
acc
Test
bacc
ce
acc
bacc
ce
Random Forest
1.000
1.000
–
0.997
0.960
0.003
Support Vector Machines
0.999
0.993
0.001
0.998
0.981
0.002
KNN
0.999
0.988
0.001
0.995
0.965
0.005
Extreme Gradient Boosting
0.998
0.982
0.002
0.994
0.937
0.006
Table 3. Positions occupied by each model according to their average performance.
Model
Rank
Train Test
Support Vector Machines
2
1
Random Forests
1
2
KNN
3
3
Extreme Gradient Boosting 4
4
Fig. 7. Support Vector Machines in Goodreads book ranking.
5 Conclusions and Future Work
The results obtained in the study serve as a decision-making tool that can favor the
Goodreads community. Developers are proposed to place greater emphasis on the group
Predictive Analysis and Data Visualization Approach
69
of books that are not very popular but are highly rated, using marketing strategies to
make them more known, giving these writers a position.
The presence of user authors on the platform could increase if they achieve their
objective: to reach the critical mass of readers. Goodreads developers can support authors
who produce good content but are not yet highly recognized. With this, the reading users
and investors will also obtain benefits. The former will have new books in their reading
suggestions that are recommended for their excellent evaluation, and that can offer them
a different experience, the latter on their part will have greater profits as they have more
traffic on their platform generated by users who are motivated to stay on Goodreads or
by new users who come to the platform.
Future Work
Consider readers’ dissatisfaction with specific recommendations. When the scanning
option is enabled on the platform, the possibility of implementing a hybrid model combining the collaborative and content-based filtering recommendation system with the
proposed prioritization system can be evaluated.
References
1. Shatzkin, M., Riger, R.: The Book Business, 1st edn. Oxford University Press, New York
(2019)
2. Rana, A., Deeba, K.: Online book recommendation system using collaborative filtering (with
Jaccard similarity). Nano Sci. J. Phys. Conf. Ser. 1362, 12130 (2019). https://doi.org/10.1088/
1742-6596/1362/1/012130
3. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey
of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734–749
(2005). https://doi.org/10.1109/TKDE.2005.99
4. Resnick, P., Iakovou, N.: GroupLens: an open architecture for collaborative filtering of
netnews. In: Computer Supported Cooperative Work Conference (1994)
5. Hill, W., Stead, L., Rosenstein, M., Furnas, G.: Recommending and evaluating choices in a
virtual community of use. In: Proceedings of Conference on Human Factors in Computing
Systems (1995)
6. Sarwar, B., Karypis, G., Konstan, J.: Item-based collaborative filtering recommendation
algorithms. In: Proceedings of 10th International WWW Conference (2001)
7. Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of 12th International
Conference Machine Learning (1995)
8. Balabanovic, M., Shoham, Y.: Fab: content-based collaborative recommendation. Comm.
ACM 40(3), 66–72 (1997)
9. Pazzani, M., Billsus, D.: Learning and revising user profiles: the identification of interesting
web sites. Mach. Learn. 27, 313–331 (1997)
10. Claypool, M., Gokhale, A., Miranda, T.: Combining content-based and collaborative filters
in an online newspaper. In: Proceedings of ACM SIGIR 1999 Workshop Recommender
11. Tran, T, Cohen., R.: Hybrid recommender systems for electronic commerce. In: Proceedings
of Knowledge-Based Electronic Markets. Papers from the AAAI Workshop, Technical report
WS-00-04, AAAI Press (2000)
12. Melville, P., Mooney, R.: content-boosted collaborative filtering for improved recommendations. In: Proceedings of 18th National Conference Artificial Intelligence (2002)
70
A. García-Pérez et al.
13. Liu, Q., Chen, E., Xiong, H., Ding, C.H.Q., Chen, J.: Enhancing collaborative filtering by user
interest expansion via personalised ranking. IEEE Trans. Syst. Man Cybern. Part B Cybern.
42(1), 2012 (2012)
14. Strub, F., Gaudel, R., Mary, J.: Hybrid recommender system based on autoencoders. In:
Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, ACM, pp. 11–
16 (2016)
15. Zhang, S., Yao, L., Sun, A.: Deep learning based recommender system: a survey and new
perspectives. arXiv preprint arXiv:1707.07435 (2017)
16. Feng, J., Fengs, X., Zhang, N., Peng, J.: An improved collaborative filtering method based
on similarity. PLoS ONE 13(9), e0204003 (2018)
17. Kaggle (2020). https://www.kaggle.com/jealousleopard/goodreadsbooks
18. Bholowalia, P., Kumar, A.: EBK-means: a clustering technique based on elbow method and
k-means in WSN. Int. J. Comput. Appl. 105(9), 17–24 (2014)
19. Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)901
25-7. ISSN 0377-0427
View publication stats
Download