See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/344587460 Predictive Analysis and Data Visualization Approach for Decision Processes in Marketing Strategies: A Case of Study Conference Paper · October 2020 DOI: 10.1007/978-3-030-61834-6_6 CITATIONS READS 0 216 3 authors: Andrés F. García-Pérez María Millán Universidad Tecnológica de Bolívar Universidad Tecnológica de Bolívar 4 PUBLICATIONS 0 CITATIONS 2 PUBLICATIONS 0 CITATIONS SEE PROFILE Daniela E. Castellón Marriaga Universidad Tecnológica de Bolívar 1 PUBLICATION 0 CITATIONS SEE PROFILE All content following this page was uploaded by Andrés F. García-Pérez on 23 February 2022. The user has requested enhancement of the downloaded file. SEE PROFILE Predictive Analysis and Data Visualization Approach for Decision Processes in Marketing Strategies: A Case of Study Andrés García-Pérez , María Alejandra Millán Hernández(B) and Daniela E. Castellón Marriaga , Programa de Ingeniería Industrial, Universidad Tecnológica de Bolívar, Cartagena de Indias, Colombia agarcia@utb.edu.co, marialeja-9805@hotmail.com, castellonmarriga@gmail.com Abstract. In this paper, we perform a new strategy for recommender systems in online entertainment platforms. As a case of study, we analyzed the reading preferences based on users of Goodreads, a social network for readers, to classify the books depending on their associated with variables as average rating, rating count, and text review count. Multivariate techniques cluster analysis and benchmarking for comparison of predictive models were used. Graphs and data are presented, allowing optimal evaluation of the number of clusters and the precision of models. Finally, we show the existence of groups of elements that can be forgotten by traditional recommendation systems, due to their low visualization on the platform. It is proposed to use promotional strategies to highlight these high-quality articles but with little visibility. All in all, consider the classification of books that predictive models can offer, it can favor the authors, readers, and investors of Goodreads, by the retention and attraction of users. Keywords: Machine Learning · Predictive analytics · Data visualization · Recommender systems · Marketing strategies 1 Introduction The creation of the concept of digital reading with the entry of electronic books brings transformations to the relationship between author and reader. [1] manifests the challenges facing the publishing industry today, claiming that this industry requires a high capital investment, referring to the costs of publishing and promoting books in physical format, while the low prices of electronic publishing remove barriers to entry. These economic costs make e-books affordable and show a competitive advantage against the publishing industry. On the other hand, there is another scenario that also encompasses digital publication: self-publishing, where authors without an editor intervention publicize his books through the internet making it easily accessible to readers. Social networks have promoted the author-reader relationship, promoting feedback between them, and expanding the possibilities of an idea exchange. Likewise, readers © Springer Nature Switzerland AG 2020 J. C. Figueroa-García et al. (Eds.): WEA 2020, CCIS 1274, pp. 60–70, 2020. https://doi.org/10.1007/978-3-030-61834-6_6 Predictive Analysis and Data Visualization Approach 61 from any part of the world can meet on digital platforms in spaces such as virtual reading clubs, where topics of the books are commented and criticized, and especially books recommendations are made. Goodreads is a readers social network, which was born as a proposal for readers who wanted to have a virtual library with the books they had read, the ones who were reading, and those who wanted to read. It offers, among its basic functions adding ratings and reviews to books read, also making friends with other readers with similar literary preferences. The platform has around 90 million users. It was sold to Amazon six years after its launching and helps authors to make themselves known doing giveaways of books that have just released. Due to all its functionalities, Goodreads draws attention not only from readers but also from authors, who establish links with readers, get feedback on their books, know their popularity in the community and can assess their positioning as writers. To keep the reader up to date on the latest books, it´s needed to select and filter the books of greatest interest. In response, recommendation systems are integrated since they study current people predilections to predict future ones, saving them the time invested, and generating higher conversion rates. Various ways of creating generic recommendation systems have been proposed [2]. Collaborative filtering, for example, assumes if a user has rated two books, then to another user who has read one of these books, can be recommended the other one. In [3] describes how Amazon and Netflix use algorithms to offer quality recommendations to their users, using the collaborative filtering recommendation system. In this way, through this method, it is intended to generate a list of similar items. However, the user experience translates that recommendations are made based only on the previous interests that the reader may have had, regardless of the incredible variety of other options that this type of platform can offer. This opens up the possibility of creating user interest lists that allow low-view authors to be promoted to keep readers more interested in the recommendations. The present work tries to expose a new approach recommendations system in virtual entertainment platforms. Our proposal uses data visualization and machine learning techniques for the creation of experimental lists of items that along together with the inclusion of other features, can generate a higher conversion rate and maintain user traffic on the platform. This methodology is applied to the Goodreads case study to identify lists of books to be prioritized when trying to generate innovative recommendations that favor Goodreads authors, readers, and investors. 2 Related Works The recommendation systems are based on the relationship between products or users. With this, there are two main techniques used for that task, collaborative filtering and content-based recommendation. [4–6] evaluated their studies based on the collaborative filtering system, where they focused on the description of the products to make the recommendations. These authors based their work on the use of algorithms like KNN to find the similarities between the items. On the other hand, [7–9] used a contentbased recommendation system, using the feedback collected from users on products and using clustering or clustering algorithms in their research. While [10–12] suggest the 62 A. García-Pérez et al. combination of these two systems, exposing a hybrid approach that combines techniques from the previous recommenders to try to cover their deficiencies. [13] propose adding a third level of user interests to the typical characteristics of users and products. In this way, over-specialization is aided, thus personalizing the recommendations. Recently, works have used the power of Artificial Neural Networks for learning about users or product features. This technique has also been used to retrieve missing values as ratings and develop hybrid recommender systems [14, 15]. [16] suggests using three similarity measures to avoid co-rated items: users, rating choice weight, and the ratio of co-rated items, obtaining good results to minimize the deviation of similarity calculation. From the works analyzed, it is possible to observe the interest in generating emotion in the users about the recommendations made, which can generate higher conversion rates. On entertainment platforms, boredom can arise among users due to the always similar recommendations. In this way, it is intended to start a discussion on the inclusion in the list recommendation systems that allow exploring the wide range of options that an entertainment platform can offer, without losing the quality of these. 3 Methodology The study used the K-Means grouping technique and benchmarking for comparison of supervised Machine Learning techniques. The analyzed data was taken from the Goodreads database stored in [17], which has 11127 book records that have been read by platform users until 2019. To classify the books, we consider the preference of the readers. Three characteristics were chosen, which represent the most significant variability among the books to carry out the segmentation. Features such as identifier code and ISBN were left out. The variables considered were: • Average rating (average_rating): The average rating that the book has received in total (users can rate according to their perception from 1 to 5, with 1 being “very bad” and 5 “very good”). • Ratings_count: Total number of ratings received by the book. • Comment count (text_reviews_count): Total number of comments the book got. The perception of users on this platform is that the recommendations made are often based on similar authors, so growing authors with excellent ratings are ignored. The type of recommendation proposed is aimed at the marketing team, to promote the promotion of little-known writers, but with great potential in topics of interest to the user. In this way, it is proposed that these types of recommendations that benefit authors and readers by offering them a greater diversity of books be incorporated into marketing strategies. As it was told, growing authors are overshadowed by famous authors. Figure 1 shows the Top Ten famous authors in Goodreads. This violin chart illustrates the relationship between authors and their book scores. It is observed that the scores given to the author Rumiko Takahashi are highly concentrated around the median, that means majority of people who read his books agree they are good. Also is noticed the scores given to Stephen King are more spread out, so the rating presented a high standard deviation. In Predictive Analysis and Data Visualization Approach 63 other terms there is people who liked his books and rate it low, and they are people who loved and rate it high. All previous support the fact that if there is a group of well-rated authors, they deserve the opportunity to be known. Fig. 1. Top ten famous authors in Goodreads Figure 2 shows the proposed method. The mass of data should be grouped by its characteristics, such as score and count of ratings and comments. The score received quality of work, while the score count and comment count, the author’s rating. The cluster is then assigned to the records in the original database. The cluster to attack is identified, and the promotional strategies are generated. Finally, classification techniques are used to evaluate new data received and thus maintain the system. Fig. 2. The workflow of the proposed method. 64 A. García-Pérez et al. 4 Results and Discussion For this study, the R language was used. A pre-processing of the data was performed, choosing the characteristics shown in Sect. 3, the missing values were removed, and the data were normalized. The percentage of variance explained according to the number of clusters is analyzed [18]. The variation in the number of clusters is evaluated with the sum of squares of the residuals, as shown in Fig. 3. We determine the optimal number, using the Silhouette Coefficient presented by [19]. For each observation i, the silhouette coefficient (si ) is obtained as follows: 1. Calculate the average of the distances (ai ) between observation i and the rest of the observations that belong to the same cluster. The smaller ai, the better the assignment of i to its cluster has been. 2. Calculate the average distance between observation i and the other clusters. Understanding by the average distance between i and a given cluster (K) as the mean of the distances between i and the cluster observations 3. Identify as bi the smallest of the average distances between i and the rest of the clusters, that is, the distance to the nearest cluster (neighborhood cluster). 4. Calculate the value of silhouette, as shown in Eq. 1. si = bi − ai max(ai , bi ) (1) The average silhouette method considers as the optimal number of clusters, the one that maximizes the mean of the silhouette coefficient of all the observations. In this way, the maximum Average Silhouette Width (see Fig. 3). It is presented for K = 7 clusters. Fig. 3. Elbow method and average silhouette for selection of cluster number. After choosing the optimal number of clusters, it is possible to carry out an adequate classification of the books, which will be divided into seven categories. Predictive Analysis and Data Visualization Approach 65 Figure 4, shows the resulting clusters for the Goodreads database. There is a large concentration of books for high average rating. Besides, a positive correlation is presented for variations in the rating count and text review count, which represent the number of people who evaluated the books. Fig. 4. Goodreads book clusters Clusters are shown in Fig. 5, for average rating and ratings count. With this, the aim is to identify the cluster with well-ratings books, but little known/evaluated on the platform. There are very popular books with high average ratings, such as those belonging to cluster 3. At the same time, there are very little-known books with low ratings, as in the case of cluster 1. The cluster of interest is presented as being 6, which presents a range of evaluations similar to that of very popular books, but which are little commented or evaluated. Fig. 5. Identification of the target group. 66 A. García-Pérez et al. In this way, the target cluster that should be prioritized has been visually identified. For the present study, the two characteristics used for identification represented the popularity and quality of work. Other possible approaches may include visualizing data in multiple dimensions and factor rating system. Table 1 shows some examples of books belonging to each cluster along with the characteristics present in each one of these. Table 1. Example of books by clusters. Book category Description Title Cluster 1 Low rating count number, Low rating average Out to Eat London 2002 (Lonely Planet Out to Eat) Juiced Official Strategy Guide Open City 6: The Only Woman He Ever Left Cluster 2 Low rating count number, High rating average Harry Potter and the Chamber of Secrets (Harry Potter #2) Harry Potter Boxed Set Books 1-5 (Harry Potter #1-5) Harry Potter Collection (Harry Potter #1-6) Cluster 3 High rating count number, High rating average Harry Potter and the Half-Blood Prince (Harry Potter #6) Harry Potter and the Order of the Phoenix (Harry Potter #5) Harry Potter and the Prisoner of Azkaban (Harry Potter #3) Cluster 4 Low rating count number, Medium rating average Bill Bryson’s African Diary Hatchet Jobs: Writings on Contemporary Fiction Changeling (Changeling #1) Cluster 5 Medium rating count number, High rating average Atlas Shrugged Memoirs of a Geisha Snow Flower and the Secret Fan Cluster 6 Medium rating count number, High rating average The Ultimate Hitchhiker’s Guide to the Galaxy (Hitchhiker’s Guide to the Galaxy #1-5) A Short History of Nearly Everything In a Sunburned Country Cluster 7 Low rating count number, High rating average Unauthorized Harry Potter Book Seven News: Half-Blood Prince Analysis and Speculation Bryson’s Dictionary of Troublesome Words: A Writer’s Guide to Getting It Right I’m a Stranger Here Myself: Notes on Returning to America After Twenty Years Away Predictive Analysis and Data Visualization Approach 67 With clusters assigned by book, it is possible to process new data. In this way, it is intended to determine when a book is moved from one cluster to another. We are especially interested when a book enters to cluster 6, to be considered in the promotion strategy. For this purpose, k-fold cross-validation was performed with k = 10, with the examples called dataset for four models: Random Forests, Support Vector Machines (svm), KNN and Extreme Gradient Boosting (xgboost). The performance measures to choose the best model are Classification Accuracy (acc), Balanced Accuracy (bacc), and Classification Error (ce). The results are presented in Fig. 6. The Support Vector Machines model, on average, shows maximums with better performance, as well as their average and, in general, offers a good relationship between the minimum and maximum values for the different performance measures. Fig. 6. Model performance. To find out which model performed better in all tasks simultaneously, performance statistics were calculated for each one. The positions occupied by each model, according to the Table 2 are shown in Table 3. In this work, the Support Vector Machines model presents the best evaluation in the test set, getting a rank of 1; and the second position for the train set, getting a rank of 2. Contrarily, the Extreme Gradient Boosting model presents the worst evaluation in the train and test set, getting a rank of 4 in both. Figure 7 shows the classification of books using the Support Vector Machines model, from the perspective of average rating and ratings count. In this way, the items (books) on the platform that can be prioritized for promotional activities could be identified through the proposed methodology. These activities can be manifested with more visualization on the platform or a more significant number of recommendations to users, depending on their interest. This approach will increase the diversity of options offered, which can promote further exploration of the platform. As it could be verified, the predictive model presents a high hit rate. So, for system maintenance, this classification model can be used to identify candidates for these promotions. This information can be used by a recommendation system, which takes into account content diversification and other variables such as common characteristics of users, products, and the user’s interests. 68 A. García-Pérez et al. Table 2. Measures of each model for the test and training data set. Model Train acc Test bacc ce acc bacc ce Random Forest 1.000 1.000 – 0.997 0.960 0.003 Support Vector Machines 0.999 0.993 0.001 0.998 0.981 0.002 KNN 0.999 0.988 0.001 0.995 0.965 0.005 Extreme Gradient Boosting 0.998 0.982 0.002 0.994 0.937 0.006 Table 3. Positions occupied by each model according to their average performance. Model Rank Train Test Support Vector Machines 2 1 Random Forests 1 2 KNN 3 3 Extreme Gradient Boosting 4 4 Fig. 7. Support Vector Machines in Goodreads book ranking. 5 Conclusions and Future Work The results obtained in the study serve as a decision-making tool that can favor the Goodreads community. Developers are proposed to place greater emphasis on the group Predictive Analysis and Data Visualization Approach 69 of books that are not very popular but are highly rated, using marketing strategies to make them more known, giving these writers a position. The presence of user authors on the platform could increase if they achieve their objective: to reach the critical mass of readers. Goodreads developers can support authors who produce good content but are not yet highly recognized. With this, the reading users and investors will also obtain benefits. The former will have new books in their reading suggestions that are recommended for their excellent evaluation, and that can offer them a different experience, the latter on their part will have greater profits as they have more traffic on their platform generated by users who are motivated to stay on Goodreads or by new users who come to the platform. Future Work Consider readers’ dissatisfaction with specific recommendations. When the scanning option is enabled on the platform, the possibility of implementing a hybrid model combining the collaborative and content-based filtering recommendation system with the proposed prioritization system can be evaluated. References 1. Shatzkin, M., Riger, R.: The Book Business, 1st edn. Oxford University Press, New York (2019) 2. Rana, A., Deeba, K.: Online book recommendation system using collaborative filtering (with Jaccard similarity). Nano Sci. J. Phys. Conf. Ser. 1362, 12130 (2019). https://doi.org/10.1088/ 1742-6596/1362/1/012130 3. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734–749 (2005). https://doi.org/10.1109/TKDE.2005.99 4. Resnick, P., Iakovou, N.: GroupLens: an open architecture for collaborative filtering of netnews. In: Computer Supported Cooperative Work Conference (1994) 5. Hill, W., Stead, L., Rosenstein, M., Furnas, G.: Recommending and evaluating choices in a virtual community of use. In: Proceedings of Conference on Human Factors in Computing Systems (1995) 6. Sarwar, B., Karypis, G., Konstan, J.: Item-based collaborative filtering recommendation algorithms. In: Proceedings of 10th International WWW Conference (2001) 7. Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of 12th International Conference Machine Learning (1995) 8. Balabanovic, M., Shoham, Y.: Fab: content-based collaborative recommendation. Comm. ACM 40(3), 66–72 (1997) 9. Pazzani, M., Billsus, D.: Learning and revising user profiles: the identification of interesting web sites. Mach. Learn. 27, 313–331 (1997) 10. Claypool, M., Gokhale, A., Miranda, T.: Combining content-based and collaborative filters in an online newspaper. In: Proceedings of ACM SIGIR 1999 Workshop Recommender 11. Tran, T, Cohen., R.: Hybrid recommender systems for electronic commerce. In: Proceedings of Knowledge-Based Electronic Markets. Papers from the AAAI Workshop, Technical report WS-00-04, AAAI Press (2000) 12. Melville, P., Mooney, R.: content-boosted collaborative filtering for improved recommendations. In: Proceedings of 18th National Conference Artificial Intelligence (2002) 70 A. García-Pérez et al. 13. Liu, Q., Chen, E., Xiong, H., Ding, C.H.Q., Chen, J.: Enhancing collaborative filtering by user interest expansion via personalised ranking. IEEE Trans. Syst. Man Cybern. Part B Cybern. 42(1), 2012 (2012) 14. Strub, F., Gaudel, R., Mary, J.: Hybrid recommender system based on autoencoders. In: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, ACM, pp. 11– 16 (2016) 15. Zhang, S., Yao, L., Sun, A.: Deep learning based recommender system: a survey and new perspectives. arXiv preprint arXiv:1707.07435 (2017) 16. Feng, J., Fengs, X., Zhang, N., Peng, J.: An improved collaborative filtering method based on similarity. PLoS ONE 13(9), e0204003 (2018) 17. Kaggle (2020). https://www.kaggle.com/jealousleopard/goodreadsbooks 18. Bholowalia, P., Kumar, A.: EBK-means: a clustering technique based on elbow method and k-means in WSN. Int. J. Comput. Appl. 105(9), 17–24 (2014) 19. Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)901 25-7. ISSN 0377-0427 View publication stats