TPT1201 Research Methodology in Computer Science ASSIGNMENT 2 Hadoop MapReduce based C5.0 Algorithm in Predicting Customer Behavior in Retail Prepared by Steven Tan Chung Hong, Ng Kai Sheng, Cho Xuan Xian, Dylan Lim Yong Sen, 1201300636, 1201302264, 1201302646, 1201302652, 011-57679012 017-3632015 011-55086875 017-8949038 Abstract Based on our last literature review “Big Data in Retail”, we are set to propose the Hadoop MapReduce based C5.0 algorithm in predicting customer behaviour in retail. To date, we find no evidence of past works to prove that the C5.0 algorithm has been applied in the field of retail. From this proposal, we expect to be able to see better predictions in customer behaviour in retail after the Hadoop MapReduce C5.0 algorithm is implemented. 1 Introduction In the current Big Data driven era, we could see Big Data being applied in various fields in order to improve operational efficiency from many angles (Sivarajah, Kamal, Irani, & Weerakkody, 2017). Big Data has also been proven to be exceptionally helpful in these sectors such as in medicine, agriculture, and even in environmental protection. Along with the emergence of Big Data, countless data science methods and techniques have been developed over time to enhance the utilisation of Big Data in different sectors. Currently, we found out that there is a method that stands above the other methods to date, which is the improvised version of C4.5 algorithm, the Hadoop MapReduce based C5.0 algorithm. However, there is no evidence that the C5.0 algorithm has been usefully implemented in the retail sector. This research has 2 primary objectives, which is to implement Hadoop MapReduce based C5.0 algorithm in customer behaviour prediction in retail, and to improve the speed and accuracy of the current up-to-date customer prediction methods. 2 Motivation of the Research With the vast usage of Big Data today, Data Scientists have developed countless methods and techniques of handling and utilising Big Data into their favour for certain sectors. However, one of these techniques, the Hadoop MapReduce based C5.0 algorithm technique that have proven to be extremely beneficial in other fields/sectors have not been fully implemented into the retail industry (Rathinasamy, Balamurali, & Raj, 2019). It has always been a challenge to understand customer purchasing behaviour due to the different needs and wants of each customer (Zhang & Tan, 2020). With the faster and more accurate C5.0 algorithm implemented into predicting customer purchase behaviour, the retail sector would be able to cater to different customers’ needs, be able to ensure customer satisfaction, be able to understand the current market’s trend, etc. In the long run, it ensures the improvement and increase of sales, which is tremendously beneficial to the retail and to the business. 3 Research Objectives 1. To implement Hadoop MapReduce based C5.0 algorithm into the prediction of customer behaviour in retail. 2. To improve the speed and accuracy of the current customer behavior prediction method. 2 4 Literature Review The study of the effects of poor prediction of customer purchase behaviour has been seen in multiple past research studies. Poor prediction lowers the customer satisfaction and organisational value (Ying, Sindakis, Aggarwal, Chen, & Su, 2021). The emergence of online shopping trend has caused the increase of difficulties of predicting customer behaviour, resulting in demand volatility and uncertainty in the retail industry, leading to negative consequences on inventory control and on shareholder profits in the long-run (Sun & David, 2021). In one of the research papers, Singh, Ghutla, Lilo Jnr, Mohammed, and Rashid (2017) stated that many times it is hard for the retailers to comprehend the market condition since their retail stores are at various geographical locations and it intertwines with issues of poor customer behavior prediction. Even in one of the papers, Ahmad, Jafar, and Aljoumaa (2019) has discussed about not being able to optimise customer behaviour prediction to reduce churn in telecommunications. Author Maryani and Riana (2017) Heldt, Silveira, and Luce (2021) Pandey and Shukla (2018) Panhalkar and Doye (2021) Li et al. (2022) Original Approach New Approach Improvements Made RFM RFM with CRM more efficient business relationships with customers and maximize customer satisfaction RFM RFM/P reduce customer base value prediction error, improve individual customer value forecasting errors Apriori Improved Apriori Swarm Intelligence Modified Swarm Intelligence, ABO RBFNN ILS-RBFNN take lesser time and it works on all type of database minimise time complexity and create a globally optimized decision tree further improve the accuracy of prediction of customer consumption behavior Table 1: Summary of Past Methods that have been Improvised Table 1 shows the various original approaches such as the RFM Model, Apriori, Swarm Intelligence and RBFNN. It also shows that advanced improvements have been made to enhance and fixing the weaknesses found in the existing approaches, giving birth to new and better versions and approaches. Algorithm C5.0 CART Processing Time (s) 13 17 Accuracy (%) 63.89 33.33 Table 2: Result Obtained from Maung (2020) Table 2 shows that the processing time of C5.0 algorithm is shorter than the CART algorithm by 4 seconds. Furthermore, the accuracy of of C5.0 is also higher which is 63.89% compared to the CART algorithm which is 33.33%. Based on Myint and Tin (2021), it is stated that C5.0 Decision Tree was perfected to the greatest accuracy and this algorithm has developed the industry standard for creating decision trees. In Sathe and Adamuthe (2021), it is shown that the implementation of C5.0 algorithm performed superior in terms of accuracy to the other algorithms in predicting students’ academic performance. 3 5 Research Method The research method we propose is conducting an experiment to evaluate the performance of customer behaviour prediction in retail. The experiment we propose follows the flow of Figure 1. Customer Dataset Give the results to reducer Give Map output to algorithm HDFS Mapper C5.0 Reducer Generate Rule Stores generated decision tree rules in HDFS Rule Generation Figure 1: Flowchart of the Proposed System The flow of the system is as follows: 1. Load customer dataset from HDFS as input for the algorithm. 2. Invoke C5.0 Algorithm. 3. MapReduce framework of Hadoop is applied. Map function is invoked to check if this instance is a Current Node. If there are uncovered attributes, it outputs index and its value and class label of instance. 4. Reduce function is applied. 5. Input dataset from HDFS is processed according to the C5.0 algorithm decision tree data mining in the MapReduce framework. 6. Decision rules will be generated and stored in HDFS. 7. Access the test data in HDFS and perform categorisation based on the rules. 8. Relevant prediction results will be saved to a csv file. 6 Expected Outcomes From this research, it is expected that the Hadoop MapReduce based C5.0 algorithm will successfully improve the customer behaviour prediction in terms of accuracy and speed. With this research proposal, we hope that we would be able to provide useful insights for researchers in contribution to the retail sector in the future. 4 References Ahmad, A. K., Jafar, A., & Aljoumaa, K. (2019). Customer churn prediction in telecom using machine learning in big data platform. Journal of Big Data, 6 , 28. Retrieved from https:// journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6#citeas doi: https://doi.org/10.1186/s40537-019-0191-6 Heldt, R., Silveira, C. S., & Luce, F. B. (2021). Predicting customer value per product: From rfm to rfm/p. Journal of Business Research, 127 , 444-453. Retrieved from https://www.sciencedirect.com/ science/article/pii/S0148296319303030 doi: https://doi.org/10.1016/j.jbusres.2019.05.001 Li, Y., Jia, X., Wang, R., Qi, J., Jin, H., Chu, X., & Mu, W. (2022). A new oversampling method and improved radial basis function classifier for customer consumption behavior prediction. Expert Systems with Applications, 199 , 116982. Retrieved from https://www.sciencedirect.com/ science/article/pii/S0957417422004067 doi: https://doi.org/10.1016/j.eswa.2022.116982 Maryani, I., & Riana, D. (2017). Clustering and profiling of customers using rfm for customer relationship management recommendations. In 2017 5th international conference on cyber and it service management (citsm) (p. 1-6). doi: 10.1109/CITSM.2017.8089258 Maung, E. T. W. (2020). Comparison of data mining classification algorithms: C5.0 and cart for car evaluation and credit card information datasets. , 63. Retrieved from https://onlineresource .ucsy.edu.mm/handle/123456789/2476 Myint, K. L., & Tin, H. H. K. (2021, 3). Analyzing the comparison of c4.5, cart and c5.0 algorithms on heart disease dataset using decision tree method. EAI. doi: 10.4108/eai.27-2-2020.2303221 Pandey, K. K., & Shukla, D. (2018). Mining on relationships in big data era using improve apriori algorithm with mapreduce approach. In 2018 international conference on advanced computation and telecommunication (icacat) (p. 1-5). doi: 10.1109/ICACAT.2018.8933674 Panhalkar, A. R., & Doye, D. D. (2021). Optimization of decision trees using modified african buffalo algorithm. Journal of King Saud University - Computer and Information Sciences. Retrieved from https://www.sciencedirect.com/science/article/pii/S1319157821000136 doi: https://doi .org/10.1016/j.jksuci.2021.01.011 Rathinasamy, R., Balamurali, S., & Raj, L. (2019, 01). Classifying agricultural crop pestdata using hadoop mapreducebased c5.0 algorithm. Journal of Cyber Security and Mobility, 8 , 393-408. doi: 10.13052/jcsm2245-1439.835 Sathe, M., & Adamuthe, A. (2021, 02). Comparative study of supervised algorithms for prediction of students’ performance. International Journal of Modern Education and Computer Science, 13 , 1-21. doi: 10.5815/ijmecs.2021.01.01 Singh, M., Ghutla, B., Lilo Jnr, R., Mohammed, A. F. S., & Rashid, M. A. (2017). Walmart’s sales data analysis - a big data analytics perspective. In 2017 4th asia-pacific world congress on computer science and engineering (apwc on cse) (p. 114-119). doi: 10.1109/APWConCSE.2017.00028 Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of big data challenges and analytical methods. Journal of Business Research, 70 , 263-286. Retrieved from https:// www.sciencedirect.com/science/article/pii/S014829631630488X doi: https://doi.org/10 .1016/j.jbusres.2016.08.001 Sun, & David, Y. (2021). A user behavior analysis method based on big data. In 2021 5th annual international conference on data science and business analytics (icdsba) (p. 491-496). doi: 10.1109/ ICDSBA53075.2021.00101 Ying, S., Sindakis, S., Aggarwal, S., Chen, C., & Su, J. (2021). Managing big data in the retail industry of singapore: Examining the impact on customer satisfaction and organizational performance. European Management Journal , 39 (3), 390-400. Retrieved from https://www.sciencedirect .com/science/article/pii/S0263237320300530 doi: https://doi.org/10.1016/j.emj.2020.04 .001 Zhang, C., & Tan, T. (2020, 05). The impact of big data analysis on consumer behavior. Journal of Physics: Conference Series, 1544 , 012165. doi: 10.1088/1742-6596/1544/1/012165 5 1 1.1 Task 1 Program Used The codes of the program used is based on method 2, which is shown in section below. 1.2 Computer Specifications Component RAM Hard disk type CPU 1.3 Specification Computer 1 Computer 2 LPDDR4x 8GB DDR4 16GB SSD SSD M1 i7-10875H Results No 1 2 3 4 5 6 7 8 9 10 Computer 1 1.725 1.656 1.672 1.662 1.665 1.667 1.670 1.667 1.656 1.665 Computer 2 4.542 4.475 4.488 4.472 4.511 4.557 4.506 4.457 4.505 4.455 Table 3: First 10 Execution Time Comparison using 2 Computers Figure 2: p value obtained from Welch’s t-test In this experiment, computer 1 is an Apple MacBook Air M1 (2020) and computer 2 is an Asus ROG Strix Scar 15 (2020). Computer 1 uses Apple M1 chip whereas Computer 2 uses Intel i7-10875H. Computer 1 has a RAM of LPDDR4x 8GB whereas Computer 2 has DDR4 16GB. The Hard Disk Type of both Computers are SSDs’. From the table above, the runtime of Computer 1 is shorter compared to Computer 2 while assigning vaccinees to PPVs despite having a lower RAM than computer 2. Moreover, the benchmark result shows that the M1 chip is better than Intel therefore we hypothesized that the primary factor that causes the time difference is due to the CPU. After using the t-test, the p-value we obtained is smaller than 0.05, thus there is significant difference. 6 Figure 3: Shaded density plot of runtime comparison between two computers 2 2.1 Task 2 Methods Figure 4: Method 1 Figure 5: Method 2 7 Figure 6: Method 3 Approach Data Structure Distance Formula Loop Type CSV Loader Method 1 DataFrame Haversine df.apply() Pandas Method 2 List Haversine while csv Method 3 List Vincenty Inverse for csv Table 4: Comparison of three methods 2.2 Results No 1 2 3 4 5 6 7 8 9 10 Method 1 6.432 6.765 6.538 6.443 7.176 6.464 6.400 6.927 6.273 6.218 Method 2 1.763 1.774 1.744 1.677 1.723 1.709 1.723 1.719 1.728 1.668 Method 3 6.248 5.809 5.916 5.975 5.983 6.280 6.283 5.850 5.792 5.855 Table 5: First 10 Execution time of different methods Figure 7: p value obtained from One-Way ANNOVA test 8 From table 4, Method 1 is based on importing a mathematical formula, which is the Haversine package. The Haversine formula was then implemented into each row of the given datasets by using the “apply” function of Python. Method 2 is based on the same Haversine package but was implemeted by using the traditional while loops and has the csv read by using the csv module instead of pandas module which was used in Methods 1. Method 3 is based on the Vincenty Inverse formula and is implemented using For Loops. Based on final results we obtained, the average runtime for Method 1 is 6.589 seconds, Method 2 is 1.694 seconds, and Method 3 is 6.031 seconds. The reason for the major deviation between results of Method 1 and 2 is that the pandas df.apply function is much slower than a traditional while loop for our context. Method 3 is slightly faster than Method 1 because of the implementation of a less complicated mathematical formula which is the Vincenty Inverse compared to Haversine. Finally, Method 2 is faster than Method 3 because Vincenty’s Inverse Formula requires greater computational power as it provides more accurate results compared to Haversine’s. After using the ANOVA test, the p-value we obtained is smaller than 0.05, thus there is significant difference. Figure 8: Shaded Density Plot 9