Uploaded by xuanxian2001

Research Proposal Example on Big Data in Retail

advertisement
TPT1201
Research Methodology in Computer Science
ASSIGNMENT 2
Hadoop MapReduce based C5.0 Algorithm in
Predicting Customer Behavior in Retail
Prepared by
Steven Tan Chung Hong,
Ng Kai Sheng,
Cho Xuan Xian,
Dylan Lim Yong Sen,
1201300636,
1201302264,
1201302646,
1201302652,
011-57679012
017-3632015
011-55086875
017-8949038
Abstract
Based on our last literature review “Big Data in Retail”, we are set to propose
the Hadoop MapReduce based C5.0 algorithm in predicting customer behaviour in
retail. To date, we find no evidence of past works to prove that the C5.0 algorithm
has been applied in the field of retail. From this proposal, we expect to be able to
see better predictions in customer behaviour in retail after the Hadoop MapReduce
C5.0 algorithm is implemented.
1
Introduction
In the current Big Data driven era, we could see Big Data being applied in various fields
in order to improve operational efficiency from many angles (Sivarajah, Kamal, Irani, &
Weerakkody, 2017). Big Data has also been proven to be exceptionally helpful in these
sectors such as in medicine, agriculture, and even in environmental protection. Along
with the emergence of Big Data, countless data science methods and techniques have
been developed over time to enhance the utilisation of Big Data in different sectors.
Currently, we found out that there is a method that stands above the other methods to
date, which is the improvised version of C4.5 algorithm, the Hadoop MapReduce based
C5.0 algorithm. However, there is no evidence that the C5.0 algorithm has been usefully
implemented in the retail sector. This research has 2 primary objectives, which is to
implement Hadoop MapReduce based C5.0 algorithm in customer behaviour prediction
in retail, and to improve the speed and accuracy of the current up-to-date customer
prediction methods.
2
Motivation of the Research
With the vast usage of Big Data today, Data Scientists have developed countless
methods and techniques of handling and utilising Big Data into their favour for certain
sectors. However, one of these techniques, the Hadoop MapReduce based C5.0
algorithm technique that have proven to be extremely beneficial in other fields/sectors
have not been fully implemented into the retail industry (Rathinasamy, Balamurali, &
Raj, 2019). It has always been a challenge to understand customer purchasing
behaviour due to the different needs and wants of each customer (Zhang & Tan, 2020).
With the faster and more accurate C5.0 algorithm implemented into predicting
customer purchase behaviour, the retail sector would be able to cater to different
customers’ needs, be able to ensure customer satisfaction, be able to understand the
current market’s trend, etc. In the long run, it ensures the improvement and increase of
sales, which is tremendously beneficial to the retail and to the business.
3
Research Objectives
1. To implement Hadoop MapReduce based C5.0 algorithm into the prediction of
customer behaviour in retail.
2. To improve the speed and accuracy of the current customer behavior prediction
method.
2
4
Literature Review
The study of the effects of poor prediction of customer purchase behaviour has been
seen in multiple past research studies. Poor prediction lowers the customer satisfaction
and organisational value (Ying, Sindakis, Aggarwal, Chen, & Su, 2021). The emergence
of online shopping trend has caused the increase of difficulties of predicting customer
behaviour, resulting in demand volatility and uncertainty in the retail industry, leading
to negative consequences on inventory control and on shareholder profits in the long-run
(Sun & David, 2021). In one of the research papers, Singh, Ghutla, Lilo Jnr, Mohammed,
and Rashid (2017) stated that many times it is hard for the retailers to comprehend
the market condition since their retail stores are at various geographical locations and it
intertwines with issues of poor customer behavior prediction. Even in one of the papers,
Ahmad, Jafar, and Aljoumaa (2019) has discussed about not being able to optimise
customer behaviour prediction to reduce churn in telecommunications.
Author
Maryani and
Riana (2017)
Heldt, Silveira,
and Luce (2021)
Pandey and
Shukla (2018)
Panhalkar and
Doye (2021)
Li et al. (2022)
Original Approach
New Approach
Improvements Made
RFM
RFM with CRM
more efficient business relationships with customers and
maximize customer satisfaction
RFM
RFM/P
reduce customer base value prediction error, improve
individual customer value forecasting errors
Apriori
Improved Apriori
Swarm Intelligence
Modified Swarm
Intelligence, ABO
RBFNN
ILS-RBFNN
take lesser time and it works on all type of database
minimise time complexity and create a globally
optimized decision tree
further improve the accuracy of prediction of customer
consumption behavior
Table 1: Summary of Past Methods that have been Improvised
Table 1 shows the various original approaches such as the RFM Model, Apriori, Swarm
Intelligence and RBFNN. It also shows that advanced improvements have been made to
enhance and fixing the weaknesses found in the existing approaches, giving birth to new
and better versions and approaches.
Algorithm
C5.0
CART
Processing Time (s)
13
17
Accuracy (%)
63.89
33.33
Table 2: Result Obtained from Maung (2020)
Table 2 shows that the processing time of C5.0 algorithm is shorter than the CART
algorithm by 4 seconds. Furthermore, the accuracy of of C5.0 is also higher which is
63.89% compared to the CART algorithm which is 33.33%.
Based on Myint and Tin (2021), it is stated that C5.0 Decision Tree was perfected to
the greatest accuracy and this algorithm has developed the industry standard for
creating decision trees.
In Sathe and Adamuthe (2021), it is shown that the
implementation of C5.0 algorithm performed superior in terms of accuracy to the other
algorithms in predicting students’ academic performance.
3
5
Research Method
The research method we propose is conducting an experiment to evaluate the performance
of customer behaviour prediction in retail. The experiment we propose follows the flow
of Figure 1.
Customer Dataset
Give the results
to reducer
Give Map
output to algorithm
HDFS
Mapper
C5.0
Reducer
Generate Rule
Stores generated decision tree rules in HDFS
Rule Generation
Figure 1: Flowchart of the Proposed System
The flow of the system is as follows:
1. Load customer dataset from HDFS as input for the algorithm.
2. Invoke C5.0 Algorithm.
3. MapReduce framework of Hadoop is applied. Map function is invoked to check if
this instance is a Current Node. If there are uncovered attributes, it outputs index
and its value and class label of instance.
4. Reduce function is applied.
5. Input dataset from HDFS is processed according to the C5.0 algorithm decision tree
data mining in the MapReduce framework.
6. Decision rules will be generated and stored in HDFS.
7. Access the test data in HDFS and perform categorisation based on the rules.
8. Relevant prediction results will be saved to a csv file.
6
Expected Outcomes
From this research, it is expected that the Hadoop MapReduce based C5.0 algorithm will
successfully improve the customer behaviour prediction in terms of accuracy and speed.
With this research proposal, we hope that we would be able to provide useful insights for
researchers in contribution to the retail sector in the future.
4
References
Ahmad, A. K., Jafar, A., & Aljoumaa, K. (2019). Customer churn prediction in telecom using
machine learning in big data platform. Journal of Big Data, 6 , 28. Retrieved from https://
journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6#citeas
doi:
https://doi.org/10.1186/s40537-019-0191-6
Heldt, R., Silveira, C. S., & Luce, F. B. (2021). Predicting customer value per product: From rfm to rfm/p.
Journal of Business Research, 127 , 444-453. Retrieved from https://www.sciencedirect.com/
science/article/pii/S0148296319303030 doi: https://doi.org/10.1016/j.jbusres.2019.05.001
Li, Y., Jia, X., Wang, R., Qi, J., Jin, H., Chu, X., & Mu, W. (2022). A new oversampling method and
improved radial basis function classifier for customer consumption behavior prediction. Expert
Systems with Applications, 199 , 116982. Retrieved from https://www.sciencedirect.com/
science/article/pii/S0957417422004067 doi: https://doi.org/10.1016/j.eswa.2022.116982
Maryani, I., & Riana, D. (2017). Clustering and profiling of customers using rfm for customer relationship
management recommendations. In 2017 5th international conference on cyber and it service
management (citsm) (p. 1-6). doi: 10.1109/CITSM.2017.8089258
Maung, E. T. W. (2020). Comparison of data mining classification algorithms: C5.0 and cart for car
evaluation and credit card information datasets. , 63. Retrieved from https://onlineresource
.ucsy.edu.mm/handle/123456789/2476
Myint, K. L., & Tin, H. H. K. (2021, 3). Analyzing the comparison of c4.5, cart and c5.0 algorithms on
heart disease dataset using decision tree method. EAI. doi: 10.4108/eai.27-2-2020.2303221
Pandey, K. K., & Shukla, D. (2018). Mining on relationships in big data era using improve apriori
algorithm with mapreduce approach. In 2018 international conference on advanced computation
and telecommunication (icacat) (p. 1-5). doi: 10.1109/ICACAT.2018.8933674
Panhalkar, A. R., & Doye, D. D. (2021). Optimization of decision trees using modified african buffalo
algorithm. Journal of King Saud University - Computer and Information Sciences. Retrieved from
https://www.sciencedirect.com/science/article/pii/S1319157821000136 doi: https://doi
.org/10.1016/j.jksuci.2021.01.011
Rathinasamy, R., Balamurali, S., & Raj, L. (2019, 01). Classifying agricultural crop pestdata using
hadoop mapreducebased c5.0 algorithm. Journal of Cyber Security and Mobility, 8 , 393-408. doi:
10.13052/jcsm2245-1439.835
Sathe, M., & Adamuthe, A. (2021, 02). Comparative study of supervised algorithms for prediction of
students’ performance. International Journal of Modern Education and Computer Science, 13 ,
1-21. doi: 10.5815/ijmecs.2021.01.01
Singh, M., Ghutla, B., Lilo Jnr, R., Mohammed, A. F. S., & Rashid, M. A. (2017). Walmart’s sales data
analysis - a big data analytics perspective. In 2017 4th asia-pacific world congress on computer
science and engineering (apwc on cse) (p. 114-119). doi: 10.1109/APWConCSE.2017.00028
Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of big data challenges
and analytical methods. Journal of Business Research, 70 , 263-286. Retrieved from https://
www.sciencedirect.com/science/article/pii/S014829631630488X doi: https://doi.org/10
.1016/j.jbusres.2016.08.001
Sun, & David, Y. (2021). A user behavior analysis method based on big data. In 2021 5th annual
international conference on data science and business analytics (icdsba) (p. 491-496). doi: 10.1109/
ICDSBA53075.2021.00101
Ying, S., Sindakis, S., Aggarwal, S., Chen, C., & Su, J. (2021). Managing big data in the retail industry
of singapore: Examining the impact on customer satisfaction and organizational performance.
European Management Journal , 39 (3), 390-400. Retrieved from https://www.sciencedirect
.com/science/article/pii/S0263237320300530
doi: https://doi.org/10.1016/j.emj.2020.04
.001
Zhang, C., & Tan, T. (2020, 05). The impact of big data analysis on consumer behavior. Journal of
Physics: Conference Series, 1544 , 012165. doi: 10.1088/1742-6596/1544/1/012165
5
1
1.1
Task 1
Program Used
The codes of the program used is based on method 2, which is shown in section below.
1.2
Computer Specifications
Component
RAM
Hard disk type
CPU
1.3
Specification
Computer 1 Computer 2
LPDDR4x 8GB DDR4 16GB
SSD
SSD
M1
i7-10875H
Results
No
1
2
3
4
5
6
7
8
9
10
Computer 1
1.725
1.656
1.672
1.662
1.665
1.667
1.670
1.667
1.656
1.665
Computer 2
4.542
4.475
4.488
4.472
4.511
4.557
4.506
4.457
4.505
4.455
Table 3: First 10 Execution Time Comparison using 2 Computers
Figure 2: p value obtained from Welch’s t-test
In this experiment, computer 1 is an Apple MacBook Air M1 (2020) and computer 2 is
an Asus ROG Strix Scar 15 (2020). Computer 1 uses Apple M1 chip whereas Computer 2
uses Intel i7-10875H. Computer 1 has a RAM of LPDDR4x 8GB whereas Computer 2 has
DDR4 16GB. The Hard Disk Type of both Computers are SSDs’. From the table above,
the runtime of Computer 1 is shorter compared to Computer 2 while assigning vaccinees
to PPVs despite having a lower RAM than computer 2. Moreover, the benchmark result
shows that the M1 chip is better than Intel therefore we hypothesized that the primary
factor that causes the time difference is due to the CPU. After using the t-test, the p-value
we obtained is smaller than 0.05, thus there is significant difference.
6
Figure 3: Shaded density plot of runtime comparison between two computers
2
2.1
Task 2
Methods
Figure 4: Method 1
Figure 5: Method 2
7
Figure 6: Method 3
Approach
Data Structure
Distance Formula
Loop Type
CSV Loader
Method 1
DataFrame
Haversine
df.apply()
Pandas
Method 2
List
Haversine
while
csv
Method 3
List
Vincenty Inverse
for
csv
Table 4: Comparison of three methods
2.2
Results
No
1
2
3
4
5
6
7
8
9
10
Method 1
6.432
6.765
6.538
6.443
7.176
6.464
6.400
6.927
6.273
6.218
Method 2
1.763
1.774
1.744
1.677
1.723
1.709
1.723
1.719
1.728
1.668
Method 3
6.248
5.809
5.916
5.975
5.983
6.280
6.283
5.850
5.792
5.855
Table 5: First 10 Execution time of different methods
Figure 7: p value obtained from One-Way ANNOVA test
8
From table 4, Method 1 is based on importing a mathematical formula, which is the
Haversine package. The Haversine formula was then implemented into each row of the
given datasets by using the “apply” function of Python. Method 2 is based on the same
Haversine package but was implemeted by using the traditional while loops and has the
csv read by using the csv module instead of pandas module which was used in Methods
1. Method 3 is based on the Vincenty Inverse formula and is implemented using For
Loops. Based on final results we obtained, the average runtime for Method 1 is 6.589
seconds, Method 2 is 1.694 seconds, and Method 3 is 6.031 seconds. The reason for the
major deviation between results of Method 1 and 2 is that the pandas df.apply function
is much slower than a traditional while loop for our context. Method 3 is slightly faster
than Method 1 because of the implementation of a less complicated mathematical formula
which is the Vincenty Inverse compared to Haversine. Finally, Method 2 is faster than
Method 3 because Vincenty’s Inverse Formula requires greater computational power as
it provides more accurate results compared to Haversine’s. After using the ANOVA test,
the p-value we obtained is smaller than 0.05, thus there is significant difference.
Figure 8: Shaded Density Plot
9
Download