Uploaded by Siddhartha Kc

CC5067NP Smart Data Discovery - Siddhartha-Kc

advertisement
Module Code & Module Title
CC5067NP Smart Data Discovery
Assessment Weightage & Type
60% Individual Coursework
Year and Semester
2023 Spring
C4 Siddhartha Kc
London Met ID: 21050025
College ID: NP04CP4A210073
Assignment Deadline: 04 May 2023
Assignment Submission Date: 04 May 2023
I understand that I must submit my coursework online via MST before the deadline in order to receive a
mark. Late submissions will not be accepted and will result in a mark of zero.
Acknowledgement
I would like to express my gratitude to my module leader, Mr. Badri Raj
Lamichhane, for providing me with the opportunity to undertake this
coursework for the module "Smart Data Discovery". His guidance and
support throughout the coursework have been invaluable, and I have learned
a great deal from his expertise in the field.
I would also like to thank Islington College and London Metropolitan
University for providing me with the resources and support necessary to
complete this coursework successfully.
Lastly, I would like to acknowledge the hard work and dedication of my fellow
classmates, whose contributions and feedback have helped me in my
learning process.
Contents
1. Introduction ............................................................................................................... 1
1.1
Data Processing Cycle ....................................................................................... 3
2. Objectives ................................................................................................................. 5
3. Overview ................................................................................................................... 6
4.
Data Understanding ........................................................................................... 7
5.
Data Preparation ................................................................................................ 9
a.
Merge data from each month into one CSV .................................................... 9
b.
Removing NAN missing values ..................................................................... 13
c.
Converting Quantity Ordered & Price Each into numeric .............................. 17
d.
Creating new column Month from Ordered Data........................................... 19
e.
Creating new column City from Purchase Address ....................................... 22
6.
Data Analysis ................................................................................................... 24
f.
Showing summary statistics of ......................................................................... 24
g.
Calculating correlation of all variable ............................................................ 27
7.
Data Exploration............................................................................................... 30
h.
Showing which month had the best sales in Bar graph................................. 30
i.
Showing which city had Highest product sold .................................................. 34
j.
Illustrating most sold item in Bar graph ........................................................... 36
k.
Showing histogram plot of Order Months ...................................................... 38
8. Conclusion .............................................................................................................. 41
References .................................................................................................................... 42
List of Figures
Figure 1: Importing Libraries ........................................................................................... 9
Figure 2: Merging Data frames into one CSV................................................................ 10
Figure 3: Merging Data Frame Result ........................................................................... 12
Figure 4: Removing Missing Values .............................................................................. 13
Figure 5: Before Removing NAN Values ....................................................................... 14
Figure 6: After Removing all Missing Values ................................................................. 14
Figure 7: Converting Order and price to numeric .......................................................... 17
Figure 8 Before/ After changing the data type ............................................................... 17
Figure 9 Creating Month column from Ordered Data .................................................... 19
Figure 10 Showing new Month Column......................................................................... 19
Figure 11: Showing Month data type ............................................................................. 20
Figure 12: Creating City Column from Purchase Address ............................................. 22
Figure 13: Showing New City Column from Purchase Address .................................... 22
Figure 14 : Code for Summary Statistic......................................................................... 24
Figure 15 : Showing summary Statistic ......................................................................... 25
Figure 16 Describing Price Each ................................................................................... 25
Figure 17: Calculating correlation of all variable ............................................................ 27
Figure 18: Showing corr of all variable .......................................................................... 28
Figure 19: Code for best sale month in Bar Graph ........................................................ 30
Figure 20: Bar Graph of Total Sales by Month .............................................................. 32
Figure 21: Code for City with Highest product sold ....................................................... 34
Figure 22: Bar Graph of Total Product Sales by City .................................................... 35
Figure 23: Code For most sold item in Overall .............................................................. 36
Figure 24: Bar Graph of Top Quantity Ordered ............................................................. 37
Figure 25: Histogram plot of Order Months ................................................................... 38
Figure 26: Showing Histogram Plot ............................................................................... 39
CC5067
Smart Data Discovery
1. Introduction
Smart data discovery is a powerful tool that enables people to extract useful information
from vast amounts of data. The process involves using advanced techniques in statistics,
mathematics, and computer science to analyse and make sense of complex data sets.
By using smart data discovery, businesses and organizations can gain valuable insights
into their customers, products, and services, which can help them to make more informed
decisions and improve their operations. (IBM 2020)
The objective of the coursework is to analyse the sales data of ABC Company for the
year 2019 using Python. Python is a popular programming language that is widely used
in data analysis due to its rich set of libraries and tools that facilitate data processing and
analysis. The aim is to prepare the data for further data mining and analysis by covering
various stages, including data understanding, preparation, exploration, and initial
analysis.
The first stage of the data analysis process involves data understanding, which involves
collecting and describing data to identify any quality issues or missing data. This stage is
critical since any inaccuracies or errors in the data can impact the accuracy of the
analysis. Once the data is collected, it is then prepared by cleaning, transforming, and
integrating it to ensure that it is ready for analysis.
In the data exploration stage, the data is visualized and summarized to identify patterns
and relationships. This stage is critical since it enables the analyst to gain insights into
the data and identify any trends that may be relevant to the analysis. Finally, in the initial
analysis stage, statistical methods are applied to test hypotheses and draw conclusions.
Through this coursework, the student will demonstrate their ability to apply critical thinking
and problem-solving skills to real-world data analysis tasks. They will also gain
experience using Python, a powerful programming language that is widely used in data
analysis. The report will showcase various stages of data analysis and highlight the
importance of smart data analysis in today's business environment.
Siddhartha Kc
1
CC5067
Smart Data Discovery
Data: refers to a collection of facts, figures, and statistics that are stored and analysed by
computers to gain insights and knowledge. It can take various forms, including text,
numbers, images, and sounds.” Data is essential in decision-making processes and
provides valuable information to businesses, governments, and individuals.” (Java point)
In a business context, data can include customer information, sales data, financial
records, and more. Data can be obtained through various sources such as surveys,
customer feedback, social media, and website analytics. Once data is collected, it needs
to be processed to extract meaningful insights. Data processing involves converting raw
data into a format that can be analysed, including cleaning, transforming, and aggregating
data. Data processing is essential for businesses because it enables them to extract
insights from the data and make informed decisions based on accurate and reliable
information. By processing data efficiently, businesses can optimize their operations,
reduce costs, and increase profits.
Smart data: refers to the process of analysing and utilizing data to gain insights, improve
decision-making, and drive business growth. The Smart Data Discovery module focuses
on teaching the process of collecting, analysing, and presenting data from various
sources to uncover hidden patterns and insights for smart decision-making in a business
context. Business data discovery is crucial for organizations to gain a competitive edge
in the market, improve their processes, and create new opportunities for growth and
innovation. This module covers topics such as data visualization, automated data
preparation, and integration of data to provide actionable insights. By leveraging smart
data solutions, businesses can sustain their growth and improve their profitability.
(TechTarget 2023)
Siddhartha Kc
2
CC5067
Smart Data Discovery
1.1 Data Processing Cycle
The data processing cycle is an essential part of modern businesses, where data is often
used to drive decision-making processes. By following the data processing cycle,
organizations can ensure that they are collecting accurate data, processing it efficiently,
and presenting the results in a useful format. (Simplilearn 2016)
❖ The first stage of the cycle, gathering, involves collecting data from various sources.
This stage is crucial, as it sets the foundation for the entire process. If the data is
inaccurate or incomplete, the resulting insights will be of little value.
❖ The second stage, preparation of data, involves cleaning and transforming raw data
into a form suitable for further analysis. This stage is often the most time-consuming
part of the process, but it is essential for ensuring that the data is accurate and reliable.
❖ The third stage, data entry, involves entering the prepared data into a computer
system for processing. This stage is typically automated, but it is important to ensure
that the data is entered accurately.
❖ The fourth stage, processing of data, involves using various tools and techniques to
analyse and manipulate the data. This stage is where the most value is often derived
from the data, as it allows analysts to identify trends, patterns, and insights that may
be hidden in the data.
❖ The fifth stage, output, involves presenting the results of the data processing in a
format that is useful to the end user. This may include reports, graphs, and
visualizations.
❖ Finally, the sixth stage, storage, involves storing the processed data for future use.
This stage is essential for ensuring that the data is available for future analysis and
decision-making.
Siddhartha Kc
3
CC5067
Smart Data Discovery
Figure 1: Data Processing Cycle
In today's data-driven world, investing in the right data processing tools and technologies
can be crucial for businesses to unlock the full potential of their data and turn it into a
competitive advantage. By using modern data processing tools such as machine learning
algorithms and artificial intelligence, businesses can gain insights that were previously
impossible to obtain. These insights can help businesses make informed decisions,
improve their processes, and stay ahead of the competition. Therefore, it is essential for
businesses to stay up-to-date with the latest data processing technologies to leverage the
full potential of their data.
Siddhartha Kc
4
CC5067
Smart Data Discovery
2. Objectives
The course is designed to equip students with highly sought-after skills that are crucial
for success in today's job market. The primary objective is to enhance students' abilities
in problem-solving, critical thinking, and data analysis. By developing these skills,
students will be better prepared to contribute to the success of their employers and secure
positions in various industries.
The first objective aims to improve students' problem-solving and critical thinking skills.
Programming is a critical tool for problem-solving and critical thinking, as it requires the
programmer to break down complex problems into smaller, more manageable
components. By learning how to code, students will be able to analyze problems,
determine the most efficient solutions, and implement them in a structured manner. These
skills will be valuable in any field, as the ability to solve problems and think critically is
highly sought after in the job market.
The second objective is to develop data analysis skills with a focus on business prospects.
Data analysis is an essential skill for businesses to make informed decisions. By
analyzing data, businesses can identify patterns, trends, and correlations that can help
them make more effective decisions. In this coursework, students will learn how to gather,
clean, and analyse data to create meaningful insights for businesses. These skills will
make them valuable assets to companies, as data analysis is becoming increasingly
important in today's data-driven world.
Overall, the goals of this course are to acquire abilities that are highly appreciated in the
employment market. Students will be more equipped to enter the workforce and
contribute to their employers' success if they develop problem-solving, critical thinking,
and data analysis abilities. Furthermore, because these skills can be applied in a variety
of fields, this coursework is beneficial to students pursuing careers in a variety of
industries.
Siddhartha Kc
5
CC5067
Smart Data Discovery
3. Overview
This coursework is focused on the analysis of sales data for the year 2019 of ABC
Company using the Python programming language. The objective of this project is to
prepare the data for further mining and analysis by going through different stages,
including data understanding, preparation, exploration, and initial analysis. This
coursework is essential to develop data analysis skills and to apply programming
knowledge to solve real-world data analysis problems.
An overview of smart data discovery, a technique for extracting important information from
vast amounts of data, opens the report. To interpret the data, it makes use of tools and
methods from the fields of statistics, math, and computer science. The value of data
analysis skills in the current job market is emphasized, as well as how this coursework
can give students access to them.
The various phases of data analysis are then covered in the report. To find any quality
problems or missing data, the data are collected and described during the data
understanding stage. To make sure the data is ready for analysis, it is cleaned,
transformed, and integrated during the data preparation stage. Data is summarized and
visually represented during the data exploration stage in order to spot trends and
relationships. Finally, statistical techniques are used in the preliminary analysis stage to
test hypotheses and draw conclusions.
The Python programming language is utilized in this course because it includes a large
number of libraries and tools that provide efficient and powerful data processing and
analysis solutions. Python is a fantastic choice for data analysis activities since it is simple
to learn and use.
In conclusion, this coursework provides an excellent opportunity for students to develop
their programming knowledge and data analysis skills using Python. The various stages
of data analysis covered in this report will equip students with the necessary skills to
understand and manipulate large data sets, identify patterns, and draw conclusions.
These skills are in high demand in today's job market, and students who complete this
coursework will have a competitive advantage over their peers.
Siddhartha Kc
6
CC5067
Smart Data Discovery
4. Data Understanding
Data understanding is a crucial step in any data analysis project. It involves identifying
the data sources, gathering information about the data, and understanding the
characteristics of the data. This step is important because it helps to ensure that the data
is fit for the intended purpose of the analysis.
In the case of the provided CSV files, we analysed the data to gain a better understanding
of the information contained within them. The CSV files contained six columns: Order ID,
Product, Quantity Ordered, Price Each, Order Date, and Purchase Address. The data
pertained to sales records of an electronics store for the period January to December
2019.
One of the key findings from our analysis was that the Order Date column was not in the
correct format. Specifically, the dates were in a text format rather than a date format. This
meant that we would need to convert the column into a date format to analyze the data
over time. Additionally, the Purchase Address column contained both the street address
and city, which required cleaning to separate the two into separate columns.
Another important finding was that some rows contained missing values, indicated by
NaN values. This meant that we needed to handle the missing values before proceeding
with any analysis to ensure the accuracy of the results.
We also examined the data types of the columns and found that the Order ID column was
a string, while the Quantity Ordered and Price Each column were integers and floats,
respectively. The Product and Purchase Address columns were both strings, and the
Order Date column was a date/time format.
Overall, our analysis of the data provided us with a clear understanding of the data
sources and characteristics. This understanding enabled us to prepare the data for further
analysis, identify potential issues or limitations, and ultimately ensure that the data is fit
for the intended purpose of the analysis.
Siddhartha Kc
7
CC5067
Smart Data Discovery
Column Name
Description
Data Type
Order ID
Unique identifier for each order
String
Product
Name of the product
String
Quantity Ordered
Quantity of the product ordered
Integer
Price Each
Price of each product
Float
Order Date
Date and time of the order
Date/Time
Purchase Address
Address where the product was shipped
String
Table 1 Data Type
Siddhartha Kc
8
CC5067
Smart Data Discovery
5. Data Preparation
Data preparation is a crucial step in the data analysis process. It involves cleaning and
transforming raw data into a format that can be easily analysed. Data preparation ensures
that the data is consistent, accurate, and complete, which helps to minimize errors and
improve the quality of the analysis. Proper data preparation techniques can also help to
identify and remove any outliers or irrelevant data points that can skew the analysis.
Overall, data preparation is an essential step that lays the foundation for successful data
analysis.
a. Merge data from each month into one CSV
This code imports several libraries such as pandas, warnings, os, calendar,
matplotlib.pyplot, seaborn, and tabulate.
import pandas as pd
import warnings,os,calendar,seaborn
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
from tabulate import tabulate
Figure 1: Importing Libraries
Pandas is a library used for data manipulation and analysis, while warnings, os, and
calendar are standard libraries used for handling warnings, operating system interfaces,
and dates, respectively. Matplotlib.pyplot is a library used for data visualization, while
seaborn is another library used for statistical data visualization. Finally, tabulate is a
library used for creating tables in Python.
Siddhartha Kc
9
CC5067
Smart Data Discovery
Figure 2: Merging Data frames into one CSV
Siddhartha Kc
10
CC5067
Smart Data Discovery
# Read the merged .csv file into a DataFrame
merged_file_path = 'sales_data.csv'
df = pd.read_csv(merged_file_path)
# Display the last 7 rows of the DataFrame
print(" First 7 rows of the DataFrame:")
df.head(7)
# display the last 7 rows without the index column
Result:
This code first defines the folder name that contains the CSV files, and then lists all the
CSV files in that folder. If no CSV files are found, it raises a FileNotFoundError.
Next, an empty list is initialized to store the DataFrames from each file, and the code
iterates over the CSV files, reads them into DataFrames, and appends them to the list. If
there is an error reading a file, the error message is printed to the console.
After all the CSV files have been read and stored in the list, the pd.concat() function is
used to merge the DataFrames into a single DataFrame. The ignore_index=True
argument is used to reset the index of the merged DataFrame.
Finally, the merged DataFrame is saved to a new CSV file using the to_csv() function,
and the user is notified if the file was saved successfully or if an error occurred. The file
is saved in the same directory as the original CSV files.
The result obtained is a merged DataFrame containing all the sales data from each
month. This DataFrame is used in subsequent tasks to perform further analysis. The code
Siddhartha Kc
11
CC5067
Smart Data Discovery
saves the merged DataFrame as a new CSV file named 'sales_data.csv'. This file can be
used in practical cases to analyze the sales data for a given period and make informed
business decisions based on this data. In future cases, the merged DataFrame can be
used to detect trends or patterns in the sales data over different months, which can be
useful for forecasting and predicting future sales.
Figure 3: Merging Data Frame Result
The code reads a CSV file 'sales_data.csv' into a Pandas DataFrame called 'df'. The
'head' function is used to display the first 7 rows of the DataFrame. This code is useful for
quickly inspecting the structure and contents of the DataFrame.
The output displays the first 7 rows of the DataFrame, including the column headers. It
provides a snapshot of the data that the DataFrame contains, allowing for a quick
understanding of the type of data and the format of the columns. The 'head' function is
commonly used to get an overview of the data before performing any further analysis or
manipulation. In this case, it can help to identify any issues with the data, such as missing
values or unexpected data types.
Siddhartha Kc
12
CC5067
Smart Data Discovery
b. Removing NAN missing values
Code:
Figure 4: Removing Missing Values
Siddhartha Kc
13
CC5067
Smart Data Discovery
Results:
Figure 5: Before Removing NAN Values
Figure 6: After Removing all Missing Values
The code above loads the consolidated sales data into a pandas DataFrame and uses
the isna() method to count the amount of NaN values in each column. The output is written
in two tables, the first displaying the count of NaN values before removing the rows
containing NaN values and the second displaying the count after removing the rows.
Siddhartha Kc
14
CC5067
Smart Data Discovery
Dropping rows with NaN values is a typical data cleaning strategy for removing
incomplete or missing data. This operation produces a modified DataFrame with the same
number of columns but fewer rows. After eliminating rows, we can observe that there are
no more NaN values, indicating that the dataset is full and ready for analysis.
This is an appealing result because inadequate or missing data might lead to incorrect
analysis and conclusions. In practical circumstances, having a complete dataset ensures
that the analysis is valid and that the results are founded on accurate data. The cleaned
information can be utilized for a variety of analyses in the future, including trend analysis,
forecasting, and spotting patterns and anomalies.
This code can be useful in identifying and dealing with missing values in a DataFrame.
Missing values can be problematic when analyzing data, as they can skew results and
affect statistical analyses. By knowing the number and location of missing values, we can
make informed decisions on how to handle them, such as imputing missing values or
dropping rows with missing values.
Siddhartha Kc
15
CC5067
Smart Data Discovery
The above code removes duplicate rows from the sales_data.csv file based on specific
columns, namely Order ID, Product, Quantity Ordered, Price Each, Order Date, and
Purchase Address. It first prints the original number of rows in the DataFrame, and then
uses the drop_duplicates() method to remove duplicate rows from the DataFrame. The
cleaned DataFrame is then saved to a new CSV file named "sales_data.csv" and the
number of rows in the cleaned DataFrame is printed.
This step is important because duplicate data can skew the analysis and lead to incorrect
insights. By removing duplicates, we ensure that the data is accurate and reliable, and
can be used to generate meaningful insights.
Siddhartha Kc
16
CC5067
Smart Data Discovery
c. Converting Quantity Ordered & Price Each into numeric
Figure 7: Converting Order and price to numeric
Figure 8 Before/ After changing the data type
This code block is used to change the data types of the "Quantity Ordered" and "Price
Each" columns from object to numeric. Before changing the data types, it first displays
the data types of these two columns using the tabulate module. After that, it converts
Siddhartha Kc
17
CC5067
Smart Data Discovery
these two columns to numeric using the pd.to_numeric() method with the "errors"
parameter set to "coerce" to convert any invalid values to NaN. It then saves the changes
back to the CSV file. Finally, it displays the new data types of these two columns using
the tabulate module.
The result of this code block shows that the data types of both columns have been
successfully changed from object to float64. This will allow us to perform mathematical
operations on these columns such as multiplication, addition, and averaging. Additionally,
it also eliminates any potential errors or inconsistencies that could arise from having
invalid or incorrect values in these columns. This change in data type is an important step
in preparing the data for analysis, and it will be helpful for future calculations and data
visualization..
Siddhartha Kc
18
CC5067
Smart Data Discovery
d. Creating new column Month from Ordered Data
Figure 9 Creating Month column from Ordered Data
Result:
Figure 10 Showing new Month Column
Using pandas' str.split() method, this code sample divides the "Order Date" column into
three different columns: "Month", "Day", and "Year". It divides the date string into three
columns using the forward slash ("/") separator. The code then removes the "Year" and
"Day" columns before returning the modified dataframe to the'sales_data.csv' file. Finally,
the code uses the head() method to print the first five rows of the changed dataframe.
This change is beneficial because it allows us to readily analyze sales data on a monthly
basis. The new "Month" column allows you to arrange sales data by month of the year,
Siddhartha Kc
19
CC5067
Smart Data Discovery
which can aid in identifying trends and patterns in the data. This could be useful for
business owners when deciding on inventory, marketing, and sales methods. For
example, they might want to know which months are the busiest for them so that they can
modify their marketing and inventory accordingly.
This update may be useful in the future for undertaking time-series analysis, projecting
future sales, and finding seasonal trends.
Figure 11: Showing Month data type
This code uses Pandas to read a CSV file named'sales_data.csv' and assigns the
generated DataFrame to the variable 'df'. Then, using the 'tabulate' function from the
'tabulate' module, it prints a table of column names and data types. The table has two
columns: 'Column Name' and 'Data Type,' and the format is 'psql,' which is a consoleprintable format.
The code produces a table that displays the column names as well as the data types
associated with them. Each row in the DataFrame represents a column, and the 'Column
Name' column displays the column's name, while the 'Data Type' column displays the
column's data type. Integer, float, object, and datetime64 are among the data types
Siddhartha Kc
20
CC5067
Smart Data Discovery
supported. This information is useful for understanding the dataset's structure as well as
data cleaning and manipulation.
Siddhartha Kc
21
CC5067
Smart Data Discovery
e. Creating new column City from Purchase Address
Figure 12: Creating City Column from Purchase Address
This code divides the pandas DataFrame "df"'s "Purchase Address" column into three
independent columns: "Street", "City", and "State Zip". The "str.split()" function is used to
separate the column values at the comma followed by the space that serves as a delimiter
between the various portions of the address. The argument "expand=True" ensures that
the split sections are returned as independent columns.
The code drops the "Street" and "State Zip" columns after splitting the column because
they are superfluous for the analysis. Finally, the revised DataFrame is saved to a new
CSV file called "sales_data.csv" with the "index=False" argument to exclude the index
column from the saved file using the "to_csv()" function.
Result:
Figure 13: Showing New City Column from Purchase Address
The "Purchase Address" field is divided into three additional columns by this code:
"Street", "City", and "State Zip". Then it removes the superfluous "Street" and "State Zip"
columns, leaving only the "City" column. This change is significant because it enables a
more granular study of sales data by geographic region. By determining which locations
have the largest sales demand, the corporation may concentrate on increasing their
Siddhartha Kc
22
CC5067
Smart Data Discovery
market share in certain places while also strengthening marketing and distribution efforts
in areas that require more attention.
The resulting DataFrame displays the amended columns "Street", "City", and the original
DataFrame's remaining columns. This data modification is useful in data analysis
because it enables a more extensive study of sales patterns and trends by geographic
area. The new columns can be utilized for additional analysis, such as determining which
cities or areas have the highest sales performance or which marketing efforts are most
effective in particular locations. Overall, this adjustment increases the data's quality and
usability for analysis.
In practice, this change can help businesses optimize their sales strategy and allocate
resources more effectively. For example, if a corporation determines that a specific city
has a strong demand for a specific product, it can boost its inventory levels in that region
to satisfy the demand. Furthermore, by analyzing sales patterns by city, businesses can
better tailor their marketing efforts to specific regions, resulting in increased sales and
revenue.
In the future, this adjustment may assist businesses in adapting to changing market
conditions and consumer behavior. Companies can spot trends and alter their plans by
examining sales patterns by geographic region over time. Changes in inventory levels,
marketing campaigns, and distribution networks are examples of this. Overall, the ability
to examine sales data by city or area can provide significant insights that can assist
businesses in improving their sales performance and remaining competitive in their
particular marketplaces.
Siddhartha Kc
23
CC5067
Smart Data Discovery
6. Data Analysis
f.
Showing summary statistics of
Figure 14 : Code for Summary Statistic
The given code computes and shows numerous sales-related statistics, such as the total
number of products sold, the mean and standard deviation of product prices, the
skewness of product pricing, and the kurtosis of amount ordered. The data has been
cleaned and pre-processed, including the removal of any null values and the conversion
of data types to the appropriate format.
Siddhartha Kc
24
CC5067
Smart Data Discovery
Result:
Figure 15 : Showing summary Statistic
The outcome is positive since it provides useful insights into sales data, such as product
demand and price volatility. These data can be utilized to inform price and inventory plans,
as well as uncover sales process flaws. In practice, organizations can utilize this outcome
to evaluate their sales success and make data-driven decisions.
In the future, the outcome can be used to compare sales data from different time periods
or to observe changes in sales trends over time. Businesses can make informed
judgments about their operations and establish plans to remain competitive in the market
by evaluating these patterns. Overall, the provided code contains useful information that
can assist firms in optimizing their sales operations and improving their bottom line.
Figure 16 Describing Price Each
Siddhartha Kc
25
CC5067
Smart Data Discovery
The describe () method returns a summary of the statistical measures for the provided
Data Frame df's 'Price Each' column. It contains the data's count, mean, standard
deviation, minimum, maximum, and quartile values. This strategy is helpful for quickly
analyzing data and identifying outliers or extreme results. The summary statistics can
help you comprehend the data's central tendency, dispersion, and range. The gathered
data can assist firms in making educated judgments about pricing strategies and product
offerings.
Siddhartha Kc
26
CC5067
Smart Data Discovery
g. Calculating correlation of all variable
Figure 17: Calculating correlation of all variable
The provided code determines the correlation matrix for the DataFrame df's numerical
columns. The correlation matrix is then shown using a heatmap from the seaborn library.
The pairwise correlation coefficients between the columns in the DataFrame are
displayed in the correlation matrix. The correlation coefficient, which has values between
-1 and 1, is a measurement of the linear relationship between two variables.
Siddhartha Kc
27
CC5067
Smart Data Discovery
Figure 18: Showing corr of all variable
The dataset's correlation coefficient between each pair of variables is displayed in the
correlation matrix. A value of 1 shows a perfect positive correlation, a value of 0 indicates
no correlation, and a value of -1 indicates a perfect negative correlation. The values range
from -1 to 1. A positive correlation between two variables indicates a tendency for them
to rise or fall together, whilst a negative correlation indicates a tendency for them to move
in the inverse way.(BYJUS 2020)
The 'viridis' color scheme, which produces a natural gradation from yellow to green to
blue, is used in the modified code. The colors in this color scheme were chosen to be
easily distinguished by the human eye and to have an approximately constant perceived
color difference between neighboring hues. This color scheme is intended to be
perceptually uniform. Because of this, it is a good option for displaying numerical data
with a continuous range of values.The pairwise correlation coefficients between all
variables in the dataset are calculated to yield this result. The correlation matrix is useful
Siddhartha Kc
28
CC5067
Smart Data Discovery
for discovering correlations between variables, which can help us better comprehend the
data and inform decision-making. It can, for example, assist us in identifying variables
that are highly correlated with one another and can be excluded from the study to avoid
multicollinearity.
The correlation matrix can be utilized in the future to find trends or patterns in data over
time. For example, we may discover that the correlation between variables changes over
time as market conditions or consumer behavior change. This can assist organizations in
adapting their strategy to changing conditions and improving their performance. The
correlation matrix can also be used to examine the performance of various products or
marketplaces in order to uncover characteristics that contribute to their success or failure.
Siddhartha Kc
29
CC5067
Smart Data Discovery
7. Data Exploration
Data exploration is the process of analysing and understanding a dataset to uncover
patterns, relationships, and insights that can help inform further analysis and decisionmaking. This typically involves using various techniques to visualize and summarize the
data, such as creating charts and graphs, calculating descriptive statistics, and identifying
outliers or missing values (Tibco 2023)
h. Showing which month had the best sales in Bar graph
Figure 19: Code for best sale month in Bar Graph
The code first converts the "Order Date" column in the pandas dataframe into a datetime
format using the pd.to_datetime() function. It then extracts the month from the "Order
Date" column using the dt.month attribute and creates a new column "Month" in the
dataframe to store the extracted month values.
Siddhartha Kc
30
CC5067
Smart Data Discovery
Next, the code calculates the total sales by month by grouping the dataframe by the
"Month" column and summing the "Sales" column using the groupby() and sum()
functions, respectively. The resulting sales values are stored in the "monthly_sales"
variable.
The code then determines the month with the highest sales by finding the maximum sales
value using the max() function and finding the corresponding month name using the
calendar.month_name[] function and the idxmax() attribute.
To visualize the sales by month, the code creates a vertical bar graph using the plt.bar()
function from the Matplotlib library. The color of the bar for the month with the highest
sales is set to blue, while the rest of the bars are set to black. The x-axis ticks are set to
display month names using the plt.xticks() function.
Finally, the code sets the axis labels, title, and background color using various Matplotlib
functions and displays the plot using the plt.show() function. The month with the highest
sales and the corresponding total sales value are then printed using the print() function
The code generates a vertical bar graph showing the total sales by month for the ABC
company. The result highlights the month with the highest sales by highlighting it with a
dark blue colour. The code extracts the month from the 'Order Date' column, calculates
Siddhartha Kc
31
CC5067
Smart Data Discovery
The total sales by month, and groups the data by month to visualize the sales data. The
modified data is used to create a more meaningful visualization of the sales performance
over time.
Figure 20: Bar Graph of Total Sales by Month
The outcome is favourable because it provides a clear depiction of sales performance by
month, allowing the ABC corporation to identify and compare the more successful months
to the less successful ones. This data can be used to improve marketing campaigns and
make educated inventory management decisions. In practice, the significance of this
result is that it assists organizations in identifying trends and patterns in sales data that
may be utilized to make educated future decisions. In the future, this outcome can be
Siddhartha Kc
32
CC5067
Smart Data Discovery
used to evaluate sales performance between months, find trends or patterns in sales
data, and make informed decisions about marketing tactics and inventory management.
This data can also be utilized to improve pricing strategies and supply chain management,
ensuring that the company meets client demand for popular products. Overall, this finding
will help the ABC company make informed business decisions.
Siddhartha Kc
33
CC5067
Smart Data Discovery
i. Showing which city had Highest product sold
Figure 21: Code for City with Highest product sold
In the first part of the code, the total product sales are calculated for each city by grouping
the Data Frame by city and using the sum () function to get the total quantity of products
sold in each city. Next, a bar graph is created to visualize the total product sales by city.
The x-axis represents the cities and the y-axis represents the total product sales. The
colour of the bars is set to black. The x-ticks are set to the city names and rotated vertically
for better readability. The x-label is set to "City" and the y-label is set to "Total Product
Sales". The title of the graph is set to "Total Product Sales by City”. The background
colour of the graph is set to a light white colour to make it more visually appealing. The
city with the highest product sales is highlighted by adding a red bar to the graph and
labelling it with the total number of units sold. The city name is also printed in red text to
make it stand out.
Siddhartha Kc
34
CC5067
Smart Data Discovery
Figure 22: Bar Graph of Total Product Sales by City
The outcome displays the city with the highest product sales and total units sold. This
assists firms in better understanding their sales performance, identifying growth
prospects, and optimizing their marketing strategy. It is useful in practice because it
guides resource allocation decisions. This outcome can be utilized in the future to
evaluate sales performance over time, analyze trends, optimize pricing strategies, and
improve supply chain management.
Siddhartha Kc
35
CC5067
j.
Smart Data Discovery
Illustrating most sold item in Bar graph
Figure 23: Code For most sold item in Overall
This code organizes the Data Frame by product and computes the total quantity ordered
for each. It then generates a sorted product list and reorders the quantity ordered list to
match the sorted product list. A horizontal bar chart is generated with light blue bars
representing non-maximum values and dark blue bars representing the best-selling
goods. In blue text, the quantity ordered value is added inside the bar. The y-axis ticks
Siddhartha Kc
36
CC5067
Smart Data Discovery
and label have been set to display the product names, although the x-axis label, chart
title, and background colour have not been changed.
Figure 24: Bar Graph of Top Quantity Ordered
This code generates a horizontal bar chart displaying the total quantity ordered for each
product, with a green bar highlighting the best-selling product. The outcome assists
organizations in identifying their top-selling products and making informed decisions
about inventory management and marketing tactics. This data is useful for optimizing
pricing strategies, finding potential new items, and enhancing supply chain management.
This result can be used in the future to compare the sales performance of different items
over time and to find trends or patterns in sales data.
Siddhartha Kc
37
CC5067
Smart Data Discovery
k. Showing histogram plot of Order Months
Figure 25: Histogram plot of Order Months
The code generates a histogram that shows the distribution of orders over the 12 months
of the year. The data is modified by converting the 'Month' column from object to integer
data type. The histogram shows that the highest number of orders were placed in
December, followed by October and November. This result can help businesses to
understand seasonal trends in order placement and plan their inventory management and
marketing strategies accordingly.
Siddhartha Kc
38
CC5067
Smart Data Discovery
Figure 26: Showing Histogram Plot
A histogram is a type of chart that shows the distribution of numerical data. It divides the
data into intervals called bins and counts how many values fall into each bin. The height
of each bar in the histogram represents the frequency or relative frequency of the bin. A
histogram can help you see the shape, centre, spread and outliers of the data.
(BYJUS,nd)
In the given code, the histogram is generated from the data after it has been grouped by
month and counted to determine the number of orders placed in each month. This
provides valuable insights into customer behaviour and helps businesses better
understand their customers. The resulting histogram shows the distribution of orders over
the 12 months of the year, allowing for observations on which months have higher or
lower order volumes. By analyzing this data, businesses can identify trends and patterns
in customer behavior and adjust their marketing and sales strategies accordingly.
Siddhartha Kc
39
CC5067
Smart Data Discovery
The peak in the histogram represents the month in which the most orders were placed,
while the troughs indicate the months in which the least orders were placed. In this case,
the histogram shows that most orders were placed in December(Holiday season),
followed by October and November, while the least orders were placed in January and
February. This information can be used by businesses to optimize inventory management
and marketing strategies for different months of the year.
Overall, the result obtained is good because it provides insights into customer behaviours
that can be used to improve business operations. By using this data to inform marketing
and sales strategies, businesses can increase their revenue and improve customer
satisfaction. In practical cases, this result can be used by businesses to identify the
months with the highest demand for their products and ensure that they have enough
stock to meet customer demand.
This result can be used in the future to analyse the distribution of orders over different
years and detect trends or patterns in the data. It can also be used to inform forecasting
models and estimate future sales, allowing firms to plan ahead and remain competitive.
Overall, this outcome is a useful tool for firms trying to streamline their operations and
increase their bottom line.
Siddhartha Kc
40
CC5067
Smart Data Discovery
8. Conclusion
I've completed my coursework and am now at this point want to talk about my project
related experiences. While working on this project, I discovered a lot of new cool things
and concepts and it was a tremendous learning experience. I learned about data cleaning,
pre-processing, merging, and analysis, and how to apply these skills to real-world data
sets. This experience has been a valuable opportunity for me to apply the knowledge and
skills I have acquired in a practical setting. Some parts of the coursework were difficult to
understand, which I remedied by conducting online research and consulting with my
teacher.
The dataset I worked on for this project was related to sales analysis for ABC company
for the year 2019. It contained information on attributes such as Order ID, Product, and
Quantity Ordered. My primary objective was to prepare the data for further data mining
and analysis.
In addition to gaining technical skills, I also learned the importance of data analysis in
business. Data analysis plays a critical role in decision-making processes, and
businesses rely on it to optimize their operations, identify areas for growth, and gain a
competitive edge. By analysing and interpreting data, businesses can make informed
decisions that can lead to greater success and profitability.
Overall, the coursework has been a valuable learning experience for me. It has helped
me develop my data science skills and has given me a better understanding of the
importance of data analysis in business. I look forward to applying these skills to future
projects and continuing to learn more about the exciting field of data science. I would like
to thanks my teachers for entrusting us with this project. Every effort I made to finish this
coursework was pleasurable. I'm hoping that my will be entertaining and perhaps even
informative.
Siddhartha Kc
41
CC5067
Smart Data Discovery
References
❖ What is Data? Definition, Types, Computer, Information - javatpoint. (n.d.).
www.javatpoint.com. https://www.javatpoint.com/data
❖ What is smart data? | Definition from TechTarget. (2016, December 1). WhatIs.com.
https://www.techtarget.com/whatis/definition/smart-data
❖ What Is Data Processing: Cycle, Types, Methods, Steps and Examples | Simplilearn.
(2020, October 21). Simplilearn.com. https://www.simplilearn.com/what-is-dataprocessing-article
❖ Histogram
-
Definition,
Types,
Graph,
and
Examples.
(n.d.).
BYJUS.
https://byjus.com/maths/histogram/
❖ Bar Graph - Definition, Types, Uses, How to Draw Bar graph, Examples. (2020,
August 9). BYJUS. https://byjus.com/maths/bar-graph/
❖ Correlation - Correlation Coefficient, Types, Formulas & Example. (2019, November
24). BYJUS. https://byjus.com/maths/correlation/
❖ Statistics Definitions, Types, Formulas & Applications. (2019, December 26). BYJUS.
https://byjus.com/maths/statistics/
❖ User Guide — pandas 2.0.1 documentation. (n.d.). User Guide — Pandas 2.0.1
Documentation. https://pandas.pydata.org/docs/user_guide/index.html#user-guide
❖ The
Python
Tutorial.
(n.d.).
Python
Documentation.
https://docs.python.org/3/tutorial/index.html
❖ Berthold, M. R., Borgelt, C., Höppner, F., & Klawonn, F. (n.d.). Data Understanding.
Data Understanding | SpringerLink. https://doi.org/10.1007/978-1-84882-260-3_4
❖ Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting,
Filtering,
Groupby).
(2018,
October
25).
YouTube.
https://www.youtube.com/watch?v=vmEHCJofslg
❖ Data Analysis with Python Course - Numpy, Pandas, Data Visualization. (2021,
February 18). YouTube. https://www.youtube.com/watch?v=GPVsHOlRBBI
Siddhartha Kc
42
CC5067
Smart Data Discovery
❖ Quick start guide — Matplotlib 3.7.1 documentation. (n.d.). Quick Start Guide —
Matplotlib
3.7.1
Documentation.
https://matplotlib.org/stable/tutorials/introductory/quick_start.html
❖ Data
Visualization:
Definition,
Benefits,
and
Examples.
(n.d.).
Coursera.
https://www.coursera.org/articles/data-visualization
❖ How to calculate summary statistics — pandas 2.0.1 documentation. (n.d.). How to
Calculate
Summary
Statistics
—
Pandas
2.0.1
Documentation.
https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.
html
❖ Pandas
-
Data
Correlations.
(n.d.).
Pandas
-
Data
Correlations.
https://www.w3schools.com/python/pandas/pandas_correlations.asp
❖ What is Data Exploration? | TIBCO Software
Siddhartha Kc
43
Download