Module Code & Module Title CC5067NP Smart Data Discovery Assessment Weightage & Type 60% Individual Coursework Year and Semester 2023 Spring C4 Siddhartha Kc London Met ID: 21050025 College ID: NP04CP4A210073 Assignment Deadline: 04 May 2023 Assignment Submission Date: 04 May 2023 I understand that I must submit my coursework online via MST before the deadline in order to receive a mark. Late submissions will not be accepted and will result in a mark of zero. Acknowledgement I would like to express my gratitude to my module leader, Mr. Badri Raj Lamichhane, for providing me with the opportunity to undertake this coursework for the module "Smart Data Discovery". His guidance and support throughout the coursework have been invaluable, and I have learned a great deal from his expertise in the field. I would also like to thank Islington College and London Metropolitan University for providing me with the resources and support necessary to complete this coursework successfully. Lastly, I would like to acknowledge the hard work and dedication of my fellow classmates, whose contributions and feedback have helped me in my learning process. Contents 1. Introduction ............................................................................................................... 1 1.1 Data Processing Cycle ....................................................................................... 3 2. Objectives ................................................................................................................. 5 3. Overview ................................................................................................................... 6 4. Data Understanding ........................................................................................... 7 5. Data Preparation ................................................................................................ 9 a. Merge data from each month into one CSV .................................................... 9 b. Removing NAN missing values ..................................................................... 13 c. Converting Quantity Ordered & Price Each into numeric .............................. 17 d. Creating new column Month from Ordered Data........................................... 19 e. Creating new column City from Purchase Address ....................................... 22 6. Data Analysis ................................................................................................... 24 f. Showing summary statistics of ......................................................................... 24 g. Calculating correlation of all variable ............................................................ 27 7. Data Exploration............................................................................................... 30 h. Showing which month had the best sales in Bar graph................................. 30 i. Showing which city had Highest product sold .................................................. 34 j. Illustrating most sold item in Bar graph ........................................................... 36 k. Showing histogram plot of Order Months ...................................................... 38 8. Conclusion .............................................................................................................. 41 References .................................................................................................................... 42 List of Figures Figure 1: Importing Libraries ........................................................................................... 9 Figure 2: Merging Data frames into one CSV................................................................ 10 Figure 3: Merging Data Frame Result ........................................................................... 12 Figure 4: Removing Missing Values .............................................................................. 13 Figure 5: Before Removing NAN Values ....................................................................... 14 Figure 6: After Removing all Missing Values ................................................................. 14 Figure 7: Converting Order and price to numeric .......................................................... 17 Figure 8 Before/ After changing the data type ............................................................... 17 Figure 9 Creating Month column from Ordered Data .................................................... 19 Figure 10 Showing new Month Column......................................................................... 19 Figure 11: Showing Month data type ............................................................................. 20 Figure 12: Creating City Column from Purchase Address ............................................. 22 Figure 13: Showing New City Column from Purchase Address .................................... 22 Figure 14 : Code for Summary Statistic......................................................................... 24 Figure 15 : Showing summary Statistic ......................................................................... 25 Figure 16 Describing Price Each ................................................................................... 25 Figure 17: Calculating correlation of all variable ............................................................ 27 Figure 18: Showing corr of all variable .......................................................................... 28 Figure 19: Code for best sale month in Bar Graph ........................................................ 30 Figure 20: Bar Graph of Total Sales by Month .............................................................. 32 Figure 21: Code for City with Highest product sold ....................................................... 34 Figure 22: Bar Graph of Total Product Sales by City .................................................... 35 Figure 23: Code For most sold item in Overall .............................................................. 36 Figure 24: Bar Graph of Top Quantity Ordered ............................................................. 37 Figure 25: Histogram plot of Order Months ................................................................... 38 Figure 26: Showing Histogram Plot ............................................................................... 39 CC5067 Smart Data Discovery 1. Introduction Smart data discovery is a powerful tool that enables people to extract useful information from vast amounts of data. The process involves using advanced techniques in statistics, mathematics, and computer science to analyse and make sense of complex data sets. By using smart data discovery, businesses and organizations can gain valuable insights into their customers, products, and services, which can help them to make more informed decisions and improve their operations. (IBM 2020) The objective of the coursework is to analyse the sales data of ABC Company for the year 2019 using Python. Python is a popular programming language that is widely used in data analysis due to its rich set of libraries and tools that facilitate data processing and analysis. The aim is to prepare the data for further data mining and analysis by covering various stages, including data understanding, preparation, exploration, and initial analysis. The first stage of the data analysis process involves data understanding, which involves collecting and describing data to identify any quality issues or missing data. This stage is critical since any inaccuracies or errors in the data can impact the accuracy of the analysis. Once the data is collected, it is then prepared by cleaning, transforming, and integrating it to ensure that it is ready for analysis. In the data exploration stage, the data is visualized and summarized to identify patterns and relationships. This stage is critical since it enables the analyst to gain insights into the data and identify any trends that may be relevant to the analysis. Finally, in the initial analysis stage, statistical methods are applied to test hypotheses and draw conclusions. Through this coursework, the student will demonstrate their ability to apply critical thinking and problem-solving skills to real-world data analysis tasks. They will also gain experience using Python, a powerful programming language that is widely used in data analysis. The report will showcase various stages of data analysis and highlight the importance of smart data analysis in today's business environment. Siddhartha Kc 1 CC5067 Smart Data Discovery Data: refers to a collection of facts, figures, and statistics that are stored and analysed by computers to gain insights and knowledge. It can take various forms, including text, numbers, images, and sounds.” Data is essential in decision-making processes and provides valuable information to businesses, governments, and individuals.” (Java point) In a business context, data can include customer information, sales data, financial records, and more. Data can be obtained through various sources such as surveys, customer feedback, social media, and website analytics. Once data is collected, it needs to be processed to extract meaningful insights. Data processing involves converting raw data into a format that can be analysed, including cleaning, transforming, and aggregating data. Data processing is essential for businesses because it enables them to extract insights from the data and make informed decisions based on accurate and reliable information. By processing data efficiently, businesses can optimize their operations, reduce costs, and increase profits. Smart data: refers to the process of analysing and utilizing data to gain insights, improve decision-making, and drive business growth. The Smart Data Discovery module focuses on teaching the process of collecting, analysing, and presenting data from various sources to uncover hidden patterns and insights for smart decision-making in a business context. Business data discovery is crucial for organizations to gain a competitive edge in the market, improve their processes, and create new opportunities for growth and innovation. This module covers topics such as data visualization, automated data preparation, and integration of data to provide actionable insights. By leveraging smart data solutions, businesses can sustain their growth and improve their profitability. (TechTarget 2023) Siddhartha Kc 2 CC5067 Smart Data Discovery 1.1 Data Processing Cycle The data processing cycle is an essential part of modern businesses, where data is often used to drive decision-making processes. By following the data processing cycle, organizations can ensure that they are collecting accurate data, processing it efficiently, and presenting the results in a useful format. (Simplilearn 2016) ❖ The first stage of the cycle, gathering, involves collecting data from various sources. This stage is crucial, as it sets the foundation for the entire process. If the data is inaccurate or incomplete, the resulting insights will be of little value. ❖ The second stage, preparation of data, involves cleaning and transforming raw data into a form suitable for further analysis. This stage is often the most time-consuming part of the process, but it is essential for ensuring that the data is accurate and reliable. ❖ The third stage, data entry, involves entering the prepared data into a computer system for processing. This stage is typically automated, but it is important to ensure that the data is entered accurately. ❖ The fourth stage, processing of data, involves using various tools and techniques to analyse and manipulate the data. This stage is where the most value is often derived from the data, as it allows analysts to identify trends, patterns, and insights that may be hidden in the data. ❖ The fifth stage, output, involves presenting the results of the data processing in a format that is useful to the end user. This may include reports, graphs, and visualizations. ❖ Finally, the sixth stage, storage, involves storing the processed data for future use. This stage is essential for ensuring that the data is available for future analysis and decision-making. Siddhartha Kc 3 CC5067 Smart Data Discovery Figure 1: Data Processing Cycle In today's data-driven world, investing in the right data processing tools and technologies can be crucial for businesses to unlock the full potential of their data and turn it into a competitive advantage. By using modern data processing tools such as machine learning algorithms and artificial intelligence, businesses can gain insights that were previously impossible to obtain. These insights can help businesses make informed decisions, improve their processes, and stay ahead of the competition. Therefore, it is essential for businesses to stay up-to-date with the latest data processing technologies to leverage the full potential of their data. Siddhartha Kc 4 CC5067 Smart Data Discovery 2. Objectives The course is designed to equip students with highly sought-after skills that are crucial for success in today's job market. The primary objective is to enhance students' abilities in problem-solving, critical thinking, and data analysis. By developing these skills, students will be better prepared to contribute to the success of their employers and secure positions in various industries. The first objective aims to improve students' problem-solving and critical thinking skills. Programming is a critical tool for problem-solving and critical thinking, as it requires the programmer to break down complex problems into smaller, more manageable components. By learning how to code, students will be able to analyze problems, determine the most efficient solutions, and implement them in a structured manner. These skills will be valuable in any field, as the ability to solve problems and think critically is highly sought after in the job market. The second objective is to develop data analysis skills with a focus on business prospects. Data analysis is an essential skill for businesses to make informed decisions. By analyzing data, businesses can identify patterns, trends, and correlations that can help them make more effective decisions. In this coursework, students will learn how to gather, clean, and analyse data to create meaningful insights for businesses. These skills will make them valuable assets to companies, as data analysis is becoming increasingly important in today's data-driven world. Overall, the goals of this course are to acquire abilities that are highly appreciated in the employment market. Students will be more equipped to enter the workforce and contribute to their employers' success if they develop problem-solving, critical thinking, and data analysis abilities. Furthermore, because these skills can be applied in a variety of fields, this coursework is beneficial to students pursuing careers in a variety of industries. Siddhartha Kc 5 CC5067 Smart Data Discovery 3. Overview This coursework is focused on the analysis of sales data for the year 2019 of ABC Company using the Python programming language. The objective of this project is to prepare the data for further mining and analysis by going through different stages, including data understanding, preparation, exploration, and initial analysis. This coursework is essential to develop data analysis skills and to apply programming knowledge to solve real-world data analysis problems. An overview of smart data discovery, a technique for extracting important information from vast amounts of data, opens the report. To interpret the data, it makes use of tools and methods from the fields of statistics, math, and computer science. The value of data analysis skills in the current job market is emphasized, as well as how this coursework can give students access to them. The various phases of data analysis are then covered in the report. To find any quality problems or missing data, the data are collected and described during the data understanding stage. To make sure the data is ready for analysis, it is cleaned, transformed, and integrated during the data preparation stage. Data is summarized and visually represented during the data exploration stage in order to spot trends and relationships. Finally, statistical techniques are used in the preliminary analysis stage to test hypotheses and draw conclusions. The Python programming language is utilized in this course because it includes a large number of libraries and tools that provide efficient and powerful data processing and analysis solutions. Python is a fantastic choice for data analysis activities since it is simple to learn and use. In conclusion, this coursework provides an excellent opportunity for students to develop their programming knowledge and data analysis skills using Python. The various stages of data analysis covered in this report will equip students with the necessary skills to understand and manipulate large data sets, identify patterns, and draw conclusions. These skills are in high demand in today's job market, and students who complete this coursework will have a competitive advantage over their peers. Siddhartha Kc 6 CC5067 Smart Data Discovery 4. Data Understanding Data understanding is a crucial step in any data analysis project. It involves identifying the data sources, gathering information about the data, and understanding the characteristics of the data. This step is important because it helps to ensure that the data is fit for the intended purpose of the analysis. In the case of the provided CSV files, we analysed the data to gain a better understanding of the information contained within them. The CSV files contained six columns: Order ID, Product, Quantity Ordered, Price Each, Order Date, and Purchase Address. The data pertained to sales records of an electronics store for the period January to December 2019. One of the key findings from our analysis was that the Order Date column was not in the correct format. Specifically, the dates were in a text format rather than a date format. This meant that we would need to convert the column into a date format to analyze the data over time. Additionally, the Purchase Address column contained both the street address and city, which required cleaning to separate the two into separate columns. Another important finding was that some rows contained missing values, indicated by NaN values. This meant that we needed to handle the missing values before proceeding with any analysis to ensure the accuracy of the results. We also examined the data types of the columns and found that the Order ID column was a string, while the Quantity Ordered and Price Each column were integers and floats, respectively. The Product and Purchase Address columns were both strings, and the Order Date column was a date/time format. Overall, our analysis of the data provided us with a clear understanding of the data sources and characteristics. This understanding enabled us to prepare the data for further analysis, identify potential issues or limitations, and ultimately ensure that the data is fit for the intended purpose of the analysis. Siddhartha Kc 7 CC5067 Smart Data Discovery Column Name Description Data Type Order ID Unique identifier for each order String Product Name of the product String Quantity Ordered Quantity of the product ordered Integer Price Each Price of each product Float Order Date Date and time of the order Date/Time Purchase Address Address where the product was shipped String Table 1 Data Type Siddhartha Kc 8 CC5067 Smart Data Discovery 5. Data Preparation Data preparation is a crucial step in the data analysis process. It involves cleaning and transforming raw data into a format that can be easily analysed. Data preparation ensures that the data is consistent, accurate, and complete, which helps to minimize errors and improve the quality of the analysis. Proper data preparation techniques can also help to identify and remove any outliers or irrelevant data points that can skew the analysis. Overall, data preparation is an essential step that lays the foundation for successful data analysis. a. Merge data from each month into one CSV This code imports several libraries such as pandas, warnings, os, calendar, matplotlib.pyplot, seaborn, and tabulate. import pandas as pd import warnings,os,calendar,seaborn import matplotlib.pyplot as plt warnings.filterwarnings("ignore") from tabulate import tabulate Figure 1: Importing Libraries Pandas is a library used for data manipulation and analysis, while warnings, os, and calendar are standard libraries used for handling warnings, operating system interfaces, and dates, respectively. Matplotlib.pyplot is a library used for data visualization, while seaborn is another library used for statistical data visualization. Finally, tabulate is a library used for creating tables in Python. Siddhartha Kc 9 CC5067 Smart Data Discovery Figure 2: Merging Data frames into one CSV Siddhartha Kc 10 CC5067 Smart Data Discovery # Read the merged .csv file into a DataFrame merged_file_path = 'sales_data.csv' df = pd.read_csv(merged_file_path) # Display the last 7 rows of the DataFrame print(" First 7 rows of the DataFrame:") df.head(7) # display the last 7 rows without the index column Result: This code first defines the folder name that contains the CSV files, and then lists all the CSV files in that folder. If no CSV files are found, it raises a FileNotFoundError. Next, an empty list is initialized to store the DataFrames from each file, and the code iterates over the CSV files, reads them into DataFrames, and appends them to the list. If there is an error reading a file, the error message is printed to the console. After all the CSV files have been read and stored in the list, the pd.concat() function is used to merge the DataFrames into a single DataFrame. The ignore_index=True argument is used to reset the index of the merged DataFrame. Finally, the merged DataFrame is saved to a new CSV file using the to_csv() function, and the user is notified if the file was saved successfully or if an error occurred. The file is saved in the same directory as the original CSV files. The result obtained is a merged DataFrame containing all the sales data from each month. This DataFrame is used in subsequent tasks to perform further analysis. The code Siddhartha Kc 11 CC5067 Smart Data Discovery saves the merged DataFrame as a new CSV file named 'sales_data.csv'. This file can be used in practical cases to analyze the sales data for a given period and make informed business decisions based on this data. In future cases, the merged DataFrame can be used to detect trends or patterns in the sales data over different months, which can be useful for forecasting and predicting future sales. Figure 3: Merging Data Frame Result The code reads a CSV file 'sales_data.csv' into a Pandas DataFrame called 'df'. The 'head' function is used to display the first 7 rows of the DataFrame. This code is useful for quickly inspecting the structure and contents of the DataFrame. The output displays the first 7 rows of the DataFrame, including the column headers. It provides a snapshot of the data that the DataFrame contains, allowing for a quick understanding of the type of data and the format of the columns. The 'head' function is commonly used to get an overview of the data before performing any further analysis or manipulation. In this case, it can help to identify any issues with the data, such as missing values or unexpected data types. Siddhartha Kc 12 CC5067 Smart Data Discovery b. Removing NAN missing values Code: Figure 4: Removing Missing Values Siddhartha Kc 13 CC5067 Smart Data Discovery Results: Figure 5: Before Removing NAN Values Figure 6: After Removing all Missing Values The code above loads the consolidated sales data into a pandas DataFrame and uses the isna() method to count the amount of NaN values in each column. The output is written in two tables, the first displaying the count of NaN values before removing the rows containing NaN values and the second displaying the count after removing the rows. Siddhartha Kc 14 CC5067 Smart Data Discovery Dropping rows with NaN values is a typical data cleaning strategy for removing incomplete or missing data. This operation produces a modified DataFrame with the same number of columns but fewer rows. After eliminating rows, we can observe that there are no more NaN values, indicating that the dataset is full and ready for analysis. This is an appealing result because inadequate or missing data might lead to incorrect analysis and conclusions. In practical circumstances, having a complete dataset ensures that the analysis is valid and that the results are founded on accurate data. The cleaned information can be utilized for a variety of analyses in the future, including trend analysis, forecasting, and spotting patterns and anomalies. This code can be useful in identifying and dealing with missing values in a DataFrame. Missing values can be problematic when analyzing data, as they can skew results and affect statistical analyses. By knowing the number and location of missing values, we can make informed decisions on how to handle them, such as imputing missing values or dropping rows with missing values. Siddhartha Kc 15 CC5067 Smart Data Discovery The above code removes duplicate rows from the sales_data.csv file based on specific columns, namely Order ID, Product, Quantity Ordered, Price Each, Order Date, and Purchase Address. It first prints the original number of rows in the DataFrame, and then uses the drop_duplicates() method to remove duplicate rows from the DataFrame. The cleaned DataFrame is then saved to a new CSV file named "sales_data.csv" and the number of rows in the cleaned DataFrame is printed. This step is important because duplicate data can skew the analysis and lead to incorrect insights. By removing duplicates, we ensure that the data is accurate and reliable, and can be used to generate meaningful insights. Siddhartha Kc 16 CC5067 Smart Data Discovery c. Converting Quantity Ordered & Price Each into numeric Figure 7: Converting Order and price to numeric Figure 8 Before/ After changing the data type This code block is used to change the data types of the "Quantity Ordered" and "Price Each" columns from object to numeric. Before changing the data types, it first displays the data types of these two columns using the tabulate module. After that, it converts Siddhartha Kc 17 CC5067 Smart Data Discovery these two columns to numeric using the pd.to_numeric() method with the "errors" parameter set to "coerce" to convert any invalid values to NaN. It then saves the changes back to the CSV file. Finally, it displays the new data types of these two columns using the tabulate module. The result of this code block shows that the data types of both columns have been successfully changed from object to float64. This will allow us to perform mathematical operations on these columns such as multiplication, addition, and averaging. Additionally, it also eliminates any potential errors or inconsistencies that could arise from having invalid or incorrect values in these columns. This change in data type is an important step in preparing the data for analysis, and it will be helpful for future calculations and data visualization.. Siddhartha Kc 18 CC5067 Smart Data Discovery d. Creating new column Month from Ordered Data Figure 9 Creating Month column from Ordered Data Result: Figure 10 Showing new Month Column Using pandas' str.split() method, this code sample divides the "Order Date" column into three different columns: "Month", "Day", and "Year". It divides the date string into three columns using the forward slash ("/") separator. The code then removes the "Year" and "Day" columns before returning the modified dataframe to the'sales_data.csv' file. Finally, the code uses the head() method to print the first five rows of the changed dataframe. This change is beneficial because it allows us to readily analyze sales data on a monthly basis. The new "Month" column allows you to arrange sales data by month of the year, Siddhartha Kc 19 CC5067 Smart Data Discovery which can aid in identifying trends and patterns in the data. This could be useful for business owners when deciding on inventory, marketing, and sales methods. For example, they might want to know which months are the busiest for them so that they can modify their marketing and inventory accordingly. This update may be useful in the future for undertaking time-series analysis, projecting future sales, and finding seasonal trends. Figure 11: Showing Month data type This code uses Pandas to read a CSV file named'sales_data.csv' and assigns the generated DataFrame to the variable 'df'. Then, using the 'tabulate' function from the 'tabulate' module, it prints a table of column names and data types. The table has two columns: 'Column Name' and 'Data Type,' and the format is 'psql,' which is a consoleprintable format. The code produces a table that displays the column names as well as the data types associated with them. Each row in the DataFrame represents a column, and the 'Column Name' column displays the column's name, while the 'Data Type' column displays the column's data type. Integer, float, object, and datetime64 are among the data types Siddhartha Kc 20 CC5067 Smart Data Discovery supported. This information is useful for understanding the dataset's structure as well as data cleaning and manipulation. Siddhartha Kc 21 CC5067 Smart Data Discovery e. Creating new column City from Purchase Address Figure 12: Creating City Column from Purchase Address This code divides the pandas DataFrame "df"'s "Purchase Address" column into three independent columns: "Street", "City", and "State Zip". The "str.split()" function is used to separate the column values at the comma followed by the space that serves as a delimiter between the various portions of the address. The argument "expand=True" ensures that the split sections are returned as independent columns. The code drops the "Street" and "State Zip" columns after splitting the column because they are superfluous for the analysis. Finally, the revised DataFrame is saved to a new CSV file called "sales_data.csv" with the "index=False" argument to exclude the index column from the saved file using the "to_csv()" function. Result: Figure 13: Showing New City Column from Purchase Address The "Purchase Address" field is divided into three additional columns by this code: "Street", "City", and "State Zip". Then it removes the superfluous "Street" and "State Zip" columns, leaving only the "City" column. This change is significant because it enables a more granular study of sales data by geographic region. By determining which locations have the largest sales demand, the corporation may concentrate on increasing their Siddhartha Kc 22 CC5067 Smart Data Discovery market share in certain places while also strengthening marketing and distribution efforts in areas that require more attention. The resulting DataFrame displays the amended columns "Street", "City", and the original DataFrame's remaining columns. This data modification is useful in data analysis because it enables a more extensive study of sales patterns and trends by geographic area. The new columns can be utilized for additional analysis, such as determining which cities or areas have the highest sales performance or which marketing efforts are most effective in particular locations. Overall, this adjustment increases the data's quality and usability for analysis. In practice, this change can help businesses optimize their sales strategy and allocate resources more effectively. For example, if a corporation determines that a specific city has a strong demand for a specific product, it can boost its inventory levels in that region to satisfy the demand. Furthermore, by analyzing sales patterns by city, businesses can better tailor their marketing efforts to specific regions, resulting in increased sales and revenue. In the future, this adjustment may assist businesses in adapting to changing market conditions and consumer behavior. Companies can spot trends and alter their plans by examining sales patterns by geographic region over time. Changes in inventory levels, marketing campaigns, and distribution networks are examples of this. Overall, the ability to examine sales data by city or area can provide significant insights that can assist businesses in improving their sales performance and remaining competitive in their particular marketplaces. Siddhartha Kc 23 CC5067 Smart Data Discovery 6. Data Analysis f. Showing summary statistics of Figure 14 : Code for Summary Statistic The given code computes and shows numerous sales-related statistics, such as the total number of products sold, the mean and standard deviation of product prices, the skewness of product pricing, and the kurtosis of amount ordered. The data has been cleaned and pre-processed, including the removal of any null values and the conversion of data types to the appropriate format. Siddhartha Kc 24 CC5067 Smart Data Discovery Result: Figure 15 : Showing summary Statistic The outcome is positive since it provides useful insights into sales data, such as product demand and price volatility. These data can be utilized to inform price and inventory plans, as well as uncover sales process flaws. In practice, organizations can utilize this outcome to evaluate their sales success and make data-driven decisions. In the future, the outcome can be used to compare sales data from different time periods or to observe changes in sales trends over time. Businesses can make informed judgments about their operations and establish plans to remain competitive in the market by evaluating these patterns. Overall, the provided code contains useful information that can assist firms in optimizing their sales operations and improving their bottom line. Figure 16 Describing Price Each Siddhartha Kc 25 CC5067 Smart Data Discovery The describe () method returns a summary of the statistical measures for the provided Data Frame df's 'Price Each' column. It contains the data's count, mean, standard deviation, minimum, maximum, and quartile values. This strategy is helpful for quickly analyzing data and identifying outliers or extreme results. The summary statistics can help you comprehend the data's central tendency, dispersion, and range. The gathered data can assist firms in making educated judgments about pricing strategies and product offerings. Siddhartha Kc 26 CC5067 Smart Data Discovery g. Calculating correlation of all variable Figure 17: Calculating correlation of all variable The provided code determines the correlation matrix for the DataFrame df's numerical columns. The correlation matrix is then shown using a heatmap from the seaborn library. The pairwise correlation coefficients between the columns in the DataFrame are displayed in the correlation matrix. The correlation coefficient, which has values between -1 and 1, is a measurement of the linear relationship between two variables. Siddhartha Kc 27 CC5067 Smart Data Discovery Figure 18: Showing corr of all variable The dataset's correlation coefficient between each pair of variables is displayed in the correlation matrix. A value of 1 shows a perfect positive correlation, a value of 0 indicates no correlation, and a value of -1 indicates a perfect negative correlation. The values range from -1 to 1. A positive correlation between two variables indicates a tendency for them to rise or fall together, whilst a negative correlation indicates a tendency for them to move in the inverse way.(BYJUS 2020) The 'viridis' color scheme, which produces a natural gradation from yellow to green to blue, is used in the modified code. The colors in this color scheme were chosen to be easily distinguished by the human eye and to have an approximately constant perceived color difference between neighboring hues. This color scheme is intended to be perceptually uniform. Because of this, it is a good option for displaying numerical data with a continuous range of values.The pairwise correlation coefficients between all variables in the dataset are calculated to yield this result. The correlation matrix is useful Siddhartha Kc 28 CC5067 Smart Data Discovery for discovering correlations between variables, which can help us better comprehend the data and inform decision-making. It can, for example, assist us in identifying variables that are highly correlated with one another and can be excluded from the study to avoid multicollinearity. The correlation matrix can be utilized in the future to find trends or patterns in data over time. For example, we may discover that the correlation between variables changes over time as market conditions or consumer behavior change. This can assist organizations in adapting their strategy to changing conditions and improving their performance. The correlation matrix can also be used to examine the performance of various products or marketplaces in order to uncover characteristics that contribute to their success or failure. Siddhartha Kc 29 CC5067 Smart Data Discovery 7. Data Exploration Data exploration is the process of analysing and understanding a dataset to uncover patterns, relationships, and insights that can help inform further analysis and decisionmaking. This typically involves using various techniques to visualize and summarize the data, such as creating charts and graphs, calculating descriptive statistics, and identifying outliers or missing values (Tibco 2023) h. Showing which month had the best sales in Bar graph Figure 19: Code for best sale month in Bar Graph The code first converts the "Order Date" column in the pandas dataframe into a datetime format using the pd.to_datetime() function. It then extracts the month from the "Order Date" column using the dt.month attribute and creates a new column "Month" in the dataframe to store the extracted month values. Siddhartha Kc 30 CC5067 Smart Data Discovery Next, the code calculates the total sales by month by grouping the dataframe by the "Month" column and summing the "Sales" column using the groupby() and sum() functions, respectively. The resulting sales values are stored in the "monthly_sales" variable. The code then determines the month with the highest sales by finding the maximum sales value using the max() function and finding the corresponding month name using the calendar.month_name[] function and the idxmax() attribute. To visualize the sales by month, the code creates a vertical bar graph using the plt.bar() function from the Matplotlib library. The color of the bar for the month with the highest sales is set to blue, while the rest of the bars are set to black. The x-axis ticks are set to display month names using the plt.xticks() function. Finally, the code sets the axis labels, title, and background color using various Matplotlib functions and displays the plot using the plt.show() function. The month with the highest sales and the corresponding total sales value are then printed using the print() function The code generates a vertical bar graph showing the total sales by month for the ABC company. The result highlights the month with the highest sales by highlighting it with a dark blue colour. The code extracts the month from the 'Order Date' column, calculates Siddhartha Kc 31 CC5067 Smart Data Discovery The total sales by month, and groups the data by month to visualize the sales data. The modified data is used to create a more meaningful visualization of the sales performance over time. Figure 20: Bar Graph of Total Sales by Month The outcome is favourable because it provides a clear depiction of sales performance by month, allowing the ABC corporation to identify and compare the more successful months to the less successful ones. This data can be used to improve marketing campaigns and make educated inventory management decisions. In practice, the significance of this result is that it assists organizations in identifying trends and patterns in sales data that may be utilized to make educated future decisions. In the future, this outcome can be Siddhartha Kc 32 CC5067 Smart Data Discovery used to evaluate sales performance between months, find trends or patterns in sales data, and make informed decisions about marketing tactics and inventory management. This data can also be utilized to improve pricing strategies and supply chain management, ensuring that the company meets client demand for popular products. Overall, this finding will help the ABC company make informed business decisions. Siddhartha Kc 33 CC5067 Smart Data Discovery i. Showing which city had Highest product sold Figure 21: Code for City with Highest product sold In the first part of the code, the total product sales are calculated for each city by grouping the Data Frame by city and using the sum () function to get the total quantity of products sold in each city. Next, a bar graph is created to visualize the total product sales by city. The x-axis represents the cities and the y-axis represents the total product sales. The colour of the bars is set to black. The x-ticks are set to the city names and rotated vertically for better readability. The x-label is set to "City" and the y-label is set to "Total Product Sales". The title of the graph is set to "Total Product Sales by City”. The background colour of the graph is set to a light white colour to make it more visually appealing. The city with the highest product sales is highlighted by adding a red bar to the graph and labelling it with the total number of units sold. The city name is also printed in red text to make it stand out. Siddhartha Kc 34 CC5067 Smart Data Discovery Figure 22: Bar Graph of Total Product Sales by City The outcome displays the city with the highest product sales and total units sold. This assists firms in better understanding their sales performance, identifying growth prospects, and optimizing their marketing strategy. It is useful in practice because it guides resource allocation decisions. This outcome can be utilized in the future to evaluate sales performance over time, analyze trends, optimize pricing strategies, and improve supply chain management. Siddhartha Kc 35 CC5067 j. Smart Data Discovery Illustrating most sold item in Bar graph Figure 23: Code For most sold item in Overall This code organizes the Data Frame by product and computes the total quantity ordered for each. It then generates a sorted product list and reorders the quantity ordered list to match the sorted product list. A horizontal bar chart is generated with light blue bars representing non-maximum values and dark blue bars representing the best-selling goods. In blue text, the quantity ordered value is added inside the bar. The y-axis ticks Siddhartha Kc 36 CC5067 Smart Data Discovery and label have been set to display the product names, although the x-axis label, chart title, and background colour have not been changed. Figure 24: Bar Graph of Top Quantity Ordered This code generates a horizontal bar chart displaying the total quantity ordered for each product, with a green bar highlighting the best-selling product. The outcome assists organizations in identifying their top-selling products and making informed decisions about inventory management and marketing tactics. This data is useful for optimizing pricing strategies, finding potential new items, and enhancing supply chain management. This result can be used in the future to compare the sales performance of different items over time and to find trends or patterns in sales data. Siddhartha Kc 37 CC5067 Smart Data Discovery k. Showing histogram plot of Order Months Figure 25: Histogram plot of Order Months The code generates a histogram that shows the distribution of orders over the 12 months of the year. The data is modified by converting the 'Month' column from object to integer data type. The histogram shows that the highest number of orders were placed in December, followed by October and November. This result can help businesses to understand seasonal trends in order placement and plan their inventory management and marketing strategies accordingly. Siddhartha Kc 38 CC5067 Smart Data Discovery Figure 26: Showing Histogram Plot A histogram is a type of chart that shows the distribution of numerical data. It divides the data into intervals called bins and counts how many values fall into each bin. The height of each bar in the histogram represents the frequency or relative frequency of the bin. A histogram can help you see the shape, centre, spread and outliers of the data. (BYJUS,nd) In the given code, the histogram is generated from the data after it has been grouped by month and counted to determine the number of orders placed in each month. This provides valuable insights into customer behaviour and helps businesses better understand their customers. The resulting histogram shows the distribution of orders over the 12 months of the year, allowing for observations on which months have higher or lower order volumes. By analyzing this data, businesses can identify trends and patterns in customer behavior and adjust their marketing and sales strategies accordingly. Siddhartha Kc 39 CC5067 Smart Data Discovery The peak in the histogram represents the month in which the most orders were placed, while the troughs indicate the months in which the least orders were placed. In this case, the histogram shows that most orders were placed in December(Holiday season), followed by October and November, while the least orders were placed in January and February. This information can be used by businesses to optimize inventory management and marketing strategies for different months of the year. Overall, the result obtained is good because it provides insights into customer behaviours that can be used to improve business operations. By using this data to inform marketing and sales strategies, businesses can increase their revenue and improve customer satisfaction. In practical cases, this result can be used by businesses to identify the months with the highest demand for their products and ensure that they have enough stock to meet customer demand. This result can be used in the future to analyse the distribution of orders over different years and detect trends or patterns in the data. It can also be used to inform forecasting models and estimate future sales, allowing firms to plan ahead and remain competitive. Overall, this outcome is a useful tool for firms trying to streamline their operations and increase their bottom line. Siddhartha Kc 40 CC5067 Smart Data Discovery 8. Conclusion I've completed my coursework and am now at this point want to talk about my project related experiences. While working on this project, I discovered a lot of new cool things and concepts and it was a tremendous learning experience. I learned about data cleaning, pre-processing, merging, and analysis, and how to apply these skills to real-world data sets. This experience has been a valuable opportunity for me to apply the knowledge and skills I have acquired in a practical setting. Some parts of the coursework were difficult to understand, which I remedied by conducting online research and consulting with my teacher. The dataset I worked on for this project was related to sales analysis for ABC company for the year 2019. It contained information on attributes such as Order ID, Product, and Quantity Ordered. My primary objective was to prepare the data for further data mining and analysis. In addition to gaining technical skills, I also learned the importance of data analysis in business. Data analysis plays a critical role in decision-making processes, and businesses rely on it to optimize their operations, identify areas for growth, and gain a competitive edge. By analysing and interpreting data, businesses can make informed decisions that can lead to greater success and profitability. Overall, the coursework has been a valuable learning experience for me. It has helped me develop my data science skills and has given me a better understanding of the importance of data analysis in business. I look forward to applying these skills to future projects and continuing to learn more about the exciting field of data science. I would like to thanks my teachers for entrusting us with this project. Every effort I made to finish this coursework was pleasurable. I'm hoping that my will be entertaining and perhaps even informative. Siddhartha Kc 41 CC5067 Smart Data Discovery References ❖ What is Data? Definition, Types, Computer, Information - javatpoint. (n.d.). www.javatpoint.com. https://www.javatpoint.com/data ❖ What is smart data? | Definition from TechTarget. (2016, December 1). WhatIs.com. https://www.techtarget.com/whatis/definition/smart-data ❖ What Is Data Processing: Cycle, Types, Methods, Steps and Examples | Simplilearn. (2020, October 21). Simplilearn.com. https://www.simplilearn.com/what-is-dataprocessing-article ❖ Histogram - Definition, Types, Graph, and Examples. (n.d.). BYJUS. https://byjus.com/maths/histogram/ ❖ Bar Graph - Definition, Types, Uses, How to Draw Bar graph, Examples. (2020, August 9). BYJUS. https://byjus.com/maths/bar-graph/ ❖ Correlation - Correlation Coefficient, Types, Formulas & Example. (2019, November 24). BYJUS. https://byjus.com/maths/correlation/ ❖ Statistics Definitions, Types, Formulas & Applications. (2019, December 26). BYJUS. https://byjus.com/maths/statistics/ ❖ User Guide — pandas 2.0.1 documentation. (n.d.). User Guide — Pandas 2.0.1 Documentation. https://pandas.pydata.org/docs/user_guide/index.html#user-guide ❖ The Python Tutorial. (n.d.). Python Documentation. https://docs.python.org/3/tutorial/index.html ❖ Berthold, M. R., Borgelt, C., Höppner, F., & Klawonn, F. (n.d.). Data Understanding. Data Understanding | SpringerLink. https://doi.org/10.1007/978-1-84882-260-3_4 ❖ Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby). (2018, October 25). YouTube. https://www.youtube.com/watch?v=vmEHCJofslg ❖ Data Analysis with Python Course - Numpy, Pandas, Data Visualization. (2021, February 18). YouTube. https://www.youtube.com/watch?v=GPVsHOlRBBI Siddhartha Kc 42 CC5067 Smart Data Discovery ❖ Quick start guide — Matplotlib 3.7.1 documentation. (n.d.). Quick Start Guide — Matplotlib 3.7.1 Documentation. https://matplotlib.org/stable/tutorials/introductory/quick_start.html ❖ Data Visualization: Definition, Benefits, and Examples. (n.d.). Coursera. https://www.coursera.org/articles/data-visualization ❖ How to calculate summary statistics — pandas 2.0.1 documentation. (n.d.). How to Calculate Summary Statistics — Pandas 2.0.1 Documentation. https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics. html ❖ Pandas - Data Correlations. (n.d.). Pandas - Data Correlations. https://www.w3schools.com/python/pandas/pandas_correlations.asp ❖ What is Data Exploration? | TIBCO Software Siddhartha Kc 43