Uploaded by safaaamed8

Assignment 1 - Predictive Analytics

advertisement
Assignment 1 – Commerce 3DA3
This homework assignment contains 6 questions. Your submission should be one Jupyter
Notebook file. The code in your jupyter notebook should produce results that reflect the
items asked in the questions below. For full marks, avoid manual operations (such as
manual data entry, copy/paste, etc.) as much as possible. If something can be done using
code, you must use code to do it.
Submit the assignment by uploading the file(s) into the "Assignment 1" folder on Avenue to
Learn. You can find this folder under Assessments>Assignments on the course webpage.
Submitting work after the deadline may result in a deduction of points or a penalty. The
deadline for your submission is 11:59PM on Thursday, February 15.
Data: We are going to use the data stored in the dataset client_data.csv. This dataset
includes information on clients (Name, Surname, Email, CLIENT_ID, TRANSACTION_ID,
Country, Region) as well as on the employees associated with each client
(SALES_MANAGER SALES_VICE_PRESIDENT), together to the number of years the
employees have been associated with the client, the sales of the current year and the
previous year on the same day, and the respective category (YEARS_AS_CLIENT,
CURRENT_YEAR_SALES, PREVIOUS_YEAR_SALES, CATEGORY).
Question 0
In the first cell of your Jupyter notebook, insert the following text as markdown. Add your
first name, last name, your student ID number, and the section number.
Question 1
Import the data from the "client_data.csv" dataset and assign it to a dataframe called df.
Explore the structure of this dataframe (number of rows, column types, etc.). Display the
first few rows of the dataframe. In a markdown cell, describe the dataset to the best of your
ability. Try to be as comprehensive as possible. To answer this question, you do not need to
run additional code. This description should be based solely on the procedures performed
in this question.
Question 2
Show how many variables in the dataframe have missing values. Then update the
dataframe by removing rows with missing values. Confirm that the dataframe now has no
missing values.
Question 3
Create a new dataframe df_EU, containing all the variables but only for customers whose
country is a member of the European Union. Reset the indexes for this dataframe. Update
the dataframe by removing the Country variable. Confirm that the variable has been
removed. Create a CSV file in your working directory, df_EU.csv, based on this dataframe.
Question 4
Working with the dataframe df, create a bar chart based on the CATEGORY variable. Add
labels for the axes and a title. Change the color of the bars and modify the tick marks on the
horizontal axis to show all the unique categories. Create a pie chart based on the CATEGORY
variable. Add a legend. In a markdown cell, based on the charts, describe the distribution of
the CATEGORY variable with as much detail as possible.
Question 5
Working with the dataframe df, create a scatter plot that shows the relationship between
CURRENT_YEAR_SALES and PREVIOUS_YEAR_SALES variables. Indicate, if you were to add
a trend line, which polynomial degree would be appropriate for this association (you do not
need to add the trend line). Also, explain what this relationship means for the sales of the
current and previous years of customers.
Download