Uploaded by harpreet649

CSIS Lab Assignment: Data Preparation & Exploration

advertisement
DOUGLAS COLLEGE – Summer 2022
CSIS 3290 – 001 – Lab 02
• Create a folder and rename it according to the folder structure and naming convention stated below
• All the files you are required to submit for the assignment should be placed inside this folder.
• You will lose points if you just cut and paste materials from class exercises (e.g., If I see the same
comments, variable names, etc. being using in your code).
• If cheating is determined (i.e., you shared your work with another student in the class), your work will
be a ZERO mark and you will face further consequences.
• Make sure to include all the necessary files to make sure that the code can run properly without
producing any error
In this lab, we will practice how to prepare and explore the cleaned dataset. You need study the textbook, demo
code and do your own research to make sure that you can perform all the tasks describe below.
1. Create a python notebook named as Lab2_ABcXXXXX with A signifies the first letter of your first name, Bc signifies
the first two letters of your last name and XXXXX denotes the last five digits of your student ID.
2. Create a markdown cell at the top of the Jupyter notebook to state the lab, your name and student ID with the
correct heading.
3. For each of the following section, you need to create a markdown heading cell followed by a few code cells to
complete the tasks. Please also put some comments in each code cell.
a. Load the python library. Please load all the required python libraries in this section
b. Read the data. Please load the csv file you have created from Lab1 and have a peek at the data by using the
head() function. If you have a column without any name or with # as its name, it means that you did not save
the dataset with index=False parameter when you did Lab1. You have to drop that column.
c. Drop unneeded columns. Columns model and trim have too many unique values. Their values depend on the
car model. We will assume that the car price may depends on the model, body types and other features. So,
columns model and trim should be dropped.
d. Reduce car make’s unique values. Use unique() and values_count() to find out the distribution of unique car
make in the dataset. We want to reduce the number of car makes in our dataframe. Looking at its distribution,
we decide to:
• Change the string in the make columns to lowercase letters. Replace any dash “-“ with underscore “_”
• Drop the rows whose car make is represented with less than 500 data points. Hint: from the value_counts()
result you can find the ones that have than 500, use Boolean indexing to find the rows index and drop
them. You can also manually find the index of the rows containing less represented make, e.g., maserati,
and drop those rows, and so forth
• Combine the car makes that are represented with less than 1000 data points into a new model category:
“other”
e. Find out the summary statistics of the dataset using describe(). However, pass the following parameter to see
some more detail information, percentiles=[0.05, 0.25, 0.5, 0.75, 0.95]
f. Produce the correlation of the dataset. You will see that mmr has the highest correlation to price.
Copyright © 2021 Bambang A.B. Sarif
Page 1 of 3
g. Univariate data analysis. We will analyze a few important features from the dataset. First of all, since we want
to predict price, we want to make sure that the price is in normal distribution.
• Produce the distribution plot and boxplot of the price feature. Notice that the price distribution has a long
tail on the right. We will cut some of these outliers. For now, we decide to keep the data points whose
price value is between 5% percentile and 95% percentile. Hint: Find the index of the datapoints whose price
value is less than the 5% (see the summary statistics) and drop them. Perform similarly with the datapoints
whose price is greater than 95% percentile. After you drop those datapoints, plot another distribution plot
or boxplot. Notice that the distribution is much better now.
• Produce the distribution plot and boxplot of the odometer feature. For the odometer feature, we decide
that we will drop the datapoints whose odometer value is either max (999999) or min (1). Feel free to do
you own practice to drop any additional datapoints. However, please make sure that you will keep the
number of rows to be at least 80% of its original dataset.
• We will keep the condition and mmr feature as it is. Feel free to practice analyzing the year feature.
h. Multivariate data analysis. Create scatter plot to compare mmr and price and notice that you can clearly see
a good relationship between these two features.
• Create a scatter plot to compare odometer and price. Write down your observation.
• Create a scatter plot to compare year and price. Write down your observation.
• Create a scatter plot to compare condition and price. Write down your observation.
i. Produce the countplot or value_counts() plot of the following features: body, transmission, color, interior and
condition.
j. Data transformation: binning. Notice that the condition values range from 1.0 until 4.9. We will reduce this
into discreet values: 1, 2, 3, 4, 5. This can easily be done using rounding.
k. Data transformation: dummy features. Some features have categorical values: make, body, transmission,
color and interior. Out of these features, body and transmission do not have ‘other’ value. So, we will treat
them a bit different.
• Create dummy values for the body feature with drop_first=True and prefix=”body”. Join the dummy with
the original dataframe.
• Create dummy values for the transmission feature with drop_first=True and prefix=”trans”. Join the
dummy with the original dataframe.
• Create dummy values for the make feature with prefix=”make”. Drop the make_other column from the
created dummy features. Then, join the dummy with the other dataframe.
• Create dummy values for the color feature with prefix=”color”. Drop the color_other column from the
created dummy features. Then, join the dummy with the original dataframe
• Create dummy values for the interior feature with prefix=”inter”. Drop the inter_other column from the
created dummy features. Then, join the dummy with the original dataframe
• Drop the following columns from the dataframe: body, transmission, make, color, interior
l. Check the dataframe with info() to make sure that all columns are in numerical data type.
m. Use reset_index() to reset the index of the dataframe
n. Save the dataframe into Lab02_prepared.csv. Make sure to use parameter index=False
Copyright © 2021 Bambang A.B. Sarif
Page 2 of 3
Note on submission:
• Create a folder named as Lab02_ABcXXXXX following the naming convention.
• Put your Jupyter notebook and the original and cleaned dataset in this folder.
• Zip the file and submit it through the blackboard
LAB/ASSIGNMENT PRE-SUBMISSION CHECKLIST
• Did you follow the naming convention for your files?!
• Did you follow the naming convention for your folder?!
• Does your submission work on another computer?!
• Double check **before** submitting
Copyright © 2022 Bambang A.B. Sarif and others. NOT FOR REDISTRIBUTION.
STUDENTS FOUND REDISTRIBUTING COURSE MATERIAL IS IN VIOLATION OF ACAMEDIC INTEGRITY
POLICIES AND MAY FACE DISCIPLINARY ACTION BY THE COLLEGE ADMINISTRATION
Copyright © 2021 Bambang A.B. Sarif
Page 3 of 3
Download