Uploaded by mexijas100

CURE Project Deliverable 1 Sep 17

advertisement
ENDG 319 - Fall 23
CURE Project – Deliverable 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Descriptive Statistics, Python, and Common Machine Learning Datasets
Instructions:





This deliverable is worth 3% of your overall grade.
The entire deliverable will require you to use Python and MS Excel and all solutions involve
pasting screenshots of your work in a Word file. Please keep the original Word file. Once done,
convert the Word file into pdf with the file name: Last Name_First Name_UCID_CURE
Deliverable 1.
Submit the pdf file to Assessment > Dropbox > ‘CURE Project Deliverable 1’ by Sep 29, 2023,
11:59 pm (MT).
You must submit this deliverable on time to be able to submit the upcoming deliverables.
This is not a group project. Each student needs to submit his/her own report.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Research skill:
(i) Looking at data critically (Who collected the data? why? what was the context? Are there missing
data? What is the usefulness of the dataset?)
(ii) Data preparation to develop machine learning algorithm with python, graphical/numerical
diagnostics/analysis of data
Relevant course content: Descriptive Statistics (Ch 1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
ENDG 319 (Fall 23)
CURE Project Deliverable 1
1|Page
TASK 1. GAINING A BASIC UNDERSTANDING OF PANDAS DATAFRAME
Storing tabular data as pandas dataframe:
(a) Data preprocessing is one of the steps in machine learning. The pandas library in python is
suitable to deal with tabular data. Create a variable ‘emissions’ and assign to it the following data
(Table 1) as padas DataFrame. Create an excel file ‘emissions_from_pandas.xlsx’ from the
‘emissions’ variable using python.
Table 1. Particulate matter (PM) emissions (in g/gal) for 15 vehicles driven at low altitude and
another 15 vehicles driven at high altitude.
Low Altitude
1.50
1.48
2.98
1.40
3.12
0.25
6.73
5.30
9.30
6.96
7.21
0.87
1.06
7.39
1.37
High Altitude
7.59
2.06
8.86
8.67
5.61
6.28
4.04
4.40
9.52
1.50
6.07
17.11
3.57
2.68
6.46
iloc[] method:
(b) Using the .iloc[] method, we can access any part of the dataframe. Run the following commands
and show the outputs:
emissions.head()
emissions.iloc[0,0]
emissions.iloc[1,1]
emissions.iloc[0:2,0:2]
emissions.iloc[2:4,:]
ENDG 319 (Fall 23)
CURE Project Deliverable 1
2|Page
File conversion:
(c) Create an xl file "emissions_from_pandas.xlsx" from the emissions variable using the .to_excel
method. Paste the screenshot of the input command.
(d) Create an MS Excel file ‘emissions_excel.xlsx’ containing the data in Table 1 above with the
column header and save it on your computer. Create a variable ‘emissions_from_excel’ from the
‘emissions_excel.xlsx’ file using pd_read function. Show the first five rows using .head(). Paste a
screenshot with the input commands used.
TASK 2. NUMERICAL AND GRAPHICAL SUMMARY OF DATASETS USING
PYTHON AND PANDAS
Report the following statistics for the emissions data (in Task 1) both at low and high altitude:
sample size or count, sample mean, sample standard deviation (std), minimum, maximum,
median, first and third quartile.
(Note: As mentioned in the textbook, different software packages calculate quartiles in slightly
different ways, so do not worry if your quartile values is slightly different from someone else who
use a different method.)
(a) Use python (any library, but pandas will be fast). You can past screenshot of both input and
output (output will look like the following.)
(b) Use the pandas library in python to generate a comparative boxplot of the emissions dataset.
Interpret the boxplot (max 50 words)
(c) Use Excel to compute the statistics as discussed in part (a) and draw the comparative boxplot
mentioned in part (b). The screenshot with your work may look like the following for the
summary statistics part.
ENDG 319 (Fall 23)
CURE Project Deliverable 1
3|Page
Low
High
Altitude Altitude
1.50
7.59
1.48
2.06
2.98
8.86
1.40
8.67
3.12
5.61
0.25
6.28
6.73
4.04
5.30
4.40
9.30
9.52
6.96
1.50
7.21
6.07
0.87
17.11
1.06
3.57
7.39
2.68
1.37
6.46
ENDG 319 (Fall 23)
Low
Altitude
High
Altitude
count
mean
std
min
25%
50%
75%
max
CURE Project Deliverable 1
4|Page
TASK 3. IMPORTANCE OF GRAPHS
Observe the following two bivariate datasets.
Bivariate dataset I
Bivariate dataset II
(a) Define a variable ‘dataset1’ of type dataframe using the bivariate dataset I given above. Find
the summary statistics using dataset1.describe().
[Hints: You can create the dataframe by any one of the techniques mentioned in Task A. Or, find
the dataset ‘anscombe’ from seaborn as shown below. Then define dataset1 using .iloc method.
]
(b) Define a dataframe ‘dataset2’ of type dataframe using the bivariate dataset II given above. Find
the summary statistics using dataset2.describe().
(c) Do you see any difference between the statistics that summarize the y variables in the two
datasets?
(d) Draw a diagram showing two scatterplots using the same axes to display dataset1 and dataset2
above. Do you see any difference between the two datasets. Comment using less than 50 words.
ENDG 319 (Fall 23)
CURE Project Deliverable 1
5|Page
TASK 4. EXPLORING AN EXISTING DATASET IN THE PYTHON LIBRARY
SKLEARN.
There are several python packages that contain some datasets often used in the study of machine
learning. We saw two such packages: sklearn and seborn. We used the following commands to
load the iris dataset, the keys, and the description of the dataset from sklearn package.
The list of some small standard datasets available in sklearn are given below with the functions
to load them on python.
For more, please see: https://scikit-learn.org/stable/datasets/toy_dataset.html
(a) Access and explore the ‘digits’ dataset, and report how many instances used,
I.
II.
III.
IV.
V.
VI.
VII.
Number of Instances
Number of Attributes
Attribute Information
Missing Attribute Values
Creator
Date
Class
ENDG 319 (Fall 23)
CURE Project Deliverable 1
6|Page
(b) Access and explore the ‘breast_cancer’ dataset, and report how many instances used,
I.
II.
III.
IV.
V.
VI.
VII.
Number of Instances
Number of Attributes
Attribute Information
Missing Attribute Values
Creator
Date
Class
Note: You can directly take screenshot and paste it as your answers.
ENDG 319 (Fall 23)
CURE Project Deliverable 1
7|Page
Download