ENDG 319 - Fall 23 CURE Project – Deliverable 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Descriptive Statistics, Python, and Common Machine Learning Datasets Instructions: This deliverable is worth 3% of your overall grade. The entire deliverable will require you to use Python and MS Excel and all solutions involve pasting screenshots of your work in a Word file. Please keep the original Word file. Once done, convert the Word file into pdf with the file name: Last Name_First Name_UCID_CURE Deliverable 1. Submit the pdf file to Assessment > Dropbox > ‘CURE Project Deliverable 1’ by Sep 29, 2023, 11:59 pm (MT). You must submit this deliverable on time to be able to submit the upcoming deliverables. This is not a group project. Each student needs to submit his/her own report. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Research skill: (i) Looking at data critically (Who collected the data? why? what was the context? Are there missing data? What is the usefulness of the dataset?) (ii) Data preparation to develop machine learning algorithm with python, graphical/numerical diagnostics/analysis of data Relevant course content: Descriptive Statistics (Ch 1) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ENDG 319 (Fall 23) CURE Project Deliverable 1 1|Page TASK 1. GAINING A BASIC UNDERSTANDING OF PANDAS DATAFRAME Storing tabular data as pandas dataframe: (a) Data preprocessing is one of the steps in machine learning. The pandas library in python is suitable to deal with tabular data. Create a variable ‘emissions’ and assign to it the following data (Table 1) as padas DataFrame. Create an excel file ‘emissions_from_pandas.xlsx’ from the ‘emissions’ variable using python. Table 1. Particulate matter (PM) emissions (in g/gal) for 15 vehicles driven at low altitude and another 15 vehicles driven at high altitude. Low Altitude 1.50 1.48 2.98 1.40 3.12 0.25 6.73 5.30 9.30 6.96 7.21 0.87 1.06 7.39 1.37 High Altitude 7.59 2.06 8.86 8.67 5.61 6.28 4.04 4.40 9.52 1.50 6.07 17.11 3.57 2.68 6.46 iloc[] method: (b) Using the .iloc[] method, we can access any part of the dataframe. Run the following commands and show the outputs: emissions.head() emissions.iloc[0,0] emissions.iloc[1,1] emissions.iloc[0:2,0:2] emissions.iloc[2:4,:] ENDG 319 (Fall 23) CURE Project Deliverable 1 2|Page File conversion: (c) Create an xl file "emissions_from_pandas.xlsx" from the emissions variable using the .to_excel method. Paste the screenshot of the input command. (d) Create an MS Excel file ‘emissions_excel.xlsx’ containing the data in Table 1 above with the column header and save it on your computer. Create a variable ‘emissions_from_excel’ from the ‘emissions_excel.xlsx’ file using pd_read function. Show the first five rows using .head(). Paste a screenshot with the input commands used. TASK 2. NUMERICAL AND GRAPHICAL SUMMARY OF DATASETS USING PYTHON AND PANDAS Report the following statistics for the emissions data (in Task 1) both at low and high altitude: sample size or count, sample mean, sample standard deviation (std), minimum, maximum, median, first and third quartile. (Note: As mentioned in the textbook, different software packages calculate quartiles in slightly different ways, so do not worry if your quartile values is slightly different from someone else who use a different method.) (a) Use python (any library, but pandas will be fast). You can past screenshot of both input and output (output will look like the following.) (b) Use the pandas library in python to generate a comparative boxplot of the emissions dataset. Interpret the boxplot (max 50 words) (c) Use Excel to compute the statistics as discussed in part (a) and draw the comparative boxplot mentioned in part (b). The screenshot with your work may look like the following for the summary statistics part. ENDG 319 (Fall 23) CURE Project Deliverable 1 3|Page Low High Altitude Altitude 1.50 7.59 1.48 2.06 2.98 8.86 1.40 8.67 3.12 5.61 0.25 6.28 6.73 4.04 5.30 4.40 9.30 9.52 6.96 1.50 7.21 6.07 0.87 17.11 1.06 3.57 7.39 2.68 1.37 6.46 ENDG 319 (Fall 23) Low Altitude High Altitude count mean std min 25% 50% 75% max CURE Project Deliverable 1 4|Page TASK 3. IMPORTANCE OF GRAPHS Observe the following two bivariate datasets. Bivariate dataset I Bivariate dataset II (a) Define a variable ‘dataset1’ of type dataframe using the bivariate dataset I given above. Find the summary statistics using dataset1.describe(). [Hints: You can create the dataframe by any one of the techniques mentioned in Task A. Or, find the dataset ‘anscombe’ from seaborn as shown below. Then define dataset1 using .iloc method. ] (b) Define a dataframe ‘dataset2’ of type dataframe using the bivariate dataset II given above. Find the summary statistics using dataset2.describe(). (c) Do you see any difference between the statistics that summarize the y variables in the two datasets? (d) Draw a diagram showing two scatterplots using the same axes to display dataset1 and dataset2 above. Do you see any difference between the two datasets. Comment using less than 50 words. ENDG 319 (Fall 23) CURE Project Deliverable 1 5|Page TASK 4. EXPLORING AN EXISTING DATASET IN THE PYTHON LIBRARY SKLEARN. There are several python packages that contain some datasets often used in the study of machine learning. We saw two such packages: sklearn and seborn. We used the following commands to load the iris dataset, the keys, and the description of the dataset from sklearn package. The list of some small standard datasets available in sklearn are given below with the functions to load them on python. For more, please see: https://scikit-learn.org/stable/datasets/toy_dataset.html (a) Access and explore the ‘digits’ dataset, and report how many instances used, I. II. III. IV. V. VI. VII. Number of Instances Number of Attributes Attribute Information Missing Attribute Values Creator Date Class ENDG 319 (Fall 23) CURE Project Deliverable 1 6|Page (b) Access and explore the ‘breast_cancer’ dataset, and report how many instances used, I. II. III. IV. V. VI. VII. Number of Instances Number of Attributes Attribute Information Missing Attribute Values Creator Date Class Note: You can directly take screenshot and paste it as your answers. ENDG 319 (Fall 23) CURE Project Deliverable 1 7|Page