Exploring a Dataset Cole Spinale 2023-02-09 Getting Started- there are 12 parts for you to answer. Create a project in a folder on your machine. You may call the project something like “week1”. You may use this project for this activity as well as the lab. R Markdown This is an R Markdown document, or Rmd file. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. Coding activities in this course are done in an R Markdown file format as are the labs. You will submit a pdf or an HTML file that is generated from your Rmd file by executing a “Knit” function. The main purpose of an R Markdown file is to produce a professional looking, human-readable report. The R code is usually hidden from view, and the output can be controlled by attributes on the R code chunks. For in-class activities, you may sometimes display the code and its output in the knitted file in order for us to evaluate your work. Code Chunks R code in an Rmd file is embedded in an R code chunk. The chunk can have attributes that control how it functions. A reference for code chunks: https://rmarkdown.rstudio.com/lesson-3.html (see “Chunk Options”). Libraries, also called Packages. Installing and loading libraries (also called packages) is an important part of doing data science work. This chunk loads two libraries into your R environment: The chunk attribute echo=FALSE means the R code will not be displayed in the knitted file. The chunk attribute include=FALSE means messages will not be displayed in the knitted file. If you don’t have these libraries (also called packages) installed, execute the install.packages(‘package name’) statement in the R console below, then, after the install completes, execute the load statement again. Chunk Output: We suggest that you route the output of R code to the console and not inline. To do this, go to the settings icon, the “gear” icon above, select the drop down options, and choose the “Chunk Output in Console”. This will avoid problems knitting and with running some R library code. The code output should appear in the lower right-hand pane in RStudio. Text in the Rmd File: The text outside of code chunks is formatted with markdown syntax. This is a fairly simple syntax that allows you to format text, tables, lists and other characters and symbols in a presentable manner. See https://www.markdownguide.org/basic-syntax/ for reference. Reading a data file. The read.table function (execute ?read.table in the R console below for the documentation page) takes several arguments, including the name of the file, a delimiting character, and whether or not the first line in the file is a header or not. A header is a row that contains column names instead of data. The read.table function returns a data frame object that contains the contents of the file, if it has been found and successfully processed. 1 We will mostly work with data files with a .csv extension. The file extension, “csv”, means the data is delimited, or separated, by the comma character. This is a very common form for files that contain data in text format (in contrast to other encodings, such as binary files, for example). This statement reads the file “FlightDelays.csv” by calling the read.csv function (which calls read.table with the appropriate arguments for a csv file). Execute the statement below, then look in the Global Environment pane in the upper right. Notice the Global Environment pane in the upper right shows that the object delays.df (a data frame) was created. You can expand that entry to inspect the object, or double-click on it to have it display on a tab in this pane. Notice the column names in the data frame object. These were parsed from the header row in the file. Clearing the Global Environment. You can clear all objects that have been created by clicking the broom icon above the Global Environment pane. Be sure to select the “Include hidden objects” option. Do that now. You will see that the objects have been cleared. Sometimes you will have issues with objects retaining links to other objects as you work on your code. After clearing the environment, you can re-establish objects by executing the code chunks. After clearing the environment, execute the read.csv statement again to re-establish the delays.df data frame. Exploring and cleaning a dataframe. A data frame is a 2 dimensional structure, like a matrix, except that it can contain vectors of mixed data types. The vectors do have to be the same length. Some data files will have columns with different lengths. They will cause an error when you attempt to read them in. Now, you have the delays.df object in your environment. The next step is to explore the data, looking for any missing values, extreme values, and if the columns are the correct data type. One benefit of going through this exploration is that you become familiar with the data set, which is very helpful when you start analyzing and visualizing the data. The process of dealing with missing or erroneous values and adjusting data types is referred to as “cleaning” the data. A quick way to look at a data frame is with the summary function. Note that for files with very many columns this become impractical, and then you can do summaries for subsets or individual columns. Now compare the output of summary in the console below with the view of delays.df in the Global Environment pane. You want to check that the data types are correct and look for any missing or extreme values. The Global Environment shows each column and some of the first rows of data. You can see that most of the columns are character data, not numeric values. Which of the columns are numeric values? ID, FlightNo, FlightLength, Delay, Delayed30 Missing values: This can be tricky as missing values can be coded in different ways, such as a blank, 0, “None”, or “NA”, to name a few. In this data set, the missing values are encoded as NA, which happens to be the value R uses to denote a missing value. Can you see missing values in the output of summary? Kind of, only some of the NA values show up under the quartiles for each variable. Data types, factors and levels. Before dealing with the missing values, we should correct the data types of the columns. Remember that data can be numeric or categorical (discrete). A column such as Carrier is listed as “character” or “chr” in the environment pane, which means it is categorical. The environment pane also reports how many rows are in the data set- the number of observations, “obs”, and the number of columns, “variables”. This data set has 4031 rows and 10 columns. 2 In R, categorical data is called “factor” data. The unique values a factor can have are its “levels”. We need to change the columns that are character to factor. Also, the FlightNo column should also be a factor. We will ignore the ID column as we are not using it in this activity. Run the following code and check that the data types for these columns changes to “Factor”. Note that accessing a single column can be done by using the “dereferencing” operator, the $ symbol, and the column name. Columns can also be called “fields”. 1) TODO: Enter and execute statements to change the other fields to factors as appropriate. Note: do not change the Delayed30 column. Now execute summary. We will display the result in the knitted render only so we can check the results: ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ID Min. : 1 1st Qu.:1008 Median :2016 Mean :2016 3rd Qu.:3024 Max. :4031 Day Fri:637 Mon:630 Sat:453 Sun:551 Thu:566 Tue:630 Wed:564 Carrier AA :2906 UA :1123 NA's: 2 Month June:2032 May :1999 FlightNo Min. : 71.0 1st Qu.: 371.0 Median : 693.0 Mean : 827.8 3rd Qu.: 787.0 Max. :2255.0 FlightLength Min. : 0.0 1st Qu.:155.0 Median :163.0 Mean :185.2 3rd Qu.:228.0 Max. :295.0 NA's :1 Destination DepartTime ORD :1784 4-8am : 700 DFW : 918 4-8pm : 971 MIA : 610 8-Mid : 257 DEN : 263 8-Noon :1054 STL : 225 Noon-4pm:1047 (Other): 229 NA's : 2 NA's : 2 Delay Delayed30 Min. :-19.00 Min. :0.0000 1st Qu.: -6.00 1st Qu.:0.0000 Median : -2.50 Median :0.0000 Mean : 11.72 Mean :0.1481 3rd Qu.: 5.00 3rd Qu.:0.0000 Max. :693.00 Max. :1.0000 NA's :1 Notice two things: the summary now reports on the levels and frequencies for the factor columns, and we see that there are missing values, NA, in those columns which we did not see before. We can now see the distribution of levels for the categorical columns as well as the descriptive stats for the distribution of the numerical columns. Handle missing values. Now that we have the categorical and numeric columns adjusted, we can deal with missing and extreme values. From the summary we can see the missing values. An extreme value, also called an outlier, would show up as a very large or small number relative to the rest of the values in the distribution, or as a categorical level with a frequency of 1. A frequency of 1 could indicate a coding error. As a data scientist, you need to decide how to deal with missing and extreme values, also called outliers. In our example, we will remove rows that contain missing values. We will call the na.omit function on the delays.df data frame. Note: you could keep the version with NAs and use a new variable such as delays.df.omit or something. We will not do that here. The values in the original file remain unchanged of course. Also, you want to be sure and take note of how many rows were deleted. This can affect your statistical power and should be reported. Before removing the rows with missing data, note how many rows are in the data set. There are 4031 rows before running the chunk below. ## ## ## ## ## ID Min. : 1 1st Qu.:1013 Median :2019 Mean :2019 Carrier AA:2906 UA:1119 FlightNo Min. : 71.0 1st Qu.: 371.0 Median : 693.0 Mean : 828.2 Destination ORD :1780 DFW : 918 MIA : 610 DEN : 263 3 DepartTime 4-8am : 700 4-8pm : 970 8-Mid : 257 8-Noon :1054 ## ## ## ## ## ## ## ## ## ## ## 3rd Qu.:3025 Max. :4031 Day Fri:636 Mon:630 Sat:453 Sun:550 Thu:565 Tue:629 Wed:562 Month June:2032 May :1993 3rd Qu.: 787.0 Max. :2255.0 FlightLength Min. : 0.0 1st Qu.:155.0 Median :164.0 Mean :185.2 3rd Qu.:228.0 Max. :295.0 STL : 225 BNA : 172 (Other): 57 Delay Min. :-19.00 1st Qu.: -6.00 Median : -2.00 Mean : 11.74 3rd Qu.: 5.00 Max. :693.00 Noon-4pm:1044 Delayed30 Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.1483 3rd Qu.:0.0000 Max. :1.0000 2) How many rows were removed? number of rows removed was: 6 Extreme values and outliers. 0 50 100 150 200 250 300 Finally, we can see a suspected extreme value in the Delay column. It is extreme because it is very far from the rest of the distribution. The best way to check for outliers in numeric data is with a boxplot (box and whisker plot). Outliers are datapoints that appear outside of the “whiskers”. We can see from the plot that there is at least one value is way beyond the lower whisker. In fact, we can see this value in the summary as the min of the distribution, 0.0. A flight length of 0 is not a value that is meaningful if we are looking at flight delays, so we will remove all 0 values from the data set (there could be more than one). First, let’s see how many zero values for flight length exist in the data set, then assign NA to those entries, 4 then omit all NA values. ## [1] 0 0 3) How many rows have a FlightLength of 0? There are 2 instances of flights with length 0. This statement assigns the value NA to the FlightLength column where the value is 0: 4) TODO: execute the statement to remove all NA rows from the data set. ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ID Min. : 1 1st Qu.:1012 Median :2018 Mean :2018 3rd Qu.:3024 Max. :4029 Day Fri:636 Mon:630 Sat:453 Sun:550 Thu:565 Tue:627 Wed:562 Carrier AA:2905 UA:1118 Month June:2030 May :1993 FlightNo Min. : 71.0 1st Qu.: 371.0 Median : 693.0 Mean : 827.5 3rd Qu.: 787.0 Max. :2255.0 FlightLength Min. : 68.0 1st Qu.:155.0 Median :164.0 Mean :185.3 3rd Qu.:228.0 Max. :295.0 Destination ORD :1780 DFW : 918 MIA : 610 DEN : 263 STL : 225 BNA : 172 (Other): 55 Delay Min. :-19.00 1st Qu.: -6.00 Median : -3.00 Mean : 11.75 3rd Qu.: 5.00 Max. :693.00 DepartTime 4-8am : 699 4-8pm : 970 8-Mid : 257 8-Noon :1053 Noon-4pm:1044 Delayed30 Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.1484 3rd Qu.:0.0000 Max. :1.0000 5) How many rows were removed? There were two rows removed. Now, we attach the dataframe to the environment. This allows us to refer to variables in the dataframe by their name. For example, we don’t have to type delays.df$FlightLength, but only FlightLength. Visualizing the data. Generating plots, charts and graphs is an important aid to exploring a data set. Likewise, visualizing results of analysis, which you will be doing in this course, is the best way to communicate your findings to an audience of stakeholders. Basic plots. The plot function is an easy way to create simple plots in R. It is especially useful when you are exploring a dataset and do not need to spend a lot of time worrying about the apearance of the plot, as you would in a more formal report. The plot function will produce different plots based on the type of data that is passed to it. Let’s look at the DepartTime column. This is an example of categorical data, which you can see either in the Global Environment pane or by looking at the summary function output. Categorical data is rendered as a bar chart when it is passed in to the plot function. This plot is a visual way to see the frequencies of occurrence of each level in the data set. A better plot would include the actual frequency counts in the plot, but that is a more complex plot for another time. 5 1000 800 600 400 200 0 4−8am 4−8pm 8−Mid 8−Noon Noon−4pm Histogram and Density plots A histogram shows the distribution of values in a discretized manner. The histogram’s columns, or “bins”, group the numerical data. The number of bins can be adjusted: more bins = more detail. Sometimes too few bins will hide the true shape of the data. You can set the number of bins by adjusting the breaks parameter. For more, you can access the help page with the statement: ?hist This statement generates a histogram for the FlightLength vector: 6 1000 500 0 Frequency 1500 Histogram of FlightLength 100 150 200 250 300 FlightLength Another way to view the shape of a distribution is to create a density plot. While a histogram is a “discretized” view of a distribution, a density plot is a continuous (smoothed) view of the data. We will discuss density and “smoothing” later on in the course. It is often useful to plot a histogram and a density plot together. Here is an example of how you can combine a density curve and a histogram in a single plot. Notice the shape of the density curve follows the height of the bins in the histogram. You also see how to use some graphical parameters to customize the visual aspects of the plot and to add labels and a title. 7 0.010 0.000 0.005 Density 0.015 0.020 Flight Length 100 150 200 250 300 scores 6) TODO: write and execute code to display a density curve and a histogram in a single plot for the Delay variable. 8 0.008 0.000 0.004 Density 0.012 Delay 0 200 400 600 scores 7) What do you notice about the shape of this plot of this distribution? It is significantly skewed to the right. Bivariate plots You often want to explore the relationship between two variables. A scatter plot is a great way to see how two variables interact, or not. Is there a correlation between FlightLength and the length of Delay? 8) Without going any further, do you think these two variables would be related? I would imagine that the longer the flight is the longer a potential delay for that flight may be as there is more room for error. This generates a scatterplot of these two variables: 9 700 500 300 0 100 Delay 100 150 200 250 300 FlightLength 9) Based on the plot, do you think there a correlation between FlightLength and the length of Delay? It doesn’t seem like there is much correlation between the variables although if there were any it seems like they are negatively correlated. To investigate further, it’s a good idea to get a more quantitative measure with the correlation coefficient: ## ## ## ## ## ## ## ## ## ## ## Pearson's product-moment correlation data: FlightLength and Delay t = -1.1445, df = 4021, p-value = 0.2525 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.04892200 0.01286334 sample estimates: cor -0.01804656 10) Does the result support the scatterplot view? The correlation is about -0.018 indicating a slight negative correlation, although the p-value suggests that the correlation is insignificant. Delays and Carrier What do the delays look like for each airline company, or Carrier? The Delayed30 column encodes delays > 30 minutes as 1, otherwise 0. This code uses the R table function to create a table object. The table will calculate the frequencies of the Delay30 values for each carrier. The kable function is used for rendering the table so it looks nice. 10 Table 1: Delays > 30 min per Carrier 0 1 AA UA 2512 393 914 204 This code displays a barplot visualization of the table created above: 1000 1500 2000 2500 Delays > 30 min per Carrier 0 500 0 1 AA UA 11) According to the table and plot above, which of the two airlines have a larger proportion of delayed flights? UA has 204/1118 (0.18) flights delayed while AA has only 393/2905 (0.14) Again, it can be difficult to gauge proportion by visual inspection. Let’s get a more quantitative result. The prop.table function can be used for calculating the proportion of delays > 30 minutes for each carrier. Because Delayed30 is not a categorical variable (we have not declared it so on purpose), we use the table from above as an argument to prop.table because that table has the frequencies. Table 2: Proportion of Delays > 30 min per Carrier 0 1 AA UA 0.62 0.10 0.23 0.05 11 Condensing a Categorical Variable’s Levels Create a new column in the delays.df dataframe called TimePeriod that has three levels: ‘Morning’, ‘Afternoon’, and ‘Evening’. This column will condense the levels found in the DepartTime column according to these rules: 4-8am, 8-Noon -> Morning, Noon-4pm -> Afternoon, 4-8pm, 8-Mid -> Evening. This code uses ‘ifelse’ conditional statements to do the condensing. The ifelse is useful because it works in a row wise manner. In other words, we don’t need to write a loop. 12) How many of each level of TimePeriod do you see? Afternoon: 1044 Evening: 1227 Morning: 1752 Now we want to create a table that shows how many times there were delays of 30 minutes or more for each TimePeriod. Use the TimePeriod column as the grouping, and pass the grouping to the summarixe function, also applying the sum function to the Delayed30 column. The sum is applied to the Delayed30 values for each group (level) of TimePeriod. Table 3: Number of delays > 30 per time period. Time Period Delays > 30 Afternoon Evening Morning 174 326 97 Knitting the Rmd file. After you have verified that your R code runs correctly, you can convert the Rmd file into an HTML or pdf file by clicking on the Knit drop down menu at the top of Rstudio and select “Knit to HTML” or “Knit to PDF”. You will see some output from the knit function scrolling by. If there are no errors, you will see the HTML or PDF file appear. The file is written to the project directory. Note: We will accept either HTML or pdf file formats. The pdf is often a better looking rendering. Inspect the resulting HTML or pdf file to make sure that it looks the way you intend, make any changes to the Rmd file if necessary and re-knit. You will submit all of your R assignments as knitted files, either HTML or pdf format. If any of your R code does not run, or if a file that you are reading cannot be found, you will see error messages in the “R Markdown” tab below. Click on the “Output” option for datails and line numbers for debugging. Now, knit this file to make sure it is working and you will be all set! 12