Uploaded by Cole Spinale

Activity1-Exploring-a-Dataset

advertisement
Exploring a Dataset
Cole Spinale
2023-02-09
Getting Started- there are 12 parts for you to answer.
Create a project in a folder on your machine. You may call the project something like “week1”. You may use
this project for this activity as well as the lab.
R Markdown This is an R Markdown document, or Rmd file. Markdown is a simple formatting
syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see
http://rmarkdown.rstudio.com.
Coding activities in this course are done in an R Markdown file format as are the labs. You will submit a pdf
or an HTML file that is generated from your Rmd file by executing a “Knit” function.
The main purpose of an R Markdown file is to produce a professional looking, human-readable report. The R
code is usually hidden from view, and the output can be controlled by attributes on the R code chunks. For
in-class activities, you may sometimes display the code and its output in the knitted file in order for us to
evaluate your work.
Code Chunks R code in an Rmd file is embedded in an R code chunk. The chunk can have attributes
that control how it functions. A reference for code chunks: https://rmarkdown.rstudio.com/lesson-3.html
(see “Chunk Options”).
Libraries, also called Packages. Installing and loading libraries (also called packages) is an important
part of doing data science work. This chunk loads two libraries into your R environment:
The chunk attribute echo=FALSE means the R code will not be displayed in the knitted file. The chunk
attribute include=FALSE means messages will not be displayed in the knitted file.
If you don’t have these libraries (also called packages) installed, execute the install.packages(‘package name’)
statement in the R console below, then, after the install completes, execute the load statement again.
Chunk Output: We suggest that you route the output of R code to the console and not inline. To do
this, go to the settings icon, the “gear” icon above, select the drop down options, and choose the “Chunk
Output in Console”. This will avoid problems knitting and with running some R library code. The code
output should appear in the lower right-hand pane in RStudio.
Text in the Rmd File: The text outside of code chunks is formatted with markdown syntax. This is
a fairly simple syntax that allows you to format text, tables, lists and other characters and symbols in a
presentable manner. See https://www.markdownguide.org/basic-syntax/ for reference.
Reading a data file. The read.table function (execute ?read.table in the R console below for the documentation page) takes several arguments, including the name of the file, a delimiting character, and whether
or not the first line in the file is a header or not. A header is a row that contains column names instead of
data. The read.table function returns a data frame object that contains the contents of the file, if it has been
found and successfully processed.
1
We will mostly work with data files with a .csv extension. The file extension, “csv”, means the data is
delimited, or separated, by the comma character. This is a very common form for files that contain data in
text format (in contrast to other encodings, such as binary files, for example).
This statement reads the file “FlightDelays.csv” by calling the read.csv function (which calls read.table with
the appropriate arguments for a csv file).
Execute the statement below, then look in the Global Environment pane in the upper right.
Notice the Global Environment pane in the upper right shows that the object delays.df (a data frame) was
created. You can expand that entry to inspect the object, or double-click on it to have it display on a tab in
this pane. Notice the column names in the data frame object. These were parsed from the header row in the
file.
Clearing the Global Environment. You can clear all objects that have been created by clicking the
broom icon above the Global Environment pane. Be sure to select the “Include hidden objects” option. Do
that now. You will see that the objects have been cleared. Sometimes you will have issues with objects
retaining links to other objects as you work on your code. After clearing the environment, you can re-establish
objects by executing the code chunks. After clearing the environment, execute the read.csv statement again
to re-establish the delays.df data frame.
Exploring and cleaning a dataframe.
A data frame is a 2 dimensional structure, like a matrix, except that it can contain vectors of mixed data
types. The vectors do have to be the same length. Some data files will have columns with different lengths.
They will cause an error when you attempt to read them in.
Now, you have the delays.df object in your environment. The next step is to explore the data, looking for any
missing values, extreme values, and if the columns are the correct data type. One benefit of going through
this exploration is that you become familiar with the data set, which is very helpful when you start analyzing
and visualizing the data.
The process of dealing with missing or erroneous values and adjusting data types is referred to as “cleaning”
the data.
A quick way to look at a data frame is with the summary function. Note that for files with very many
columns this become impractical, and then you can do summaries for subsets or individual columns.
Now compare the output of summary in the console below with the view of delays.df in the Global Environment
pane. You want to check that the data types are correct and look for any missing or extreme values.
The Global Environment shows each column and some of the first rows of data. You can see that most of the
columns are character data, not numeric values. Which of the columns are numeric values?
ID, FlightNo, FlightLength, Delay, Delayed30
Missing values: This can be tricky as missing values can be coded in different ways, such as a blank, 0,
“None”, or “NA”, to name a few. In this data set, the missing values are encoded as NA, which happens to
be the value R uses to denote a missing value. Can you see missing values in the output of summary?
Kind of, only some of the NA values show up under the quartiles for each variable.
Data types, factors and levels. Before dealing with the missing values, we should correct the data types
of the columns. Remember that data can be numeric or categorical (discrete). A column such as Carrier
is listed as “character” or “chr” in the environment pane, which means it is categorical. The environment
pane also reports how many rows are in the data set- the number of observations, “obs”, and the number of
columns, “variables”. This data set has 4031 rows and 10 columns.
2
In R, categorical data is called “factor” data. The unique values a factor can have are its “levels”. We need
to change the columns that are character to factor. Also, the FlightNo column should also be a factor. We
will ignore the ID column as we are not using it in this activity.
Run the following code and check that the data types for these columns changes to “Factor”. Note that
accessing a single column can be done by using the “dereferencing” operator, the $ symbol, and the column
name. Columns can also be called “fields”.
1) TODO: Enter and execute statements to change the other fields to factors as appropriate.
Note: do not change the Delayed30 column.
Now execute summary. We will display the result in the knitted render only so we can check the results:
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
ID
Min.
:
1
1st Qu.:1008
Median :2016
Mean
:2016
3rd Qu.:3024
Max.
:4031
Day
Fri:637
Mon:630
Sat:453
Sun:551
Thu:566
Tue:630
Wed:564
Carrier
AA :2906
UA :1123
NA's:
2
Month
June:2032
May :1999
FlightNo
Min.
: 71.0
1st Qu.: 371.0
Median : 693.0
Mean
: 827.8
3rd Qu.: 787.0
Max.
:2255.0
FlightLength
Min.
: 0.0
1st Qu.:155.0
Median :163.0
Mean
:185.2
3rd Qu.:228.0
Max.
:295.0
NA's
:1
Destination
DepartTime
ORD
:1784
4-8am
: 700
DFW
: 918
4-8pm
: 971
MIA
: 610
8-Mid
: 257
DEN
: 263
8-Noon :1054
STL
: 225
Noon-4pm:1047
(Other): 229
NA's
:
2
NA's
:
2
Delay
Delayed30
Min.
:-19.00
Min.
:0.0000
1st Qu.: -6.00
1st Qu.:0.0000
Median : -2.50
Median :0.0000
Mean
: 11.72
Mean
:0.1481
3rd Qu.: 5.00
3rd Qu.:0.0000
Max.
:693.00
Max.
:1.0000
NA's
:1
Notice two things: the summary now reports on the levels and frequencies for the factor columns, and we
see that there are missing values, NA, in those columns which we did not see before. We can now see the
distribution of levels for the categorical columns as well as the descriptive stats for the distribution of the
numerical columns.
Handle missing values.
Now that we have the categorical and numeric columns adjusted, we can deal with missing and extreme values.
From the summary we can see the missing values. An extreme value, also called an outlier, would show up as
a very large or small number relative to the rest of the values in the distribution, or as a categorical level
with a frequency of 1. A frequency of 1 could indicate a coding error.
As a data scientist, you need to decide how to deal with missing and extreme values, also called outliers.
In our example, we will remove rows that contain missing values. We will call the na.omit function on the
delays.df data frame. Note: you could keep the version with NAs and use a new variable such as delays.df.omit
or something. We will not do that here. The values in the original file remain unchanged of course. Also, you
want to be sure and take note of how many rows were deleted. This can affect your statistical power and
should be reported.
Before removing the rows with missing data, note how many rows are in the data set.
There are 4031 rows before running the chunk below.
##
##
##
##
##
ID
Min.
:
1
1st Qu.:1013
Median :2019
Mean
:2019
Carrier
AA:2906
UA:1119
FlightNo
Min.
: 71.0
1st Qu.: 371.0
Median : 693.0
Mean
: 828.2
Destination
ORD
:1780
DFW
: 918
MIA
: 610
DEN
: 263
3
DepartTime
4-8am
: 700
4-8pm
: 970
8-Mid
: 257
8-Noon :1054
##
##
##
##
##
##
##
##
##
##
##
3rd Qu.:3025
Max.
:4031
Day
Fri:636
Mon:630
Sat:453
Sun:550
Thu:565
Tue:629
Wed:562
Month
June:2032
May :1993
3rd Qu.: 787.0
Max.
:2255.0
FlightLength
Min.
: 0.0
1st Qu.:155.0
Median :164.0
Mean
:185.2
3rd Qu.:228.0
Max.
:295.0
STL
: 225
BNA
: 172
(Other): 57
Delay
Min.
:-19.00
1st Qu.: -6.00
Median : -2.00
Mean
: 11.74
3rd Qu.: 5.00
Max.
:693.00
Noon-4pm:1044
Delayed30
Min.
:0.0000
1st Qu.:0.0000
Median :0.0000
Mean
:0.1483
3rd Qu.:0.0000
Max.
:1.0000
2) How many rows were removed?
number of rows removed was: 6
Extreme values and outliers.
0
50
100 150 200 250 300
Finally, we can see a suspected extreme value in the Delay column. It is extreme because it is very far from
the rest of the distribution. The best way to check for outliers in numeric data is with a boxplot (box and
whisker plot). Outliers are datapoints that appear outside of the “whiskers”.
We can see from the plot that there is at least one value is way beyond the lower whisker. In fact, we can
see this value in the summary as the min of the distribution, 0.0. A flight length of 0 is not a value that is
meaningful if we are looking at flight delays, so we will remove all 0 values from the data set (there could be
more than one).
First, let’s see how many zero values for flight length exist in the data set, then assign NA to those entries,
4
then omit all NA values.
## [1] 0 0
3) How many rows have a FlightLength of 0?
There are 2 instances of flights with length 0.
This statement assigns the value NA to the FlightLength column where the value is 0:
4) TODO: execute the statement to remove all NA rows from the data set.
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
ID
Min.
:
1
1st Qu.:1012
Median :2018
Mean
:2018
3rd Qu.:3024
Max.
:4029
Day
Fri:636
Mon:630
Sat:453
Sun:550
Thu:565
Tue:627
Wed:562
Carrier
AA:2905
UA:1118
Month
June:2030
May :1993
FlightNo
Min.
: 71.0
1st Qu.: 371.0
Median : 693.0
Mean
: 827.5
3rd Qu.: 787.0
Max.
:2255.0
FlightLength
Min.
: 68.0
1st Qu.:155.0
Median :164.0
Mean
:185.3
3rd Qu.:228.0
Max.
:295.0
Destination
ORD
:1780
DFW
: 918
MIA
: 610
DEN
: 263
STL
: 225
BNA
: 172
(Other): 55
Delay
Min.
:-19.00
1st Qu.: -6.00
Median : -3.00
Mean
: 11.75
3rd Qu.: 5.00
Max.
:693.00
DepartTime
4-8am
: 699
4-8pm
: 970
8-Mid
: 257
8-Noon :1053
Noon-4pm:1044
Delayed30
Min.
:0.0000
1st Qu.:0.0000
Median :0.0000
Mean
:0.1484
3rd Qu.:0.0000
Max.
:1.0000
5) How many rows were removed?
There were two rows removed.
Now, we attach the dataframe to the environment. This allows us to refer to variables in the dataframe by
their name. For example, we don’t have to type delays.df$FlightLength, but only FlightLength.
Visualizing the data.
Generating plots, charts and graphs is an important aid to exploring a data set. Likewise, visualizing results
of analysis, which you will be doing in this course, is the best way to communicate your findings to an
audience of stakeholders.
Basic plots.
The plot function is an easy way to create simple plots in R. It is especially useful when you are exploring a
dataset and do not need to spend a lot of time worrying about the apearance of the plot, as you would in a
more formal report.
The plot function will produce different plots based on the type of data that is passed to it. Let’s look at
the DepartTime column. This is an example of categorical data, which you can see either in the Global
Environment pane or by looking at the summary function output. Categorical data is rendered as a bar chart
when it is passed in to the plot function. This plot is a visual way to see the frequencies of occurrence of each
level in the data set. A better plot would include the actual frequency counts in the plot, but that is a more
complex plot for another time.
5
1000
800
600
400
200
0
4−8am
4−8pm
8−Mid
8−Noon
Noon−4pm
Histogram and Density plots
A histogram shows the distribution of values in a discretized manner. The histogram’s columns, or “bins”,
group the numerical data. The number of bins can be adjusted: more bins = more detail. Sometimes too few
bins will hide the true shape of the data. You can set the number of bins by adjusting the breaks parameter.
For more, you can access the help page with the statement: ?hist This statement generates a histogram for
the FlightLength vector:
6
1000
500
0
Frequency
1500
Histogram of FlightLength
100
150
200
250
300
FlightLength
Another way to view the shape of a distribution is to create a density plot. While a histogram is a “discretized”
view of a distribution, a density plot is a continuous (smoothed) view of the data. We will discuss density
and “smoothing” later on in the course. It is often useful to plot a histogram and a density plot together.
Here is an example of how you can combine a density curve and a histogram in a single plot. Notice the
shape of the density curve follows the height of the bins in the histogram. You also see how to use some
graphical parameters to customize the visual aspects of the plot and to add labels and a title.
7
0.010
0.000
0.005
Density
0.015
0.020
Flight Length
100
150
200
250
300
scores
6) TODO: write and execute code to display a density curve and a histogram in a single plot
for the Delay variable.
8
0.008
0.000
0.004
Density
0.012
Delay
0
200
400
600
scores
7) What do you notice about the shape of this plot of this distribution?
It is significantly skewed to the right.
Bivariate plots
You often want to explore the relationship between two variables. A scatter plot is a great way to see how
two variables interact, or not. Is there a correlation between FlightLength and the length of Delay?
8) Without going any further, do you think these two variables would be related?
I would imagine that the longer the flight is the longer a potential delay for that flight may be as
there is more room for error.
This generates a scatterplot of these two variables:
9
700
500
300
0 100
Delay
100
150
200
250
300
FlightLength
9) Based on the plot, do you think there a correlation between FlightLength and the length of
Delay?
It doesn’t seem like there is much correlation between the variables although if there were any it
seems like they are negatively correlated.
To investigate further, it’s a good idea to get a more quantitative measure with the correlation coefficient:
##
##
##
##
##
##
##
##
##
##
##
Pearson's product-moment correlation
data: FlightLength and Delay
t = -1.1445, df = 4021, p-value = 0.2525
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.04892200 0.01286334
sample estimates:
cor
-0.01804656
10) Does the result support the scatterplot view?
The correlation is about -0.018 indicating a slight negative correlation, although the p-value
suggests that the correlation is insignificant.
Delays and Carrier What do the delays look like for each airline company, or Carrier? The Delayed30
column encodes delays > 30 minutes as 1, otherwise 0. This code uses the R table function to create a table
object. The table will calculate the frequencies of the Delay30 values for each carrier. The kable function is
used for rendering the table so it looks nice.
10
Table 1: Delays > 30 min per Carrier
0
1
AA
UA
2512
393
914
204
This code displays a barplot visualization of the table created above:
1000 1500 2000 2500
Delays > 30 min per Carrier
0
500
0
1
AA
UA
11) According to the table and plot above, which of the two airlines have a larger proportion
of delayed flights?
UA has 204/1118 (0.18) flights delayed while AA has only 393/2905 (0.14)
Again, it can be difficult to gauge proportion by visual inspection. Let’s get a more quantitative result.
The prop.table function can be used for calculating the proportion of delays > 30 minutes for each carrier.
Because Delayed30 is not a categorical variable (we have not declared it so on purpose), we use the table
from above as an argument to prop.table because that table has the frequencies.
Table 2: Proportion of Delays > 30 min per Carrier
0
1
AA
UA
0.62
0.10
0.23
0.05
11
Condensing a Categorical Variable’s Levels Create a new column in the delays.df dataframe called
TimePeriod that has three levels: ‘Morning’, ‘Afternoon’, and ‘Evening’. This column will condense the
levels found in the DepartTime column according to these rules: 4-8am, 8-Noon -> Morning, Noon-4pm ->
Afternoon, 4-8pm, 8-Mid -> Evening. This code uses ‘ifelse’ conditional statements to do the condensing.
The ifelse is useful because it works in a row wise manner. In other words, we don’t need to write a loop.
12) How many of each level of TimePeriod do you see?
Afternoon: 1044 Evening: 1227 Morning: 1752
Now we want to create a table that shows how many times there were delays of 30 minutes or more for each
TimePeriod. Use the TimePeriod column as the grouping, and pass the grouping to the summarixe function,
also applying the sum function to the Delayed30 column. The sum is applied to the Delayed30 values for
each group (level) of TimePeriod.
Table 3: Number of delays > 30 per time period.
Time Period
Delays > 30
Afternoon
Evening
Morning
174
326
97
Knitting the Rmd file.
After you have verified that your R code runs correctly, you can convert the Rmd file into an HTML or pdf
file by clicking on the Knit drop down menu at the top of Rstudio and select “Knit to HTML” or “Knit to
PDF”. You will see some output from the knit function scrolling by. If there are no errors, you will see the
HTML or PDF file appear. The file is written to the project directory.
Note: We will accept either HTML or pdf file formats. The pdf is often a better looking rendering.
Inspect the resulting HTML or pdf file to make sure that it looks the way you intend, make any changes
to the Rmd file if necessary and re-knit. You will submit all of your R assignments as knitted files, either
HTML or pdf format.
If any of your R code does not run, or if a file that you are reading cannot be found, you will see error
messages in the “R Markdown” tab below. Click on the “Output” option for datails and line numbers for
debugging.
Now, knit this file to make sure it is working and you will be all set!
12
Download