Uploaded by AlexaViola_12

stat503 spring2023 hw2 (1.0)

advertisement
STAT 503 – Statistical Methods for Biologists
Modified: 2023-01-27
NAME:
STAT 503 – Statistical Methods for Biology
Homework 2
30 Points
Due at 11:59 PM on Friday, February 3, 2023
Please use complete sentences unless the question is marked with an asterisk (*).
Please download the Dalgleish et al. tree data from the Brightspace and save it to your hard
drive (the file is titled dalgleish_et_al_trees.csv). This dataset describes the species and
diameter-at-beast-height (dbh, about 1.4 m above the ground, measured in cm) of 5681
individual trees at a site in Maine. It also contains age, reproductive data, and disease status for
American chestnut (Castanea dentata) trees at this site, as well as map coordinates for each tree.
We will use these data to practice working with large datasets and calculating summary statistics.
1. [Total: 9 points] Follow the steps below to build and present a frequency distribution for dbh
in quaking aspen.
a. *[1 point] After you read in the data, use the filter() and arrange() functions from
the dplyr package to create a new data frame that contains only data for quaking aspen
trees and has the rows arranged in order of increasing dbh (i.e., the first row has the
smallest tree, the last has the largest). Tutorial 2 explains how to run these operations,
but if you have difficulty, post a question on Brightspace Discussion Forums. Once you
have sorted the data, report the last five values here.
b. *[4 points] Use the 42 dbh values that you obtained in question 1a to build a frequency
table (with 9 intervals) for dbh in quaking aspen at this site. Please begin your table at
dbh ≤ 10 cm, use 4 cm bins, and include columns for both absolute and relative
frequency. Please do not use R for this question. See the Guidelines for reporting
statistical results on the Brightspace Cheat sheets tab for table formatting guidelines, and
round decimals to 3 decimal places.
Table 1: Descriptive caption required here, and you need to create 9 intervals in total
DBH (cm)
Frequency
≤ 10
1
Relative frequency
STAT 503 – Statistical Methods for Biologists
Modified: 2023-01-27
> 38
Total
c. *[2 points] In Tutorial 1, you learned how to control the number of bins in a histogram.
You can also control the specific breaks points that divide the bins.
For instance, if I have saved my quaking aspen data to a data frame called qa, then the
following statement will draw a histogram of the table from question 1b:
qplot(x = dbh, data = qa, breaks = seq(from = 8, to = 64, by = 4))
(don’t forget to load ggplot2!).
The goal of building a histogram is to approximate the distribution of the variable in the
population. To do this, we must "focus" or "tune" the number of bins so that we can see
the overall shape of the distribution while minimizing the effects of random noise due to
sampling error. If we use too few bins, then we “oversmooth” the distribution and make
it difficult to see any patterns. On the other hand, using too many bins “undersmooths”
the plot and mainly visualizes random noise in our particular dataset.
With this in mind, please use qplot() to generate a series of 12 histograms with 3-14
bins, and then select one of the graphs and present it here (see Guidelines for making
graphs). Please include the number of bins in your caption, and briefly explain why you
chose this plot.
d. [2 points] You can use qplot() to generate a boxplot instead of a histogram using syntax
similar to this:
qplot(y = myVariable, data = myData, geom = “boxplot”)
(see Tutorial 4 for additional information on plotting). Please present a boxplot of the
quacking aspen dbh data, and briefly explain how to interpret this plot. What does each
part of the plot represent? Do you see any evidence of potential outliers? If not, what
would they look like if they were present? What is the skewness of this boxplot and why?
2
STAT 503 – Statistical Methods for Biologists
Modified: 2023-01-27
2. [Total: 12 points] Repeat the procedure that you used in question 1a to get the dbh data for
eastern hemlock. This should give you 13 values ranging from 10.160 – 32.258 cm.
Please use these data to calculate the following summary statistics. Then, for each
statistic, briefly explain its interpretation.
You should do these calculations by hand (calculators are okay) and show your work (you
need to show the symbolic equations that you are applying or to explain how you are getting
the answer for the mean and median).
Explanations should be brief and to the point (1-2 sentences), but they do need to give more
detail than just “it represents location” or “spread.” What, specifically, does the statistic
represent?
a. [1 point] Please list your 1 3 values below:
b. [2 points] Mean
c. [2 points] Median
d. [3 points] Variance and standard deviation
e. [4 points] Interquartile range
3. *[Total: 4 points] In this exercise, you will use R to generate a table that summarizes the dbh
data for a selection of the tree species in the Dalgleish dataset. Your table should include 5
rows of results (one for each species that is included in the analysis). Tutorial 3 discusses
summary statistics. You may export the final table using the procedures in Tutorial 3
Section 4, or you may copy the values into your homework manually.
Please include the following statistics in your table, in this order: sample size, mean, standard
deviation, coefficient of variation, the 25% quantile, median, the 75% quantile, and IQR. See
the select() function in Tutorial 2 for instructions on how to reorder columns.
a. As a first step, you will need to use filter() to clean the data. This is often necessary
when we want to work with a subset of a larger dataset.
There are two ways to define a search criterion that can be met in more than one way.
Either use the OR operator (a vertical line: | ) , or the %in% operator, as shown below:
3
STAT 503 – Statistical Methods for Biologists
Modified: 2023-01-27
## GOAL: identify vowels in the lowercase letters
# letters is a character vector showing the lower-case Latin
# alphabet (it is preprogrammed into R; you don’t need to make it)
letters
# first element is TRUE; all others are FALSE
letters == “a”
# TRUE for all vowels
letters == “a” | letters == “e” | letters == “i” |
letters == “o” | letters == “u”
# also TRUE for all vowels
letters %in% c(“a”, ”e”, ”i”, ”o”, ”u”)
(i) Use filter() to isolate the data for eastern hemlock, northern red oak, American
chestnut (this is not capitalized in the data!), sugar maple, and balsam fir.
(ii) These data are focused on American chestnut, so it includes data for all of the
chestnut trees that we encountered. Other species were only recorded if they had
dbh ≥ 5 cm. The ensure a fair comparison, use filter() to limit the data to only
those trees with dbh ≥ 5 cm. This can be done using a second filter() statement,
or it can be combined with step 3a(i).
[there is nothing to report here]
b. (OPTIONAL) Technically, dplyr produces a special type of data frame called a “tibble.”
Tibbles and “regular” data frames are mostly the same, but tibbles can behave oddly
when printed. To ensure that your summary table prints with the correct number of
decimal places, you can modify the following code (note the use of the pipe – see
Tutorial 2, Section 3.4):
myTable %>%
# a summary table or other tibble
as.data.frame() %>%
# convert the tibble to a regular data frame
mutate_if(is.numeric, round, 3)
# see below
The mutate_if() function looks at each column, checks to see if it meets a criterion
(here, is the column numeric?), and then applies a function to the column if the answer is
TRUE. Here, we check to see if the column is numeric, and if it is, we round the value to
three decimal places. See the help file for mutate_if() for details.
[there is nothing to report here]
c. [4 points] Generate your summary table and present it here. Note that you can use the
results here to double-check your answers in question 3 (the result that you get for IQR
by hand may differ from the result in R).
Table 2: Descriptive caption required here
species
n
Mean
SD
American chestnut
balsam fir
eastern hemlock
4
CV
25%
50%
75%
IQR
STAT 503 – Statistical Methods for Biologists
species
n
Mean
Modified: 2023-01-27
SD
CV
25%
50%
75%
IQR
northern red oak
sugar maple
4. [Total: 4 points] Each of the following graphs illustrates an association between two
variables. For each graph, identify (1) the type of graph, (2) the explanatory and
response variables, and (3) the type of data (whether numerical or categorical) for each
variable.
a. [2 points, 0.5 point each] Taste sensitivity to phenylthiocarbamide (PTC) in a
sample of human subjects grouped according to their genotype at the PTC gene—
namely, AA, Aa, or aa (Kim et al. 2003):
b. [2 points, 0.5 point each] The frequency of injection-heroin users who share or do
not share needles according to their known HIV infection status (Wood et al.
2001):
5
STAT 503 – Statistical Methods for Biologists
Modified: 2023-01-27
5. [1 point] Please paste your script here. Remember that your script should be an edited, clean
document, and should contain comments that mark the code for each question in the
assignment.
6
Download