STAT 503 – Statistical Methods for Biologists Modified: 2023-01-27 NAME: STAT 503 – Statistical Methods for Biology Homework 2 30 Points Due at 11:59 PM on Friday, February 3, 2023 Please use complete sentences unless the question is marked with an asterisk (*). Please download the Dalgleish et al. tree data from the Brightspace and save it to your hard drive (the file is titled dalgleish_et_al_trees.csv). This dataset describes the species and diameter-at-beast-height (dbh, about 1.4 m above the ground, measured in cm) of 5681 individual trees at a site in Maine. It also contains age, reproductive data, and disease status for American chestnut (Castanea dentata) trees at this site, as well as map coordinates for each tree. We will use these data to practice working with large datasets and calculating summary statistics. 1. [Total: 9 points] Follow the steps below to build and present a frequency distribution for dbh in quaking aspen. a. *[1 point] After you read in the data, use the filter() and arrange() functions from the dplyr package to create a new data frame that contains only data for quaking aspen trees and has the rows arranged in order of increasing dbh (i.e., the first row has the smallest tree, the last has the largest). Tutorial 2 explains how to run these operations, but if you have difficulty, post a question on Brightspace Discussion Forums. Once you have sorted the data, report the last five values here. b. *[4 points] Use the 42 dbh values that you obtained in question 1a to build a frequency table (with 9 intervals) for dbh in quaking aspen at this site. Please begin your table at dbh ≤ 10 cm, use 4 cm bins, and include columns for both absolute and relative frequency. Please do not use R for this question. See the Guidelines for reporting statistical results on the Brightspace Cheat sheets tab for table formatting guidelines, and round decimals to 3 decimal places. Table 1: Descriptive caption required here, and you need to create 9 intervals in total DBH (cm) Frequency ≤ 10 1 Relative frequency STAT 503 – Statistical Methods for Biologists Modified: 2023-01-27 > 38 Total c. *[2 points] In Tutorial 1, you learned how to control the number of bins in a histogram. You can also control the specific breaks points that divide the bins. For instance, if I have saved my quaking aspen data to a data frame called qa, then the following statement will draw a histogram of the table from question 1b: qplot(x = dbh, data = qa, breaks = seq(from = 8, to = 64, by = 4)) (don’t forget to load ggplot2!). The goal of building a histogram is to approximate the distribution of the variable in the population. To do this, we must "focus" or "tune" the number of bins so that we can see the overall shape of the distribution while minimizing the effects of random noise due to sampling error. If we use too few bins, then we “oversmooth” the distribution and make it difficult to see any patterns. On the other hand, using too many bins “undersmooths” the plot and mainly visualizes random noise in our particular dataset. With this in mind, please use qplot() to generate a series of 12 histograms with 3-14 bins, and then select one of the graphs and present it here (see Guidelines for making graphs). Please include the number of bins in your caption, and briefly explain why you chose this plot. d. [2 points] You can use qplot() to generate a boxplot instead of a histogram using syntax similar to this: qplot(y = myVariable, data = myData, geom = “boxplot”) (see Tutorial 4 for additional information on plotting). Please present a boxplot of the quacking aspen dbh data, and briefly explain how to interpret this plot. What does each part of the plot represent? Do you see any evidence of potential outliers? If not, what would they look like if they were present? What is the skewness of this boxplot and why? 2 STAT 503 – Statistical Methods for Biologists Modified: 2023-01-27 2. [Total: 12 points] Repeat the procedure that you used in question 1a to get the dbh data for eastern hemlock. This should give you 13 values ranging from 10.160 – 32.258 cm. Please use these data to calculate the following summary statistics. Then, for each statistic, briefly explain its interpretation. You should do these calculations by hand (calculators are okay) and show your work (you need to show the symbolic equations that you are applying or to explain how you are getting the answer for the mean and median). Explanations should be brief and to the point (1-2 sentences), but they do need to give more detail than just “it represents location” or “spread.” What, specifically, does the statistic represent? a. [1 point] Please list your 1 3 values below: b. [2 points] Mean c. [2 points] Median d. [3 points] Variance and standard deviation e. [4 points] Interquartile range 3. *[Total: 4 points] In this exercise, you will use R to generate a table that summarizes the dbh data for a selection of the tree species in the Dalgleish dataset. Your table should include 5 rows of results (one for each species that is included in the analysis). Tutorial 3 discusses summary statistics. You may export the final table using the procedures in Tutorial 3 Section 4, or you may copy the values into your homework manually. Please include the following statistics in your table, in this order: sample size, mean, standard deviation, coefficient of variation, the 25% quantile, median, the 75% quantile, and IQR. See the select() function in Tutorial 2 for instructions on how to reorder columns. a. As a first step, you will need to use filter() to clean the data. This is often necessary when we want to work with a subset of a larger dataset. There are two ways to define a search criterion that can be met in more than one way. Either use the OR operator (a vertical line: | ) , or the %in% operator, as shown below: 3 STAT 503 – Statistical Methods for Biologists Modified: 2023-01-27 ## GOAL: identify vowels in the lowercase letters # letters is a character vector showing the lower-case Latin # alphabet (it is preprogrammed into R; you don’t need to make it) letters # first element is TRUE; all others are FALSE letters == “a” # TRUE for all vowels letters == “a” | letters == “e” | letters == “i” | letters == “o” | letters == “u” # also TRUE for all vowels letters %in% c(“a”, ”e”, ”i”, ”o”, ”u”) (i) Use filter() to isolate the data for eastern hemlock, northern red oak, American chestnut (this is not capitalized in the data!), sugar maple, and balsam fir. (ii) These data are focused on American chestnut, so it includes data for all of the chestnut trees that we encountered. Other species were only recorded if they had dbh ≥ 5 cm. The ensure a fair comparison, use filter() to limit the data to only those trees with dbh ≥ 5 cm. This can be done using a second filter() statement, or it can be combined with step 3a(i). [there is nothing to report here] b. (OPTIONAL) Technically, dplyr produces a special type of data frame called a “tibble.” Tibbles and “regular” data frames are mostly the same, but tibbles can behave oddly when printed. To ensure that your summary table prints with the correct number of decimal places, you can modify the following code (note the use of the pipe – see Tutorial 2, Section 3.4): myTable %>% # a summary table or other tibble as.data.frame() %>% # convert the tibble to a regular data frame mutate_if(is.numeric, round, 3) # see below The mutate_if() function looks at each column, checks to see if it meets a criterion (here, is the column numeric?), and then applies a function to the column if the answer is TRUE. Here, we check to see if the column is numeric, and if it is, we round the value to three decimal places. See the help file for mutate_if() for details. [there is nothing to report here] c. [4 points] Generate your summary table and present it here. Note that you can use the results here to double-check your answers in question 3 (the result that you get for IQR by hand may differ from the result in R). Table 2: Descriptive caption required here species n Mean SD American chestnut balsam fir eastern hemlock 4 CV 25% 50% 75% IQR STAT 503 – Statistical Methods for Biologists species n Mean Modified: 2023-01-27 SD CV 25% 50% 75% IQR northern red oak sugar maple 4. [Total: 4 points] Each of the following graphs illustrates an association between two variables. For each graph, identify (1) the type of graph, (2) the explanatory and response variables, and (3) the type of data (whether numerical or categorical) for each variable. a. [2 points, 0.5 point each] Taste sensitivity to phenylthiocarbamide (PTC) in a sample of human subjects grouped according to their genotype at the PTC gene— namely, AA, Aa, or aa (Kim et al. 2003): b. [2 points, 0.5 point each] The frequency of injection-heroin users who share or do not share needles according to their known HIV infection status (Wood et al. 2001): 5 STAT 503 – Statistical Methods for Biologists Modified: 2023-01-27 5. [1 point] Please paste your script here. Remember that your script should be an edited, clean document, and should contain comments that mark the code for each question in the assignment. 6