Descriptive statistics in R This lecture shows how to calculate various descriptive statistics using R. The data set Nile in the base R gives measurements of the annual flow of the river Nile at Ashwan 1871–1970. hist(Nile) mean(Nile) sd(Nile) var(Nile) > mean(Nile) [1] 919.35 > sd(Nile) [1] 169.2275 > var(Nile) [1] 28637.95 Notice that variance and standard deviation do not require a normally distributed data set. We'll illustrate with the data set faithful in the base R. ?faithful Description Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. … Format A data frame with 272 observations on 2 variables. [,1] [,2] eruptions numeric Eruption time in mins waiting numeric Waiting time to next eruption (in mins) hist(faithful[,2]) or hist(faithful[,"waiting"]) > mean(faithful[,"waiting"]) [1] 70.89706 > sd(faithful[,"waiting"]) [1] 13.59497 > var(faithful[,"waiting"]) [1] 184.8233 > Sample standard deviation Do the following calculations by hand, and then confirm the results using R. x1 = c(-1, 0 , 1) sd(x1) # Should be sd = 1 x2 = c(-2, -2, 0 , 2, 2) sd(x2) # Should be sd = 2 x3 = c(99, 100 , 101) sd(x3) # Should be sd = 1 Standard error of the mean Recall the equation SEM = S/sqrt(n) Example 1. x1 = c(-1, 0 , 1) mean(x1) sd(x1) # Should be 1 > mean(x1) [1] 0 > sd(x1) # Should be 1 [1] 1 sem.x1=sd(x1)/sqrt(3) sem.x1 [1] 0.5773503 Here's a shorter version of the calculation for Example 1. sd(c(-1,0,1)/sqrt(3)) [1] 0.5773503 Example 2. A sample of n=9 observations has sample standard deviation S=6. Calculate the standard error of the mean. SEM = S/sqrt(n) = 6/sqrt(9) = 2. Standardized variables: z-scores In R, the function scale() calculates the z-scores. Calculate z-score for x2 by hand, and then confirm the result using R. x2 = c(-2, -2, 0 , 2, 2) sd(x2) # Recall sd = 2 scale(x2) > scale(x2) [,1] [1,] -1 [2,] -1 [3,] 0 [4,] 1 [5,] 1 Here's another example. > x=1:10 > sd(x) [1] 3.027650 > scale(x) [,1] [1,] -1.4863011 [2,] -1.1560120 [3,] -0.8257228 [4,] -0.4954337 [5,] -0.1651446 [6,] 0.1651446 [7,] 0.4954337 [8,] 0.8257228 [9,] 1.1560120 [10,] 1.4863011 attr(,"scaled:center") [1] 5.5 attr(,"scaled:scale") [1] 3.027650 > y=scale(x) > plot(x,y) The summary() function The summary() function gives summary descriptive statistics x=1:10 summary(x) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.25 5.50 5.50 7.75 10.00 x=(0:200) summary(x) > > x=(0:200) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. 0 50 100 100 150 200 > Max. For categorical data, the summary function gives counts of the number of observations in each category. Let's look at the lung data set. The lung data frame has 18 rows and 3 columns. It contains data on three different methods of determining human lung volume. library("ISwR") > lung volume method subject 1 3.3 A 1 2 3.1 B 1 3 4.0 C 1 4 2.5 A 2 5 2.6 B 2 6 2.8 C 2 7 3.1 A 3 8 3.5 B 3 9 4.1 C 3 10 3.0 A 4 11 3.7 B 4 12 3.5 C 4 13 2.8 A 5 14 3.6 B 5 15 3.9 C 5 16 2.9 A 6 17 2.8 B 6 18 2.9 C 6 > summary(lung) volume Min. :2.500 1st Qu.:2.825 Median :3.100 Mean :3.228 3rd Qu.:3.575 Max. :4.100 method A:6 B:6 C:6 subject 1:3 2:3 3:3 4:3 5:3 6:3 Log transform to produce a more Normal distribution in a data set # Generate the data values=c(rep(1,6),rep(2,11),rep(3,14),rep(4,13),rep(5,12),rep(6,10),rep(7,8),rep(8,6),rep( 9,4),rep(10,3), rep(11,4),rep(12,2),rep(13,1),rep(14,2), rep(17,2), rep(25,1)) values # Plot the values and the log transformed values par(mfrow=c(1,2)) hist(values, breaks=c(0:30)) hist(log(values), breaks=10) par(mfrow=c(1,1))