2.2 Descriptive statistics in R

advertisement
Descriptive statistics in R
This lecture shows how to calculate various descriptive statistics using R.
The data set Nile in the base R gives measurements of the annual flow of the river Nile
at Ashwan 1871–1970.
hist(Nile)
mean(Nile)
sd(Nile)
var(Nile)
> mean(Nile)
[1] 919.35
> sd(Nile)
[1] 169.2275
> var(Nile)
[1] 28637.95
Notice that variance and standard deviation do not require a normally distributed data
set. We'll illustrate with the data set faithful in the base R.
?faithful
Description
Waiting time between eruptions and the duration of the eruption
for the Old Faithful geyser in Yellowstone National Park,
Wyoming, USA.
…
Format
A data frame with 272 observations on 2 variables.
[,1]
[,2]
eruptions numeric Eruption time in mins
waiting numeric Waiting time to next eruption (in mins)
hist(faithful[,2]) or hist(faithful[,"waiting"])
> mean(faithful[,"waiting"])
[1] 70.89706
> sd(faithful[,"waiting"])
[1] 13.59497
> var(faithful[,"waiting"])
[1] 184.8233
>
Sample standard deviation
Do the following calculations by hand, and then confirm the results using R.
x1 = c(-1, 0 , 1)
sd(x1) # Should be sd = 1
x2 = c(-2, -2, 0 , 2, 2)
sd(x2) # Should be sd = 2
x3 = c(99, 100 , 101)
sd(x3) # Should be sd = 1
Standard error of the mean
Recall the equation SEM = S/sqrt(n)
Example 1.
x1 = c(-1, 0 , 1)
mean(x1)
sd(x1) # Should be 1
> mean(x1)
[1] 0
> sd(x1) # Should be 1
[1] 1
sem.x1=sd(x1)/sqrt(3)
sem.x1
[1] 0.5773503
Here's a shorter version of the calculation for Example 1.
sd(c(-1,0,1)/sqrt(3))
[1] 0.5773503
Example 2. A sample of n=9 observations has sample standard deviation S=6. Calculate
the standard error of the mean.
SEM = S/sqrt(n) = 6/sqrt(9) = 2.
Standardized variables: z-scores
In R, the function scale() calculates the z-scores.
Calculate z-score for x2 by hand, and then confirm the result using R.
x2 = c(-2, -2, 0 , 2, 2)
sd(x2) # Recall sd = 2
scale(x2)
> scale(x2)
[,1]
[1,] -1
[2,] -1
[3,] 0
[4,] 1
[5,] 1
Here's another example.
> x=1:10
> sd(x)
[1] 3.027650
> scale(x)
[,1]
[1,] -1.4863011
[2,] -1.1560120
[3,] -0.8257228
[4,] -0.4954337
[5,] -0.1651446
[6,] 0.1651446
[7,] 0.4954337
[8,] 0.8257228
[9,] 1.1560120
[10,] 1.4863011
attr(,"scaled:center")
[1] 5.5
attr(,"scaled:scale")
[1] 3.027650
> y=scale(x)
> plot(x,y)
The summary() function
The summary() function gives summary descriptive statistics
x=1:10
summary(x)
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.25 5.50 5.50 7.75 10.00
x=(0:200)
summary(x)
>
> x=(0:200)
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu.
0 50 100 100 150 200
>
Max.
For categorical data, the summary function gives counts of the number of observations in
each category.
Let's look at the lung data set. The lung data frame has 18 rows and 3 columns. It
contains data on three different methods of determining human lung volume.
library("ISwR")
> lung
volume method subject
1
3.3
A
1
2
3.1
B
1
3
4.0
C
1
4
2.5
A
2
5
2.6
B
2
6
2.8
C
2
7
3.1
A
3
8
3.5
B
3
9
4.1
C
3
10
3.0
A
4
11
3.7
B
4
12
3.5
C
4
13
2.8
A
5
14
3.6
B
5
15
3.9
C
5
16
2.9
A
6
17
2.8
B
6
18
2.9
C
6
> summary(lung)
volume
Min.
:2.500
1st Qu.:2.825
Median :3.100
Mean
:3.228
3rd Qu.:3.575
Max.
:4.100
method
A:6
B:6
C:6
subject
1:3
2:3
3:3
4:3
5:3
6:3
Log transform to produce a more Normal distribution in a data set
# Generate the data
values=c(rep(1,6),rep(2,11),rep(3,14),rep(4,13),rep(5,12),rep(6,10),rep(7,8),rep(8,6),rep(
9,4),rep(10,3), rep(11,4),rep(12,2),rep(13,1),rep(14,2), rep(17,2), rep(25,1))
values
# Plot the values and the log transformed values
par(mfrow=c(1,2))
hist(values, breaks=c(0:30))
hist(log(values), breaks=10)
par(mfrow=c(1,1))
Download