Lecture 2 - Tamu.edu

advertisement
Summarizing Data
Numeric Methods
Rcmdr
• Features for loading, viewing and
analyzing data
• Help system
• Packages
Data in R
• Several formats: vectors, arrays,
matrices, lists, data.frames
• Generally we use data.frames as
they have the advantage of letting us
store different kinds of data and
linking them by row.
• Rcmdr uses data.frames
Referencing Data.frames
• R allows you to refer to rows,
columns and individual cells in a
data frame in multiple ways
• Every cell has a row and column
number that identifies it (like Excel)
• Every cell is the intersection of a
named row and a named column
Types of Data in R
•
•
•
•
•
Numeric – integer and decimal
Categorical – factors
Ranks – ordered factor
Logical (True/False)
Character
Data sets
• Darl and Pedernales points from
Fort Hood archaeological surveys
• Data on village and house sizes
among California Indian tribes from
an article by Sherburne Cook and
Robert Heizer
Darl
Pedernales
> head(DartPoints)
Name
TARL
35-3026 Darl 41CV0270
36-3321 Darl 41CV1023
36-3520 Darl 41CV0495
35-2382 Darl 41CV0611
40-0847 Darl 41CV1287
35-2959 Darl 41CV0235
> tail(DartPoints)
Name
38-0098 Pedernales
35-2951 Pedernales
35-0173 Pedernales
36-4266 Pedernales
41-0239 Pedernales
35-2855 Pedernales
QUAD East North Length Width Thick
24/62
62
24
34.5 15.9
4.8
12/58
58
12
36.0 17.1
4.0
16/62
62
16
32.4 14.5
5.2
22/62
62
22
31.2 15.6
5.1
05/48
48
5
33.6 15.8
5.1
21/63
63
21
41.8 16.8
4.1
TARL
41BL0416
41CV0235
41CV0869
41CV0240
41CV0493
41CV0843
QUAD East North Length Width Thick
39/45
45
39
74.0 34.0
6.6
21/63
63
21
64.5 28.5
8.2
22/66
66
22
78.3 28.1
8.5
15/65
65
15
64.1 27.2 12.0
16/62
62
16
67.2 27.1 12.0
24/65
65
24
49.3
19.5
7.5
> str(DartPoints)
'data.frame':
55 obs. of 8 variables:
$ Name : Factor w/ 2 levels "Darl","Pedernales": 1 1 1 1 1 1 1 1 1 1 ...
$ TARL : Factor w/ 43 levels "41BL0183","41BL0205",..: 13 34 18 21 ...
$ QUAD : Factor w/ 38 levels "05/48","08/63",..: 30 9 17 28 1 26 15 ...
$ East : num 62 58 62 62 48 63 33 63 59 63 ...
$ North : num 24 12 16 22 5 21 16 17 26 20 ...
$ Length: num 34.5 36 32.4 31.2 33.6 41.8 33.5 32 42.8 37.5 ...
$ Width : num 15.9 17.1 14.5 15.6 15.8 16.8 16.6 16 15.8 16.3 ...
$ Thick : num 4.8 4 5.2 5.1 5.1 4.1 4.9 5.4 5.8 6.1 ...
- attr(*, "na.action")=Class 'omit' Named int [1:2] 39 53
.. ..- attr(*, "names")= chr [1:2] "35-2650" "35-2384“
> attributes(DartPoints)
$names
[1] "Name"
"TARL"
"QUAD"
"Thick"
$row.names
[1] "35-3026"
[7] "41-0257"
[13] "35-2866"
[19] "35-2946"
[25] "35-2004"
[31] "41-0008"
[37] "47-0041"
[43] "35-2875"
[49] "35-2391"
[55] "35-2855"
"36-3321"
"36-3619"
"36-3487"
"38-0736"
"35-2960"
"36-4320"
"36-3879"
"36-3081"
"38-0098"
$class
[1] "data.frame"
"East"
"36-3520"
"41-0322"
"36-4247"
"35-2325"
"41-0237"
"44-1315M"
"41-0054"
"36-3897"
"35-2951"
"North"
"35-2382"
"35-2921"
"35-2928"
"35-0164"
"44-0643"
"35-2901"
"50-0092"
"44-1253M"
"35-0173"
"Length" "Width"
"40-0847"
"36-3036"
"35-2871"
"41-0323"
"43-0110"
"41-0220"
"44-1492M"
"36-3229"
"36-4266"
"35-2959"
"35-2905"
"36-3898"
"35-3043"
"36-3549"
"35-2873"
"36-3880"
"41-0058"
"41-0239"
> DartPoints[1,]
Name
TARL QUAD East North Length Width Thick
35-3026 Darl 41CV0270 24/62
62
24
34.5 15.9
4.8
> DartPoints["35-3026",]
Name
TARL QUAD East North Length Width Thick
35-3026 Darl 41CV0270 24/62
62
24
34.5 15.9
4.8
> DartPoints[,6]
[1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1
[16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0
[31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0
[46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3
> DartPoints[,"Length"]
[1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1
[16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0
[31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0
[46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3
> DartPoints$Length
[1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1
[16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0
[31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0
[46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3
> DartPoints[1,6]
[1] 34.5
41.8 43.0 43.0 33.1
54.2 35.4 65.0 43.7
57.2 59.0 48.0 52.0
41.8 43.0 43.0 33.1
54.2 35.4 65.0 43.7
57.2 59.0 48.0 52.0
41.8 43.0 43.0 33.1
54.2 35.4 65.0 43.7
57.2 59.0 48.0 52.0
> head(CAIndians)
Region
Tribe
Language AreaHouse FamilySize FpHouse PpHouse AreapPer
1
1
Yurok
Algonkin
439
7.5
1
7.5
58.5
2
2
Wiyot
Algonkin
254
7.5
1
7.5
33.8
3
3
Karok
Hokan
NA
7.5
1
7.5
NA
4
4
Hupa Athabaskan
400
7.0
1
7.0
57.1
5
5 Chilula Athabaskan
NA
7.5
1
7.5
NA
6
6 Shasta
Hokan
264
7.0
1
7.0
33.0
1
2
3
4
5
6
HpVillage PpVillage AreapVil AreaVillage VpHouse VpPer PctFloor
7.8
60
3434
25450
3263
424
13.5
7.6
57
1930
28400
3738
498
6.8
4.1
31
NA
NA
NA
NA
NA
10.9
76
4360
NA
NA
NA
NA
7.0
52
NA
NA
NA
NA
NA
6.0
48
1584
18950
3158
394
8.4
> str(CAIndians)
'data.frame':
30 obs. of 15 variables:
$ Region
: int 1 2 3 4 5 6 7 8 9 10 ...
$ Tribe
: Factor w/ 30 levels "Achomawi","Athabascans",..: 30 27 8 7 5...
$ Language
: Factor w/ 6 levels "Algonkin","Athabaskan",..: 1 1 3 2 2 3 3 ...
$ AreaHouse : int 439 254 NA 400 NA 264 110 118 100 125 ...
$ FamilySize : num 7.5 7.5 7.5 7 7.5 7 6 6 6 6 ...
$ FpHouse
: num 1 1 1 1 1 1 1 1 1 1 ...
$ PpHouse
: num 7.5 7.5 7.5 7 7.5 7 6 6 6 6 ...
$ AreapPer
: num 58.5 33.8 NA 57.1 NA 33 18.3 19.6 16.7 20.8 ...
$ HpVillage : num 7.8 7.6 4.1 10.9 7 6 5.3 5.4 3.6 5 ...
$ PpVillage : int 60 57 31 76 52 48 32 32 22 30 ...
$ AreapVil
: int 3434 1930 NA 4360 NA 1584 583 637 360 625 ...
$ AreaVillage: int 25450 28400 NA NA NA 18950 14000 27100 61500 6390 ...
$ VpHouse
: int 3263 3738 NA NA NA 3158 2641 5019 17084 1278 ...
$ VpPer
: int 424 498 NA NA NA 394 438 847 2795 214 ...
$ PctFloor
: num 13.5 6.8 NA NA NA 8.4 4.2 2.4 0.6 9.8 ...
Central Tendency
• Mean (Average) = Sum/Number
– Dichotomous data – percentage
present
• Median = Middle value
• Mode = Predominant value
> mean(DartPoints$Length)
[1] 48.64
> median(DartPoints$Length)
[1] 47.1
> mean(CAIndians$AreaHouse)
[1] NA
> mean(CAIndians$AreaHouse, na.rm=TRUE)
[1] 299.4815
> median(CAIndians$AreaHouse)
[1] NA
> median(CAIndians$AreaHouse, na.rm=TRUE)
[1] 129
> mean(DartPoints[,6:8])
Length
Width
Thick
48.640000 22.052727 7.283636
> mean(DartPoints[DartPoints$Name=="Darl",6:8])
Length
Width
Thick
40.574074 18.003704 5.981481
> mean(DartPoints[DartPoints$Name=="Pedernales",6:8])
Length
Width
Thick
56.417857 25.957143 8.539286
Dispersion
• Range (max – min)
• Standard Deviation, Variance
(Sample vs. Population)
• Coefficient of Variation =
StDev/Mean * 100
• Quartiles and the Interquartile
Range
> range(DartPoints$Length)
[1] 31.2 84.0
> diff(range(DartPoints$Length))
[1] 52.8
> sd(DartPoints$Length)
[1] 12.22144
> var(DartPoints$Length)
[1] 149.3636
> sd(DartPoints$Length)/mean(DartPoints$Length)*100
[1] 25.12631
> quantile(DartPoints$Length)
0%
25%
50%
75% 100%
31.20 40.90 47.10 55.65 84.00
> IQR(DartPoints$Length)
[1] 14.75
> diff(range(CAIndians$AreaHouse, na.rm=TRUE))
[1] 1175
> sd(CAIndians$AreaHouse, na.rm=TRUE)
[1] 339.4273
> var(CAIndians$AreaHouse, na.rm=TRUE)
[1] 115210.9
> quantile(CAIndians$AreaHouse, na.rm=TRUE)
0%
25%
50%
75%
100%
75.0 110.5 129.0 310.0 1250.0
Shape
• Symmetry, Skewness
– Normal = 0, Positive or Negative
indicates tail in that direction
• Peaked vs Flat, Kurtosis
– Normal = 0, Positive – more clustered
(peaked) than normal, Negative – more
spread (flatter) than normal
> library(e1071)
Loading required package: class
> skewness(DartPoints$Length)
[1] 0.7749526
> kurtosis(DartPoints$Length)
[1] 0.12126
> skewness(CAIndians$AreaHouse, na.rm=TRUE)
[1] 1.708035
> kurtosis(CAIndians$AreaHouse, na.rm=TRUE)
[1] 1.498035
Descriptive Stats
•
•
•
•
•
summary() – in base R
numSummary() – in Rcmdr
describe() – in psych
describe() – in prettyR
stat.desc() – pastecs
> summary(DartPoints)
Name
TARL
Darl
:27
41CV0235: 4
Pedernales:28
41CV0859: 3
41CV1092: 3
41BL0205: 2
41CV0132: 2
41CV0493: 2
(Other) :39
Length
Width
Min.
:31.20
Min.
:14.50
1st Qu.:40.90
1st Qu.:16.95
Median :47.10
Median :22.00
Mean
:48.64
Mean
:22.05
3rd Qu.:55.65
3rd Qu.:26.95
Max.
:84.00
Max.
:34.00
QUAD
East
21/63 : 4
Min.
:33.00
14/62 : 3
1st Qu.:55.00
16/62 : 3
Median :62.00
20/63 : 3
Mean
:58.24
22/66 : 3
3rd Qu.:63.50
24/66 : 3
Max.
:70.00
(Other):36
Thick
Min.
: 4.000
1st Qu.: 5.850
Median : 7.200
Mean
: 7.284
3rd Qu.: 8.050
Max.
:12.000
North
Min.
: 5.00
1st Qu.:14.50
Median :20.00
Mean
:19.02
3rd Qu.:23.00
Max.
:39.00
> numSummary(DartPoints[,6:8])
mean
sd
0%
25% 50%
75% 100% n
Length 48.640000 12.221438 31.2 40.90 47.1 55.65
84 55
Width 22.052727 5.194579 14.5 16.95 22.0 26.95
34 55
Thick
7.283636 1.891870 4.0 5.85 7.2 8.05
12 55
> library(psych)
> describe(DartPoints[,6:8])
var n mean
sd median trimmed
mad min max range skew kurtosis
se
Length
1 55 48.64 12.22
47.1
47.63 12.16 31.2 84 52.8 0.77
0.38 1.65
Width
2 55 22.05 5.19
22.0
21.85 7.41 14.5 34 19.5 0.24
-1.16 0.70
Thick
3 55 7.28 1.89
7.2
7.13 1.48 4.0 12
8.0 0.69
0.50 0.26
> detach(package:psych)
> library(prettyR)
> describe(DartPoints[,6:8])
Description of DartPoints[, 6:8]
Numeric
Length
Width
Thick
mean
48.64
22.05
7.284
median
47.1
22
7.2
var
149.4
26.98
3.579
sd
12.22
5.195
1.892
valid.n
55
55
55
> library(pastecs)
> stat.desc(DartPoints[,6:8])
Length
Width
Thick
nbr.val
55.0000000
55.0000000 55.0000000
nbr.null
0.0000000
0.0000000
0.0000000
nbr.na
0.0000000
0.0000000
0.0000000
min
31.2000000
14.5000000
4.0000000
max
84.0000000
34.0000000 12.0000000
range
52.8000000
19.5000000
8.0000000
sum
2675.2000000 1212.9000000 400.6000000
median
47.1000000
22.0000000
7.2000000
mean
48.6400000
22.0527273
7.2836364
SE.mean
1.6479384
0.7004369
0.2550997
CI.mean.0.95
3.3039176
1.4042914
0.5114441
var
149.3635556
26.9836498
3.5791717
std.dev
12.2214384
5.1945789
1.8918699
coef.var
0.2512631
0.2355527
0.2597425
Decimals
• The various summaries of statistics
provide limited ways to round or
modify the output:
• Options digits= and scipen= can be
set before running the summary
• Wrapping the function in round()
works for some.
> stat.desc(DartPoints[,6:8], norm=TRUE)
Length
Width
Thick
. . . . .
skewness
7.749526e-01
0.23775656
0.68608894
skew.2SE
1.204307e+00
0.36948312
1.06620943
kurtosis
1.212600e-01
-1.22700641
0.23109312
kurt.2SE
9.570531e-02
-0.96842355
0.18239189
normtest.W
9.435694e-01
0.92900685
0.94970998
normtest.p
1.207084e-02
0.00300526
0.02233218
> op <- options(digits=3, scipen=100)
> stat.desc(DartPoints[,6:8], norm=TRUE)
Length
Width
Thick
. . . . .
skewness
0.7750
0.23776
0.6861
skew.2SE
1.2043
0.36948
1.0662
kurtosis
0.1213
-1.22701
0.2311
kurt.2SE
0.0957
-0.96842
0.1824
normtest.W
0.9436
0.92901
0.9497
normtest.p
0.0121
0.00301
0.0223
> options(op)
Publishable Tables
• Much of the focus in producing
publishable tables in R is on LaTex
• Most anthropologists are more
familiar with html
• xtable() provides both if there is an
xtable method for your output
Using xtable()
• xtable(function-output) produces a
LaTex version of the table
• print(xtable(function-output),
type=“html”) converts to html
• Appending file=“mytable.html”) will
write the output to a file
> library(psych)
> print(xtable(describe(DartPoints[,6:8])), type="html")
<!-- html table generated in R 2.13.1 by xtable 1.5-6 package -->
<!-- Mon Sep 05 11:54:20 2011 -->
<TABLE border=1>
<TR> <TH> </TH> <TH> var </TH> <TH> n </TH> <TH> mean </TH> <TH> sd </TH>
<TH> median </TH> <TH> trimmed </TH> <TH> mad </TH> <TH> min </TH>
<TH> max </TH> <TH> range </TH> <TH> skew </TH> <TH> kurtosis </TH>
<TH> se </TH> </TR>
. . . . .
</TABLE>
Use print(xtable(x), type=“html”) ; select the html
commands (<TABLE> to </TABLE> , copy, and paste into
Excel or
print(xtable(x), type=“html”, file=“filename.html”) and
insert the file into Excel or Word
Summary
Mixed
Data
Customize
Groups
digits=
round()
xtable()
summary
Yes
No
by()
Yes
No
Yes
numSummary
- Rcmdr
No
Yes
Yes
print()1
Yes1
Yes1
describe psych
No
Yes
Yes
print()
No
Yes
describe prettyR
Yes
Yes
by()
print()
Yes
No
stat.desc pastecs
No
Some
by()
No
No
Yes
1
Extract table part of results: print(numSummary(x)$table, digits=4);
round(print(numSummary(x)$table, 3)
print(xtable(numSummary(x)$table), type="html")
Download