Summarizing Data Numeric Methods Rcmdr • Features for loading, viewing and analyzing data • Help system • Packages Data in R • Several formats: vectors, arrays, matrices, lists, data.frames • Generally we use data.frames as they have the advantage of letting us store different kinds of data and linking them by row. • Rcmdr uses data.frames Referencing Data.frames • R allows you to refer to rows, columns and individual cells in a data frame in multiple ways • Every cell has a row and column number that identifies it (like Excel) • Every cell is the intersection of a named row and a named column Types of Data in R • • • • • Numeric – integer and decimal Categorical – factors Ranks – ordered factor Logical (True/False) Character Data sets • Darl and Pedernales points from Fort Hood archaeological surveys • Data on village and house sizes among California Indian tribes from an article by Sherburne Cook and Robert Heizer Darl Pedernales > head(DartPoints) Name TARL 35-3026 Darl 41CV0270 36-3321 Darl 41CV1023 36-3520 Darl 41CV0495 35-2382 Darl 41CV0611 40-0847 Darl 41CV1287 35-2959 Darl 41CV0235 > tail(DartPoints) Name 38-0098 Pedernales 35-2951 Pedernales 35-0173 Pedernales 36-4266 Pedernales 41-0239 Pedernales 35-2855 Pedernales QUAD East North Length Width Thick 24/62 62 24 34.5 15.9 4.8 12/58 58 12 36.0 17.1 4.0 16/62 62 16 32.4 14.5 5.2 22/62 62 22 31.2 15.6 5.1 05/48 48 5 33.6 15.8 5.1 21/63 63 21 41.8 16.8 4.1 TARL 41BL0416 41CV0235 41CV0869 41CV0240 41CV0493 41CV0843 QUAD East North Length Width Thick 39/45 45 39 74.0 34.0 6.6 21/63 63 21 64.5 28.5 8.2 22/66 66 22 78.3 28.1 8.5 15/65 65 15 64.1 27.2 12.0 16/62 62 16 67.2 27.1 12.0 24/65 65 24 49.3 19.5 7.5 > str(DartPoints) 'data.frame': 55 obs. of 8 variables: $ Name : Factor w/ 2 levels "Darl","Pedernales": 1 1 1 1 1 1 1 1 1 1 ... $ TARL : Factor w/ 43 levels "41BL0183","41BL0205",..: 13 34 18 21 ... $ QUAD : Factor w/ 38 levels "05/48","08/63",..: 30 9 17 28 1 26 15 ... $ East : num 62 58 62 62 48 63 33 63 59 63 ... $ North : num 24 12 16 22 5 21 16 17 26 20 ... $ Length: num 34.5 36 32.4 31.2 33.6 41.8 33.5 32 42.8 37.5 ... $ Width : num 15.9 17.1 14.5 15.6 15.8 16.8 16.6 16 15.8 16.3 ... $ Thick : num 4.8 4 5.2 5.1 5.1 4.1 4.9 5.4 5.8 6.1 ... - attr(*, "na.action")=Class 'omit' Named int [1:2] 39 53 .. ..- attr(*, "names")= chr [1:2] "35-2650" "35-2384“ > attributes(DartPoints) $names [1] "Name" "TARL" "QUAD" "Thick" $row.names [1] "35-3026" [7] "41-0257" [13] "35-2866" [19] "35-2946" [25] "35-2004" [31] "41-0008" [37] "47-0041" [43] "35-2875" [49] "35-2391" [55] "35-2855" "36-3321" "36-3619" "36-3487" "38-0736" "35-2960" "36-4320" "36-3879" "36-3081" "38-0098" $class [1] "data.frame" "East" "36-3520" "41-0322" "36-4247" "35-2325" "41-0237" "44-1315M" "41-0054" "36-3897" "35-2951" "North" "35-2382" "35-2921" "35-2928" "35-0164" "44-0643" "35-2901" "50-0092" "44-1253M" "35-0173" "Length" "Width" "40-0847" "36-3036" "35-2871" "41-0323" "43-0110" "41-0220" "44-1492M" "36-3229" "36-4266" "35-2959" "35-2905" "36-3898" "35-3043" "36-3549" "35-2873" "36-3880" "41-0058" "41-0239" > DartPoints[1,] Name TARL QUAD East North Length Width Thick 35-3026 Darl 41CV0270 24/62 62 24 34.5 15.9 4.8 > DartPoints["35-3026",] Name TARL QUAD East North Length Width Thick 35-3026 Darl 41CV0270 24/62 62 24 34.5 15.9 4.8 > DartPoints[,6] [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 [16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 [31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 [46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3 > DartPoints[,"Length"] [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 [16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 [31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 [46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3 > DartPoints$Length [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 [16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 [31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 [46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3 > DartPoints[1,6] [1] 34.5 41.8 43.0 43.0 33.1 54.2 35.4 65.0 43.7 57.2 59.0 48.0 52.0 41.8 43.0 43.0 33.1 54.2 35.4 65.0 43.7 57.2 59.0 48.0 52.0 41.8 43.0 43.0 33.1 54.2 35.4 65.0 43.7 57.2 59.0 48.0 52.0 > head(CAIndians) Region Tribe Language AreaHouse FamilySize FpHouse PpHouse AreapPer 1 1 Yurok Algonkin 439 7.5 1 7.5 58.5 2 2 Wiyot Algonkin 254 7.5 1 7.5 33.8 3 3 Karok Hokan NA 7.5 1 7.5 NA 4 4 Hupa Athabaskan 400 7.0 1 7.0 57.1 5 5 Chilula Athabaskan NA 7.5 1 7.5 NA 6 6 Shasta Hokan 264 7.0 1 7.0 33.0 1 2 3 4 5 6 HpVillage PpVillage AreapVil AreaVillage VpHouse VpPer PctFloor 7.8 60 3434 25450 3263 424 13.5 7.6 57 1930 28400 3738 498 6.8 4.1 31 NA NA NA NA NA 10.9 76 4360 NA NA NA NA 7.0 52 NA NA NA NA NA 6.0 48 1584 18950 3158 394 8.4 > str(CAIndians) 'data.frame': 30 obs. of 15 variables: $ Region : int 1 2 3 4 5 6 7 8 9 10 ... $ Tribe : Factor w/ 30 levels "Achomawi","Athabascans",..: 30 27 8 7 5... $ Language : Factor w/ 6 levels "Algonkin","Athabaskan",..: 1 1 3 2 2 3 3 ... $ AreaHouse : int 439 254 NA 400 NA 264 110 118 100 125 ... $ FamilySize : num 7.5 7.5 7.5 7 7.5 7 6 6 6 6 ... $ FpHouse : num 1 1 1 1 1 1 1 1 1 1 ... $ PpHouse : num 7.5 7.5 7.5 7 7.5 7 6 6 6 6 ... $ AreapPer : num 58.5 33.8 NA 57.1 NA 33 18.3 19.6 16.7 20.8 ... $ HpVillage : num 7.8 7.6 4.1 10.9 7 6 5.3 5.4 3.6 5 ... $ PpVillage : int 60 57 31 76 52 48 32 32 22 30 ... $ AreapVil : int 3434 1930 NA 4360 NA 1584 583 637 360 625 ... $ AreaVillage: int 25450 28400 NA NA NA 18950 14000 27100 61500 6390 ... $ VpHouse : int 3263 3738 NA NA NA 3158 2641 5019 17084 1278 ... $ VpPer : int 424 498 NA NA NA 394 438 847 2795 214 ... $ PctFloor : num 13.5 6.8 NA NA NA 8.4 4.2 2.4 0.6 9.8 ... Central Tendency • Mean (Average) = Sum/Number – Dichotomous data – percentage present • Median = Middle value • Mode = Predominant value > mean(DartPoints$Length) [1] 48.64 > median(DartPoints$Length) [1] 47.1 > mean(CAIndians$AreaHouse) [1] NA > mean(CAIndians$AreaHouse, na.rm=TRUE) [1] 299.4815 > median(CAIndians$AreaHouse) [1] NA > median(CAIndians$AreaHouse, na.rm=TRUE) [1] 129 > mean(DartPoints[,6:8]) Length Width Thick 48.640000 22.052727 7.283636 > mean(DartPoints[DartPoints$Name=="Darl",6:8]) Length Width Thick 40.574074 18.003704 5.981481 > mean(DartPoints[DartPoints$Name=="Pedernales",6:8]) Length Width Thick 56.417857 25.957143 8.539286 Dispersion • Range (max – min) • Standard Deviation, Variance (Sample vs. Population) • Coefficient of Variation = StDev/Mean * 100 • Quartiles and the Interquartile Range > range(DartPoints$Length) [1] 31.2 84.0 > diff(range(DartPoints$Length)) [1] 52.8 > sd(DartPoints$Length) [1] 12.22144 > var(DartPoints$Length) [1] 149.3636 > sd(DartPoints$Length)/mean(DartPoints$Length)*100 [1] 25.12631 > quantile(DartPoints$Length) 0% 25% 50% 75% 100% 31.20 40.90 47.10 55.65 84.00 > IQR(DartPoints$Length) [1] 14.75 > diff(range(CAIndians$AreaHouse, na.rm=TRUE)) [1] 1175 > sd(CAIndians$AreaHouse, na.rm=TRUE) [1] 339.4273 > var(CAIndians$AreaHouse, na.rm=TRUE) [1] 115210.9 > quantile(CAIndians$AreaHouse, na.rm=TRUE) 0% 25% 50% 75% 100% 75.0 110.5 129.0 310.0 1250.0 Shape • Symmetry, Skewness – Normal = 0, Positive or Negative indicates tail in that direction • Peaked vs Flat, Kurtosis – Normal = 0, Positive – more clustered (peaked) than normal, Negative – more spread (flatter) than normal > library(e1071) Loading required package: class > skewness(DartPoints$Length) [1] 0.7749526 > kurtosis(DartPoints$Length) [1] 0.12126 > skewness(CAIndians$AreaHouse, na.rm=TRUE) [1] 1.708035 > kurtosis(CAIndians$AreaHouse, na.rm=TRUE) [1] 1.498035 Descriptive Stats • • • • • summary() – in base R numSummary() – in Rcmdr describe() – in psych describe() – in prettyR stat.desc() – pastecs > summary(DartPoints) Name TARL Darl :27 41CV0235: 4 Pedernales:28 41CV0859: 3 41CV1092: 3 41BL0205: 2 41CV0132: 2 41CV0493: 2 (Other) :39 Length Width Min. :31.20 Min. :14.50 1st Qu.:40.90 1st Qu.:16.95 Median :47.10 Median :22.00 Mean :48.64 Mean :22.05 3rd Qu.:55.65 3rd Qu.:26.95 Max. :84.00 Max. :34.00 QUAD East 21/63 : 4 Min. :33.00 14/62 : 3 1st Qu.:55.00 16/62 : 3 Median :62.00 20/63 : 3 Mean :58.24 22/66 : 3 3rd Qu.:63.50 24/66 : 3 Max. :70.00 (Other):36 Thick Min. : 4.000 1st Qu.: 5.850 Median : 7.200 Mean : 7.284 3rd Qu.: 8.050 Max. :12.000 North Min. : 5.00 1st Qu.:14.50 Median :20.00 Mean :19.02 3rd Qu.:23.00 Max. :39.00 > numSummary(DartPoints[,6:8]) mean sd 0% 25% 50% 75% 100% n Length 48.640000 12.221438 31.2 40.90 47.1 55.65 84 55 Width 22.052727 5.194579 14.5 16.95 22.0 26.95 34 55 Thick 7.283636 1.891870 4.0 5.85 7.2 8.05 12 55 > library(psych) > describe(DartPoints[,6:8]) var n mean sd median trimmed mad min max range skew kurtosis se Length 1 55 48.64 12.22 47.1 47.63 12.16 31.2 84 52.8 0.77 0.38 1.65 Width 2 55 22.05 5.19 22.0 21.85 7.41 14.5 34 19.5 0.24 -1.16 0.70 Thick 3 55 7.28 1.89 7.2 7.13 1.48 4.0 12 8.0 0.69 0.50 0.26 > detach(package:psych) > library(prettyR) > describe(DartPoints[,6:8]) Description of DartPoints[, 6:8] Numeric Length Width Thick mean 48.64 22.05 7.284 median 47.1 22 7.2 var 149.4 26.98 3.579 sd 12.22 5.195 1.892 valid.n 55 55 55 > library(pastecs) > stat.desc(DartPoints[,6:8]) Length Width Thick nbr.val 55.0000000 55.0000000 55.0000000 nbr.null 0.0000000 0.0000000 0.0000000 nbr.na 0.0000000 0.0000000 0.0000000 min 31.2000000 14.5000000 4.0000000 max 84.0000000 34.0000000 12.0000000 range 52.8000000 19.5000000 8.0000000 sum 2675.2000000 1212.9000000 400.6000000 median 47.1000000 22.0000000 7.2000000 mean 48.6400000 22.0527273 7.2836364 SE.mean 1.6479384 0.7004369 0.2550997 CI.mean.0.95 3.3039176 1.4042914 0.5114441 var 149.3635556 26.9836498 3.5791717 std.dev 12.2214384 5.1945789 1.8918699 coef.var 0.2512631 0.2355527 0.2597425 Decimals • The various summaries of statistics provide limited ways to round or modify the output: • Options digits= and scipen= can be set before running the summary • Wrapping the function in round() works for some. > stat.desc(DartPoints[,6:8], norm=TRUE) Length Width Thick . . . . . skewness 7.749526e-01 0.23775656 0.68608894 skew.2SE 1.204307e+00 0.36948312 1.06620943 kurtosis 1.212600e-01 -1.22700641 0.23109312 kurt.2SE 9.570531e-02 -0.96842355 0.18239189 normtest.W 9.435694e-01 0.92900685 0.94970998 normtest.p 1.207084e-02 0.00300526 0.02233218 > op <- options(digits=3, scipen=100) > stat.desc(DartPoints[,6:8], norm=TRUE) Length Width Thick . . . . . skewness 0.7750 0.23776 0.6861 skew.2SE 1.2043 0.36948 1.0662 kurtosis 0.1213 -1.22701 0.2311 kurt.2SE 0.0957 -0.96842 0.1824 normtest.W 0.9436 0.92901 0.9497 normtest.p 0.0121 0.00301 0.0223 > options(op) Publishable Tables • Much of the focus in producing publishable tables in R is on LaTex • Most anthropologists are more familiar with html • xtable() provides both if there is an xtable method for your output Using xtable() • xtable(function-output) produces a LaTex version of the table • print(xtable(function-output), type=“html”) converts to html • Appending file=“mytable.html”) will write the output to a file > library(psych) > print(xtable(describe(DartPoints[,6:8])), type="html") <!-- html table generated in R 2.13.1 by xtable 1.5-6 package --> <!-- Mon Sep 05 11:54:20 2011 --> <TABLE border=1> <TR> <TH> </TH> <TH> var </TH> <TH> n </TH> <TH> mean </TH> <TH> sd </TH> <TH> median </TH> <TH> trimmed </TH> <TH> mad </TH> <TH> min </TH> <TH> max </TH> <TH> range </TH> <TH> skew </TH> <TH> kurtosis </TH> <TH> se </TH> </TR> . . . . . </TABLE> Use print(xtable(x), type=“html”) ; select the html commands (<TABLE> to </TABLE> , copy, and paste into Excel or print(xtable(x), type=“html”, file=“filename.html”) and insert the file into Excel or Word Summary Mixed Data Customize Groups digits= round() xtable() summary Yes No by() Yes No Yes numSummary - Rcmdr No Yes Yes print()1 Yes1 Yes1 describe psych No Yes Yes print() No Yes describe prettyR Yes Yes by() print() Yes No stat.desc pastecs No Some by() No No Yes 1 Extract table part of results: print(numSummary(x)$table, digits=4); round(print(numSummary(x)$table, 3) print(xtable(numSummary(x)$table), type="html")