. . Using R for Elementary Statistics Chia-hung Tsai Election Study Center, Graduate Institute of East Asian Studies NCCU March 26, 2014 .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 1 / 20 . Table of Contents 1. Data Analysis 2. Objects 3. Table 4. Plot 5. Exercises .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 2 / 20 . Read Data Stata data: library(foreign) udata1<-read.dta("mystata.dta", convert.factors=F) SPSS data: udata2<-read.spss("myspss.sav") Excel data: udata3<-read.csv("myexcel.csv",header=T,sep=",") table: > setwd("~/Google /ESCtraining") > birth<-read.table("birthrate.txt",header=T,sep=",", + dec=".",na.string=c("NA","missing .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 3 / 20 . Using Data There are many datasets available in R. > data() > library(car) > data() We can pick one of them for analysis. > data(Florida) > head(Florida) GORE BUSH BUCHANAN NADER BROWNE HAGELIN H ALACHUA 47300 34062 262 3215 658 42 BAKER 2392 5610 73 53 17 3 BAY 18850 38637 248 828 171 18 BRADFORD 3072 5413 65 84 28 2 BREVARD 97318 115185 570 4470 643 39 BROWARD 386518 177279 789 7099 1212 128 MOOREHEAD PHILLIPS Total ALACHUA 21 20 86242 CHT (ESC and GIEAS) Using R March 26, 2014 4 / 20 .. . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .. . .. . .. . .. . Checking Data Don’t use attach() unless you have only one dataset in your workplace. Use $ instead. Check the data after reading it and before using it. > head(birth) country birthrate rank 1 Niger 7.03 1 2 Mali NA 2 3 Zambia 5.81 7 4 Angola missing 10 5 Vietnam 1.87 6 Sweden 1.67 173 > class(birth) [1] "data.frame" > nrow(birth) [1] 14 .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 5 / 20 . Remedy Data We can find out which cases have missing values. Beforehand, we should convert factor to character. > newdata<-data.frame(lapply(birth,as.character), + stringAsFactors=F) > newdata[]<-lapply(birth,as.character) > newdata$birthrate [1] "7.03" [7] " 1.66" [13] " 1.11" " NA" " 1.59" " 0.79" " 5.81" " 1.46" " missing" " 1.87" " 1.42" " 1.42" .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 6 / 20 . Then we use which() to find out which rows have missing data. > newdata[which(newdata$rank==" NA"),] country birthrate rank stringAsFactors 14 Singapore 0.79 NA Singapore If we have new information, we can change the missing value to, for example, 4.2. > ok<-newdata$birthrate==" NA" > newdata$birthrate[ok]<-4.2 > newdata$birthrate [1] "7.03" [7] " 1.66" [13] " 1.11" "4.2" " 1.59" " 0.79" " 5.81" " 1.46" .. CHT (ESC and GIEAS) Using R . .. " missing" " 1.8 " 1.42" " 1.4 . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 7 / 20 . Objects Vector(x=c()) Matrix(matrix(x, row=2,col=2)) Array(array(x,dim=length(x))) Dataframe(data.frame(x)) List(list(A,B,C)) .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 8 / 20 . Types of Data Numeric(1, 0.1, pi) Character(A,b,”1”) Factor(A,”Group A”) Integer(1,2,3) Logical(x==4) command: class() .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 9 / 20 . Conversion and Recode We can use as.numeric() to convert a string variable to numberic variable. > data(Prestige) > table(Prestige$type) bc prof wc 44 31 23 > newtype<-as.numeric(Prestige$type) > table(newtype) newtype 1 2 3 44 31 23 as.numeric() changes the <NA> to NA. Unfortunately, there are some missing values. We can use ifelse() to convert NA to any value we want. For example, we change it to 0. > Prestige$types<-ifelse(newtype==NA, 0, newtype) ifelse() assigns 0 if newtype is missing value and keeps it as what CHT and GIEAS) Using R March 26, 2014 10 / 20 it (ESC is otherwise. .. . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .. . .. . .. . .. . Convert We can assign any value to a variable. > + + + + + + > > > > > > > > city<-c("Taipei County","Yilan County","Taoyuan County","Hsinchu County", "Miaoli County", "Taichung County","Chunghua County", "Nantou County","Yunlin County", "Chiayi County","Tainan County","Kaohsiung County","Pingtung County", "Taitung County","Hwalien County", "Penghu County","Keelung City", "Hsinchu City","Taichung City", "Chiayi City","Tainan City", "Taipei City","Kaohsiung City") citynew<-rep(NA, 20) citynew[city=="Taipei County"]<-18#New Taipei City citynew[city=="Kaohsiung County"]<-17#Kaohsiung City citynew[city=="Taipei City"]<-16#New Taipei City citynew[city=="Kaohsiung City"]<-17#Kaohsiung City citynew[city=="Tainan County"]<-20# citynew[city=="Taichung County"]<-19 sort(citynew) [1] 16 17 17 18 19 20 > citynew [1] 18 NA NA NA NA 19 NA NA NA NA 20 17 NA NA NA NA NA NA NA NA NA 16 17 .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 11 / 20 . Recode We can also recode values. For example, we divide education as three groups. > library(car) > edu3<-recode(Prestige$education, + "0:10.99=1;11:12.99=2;13:16=3") > table(edu3) edu3 1 2 3 54 25 23 > edu3f<-recode(edu3,"1='low';2='middle';3='high'") > table(edu3f); class(edu3f) edu3f high low middle 23 54 25 [1] "character" > ordered(edu3f,levels=levels(edu3f)) CHT (ESC GIEAS) <NA> <NA> <NA> Using<NA> R March 26, 2014 <NA> 12 / 20<N [1]and<NA> <NA> <NA> <NA> <NA> .. . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .. . .. . .. . .. . na.omit() We can use na.omit() to avoid missing values. > q1<-c(2,3,6,NA,10,8,1) > mean(q1) [1] NA > mean(na.omit(q1)) [1] 5 We can also use logical value to filter out missing values. > ok<-!is.na(q1) > table(ok) ok FALSE TRUE 1 6 > mean(q1[ok]) [1] 5 .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 13 / 20 . Cross-table We can observe the bivariate relationship between two variables. For example, we look at the relationship bewteen origins of car companies and availability of manual transmission. > library(UsingR) > tab1<-table(Cars93$Man.trans.avail, Cars93$Origin) > 100*prop.table(tab1,2) USA non-USA No 54.16667 13.33333 Yes 45.83333 86.66667 .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 14 / 20 . Barplot nohs<-c(.18,.23,.59);names(nohs)<-c("Favor","Neutral","Oppose") hs<-c(.14,.23,.63);names(hs)<-c("Favor","Neutral","Oppose") ba<-c(.41,.17,.41);names(ba)<-c("Favor","Neutral","Oppose") baup<-c(.44,.07,.49);names(baup)<-c("Favor","Neutral","Oppose") par(mfrow=c(2,2)) barplot(nohs, main="No HS", col="ForestGreen") barplot(hs, main="HS", col="Salmon") barplot(ba, main="BA", col="RoyalBlue") barplot(baup, main="BA+", col="Orchid") HS 0.0 0.0 0.2 0.2 0.4 0.4 0.6 No HS Favor Neutral Oppose Favor Neutral Oppose BA+ 0.0 0.1 0.2 0.2 0.3 0.4 0.4 BA 0.0 > > > > > > > > > Favor Neutral Oppose Favor Neutral Oppose .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 15 / 20 . ggplot2 .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 16 / 20 . Exercises 1 1 Regarding the birthrate variable in the birth data, please replace the missing data with the average values of ”non-missing” data for each variable. 2 Open Duncan in the car library and recode the income variable as four categories: 0-25.5, 25.6-44, 44.1-62.5, 62.6-99. Moreover, assign ”1st group”,”2nd group”,”3rd group”, and ”4th group” to categories respectively. .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 17 / 20 . Exercises 2 1 Please use Salaries data in car library and show the table of rank and sex. 2 Please recode yrs.since.phd as three groups: within 5 years, more than 5 but less than 10 years, and more than 10 years. .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 18 / 20 . Solutions 1 . rate_n<-as.numeric(newdata$birthrate) mean.naomit<-mean(na.omit(rate_n)) ok<-newdata$birthrate==" NA" newdata$birthrate[ok]<-mean.naomit 2. inc4<-recode(Duncan$income, "0:25.5='1st group'; 25.6:44='2nd group'; 44.1:62.5='3rd group';62.6:99='4th group'") 1 .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 19 / 20 . Solutions 2 . table(Salaries$rank,Salaries$sex) 2. yrs_phd<-recode(Salaries$yrs.since.phd,"0:5=1;6:10=2; 11:60=3") table(yrs_phd) 1 .. CHT (ESC and GIEAS) Using R . .. . .. . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. March 26, 2014 . .. . .. . .. . .. 20 / 20 .