Using R for Elementary Statistics

advertisement
.
.
Using R for Elementary Statistics
Chia-hung Tsai
Election Study Center, Graduate Institute of East Asian Studies
NCCU
March 26, 2014
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
1 / 20
.
Table of Contents
1.
Data Analysis
2.
Objects
3.
Table
4.
Plot
5.
Exercises
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
2 / 20
.
Read Data
Stata data:
library(foreign)
udata1<-read.dta("mystata.dta", convert.factors=F)
SPSS data:
udata2<-read.spss("myspss.sav")
Excel data:
udata3<-read.csv("myexcel.csv",header=T,sep=",")
table:
> setwd("~/Google /ESCtraining")
> birth<-read.table("birthrate.txt",header=T,sep=",",
+
dec=".",na.string=c("NA","missing
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
3 / 20
.
Using Data
There are many datasets available in R.
> data()
> library(car)
> data()
We can pick one of them for analysis.
> data(Florida)
> head(Florida)
GORE
BUSH BUCHANAN NADER BROWNE HAGELIN H
ALACHUA
47300 34062
262 3215
658
42
BAKER
2392
5610
73
53
17
3
BAY
18850 38637
248
828
171
18
BRADFORD
3072
5413
65
84
28
2
BREVARD
97318 115185
570 4470
643
39
BROWARD 386518 177279
789 7099
1212
128
MOOREHEAD PHILLIPS Total
ALACHUA
21
20 86242
CHT (ESC and GIEAS)
Using R
March 26, 2014
4 / 20
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Checking Data
Don’t use attach() unless you have only one dataset in your
workplace. Use $ instead.
Check the data after reading it and before using it.
> head(birth)
country birthrate rank
1
Niger
7.03
1
2
Mali
NA
2
3 Zambia
5.81
7
4 Angola
missing
10
5 Vietnam
1.87
6 Sweden
1.67 173
> class(birth)
[1] "data.frame"
> nrow(birth)
[1] 14
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
5 / 20
.
Remedy Data
We can find out which cases have missing values. Beforehand,
we should convert factor to character.
> newdata<-data.frame(lapply(birth,as.character),
+
stringAsFactors=F)
> newdata[]<-lapply(birth,as.character)
> newdata$birthrate
[1] "7.03"
[7] " 1.66"
[13] " 1.11"
" NA"
" 1.59"
" 0.79"
" 5.81"
" 1.46"
" missing" " 1.87"
" 1.42"
" 1.42"
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
6 / 20
.
Then we use which() to find out which rows have missing data.
> newdata[which(newdata$rank==" NA"),]
country birthrate rank stringAsFactors
14 Singapore
0.79
NA
Singapore
If we have new information, we can change the missing value to,
for example, 4.2.
> ok<-newdata$birthrate==" NA"
> newdata$birthrate[ok]<-4.2
> newdata$birthrate
[1] "7.03"
[7] " 1.66"
[13] " 1.11"
"4.2"
" 1.59"
" 0.79"
" 5.81"
" 1.46"
..
CHT (ESC and GIEAS)
Using R
.
..
" missing" " 1.8
" 1.42"
" 1.4
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
7 / 20
.
Objects
Vector(x=c())
Matrix(matrix(x, row=2,col=2))
Array(array(x,dim=length(x)))
Dataframe(data.frame(x))
List(list(A,B,C))
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
8 / 20
.
Types of Data
Numeric(1, 0.1, pi)
Character(A,b,”1”)
Factor(A,”Group A”)
Integer(1,2,3)
Logical(x==4)
command: class()
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
9 / 20
.
Conversion and Recode
We can use as.numeric() to convert a string variable to
numberic variable.
> data(Prestige)
> table(Prestige$type)
bc prof
wc
44
31
23
> newtype<-as.numeric(Prestige$type)
> table(newtype)
newtype
1 2 3
44 31 23
as.numeric() changes the <NA> to NA. Unfortunately, there are
some missing values. We can use ifelse() to convert NA to any
value we want. For example, we change it to 0.
> Prestige$types<-ifelse(newtype==NA, 0, newtype)
ifelse() assigns 0 if newtype is missing value and keeps it as what
CHT
and GIEAS)
Using R
March 26, 2014
10 / 20
it (ESC
is otherwise.
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
Convert
We can assign any value to a variable.
>
+
+
+
+
+
+
>
>
>
>
>
>
>
>
city<-c("Taipei County","Yilan County","Taoyuan County","Hsinchu County",
"Miaoli County",
"Taichung County","Chunghua County", "Nantou County","Yunlin County",
"Chiayi County","Tainan County","Kaohsiung County","Pingtung County",
"Taitung County","Hwalien County",
"Penghu County","Keelung City", "Hsinchu City","Taichung City",
"Chiayi City","Tainan City", "Taipei City","Kaohsiung City")
citynew<-rep(NA, 20)
citynew[city=="Taipei County"]<-18#New Taipei City
citynew[city=="Kaohsiung County"]<-17#Kaohsiung City
citynew[city=="Taipei City"]<-16#New Taipei City
citynew[city=="Kaohsiung City"]<-17#Kaohsiung City
citynew[city=="Tainan County"]<-20#
citynew[city=="Taichung County"]<-19
sort(citynew)
[1] 16 17 17 18 19 20
> citynew
[1] 18 NA NA NA NA 19 NA NA NA NA 20 17 NA NA NA NA NA NA NA NA NA 16 17
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
11 / 20
.
Recode
We can also recode values. For example, we divide education
as three groups.
> library(car)
> edu3<-recode(Prestige$education,
+
"0:10.99=1;11:12.99=2;13:16=3")
> table(edu3)
edu3
1 2 3
54 25 23
> edu3f<-recode(edu3,"1='low';2='middle';3='high'")
> table(edu3f); class(edu3f)
edu3f
high
low middle
23
54
25
[1] "character"
> ordered(edu3f,levels=levels(edu3f))
CHT (ESC
GIEAS) <NA> <NA> <NA>
Using<NA>
R
March 26,
2014 <NA>
12 / 20<N
[1]and<NA>
<NA> <NA> <NA>
<NA>
..
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
.
..
.
..
.
..
.
na.omit()
We can use na.omit() to avoid missing values.
> q1<-c(2,3,6,NA,10,8,1)
> mean(q1)
[1] NA
> mean(na.omit(q1))
[1] 5
We can also use logical value to filter out missing values.
> ok<-!is.na(q1)
> table(ok)
ok
FALSE TRUE
1
6
> mean(q1[ok])
[1] 5
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
13 / 20
.
Cross-table
We can observe the bivariate relationship between two variables.
For example, we look at the relationship bewteen origins of car
companies and availability of manual transmission.
> library(UsingR)
> tab1<-table(Cars93$Man.trans.avail, Cars93$Origin)
> 100*prop.table(tab1,2)
USA non-USA
No 54.16667 13.33333
Yes 45.83333 86.66667
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
14 / 20
.
Barplot
nohs<-c(.18,.23,.59);names(nohs)<-c("Favor","Neutral","Oppose")
hs<-c(.14,.23,.63);names(hs)<-c("Favor","Neutral","Oppose")
ba<-c(.41,.17,.41);names(ba)<-c("Favor","Neutral","Oppose")
baup<-c(.44,.07,.49);names(baup)<-c("Favor","Neutral","Oppose")
par(mfrow=c(2,2))
barplot(nohs, main="No HS", col="ForestGreen")
barplot(hs, main="HS", col="Salmon")
barplot(ba, main="BA", col="RoyalBlue")
barplot(baup, main="BA+", col="Orchid")
HS
0.0
0.0
0.2
0.2
0.4
0.4
0.6
No HS
Favor
Neutral
Oppose
Favor
Neutral
Oppose
BA+
0.0
0.1
0.2
0.2
0.3
0.4
0.4
BA
0.0
>
>
>
>
>
>
>
>
>
Favor
Neutral
Oppose
Favor
Neutral
Oppose
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
15 / 20
.
ggplot2
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
16 / 20
.
Exercises 1
1 Regarding the birthrate variable in the birth data, please
replace the missing data with the average values of ”non-missing”
data for each variable.
2 Open Duncan in the car library and recode the income variable
as four categories: 0-25.5, 25.6-44, 44.1-62.5, 62.6-99.
Moreover, assign ”1st group”,”2nd group”,”3rd group”, and ”4th
group” to categories respectively.
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
17 / 20
.
Exercises 2
1 Please use Salaries data in car library and show the table of
rank and sex.
2 Please recode yrs.since.phd as three groups: within 5 years,
more than 5 but less than 10 years, and more than 10 years.
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
18 / 20
.
Solutions 1
. rate_n<-as.numeric(newdata$birthrate)
mean.naomit<-mean(na.omit(rate_n))
ok<-newdata$birthrate==" NA"
newdata$birthrate[ok]<-mean.naomit
2.
inc4<-recode(Duncan$income, "0:25.5='1st group';
25.6:44='2nd group';
44.1:62.5='3rd group';62.6:99='4th group'")
1
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
19 / 20
.
Solutions 2
. table(Salaries$rank,Salaries$sex)
2. yrs_phd<-recode(Salaries$yrs.since.phd,"0:5=1;6:10=2;
11:60=3")
table(yrs_phd)
1
..
CHT (ESC and GIEAS)
Using R
.
..
.
..
.
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
March 26, 2014
.
..
.
..
.
..
.
..
20 / 20
.
Download