Uploaded by Yan YinLan Wang

STUDY

advertisement
Introduction to R/R-Studio
This workshop is designed to introduce you to R using R-Studio and provide tips and
tools you can use to get started with statistical analysis. When we are finished you will:









Understand the components of the R-Studio interface
Be able to enter, load, and describe data
Use the basic building blocks of an R program
Learn how to install and load packages
Be able to syntax to analyze and graph data
Generate new and recode existing variables
Generate Descriptive Statistics
Run Correlations and Linear Regressions
Create Histograms, Bar Charts, and Scatter Plots
SCENARIO: We are interested in studying the relationship between the amount of corruption in a country
and the quality of their economy.
DATA: To access the data for this workshop visit: https://tinyurl.com/r-intro-ctrl
For this question, we are lucky as data already exists. We will use the “World Development Indicators”
which are compiled annually by The World Bank. We are using data from a past year, and we have
simplified the data set, but you can find the full data set here: http://data.worldbank.org/datacatalog/world-development-indicators. We are also lucky, because the World Bank saves there data in a
way that make it easy to open, often when you find a data set (for example, old Census data), it can take
a lot of effort just to transform the data into a usable form. And, of course, there is never a guarantee that
someone else has created the data set you want in advance; often the first step in a research project is
spending a long time looking up values to create the data set you need.
Within the WDI data set, we will focus on the variables “corrup” and “gdppc.” The first is an index of
perceived corruption within a country’s government. Citizens are surveyed, and asked questions like
“Would you need to plan bribes to start a local business?” Note that corrupt is “reverse coded” so that
high scores equal low corruption and ranges from 0 to 10. The second variable is Gross Domestic Product
per Capita, which indicates how well the economy is doing, scaled for the population size of a country
(for example, it is not very interesting to note that Botswana produces less than Brazil overall, but it might
be interesting to note that Botswana produced more per citizen than Brazil in 2014).
Note that, in using these variables from the WDI data set, we have given over the responsibility of
determining what “corruption” and “quality of economy” means to the World Bank. We have also allowed
dynamic concepts to become simple numbers. However, this allows us to move quickly to the later stages
of analysis, which has its advantages.
"Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for
Teaching, Research & Learning is licensed under CC BY SA 4.0
BACKGROUND ON R/R-STUDIO
R is a programming language and an environment for statistical computing and graphics. It is a open
source project which is similar to the S language and environment which was developed at Bell
Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R
provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series
analysis, classification, clustering, etc.) and graphical techniques, and is highly extensible. R can be
downloaded here: https://cran.r-project.org/
R has some comands already built into it, but often you will need to add in additional packages written by
others. This is the cool part of R being open source, there are well over 20,000 packages.
If you are interested in using Base R:
R Console: Default window when you open R. This is the command-line interface/output window.
Editor: Click on File > New Script. This is the primary script/code window. You run commands in this
window by pressing CTRL + R or run certain lines of code by highlighting the lines and pressing these
commands.
Data Editor: When you’ve loaded data, type fix(name_of_dataset) to view and edit your data in this
window.
R-Studio is an open-source interface that runs Base R and has a built in text editor with debugging features,
a panel to view data sets, variables and functions, and a window to view installed packages, plots, and
package user guides. The open source version of R-Studio can be downloaded here (be sure to download
Base R first): https://www.rstudio.com
I. R-STUDIO OVERVIEW
Before we begin, it is important to become familiar with R-Studio, including becoming familiar with the
menus, and the source pane (for R-scripts or code), the console (runs base R and will print output), the
environment that lists lists, variables, functions, and datasets, and the plots/packages pane.
II. SOME R BASICS
# denots a comment in R and the line will be skipped.
We can assign a number (5) to a variable we call x (note that R is case sensitive)
x <- 5
x #prints out the value associated with x
To remove an object we can type and run
rm(x)
"Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for
Teaching, Research & Learning is licensed under CC BY SA 4.0
We can create an object with multiple values by typing.
Y <- c(1, 2, 3, 4)
Y #prints out the object
We can also have characters (strings) in an object.
Z <- c(“Hi”, “Hey”, “Hello”)
Z #prints out the strings
We can create three vector objects.
name <- c("Evelyn", "Erica", "Jose", "Javier")
age <- c(33, 32, 40, 27)
sex <- c("F", "F", "M", "M")
Then we can combine them into a data frame.
person.data <- data.frame(name, age, sex)
person.data #prints the data frame
We can also view this data like a spreadsheet.
View(person.data)
A "$" is used to show a specific value within a dataframe. If we wanted to specifically reference the "sex"
variable within the dataframe "person.data"
person.data$sex
We could also use this approach to add a new variable to the data frame.
person.data$school <- c("SIS", "SPA", "SOC", "SIS")
person.data # Prints out the dataframe
Whenever you need help in R you can search their extensive helpfiles with the following command.
help.search("view")
Or go to the package tab in the lower right pane.
Let's clear everything and start fresh.
rm(person.data, age, name, sex, y, z)
"Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for
Teaching, Research & Learning is licensed under CC BY SA 4.0
Notes: CTRL + L also clears the consol
If we want to install a package, we can do that without needing any administrative access. We just need
to know the name of the package. We will install the ‘psych’ package that has a number of helpful
functions.
install.packages(“psych”)
To load or activate a library and make the functions in that library accessible to us in this R session, type
and run:
library(psych)
These are the other packages we will need today
#install packages
install.packages("psych")
install.packages("car")
install.packages("lmtest")
install.packages("yhat")
#load packages
library(psych)
library(car)
library(lmtest)
library(yhat)
III. MANAGING DATA AND DESCRIPTIVE STATISTICS
The WDI data comes as an Excel file. After we look briefly at the file in Excel, we will import the data
into R. We also need to tell R where to look for and save files. You can also do this from the "Session"
Menu.
Go to -> "Session" -> "Set Working Directory" -> "Choose Directory" -> Select the folder that you want
When you do this, copy and past the line of code that is run from the consol to your scipt.
setwd("C:/Users/eschuler/Desktop")
"Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for
Teaching, Research & Learning is licensed under CC BY SA 4.0
Importing Data
Import the data from a .csv file
cs<-read.csv("cs.csv", header=TRUE)
For SPSS (must save SPSS dataset in trasport format)
Install.package(“Hmisc”)
library(Hmisc)
mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
# last option converts value labels to R factors
For SAS
install.package(“Hmisc”)
library(Hmisc)
mydata <- sasxport.get("c:/mydata.xpt")
#character variables are converted to R factors
For Stata
install.package(“foreign”)
library(foreign)
mydata <- read.dta("c:/mydata.dta")
To take a look at the column names of our dataframe and the first few entries in each, run:
colnames(cs)
str(cs)
Obtaining Descriptive Statistics
For most questions you might have about your data, you will want to look at descriptive statistics, then
look at graphs, then run a statistical test.
The "summary" funtion is in R by default, we can run it on every variable in the cs dataframe or choose
a single variable in the dataframe by using ‘$’
summary(cs)
summary(cs$oecd)
When we installed the ‘psych’ package, the describe function became available to us. We can use that
function to obtain a more detailed summary of the descriptive statistics.
describe(cs)
describe(cs$corrup)
"Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for
Teaching, Research & Learning is licensed under CC BY SA 4.0
Recoding an Existing Variable into a New Variable
As we mentioned early, corrup is reverse coded so higher corupp scores reflect lower perceived
corruption. We will now reverse code corrupt so that higher scores reflect higher perceived corruption
and save it as a new variable, corrup2.
cs$corrup2 <- 10 - cs$corrup
Recode Continuous Variables into Discrete Variable
Sometimes we may want to turn a continuous variable into a categorical (low, medium, and high)
variable. To do that:
cs$corrup_cat[cs$corrup2 < 3.3] <- 1
cs$corrup_cat[cs$corrup2 >= 3.3 & cs$corrup2 < 6.6] <- 2
cs$corrup_cat[cs$corrup2 >= 6.61] <- 3
table(cs$corrup_cat)
Then we can add some labels.
cs$corrup_cat <- ordered(cs$corrup_cat, levels = c(1,2,3), labels = c( "Low", "Medium",
"High"))
table(cs$corrup_cat)
If we look at the variable oecd, it is dummy coded 0 and 1. We can attach value labels so that 0 = No
and 1 = Yes.
cs$oecd <- factor(cs$oecd, levels = c(0,1), labels = c("No", "Yes"))
table(cs$oecd)
Crosstabs allows us to look at 2 cateorical variables at the same time in a table and we can save this
table as an object.
oecd_corruption_table <- table(cs$oecd,cs$corrup_cat)
oecd_corruption_table
If we want to see row proportions
prop.table(oecd_corruption_table, 1)
If we want to see column proportions
prop.table(oecd_corruption_table, 2)
If we want sums
addmargins(oecd_corruption_table)
"Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for
Teaching, Research & Learning is licensed under CC BY SA 4.0
Alternatively, we can combine the two previous commands
addmargins(prop.table(oecd_corruption_table, 1))
We can also do a chi square test
summary(oecd_corruption_table)
If we wanted to visualize the GDPPC amounts we can create a Histograms
hist(cs$gdppc)
hist(cs$gdppc, main = "GDP Per Capita", sub="Source: WDI 2010, The World Bank Group",
xlab = " ")
If we want to calculate the average corruption for each of our groups we can use the aggregate command
gdp_corrup_table <- aggregate(cs[, c('gdppc')], list(cs[, c('corrup_cat')]), mean, na.rm=TRUE)
Then we could generate a bar graph using this data
barplot(gdp_corrup_table$x, main = "GDP by Corruption", ylab = "GDP Per Capita")
IV. QUESTIONS ABOUT THE RELATIONSHIP BETWEEN CONTINUOUS VARIABLES
Here we will try to determine the relationship between gdppc and corruption, leaving corruption on its
original scale from 0 to 10. The most basic graph for two continuous variables is a scatter plot. The scatter
plot will show us that the two variables appear to be related. We will follow this visual intuition up with
two tests: First, a correlation tells us how tight the relationship is between the two variables. A correlation
close to 0 indicates that that there is no relationship, a correlation close to 1 (or negative 1) indicates that
the relationship is very close to a perfectly straight line. Second, a regression tells us the nature of the line
in question. A line is defined by a slope and an intercept, and the regression output will tell you if either
of those is significantly different than zero.
Correlation
The correlation coefficient is a measure of association between two variables. The sign indicates inverse
versus directly proportional relationships. 0 would be a non-relationship, while a value of 1 would be a
perfect positive relationship, and -1 would be a perfect inverse relationship.
cor(cs$gdppc, cs$corrup2, use = "complete.obs", method = "pearson")
Your output will be a single correlation coefficient.
"Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for
Teaching, Research & Learning is licensed under CC BY SA 4.0
Scatter Plot
Scatter plots will place each value of gdppc on the y-axis and the country’s corresponding corrup2 score
on the x-axis. The abline command will draw a line of best fit.
plot(cs$corrup2, cs$gdppc,
main = "Gross Domestic Product Per Capita by Corruption",
ylab = "GDP Per Capita",
xlab = "Corruption")
abline(lm(cs$gdppc~cs$corrup2), col = "red", lwd = 2, lty = 1)
Regression
Regression provides additional information about the relationship between two variables. Specifically, it
allows you to guess the value of one variable given information about another variable. We will regress
gdppc (our dependent variable) over corrup2 (our independent variable).
First we specify the model and create it as an object
model1 <- lm(cs$gdppc ~ cs$corrup2)
Then we need to use the summary command to see the output
summary(model1)
This output can be used to make a Linear equation:
Gdppc = b0 + b1* Corruption2
If we wanted to add more independent variables we place + between them
model2 <- lm(cs$gdppc ~ cs$corrup2 + cs$oecd)
summary(model2)
Here are some commands to run regression diagnostics
par(mar=c(2,2,2,2), mfrow=c(2,2))
plot(model2)
vif(model2)
"Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for
Teaching, Research & Learning is licensed under CC BY SA 4.0
Alternatively, we can us the ‘yhat’ package to look at the how each independent variable can uniquely
explain the variability in gdppc and how much shared variance there is among the independent variables.
This is called a commonality analysis. In the command below, we list our dataframe name, the
dependent variable, and list the two independent variables
commonalityCoefficients(cs,"gdppc",list("corrup2","oecd"))
The output is the amount of shared and unique variance.
V. TO EXPORT RESULTS AND TABLES
To export these results we can write them to a .csv file
write.csv (as.data.frame(summary(model2)$coef), file = "model2.csv")
We could also do this with any table that we make
write.csv(oecd_corruption_table, file = "table1.csv")
If you are interested in learning more, see Garret Grolemund and Hadley Wickham’s online and
open sourced book “R for Data Science” at https://r4ds.had.co.nz/
"Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for
Teaching, Research & Learning is licensed under CC BY SA 4.0
Download