Introduction to R/R-Studio This workshop is designed to introduce you to R using R-Studio and provide tips and tools you can use to get started with statistical analysis. When we are finished you will: Understand the components of the R-Studio interface Be able to enter, load, and describe data Use the basic building blocks of an R program Learn how to install and load packages Be able to syntax to analyze and graph data Generate new and recode existing variables Generate Descriptive Statistics Run Correlations and Linear Regressions Create Histograms, Bar Charts, and Scatter Plots SCENARIO: We are interested in studying the relationship between the amount of corruption in a country and the quality of their economy. DATA: To access the data for this workshop visit: https://tinyurl.com/r-intro-ctrl For this question, we are lucky as data already exists. We will use the “World Development Indicators” which are compiled annually by The World Bank. We are using data from a past year, and we have simplified the data set, but you can find the full data set here: http://data.worldbank.org/datacatalog/world-development-indicators. We are also lucky, because the World Bank saves there data in a way that make it easy to open, often when you find a data set (for example, old Census data), it can take a lot of effort just to transform the data into a usable form. And, of course, there is never a guarantee that someone else has created the data set you want in advance; often the first step in a research project is spending a long time looking up values to create the data set you need. Within the WDI data set, we will focus on the variables “corrup” and “gdppc.” The first is an index of perceived corruption within a country’s government. Citizens are surveyed, and asked questions like “Would you need to plan bribes to start a local business?” Note that corrupt is “reverse coded” so that high scores equal low corruption and ranges from 0 to 10. The second variable is Gross Domestic Product per Capita, which indicates how well the economy is doing, scaled for the population size of a country (for example, it is not very interesting to note that Botswana produces less than Brazil overall, but it might be interesting to note that Botswana produced more per citizen than Brazil in 2014). Note that, in using these variables from the WDI data set, we have given over the responsibility of determining what “corruption” and “quality of economy” means to the World Bank. We have also allowed dynamic concepts to become simple numbers. However, this allows us to move quickly to the later stages of analysis, which has its advantages. "Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for Teaching, Research & Learning is licensed under CC BY SA 4.0 BACKGROUND ON R/R-STUDIO R is a programming language and an environment for statistical computing and graphics. It is a open source project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, etc.) and graphical techniques, and is highly extensible. R can be downloaded here: https://cran.r-project.org/ R has some comands already built into it, but often you will need to add in additional packages written by others. This is the cool part of R being open source, there are well over 20,000 packages. If you are interested in using Base R: R Console: Default window when you open R. This is the command-line interface/output window. Editor: Click on File > New Script. This is the primary script/code window. You run commands in this window by pressing CTRL + R or run certain lines of code by highlighting the lines and pressing these commands. Data Editor: When you’ve loaded data, type fix(name_of_dataset) to view and edit your data in this window. R-Studio is an open-source interface that runs Base R and has a built in text editor with debugging features, a panel to view data sets, variables and functions, and a window to view installed packages, plots, and package user guides. The open source version of R-Studio can be downloaded here (be sure to download Base R first): https://www.rstudio.com I. R-STUDIO OVERVIEW Before we begin, it is important to become familiar with R-Studio, including becoming familiar with the menus, and the source pane (for R-scripts or code), the console (runs base R and will print output), the environment that lists lists, variables, functions, and datasets, and the plots/packages pane. II. SOME R BASICS # denots a comment in R and the line will be skipped. We can assign a number (5) to a variable we call x (note that R is case sensitive) x <- 5 x #prints out the value associated with x To remove an object we can type and run rm(x) "Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for Teaching, Research & Learning is licensed under CC BY SA 4.0 We can create an object with multiple values by typing. Y <- c(1, 2, 3, 4) Y #prints out the object We can also have characters (strings) in an object. Z <- c(“Hi”, “Hey”, “Hello”) Z #prints out the strings We can create three vector objects. name <- c("Evelyn", "Erica", "Jose", "Javier") age <- c(33, 32, 40, 27) sex <- c("F", "F", "M", "M") Then we can combine them into a data frame. person.data <- data.frame(name, age, sex) person.data #prints the data frame We can also view this data like a spreadsheet. View(person.data) A "$" is used to show a specific value within a dataframe. If we wanted to specifically reference the "sex" variable within the dataframe "person.data" person.data$sex We could also use this approach to add a new variable to the data frame. person.data$school <- c("SIS", "SPA", "SOC", "SIS") person.data # Prints out the dataframe Whenever you need help in R you can search their extensive helpfiles with the following command. help.search("view") Or go to the package tab in the lower right pane. Let's clear everything and start fresh. rm(person.data, age, name, sex, y, z) "Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for Teaching, Research & Learning is licensed under CC BY SA 4.0 Notes: CTRL + L also clears the consol If we want to install a package, we can do that without needing any administrative access. We just need to know the name of the package. We will install the ‘psych’ package that has a number of helpful functions. install.packages(“psych”) To load or activate a library and make the functions in that library accessible to us in this R session, type and run: library(psych) These are the other packages we will need today #install packages install.packages("psych") install.packages("car") install.packages("lmtest") install.packages("yhat") #load packages library(psych) library(car) library(lmtest) library(yhat) III. MANAGING DATA AND DESCRIPTIVE STATISTICS The WDI data comes as an Excel file. After we look briefly at the file in Excel, we will import the data into R. We also need to tell R where to look for and save files. You can also do this from the "Session" Menu. Go to -> "Session" -> "Set Working Directory" -> "Choose Directory" -> Select the folder that you want When you do this, copy and past the line of code that is run from the consol to your scipt. setwd("C:/Users/eschuler/Desktop") "Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for Teaching, Research & Learning is licensed under CC BY SA 4.0 Importing Data Import the data from a .csv file cs<-read.csv("cs.csv", header=TRUE) For SPSS (must save SPSS dataset in trasport format) Install.package(“Hmisc”) library(Hmisc) mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE) # last option converts value labels to R factors For SAS install.package(“Hmisc”) library(Hmisc) mydata <- sasxport.get("c:/mydata.xpt") #character variables are converted to R factors For Stata install.package(“foreign”) library(foreign) mydata <- read.dta("c:/mydata.dta") To take a look at the column names of our dataframe and the first few entries in each, run: colnames(cs) str(cs) Obtaining Descriptive Statistics For most questions you might have about your data, you will want to look at descriptive statistics, then look at graphs, then run a statistical test. The "summary" funtion is in R by default, we can run it on every variable in the cs dataframe or choose a single variable in the dataframe by using ‘$’ summary(cs) summary(cs$oecd) When we installed the ‘psych’ package, the describe function became available to us. We can use that function to obtain a more detailed summary of the descriptive statistics. describe(cs) describe(cs$corrup) "Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for Teaching, Research & Learning is licensed under CC BY SA 4.0 Recoding an Existing Variable into a New Variable As we mentioned early, corrup is reverse coded so higher corupp scores reflect lower perceived corruption. We will now reverse code corrupt so that higher scores reflect higher perceived corruption and save it as a new variable, corrup2. cs$corrup2 <- 10 - cs$corrup Recode Continuous Variables into Discrete Variable Sometimes we may want to turn a continuous variable into a categorical (low, medium, and high) variable. To do that: cs$corrup_cat[cs$corrup2 < 3.3] <- 1 cs$corrup_cat[cs$corrup2 >= 3.3 & cs$corrup2 < 6.6] <- 2 cs$corrup_cat[cs$corrup2 >= 6.61] <- 3 table(cs$corrup_cat) Then we can add some labels. cs$corrup_cat <- ordered(cs$corrup_cat, levels = c(1,2,3), labels = c( "Low", "Medium", "High")) table(cs$corrup_cat) If we look at the variable oecd, it is dummy coded 0 and 1. We can attach value labels so that 0 = No and 1 = Yes. cs$oecd <- factor(cs$oecd, levels = c(0,1), labels = c("No", "Yes")) table(cs$oecd) Crosstabs allows us to look at 2 cateorical variables at the same time in a table and we can save this table as an object. oecd_corruption_table <- table(cs$oecd,cs$corrup_cat) oecd_corruption_table If we want to see row proportions prop.table(oecd_corruption_table, 1) If we want to see column proportions prop.table(oecd_corruption_table, 2) If we want sums addmargins(oecd_corruption_table) "Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for Teaching, Research & Learning is licensed under CC BY SA 4.0 Alternatively, we can combine the two previous commands addmargins(prop.table(oecd_corruption_table, 1)) We can also do a chi square test summary(oecd_corruption_table) If we wanted to visualize the GDPPC amounts we can create a Histograms hist(cs$gdppc) hist(cs$gdppc, main = "GDP Per Capita", sub="Source: WDI 2010, The World Bank Group", xlab = " ") If we want to calculate the average corruption for each of our groups we can use the aggregate command gdp_corrup_table <- aggregate(cs[, c('gdppc')], list(cs[, c('corrup_cat')]), mean, na.rm=TRUE) Then we could generate a bar graph using this data barplot(gdp_corrup_table$x, main = "GDP by Corruption", ylab = "GDP Per Capita") IV. QUESTIONS ABOUT THE RELATIONSHIP BETWEEN CONTINUOUS VARIABLES Here we will try to determine the relationship between gdppc and corruption, leaving corruption on its original scale from 0 to 10. The most basic graph for two continuous variables is a scatter plot. The scatter plot will show us that the two variables appear to be related. We will follow this visual intuition up with two tests: First, a correlation tells us how tight the relationship is between the two variables. A correlation close to 0 indicates that that there is no relationship, a correlation close to 1 (or negative 1) indicates that the relationship is very close to a perfectly straight line. Second, a regression tells us the nature of the line in question. A line is defined by a slope and an intercept, and the regression output will tell you if either of those is significantly different than zero. Correlation The correlation coefficient is a measure of association between two variables. The sign indicates inverse versus directly proportional relationships. 0 would be a non-relationship, while a value of 1 would be a perfect positive relationship, and -1 would be a perfect inverse relationship. cor(cs$gdppc, cs$corrup2, use = "complete.obs", method = "pearson") Your output will be a single correlation coefficient. "Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for Teaching, Research & Learning is licensed under CC BY SA 4.0 Scatter Plot Scatter plots will place each value of gdppc on the y-axis and the country’s corresponding corrup2 score on the x-axis. The abline command will draw a line of best fit. plot(cs$corrup2, cs$gdppc, main = "Gross Domestic Product Per Capita by Corruption", ylab = "GDP Per Capita", xlab = "Corruption") abline(lm(cs$gdppc~cs$corrup2), col = "red", lwd = 2, lty = 1) Regression Regression provides additional information about the relationship between two variables. Specifically, it allows you to guess the value of one variable given information about another variable. We will regress gdppc (our dependent variable) over corrup2 (our independent variable). First we specify the model and create it as an object model1 <- lm(cs$gdppc ~ cs$corrup2) Then we need to use the summary command to see the output summary(model1) This output can be used to make a Linear equation: Gdppc = b0 + b1* Corruption2 If we wanted to add more independent variables we place + between them model2 <- lm(cs$gdppc ~ cs$corrup2 + cs$oecd) summary(model2) Here are some commands to run regression diagnostics par(mar=c(2,2,2,2), mfrow=c(2,2)) plot(model2) vif(model2) "Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for Teaching, Research & Learning is licensed under CC BY SA 4.0 Alternatively, we can us the ‘yhat’ package to look at the how each independent variable can uniquely explain the variability in gdppc and how much shared variance there is among the independent variables. This is called a commonality analysis. In the command below, we list our dataframe name, the dependent variable, and list the two independent variables commonalityCoefficients(cs,"gdppc",list("corrup2","oecd")) The output is the amount of shared and unique variance. V. TO EXPORT RESULTS AND TABLES To export these results we can write them to a .csv file write.csv (as.data.frame(summary(model2)$coef), file = "model2.csv") We could also do this with any table that we make write.csv(oecd_corruption_table, file = "table1.csv") If you are interested in learning more, see Garret Grolemund and Hadley Wickham’s online and open sourced book “R for Data Science” at https://r4ds.had.co.nz/ "Intro to R" created by Bill Harder and E.R. Schuler at American University's Center for Teaching, Research & Learning is licensed under CC BY SA 4.0