Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator R R is a free software environment for statistical computing and graphics Object-oriented It runs on a wide variety of platforms Highly extensible Command line and GUI Conflict between extensible and GUI Scripts Results R Studio Datasets Files, plots, packages, & help Creating a project Store all R scripts and data in the same folder or directory by creating a project File > New Project… Script A script is a set of R commands A program # CO2 parts per million for 2000-2009 co2 <c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) year <- (2000:2009) # a range of values # show values co2 year #compute mean and standard deviation mean(co2) sd(co2) plot(year,co2) c is short for combine in c(369.40, …) Exercise Plot kWh per square foot by year for the following University of Georgia data. year 2007 sqfeet 14,214,216 kWh 2,141,705 2008 14,359,041 2,108,088 2009 14,752,886 2,150,841 2010 15,341,886 2,211,414 2011 15,573,100 2,187,164 2012 15,740,742 2,057,364 1. 2. 3. 4. 5. Smart editing Copy each column to a word processor Convert table to text Search and replace commas with null Search and replace returns with commas Edit to put R text around numbers # Data in R format year <- (2007:2012) sqft <- c(14214216, 14359041, 14752886, 15341886, 15573100, 15740742) kwh <- c(2141705, 2108088, 2150841, 2211414, 2187164, 2057364) Datasets A dataset is a table One row for each observation Columns contain observation values Same as the relational model R supports multiple data structures and multiple data types Data structures Vector A single row table where data are all of the same type co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) year <- (2000:2009) co2[2] # get the second value Matrix A table where all data are of the same type m <- matrix(1:12, nrow=4,ncol=3) m[4,3] Exercise Create a matrix with 6 rows and 3 columns containing the numbers 1 through 18 Data structures Array Extends a matrix beyond two dimensions a <- array(1:24, c(4,3,2)) a[1,1,1] Data frame Same as a relational table Columns can have different data types Typically, read a file to create a data frame gender <- c("m","f","f") age <- c(5,8,3) df <- data.frame(gender,age) df[1,2] df[1,] df[,2] Data structures List An ordered collection of objects Can store a variety of objects under one name l <- list(co2,m,df) l[[3]] # list 3 l[[1]][2] # second element of list 1 Logical operations Logical operator Symbol EQUAL == AND & OR | NOT ! Objects Anything that can be assigned to a variable Constant Data structure Function Graph … Types of data Classification Nominal Sorting or ranking Ordinal Measurement Interval Ratio Factors Nominal and ordinal data are factors By default, strings are treated as factors Determine how data are analyzed and presented Failure to realize a column contains a factor, can cause confusion Use str() to find out a frame’s data structure Missing values Missing values are indicated by NA (not available) Arithmetic expressions and functions containing missing values generate missing values sum(c(1,NA,2)) Use the na.rm=T option to exclude missing values from calculations sum(c(1,NA,2),na.rm=T) Missing values You remove rows with missing values by using na.omit() gender <- c("m","f","f","f") age <- c(5,8,3,NA) df <- data.frame(gender,age) df2 <- na.omit(df) Packages R’s base set of packages can be extended by installing additional packages Over 4,000 packages Search the R Project site to identify packages and functions Install using R studio Packages must be installed prior to use and their use specified in a script library(packagename) Packages # install ONCE on your computer # can also use Rstudio to install install.packages("knitr") # library EVERY TIME before using a package in a session # loads the package to memory library(knitr) Exercise Install the package birk and use one of its functions to do the following conversions: 100ºF to ºC 1oo meters to feet Compile a notebook A notebook is a report of an analysis Interweaves R code and output File > Compile Notebook … Select html, pdf, or Word output Install knitr before use Install suggested packages PDF Reading a file R can read a wide variety of input formats Text Statistical package formats (e.g., SAS) DBMS Reading a text file Delimited text file, such as CSV Creates a data frame Specify as required Presence of header Separator Row names It will not find this local file on your computer. t <- read.table("~/Dropbox/ Documents/R/Data/centralparktemps.txt", header=T, sep=',') Reading a text file Read a file using a URL url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read.table(url, header=T, sep=',') Learning about an object Click on the name of the file in the top-right window to see its content url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read.table(url, header=T, sep=',') head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object Click on the blue icon of the file in the top-right window to see its structure Referencing data datasetName$columName url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read.table(url, header=T, sep=',') # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) Data set Column Creating a new column library(birk) <url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read.table(url, header=T, sep=',') # compute Celsius t$Ctemp <- round(conv_unit(t$temperature,F,C),1) Reshaping year month Converting data from one format to another Wide to narrow Melt Cast co2 1959 1 315.62 1959 2 316.38 1959 3 316.71 1959 4 317.72 1959 5 318.29 1959 6 318.15 1959 7 316.54 1959 8 314.80 1959 9 313.84 1959 10 313.26 1959 11 314.80 1959 12 315.58 Year 1 2 3 4 5 6 7 8 9 10 11 12 1959 315.62 316.38 316.71 317.72 318.29 318.15 316.54 314.8 313.84 313.26 314.8 315.58 External files & RStudio server Upload a file Download a file More > Export … Reshaping library(reshape) url <- 'http://people.terry.uga.edu/rwatson/data/meltExample.csv' s <- read.table(url, header=F, sep=',') colnames(s) <- c('year', 1:12) # melt (normalization) m <- melt(s,id='year') colnames(m) <- c('year','month','co2') # Cast – revers of melt c <- cast(m,year~month, value='co2') Writing files url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt’ t <- read.table(url, header=T, sep=',') # compute Celsius and round to one decimal place t$Ctemp = round((t$temperature-32)*5/9,1) colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit write.table(t,"centralparktempsCF.txt") The file is stored in the project's folder sqldf A R package for using SQL with data frames Returns a data frame Supports MySQL Subset Selecting rows library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL trowSQL <- sqldf("select * from t where year = 1999") Selecting columns tcolSQL <- sqldf("select year, month, Ctemp from t") Selecting rows and columns trowcolSQL <2000") sqldf("select year, month, Ctemp from t where year > 1989 and year < Logical operator Symbol EQUAL == AND & OR | NOT ! Sort Sorting on column name sSQL <- sqldf("select * from t order by year desc, month") Recoding Some analyses might be facilitated by the recoding of data Split a continuous measure into two categories t$Category <- 'Other' t$Category[t$Ftemp >= 30] <- 'Hot' Deleting a column t$Category <- NULL Exercise Download the spreadsheet of monthly mean CO2 measurements (PPM) taken at the Mauna Loa Observatory from 1958 onwards http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loaco2-data.html Export a CSV file that contains three columns: year, month, and average CO2 Read the file into R Recode missing values (-99.99) to NA Plot year versus CO2 Summarizing data library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL url <'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt' t <- read.table(url, header=T, sep=',') w <- sqldf("select year, avg(temperature) as mean from t group by year") Merging files There must be a common column in both files library(sqldf) options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt' t <- read.table(url, header=T, sep=',') # average monthly temp for each year a <- sqldf("select year, avg(temperature) as mean from t group by year") # read yearly carbon data (source: http://co2now.org/Current-CO2/CO2Now/noaa-mauna-loa-co2-data.html) url <- 'http://people.terry.uga.edu/rwatson/data/carbon1959-2011.txt' carbon <- read.table(url, header=T, sep=',') m <- sqldf("select a.year, CO2, mean from a, carbon where a.year = carbon.year") Correlation coefficient cor.test(m$mean,m$CO2) Pearson's product-moment correlation data: m$mean and m$CO2 t = 3.1173, df = 51, p-value = 0.002997 95 percent confidence interval: 0.1454994 0.6049393 sample estimates: cor 0.4000598 Significant Concatenating files Taking a set of files of with the same structure and creating a single file Same type of data in corresponding columns Files should be in the same directory Concatenating files Local directory # read the file names from a local directory filenames <- list.files("homeC-all/homeC-power", pattern="*.csv", full.names=TRUE) # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read.table(filenames[i], header=F, sep=',') } else { temp <-read.table(filenames[i], header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } } colnames(cp) <- c('time','watts') Takes a while to run Concatenating files Remote directory with FTP # read the file names from a remote directory (FTP) library(RCurl) url <"ftp://watson_ftp:bulldawg1989@http://people.terry.uga.edu/rwatso n/data/Mauna%20Loa%20CO2.csvpeople.terry.uga.edu/rwatson/power/" dir <- getURL(url, dirlistonly = T) filenames <- unlist(strsplit(dir,"\n")) # split into filenames # append the files one after another for (i in 1:length(filenames)) { file <- paste(url,filenames[i],sep='') # concatenate for url if (i == 1) { cp <- read.table(file, header=F, sep=',') } else { temp <-read.table(file, header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } Database access MySQL access library(RMySQL) conn <- dbConnect(RMySQL::MySQL(), "richardtwatson.com", dbname="Weather", user="db2", password="student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t) Exercise Using the Atlanta weather database and the lubridate package Compute the average temperature at 5 pm in August Determine the maximum temperature for each day in August for each year Resources R books Reference card Quick-R Key points R is a platform for a wide variety of data analytics Statistical analysis Data visualization HDFS and MapReduce Text mining Energy Informatics R is a programming language Much to learn