Subsets, Types & Objects Stat 579 Heike Hofmann 1 Loading data • • • • Import data with: • read.csv() for csv files Use file.choose() to help find your file Stored in a data.frame Check whether import worked, e.g. with functions summary, str, head, dim 2 Working Directories • R uses special folder to store information: working directory •getwd()! •setwd(“path/to/working/directory”)! •dirname(file.choose()) helps to find paths Hint: start each project/each new analysis by creating a new working directory 3 What happened? • Loading data in R is sometimes frustrating - work through the following example to see whether things worked and figure out why, if not: • data1 <- read.csv("http://www.public.iastate.edu/~hofmann/data/ data1.csv") • data2 <- read.csv("http://www.public.iastate.edu/~hofmann/data/ data2.csv") • data3 <- read.csv("http://www.public.iastate.edu/~hofmann/data/ data3.csv") • data4 <- read.csv("http://www.public.iastate.edu/~hofmann data/ data4.csv") 4 Difficulties? • Paths in Windows Right-click on a file and open Properties Use / or \\ between folders • Need read/write permission for folder 5 Outline • Logical Operations in R: - Vectors - Operators - Subsetting Data • Factors, Characters & Numbers 6 Vectors • Everything in R is a vector (pretty much) • A vector must contain elements of the same type • Logical, vector of TRUEs and FALSE (T&F) • Numeric, vector of numbers • Character, vector of strings • Lists, vector of vectors 7 Properties of vectors • All have length and mode • str will tell you both, and give a sample of the contents • All can contain missing values (NA) • Beware “NA” vs NA • c(“NA”, NA) 8 Logical vectors • Usually created with a logical comparison •<, >, ==, !=, <=, >=! •%in%! •subset 9 Logical expressions • & and | are the logical and and or • ! is the logical negation • use parentheses () when linking expressions to avoid mis-interpretation 10 Logical Operators A B A & B is the set of elements that is both in A and B A B A | B is the set of elements that is in A or in B or in both 11 Practice •a <- c(1,15, 3,20, 5,8,9,10, 1,3)! • Define a logical vector z depending on whether values in a are: • even (look at a %% 2) • less than 20 • squared value is at least 100 or less than 10 • equals 1 or 3 12 Updating subsets • You can take a subset and update the original data •a <- 1:4 a[2:3] <- 0 a! • Very useful with logical subsetting 13 And now for real ... • The General Social Survey is a telephone survey done every two years since 1972 • asks participants about 400 questions • http://www3.norc.org/gss+website/ • … let’s browse the website for a bit 14 Economical Status • The file economical-status.csv is an extract of the General Social Survey, with variables • region, income, happy, age, finrela, marital, degree, health, wrkstat, partyid, polviews, sex, year • Go on the GSS website and find out more about these variables 15 Advanced • Load economical-status.csv data into R • Find subsets for two years of your choice • Plot income. Why is the order of the labels so strange? Is there a difference in income between the two years? • Plot age. Is there an age difference between the two years? 16 [ + logical vectors • The most complicated to understand, but the most powerful • Lets you extract a subset defined by some characteristic of the data •ty <- twoyears ty[ty$age>90,] 17 Practice • What is the % of individuals with an income of more than $25,000 in the two years? • Extract all records for 80 year old females (or males) in the midwest (or another region of your choosing). What is their general health? • What is the record with the highest age? What does that mean? - check the codebook online. Fix correspondingly. 18 Practice • Check the website for a description of the levels for income. • Make a new variable income2 in the gss data, by copying all values of income. Replace all values in income2 that don’t correspond to intervals by the value NA. • Advanced: what does the command levels() do? Use it to replace all intervals in the income2 variable by their midpoint, i.e. use 2000 instead of “$1000 TO $2999”, use end points, if the interval is one-sided 19 More about missings • NA + x = NA, NA * x = NA! • x == NA ! • is.na returns logical vector, for single vector • na.omit removes all missing values from a vector complete.cases does the same for a data.frame • Many functions have parameter na.rm 20 Practice • Compute average age and standard deviation for both years. • What is the correlation between age and income2? 21 Factors • A special type of numeric (integer) data • Numbers + labels • Used for categorical variables • On import, make sure numeric categorical variables are converted to factors • factor creates a new factor with specified labels 22 Checking for, and casting between types • str, mode provide info on type • is.XXX (with XXX either factor, int, numeric, logical, character, ... ) checks for specific type • as.XXX casts to specific type 23 Casting between types as.factor factor character as.character numeric as.numeric 24 Factors • factor variables often have to be re-ordered for ease of comparisons • We can specify the order of the levels by explicitly listing them, see help(factor) • We can make the order of the levels in one variable dependent on the summary statistic of another variable, see help(reorder) 25 Practice • Plot barcharts of wrkstat, marital, degree. • Based on the codebook, re-code the levels of degree by approximate years of education.! • Plot boxplots of age by marital status. • Reorder income by income2. • Reorder marital status by frequency. 26 Tables • table(x) gives returns frequency table of variable x • for higher number of variables, frequencies of corresponding contingency tables are returned • ggfluctuation(table(x,y)) renders matrix graphically (area proportional to cell count) 27 Practice For all of these questions answer by creating a table of numbers and a fluctuation diagram: • are men or women happier? • do men or women earn more? • is health different for men or women? 28