Subsets, Types & Objects Stat 579
 Heike Hofmann

advertisement
Subsets, Types &
Objects
Stat 579
Heike Hofmann
1
Loading data
•
•
•
•
Import data with:
•
read.csv() for csv files
Use file.choose() to help find your file
Stored in a data.frame Check whether import worked, e.g. with
functions
summary, str, head, dim
2
Working Directories
• R uses special folder to store information:
working directory
•getwd()!
•setwd(“path/to/working/directory”)!
•dirname(file.choose()) helps to find
paths
Hint: start each project/each new analysis
by creating a new working directory
3
What happened?
•
Loading data in R is sometimes frustrating - work through the
following example to see whether things worked and figure out
why, if not:
•
data1 <- read.csv("http://www.public.iastate.edu/~hofmann/data/
data1.csv")
•
data2 <- read.csv("http://www.public.iastate.edu/~hofmann/data/
data2.csv")
•
data3 <- read.csv("http://www.public.iastate.edu/~hofmann/data/
data3.csv")
•
data4 <- read.csv("http://www.public.iastate.edu/~hofmann data/
data4.csv")
4
Difficulties?
• Paths in Windows
Right-click on a file and open Properties
Use / or \\ between folders
• Need read/write permission for folder
5
Outline
• Logical Operations in R:
- Vectors
- Operators
- Subsetting Data
• Factors, Characters & Numbers
6
Vectors
• Everything in R is a vector (pretty much)
• A vector must contain elements of the same type
• Logical, vector of TRUEs and FALSE (T&F)
• Numeric, vector of numbers
• Character, vector of strings
• Lists, vector of vectors
7
Properties of vectors
• All have length and mode
• str will tell you both, and give a sample of
the contents
• All can contain missing values (NA)
• Beware “NA” vs NA
• c(“NA”, NA)
8
Logical vectors
• Usually created with a logical comparison
•<, >, ==, !=, <=, >=!
•%in%!
•subset
9
Logical expressions
• & and | are the logical and and or
• ! is the logical negation
• use parentheses () when linking
expressions to avoid mis-interpretation
10
Logical Operators
A
B
A & B is the set of elements
that is both in A and B
A
B
A | B is the set of elements
that is in A or in B or in both
11
Practice
•a
<- c(1,15, 3,20, 5,8,9,10, 1,3)!
• Define a logical vector z depending on
whether values in a are:
• even (look at a %% 2)
• less than 20
• squared value is at least 100 or less than 10
• equals 1 or 3
12
Updating subsets
• You can take a subset and update the
original data
•a
<- 1:4
a[2:3] <- 0
a!
• Very useful with logical subsetting
13
And now for real ...
• The General Social Survey is a telephone
survey done every two years since 1972
• asks participants about 400 questions
• http://www3.norc.org/gss+website/
• … let’s browse the website for a bit
14
Economical Status
• The file economical-status.csv is an extract
of the General Social Survey, with variables
• region, income, happy, age, finrela, marital,
degree, health, wrkstat, partyid, polviews,
sex, year
• Go on the GSS website and find out more
about these variables
15
Advanced
• Load economical-status.csv data into
R
• Find subsets for two years of your choice
• Plot income. Why is the order of the labels so
strange? Is there a difference in income between the
two years?
• Plot age. Is there an age difference between the
two years?
16
[ + logical vectors
• The most complicated to understand, but
the most powerful
• Lets you extract a subset defined by some
characteristic of the data
•ty
<- twoyears
ty[ty$age>90,]
17
Practice
• What is the % of individuals with an income of
more than $25,000 in the two years?
• Extract all records for 80 year old females (or
males) in the midwest (or another region of
your choosing). What is their general health?
• What is the record with the highest age? What does that mean? - check the codebook
online. Fix correspondingly.
18
Practice
• Check the website for a description of the levels
for income. • Make a new variable income2 in the gss data, by
copying all values of income. Replace all values in
income2 that don’t correspond to intervals by the
value NA.
• Advanced: what does the command levels() do? Use
it to replace all intervals in the income2 variable by
their midpoint, i.e. use 2000 instead of “$1000 TO
$2999”, use end points, if the interval is one-sided
19
More about missings
• NA + x = NA, NA * x = NA!
• x == NA !
• is.na returns logical vector, for single vector
• na.omit removes all missing values from a
vector
complete.cases does the same for a
data.frame
• Many functions have parameter na.rm
20
Practice
• Compute average age and standard deviation
for both years.
• What is the correlation between age and
income2?
21
Factors
• A special type of numeric (integer) data
• Numbers + labels
• Used for categorical variables
• On import, make sure numeric categorical
variables are converted to factors
•
factor creates a new factor with specified
labels
22
Checking for, and
casting between types
• str, mode provide info on type
• is.XXX (with XXX either factor,
int, numeric, logical,
character, ... ) checks for specific type
• as.XXX casts to specific type
23
Casting between types
as.factor
factor
character
as.character
numeric
as.numeric
24
Factors
• factor variables often have to be re-ordered for
ease of comparisons
• We can specify the order of the levels by explicitly
listing them, see help(factor)
• We can make the order of the levels in one variable
dependent on the summary statistic of another
variable, see help(reorder)
25
Practice
• Plot barcharts of wrkstat, marital, degree.
• Based on the codebook, re-code the levels of degree by
approximate years of education.!
• Plot boxplots of age by marital status. • Reorder income by income2. • Reorder marital status by frequency.
26
Tables
•
table(x) gives returns frequency table of
variable x
• for higher number of variables, frequencies
of corresponding contingency tables are
returned
•
ggfluctuation(table(x,y)) renders
matrix graphically (area proportional to cell
count)
27
Practice
For all of these questions answer by creating a
table of numbers and a fluctuation diagram:
• are men or women happier?
• do men or women earn more?
• is health different for men or women?
28
Download