Basics_of_Data_Analysis_Lecture

advertisement
Basic Data Analysis Using R
1
Xiao He
AGENDA
1. Data cleaning (e.g., missing values)
2. Descriptive statistics
3. t-tests
4. ANOVA
Data visualization
2
5. Linear regression
AGENDA
1. Data cleaning (e.g., missing values)
2. Descriptive statistics
3. t-tests
4. ANOVA
3
5. Linear regression
1. DATA CLEANING
NA and NaN:
1. NA (Not Available): missing values.
a). Represented in the form of NA, or in the form of <NA>.
: You may have coded missing values using other
2. NaN (Not a number): when an arithmetic operation returns a nonnotations.
Fore.g.,
example,
some
use numeric result:
in R, 0/0
givesresearchers/programs
you NaN.
4
99 or -999 to denote missing values. If that is the case, you
need to check for missing values using other methods that
won’t be covered here.
1. DATA CLEANING
NA and NaN:
3. Deal with NA and NaN?
a). Check how many cases (rows) do NOT have NA or NaN:
complete.cases()
#Returns Booleans (TRUE, FALSE). TRUE means no missing
#value (in a given row), and FALSE means there is at
#least one missing value.
Ex1.1: (refer to handout)
b). Remove cases with NA or NaN:
NA
na.omit()
#return a data object with NA and NaN removed. For data
#frames, an entire row will be removed if it contains
#or NaN.
Ex1.2: (refer to handout)
5
c). More sophisticated ways of dealing with NA (covered by Addie’s
workshop in two weeks)
AGENDA
1. Data cleaning (e.g., missing values)
2. Descriptive statistics
3. t-tests
4. ANOVA
6
5. Linear regression
2. DESCRIPTIVE STATISTICS
1. Help you diagnose potential problems w/ data entry or collection:
a. Were any values entered incorrectly?
e.g., survey study using 1 – 5 Likert scale, but when you checked the
range of your data, you found that the maximum value in your dataset was 7.
a. Any strange responses?
e.g., Did your participants give you any odd responses?
2. Help you get a sense of how your data are distributed.
Extreme values (outliers)
Non-normality
Skewness
Unequal variance
7
a.
b.
c.
d.
2. DESCRIPTIVE STATISTICS
 Compute individual descriptive statistics
1. Location:


mean(x, trim)
median(x)
Ex2.1: (refer to handout)
2. Dispersion




var(x)
sd(x)
range(x); min(x); max(x)
IQR(x)
8
Ex2.2: (refer to handout)
2. DESCRIPTIVE STATISTICS
 Compute a set of descriptive statistics
1. Use the function summary().
Ex3.1: (refer to handout)
2.
function describe() in the package `psych`.
9
Ex3.2: (refer to handout)
AGENDA
1. Data cleaning (e.g., missing values)
2. Descriptive statistics
3. t-tests
4. ANOVA
10
5. Linear regression
3. T TESTS
t.test()
t.test(x, y = NULL,
alternative=c("two.sided", "less", "greater”),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = .95)
11
1. One-sample t-test:
3. T TESTS
t.test()
t.test(x, y = NULL,
alternative=c("two.sided", "less", "greater”),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = .95)
1. One-sample t-test:
Suppose someone hypothesized that the mean undergrad age was 19.75. Let’s
test whether the mean age was significantly different from 19.75.
H0: mu0 = 19.75
Ex4.1: (refer to handout)
12
H1: mu0 ≠ 19.75
3. T TESTS
t.test()
t.test(x, y = NULL,
alternative=c("two.sided", "less", "greater”),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = .95)
2. Independent t-test:
Let’s test whether the mean height of female students is significantly different
from the mean height of male students
H0: muFemale = muMale
Ex4.2: (refer to handout)
13
H1: muFemale ≠ muMale
3. T TESTS
t.test()
t.test(x, y = NULL, formula = Y ~ X
Y: outcome variable
(e.g., height)
alternative=c("two.sided",
"less",
"greater”),
level grouping
variable=(e.g.,
Sex)
mu = 0, paired X:=2FALSE,
var.equal
FALSE,
conf.level = .95)
t.test(formula, data,…)
2. Independent t-test:
Let’s test whether the mean height of female students is significantly different
from the mean height of male students
H0: muFemale = muMale
Ex4.3: (refer to handout)
14
H1: muFemale ≠ muMale
3. T TESTS
t.test()
t.test(x, y = NULL,
alternative=c("two.sided", "less", "greater”),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = .95)
t.test(formula, data,…)
3. Paired t-test:
Let’s test whether the mean Writing hand span (Wr.Hnd) and the mean Nonwriting hand span (NW.Hnd) differ significantly.
H0: muWr.Hnd = muNW.Hnd
Ex4.4: (refer to handout)
15
H1: muWr.Hnd ≠ muNW.Hnd
AGENDA
1. Data cleaning (e.g., missing values)
2. Descriptive statistics
3. t-tests
4. ANOVA
16
5. Linear regression
4. ANOVA
aov()
aov(formula, data)
1. One-way ANOVA:
Suppose we are interested in whether the mean pulse rates differ amongst
people of different exercise statuses.
H0: muNone = muSome = muFreq
Ex5.1: (refer to handout)
17
H1: Not all groups are equal.
We will import a new dataset and will use it for the next exercise.
18
hsb2 <read.table("http://www.ats.ucla.edu/stat/r/faq/hsb2.cs
v", sep=",", header=TRUE)
4. ANOVA
aov()
aov(formula, data)
2. Two-way ANOVA:
The formula/model for factorial ANOVA (take 2 way interaction for example) is
specified as follows:
Y ~ X1 * X2 which is equivalent to Y ~ X1 + X2 + X1:X2
Why do we do this?
Suppose we are interested the main effects of race and schtyp (school type) as
well as the interaction effect between the two variables on read. The formula/model
can be specified as:
Ex5.2: (refer to handout)
19
read ~ as.factor(race) * as.factor(schtyp)
20
In R, by default, ANOVA results are based on Type 1 (“sequential”) Sum of Squares.
Some other programs report Type 3 SS (SPSS) or both Type 1 and Type 3 (SAS).
The Type 3 SS for each term is calculated given all other terms are in the model.
AGENDA
1. Data cleaning (e.g., missing values)
2. Descriptive statistics
3. t-tests
4. ANOVA
21
5. Linear regression
Let’s import a new dataset
22
expenditure <read.table("http://dornsife.usc.edu/assets/sites/210/d
ocs/GC3/educationExpenditure.txt", sep=",",
header=TRUE)
5. LINEAR REGRESSION
lm()
lm(formula, data)
education: Per-capita education expenditures, dollars.
income: Per-capita income, dollars.
young: Proportion under 18, per 1000.
urban: Proportion urban, per 1000.
1. Simple linear regression
formula = Y ~ X
Suppose we are interested in testing the regression of per-capita education
expenditure on per-capita income, the model should be specified as:
Ex6.1: (refer to handout)
23
formula = education ~ income
5. LINEAR REGRESSION
lm()
lm(formula, data)
education: Per-capita education expenditures, dollars.
income: Per-capita income, dollars.
young: Proportion under 18, per 1000.
urban: Proportion urban, per 1000.
2. Multiple regression
formula = Y ~ X1 + X2 + … + Xp
Suppose we are interested in testing the regression of education on income,
young, and urban, the model should be specified as:
Ex6.2: (refer to handout)
24
formula = education ~ income + young + urban
25
Thanks!
Download