T-Test

advertisement
R
Acknowledgement:
http://www.stat.columbia.edu/~martin/
W2024/W2024.html
And many other websites
Introduction to R

Introduction:


2
R is a free software programming language and software
environment for statistical computing and graphics.
R's popularity has increased substantially in recent years
Installing R for Windows

Download and install:


3
Download R for Windows from link: http://cran.r-project.org/
Double click “R-3.1.1-win.exe” and follow the setup wizard
until installation finished successfully.
Open R

Open R

4
Double click R icon on your windows desktop or click R
command in Start  R  R i386 3.1.0
Section 1
GETTING START WITH R
5
R
R is a programming language and software
environment for statistical computing and
graphics.
 It is highly extensible and becomes the most
popular language among statisticians for
developing and introducing new statistical
methods.

6
Basic commands

To see which datasets are available in your
workspace

To get more info

To quit R
7
Math operations
8
Vectors and Matrices

9
Vector is an array of numbers or characters
Vectors and Matrices
10
Vectors and Matrices
11
Vectors and Matrices
12
Vectors and Matrices
13
Vectors and Matrices
14
Vectors and Matrices
15
Data Frames
R often works with data frames, which are like
matrices except the columns are allowed to
be of different types (e.g., one column can be
numerical, while another consists of
characters)
 data.frame()

16
Data Frames
17
Data Frames
18
Data Frames
19
Reading and Writing Data
read.table() can read an external file and
create a data frame
 bank.txt

20
Built-in Datasets

21
Trees dataset: provides the measurements of
the girth, height, and volume of timber in 31
felled black cherry trees.
Built-in Datasets
22
Section 2
MEAN, MEDIAN, MODE
23
Calculate mean in R

Create a vector a:

Calculate mean of a and saved into b
24
Calculate median in R

Create a vector “a”:

Calculate median of vector “a” and saved into
“b”
25
Calculate mode in R

Create a vector “a”:


Counts how many occurrences of each value


26
Operator “<-” has the same function as “=”
The first row of “b" is a sorted list of all unique values in the vector “a".
The second row in “b" counts how many occurrences of each value.
Calculate mode in R

Calculate the mode of vector “a”
this command returns the names of the values that
have the highest count in b's second row.
 Since the mode is the value(s) that occur most
frequently in a vector or matrix, this line returns
the mode.

27
Section 3
CENTRAL TENDENCY
28
Calculate range in R

Create a vector “a”:

Calculate range in R:
29
Calculate SD in R

Create a vector “a”:

Calculate SD in R:

30
sd() is similar to STDEV() in Excel. If you want to
obtain result as STDEVP() in Excel, just multiply
the sd() result with (N-1)/N
Calculate variance in R

Create a vector “a”:

Calculate variance in R:

31
Variance is squared SD
Section 4
Z SCORE
32
Calculate z score in R
Zscore=(x-mean(x))/sd(x)
 Compute the z scores where mean=50 and
the standard deviation =5

55
 50
 60
 57.5
 46

33
Z-test in R (1)



Suppose that 10 volunteers have done an intelligence test
The mean obtained at the same test, from the entire
population is 75.
If there is a statistically significant difference (with a
significance level of 95%) between the means of the sample
and the population? (assuming that the population variance is
known and equal to 18.)

34
65, 78, 88, 55, 48, 95, 66, 57, 79, 81
Z-test in R (2)

Add a function to R

Apply the function
P value for Z score calculator:
http://www.socscistatistics.com/pvalues/normaldistribution.aspx
35
Section 5
T-TEST
36
T-test in R



37
T-test are used to determine whether the
means of two groups are equal to each other.
The assumption for the test is that both
groups are sampled from normal distributions
with equal variances. The null hypothesis is
that the two means are equal, and the
alternative is that they are not.
Ref: http://statistics.berkeley.edu/computing/r-t-tests
http://www.stat.columbia.edu/~martin/W2024/R2.pdf
T-test in R

38
t.test() function in R can perform both one
and two sample t-test on vectors of data
T-test in R

39
t.test() function in R can perform both one
and two sample t-test on vectors of data
T-test in R
40
One-sample t-tests
41
One-sample t-tests

42
P<0.05, reject null hypothesis, so mean Salmonella level in the
ice cream is greater than 0.3 MPN/g
Two-sample t-tests
43
Two-sample t-tests
P<0.05, rejected null hypothesis.
P<0.05, rejected null hypothesis.
44
Paired t-test
45
Paired t-test
P<0.05, rejected null hypothesis.
46
Section 6
ANOVA
47
ANOVA in R

48
Sometimes we need to determine whether
the means from more than two populations or
groups are equal or not. To test whether the
difference in means is statistically significant,
we use analysis of variance (ANOVA). The
function in R is aov().
ANOVA in R

49
First, we can graphically compare the means of
the variable of interest across groups. We can
create boxplots of measurements organized in
groups using plot()
ANOVA in R

50
Example: A drug company tested three formulations
of a pain relief medicine for migraine headache
suffers. For the experiment 27 volunteers were
selected and 9 were randomly assigned to one of
three drug formulations. The subjects were
instructed to take the drug during their next
migraine headache episode and to report their pain
on a scale of 1 to 10 (10 being the most)
ANOVA in R

51
boxplots
ANOVA in R
aov(response~factor, data=data_name)
 summary()

So the results say: F-value=11.91, p value=0.0003, which is very
significant. So we reject the null hypothesis, there exist difference in
the means of three drug groups
52
Multiple comparisons
ANOVA F-test answers the question whether
there are significant differences in the K
population means. However, it does not tell
how they differ.
 The function pairwise.t.test computes the
pair-wise difference between group means.

53
Multiple comparisons
B and C are not significant, but A and B, and A
and C are as p value less than 0.05 or 0.01.
 So drug A is very different

54
Multiple comparisons

55
Another multiple comparisons procedure is
Tukey’s method (i.e., Tukey’s Honest
Significance Test). The function TukeyHSD()
creates a set of confidence intervals on the
differences between means with the specified
family-wise probability of coverage.
Multiple comparisons
56
Two-way ANOVA

57
Two-way ANOVA is a technique for studying
the relationship between a quantitative
dependent variable and two qualitative
independent variables.
Two-way ANOVA

58
Example: The student was interested in her success at
basketball free throws. This study investigate whether there
was any relationship between the quantitative variable
“number of shots made” and two qualitative variables “Time
of Day” and “Shoe Worn”.
Two-way ANOVA


59
R can read data from a text file. The text file has to be in a
form of table with columns representing variables with the
first row of the file for names of variables. All columns must
be the same length. Missing data uses “NA”.
R is case-sensitive
Two-way ANOVA
After using attach, now the variables are
attached with object basketball.
Now we are ready to run ANOVA
60
Two-way ANOVA



61
First, we can compare the two times (morning vs.
night), or the two shoes (favorite vs. others) by
looking at summary statistics or boxplots.
To get the means for each level of each factor, use
tapply()
It seems that she does better at night and in her
favorite shoes. But that could just be due to natural
variability. So we use ANOVA.
Two-way ANOVA
P-value for the interaction is 0.38, so we have to keep the null
hypothesis. That means that the interaction of time and shoes
will not change the free throw performance.
P-value for both variables >0.05, so keep the null hypothesis.
The each of the individual variables will not affect the final result
62
Two-way ANOVA
63
Section 7
CORRELATION
64
Correlation in R


65
Create your data file. Use a spreadsheet and make
each column a variable. Each row is a replicate. The
first row should contain the variable names. Save this
as a .CSV file (R_Correlation.csv)
Ref: http://www.gardenersown.co.uk/Education/Lectures/R/index.htm
Correlation in R

66
Read the data into R and save as some name
Correlation in R

67
Allow the factors within the data to be
accessible to R
Correlation in R

68
Decide on the method, run the correlation
and assign the result to a new variable.
Methods are "pearson" (default), "kendal" and
"spearman"
Correlation in R

69
Perform a pairwise correlation on all the
variables in the data set. Decide on the
method ("pearson" (default), "kendal" and
"spearman")
Correlation in R

70
To evaluate the statistical significance of your
correlation decide on the appropriate method
(pearson is the default), assign a variable and
run the test
Correlation in R

71
Have a look at the result of yor significance
test
Correlation in R

Plot a graph of the two variables from your
correlation. pch=21 plots an open circle,
pch=19 plots a solid circle. Try other values.

Add a line of best fit (if appropriate)
72
Correlation in R

73
Graph:
Section 8
REGRESSION
74
Correlation and Linear Regression





75
We are interested to study the relationship among variables
to determine whether they are associated with one another.
The changes in variable x, can explain or cause changes in
variable y.
X is called explanatory variable, y a response variable.
If the plot looks like a straight line, it is a linear relationship.
The relationship is strong if all the data points approximately
make up a straight line and weak if the points are widely
scattered about the line.
Correlation and Linear Regression



76
The covariance and correlation are measures of the strength
and direction of a linear relationship between two quantitative
variables.
A regression line is a mathematical model for describing a
linear relationship between an explanatory variable, x and a
response variable y. It can be used to predict the value of y for
a given value of x.
cov(), cor()
Covariance and correlation

77
Example: A pediatrician wants to study the relationship
between a child’s height and their head circumference (both
measured in inches). She selects a SRS of 11 three-year old
children and obtains the following data.
Covariance and correlation
78
Covariance and correlation

The variance of Height and Circ is 1.198 and 0.048. The
covariance between Height and Circ is 0.219 indicating a
positive relationship

The correlation between Height and Circ is 0.911. Hence,
there exists a strong positive linear relationship between the
variables.
79
Linear Regression



If there exists a strong linear relationship between
two variables, it is often to model the relationship
using a regression line.
lm(response~explanatory)
Circ=12.493+0.183Height
Y=12.493+0.183X
So one inch increase in height will lead to a 0.183 inch
increase in head circumference
80
Linear Regression
81
Linear Regression


82
Next step is to verify all the relevant model assumptions
needed for using the simple linear regression model.
The residuals should be normally distributed with equal
variance for every value of x.
Linear Regression

83
The plot shows no apparent pattern in the residuals indicating
no clear violations of any model assumptions
Linear Regression


84
To check the normality assumption,
we make the QQ-plots
Or histogram
Linear Regression


Next step is to do inference. We want to construct tests and
confidence intervals for the slope and intercept, confidence
intervals for the mean response and prediction intervals for
future observations.
To test whether the slope is significantly different from 0
So we can
reject null
hypothesis
of no linear
relationship
between
height and
circ
(p=9.59e0.5)
85
R-squared



R-squared is a statistical measure of how close the data are to the fitted
regression line. It is also known as the coefficient of determination, or the
coefficient of multiple determination for multiple regression.
R-squared is the percentage of the response variable variation that is
explained by a linear model
R-squared=explained variation/total variation (0-100%)




86
0% means that the model explain none of the variability of the response data around its
mean
100% means that the model explain all the variability of the response data around its
mean.
So the higher R-squared, the better the model fits your data.
R-squared=0.83 indicating that 83% of the variability in the response is
explained by the explanatory variable.
Linear Regression


Now we can use the regression equation to predict future
values of the response variable.
The predicted value of head circumference for a child of a
given height has two interpretations:



87
Represent the mean circumference for all children whose height is x
Represent the predicted circumference for a randomly selected child
whose height is x
The predicted value will be the same for both cases, but the
standard error will be larger in the second case due to the
additional variation of individual responses about the mean.
Linear Regression

To obtain a 95% confidence intervals for the mean head
circumference of children who are 25 inches tall. The
confidence interval lies in the range (16.95, 17.17)

To obtain a 95% confidence intervals for the mean head
circumference of a child who is 25 inches tall. The confidence
interval lies in the range (16.81 17.30)
88
Multiple Linear Regression
89
Multiple Linear Regression
We want to know the relationships between
X1 and X2, X3, X4.
 X1 = first year box office receipts/millions
X2 = total production costs/millions
X3 = total promotional costs/millions
X4 = total book sales/millions


90
Multiple Regression


91
First create your data file. Use a spreadsheet and
make each column a variable. Each row is a replicate.
The first row should contain the variable names. Save
this as a .CSV file (R_Regression.csv)
The data (X1, X2, X3, X4) are for each movie
X1 = first year box office receipts/millions
X2 = total production costs/millions
X3 = total promotional costs/millions
X4 = total book sales/millions
Multiple Linear Regression

Read the data into R and save as some name

Allow the factors within the data to be accessible to R
92
Multiple Linear Regression

93
Have a first look at the data as a pairs graph (plots all
combinations as scatter plots)
Multiple Linear Regression

Decide on the model, run it and assign the result to a new
variable

See the basic coefficients of your regression
94
Multiple Linear Regression

A more detailed summary of your regression
Overall p=7.913e05, the fitting is very
good. X2 and X3
are of significance,
but X4 is not.
95
Multiple Linear Regression






96
Once you have basic info, you can go ahead to exam more
components of your regression model
Examine an individual coefficient
Beta coefficients that are standardized
again one another to show the
relative strengths.
A beta coefficient is determined as
Calculate the beta coefficients (you will need to do one for
each x factor)
Display all your beta coefficients
Multiple Linear Regression




97
R-squared value tells us how strong the fit is (the proportion
of the explained variance). However, R only shows the value
for the overall model.
We can find the individual R-squared values once we know
the beta coefficients: R2=beta*correlation(X,Y)
Calculate the R-squared value for each components
Display all your R-squared values
Multiple Linear Regression

98
Plot a graph of two variables from your regression
Multiple Linear Regression

99
Add a line of best fit
Transformations

100
In many situations there exists a non-linear
relationship between the variables. This can
sometimes be remedied by applying a suitable
transformation of the variables, such as power
transformations or logarithms.
Transformations

101
Data were collected on the number of
academic journals published on the Internet
during the period of 1991-1997
Transformations
Clearly, there is a non-linear
relationship between year and
journals. Taking the logarithm of
number of journals may be
appropriate before fitting a
simple linear regression model
102
Transformations
103
Transformations
The residual plot shows no apparent
pattern. It shows that a simple linear
regression model is appropriate for the
transformed data
104
Transformations

105
Now we predict the number of journals
published in 1998 (x=8)
Section 9
CHI-SQUARE
106
Chi-Squared in R
In an election survey, voters might be classified
by gender (male or female) and voting
preference (Democrat, Republican, or
Independent). We wan to determine whether
gender is related to voting preference.
 We use chi-square test for independence

107
Chi-Squared in R

When to use chi-square test for
independence:
The sample method is simple random sampling
 Each population is at least 10 times as large as its
reprehensive sample
 The variables under study are each categorical


Four steps

108
State the hypotheses, formulate an analysis plan,
analyze sample data, and interpret results
Chi-Squared in R

109
Example: A public opinion poll surveyed a simple random
sample of 1000 voters. Respondents were classified by gender
(male or female) and by voting preference (Republican,
Democrat, or Independent). Results is below. Is there a gender
gap? Do the men’s voting preferences differ significantly from
the women’s preferences? Use a 0.05 level of significance.
Chi-Squared in R

110
State the hypotheses:
Chi-Squared in R

Read in your file and assign it to a variable name. This
command tells R that the 1st column contains the row names.

Run the Chi-Squared test and assign it to a variable

If you need to apply Yates correction for a 2 x 2 matrix
111
Chi-Squared in R

112
So we reject the null hypothesis (p<0.05).
men’s voting preferences differ significantly
from the women’s preferences
Chi-Squared in R

To see the original data i.e. observed values

To see the expected values

To see the Pearson residuals (O-E)/sqrt(E)
113
Section 10
CLUSTERING
114
K-means clustering in R

This is the most basic algorithm




115
Pick an initial set of K centroids (this can be random or any other
means)
For each data point, assign it to the member of the closest centroid
according to the given distance function
Adjust the centroid position as the mean of all its assigned member
data points. Go back to (2) until the membership isn't change and
centroid position is stable.
Output the centroids.
K-means clustering in R

This is the most basic algorithm




116
Pick an initial set of K centroids (this can be random or any other
means)
For each data point, assign it to the member of the closest centroid
according to the given distance function
Adjust the centroid position as the mean of all its assigned member
data points. Go back to (2) until the membership isn't change and
centroid position is stable.
Output the centroids.
K-means clustering in R

Prepare data: Iris.csv

Data introduction:


117
http://en.wikipedia.org/wiki/Iris_flower_data_set
The dataset consists of 50 samples from each of three species
of Iris (Iris setosa, Iris virginica and Iris versicolor). Four
features were measured from each sample: the length and the
width of the sepals and petals, in centimetres.
K-means clustering in R

118
Read data in R
K-means clustering in R

K-means clustering

119
Using column 1-2 for clustering, 3 classes.
K-means clustering in R

Plot the result
https://stat.ethz.ch/R-manual/Rpatched/library/graphics/html/point
s.html
120
Section 11
R GRAPHICS
121
Built-in Datasets

122
Trees dataset: provides the measurements of
the girth, height, and volume of timber in 31
felled black cherry trees.
Built-in Datasets
123
Graphics
124
Box plot
125
Good R graphics Tutorial

126
http://teachpress.environmentalinformaticsmarburg.de/2013/07/creating-publicationquality-graphs-in-r-7/
Download