Psyc 6627 – Lab 4 – Correlation
use USairpollution data set from the HSAUR2 package
Correlation in Excel
1. z-scores: FISHER and FISHERINV
Correlation in R
1. Scatterplot
a. choose from integer variables
b. Identify points: to use cursor to add ID/row number
c. Jitter: to see data points that are overlaying
d. Log: use log values
e. Marginal boxplots: boxplot for each variable
f. Least-squares line: solid green; regression line
g. Smooth line: solid red, locally weighted, use to look for deviations from linearity
h. Show spread: spread around the smooth line
i. Span for smooth: span that is used to determine the smooth line
j. Change axis labels
k. Subset expression: to use a subset of observations (sex == “male”)
l. Plot by groups: select a categorical variable to distinguish between groups
2. Scatterplot matrix
a. Least-squares lines
b. Smooth lines
c. Show spread
d. Density plots: on diagonal, similar to a histogram, good way to view the distribution
e. Histograms: on diagonal
f. Boxplots: on diagonal
g. One-dimensional scatterplots: not very useful
h. Normal QQ plots: Q = quantile; compares the distribution of a variable to the normal
distribution (the straight line), good way to look at linearity of the data
i. Subset and Plot by groups
3. Calculating correlations
a. Statistics, Summaries, Correlation Matrix
i. Pairwise p-values: to get matrix of p-values
b. Statistics, Summaries, Correlation Text (only possible to select two variables)
i. Alternative Hypothesis (one and two-tailed options)
ii. Output: t, df, and p-value, statement of the alternative hypothesis, 95% CI of the
correlation, sample estimate (the correlation value)
4. Correlations with script
#calculate and output the correlations and covariance for two variables
cor(temp, precip)
cov(temp, precip)
###calculate correlations for all variables in a dataframe or matrix
#use: how to handle missing data, other options are all.obs (for no missing data) and
complete.obs (listwise deletion)
#method: type of correlation, other options are spearman and kendall
cor(USairpollution, use="pairwise.complete.obs", method="pearson")
cov(USairpollution, use="pairwise.complete.obs")
###Compute correlations between certain variables (not a whole matrix)
#these variables will be the rows; the numbers mean variables 1 and 2 from the data set
x <- USairpollution[1:2]
#these variables will the the columns; the numbers mean variables 3 through 7
y <- USairpollution[3:7]
cor(x,y)
cov(x,y)
###Calculate correlations and covariances with significance levels.
#data structure has to be a matrix. Only uses pairwise deletion
#load the Hmisc package
library (Hmisc)
#convert the data from to a matrix and calculate the full correlation matrix
rcorr(as.matrix(USairpollution))
#calculate correlations for pairs of variables
rcorr(temp, precip)
###Calculate correlation and covariance for different levels of a categorical variable
#attach Cowles data set because it has a categorical variables
data(Cowles, package="car")
attach(Cowles)
#load the package called plyr
library(plyr)
#ddply = input is a dataframe and output is a dataframe
#Cowles is the name of the data set
#sex is the variable we are splitting by
#neuroticm and extraversion are the variables being correlated
ddply(Cowles, .(sex), summarise, "corr" = cor(neuroticism, extraversion))
ddply(Cowles, .(sex), summarise, "covariance" = cov(neuroticism, extraversion))
###Calculate confidence intervals
library(psychometric)
CIr(r=.90, n = 100, level = .95)
###Compare independent correlations
#Install and load the psych package
library (psych)
#(r1, r2, NULL, n, n2=) NULL is needed because there is not a third correlation, n2= is
optional
paired.r(.52, .12, NULL, 60, n2=50)
###Compare dependent correlations
#Use the psych package
#library (psych)
#(r31, r32, r12, N)
paired.r(.20,.45,.15,103)
5. Online calculator for p-values:
http://www.socscistatistics.com/pvalues/pearsondistribution.aspx