R: statistics for all of us

advertisement
R: statistics for all of us
R: an international statistical collaboratory
Prepared for part of the symposium on
Multivariate Statistical Methods in Individual Differences Research
International Society for the Study of Individual Differences
Biennial meeting, Adelaide, July , 2005
William Revelle, Northwestern University
personality-project.org/r/
R: statistics for all of us
• What is it?
• Why use it?
• Common (mis)perceptions
• Examples for personality and individual
differences research
R: What is it?
•R: An international collaboration
•R: the open source - public domain
version of S+
•R: written by statisticians (and all of us)
for statisticians (and the rest of us)
•R: an extensible language
Common statistical programs
General
R
S+
SAS
SPSS
STATA
SYSTAT
Specialized
AMOS
EQS
LISREL
Plus
Mx
your favorite program.
Common statistical programs
most are costly
General
R
$+
$A$
$P$$
$TATA
$Y$TAT
Specialized
AMO$
EQ$
LI$REL
Maple$
Mx
your favorite program.
R: a way of thinking
(from the R point of view)
• “R is the lingua franca of statistical
research. Work in all other languages
should be discouraged.”
• “This is R. There is no if. Only how.”
• “Overall, SAS is about 11 years behind R
and S-Plus in statistical capabilities (last
year it was about 10 years behind) in my
estimation.”
Taken from the R.-fortunes (selections from the R.-help list serve)
But it is open source how can you trust it?
• Q: When you use it [R], since it is written by so
many authors, how do you know that the results are
trustable?
• A:The R engine [...] is pretty well uniformly
excellent code but you have to take my word for
that. Actually, you don't. The whole engine is open
source so, if you wish, you can check every line of it.
If people were out to push dodgy software, this is
not the way they'd go about it.
Taken from the R.-fortunes (selections from the R.-help list serve)
What is R? : Technically
• R is an open source implementation of S (S-Plus
is a commercial implementation)
• R is available under GNU Copy-left
• The current version of R is 2.1.1
• R is group project run by a core group of
developers (with new releases semiannually)
•
(Adapted from Robert Gentleman)
R: History
• 1991-93: Ross Dhaka and Robert Gentleman begin
work on R project at U. Auckland
• 1995: R available by ftp under the GPL
• 96-97: mailing list and R core group is formed
• 2000: John Chambers, designer of S joins the R core
(wins a prize for best software from ACM for S)
• 2001-2005: Core team continues to improve base
package
• Many (>400)
others contribute “packages”
Why R?
• Graphics for data exploration and interpretation
• Data manipulation including statistics as data
• Statistical analysis
• Standard univariate and multivariate
generalizations of the linear model
• Multivariate-structural extensions
Why R? Graphics
• Sample graphics taken from
• http:personality-project.org/r/
• showing what can be done by an amateur
• http://addictedtor.free.fr/graphiques/
• showing some most impressive graphs
Standard Plots of factor loadings
Two
T
M
wo dimensions of affect
m
--1.0
-0.5
0.0
0
0.5
1
1.0
ffactor
M
sluggish
depressed
rsatisfied
relaxed
b
iblue
intense
scared
enthusiastic
fsad
full_of_pep
u
lunhappy
lively
tired
dull
at_ease
guilty
excited
gloomy
n
nervous
g
grouchy
calm
drowsy
alert
surprised
e
energetic
ssorry
sleepy
p
pleased
h
happy
d
distressed
tcontent
tense
ashamed
a
canxious
M
cheerful
ntense
ively
ull_of_pep
ired
ense
elaxed
actor
1.0
0.5
luggish
atisfied
cared
ad
alm
urprised
orry
leepy
ontent
heerful
epressed
lue
nthusiastic
nhappy
ull
t_ease
uilty
xcited
loomy
ervous
rouchy
rowsy
lert
nergetic
leased
appy
istressed
shamed
nxious
.5 2
.0
1
1.0
Two dimensions of affect
0.5
distressed
tense
unhappy
blue
nervous
sorry
depressed
anxious
sad
gloomy
scared
ashamed
grouchy
guilty
intense
0.0
sluggish
excited
energetic
alert
lively
full_of_pep
enthusiastic
cheerful
dull
tired
drowsy
sleepy
pleased happy
satisfied
content
calm
at_ease
-0.5
relaxed
-1.0
factor 2
surprised
-1.0
-0.5
0.0
factor 1
0.5
1.0
Data points can be dynamically identified
m
--0.4
-0.2
0.0
0.2
0.4
0.6
0
0.8
1
1.0
correlations
correlations
cM
M
sluggish
depressed
rsatisfied
relaxed
b
iblue
intense
scared
fsad
full_of_pep
u
lunhappy
lively
tired
dull
at_ease
guilty
excited
gloomy
n
nervous
g
grouchy
calm
drowsy
alert
surprised
e
energetic
ssorry
sleepy
p
pleased
h
happy
d
distressed
tcontent
tense
ashamed
a
canxious
M
cheerful
ntense
ively
ull_of_pep
ired
ense
elaxed
0.4
0.2
luggish
atisfied
cared
ad
alm
urprised
orry
leepy
ontent
heerful
epressed
lue
nhappy
ull
t_ease
uilty
xcited
loomy
ervous
rouchy
rowsy
lert
nergetic
leased
appy
istressed
shamed
nxious
orrelations
.2
.4
.6
.8
.0
orrelationswith
with
happy
sad
happy
and sad
0.4
sad
blue
unhappy
depressed
gloomy
distressed
sorry
ashamed
tense
grouchy
scared
0.2
guilty
dull
tired
nervous
anxious
sluggish sleepy
drowsy
intense
0.0
surprised
alert
lively
energetic
full_of_pep
calm
-0.2
excited
relaxed
at_ease
pleased
satisfied
content
cheerful
happy
-0.4
correlations with sad
0.6
0.8
1.0
correlations with happy and sad
-0.4
-0.2
0.0
0.2
0.4
correlations with happy
0.6
0.8
1.0
0.5
-0.5
-1.0
-0.5
0.0
0.5
1.0
-1.0
0.0
0.5
Simulated correlations
120 degrees
Observed correlations
with happy and sad
1.0
0.5
0.0
-0.5
-1.0
-0.5
0.0
r with sad
0.5
1.0
r with happy
-1.0
-0.5
0.0
0.5
1.0
-1.0
-0.5
0.0
0.5
r with happy
Simulated correlations
180 degrees
Observed correlations
with active and inactive
1.0
0.5
-0.5
-1.0
-1.0
-0.5
0.0
r with inactive
0.5
1.0
r with 0 degrees
1.0
-1.0
0.0
r with 120 degrees
-0.5
r with 0 degrees
1.0
-1.0
r with 180 degrees
0.0
r with nervous
0.5
0.0
-0.5
-1.0
r with 90 degrees
1.0
Observed correlations
with happy and nervous
1.0
Simulated correlations
90 degrees
-1.0
-0.5
0.0
r with 0 degrees
0.5
1.0
-1.0
-0.5
0.0
r with active
0.5
1.0
Multi-panel graphs can be
labeled separately and
organized vertically or
horizontally
Simulated data can be
generated to fit normal,
rectangular, binomial,
poissen, exponential, etc.
distributions
Scatter Plot Matrices can show smoothed fits
2
4
6
8
10
12
0
1
2
3
4
5
6
7
12
10
0.85 0.80 -0.22 -0.18
5
epiE
15
20
0
10
epiS
0
2
4
6
8
0.43 -0.05 -0.22
4
2
7
0
-0.24 -0.07
6
8
epiImp
5
6
epilie
0
1
2
3
4
-0.25
0
5
10
15
20
epiNeur
5
10
15
20
0
2
4
6
8
0
5
10
15
20
Can scale font size of correlations by absolute size of r
2
4
6
8
10
12
0
1
2
3
4
5
6
7
20
0
12
-0.18
-0.052
-0.22
10
-0.22
5
0.85 0.80
15
epiE
8
10
epiS
0
2
4
6
0.43
6
8
epiImp
-0.24
7
0
2
4
-0.074
4
5
6
epilie
0
1
2
3
-0.25
0
5
10
15
20
epiNeur
5
10
15
20
0
2
4
6
8
0
5
10
15
20
Error bars on two dimensions
1.5
Movie Study 2
1.5
Movie Study 1
Sad
1.0
1.0
●
0.5
Sad
0.0
Negative Affect
0.5
0.0
Threat Neutral
Neutral
Threat
−0.5
Happy
−1.0
−0.5
Happy
−1.0
Negative Affect
●
−1.0
−0.5
0.0
0.5
Positive Affect
1.0
1.5
−1.0
−0.5
0.0
0.5
Positive Affect
1.0
1.5
PA x NA
PA x NA
1.5
Study 3
1.5
Study 2
1.0
Happy
0.0
0.5
1.0
0.5
Neutral
Threat
−1.0
−0.5
Happy
0.0
0.5
Positive Affect
PA x EA
PA x EA
Energetic Arousal
Neutral
●
1.0
1.5
Threat
Neutral
●
−0.5
0.0
0.5
1.0
1.5
Sad
−1.0
−0.5
0.0
0.5
Postive Affect
NA x TA
NA x TA
●
Neutral
1.5
●
Neutral
Happy
−1.5
−1.5
Happy
1.0
Sad
Threat
−0.5
Sad
0.5 1.0 1.5
Postive Affect
Threat
−0.5
−1.5
Tense Arousal
0.5 1.0 1.5
−1.0
−1.5
−1.0
−0.5
0.0
0.5
Negative Affect
1.0
1.5
−1.5
−1.0
−0.5
0.0
0.5
Negative Affect
1.0
Presentation graphics
scale to fit page
and produce output
for pdf presentations
Window or page size
is controllable
Happy
−1.5
−1.5
Sad
0.5 1.0 1.5
Positive Affect
Threat
−0.5
0.0
1.5
Happy
−1.5
Tense Arousal
Sad
−0.5
0.5 1.0 1.5
−0.5
●
−1.0 −0.5
Threat
−1.0
Energetic Arousal
Negative Affect
0.5
0.0
Neutral
−1.0 −0.5
Negative Affect
1.0
●
Sad
1.5
Graphic output can be taken from multiple programs
(e.g.Very Simple Structure, factor analysis
Built-in data sets provide useful demonstrations of stats
12
10
8
y2
4
6
8
4
6
y1
10
12
Anscombe's 4 Regression data sets
5
10
15
5
10
10
8
y4
4
6
8
6
4
y3
10
12
x2
12
x1
15
5
10
15
x3
5
10
15
x4
Mapping data (GIS) may be combined with descriptive data
Many GIS map files are available for download from ESRI
GIS maps detail regional boundaries
US counties
Counties of California
Combine income data from 2000 census
with zipcode map of San Diego County
Median Household Income for 2000
2000 median household income for zip codes
GIS files of Borneo can show country boundaries
(e.g., Malaysia, Brunei, Indonesia)
-5
0
latitude
5
10
Borneo and its surrounds
100
105
110
115
longitude
120
125
130
Public access GIS files include rivers and roads
-5
0
latitude
5
10
Borneo and its surrounds
100
105
110
115
longitude
120
125
130
Even more graphics
• Taken from a collection of R
demonstrations and graphics
• http://addictedtor.free.fr/graphiques/
Principal components and clustering of sources of
variance in USA arrest data
PCA 5 vars
princomp(x = data, cor = cor)
c(-1, 1)
Fertility
Examination
Catholic
Education
Agriculture
V. De Geneve
1.5
(1-3) 60%
0.0
0.5
1.0
x1
Comp.3
Comp.4
Comp.5
Factor 1 [41%]
c(-1, 1)
c(-1, 1)
28
16
1
2
60
40
Factor 3 [19%]
Groups
48
80
c(0, 1)
Comp.2
c(0, 1)
Comp.1
Clustering 4 groups
20
0
c(-1, 1)
Mixture models
0.6
0.4
0.2
0.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
Notched Boxplots show confidence regions
Notched Boxplots
6
4
2
0
1
2
3
4
5
6
7
8
9
10
Multipanel graphs
virginica
Petal
Length
Three
Varieties
Sepal
Width
of
Iris
Sepal
Length
setosa
versicolor
Petal
Length
Petal
Length
Sepal
Width
Sepal
Length
Sepal
Width
Sepal
Length
Scatter Plot Matrix
Histograms and fitted distributions
60
Alto 2
65
70
75
60
Alto 1
Soprano 2
65
70
75
Soprano 1
0.20
0.15
0.10
0.05
Density
0.00
Bass 2
Bass 1
Tenor 2
Tenor 1
0.20
0.15
0.10
0.05
0.00
60
65
70
75
60
Height (inches)
65
70
75
Combine scatter plot with histograms
3
2
1
0
-1
-2
-3
-3
-2
-1
0
1
2
3
Why R?
• Graphics for data exploration and interpretation
• Data manipulation including statistics as data
• Statistical analysis
• Standard univariate and multivariate
generalizations of the linear model
• Multivariate-structural extensions
Data Manipulation
Data Entry
• from console
• from clipboard (copied from other programs)
• from file (text files, csv, SPSS, Excel, MySQL)
• from the web
Data Manipulation:
Data Structures
• Data types: integer, real, logical, character, string
• Vectors of any data type
• Matrices of any data type
• Data Frames (similar to matrix of mixed type)
• Lists of any mixture of types
• All operations are functions and the returned
values may be used in any data structure (e.g., as
an element of a data frame or of a list)
Data Manipulation
• standard arithmetic and logical operations
• matrix operations including transpose,
inner product, outer product, diagonal,
trace, invert
• searching, sorting, merging
• data cleaning by logical commands
Why R?
• Graphics for data exploration and interpretation
• Data manipulation including statistics as data
• Statistical analysis
• Standard univariate and multivariate
generalizations of the linear model
• Multivariate-structural extensions
Standard Statistical
packages in R
• Descriptive and exploratory statistics
• The general linear model and its special cases
• t-test, ANOVA, MANOVA, regression, logistic
regression, cox models, etc.
• multilevel models (mixed models, hierarchical
models)
• time series, econometrics
• circular statistics, environmental-geographical
statistics
Psychometric packages
• Structural Equation Modeling
• Rasch Modeling of one parameter IRT
• Factor and Principal Component Analysis
• Rotations (Orthogonal:Varimax, Oblique:
quartimax, quartimin, Promax, etc.)
• singular value decomposition
• eigen vector - eigen value decomposition
• Multidimensional scaling
R is extensible
• Functions can be defined easily and then
stored for later use.
• Packages of functions are written by “all of
us” to solve particular problems
• e.g. sem, rasch, rotation, lattice graphics
• Packages can be developed and tailored for
a specific lab or problem area, or can be
shared through the CRAN repository of
(currently) > 400 packages
Programming in R-personal examples
• Very Simple Structure
• 6 weeks in Fortran for a mainframe
• 3 weeks in Pascal for a Mac
• 2 days in R for Mac/Unix/Windows
• Schmid Leiman decomposition and
estimation of coefficient omega
• 1 day
Popular misconceptions
• R is hard to learn
• R is not user friendly
• I can’t figure R out
• R is free -- it can’t be very good
Popular Misconceptions
• R is hard to learn
• With a brief web based tutorial, 2nd and 3rd
year undergraduates in psychological methods
and personality research courses were using R
for descriptive and inferential statistics and
producing publication quality graphics
• R is easy to learn, hard to master
• R-help newsgroup is very supportive
• Multiple web based and pdf tutorials
Popular Misconceptions
• R is not user friendly
• Does not have a GUI, but those are being
developed
• Help and examples are embedded within
program,
• ? function produces help with examples
for any function
• R-help newsgroup is very supportive
Popular Misconceptions
• R is free, it can not be any good
• Developed by some of the best
statisticians around
• vetted by all users, bugs when they exist,
are rapidly fixed
• R community provides excellent and
timely statistical and programming help (if
somewhat acerbic to those who have not
done their homework)
R: an international
collaboratory
• Most appropriate for an International
Society to be aware of the power of
international collaboration to produce
cutting edge software
R: statistics for all of us
R: an international statistical collaboratory
Prepared for part of the symposium on
Multivariate Statistical Methods in Individual Differences Research
International Society for the Study of Individual Differences
Biennial meeting, Adelaide, July , 2005
William Revelle, Northwestern University
personality-project.org/r/
personality-project.org/r.short.html
Download