R: statistics for all of us R: an international statistical collaboratory Prepared for part of the symposium on Multivariate Statistical Methods in Individual Differences Research International Society for the Study of Individual Differences Biennial meeting, Adelaide, July , 2005 William Revelle, Northwestern University personality-project.org/r/ R: statistics for all of us • What is it? • Why use it? • Common (mis)perceptions • Examples for personality and individual differences research R: What is it? •R: An international collaboration •R: the open source - public domain version of S+ •R: written by statisticians (and all of us) for statisticians (and the rest of us) •R: an extensible language Common statistical programs General R S+ SAS SPSS STATA SYSTAT Specialized AMOS EQS LISREL Plus Mx your favorite program. Common statistical programs most are costly General R $+ $A$ $P$$ $TATA $Y$TAT Specialized AMO$ EQ$ LI$REL Maple$ Mx your favorite program. R: a way of thinking (from the R point of view) • “R is the lingua franca of statistical research. Work in all other languages should be discouraged.” • “This is R. There is no if. Only how.” • “Overall, SAS is about 11 years behind R and S-Plus in statistical capabilities (last year it was about 10 years behind) in my estimation.” Taken from the R.-fortunes (selections from the R.-help list serve) But it is open source how can you trust it? • Q: When you use it [R], since it is written by so many authors, how do you know that the results are trustable? • A:The R engine [...] is pretty well uniformly excellent code but you have to take my word for that. Actually, you don't. The whole engine is open source so, if you wish, you can check every line of it. If people were out to push dodgy software, this is not the way they'd go about it. Taken from the R.-fortunes (selections from the R.-help list serve) What is R? : Technically • R is an open source implementation of S (S-Plus is a commercial implementation) • R is available under GNU Copy-left • The current version of R is 2.1.1 • R is group project run by a core group of developers (with new releases semiannually) • (Adapted from Robert Gentleman) R: History • 1991-93: Ross Dhaka and Robert Gentleman begin work on R project at U. Auckland • 1995: R available by ftp under the GPL • 96-97: mailing list and R core group is formed • 2000: John Chambers, designer of S joins the R core (wins a prize for best software from ACM for S) • 2001-2005: Core team continues to improve base package • Many (>400) others contribute “packages” Why R? • Graphics for data exploration and interpretation • Data manipulation including statistics as data • Statistical analysis • Standard univariate and multivariate generalizations of the linear model • Multivariate-structural extensions Why R? Graphics • Sample graphics taken from • http:personality-project.org/r/ • showing what can be done by an amateur • http://addictedtor.free.fr/graphiques/ • showing some most impressive graphs Standard Plots of factor loadings Two T M wo dimensions of affect m --1.0 -0.5 0.0 0 0.5 1 1.0 ffactor M sluggish depressed rsatisfied relaxed b iblue intense scared enthusiastic fsad full_of_pep u lunhappy lively tired dull at_ease guilty excited gloomy n nervous g grouchy calm drowsy alert surprised e energetic ssorry sleepy p pleased h happy d distressed tcontent tense ashamed a canxious M cheerful ntense ively ull_of_pep ired ense elaxed actor 1.0 0.5 luggish atisfied cared ad alm urprised orry leepy ontent heerful epressed lue nthusiastic nhappy ull t_ease uilty xcited loomy ervous rouchy rowsy lert nergetic leased appy istressed shamed nxious .5 2 .0 1 1.0 Two dimensions of affect 0.5 distressed tense unhappy blue nervous sorry depressed anxious sad gloomy scared ashamed grouchy guilty intense 0.0 sluggish excited energetic alert lively full_of_pep enthusiastic cheerful dull tired drowsy sleepy pleased happy satisfied content calm at_ease -0.5 relaxed -1.0 factor 2 surprised -1.0 -0.5 0.0 factor 1 0.5 1.0 Data points can be dynamically identified m --0.4 -0.2 0.0 0.2 0.4 0.6 0 0.8 1 1.0 correlations correlations cM M sluggish depressed rsatisfied relaxed b iblue intense scared fsad full_of_pep u lunhappy lively tired dull at_ease guilty excited gloomy n nervous g grouchy calm drowsy alert surprised e energetic ssorry sleepy p pleased h happy d distressed tcontent tense ashamed a canxious M cheerful ntense ively ull_of_pep ired ense elaxed 0.4 0.2 luggish atisfied cared ad alm urprised orry leepy ontent heerful epressed lue nhappy ull t_ease uilty xcited loomy ervous rouchy rowsy lert nergetic leased appy istressed shamed nxious orrelations .2 .4 .6 .8 .0 orrelationswith with happy sad happy and sad 0.4 sad blue unhappy depressed gloomy distressed sorry ashamed tense grouchy scared 0.2 guilty dull tired nervous anxious sluggish sleepy drowsy intense 0.0 surprised alert lively energetic full_of_pep calm -0.2 excited relaxed at_ease pleased satisfied content cheerful happy -0.4 correlations with sad 0.6 0.8 1.0 correlations with happy and sad -0.4 -0.2 0.0 0.2 0.4 correlations with happy 0.6 0.8 1.0 0.5 -0.5 -1.0 -0.5 0.0 0.5 1.0 -1.0 0.0 0.5 Simulated correlations 120 degrees Observed correlations with happy and sad 1.0 0.5 0.0 -0.5 -1.0 -0.5 0.0 r with sad 0.5 1.0 r with happy -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 r with happy Simulated correlations 180 degrees Observed correlations with active and inactive 1.0 0.5 -0.5 -1.0 -1.0 -0.5 0.0 r with inactive 0.5 1.0 r with 0 degrees 1.0 -1.0 0.0 r with 120 degrees -0.5 r with 0 degrees 1.0 -1.0 r with 180 degrees 0.0 r with nervous 0.5 0.0 -0.5 -1.0 r with 90 degrees 1.0 Observed correlations with happy and nervous 1.0 Simulated correlations 90 degrees -1.0 -0.5 0.0 r with 0 degrees 0.5 1.0 -1.0 -0.5 0.0 r with active 0.5 1.0 Multi-panel graphs can be labeled separately and organized vertically or horizontally Simulated data can be generated to fit normal, rectangular, binomial, poissen, exponential, etc. distributions Scatter Plot Matrices can show smoothed fits 2 4 6 8 10 12 0 1 2 3 4 5 6 7 12 10 0.85 0.80 -0.22 -0.18 5 epiE 15 20 0 10 epiS 0 2 4 6 8 0.43 -0.05 -0.22 4 2 7 0 -0.24 -0.07 6 8 epiImp 5 6 epilie 0 1 2 3 4 -0.25 0 5 10 15 20 epiNeur 5 10 15 20 0 2 4 6 8 0 5 10 15 20 Can scale font size of correlations by absolute size of r 2 4 6 8 10 12 0 1 2 3 4 5 6 7 20 0 12 -0.18 -0.052 -0.22 10 -0.22 5 0.85 0.80 15 epiE 8 10 epiS 0 2 4 6 0.43 6 8 epiImp -0.24 7 0 2 4 -0.074 4 5 6 epilie 0 1 2 3 -0.25 0 5 10 15 20 epiNeur 5 10 15 20 0 2 4 6 8 0 5 10 15 20 Error bars on two dimensions 1.5 Movie Study 2 1.5 Movie Study 1 Sad 1.0 1.0 ● 0.5 Sad 0.0 Negative Affect 0.5 0.0 Threat Neutral Neutral Threat −0.5 Happy −1.0 −0.5 Happy −1.0 Negative Affect ● −1.0 −0.5 0.0 0.5 Positive Affect 1.0 1.5 −1.0 −0.5 0.0 0.5 Positive Affect 1.0 1.5 PA x NA PA x NA 1.5 Study 3 1.5 Study 2 1.0 Happy 0.0 0.5 1.0 0.5 Neutral Threat −1.0 −0.5 Happy 0.0 0.5 Positive Affect PA x EA PA x EA Energetic Arousal Neutral ● 1.0 1.5 Threat Neutral ● −0.5 0.0 0.5 1.0 1.5 Sad −1.0 −0.5 0.0 0.5 Postive Affect NA x TA NA x TA ● Neutral 1.5 ● Neutral Happy −1.5 −1.5 Happy 1.0 Sad Threat −0.5 Sad 0.5 1.0 1.5 Postive Affect Threat −0.5 −1.5 Tense Arousal 0.5 1.0 1.5 −1.0 −1.5 −1.0 −0.5 0.0 0.5 Negative Affect 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 Negative Affect 1.0 Presentation graphics scale to fit page and produce output for pdf presentations Window or page size is controllable Happy −1.5 −1.5 Sad 0.5 1.0 1.5 Positive Affect Threat −0.5 0.0 1.5 Happy −1.5 Tense Arousal Sad −0.5 0.5 1.0 1.5 −0.5 ● −1.0 −0.5 Threat −1.0 Energetic Arousal Negative Affect 0.5 0.0 Neutral −1.0 −0.5 Negative Affect 1.0 ● Sad 1.5 Graphic output can be taken from multiple programs (e.g.Very Simple Structure, factor analysis Built-in data sets provide useful demonstrations of stats 12 10 8 y2 4 6 8 4 6 y1 10 12 Anscombe's 4 Regression data sets 5 10 15 5 10 10 8 y4 4 6 8 6 4 y3 10 12 x2 12 x1 15 5 10 15 x3 5 10 15 x4 Mapping data (GIS) may be combined with descriptive data Many GIS map files are available for download from ESRI GIS maps detail regional boundaries US counties Counties of California Combine income data from 2000 census with zipcode map of San Diego County Median Household Income for 2000 2000 median household income for zip codes GIS files of Borneo can show country boundaries (e.g., Malaysia, Brunei, Indonesia) -5 0 latitude 5 10 Borneo and its surrounds 100 105 110 115 longitude 120 125 130 Public access GIS files include rivers and roads -5 0 latitude 5 10 Borneo and its surrounds 100 105 110 115 longitude 120 125 130 Even more graphics • Taken from a collection of R demonstrations and graphics • http://addictedtor.free.fr/graphiques/ Principal components and clustering of sources of variance in USA arrest data PCA 5 vars princomp(x = data, cor = cor) c(-1, 1) Fertility Examination Catholic Education Agriculture V. De Geneve 1.5 (1-3) 60% 0.0 0.5 1.0 x1 Comp.3 Comp.4 Comp.5 Factor 1 [41%] c(-1, 1) c(-1, 1) 28 16 1 2 60 40 Factor 3 [19%] Groups 48 80 c(0, 1) Comp.2 c(0, 1) Comp.1 Clustering 4 groups 20 0 c(-1, 1) Mixture models 0.6 0.4 0.2 0.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 Notched Boxplots show confidence regions Notched Boxplots 6 4 2 0 1 2 3 4 5 6 7 8 9 10 Multipanel graphs virginica Petal Length Three Varieties Sepal Width of Iris Sepal Length setosa versicolor Petal Length Petal Length Sepal Width Sepal Length Sepal Width Sepal Length Scatter Plot Matrix Histograms and fitted distributions 60 Alto 2 65 70 75 60 Alto 1 Soprano 2 65 70 75 Soprano 1 0.20 0.15 0.10 0.05 Density 0.00 Bass 2 Bass 1 Tenor 2 Tenor 1 0.20 0.15 0.10 0.05 0.00 60 65 70 75 60 Height (inches) 65 70 75 Combine scatter plot with histograms 3 2 1 0 -1 -2 -3 -3 -2 -1 0 1 2 3 Why R? • Graphics for data exploration and interpretation • Data manipulation including statistics as data • Statistical analysis • Standard univariate and multivariate generalizations of the linear model • Multivariate-structural extensions Data Manipulation Data Entry • from console • from clipboard (copied from other programs) • from file (text files, csv, SPSS, Excel, MySQL) • from the web Data Manipulation: Data Structures • Data types: integer, real, logical, character, string • Vectors of any data type • Matrices of any data type • Data Frames (similar to matrix of mixed type) • Lists of any mixture of types • All operations are functions and the returned values may be used in any data structure (e.g., as an element of a data frame or of a list) Data Manipulation • standard arithmetic and logical operations • matrix operations including transpose, inner product, outer product, diagonal, trace, invert • searching, sorting, merging • data cleaning by logical commands Why R? • Graphics for data exploration and interpretation • Data manipulation including statistics as data • Statistical analysis • Standard univariate and multivariate generalizations of the linear model • Multivariate-structural extensions Standard Statistical packages in R • Descriptive and exploratory statistics • The general linear model and its special cases • t-test, ANOVA, MANOVA, regression, logistic regression, cox models, etc. • multilevel models (mixed models, hierarchical models) • time series, econometrics • circular statistics, environmental-geographical statistics Psychometric packages • Structural Equation Modeling • Rasch Modeling of one parameter IRT • Factor and Principal Component Analysis • Rotations (Orthogonal:Varimax, Oblique: quartimax, quartimin, Promax, etc.) • singular value decomposition • eigen vector - eigen value decomposition • Multidimensional scaling R is extensible • Functions can be defined easily and then stored for later use. • Packages of functions are written by “all of us” to solve particular problems • e.g. sem, rasch, rotation, lattice graphics • Packages can be developed and tailored for a specific lab or problem area, or can be shared through the CRAN repository of (currently) > 400 packages Programming in R-personal examples • Very Simple Structure • 6 weeks in Fortran for a mainframe • 3 weeks in Pascal for a Mac • 2 days in R for Mac/Unix/Windows • Schmid Leiman decomposition and estimation of coefficient omega • 1 day Popular misconceptions • R is hard to learn • R is not user friendly • I can’t figure R out • R is free -- it can’t be very good Popular Misconceptions • R is hard to learn • With a brief web based tutorial, 2nd and 3rd year undergraduates in psychological methods and personality research courses were using R for descriptive and inferential statistics and producing publication quality graphics • R is easy to learn, hard to master • R-help newsgroup is very supportive • Multiple web based and pdf tutorials Popular Misconceptions • R is not user friendly • Does not have a GUI, but those are being developed • Help and examples are embedded within program, • ? function produces help with examples for any function • R-help newsgroup is very supportive Popular Misconceptions • R is free, it can not be any good • Developed by some of the best statisticians around • vetted by all users, bugs when they exist, are rapidly fixed • R community provides excellent and timely statistical and programming help (if somewhat acerbic to those who have not done their homework) R: an international collaboratory • Most appropriate for an International Society to be aware of the power of international collaboration to produce cutting edge software R: statistics for all of us R: an international statistical collaboratory Prepared for part of the symposium on Multivariate Statistical Methods in Individual Differences Research International Society for the Study of Individual Differences Biennial meeting, Adelaide, July , 2005 William Revelle, Northwestern University personality-project.org/r/ personality-project.org/r.short.html