Combining the Power of R and Excel: RExcel A LISA Short Course February 2012 Matthew Lanham Ph.D. Student, Business Information Technology M.S. Student, Statistics As you come in, please get materials here: https://filebox.vt.edu/users/lanham/LISA/ Motivation for this course: Two Facts 1. Excel is the most prevalent software used for data storage and analysis. There are a lot of built in statistical functions in Excel along in addition to the “Analysis ToolPak.” 2. R is a free and open source program, and one of the most powerful and the fastest-growing statistics programs. Why not use them both together!! This is you with Excel This is you with Excel + R Outcome from this course: I hope to have provided you some examples that you might incorporate in your own work that might prove beneficial. Lets get started: 1) Double-click the RExcel2010 with Rcommander Icon This will open Excel and Rcommander. R commander is like using the standard R GUI, but looks a bit different. You will find R in the Excel Ribbon as well. Part 1 Transferring data between R and Excel • Data from Excel to R • Data from R to Excel RExcel Drop-down Close R – Will close the open instance of R and Rcommander as well Run Code – Will run R code Get R Value (Array or Dataframe) – Gets data Put R Value (Array or Dataframe) – Defines a cell or range for R Get R Output – Retrieves code output from R to Excel Set R working dir – Define the folder location you want to work from on your PC. Load R file – Used to load a data set or .R file Copy code – copies code in Excel Debug R – If checked, this will open a debugger if an error occurs Error log – This will show you all the R errors Options – Offers a few basic options Set R sever – allows to select the server type, server name (for remote servers), and R process name (for servers from a serverpool). RExcel Help – Takes you here: file:///C:/Program%20Files%20%28x86%29/RExcel/doc/RExcel.html Rhelp – Takes you here: http://127.0.0.1:18357/doc/html/index.html Rcommander – Opens Rcommander with menus in the Excel Ribbon or in Rcommander. Demo worksheets – There are five demos for learning how to use the software Mark calc cells – If activated, this will mark all cells containing calculated results with a special marker in the upper left corner About RExcel Part 1 Functions, Arrays, and Dataframes • Advantages – Use Excel as a container for dependencies – Use R code functions without lengthy “IF” statements – Allows automatic recalculations via Excel’s computation engine(R will not do this by itself) See RExcelExamples workbook, Part1 tab Regression: Excel and R Excel 1. Excel Functions TREND(Y-range, X-range, X-value for prediction) function LINEST(Y-range, X-range, Const, Stats) array function 2. Excel’s Analysis TookPak Data -> Data Analysis -> Regression -> Then fill in the dialog box (see example sheet) R 1. Use Rcommander "Statistics" -> "Fit models" -> "Linear regression.." 2. Use R code via RExcel myfit = lm(formula = Sales ~ Advertising, data = salesdata) summary(myfit) Benefits of each: Use what you like and is more advantageous to your problem • The Excel functions automatically update • Analysis TookPak outputs the statistics in a nice readable table • Rcommander has nice drop-down menus • R provides plots that are not easily available via Excel alone • R is more extensible and allows more advanced modeling Part 2 Regression: Assumption Review Part 2 Sales vs. Advertising 600.0 Sales (in $1000s) 500.0 400.0 300.0 200.0 100.0 0.0 Gauss-Markov Theorem 35 55 75 95 115 Advertising (in $1000s) Tells us that our OLS estimators (our intercept and slope) are unbiased and have minimum variance among all linear unbiased estimators IF… Two assumptions: (1) Independence => 𝐶𝑜𝑣 𝜀𝑖 , 𝜀𝑗 = 0 (2) Equal variance (aka. Homoscedasticity, same finite variance) => 𝑉𝑎𝑟 𝜀𝑖 = 𝜎 2 To make tests inferences, do statistical tests, and create confidence intervals, we need to assume a third condition: (3) Error is normally distributed => 𝜀𝑖 ~ 𝑁(𝑚𝑒𝑎𝑛 = 0, 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝜎 2 ) Part 3 Regression: Assumption investigation (a.k.a. Diagnostics) Linear relationship among Sales and Advertising looks fine. What do you think about our independence assumption? What do you think about the constant finite variance assumption? What about normality? Anything else stand out? Regression: Fit without influential points Here we see our new fitted line, in addition to how well our model performed at estimating sales. Part 3 Part 3 Regression: More diagnostics Linear relationship among Sales and Advertising is fine. What do you think about our independence assumption? What do you think about the constant finite variance assumption? What about normality? Anything else stand out? Regression: Interpretation Regression Statistics Multiple R 0.956 R Square 0.914 Adjusted R Square 0.909 Standard Error 32.707 Observations 18 Part 3 What is this calculation? This is the Pearson correlation of (x and y) for simple regression, or the sqrt of "R Square" a.k.a. Multiple R-squared, is the fraction of total variation explained by the model) Similar to R-square but adjusts for the number of covariates in the model. This is the standard error of our residuals. This is the total number of observations we used. Regression Statistics What does it mean? Multiple R 0.956 There is a strong positive linear relationship among advertising and sales R Square 0.914 91.4% of the variation in sales is explained by the variation in advertising Adjusted R Square 0.909 90.9% of the variation in sales is explained by the variation in advertising, accounting for number of covariates used. Standard Error 32.707 This is our measure of spread or variability for our residuals in the model. Observations 18 This is the total number of observations we used ANOVA The F and Significant F tell us that our slope is statistically significant. Meaning, it is highly unlikely it is 0. df SS MS F Significance F Regression 1 181861.7 181861.7 170.0 0.000 Residual 16 17116.4 1069.8 Total 17 198978.0 ANOVA - This is just a table that summarizes the levels of variation. df SS MS F Significance F Regression k = # covariates SSR = variation in the the mean response MSR = SSR/k MSR/MSE p-value of F-test Residual n -1 - k SSE = variation in residuals MSE = SSE/(n-1-k) Total n -1 SST = total variation in the response Intercept Advertising Intercept Advertising Coefficients Standard Error -109.64 37.46 6.26 0.48 t Stat P-value Lower 95% Upper 95% -2.93 0.010 -189.06 -30.23 13.04 0.000 5.24 7.28 Coefficients Standard Error y-int seY = s.e. of y-intercept slope seX = s.e. of slope t Stat y-int/seY slope/seX P-value significance for y-int significance for slope 95% CI for parameter Lower limit Upper limit Lower limit Upper limit Using R commander • Obtain data sets from R libraries or load in your own • Nice drop-down for basic statistics and plots (code prints to R commander window) • Common distributions are available via drop-down See RExcelExamples workbook, Part4 tab Part 4 Using built-in R commander plug-ins Part 5 Lets look at RmcdrPlugin.HH These plots are useful, but somewhat dull. The code that generates these will show up in the R commander window (very useful for newbies). Like plotting in Excel, you can get what you need by default, but you’ll probably have to modify the graph a bit. XY conditioning plot (HH) Side-by-side Boxplot Additional References • http://rcom.univie.ac.at/RExcelDemo/