GENERAL LINEAR MODELS Oneway ANOVA, GLM Univariate (n-way ANOVA, ANCOVA) BASICS Dependent variable is continuous Independent variables are nominal, categorical (factor, CLASS) or continuous (covariate) Are the group means of the dependent variable different across groups defined by the independents Main effects, interactions and nested effects Often used for testing hypotheses with experimental data BASICS Factor A (industry) Level 1 (manufact) Factor B (size) Level 1 (small) Factor A (industry) Level 2 (trade) Cell Factor B (size) Level 2 (medium) Factor B (size) Level 3 (large) 3 X 2 full factorial design (full: each cell has observations) Balanced design: each cell has equal number of observations ASSUMPTIONS Enough observations in each group? (n >20) Independence of observations Similarity of variance-covariance matrices (no problem if largest group variance < 1.5*smallest group variance, 4* if balanced design) Normality Linearity No outlier-observations STEPS OF INTERPRETATION Model significance? F-test and R square Welch, if unequal group variances (this can be tested using Levene or Brown-Forsythe test) Significance of effects? (F-test and partial eta squared) Which group differences are significant? Post hoc or contrast tests What are the group differences like? Estimated marginal means for groups Oneway ANOVA A continuous dependent variable (y) and one categorical independent variable (x), with min. 3 categories, k= number of categories assumptions: y normally distributed with equal variance in each x category H0: mean of y is the same in all x categories Variance of y is divided into two components: within groups (error) and between groups (model, treatment) Test statistic= between mean square / within mean square follows F-distribution with k-1, n-k degrees of freedom F-test can be replaced by Welch if variances are unequal Oneway ANOVA If the F test is significant, you can use post hoc tests for pairwise comparison of means across the groups Alternatively (in experiments) you can define contrasts ex ante Contrast Coefficients hius ten väri Contrast 1 2 vaalea tumma punainen kalju 1 0 -1 0 ,5 ,5 -1 0 SAS: oneway ANOVA SAS: oneway ANOVA Use this instead of F if variances are not equal BF or Levene, H0: group variances are equal SAS: oneway ANOVA Post hoc -tests SAS: oneway ANOVA SAS: oneway ANOVA MODEL FIT Class Level Information Class Levels Values class_popgrowth 4 1234 Source DF Sum of Squares Model 3 298.3992640 99.4664213 Error 68 504.4139305 7.4178519 Corrected Total 71 802.8131944 Mean Square F Value Pr > F 13.41 <.0001 R-Square Coeff Var Root MSE deathrate Mean 0.371692 34.10981 2.723573 7.984722 EQUALITY OF VARIANCES Levene's Test for Homogeneity of deathrate Variance ANOVA of Squared Deviations from Group Means Source class_popgrowth Error DF Sum of Squares Mean Square F Value Pr > F 3 3004.3 1001.4 4.23 0.0084 68 16110.6 236.9 Welch's ANOVA for deathrate Source class_popgrowth Error DF F Value Pr > F 3.0000 13.00 <.0001 18.9519 GROUP MEANS deathrate Level of class_popgrowth N Mean Std Dev 1 27 10.5666667 3.01457996 2 22 6.9272727 1.60064922 3 17 5.9705882 1.87941637 4 6 5.9500000 5.61809576 POST HOC TEST Comparisons significant at the 0.05 level are indicated by ***. class_popgrowth Comparison Difference Between Means Simultaneous 95% Confidence Limits 1-2 3.6394 1.5135 5.7653 *** 1-3 4.5961 2.3044 6.8877 *** 1-4 4.6167 1.2760 7.9573 *** 2-1 -3.6394 -5.7653 -1.5135 *** 2-3 0.9567 -1.4335 3.3468 2-4 0.9773 -2.4317 4.3862 3-1 -4.5961 -6.8877 -2.3044 3-2 -0.9567 -3.3468 1.4335 3-4 0.0206 -3.4942 3.5353 4-1 -4.6167 -7.9573 -1.2760 4-2 -0.9773 -4.3862 2.4317 4-3 -0.0206 -3.5353 3.4942 *** *** BOXPLOTS Multiway ANOVA, GLM A continuous dependent variable y, two or more categorical independent variables (factorial design) ANCOVA, if there are continuous independents (covariates) main effects and interaction effects can be modeled fixed factor, if all groups are present and random factor, if only some groups are randomly represented in the data Eta squared = SSK/SST expresses how many % of the variance in y is explained by x (not in EG! SAS code: model y = x1 x2 / ss3 EFFECTSIZE;) INTERACTION EFFECT Synergy of two factors, the effect of one factor is different in the groups of the other factor Crossing effect = interaction effect Ordinal (lines in means plot have different slopes, but do not cross) Disordinal (lines cross in the means plot) NO INTERACTION mean of profitability 40 30 manufact 20 trade 10 0 small medium large Size and industry both have a significant main effect No interaction, homogeneity of slopes INTERACTIONS mean of profitability Ordinal interaction (the effect of size is stronger in manufacturing than in trade) 50 40 30 manufact 20 trade 10 0 small Dis-ordinal interaction (the effect of size has a different sign in manufacturing and trade) medium large mean of profitability 50 40 30 manufact 20 trade 10 0 small medium large NESTED EFFECTS Nested effect B(A) ”B nested within A” size (industry): the effect of size is estimated separately for each industry group Difference between nested and interaction effect is that the main effect of B (size) is not included The slope of B (size) is different in each category of A (industry) ESTIMATED GROUP MEANS Estimated marginal means or LS (least squares) means Predicted group means are calculated using the estimated model coefficients The effects of other independent variables are controlled for Is not equal to the group means from the sample SUM OF SQUARES Type I SS does not control for the effects of other independent variables which are specified later into the model Type II SS controls for the effects of all other independents Types III and IV SS are better in unbalanced designs, IV if there are empty cells POST HOC TESTS Multiple comparison procedures, mean separation tests The idea is to avoid the risk of Type I error which results from doing many pairwise tests, each at 5% risk level E.g. Bonferroni, Scheffe, Sidak,… Tukey-Kramer is most powerful H0: equal group means -> rejection means that group means are not equal, but failure to reject does not necessarily mean that they are equal (small sample size -> low power -> failure to reject the null) ANCOVA The model includes a covariate (= continuous independent variable, often one whose effect you want to control for) Regress y on the covariate -> then ANOVA with factors explaining the residual The relationship between covariate and y must be linear, and the slope is assumed to be the same at all factor levels The covariate and factor should not be too much related to each other Do not include too many covariates, max 0.1*n – (k1) SAS: analyze – ANOVA – linear models Effects to be estimated Interaction here, first select both variables, then click Cross Sums of squares Other options, defaults ok Post hoc-tests Plots SAS - code PROC GLM DATA=libname.datafilename PLOTS(ONLY)=DIAGNOSTICS(UNPACK) PLOTS(ONLY)=RESIDUALS PLOTS(ONLY)=INTPLOT ; CLASS Elinkaari Perheyr; MODEL growthorient= ln_hlo Elinkaari Perheyr Elinkaari*Perheyr / SS3 SOLUTION SINGULAR=1E-07 EFFECTSIZE ; LSMEANS Elinkaari Perheyr Elinkaari*Perheyr / PDIFF ADJUST=BON ; RUN; QUIT; Model significance and fit Class Level Information Class Levels Values Elinkaari phase 3 234 Perheyr family 2 01 Sum of Squares 13.03085542 75.69810081 88.72895623 Source Model Error Corrected Total DF 6 125 131 R-Square 0.146861 Coeff Var 21.79382 Root MSE 0.778193 Number of Observations Read Number of Observations Used Mean Square 2.17180924 0.60558481 F Value Pr > F 3.59 0.0026 growthorient Mean 3.570707 181 132 Significance of predictors Source DF Type III SS Mean Square F Value Pr > F ln_hlo employees 1 2.88693851 2.88693851 4.77 0.0309 Elinkaari phase 2 9.52176337 4.76088169 7.86 0.0006 Perheyr family 1 0.28960870 0.28960870 0.48 0.4905 Elinkaari*Perheyr 2 1.99071120 0.99535560 1.64 0.1974 Phase*Family EFFECT SIZE OF PREDICTORS Total Variation Accounted For Source ln_hlo Partial Variation Accounted For Semipartial Conservative Semipartial Omega- 95% Confidence Li Partial EtaEta-Square Square mits Square 0.0325 0.0255 0.0000 0.1112 0.0367 Elinkaari 0.1073 0.0930 0.0219 0.2056 0.1117 Perheyr 0.0033 -0.0035 0.0000 0.0488 0.0038 Elinkaari*Per heyr 0.0224 0.0087 0.0000 0.0842 0.0256 Partial 95% Omega- Confidenc Square e Limits 0.0277 0.000 0 0.0942 0.022 5 -0.0040 0.000 0 0.0097 0.000 0 0.115 8 0.207 3 0.050 3 0.088 7 Parameter estimates Parameter Intercept ln_hlo employees Elinkaari 2 growth Elinkaari 3 mature Elinkaari 4 decline Perheyr 0 non family Perheyr 1 family Elinkaari*Perheyr 2 0 Elinkaari*Perheyr 2 1 Elinkaari*Perheyr 3 0 Elinkaari*Perheyr 3 1 Elinkaari*Perheyr 4 0 Elinkaari*Perheyr 4 1 Estimate 3.196306815 0.161079578 0.372704251 -0.041166136 0.000000000 -0.862973482 0.000000000 1.250588328 0.000000000 0.654885600 0.000000000 0.000000000 0.000000000 B B B B B B B B B B B B Standard Error 0.49826714 0.07377500 0.49030119 0.46224369 . 0.92404272 . 0.98491805 . 0.94241380 . . . t Value 6.41 2.18 0.76 -0.09 . -0.93 . 1.27 . 0.69 . . . Pr > |t| <.0001 0.0309 0.4486 0.9292 . 0.3522 . 0.2065 . 0.4884 . . . Prediction for 6 cells Elinkaari=2 & perheyr=0 (growth phase, non family) Elinkaari=3 & perheyr=0 (mature phase, non family) Elinkaari=4 & perheyr=0 (decline phase, non family) Elinkaari=2 & perheyr=1 (growth phase, family) Elinkaari=3 & perheyr=1 (mature phase, family) Elinkaari=4 & perheyr=1 (decline phase, family) Growth = 3.20 + 0.16*ln_hlo + 0.37 – 0.86 + 1.25 = 3.96 + 0.16*ln_hlo Growth = 3.20 + 0.16*ln_hlo – 0.04 – 0.86 + 0.65 = 2.95 + 0.16*ln_hlo Growth = 3.20 + 0.16*ln_hlo + 0.00 – 0.86 + 0.00 = 2.34 + 0.16*ln_hlo Growth = 3.20 + 0.16*ln_hlo + 0.37 + 0.00 + 0.00 = 3.57 + 0.16*ln_hlo Growth = 3.20 + 0.16*ln_hlo - 0.04 + 0.00 + 0.00 = 3.16 + 0.16*ln_hlo Growth = 3.20 + 0.16*ln_hlo + 0.00 + 0.00 + 0.00 = 3.20 + 0.16*ln_hlo 38 Parameter estimates The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable. This warning always occurs if you have categorical independent variables in the model, SAS can however estimate the coefficients 39 Homoskedasticity Outlier diagnostics Residual distribution Model fit Influence diagnostics Residual vs. covariate Significance of group differences, main effects Elinkaari phase 2 growth 3 mature 4 decline Perheyr Family 0 1 growthorient LSMEAN LSMEAN Number 4.14643211 1 3.43471035 2 3.14843369 3 Least Squares Means for effect Elinkaari Pr > |t| for H0: LSMean(i)=LSMean(j) i/j 1 2 3 Dependent Variable: growthorient 1 2 3 0.0006 0.1225 0.0006 1.0000 0.1225 1.0000 H0:LSMean1=LSMean growthorient 2 LSMEAN Pr > |t| 3.46261763 0.4905 3.69043314 Significance of group differences, interaction Phase 2 growth 2 3 mature 3 4 decline 4 Family 0 1 0 1 0 1 growthorient LSMEAN LSMEAN Number 4.34023953 1 3.95262468 2 3.33066641 3 3.53875430 4 2.71694695 5 3.57992043 6 Non-family firms in growth phase differ from non-family firms in mature phase Least Squares Means for effect Elinkaari*Perheyr Pr > |t| for H0: LSMean(i)=LSMean(j) i/j 1 2 3 4 5 6 Dependent Variable: growthorient 1 2 3 4 5 1.0000 0.0161 0.1052 0.8474 1.0000 0.1040 0.8177 1.0000 0.0161 0.1040 1.0000 1.0000 0.1052 0.8177 1.0000 1.0000 0.8474 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 6 1.0000 1.0000 1.0000 1.0000 1.0000 REPORTING GLM Model fit: F + df + p and R Square Nature and significance of effects: parameter estimates B+s.e.+t+p and F+p estimated group means (means plot) post hoc test results Means plot 5 4.5 kasvuhakuisuus 4 3.5 perheyr 3 ei-perheyr 2.5 2 1.5 1 kasvuvaihe vakiintunut loppumassa Employees at its mean value (20)