MTO4 – summary slides Introduction Types of data: Nonmetric Nominal (categorical) = blue, green Ordinal = bad, medium, good Metric Interval = 1..5 Ratio = 43.28 Validiy = how well is the measurement Reliability = degree of each time same result Statistical Techniques Dependence: Multiple Regression: 1 metric explained by other metrics Logistic Regression: 1 binary explained by other metrics (M)ANOVA: metric(s) explained by non-metrics Structural Equations Modeling (SEM): multiple interrelated explained with structural and measurement model Interdependence: (without distinction between dependant en independent) Factor Analysis: Analyses of structure to determine underlying dimensions Examining data Missing data: can be random of systematic Test randomness: t-test (tests mean differences between group with missing data and other group) Little’s MCAR x2-test (X2 < 0.05 than non-random) Dealing with missing data: only using complete observations, delete cases with many missing (>10%) of estimate values (MCAR or MAR) Outliers: π₯π −π₯Μ Univariate (1): extreme values, π§ = π π Bivariate (2): unusual combination of values, scatterplots boxplots Multiple (>2): unusual combination of multiple values, Mahalonobis Distance (p<0.001) Extreme events, data entry errors, weird respondents Assumptions: Normality: histograms, normal probability plots Non-normality: skewness, kurtosis Testing: Shapiro-wilks test, Kolmogorov-Smirnov test Data transformations when normality assumption is violated Linearity: linear functional form of relationships, testing with scatterplots Data transformations to achieve linearity Homoscedasticity: variances of subpopulations of values are all equal Testing with scatterplots, Levene test, Box M test When violated take square root (for opening left, for right first taking inverse) Uncorrelated Errors: occurs when observations are ‘nested’ within groups When violated use hierarchical linear modelling and take this into account (Exploratory) Factor Analysis Objectives Definition: Examining interrelationships among a larger set of variables and then attempts to explain them in a terms of their common underlying dimensions (factors), attempt to explain maximal variance in variables with a minimal loss of information Objectives: Identifying underlying causal structure, data reduction, summarization Designing Measurement: metric Variables: min 3 per Latent variable (factor) Sample size: ideally n>150, min n>50, random sampling Normality Assumption Assumptions Strong conceptual foundation: nr of correlations > 0.30 Bartlett’s Test of Sphericity (roundness) Measure of Sampling Adequacy: Proportion of variance in variables, is common var. (min>.5, >.8 good) No multicollinarity: No singular matrix, Determinant >.00001 Deriving Factors &Assessing Overall Fit Selecting Factor Extraction Methods: 1. Principal Components Analysis (PCA, default SPSS): create linear combinations of original variables where weights are determined that they maximize the variance of the variables explained by the factors. Factors are uncorrelated (orthogonal) Not a very realistic method 2. Common Factor Analysis (in SPSS: Principal Axis Factoring) Total variance of each variable consist of: Common variance + unique variance + random variance Assessing fit Estimation procedure is based on correlation matrix with communalities in the diagonal Communality of X = percentage of variance in X explained by all factors Starting communality of variables: standard 1 -> PCA, estimated R2 multiple regression analysis with given variable as Y and others as X -> Common Factor Analysis Choosing Method Objectives: data reduction -> PCA, understanding-> Common Factor Analysis Amount of prior knowledge: small error -> PCA, limited knowledge -> Common Factor Analysis Number of factors A priori criterion: determine amount of factors, than chose best Latent Root or Kaiser Criterion (eigenvalue): eigenvalue>1 Proportion of Variance Accounted For: cumulative (>60%) Scree Test Criterion: Determine on Graph Interpreting Estimate Factor Matrix: significance (n=120 >.5, n=60 >.7) Interpret Factor Matrix: optimal when all variables have high loadings one factor -> convergent validity Delete variables that cross-load (high load on >1 factor) -> discriminant validity Cut-off value cross-loadings >.4 Variables should generally have communalities>.5 Factor Rotation Unrotated: loadings determined such that factors explain maximal variance, difficult for interpretation Rotated: redistribute variance from earlier factors to later factors to achieve more meaningful pattern Methods: Orthogonal (factors not correlated): data reduction (SPSS: Varimax) Oblique (factors are correlated): obtaining framework ther. meaningful factors (SPSS: Direct Oblimin) Interpretation of results: Orthogonal -> Rotated Factor Component Matrix, Oblique -> Pattern Matrix (loadings) not Structure Matrix (correlations) Validation Confirmatory Perspective -> Overall fit of factor solution Detecting influential observations Additional Comfirmatory Factor Analysis is quite similar but philosophically different from Exploratory EFA: factors derived from statistical results, not theory, all measured variables related to every factor CFA: a priori determining nr of factors and variables for factors Logistic Regression Analysis 1 binary dependent variable explained by multiple independent metric variables Predicted probabilities are always between 0 and 1 Maximum Likelihood method: yield values for the unknown parameters which maximizes the probability of obtaining the observed set of data Interpret coefficients - Direction (positive or negative relationship) - Significance (Wald method) - Magnitude (percentage change in odds) Assumptions Hosmer&Lemeshow test: checks if differences between observed and predicted are about equal Isolate points for which the model fits poorly and influential data (Cook’s distance >1) points Multiple Regression Analysis Stages of multivariate statistical analysis 1. Examining data 2. Checking quality of constructs (cronbach alpha, factor analysis) 3. Analyzing relationships (regression analysis, manova, sem etc.) Multiple regression analysis: used to analyze the relationship between a single dependent variable and multiple independent variables. Explaining variance in outcomes, Predicting/forecasting outcomes Y = a + ππ X + Ι Estimation Procedure Testing significance: F-test = 2 Explained variance: π = 2 ππππππππ π πππ πππππ πππ’ππ ππππππππ π πππ πππ‘ππ‘ππ (physical sciences min >.6, social sciences >.25) π ππππ’π π‘ππis useful when comparing equations with different observations and or predictors Regression coefficient, significance and interpretation H0 : Bi = 0 H1 : Bi ≠ 0 π΅ πππππππππππ‘ π‘ π£πππ’π = π π‘. ππ£. B coefficient: reflect absolute effect size, not possible to compare when X’s vary in scales Beta coefficient: reflect relative importance, X’scan be compared Bivariate correlation: linear relationship between two variables Assumptions 1. Linearity 2. Homoscedasticity 3. Independence or errors 4. Normally distributed 5. Variables at least intervally scaled 6. No multicollinearity betweens X’s 7. Independent variables, X’s, are measured without error Special Topics Power: 1-π½, probability that you correctly find a significant effect, should be at least 80% Ratio observations/variables, min 5/1 – ideally 15/1 Multicollinearity: undesirable situation that predictors variables (Xs) are strongly related to each other - Bivariate correlations (between two variables) - Extent to which variance in Xi ban be explained by other Xs o Tolerance (TOL) – proportion of variance that cannot be explained by others (>.1) o Variance Inflation Factor (VIF), inverse of TOL When including non-metric variables, one needs to create dummy variables (nr of cat. -1 ) Selection Methods Simultaneous Regression (SPSS: Enter): includes all predictors at the same time Appropriate: confirmation of existing theory Problem: specification error Sequential Regression methods (e.g. Stepwise): include most relevant, then consider what other Xs add First one with highest bivariate correlation with Y, then X with highest F-test with Y is added Univariate and Multivariate Analysis of Variance (ANOVA, MANOVA) ANOVA - tests whether all group means are the same (amount of levels >2) Variance: between-groups + within-group (error) When H0 : μ = μ1 = μ2 = μ3 rejected, ANOVA not tells which pairs of means are different from one another Yij = μ + αj + eij MANOVA - Extension ANOVA in which effect(s) discr. ind. var. are assessed on a comb. of dep. Var. - Tests whether mean differences among groups on a combination of dependent variables is likely to occur by change - Creates a new dep. Variable that is a linear combination of the original individual dependent variables that maximizes the difference between groups MANOVA vs ANOVA 1. Multiple testing with ANOVA will increase probability for alpha error 2. For multiple dependent variables MANOVA takes the intercorrelations into account 3. Differences between groups may be too small to be detected on a single dependent variable, but when they are considered jointly there may be a significant difference Multivariate test criteria Wilk’s Lambda, Hoteling’s Trace, Pillai’s trace: Pool variance from all dimensions to create test statistic Roy’s largest root: uses variance from the dimension that separates the groups most Assumptions - Independence of observations - Homogeneity of the covariance matrices Test: Levene’s test, Box’s M - Multivariate Normality Dep.Var. normally distributed, Dep.Var. normal distribution within group, any linear combination must be distributed normally Use histograms, cumulative normal probability plots, scatter plots Structural Equation Modeling (SEM) Nowadays the most dominant multivariate technique Combines measurement theory (CFA) and structural theory (SEM) in one analysis Confirmatory Factor Analysis (CFA) vs. Exploratory Factor Analysis (EFA): CFA: advance specification of both the number of factors that exist within a set of variables and which factor each variable will load on, then applied to test an a-priori pattern of factor loadings representing the actual data or not. CFA or SEM Relationship between construct and variable -> CFA Relationship between construct and multiple variables -> CFA Structural relation between two constructs -> SEM Correlational relationship between constructs -> CFA or SEM Causality Four types of evidence must be met: 1. systematic covariation, 2. temporal sequence, 3. nonspurious covariance, 4. theoretical support Six modelling stages (1tm4 = CFA, 5,6 = SEM) 1. Defining Individual constructs 2. Developing Overall measurement model 3. Designing a study to produce empirical Results <5 constructs (>3 items each), with high item communalities (>.6) min sample size 100-150, 200 recommended. More constructs need a bigger sample size, >500 4. Assessing the measurement model validity Goodness of Fit (GOF), types are absolute (overall), incremental(degree improvement) and parsimonious(nr of estimated coefficients required) fit measures, use more methods (chi-square is the only statistical model test) 5. Specifying the structural model 6. Assessing structural model validity Advanced Topics Testing moderating effects: Multigroup Analysis (MGA), Continuous Variable Interaction (CVI) Longitudinal Analysis: 1. Alpha, Beta, Gamma change 2. Alternative Models (AM) testing