- Syneratio

advertisement
MTO4 – summary slides
Introduction
Types of data:
Nonmetric
Nominal (categorical) = blue, green
Ordinal = bad, medium, good
Metric
Interval = 1..5
Ratio = 43.28
Validiy = how well is the measurement
Reliability = degree of each time same result
Statistical Techniques
Dependence:
Multiple Regression: 1 metric explained by other metrics
Logistic Regression: 1 binary explained by other metrics
(M)ANOVA: metric(s) explained by non-metrics
Structural Equations Modeling (SEM): multiple interrelated explained with structural and measurement
model
Interdependence: (without distinction between dependant en independent)
Factor Analysis: Analyses of structure to determine underlying dimensions
Examining data
Missing data: can be random of systematic
Test randomness: t-test (tests mean differences between group with missing data and other group)
Little’s MCAR x2-test (X2 < 0.05 than non-random)
Dealing with missing data: only using complete observations, delete cases with many missing (>10%) of
estimate values (MCAR or MAR)
Outliers:
π‘₯𝑖 −π‘₯Μ…
Univariate (1): extreme values, 𝑧 = 𝑠𝑑
Bivariate (2): unusual combination of values, scatterplots boxplots
Multiple (>2): unusual combination of multiple values, Mahalonobis Distance (p<0.001)
Extreme events, data entry errors, weird respondents
Assumptions:
Normality: histograms, normal probability plots
Non-normality: skewness, kurtosis
Testing: Shapiro-wilks test, Kolmogorov-Smirnov test
Data transformations when normality assumption is violated
Linearity: linear functional form of relationships, testing with scatterplots
Data transformations to achieve linearity
Homoscedasticity: variances of subpopulations of values are all equal
Testing with scatterplots, Levene test, Box M test
When violated take square root (for opening left, for right first taking inverse)
Uncorrelated Errors: occurs when observations are ‘nested’ within groups
When violated use hierarchical linear modelling and take this into account
(Exploratory) Factor Analysis
Objectives
Definition: Examining interrelationships among a larger set of variables and then attempts to explain
them in a terms of their common underlying dimensions (factors), attempt to explain maximal variance
in variables with a minimal loss of information
Objectives: Identifying underlying causal structure, data reduction, summarization
Designing
Measurement: metric
Variables: min 3 per Latent variable (factor)
Sample size: ideally n>150, min n>50, random sampling
Normality Assumption
Assumptions
Strong conceptual foundation: nr of correlations > 0.30
Bartlett’s Test of Sphericity (roundness)
Measure of Sampling Adequacy: Proportion of variance in variables, is common var. (min>.5, >.8 good)
No multicollinarity: No singular matrix, Determinant >.00001
Deriving Factors &Assessing Overall Fit
Selecting Factor Extraction Methods:
1. Principal Components Analysis (PCA, default SPSS): create linear combinations of original
variables where weights are determined that they maximize the variance of the variables
explained by the factors.
Factors are uncorrelated (orthogonal)
Not a very realistic method
2. Common Factor Analysis (in SPSS: Principal Axis Factoring)
Total variance of each variable consist of: Common variance + unique variance + random variance
Assessing fit
Estimation procedure is based on correlation matrix with communalities in the diagonal
Communality of X = percentage of variance in X explained by all factors
Starting communality of variables: standard 1 -> PCA, estimated R2 multiple regression analysis with
given variable as Y and others as X -> Common Factor Analysis
Choosing Method
Objectives: data reduction -> PCA, understanding-> Common Factor Analysis
Amount of prior knowledge: small error -> PCA, limited knowledge -> Common Factor Analysis
Number of factors
A priori criterion: determine amount of factors, than chose best
Latent Root or Kaiser Criterion (eigenvalue): eigenvalue>1
Proportion of Variance Accounted For: cumulative (>60%)
Scree Test Criterion: Determine on Graph
Interpreting
Estimate Factor Matrix: significance (n=120 >.5, n=60 >.7)
Interpret Factor Matrix: optimal when all variables have high loadings one factor -> convergent validity
Delete variables that cross-load (high load on >1 factor) -> discriminant validity
Cut-off value cross-loadings >.4
Variables should generally have communalities>.5
Factor Rotation
Unrotated: loadings determined such that factors explain maximal variance, difficult for interpretation
Rotated: redistribute variance from earlier factors to later factors to achieve more meaningful pattern
Methods:
Orthogonal (factors not correlated): data reduction (SPSS: Varimax)
Oblique (factors are correlated): obtaining framework ther. meaningful factors (SPSS: Direct Oblimin)
Interpretation of results: Orthogonal -> Rotated Factor Component Matrix, Oblique -> Pattern Matrix
(loadings) not Structure Matrix (correlations)
Validation
Confirmatory Perspective -> Overall fit of factor solution
Detecting influential observations
Additional
Comfirmatory Factor Analysis is quite similar but philosophically different from Exploratory
EFA: factors derived from statistical results, not theory, all measured variables related to every factor
CFA: a priori determining nr of factors and variables for factors
Logistic Regression Analysis
1 binary dependent variable explained by multiple independent metric variables
Predicted probabilities are always between 0 and 1
Maximum Likelihood method: yield values for the unknown parameters which maximizes the probability
of obtaining the observed set of data
Interpret coefficients
- Direction (positive or negative relationship)
- Significance (Wald method)
- Magnitude (percentage change in odds)
Assumptions
Hosmer&Lemeshow test: checks if differences between observed and predicted are about equal
Isolate points for which the model fits poorly and influential data (Cook’s distance >1) points
Multiple Regression Analysis
Stages of multivariate statistical analysis
1. Examining data
2. Checking quality of constructs (cronbach alpha, factor analysis)
3. Analyzing relationships (regression analysis, manova, sem etc.)
Multiple regression analysis: used to analyze the relationship between a single dependent variable and
multiple independent variables.
Explaining variance in outcomes, Predicting/forecasting outcomes
Y = a + 𝑏𝑖 X + Ι›
Estimation Procedure
Testing significance: F-test =
2
Explained variance: 𝑅 =
2
π‘€π‘†π‘Ÿπ‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›
π‘€π‘†π‘Ÿπ‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™
π‘†π‘†π‘Ÿπ‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›
π‘†π‘†π‘‘π‘œπ‘‘π‘Žπ‘™
(physical sciences min >.6, social sciences >.25)
𝑅 π‘Žπ‘‘π‘—π‘’π‘ π‘‘π‘’π‘‘is useful when comparing equations with different observations and or predictors
Regression coefficient, significance and interpretation
H0 : Bi = 0
H1 : Bi ≠ 0
𝐡 π‘π‘œπ‘’π‘“π‘“π‘–π‘π‘–π‘’π‘›π‘‘
𝑑 π‘£π‘Žπ‘™π‘’π‘’ =
𝑠𝑑. 𝑑𝑣.
B coefficient: reflect absolute effect size, not possible to compare when X’s vary in scales
Beta coefficient: reflect relative importance, X’scan be compared
Bivariate correlation: linear relationship between two variables
Assumptions
1. Linearity
2. Homoscedasticity
3. Independence or errors
4. Normally distributed
5. Variables at least intervally scaled
6. No multicollinearity betweens X’s
7. Independent variables, X’s, are measured without error
Special Topics
Power: 1-𝛽, probability that you correctly find a significant effect, should be at least 80%
Ratio observations/variables, min 5/1 – ideally 15/1
Multicollinearity: undesirable situation that predictors variables (Xs) are strongly related to each other
- Bivariate correlations (between two variables)
- Extent to which variance in Xi ban be explained by other Xs
o Tolerance (TOL) – proportion of variance that cannot be explained by others (>.1)
o Variance Inflation Factor (VIF), inverse of TOL
When including non-metric variables, one needs to create dummy variables (nr of cat. -1 )
Selection Methods
Simultaneous Regression (SPSS: Enter): includes all predictors at the same time
Appropriate: confirmation of existing theory
Problem: specification error
Sequential Regression methods (e.g. Stepwise): include most relevant, then consider what other Xs add
First one with highest bivariate correlation with Y, then X with highest F-test with Y is added
Univariate and Multivariate Analysis of Variance (ANOVA, MANOVA)
ANOVA
-
tests whether all group means are the same (amount of levels >2)
Variance: between-groups + within-group (error)
When H0 : μ = μ1 = μ2 = μ3 rejected, ANOVA not tells which pairs of means are different
from one another
Yij = μ + αj + eij
MANOVA
- Extension ANOVA in which effect(s) discr. ind. var. are assessed on a comb. of dep. Var.
- Tests whether mean differences among groups on a combination of dependent variables is
likely to occur by change
- Creates a new dep. Variable that is a linear combination of the original individual dependent
variables that maximizes the difference between groups
MANOVA vs ANOVA
1. Multiple testing with ANOVA will increase probability for alpha error
2. For multiple dependent variables MANOVA takes the intercorrelations into account
3. Differences between groups may be too small to be detected on a single dependent variable,
but when they are considered jointly there may be a significant difference
Multivariate test criteria
Wilk’s Lambda, Hoteling’s Trace, Pillai’s trace: Pool variance from all dimensions to create test statistic
Roy’s largest root: uses variance from the dimension that separates the groups most
Assumptions
- Independence of observations
- Homogeneity of the covariance matrices
Test: Levene’s test, Box’s M
- Multivariate Normality
Dep.Var. normally distributed, Dep.Var. normal distribution within group, any linear
combination must be distributed normally
Use histograms, cumulative normal probability plots, scatter plots
Structural Equation Modeling (SEM)
Nowadays the most dominant multivariate technique
Combines measurement theory (CFA) and structural theory (SEM) in one analysis
Confirmatory Factor Analysis (CFA) vs. Exploratory Factor Analysis (EFA):
CFA: advance specification of both the number of factors that exist within a set of variables and which
factor each variable will load on, then applied to test an a-priori pattern of factor loadings representing
the actual data or not.
CFA or SEM
Relationship between construct and variable -> CFA
Relationship between construct and multiple variables -> CFA
Structural relation between two constructs -> SEM
Correlational relationship between constructs -> CFA or SEM
Causality
Four types of evidence must be met: 1. systematic covariation, 2. temporal sequence, 3. nonspurious
covariance, 4. theoretical support
Six modelling stages (1tm4 = CFA, 5,6 = SEM)
1. Defining Individual constructs
2. Developing Overall measurement model
3. Designing a study to produce empirical Results
<5 constructs (>3 items each), with high item communalities (>.6) min sample size 100-150, 200
recommended. More constructs need a bigger sample size, >500
4. Assessing the measurement model validity
Goodness of Fit (GOF), types are absolute (overall), incremental(degree improvement) and
parsimonious(nr of estimated coefficients required) fit measures, use more methods (chi-square
is the only statistical model test)
5. Specifying the structural model
6. Assessing structural model validity
Advanced Topics
Testing moderating effects: Multigroup Analysis (MGA), Continuous Variable Interaction (CVI)
Longitudinal Analysis:
1. Alpha, Beta, Gamma change
2. Alternative Models (AM) testing
Download