Multivariate Statistical Data Analysis with Its Applications Hua-Kai Chiou Ph.D., Assistant Professor Department of Statistics, NDMC hkchiou@rs590.ndmc.edu.tw September, 2005 1 Agenda 1. 2. 3. 4. 5. 6. 7. 8. Introduction Examining Your Data Sampling & Estimation Hypothesis & Testing Multiple Regression Analysis Logistic Regression Multivariate Analysis of Variance Principal Components Analysis 2 9. 10. 11. 12. 13. 14. 15. Factor Analysis Cluster Analysis Discriminant Analysis Multidimensional Scaling Canonical Correlation Analysis Conjoint Analysis Structural Equation Modeling 3 Introduction 4 Some Basic Concept of MVA • • • • • What is Multivariate Analysis (MVA)? Impact of the Computer Revolution Multivariate Analysis Defined Measurement Scales Type of Multivariate Techniques 5 • Dependence technique – the objective is prediction of the dependent variable(s) by the independent variable(s), e.g., regression analysis. • Dependent variable – presumed effect of, or response to, a change in the independent variable(s). • Dummy variable – nometrically measured variable transformed into a metric variable by assigning 1 or 0 to a subject, depending on whether it possesses a particular characteristic. • Effect size – estimate of the degree to which the phenomenon being studied (e.g., correlation or difference in means) exists in population. 6 • Indicator – single variable used in conjunction with one or more other variables to form a composite measure. • Interdependence technique – classification of statistical techniques in which the variables are not divided into dependent and independent sets (e.g., factor analysis). • Metric data – also called quantitative data, interval data, or ratio data, these measurements identify or describe subjects (or objects) not only on the possession of an attribute but also by the amount or degree to which the subject may be characterized by attribute. For example, a person’s age and weight are metric data. 7 • Multicollinearity – extent to which a variable can be explained by the other variables in the analysis. As multicollinearity increases, it complicates the interpretation of the variate as it is more difficult to ascertain the effect of any single variable, owing to their interrelationships. • Nonmetric data – also called qualitative data. • Power – probability of correctly rejecting the null hypothesis when it is false, that is, correctly finding a hypothesized relationship when it exists. Determined as a function of (1)the statistical significance level (α) set by the researcher for a Type I error, (2) the sample size used in the analysis, and (3) the effect size being examined. 8 • Practical significance – means of assessing multivariate analysis results based on their substantive findings rather than their statistical significance. Whereas statistical significance determines whether the result is attributable to chance, practical significance assesses whether the result is useful. • Reliability – extent to which a variable or set of variables is consistent in what it is intended to measure. Reliability relates to the consistency of the measure(s). • Validity – extent to which a measure or set of measures correctly represents the concept of study. Validity is concerned with how well the concept is defined by the measure(s). 9 • Type I error – probability of incorrectly rejecting the null hypothesis. • Type II error - probability of incorrectly failing to reject the null hypothesis, it meaning the chance of not finding a correlation or mean difference when it does exist. • Variate – linear combination of variables formed in the multivariate technique by deriving empirical weights applied to a set of variables specified by the researcher. 10 • The Relationship between Multivariate Dependence Methods Analysis of Variance (ANOVA) Y1 X1 X 2 X 3 ... X n (metric) (nometric) Multivariate Analysis of Variance (MANOVA) Y1 Y2 Y3 ... Yn X1 X 2 X 3 ... X n (metric) (nometric) Canonical Correlation Y1 Y2 Y3 ... Yn X1 X 2 X 3 ... X n (metric, nometric) (metric, nometric) 11 Discriminant Analysis Y1 X 1 X 2 X 3 ... X n (nometric) (metric) Multiple Regression Analysis Y1 X 1 X 2 X 3 ... X n (metric) (metric, nometric) Conjoint Analysis Y1 X 1 X 2 X 3 ... X n (metric, nometric) (nometric) 12 Structural Equation Modeling Y1 X 11 X 12 X 13 ... X 1n Y2 X 21 X 22 X 23 ... X 2 n Ym X m1 X m 2 X m 3 ... X mn (metric) (metric, nometric) 13 What type of relationship is being examined? Dependence Interdependence How many variables are being predicted? Multiple relationships of dependent and independent variables Is the structure of relationship s among: Several dependent variables in single relationship One dependent variables in single relationship Variable Factor analysis What is the measurement scale of the dependent variable? Structural Equation Modeling What is the measurement scale of the dependent variable? Cases/Respondent Object How are the attributes measured? Cluster analysis Metric Nometric Nometric Metric Nometric Metric Nometric Multidimensiona l scaling What is the measurement scale of the dependent variable? Canonical correlation analysis with dummy variables Metric Nometric Canonical correlation analysis Multivariate analysis of variance (MANOVA) Multiple regression Conjoint analysis Correspondenc e analysis Multiple discriminant analysis Linear probability models 14 A Structured Approach to Multivariate Model Building Stage 1: Define the research problem, objectives, and multivariate technique to be used Stage 2: Develop the analysis plan Stage 3: Evaluate the assumptions underlying the multivariate technique Stage 4: Estimate the multivariate model and assess overall model fit Stage 5: Interpret the variate(s) Stage 6: Validate the multivariate model 15 Examining Your Data 16 HATCO Case • Primary Database – This example investigates a business-to-business case from existing customers of HATCO. – The primary database consists 100 observations on 14 separate variables. • Three types of information were collected: – The perceptions of HATCO, 7 attributes (X1 – X7); – The actual purchase outcomes, 2 specific measures (X9,X10); – The characteristics of the purchasing companies, 5 characteristics (X8, X11-X14). 17 Table 2.1 Description of Database Variables (Hair et al., 1998) Variables Description Perceptions of HATCO X1 Delivery Speed X2 Price Level X3 Price Flexibility X4 Manufacturer’s Image X5 Overall Service X6 Salesforce Image X7 Product Quality Purchase Outcomes X9 Usage Level X10 Satisfaction Level Purchaser Characteristics X8 Size of Firm X11 Specification Buying X12 Structure of Procurement X13 Type of Industry X14 Type of Buying Situation Variable Type Rating Scale Metric Metric Metric Metric Metric Metric Metric 0 – 10 0 – 10 0 – 10 0 – 10 0 – 10 0 – 10 0 – 10 Metric Metric 100-point percentage 0 – 10 Nonmetric Nonmetric Nonmetric Nonmetric Nonmetric {0,1} {0,1} {0,1} {0,1} {1,2,3} 18 Fig 2.1 Scatter Plot Matrix of Metric Variables (Hair et al., 1998) 19 Fig 2.2 Examples of Multivariate Graphical Displays (Hair et al., 1998) 20 Missing Data • A missing data process is any systematic event external to the respondent (e.g. data entry errors or data collection problems) or action on the part of the respondent (such as refusal to answer) that leads to missing values. • The impact of missing data is detrimental not only through its potential “hidden” biases of the results but also in its practical impact on the sample size available for analysis. 21 • Understanding the missing data – Ignorable missing data – Remediable missing data • Examining the pattern of missing data 22 Table 2.2 Summary Statistics of Pretest Data (Hair et al., 1998) 23 Table 2.3 Assessing the Randomness of Missing Data through Group Comparisons of Observations with Missing versus Valid Data (Hair et al., 1998) 24 Table 2.4 Assessing the Randomness of Missing Data through Dichotomized Variable Correlations and the Multivariate Test for Missing Completely at Random (MCAR) (Hair et al., 1998) 25 Table 2.5 Comparison of Correlations Obtained with All-Available (Pairwise), Complete Case (Listwise), and Mean Substitution Approaches (Hair et al., 1998) 26 Table 2.6 Results of the Regression and EM Imputation Methods (Hair et al., 1998) 27 Outliers • Four classes of outliers: – – – – Procedural error Extraordinary event can be explained Extraordinary observations has no explanation Observations fall within the ordinary range of values on each of the variables but are unique in their combination of values across the variables. • Detecting outliers – Univariate detection – Bivariate detection – Multivariate detection 28 Outliers detection • Univariate detection threshold: – For small samples, within ±2.5 standardized variable values – For larger samples, within ±3 or ± 4 standardized variable values • Bivariate detection threshold: – Varying between 50 and 90 percent of the ellipse representing normal distribution. • Multivariate detection: – The Mahalanobis distance D2 29 Table 2.7 Identification of Univariate and Bivariate Outliers (Hair et al., 1998) 30 Fig 2.3 Graphical Identification of Bivariate Outliers (Hair et al., 1998) 31 Table 2.8 Identification of Multivariate Outliers (Hair et al., 1998) 32 Testing the Assumptions of Multivariate Analysis • Graphical analyses of normality – Kurtosis refers to the peakedness or flatness of the distribution compared with the normal distribution. – Skewness indicates the arc, either above or below the diagonal. • Statistical tests of normality zskewness skewness ; 6N zkurtosis kurtosis 24 N 33 Fig 2.4 Normal Probability Plots and Corresponding Univariate Distribution 34 (Hair et al., 1998) Homoscedasticity vs. Heteroscedasticity • Homoscedasticity is an assumption related primarily to dependence relationships between variables. • Although the dependent variables must be metric, this concept of an equal spread of variance across independent variables can be applied either metric or nonmetric. 35 Fig 2.5 Scatter Plots of Homoscedastic and Heteroscedastic Relationships 36 (Hair et al., 1998) Fig 2.6 Normal Probability Plots of Metric Variables (Hair et al., 1998) 37 Table 2.9 Distributional Characteristics, Testing for Normality, and Possible Remedies (Hair et al., 1998) 38 Fig 2.7 Transformation of X2 (Price Level) to Achieve Normality (Hair et al., 1998) 39 Table 2.10 Testing for Homoscedasticity (Hair et al., 1998) 40 Sampling Distribution 41 Understanding sampling distributions • A histogram is constructed from a frequency table. The intervals are shown on the X-axis and the number of scores in each interval is represented by the height of a rectangle located above the interval. 42 • A bar graph is much like a histogram, differring in that the columns are separated from each other by a small distance. Bar graphs are commonly used for qualitative variables. 43 What is a normal distribution? • Normal distributions are a family of distributions that have the same general shape. They are symmetric with scores more concentrated in the middle than in the tails. Normal distributions are sometimes described as bell shaped. The height of a normal distribution can be specified mathematically in terms of two parameters: the mean (m) and the standard deviation (s). 44 45