Module Three: Graphical and numerical exploration of bivariate variables The graphical and numerical summaries discussed so far are for each variable, or for comparing the responses among different levels of a factor. In many interlaboratory testing studies, we are interested in •studying the relationship between two responses – correlation study, • predicting a response based on a group of variables, regression modeling. This type of analysis involves two or more variables. In this module, we will take a look at bivariate cases. 1 Revisit the example of student information Student GPA Gender Year Major Hour study/wk 1 2.0 M 3 Biology 3.6 2 3.2 F 1 Biology 7.4 3 2.5 F 4 Biology 4.8 4 2.8 M 3 Accounting 5.0 5 3.6 F 2 Accounting 6.5 6 3.1 M 3 Law 4.2 7 2.8 M 2 Law 3.8 8 2.4 M 2 Math 2.5 9 2.8 F 1 Math 5.2 10 2.6 M 3 Math 3.5 11 3.0 F 4 Math 6.8 12 3.2 M 2 Computer 9.3 13 3.7 M 1 Computer 7.2 14 2.7 F 3 Computer 5.3 15 2.9 M 2 Computer 4.2 16 2.5 M 4 Language 2.8 17 2.8 F 4 Language 3.8 18 3.2 F 1 Language 4.8 19 3.4 M 3 Engineer 4.4 20 3.1 F 3 Engineer 7.4 We are interested in studying the following problems: Q1: Are GPA’s different between male and female? – This is a comparative study • Graphical methods: side-by-side bar charts or pie charts, or stacked bar charts. • Numerical comparison between two independent groups. Q2: Is there a relationship between GPA and Hours study per week? – This is a relationship study • Graphical method: scatter plots between two variables. • Numerical investigation of correlation or regression models. Module One shows a variety of graphical and numerical summaries for comparative studies. In this module,We will focus on the relationship of bivariate and regression modeling. How to display relationship between two variables – Scatter plots • Describing patterns shown in the following scatter plots: - What type of pattern do you see? Upward or downward? Curved? None or random? - How strong is the pattern? All points follow it? Only weakly visible? - Are there any unusual observations? Outliers? Clusters? Explanation for groupings? Numerical Measures for Quantitative Bivariate Data • Numerical measurement to quantify the relation is the Pearson’s correlation coefficient, r. • Properties of r: -1 r 1 r = 1 means perfect positive correlation. Every pair of observation is on the linear line. r = -1: perfect negative correlation. r = 0: No correlation. A random pattern. • In most real world applications, r is rarely 1 , -1 or 0. Some common relationship patterns 3.5 Y 3.0 2.5 2.0 1 2 3 X Positive correlation, outlier Y = b + ax a > 0. With or without the outlier, the regression models are different. r is positive, and around .7 with the outlier. Without the outlier, r would be close to one. Random, no correlation Y = b + ax, with a almost zero. r is almost zero. 4 3.5 Y 3.0 2.5 2.0 1 2 3 4 X Nonlinear Positive linear, Nonlinear Y = b + ax Y = b + ax with b about 19, a almost zero with b about 23, a > 0 Y = b + ax +cx2 Y = b + ax +cx2 With c > 0, Y has a minimum With c < 0 and Y has a maximum 7 r is about zero R is positive, and about .7 3.5 3.0 3.0 Y Y 3.5 2.5 2.5 2.0 2.0 1 2 3 X Highly positive Y = b + ax B about 20, a > 0 r close to one. 4 1 2 3 4 X Highly negative Y = b + ax b about 35, a < 0 r close to -1. 8 Numerical Measure of Relationship – Correlation Coefficient • A simple measure of the relationship between two variables x and y is the correlation coefficient, r : 2 rs xy /(sx s y ) where sx and sy are the standard deviations for the variables x and y. • The new quantity sxy is called the covariance between x and y and is defined as: _ _ ( xi x) ( yi y) 2 s xy n 1 • A computing formula for the covariance: s 2 xy xi yi ( xi )( yi ) n 1 n where Sxiyi is the sum of the products xiyi for each of the n pairs of measurements. Some connections between correlation coefficient, r and s 2 xy • If the points in the x vs y plot tend to run from lower left to upper right, then s 2 and r will be positive. xy • • 2 If the points tend to run from upper left to lower right, then sxy and r will be negative. • If the points are scattered high and low and left and right, then 2 and r will be close to zero. sxy Use Minitab to compute the correlation coefficient, r: 1. Go to Stat, choose basic statistics, 2. choose Correlation, then enter the Dialog box. 3. If ‘Display p-value’ is selected, the output will show if the correlation coefficient is significant or not. (More will be discussed later). In many real world applications, we are interested in determining the relationship between Y and X using a model for prediction purpose. E.g., How can you build a model to predict mileage using car weights? • The value of y typically depends on the value of x; y is called the dependent variable and x is called the independent variable. • Sometimes it is possible to describe the relationship relating x to y using a straight line given by the equation y ax + b. • The best-fitting line relating y to x, called the regression or leastsquares line, is found by minimizing the sum of the squared differences between the data points and the line itself. • In some cases, the relation between Y and X is nonlinear. The regression line is a polynomial regression, not a straight line. For example, a quadratic polynomial regression has the form: Y = ao + a1x + a2x2. The rice production and the amount of fertilizer would have such a relation. In many situations, we would like to interpret the response Y using several independent variables x’s. This is what we call, Multiple Regression Model: Example, both temperature and humidity may have significant impact to the length of time period for keeping mea product fresh. An experiment can be conducted to measure the time period under different combination of (humidity, temperature). Y = time period the meat is fresh. X1 = humidity, X2 = temperature And a multiple regression model would be: Y = ao + a1X1 + a2X2 + a3(X1*X2) + a4X12 + a5X22. This model is also called Response Surface Model. Since our main purpose is to find the optimal combination of (humidity and temperature) that will maximize the time. In fact, we can make this experiment more complicated by adding different types of meat into the study. Modeling the time using humidity and temperature, at the same time, compare the time among different types of meat. This is a regression modeling involves with both quantitative and qualitative independent variables. 12 will Regression modeling itself is a series of two semester course. The discussion of this subject be brief, due to the time limit. The concept of the Least Squared Method for developing regression models: Regression Model for Estimating GPA using Hours Study Per week. GPA = 2.14677 + 0.149898 Hourstudy S = 0.327960 R-Sq = 41.1 % R-Sq(adj) = 37.8 % GPA 4 3 Regression 95% CI 2 2.5 3.5 4.5 5.5 6.5 Hours study per week 7.5 8.5 9.5 The idea behind the regression model is to find the ‘Best line’ that interprets the relationship the most. One way to do to find the line that gives smallest sum of squared residuals. The term residual is the difference from the observed yi to the predicted yi. More specifically, we first choose a type of model that we think will fit the relation best, in this case, it is y = a + bx, a straight line. In order to distinguish between observed data (xi, yi), we use the notation: yˆ i a + bxi The the difference is yi yˆ i ei . This is the residual for the ith case. •The best least squared regression model is the one that minimizes the sum of squared ei’s. That is we are looking for a and b using the data (xi, yi) so that 2 2 ˆ e ( y y ) i i i is minimized. •The solution of this minimization problem gives the formulas for computing b and a: sy b r sx and a y b x When r is positive, b is positive; when r is negative, b is negative; when r is zero, b is zero. 14 Computing Regression Line similar to the above one for fitting GPA using Hours Study: 1. 2. 3. 4. Go to Stat, choose Regression, select ‘Fitted Line Plot’. In the Dialog box, enter GPA for Response Y, and Hourstudy for Predictor X. You may choose a linear, a quadratic or cubic regression line, depending on the relation shown on the data. You may click on ‘Options’, and choose to ‘Display Confidence Band’ as the above example shown. A few key points about regression modeling 1. The regression model for this study is GPA = 2.15 + .15(Hours Study) 2. This model explains 38% of GPA information can be explained by the Hours Study. (R2(Adjusted) = .378). More specifically, R2 measures the proportion of variation of the GPA explained by the Hours Study. Generally speaking, the higher the R2, the more the X variable can explain the pattern of the response. 3. We can apply the model to estimate a student’s GPA if we know the # of hours they study per week. For example, Tom spends about 5 hours per week studying. Based on this model, we estimate his GPA would be about 2.9. 4. Further more, we are 95% confidence that if a student spends 5 hours per week to study, his/her GPA would be between 2.74 to 3.05. This interval is from the 95% confidence upper and lower bands shown on the graph.. It says: for a given X-value, the # of hours study, the range of the GPA will be within the given band with 95% of chance. Numerical bands can also be computed using Minitab. 5. What happen if a student spent , on average, 20 hours to study. What is the estimated GPA? Using the model, the estimate would be 5.15, which is impossible! So, what is wrong? 6. How do we know if the variable ‘Hours Study’ is indeed a ‘good’ predictor? The R2 gives some message about this. The higher the R2, the more the X variable can explain GPA, and therefore, the prediction will be more precise. Another commonly used approach to find out the degree of significance of an X variable is to conduct a hypothesis test. HOW? Using Minitab for Regression Modeling When we fit the regression line for the modeling GPA using Hours Study, we use ‘Fitted Line Plot’ procedure. This is only for one variable case with rather limited results. A more general procedure in Minitab for regression modeling is the following: 1. Go to Stat, choose Regression, then select ‘Regression’ procedure. 2. In the Dialog box, enter GPA as Response. Enter HourStudy into the Predictors box. (You can enter as many independent variables as needed here for multiple regression). 3. There are four sub-dialog boxes available: Graphs for residual analysis, Results for additional outputs, Storage for storing some results in the worksheet for further analysis, Options for a variety of additional analysis. 4. The confidence limit for studying 5 hours that I obtained (2.74 to 3.05) is computed in the Options Dialog. Click on the ‘Options’, enter the X values that you would like to compute confidence limits or prediction limits (in this case, I enter 5), and choose to store confidence limits. This will give us 2.74 18 and 3.05 in the data worksheet as CLIM1 and CLIM2. How to conduct a complete Regression Analysis? The above discussion gives some basic regression modeling , and interpretations, and introduce Minitab to conduct these analysis. It will take some time to talk about a more complete regression analysis. If we have time, we will come back to this topic. Before we move onto other topic, let’s work on a project of simple linear regression analysis. 19 20