AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( TESTS FOR PREDICTION: ( The ability to choose the most appropriate statistical test(s) to perform on data, is dependent on the type of management question you seek to answer, gain greater insight, and/or explore. These management questions can be broadly classified into three “buckets” namely: difference, association, and prediction. This set of notes deal with the “bucket” of prediction, and more specifically multiple regression. In business management, we most often concerned with predicting the value of a dependent variable based on the value of an independent variable. The dependent variable is also referred to as the outcome, target or criterion variable and the independent variable as the predictor, explanatory or regressor variable. A simple linear regression is also referred to as a bivariate linear regression or simply as a linear regression, premised on the relationship being linear. While some relationships are not linear, we only deal with linear relationships. The simple linear regression deals with only one dependent and one independent varaible such as the prediction of: 1. Sales based on the number of advertisements placed. The number of advertisements placed is the independent varaible, with the amount of sales generated and measured on a continous scale being dependent. The easiest manner to think about this is: The amount of sales generated is DEPENDENT and therefore the dependent varaible on the number of advertisements placed independent variable. The number of advertisements placed is INDEPENDENT of sales, i.e. not dependent on sales in the short-term. 2. Employee productivity is dependent on the amount of training received. The amount of training is the independent variable, with employee productivity the dependent variable. The way to think of this is: Employee productivity is DEPENDENT on the amount of training received, making training the independent variable. For those that confuse dependent and independent variables, the principle of CPM may assist in understanding this: C = Cause, P = Predictor, M = manipulated – if a varable has one of these characteristics it is the INDEPENDENT VARIABLE. AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( From the above it should however become apparent that in business, that one ‘factor’ such as number of advertisements placed, is sufficient to understand the driver of sales for example. There are at least a few more factors that contribute to sales. So while simple regression allows for the understanding of one variable predicting another, multiple regression allows one to build a more holistic picture of what is influencing a factor such as sales, outside of only advertisements. Thus, multiple regression allows you to predict a dependent varaible (sales, fuel consumption, electricity demand for example) based on multiple independent variables. As multiple regression is an extension of simple regression, the following are basic requirements for understanding and executing a mutliple regression: 1. Two or more independent variables; and 2. One dependent variable The independent variable can be either continous or categorical, while the dependent variable must be continous. Types of business problems/scenario’s solved using multiple regression: 1. Predict new values for the dependent variable given the independent variable a. Personnell professionals (Human Resource Practitioners) generally used multiple regression to determin equitable compensation. You can determine a number of factors or dimensions such as “amount of responsibility” or “number of people to supervise” that you believe to contribute to the value of a job. The personnel analyst then ususal conducts a salary survey among comparable companies in the market, recording the salaries and respective characteristics (i.e. dimensions) for different positions. This information can be used to understand to build a regression model, to understand the underlying drivers of salary in the market. Fr example the model (equation) may be Salary = 0.5(amount of responsibility) + 0.8 (number of people to supervise) + 30,000. This indicates that the salary of this position with no responsibility (hardly beleiveable) and no supervision would attract a base salary of R30, 000. However, as soon as the amount of responsibility increases, the salary increases by 0.5; and so on. The key here is that the biggest underlying driver of salary is number of people that are supervised, as its co-efficient is larger than the amount of responsibility. AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( 2. To determine or understand, how much variation in the dependent variable is explained by the independent variables. a. Using the salary example above, we know that or can reasonably assume that salary is not only determined by the two factors or dimensions of “amount of responsibility” and “ number of people to supervise” but other factors as well such as number of years of relevant work experience. Therefore a regression model, will allow us to understand how much variation is expalined by the “amount of responsibility” and “ number of people to supervise”, in this case it may be 20%, which simply means that the variation in salary that is observed only 20% is explained by the two dimensions in the equation, and 80% is explained by “other” factors. Like all statistical tests, there are underlying assumptions that are either assumed to be true or tested. For multiple regression the underlying assumptions are assumed true particular for this course. All statistical tests performed using software are underpinned by specific null and alternate hypotheses. These are not those that YOU specify, rather those that are specified and tested by the software package that is used. For the multiple linear regression the following are the hypotheses: Null Hypothesis (H0): There is no relationship between the X variables and the Y variables. The null hypothesis for the multiple linear regression is simply stating that the fit of the observed Y values to those predicted by the multiple regression are no better than you would expect by chance. Alternate hypothesis (H1): The is a relationship between the X variables and the Y variables. Using the procedure of a multiple linear regression in SPSS below, we will start gaining an understanding of the above theory. ( ( ( ( ( ( ( ( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( PROCEDURE IN SPSS: ( Context: A health researcher wants to be able to predict maximal aerobic capacity (VO2max), an indicator of fitness and health. Ultimately, the researcher is trying to understand how fit and healthy executives are during the MBA. Normally, to perform this procedure requires expensive laboratory equipment and necessitates that an individual exercise to their maximum (i.e., until they can longer continue exercising due to physical exhaustion). This can put off those individuals that are not very active/fit and those individuals that might be at higher risk of ill health (e.g., older unfit subjects). For these reasons, it has been desirable to find a way of predicting an individual's VO2max based on more easily and cheaply measured attributes. To this end, the researcher recruits 100 MBA students to perform a maximum VO2max test, but also records their age, weight, heart rate and gender. Heart rate is the average of the last 5 mins of a 20 mins much easier, lower workload cycling test. The researcher's goal is to be able to predict VO2max based on age, weight, heart rate and gender. This will allow the researcher to understand how healthy and fit executives are during the MBA. Step 1: Click Analyse > Regression > Linear. AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( ( You will be presented with below Linear Regression dialog box: ( ( ( Step 2: Transfer the dependent variable VO2 Max, into the dependent box using the by first clicking on the variable VO2 Max and using the arrow key to pull it across into the dependent box. NOTE: You can only add one variable here. You cannot predict the outcome of multiple variables at the same time. Highlight the independent varaibles of age, weight , heart rate and gender into the independent(s) box using the ( ( button. The below should result: AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( Step 3: Click the Statistics button, and you will be presented with the below: ( ( ( ( Ensure the following box are checked: Estimates (sometimes selected by default); Confidence intervals; Model Fit (sometimes selected by default); and R squared change. Click Continue. ( Step 4: Click OK to generate the output. ( ! ! ! ! ! ! ! ! AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( PROCEDURE IN EXCEL: ! Please!note!that!this!requires!the!Add1In!Data!Analysis.!I!have!used!a! different!example!for!the!screen!shots!below,!however!the!procedure!is!the! same.!! ! Step 1: Click on Data > then Click on Data Analysis > Click on Regression. The below image should result: ( ! On!the!above!image,!note!that!we!are!trying!to!predict!Cars!(dependent! variable)!by!HH!size!and!cubed!HH!size!(two!independent!variables).! ! AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( Step 2: Input the dependent variable into Input Y range and the other 2 variables into the input X range. See image below: ! ! Step 3: Ensure that you have selected the correct output range and click OK to generate the output. ! ! ! ! ! ! ! ! ! ! ! ! ! ! AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( ! AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( ! INTERPRETING THE OUTPUT FROM SPSS: Step 1: The first output from SPSS is the variables that are contained in the equation: ( ( Ensure that SPSS has not removed any variables, if it has why it has.( ( Step 2: Look at the output labelled Model Summary: ( There are 3 key measures that you are interested in on this box: 1. The "R" column represents the value of R, the multiple correlation coefficient. When there is only one independent variable, as in simple linear regression, R is r, the Pearson correlation coefficient. The multiple correlation coefficient, R, generalizes the correlation coefficient, r. R can be considered to be one measure of the quality of the prediction of the dependent variable; in this case, VO2 Max. R is, in fact, the correlation between the predicted scores and the actual scores of the dependent variable. R can range in value from 0 to 1, with higher values indicating that the predicted values are more closely correlated to the dependent variable (i.e., the greater the value of R, the better the independent variables are at predicting the dependent AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( variable). A value of 0.760, in this example, indicates a good level of prediction. 2. The second key measures are related, and therefore dealt with together: The "R Square" column represents the R2 value (also called the coefficient of determination). This represents the proportion of variance in the dependent variable that can be explained by the independent variables. You can see from our value of 0.577 that our independent variables explain 57.7% of the variability of our dependent variable, VO2 Max. However, R2 is based on the sample and is considered a positively-biased estimate of the proportion of the variance of the dependent variable accounted for by the regression model (i.e., it is larger than it should be when generalizing to a larger population). "Adjusted R Square" (adj. R2) attempts to correct for this bias and thus provides smaller values, as would be expected in the population. As such, it is preferable to use this value to report the proportion of variance explained (i.e., report 55.9%, instead of 57.7%). From the above we can already see that 55.9% of the variance in VO2 Max is explained by Gender, Age, Heart Rate and Weight. Step 3: Look at the output labelled ANOVA. The ANOVA table in a multiple regression indicates whether the model that YOU have proposed is a good fit for the data. As such, the most important column that one needs to understand here is the Sig. column (pvalue). The general significance rule applies here. If this value is less than 0.05 (assuming we are testing at 95% confidence interval) then the model that YOU have proposed is a good fit for the data. If the Sig. colum (pvalue) is greater than 0.05 (assuming you are testing at a 95% confidence interval) then the model you are proposing is a bad fit for the data. If this is the result, it is generally best to stop and rethink what other variables other than those in this model should apply. AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( Step 4: Look at the table labeled Coefficients: ! In this table the first column you want to look at is the B (Beta) column under Unstandardised co-efficients. NOTE: B (Beta) in a multiple regression can be thought of as m or the slope in a general straight line equation. From the above we can build our model/equation: VO2 Max = 87.830 – 0.165 (age) – 0.385 (Weight)- 0.118 (heart rate) + 13,208 (Gender) From the above the first notable aspect is that the constant is 87.830, which technically means that if all the other independent variables were 0, then the VO2 max would be 87.830. In the context of the above, this is impossible as this would mean that you do not have a person being measured, the person does not exist. However, the principle applies; but the CONTEXT matters. NOTE: As gender was coded as 1 for male and for 2 for female you would need to add that code, i.e. 1 for male into gender and 2 for male. This indicates that females in the above equation would at minimum have a 13.208 higher VO2 max than males. AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( Step 4: While the above starts giving us an indication of the equation, we must understand if each independent variable is a significant predictor/contributor to the equation. Technically, we are look if the independent variable coefficients are statistically different from 0. ( ( ( To understand the significant predictors/ independent variables look at the columns labelled t and sig. The t column gives the t-value, while the Sig. column gives us the p-value. As we can observe from the above, all independent variables are significant, i.e. less than 0.05. ( Step 5: Look at the table labelled Coefficients, again, and look at the column for 95% Confidence Interval for B, Lower Bound and Upper Bound. This simply indicates the upper and lower 95% Confidence bounds of the independent variables. ( ( The conclusion that we draw from the multiple regression: VO2 Max can be predicted from our independent variables of age, weight, heart rate and gender. Each are significant predictors. The equation is: VO2 Max = 87.830 – 0.165 (age) – 0.385 (Weight)- 0.118 (heart rate) + 13,208 (Gender) ( ! ! AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( ! ! INTERPRETING THE OUTPUT FROM EXCEL: ! Excel provides a single table output, as indicated below. I use the output I generated from another dataset to interpret the findings: ( Step 1: Look at the Regression Statistics: ( ( ( ( There are 3 key measures that you are interested in on this box: 1. The "Mulitple R" row represents the value of R, the multiple correlation coefficient. When there is only one independent variable, as in simple linear regression, R is r, the Pearson correlation coefficient. The multiple correlation coefficient, R, generalizes the correlation coefficient, r. R can be considered to be one measure of the quality of the prediction of the dependent variable; in this case, Cars. R is, in fact, the correlation between the predicted scores and the actual scores of the dependent variable. R can range in value from 0 to 1, with higher values indicating that the predicted values are more closely correlated to the dependent variable (i.e., the greater the value of R, the better the independent variables are at predicting the dependent variable). A value of 0.760, in this example, indicates a good level of prediction. 2. The second key measures are related, and therefore dealt with together: The "R Square" column represents the R2 value (also called the coefficient of determination). This represents the proportion of variance in the dependent variable that can be explained by the AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( independent variables. You can see from our value of 0.802 that our independent variables explain 80.2% of the variability of our dependent variable, Cars. However, R2 is based on the sample and is considered a positively-biased estimate of the proportion of the variance of the dependent variable accounted for by the regression model (i.e., it is larger than it should be when generalizing to a larger population). "Adjusted R Square" (adj. R2) attempts to correct for this bias and thus provides smaller values, as would be expected in the population. As such, it is preferable to use this value to report the proportion of variance explained (i.e., report 60.5%, instead of 80.20%). From the above we can already see that 60.5% of the variance in Cars is explained by HH Size and Cubed HH size. Step 2: Look at the ANOVA part of the table: ( ( ( The ANOVA table in a multiple regression indicates whether the model that YOU have proposed is a good fit for the data. As such, the most important column that one needs to understand here is the Significance F (p-value). The general significance rule applies here. If this value is less than 0.05 (assuming we are testing at 95% confidence interval) then the model that YOU have proposed is a good fit for the data. If the Significance F (p-value) is greater than 0.05 (assuming you are testing at a 95% confidence interval) then the model you are proposing is a bad fit for the data. If this is the result, it is generally best to stop and rethink what other variables other than those in this model should apply. THIS IS THE CASE in the above example. For illustrative purposes we continue. AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( Step 3: Look at the last aspect of the output which contains the intercept and variables. In this section the first column you want to look at is the co-efficients (same B (Beta) column under Unstandardised co-efficients in the SPSS output). ( From the above we can build our model/equation: Cars = 0.896 + 0.33 (HH Size) + 0.002 (Cubed HH size) ( Step 4: While the above starts giving us an indication of the equation, we must understand if each independent variable is a significant predictor/contributor to the equation. Technically, we are look if the independent variable coefficients are statistically different from 0. ( To understand the significant predictors/ independent variables look at the columns labelled t stat and p-value. The t stat column gives the t-value, while the p-value column gives us the p-value. As we can observe from the below, all independent variables are NOT significant, i.e. greater than 0.05.(( ( ( Step 5: Look at the column for 95% Confidence Interval’s, Lower 95% and Upper 95% . This simply indicates the upper and lower 95% Confidence bounds of the independent variables. The conclusion that we draw from the multiple regression: Cars cannot be predicted from our independent variables of HH size and cubed HH size. ( AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s( permission( ( (