1 Quantitative Research Methods Oliver Fris Bjørling: 202207312 Julius Jebjerg: 202208510 Exam prep notes: 5 Definitions 5 SRS: 5 Independence: 5 Trustworthiness: 5 Variance homogeneity: 5 Residuals: 5 Prediction interval: 5 Exam summary: 7 Variables: 7 Nominal variable: 7 Ordinal variable: 7 Continuous variable: 7 Exception: 7 Syllabus relevant tests: 7 Goodness of fit test/ representative test 7 Independency test: 7 Anova 8 Two Factor Anova 8 Regression: 8 Assignment walkthroughs: 9 Introduction to the data: 9 Assignment 1: Simple linear regression 10 a) Describe the relationship between salary and age in a scatter plot and assess whether the assumptions for linear regression are met. Discuss how the model could be reformulated to avoid any assumption-related issues. 10 Assumptions (Base assumptions for “sample”) 10 Assumptions specifically concerning linear regression. (Guideline) 11 1. There needs to be linearity between x and y. 11 2. Examine if the x variable has a uniform distribution. 11 3. Examine if the y variable is normally distributed 12 → Make Residual analysis: 13 1. Residuals are normally distributed (make residual plot and discuss) 13 2. Expected value for residuals is equal to 0. Average of residuals is equal to 0. 13 3. Homoscedasticity (variance in residuals is constant) 14 4. Independence between residuals 14 2 b) Based on this analysis, develop a relevant linear regression model and explain which variable is the dependent variable. 16 c) Test whether age has a significant effect on salary (test the model). 17 d) Interpret the results of the model: 18 e) Estimate a 95% confidence interval for the effect that age has on salary. 18 The following confidence interval is set up by: 18 845,45t100-1-1;0,052*76,22 19 Assignment 2: Transformations in linear regression + PI + CI 20 a) Add the quadratic terms for age to the model from the previous task and assess whether they should be included. 20 Calculating tobs 21 b) Calculate the expected salary and a corresponding 95% confidence interval for a 30year-old employee in the model you find appropriate. Also, determine a prediction interval. 21 c) In the two variables Sam_løn and Sam_alder, a random selection of Danes is used to represent the salaries and ages of the population. Create the model �� = ��0 + ��1 ⋅ �� + ��2 ⋅ ��2 and determine at what age the maximum point (top point) is reached based on this data. 23 First we examining a uncentered polynomial: 23 Examining the centered polynomial 24 Assignment 3 - Multiple linear regression 25 a) Estimate the model below and briefly assess the assumptions for the analysis, including whether there might be an issue with multicollinearity. 25 Assumptions for the model: 26 X variables must be uniformly distributed 26 Y-variable needs to be normally distributed: 26 Residual analysis: 27 1) Homoscedasticity: 27 2) Normally Distributed residuals 27 3) Independent residuals 27 4) Expected value of residuals = 0 27 Estimating the model: 28 b) Argue for the actual (reduced) model. 30 c) Compare the reduced model with the original model from 4.a. Which one is preferable? 31 d) In the final reduced model, please investigate if there are any interaction effects that should be included. 31 Assignment 4: Variance analyses (one-way ANOVA /1 factor ANOVA) 32 a) Calculate the mean and variance of the salary for employees on each of the 3 machines and test for homogeneity of variances for the three samples, specifying the assumptions for the test in question b precisely. 32 Assumptions for one-way ANOVA: 33 b) Conduct a relevant test to determine if the salaries are the same for employees working on different machines. 34 3 Anova 1 way 34 Assignment 5: Variance analyses (Two-way/two-factor ANOVA) 36 a) Formulate a Two-factor ANOVA model and test whether there is an interaction between gender and marital status when explaining salary. 36 The following assumptions are tied to the analysis: 36 1. Normally distributed populations or n>30 (look at the distributions) 36 2. Even population sizes (approximately) 36 3. False f- test - variance homogeneity in between the groups 37 b) If an interaction is concluded, it should be commented upon. If it is concluded that there is no interaction, the model should be reduced and commented upon. 39 Assignment 6: Logistic regression 40 a) Develop a model that can explain whether an employee is satisfied or not. 40 Assumptions for logistic regression: 40 1. It is required that the dependent variable is categorically binary 40 2. Approximately the same number of observations in group 1 and 0. 41 3. There must not be multicollinearity between the x variables 41 b) Based on the final (reduced) model, interpret and assess the results. 43 c) What is the odds ratio for men being satisfied compared to women being satisfied? 45 d) What is the odds ratio for each additional unit of currency earned? 45 e) What is the probability that an average, married, female employee in machine 1 becomes satisfied? (Not all information may be necessary for the prediction.) 46 Assignment 7: Chi squared and goodness of fit test 46 Following X2 assumptions will be discussed: 46 1. Mutually exclusive groups 46 2. Independence between groups 46 3. Rule of five 47 Assignment 8: Chi-squared independence test: 48 a) Conduct a relevant test to determine if the gender distribution across different machines differs from each other. Comment on the results you find. 48 Assignment 9: Forecasting 50 a) Estimate a Trend model that explains the development in average salary and use the model to predict the average salary in January 2019. Evaluate whether the model exhibits autocorrelation. 50 b) Assess whether there is a possibility to enhance the trend model by incorporating seasonality and use the model to predict the average salary for January 2016. 52 c) Estimate an autoregressive model where average salary is explained by previous periods' salaries. Use the model to predict the average salary for January 2019. 53 d) Which model provides the best estimate for January 2019? 54 JMP Guide: 55 Scatterplot 55 For regression line (after creating scatter plot): 55 For log(y) regression line (improving model if needed): 55 Checking for improvement in distribution (log-level model) 55 4 Can add a little extra (Normal Distribution and Quantile plot): Parameter estimates Checking variable distribution JMP (Uniform/Normal distribution) Residual plot in JMP Residual analysis with multiple variables Saving residuals to dataset (Test for distribution): Confidence interval in JMP Or for confidence intervals in the dataset: Inserting quadratic terms for improving correlation: Prediction interval: Test for multicollinearity One-way/One-factor Anova Two-way/Two-factor ANOVA: Logistic Regression Chi squared: Forecasting 55 55 55 55 55 56 56 56 56 56 56 56 56 57 57 57 5 Exam prep notes: 1) Theoretical contributions surrounding JMP calculations should only be included if you have extra time. Definitions SRS: Simple random sampling Independence: Trustworthiness: Variance homogeneity: Difference between t test and f test The t-test is used to compare the means of two populations. In contrast, f-test is used to compare two population variances. Residuals: The distance between the observations and the linear regression line. If the residuals = 0, then all our observations fall on the regression line (which is not a realistic scenario). Prediction interval: What happens if the dataset receives one more observation? 6 7 Exam summary: Y=X+X+X => Dependant variable = independent variable + independent variable + independent variable ______________________________________________________________________________ Variables: Nominal variable: Categorical variable without rank order ( sex) (0-1) Ordinal variable: Categorical variable with clear rank order (for example first to third place) (1-3) Continuous variable: Where it makes sense to take the mean (average). Typically comma-numbers (for example weight or height) Exception: Sex (gender) is the only variable that can function as both a continuous variable and as a categorical variable. - That’s just how it is. ____________________________________________________________________________________ Syllabus relevant tests: Goodness of fit test/ representative test Categorical= % or proportions. For example we are testing 50/50 or distributed proportions for f.x group 1 is 10%, group 2 is 30% and so on Independency test: Categorical variable = Categorical variable For example: Opinion about the environment (1-5) vs. if you have a macbook (0-1) 8 Logistic regression Categorical variable= interval + + interval (minimum 1 x- variable) (with no minimum x variable then it’s a goodness of fit test) Example: If you are continuing directly onto a economics masters degree (0-1) = age+ sex+ weight Anova Interval= Categorical→ One way Anova Interval= Categorical+ Categorical → 2 way Anova Two Factor Anova Interval variable = Categorical + Categorical + Interaction→ Two Factor ANOVA Regression: Continuous = Continuous → Simple linear regression Continuous = Continuous + continuous → Multiple regression Continuous= Time→ forecasting/ Autoregressive ______________________________________________________________________________ 9 Assignment walkthroughs: Introduction to the data: Datasættet ”Arbejdsliv” indeholder information om 100 tilfældigt udvalgte medarbejdere fra virksomheden Sleepy, der er en produktionsvirksomhed, som laver senge og tilbehør hertil. Information om løn, alder, Maskine, Forfremmelse, køn, Civilstatus og Gennemsnitsløn er data indsamlet fra kontrakter og andre virksomhedsregistreringer, hvorimod Erfaring og tilfredshed er indsamlet via spørgeskemaundersøgelse af medarbejderne. De variable i datasættet, der skal anvendes i denne opgave, er følgende: A: ID number (i=1 to 100) B: Løn (medarbejderens løn i DKK) C: Alder (medarbejderens alder i år) D: Erfaring (antal års erfaring en medarbejder har i branchen) E: Maskine (indikerer maskinen medarbejderen arbejder på, 1=maskine 1..3) F: Kvinde (Dummy som indikerer hvis en person er kvinde) G: Uddannelseskategori (1=ingen, 2=Folkeskole 3=kort videregående 4=lang) H: Gift (Dummy som indikerer hvis en medarbejder er gift) I: Forfremmet (Dummy som indikerer om der er givet en forfremmelse) J: Tilfreds (Dummy som indikerer om medarbejderen er tilfreds med sit job) 10 Assignment 1: Simple linear regression a) Describe the relationship between salary and age in a scatter plot and assess whether the assumptions for linear regression are met. Discuss how the model could be reformulated to avoid any assumption-related issues. The older you are the more you earn. When you reach a certain age, income will follow a decreasing pattern as you hit pension age. It’s typical that you earn more the older you are, until you reach retirement, from which income will decrease again. This depends on your data set, for example if your data set sample only includes age up to 25 years of age then you will not see the decreasing income values as people hit 60. The decreasing trend will not be pictured in this data set. Scatterplot shows a linear trend in terms of Age and Income. However there is not a decreasing trend depicted as of yet → with a data set of a higher age we may be able to see this expected trend. (Where the model could be explained by a polynomial . Assumptions (Base assumptions for “sample”) Simple random sample (SRS): There are 100 randomly chosen workers picked for this sample as described by the introduction to data above. Therefore this assumption is accounted for. Independence: “Have the respondents affected each other?” - From the introduction to the data, it’s explicitly stated that the variables age and salary are derived from registration systems. This means that it would have been impossible for the samples to have influenced each other internally. Trustworthiness: Following the fact that the data comes from a reliable source (registration systems) so there is no reason to believe the data is not trustworthy. 11 Assumptions specifically concerning linear regression. (Guideline) 1. There needs to be linearity between x and y. From our graph to the left, we can see that there is definite linearity between salary and age. There are no immediate signs of any possible other transformation. We can see that there is a positive correlation between the 2 variables, as was expected The older you are, the more money you earn. This is also applicable to the real world, where we would also expect to see a 45 year old, earning more than a 25 year old. 2. Examine if the x variable has a uniform distribution. From the distribution above, we can derive that the minimum age of our sample is = 29, and the maximum age is = 63. This is relevant because it means that our sample does not include people at retirement age. The proportion of retired observations (60+) is therefore also relatively low. From our distribution we can also see that the model is best at describing observations around the age of ≈ 45. We can also conclude that the x-variable is not perfectly uniform, however, our middle observations are somewhat uniformly distributed. Furthermore, we can derive from the model that there are very few observations in the age 29-31 and from 60 plus. This means that our model will have a weaker describing ability of these people. The model 12 is better at explaining observations at the age of 32-56, given that ≈ 80% of our observations fall within this age. 3. Examine if the y variable is normally distributed From the graph we can see the income distribution. The distribution of income is not normally distributed, this is because income will never be normally distributed. This is because the majority of our observations earn roughly 45000. There are a few observations that earn much higher than the average person, resulting in the distribution becoming right skewed. Thus, we have identified an assumption error, which results in a model that will be less valid. Therefore the overall describability if the observations will be lesser compared to if there wasn’t an assumption error. Later in the problem, it will be identified that this assumption problem can be fixed. 13 → Make Residual analysis: 1. Residuals are normally distributed (make residual plot and discuss) Observing the residuals, we can see that they are somewhat normally distributed. It can be said that the model is a little bit right skewed. Thus follows, that income is skewed to the right side. The residuals are calculated by the distance from the observation to the regression line. 2. Expected value for residuals is equal to 0. Average of residuals is equal to 0. (Normally true) Assuming that the residuals are somewhat normally distributed, and we have a nice linear regression, without any indications of a need for transformation (by for example adding quadratic terms (polynomial)), it is fair to assume that the expected value for the residuals = 0. 14 3. Homoscedasticity (variance in residuals is constant) (“Trumpet” shape in residual plot) Homoscedasticity means that there should be a constant variance in the residuals, which will mean that there isn’t heteroskedasticity. Heteroscedasticity means that the variances are not constant. The plot shows that to the left the variances can not be said to be constant. This is shown by the residuals getting bigger and bigger the higher the income one earns. This can be backed up by the fact the residuals have “trumpet shape” and therefore the residuals do not hold the assumption of homoscedasticity - But is a case of heteroscedasticity. Therefore the assumption is broken which will then remove overall reliability of the model, compared to if an assumption hadn’t been broken. The overall conclusion of this model is therefore one of lower quality. 4. Independence between residuals To check for independence between residuals, we can simply take a look at the scatterplot above, and look for any patterns in the observations. From the plot we can see that the residuals look randomly distributed, thus there is independence between the residuals. 15 Reformulation of the Model to eliminate assumption problems: Model X increases Y increases Hjælp: Level(Y)-Level(X) y=b0+b1*x units B1 units N/a Log-level log(y)=b0+b1*x Units B1* 100% Can help IF y is NOT normally distributed. Level-Log y=b0+b1*log(x) 1% B1 / 100 Not usually used Log-Log Log(y)=b0+b1* Log(x) 1% B1*% Not usually used ● In an exam, this is the only relevant one The following problems arose with our assumptions: 1) Y-variable is not normally distributed (right skew) 2) Variation in residuals is not constant (variance homoscedasticity) If the model needs to eliminate the problems with our assumptions, then the following model is ̂ = 𝑦0 + 𝑦1 ∗ 𝑦 recommended: 𝑦 ̂ ) = 𝑦0 + 𝑦1 ∗ 𝑦 This can be reformulated into 𝑦𝑦𝑦(𝑦 Here we evaluate if using the logged value of y can help us resolve the problems with our assumptions. We examine the assumptions that were breached: Overall, we can see that the model is usable. However, there isn’t a big difference in the usability of the model. Furthermore, we can also see that the correlation between log(salary) and age is strong. The correlation between the two variables is positive. 16 We now examine whether the error in assumption about normal distribution of salary has been improved: From the histogram to the left, we can see that the distribution is now a lot smoother with the log-level model, than it was earlier. This means that the distribution is now almost normally distributed. This is supported by the plot above the distribution (Normal quantile plot) Where we can see that the majority of our observations fall within the expected limits. From the boxplot, we can also derive that the distribution is now somewhat normally distributed. b) Based on this analysis, develop a relevant linear regression model and explain which variable is the dependent variable. (From the question formulation, we can either choose to use the level-level model, or the loglevel model, given nothing is explicitly stated in the assignment. For ease of explainability, we will continue on with the level-level (original) model.) Model Formulation: ̂ ) = 𝑦0 + 𝑦1 ∗ 𝑦1 + 𝑦 → The “true” model. (𝑦 The true model cannot be estimated. The model includes a residual that we cannot estimate. This in turn means, that we are limited to utilizing the estimated model: ̂ ) = 𝑦0 + 𝑦1 ∗ 𝑦1 → The estimated model. (𝑦 This is the formula for the estimated model. With insertion of our variables we get: ̂ (𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦(𝑦𝑦𝑦) 𝑦ø𝑦 From here we would then examine whether salary can be explained by age. From the previous section, it’s stated that we’re working with a linear regression, why there will be a test to examine whether there is in fact a linear relation, in the following section. 17 c) Test whether age has a significant effect on salary (test the model). To test whether age has a significant effect on salary, we set up the following hypotheses: 𝑦0 = 𝑦1 = 0 (slope of regression line = 0) 𝑦1 = 𝑦1 ≠ 0 The above hypotheses will test if there is a linear relation between x and y. Alpha is set = 0,05, because nothing else is stated in the assignment. The results of the test are as follows: From the parameter estimates, we can see that the p-value for age is significant with the value =0,0001*. This means that the p-value is below our significance level (0,05), why we reject the null hypothesis in favor of H1. In other words, we can conclude that age has a significant correlation with salary. With calculations (Only if explicitly asked for in assignment, or there’s extra time): (𝑦 − 0) 𝑦𝑦𝑦𝑦 = 1 𝑦𝑦1 Inserting variables (from parameter estimates): 845,45 − 0 𝑦𝑦𝑦𝑦 = = 11,09 76,22 The critical values are calculated: 𝑦𝑦−𝑦−1;𝑦 = 𝑦100−1−1;0,05 = 𝑦98;0.025 = 1,984 2 2 Is the tobs within the critical limit? Given that our observed value is 11,09 - which is significantly larger than our critical value of 1,984 - We reject our null hypothesis in favor for H1. Thus we can conclude, that there is a linear correlation, given our slope is not = 0. 18 d) Interpret the results of the model: From the linear regression, we got the following results: ̂ (𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦(𝑦𝑦𝑦) 𝑦ø𝑦 By inserting values from our parameter estimates, we get: ̂ = 8182,75 + 845,45 ∗ 𝑦𝑦𝑦𝑦𝑦 𝑦ø𝑦 From the above, we can derive that for each time age is increased with 1, the salary increases with 845,45. This seems reasonable, since you earn more, the older you are. Note the starting value (b0) is 8182,75. This means that when age = 0, you earn 8182,75. This makes no practical sense, as it is very rare for a 0-year old to make any money. The model only ecompasses observations in the age range of 29-63. This means that b0 has no explanatory power on its own. Instead we can examine how much a 29-year old earns, as this is the youngest observation of our sample: 𝑦ø𝑦29 = 8182,75 + 845,45 ∗ 29 = 32.700 This makes a lot more sense, and is more usable for the interpreter. Furthermore it would also seem fair to assume a 29-year old to make 32.700. Here it is important to emphasize that our model had two large assumption violations. One being the distribution of the y-variable and the other the absence of variance homoscedasticity. The abovementioned results are therefore based on a model that hasn’t been able to fulfill the expected assumptions, which in turn means that the results should not be the basis of any large decisions. e) Estimate a 95% confidence interval for the effect that age has on salary. The following confidence interval is set up by: 𝑦𝑦 ± 𝑦𝑦−𝑦−1;𝑦 ∗ 𝑦𝑦𝑦 2 In this case, we are being asked to set up a confidence interval for the slope coefficient for age which is indicated by 1. 𝑦𝑦 ± 𝑦𝑦−𝑦−1;𝑦 ∗ 𝑦𝑦1 2 19 The assumptions for this confidence interval are the same that we have gone through in the above question. Take a look at the above questions for these. 845,45 ± 𝑦 100−1−1; 0,05 2 ∗ 76,22 Working out: 𝛽1 𝑦(694,2; 966,7) It can then be said with 95 confidence that the slope coefficient for age compared to income is between 694 to 966. Therefore 0 is not included in this interval, of which further indicates that age is a significant factor in the description of wage. In JMP: Go to analyse→ fit model→ Income in y, x in construct cross model effects→ run → red triangle→ regression reports→ show all confidence interval→ then scroll to bottom 20 Assignment 2: Transformations in linear regression + PI + CI a) Add the quadratic terms for age to the model from the previous task and assess whether they should be included. On the background of the previous task, it can be said that it is not better to add the quadratic terms, it is found out that they will not have anything to do with giving a better link between income and age. The reasoning behind his is because the data set only goes from 29-63. What we shall do instead is: Model formulering ̂ = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦^2 𝑦 Using the above equation, we can examine whether the quadratic term is beneficial for the conclusion. (anbefales at bruge fit model when spoken about multiple quadratic questions) To test whether the model can contribute or not we form the following hypothesis. 𝑦0 : 𝑦𝑦 = 0 𝑦1 : 𝑦𝑦 ≠ 0 Now we are testing if the slope coefficient is equal to 0. This is done with help of parameter estimates. According to the parameter estimates, the current model is not significant. This means that the model should not be continued with a polynomial term. There is not a correlation between income and age quadratic. The p values concerning age and age^2 are ages over our alpha level (5%). Therefore it is forecasted that H1 and H0 hold. This is an indication that our slope coefficient is equal to 0 21 Calculating tobs 𝑦𝑦 − 0 525,26 − 0 𝑦𝑦𝑦𝑦 = = = 0,68 𝑦𝑦 775,32 𝑦𝑦 − 0 3,60 − 0 𝑦𝑦𝑦𝑦 = = = 0,42 𝑦𝑦 8,60 The critical limit is known from previous assignment: 𝑦 𝑦 𝑦−𝑦−1; 2 =𝑦 0,05 100−2−1: 2 = 1,985 From the above, we can see that the observation value are included within the critical limits, for which we fail to reject H0 . This supports the conclusion in the JMP output. Thus we don’t deem it necessary to include to quadratic term. b) Calculate the expected salary and a corresponding 95% confidence interval for a 30-year-old employee in the model you find appropriate. Also, determine a prediction interval. It is obvious to see that the model that fit the best is the original model without the quadratic terms. ̂ 𝑦𝑦𝑦𝑦𝑦𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦 First we get our prediction interval 22 ̂ ± 𝑦𝑦−2;𝑦/2 ∗ 𝑦𝑦 ∗ √1 + 𝑦 1 (𝑦 − 𝑦)2 + 𝑦 (𝑦 − 1) ∗ 𝑦2𝑦 JMP is then used to calculate the above: Analyze → Fit model → Add y → add x → run → red triangle → Save columns → Indiv confidence limit Formula → Adds confidence intervals to dataset for each value. From the above, we can set up the following interval. 95% 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 = [20.085; 47.007] This means that if the company hires a NEW employee, that is 30 years old, he/she would with 95% certainty have a salary between 20.085 and 47.007 kr. Predictions interval: What happens if the data set gets one more observation? Confidence interval: Confidence interval is set up for a current employee who is 30 years old. The following confidence interval is set up. ̂ ± 𝑦𝑦−2:𝑦/2 ∗ 𝑦𝑦 ∗ √1/𝑦 + 𝑦 𝑦 − 𝑦)^2 (𝑦 − 1) ∗ 𝑦2𝑦 USe jmp to calculate the above confidence interval. (Red triangle→ save columns→ mean confidence limit formula) From the above we can set up the following confidence interval: 95% 𝑦𝑦 = [31.009; 36.084] 23 c) In the two variables Sam_løn and Sam_alder, a random selection of Danes is used to represent the salaries and ages of the population. Create the model �� = ��0 + ��1 ⋅ �� + ��2 ⋅ ��2 and determine at what age the maximum point (top point) is reached based on this data. 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦^2 𝑦𝑦𝑦𝑦𝑦 A polynomial as above can also be reformulated into: 𝑦 = 𝑦𝑦2 + 𝑦𝑦 + 𝑦𝑦 This is used when we have to calculate our top point. In a section further below we will show how to calculate the top point for a centered and an uncentered polynomial. −𝑦1 2 ∗ 𝑦2 𝑦1 =𝑦− 2 ∗ 𝑦2 𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦 = 𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦 First we examining a uncentered polynomial: In JMP: Analyze → Fit model → Add y variable → add x variable → right click x variable → transform → square → add squared x variable → run From the output above, we utilize the values to calculate the top point: 𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦 = −4816,2277 = 52,5611 2 ∗ −45,81 From the above it can be see that the top point is 52.56 which approximates to 53 years old for when one reaches the largest income possible (for data set). Income for this person can be calculated by: 𝑦𝑦𝑦𝑦ø𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦2𝑦𝑦𝑦 𝑦𝑦𝑦𝑦ø𝑦 = −73739,28 + 4816,2277 ∗ 52,56 − 45,81 ∗ 52.56^2 = 52.837 It is seen by the above that if you calculate the precise income then the value of 52837 is realistic for a person of 53 years of age. Examining the centered polynomial 𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦 = 𝑦 − 𝑦1 2 ∗ 𝑦2 24 In JMP: Analyze → Fit y by x → Add y variable → add x variable → red triangle → Fit polynomial → quadratic By using JMP we get the above polynomial. Through further examination, we can rough-estimate our top point to be around 50. Thus our earlier calculation of 52,56 is assumed to be correct and appropriate. We can now calculate the top point for the centered polynomial: (should be the same) 𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦 = 𝑦 − 𝑦1 2 ∗ 𝑦2 Using the values from our parameter estimates: 𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦 = 46,47 − 558,13 = 52,56 2 ∗ −45,81549 25 Assignment 3 - Multiple linear regression a) Estimate the model below and briefly assess the assumptions for the analysis, including whether there might be an issue with multicollinearity. 𝑦𝑦ø𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦 The model above is the true model: 𝑦𝑦ø𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦 We can then do an analysis on the estimated model: ̂ 𝑦 𝑦ø𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 In this section we will evaluate whether there are problems with mulitcollinearity. Furthermore the assumptions surrounding multiple regression will be discussed. We start with testing multicollinearity: From the picture above, we can derive the correlations between our variables. The correlation is between 1 and 1, where 1 is a perfect positive correlation, which means that when one variable increases, the other variable increases the same amount. -1 defines perfect negative correlation, which is just the opposite of perfect positive. From the above x-variables, we can see that they all (excluding køn) have a strong correlation with løn. This is deductible, given all variables have a correlation value greater than 0,60. This was as expected, as age, experience and education all have explanatory correlation with løn. In reality, salary is also dependent on age, experience and length of education. Note: It’s really good if x-variables have a large correlation with y-variables Note: It’s really BAD if the x-variables have a large internal correlation. This means that the xvariables are good at explaining each other, which is not wanted - as we want to explain the yvariable. Internal x-variable correlation results in “noise” in the analysis. We can see that between age and experience there is a high correlation of 0,85. This means that the variables describe each other very well and this can mean that one of these variables are not needed in the regression analysis. Moreover the correlation between experience and education is also relatively large. This doesn’t make a lot of sense as if you have a large education then it would make more sense that you would have less experience, as you’ve spent more time on education rather than working and gaining experience. Our women variable has little to no correlation between our other x variables however also has little explanation to do with our income variable (y variable) . 26 Assumptions for the model: X variables must be uniformly distributed Age Age is already controlled for uniform distribution in the previous question (assumptions concerning linear regression). Here we found out tha age was not perfectly uniformly distributed. We found out the bounds for age were from 29 to 63. Experience Experience cannot take a uniform distribution. This is because there is a lot of observations for experience between 5-10 years. This means that the model will be skewed in terms of the explanation of these observations. Women Sex can take a uniform distribution Education: Education looks to be OK uniformly distributed. There are a few more observations with a lower education than those with high, however, given it’s such a slight difference, it is assumed not to have a significant effect on the results. Y-variable needs to be normally distributed: Was covered in previous question, so please refer back to that. 27 Residual analysis: 1) Homoscedasticity: From the above, we can evaluate whether there is constant variation in the residuals. There is no immediate “trumpet” shape or heteroscedasticity in the picture above. The variation is between -10.000 and +10.000 which seems alright. 2) Normally Distributed residuals From the distribution to the left, we can see that the residuals are almost normally distributed. This is with the exception of a single observation, that can be classified as an outlier. (The black dot in the box plot) This is good for our test, as we want normally distributed residuals. 3) Independent residuals There is no clear indication of pattern in our residuals. We therefore assume the residuals to be independent. 4) Expected value of residuals = 0 Due to the normal distribution of our residuals, we can conclude that the expected value is also = 0. To be exact it is 3,383e-12, meaning it is not exactly 0, but close enough. 28 Estimating the model: The model is now estimated using JMP: Analyze → Fit model → add y variable → add x variables → Run In the multiple regression, we first need to examine whether the “whole model test” is significant or not. This means, we’re testing the model as a whole (including all variables) to test whether it is significant or not. 𝑦0 : 𝑦1 = 𝑦2 = 𝑦3 . . . = 𝑦𝑦 = 0 𝑦1 : 𝑦𝑦 ≠ 0 The model thus tests if the entire model is insignificant og therefor if there is at least one variable that is significant. The assumptions are gone through in sub question a) The following observations for the hypothesis test above: Theoretical contribution about JMP calculations (Only include if you have time): 𝑦𝑦𝑦𝑦 = 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 8,12𝑦𝑦𝑦 = = 114,76 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 1,68𝑦𝑦𝑦 The observation for f is 114,76. This needs to be compared to the critical limit. 𝑦𝑦;𝑦−𝑦−1;𝑦 = 𝑦4;100−1;0,05 = 2,467 k= Number of x variables n= sample size a= significance level With the above calculations, we can see that the Fobs value is outside the critical limit, which is why we reject H0 in favor of H1. This means that there is at least one coefficient slope that is significant. We will now examine which variable(s) is significant: 29 From the above parameter estimates, we can see that age and gender aren’t significant. This means that the variables “age” and “woman” cannot be used to explain the development in salary. In assignment 1, we found that age was significant, yet now we find that age is not significant. This is two contrasting conclusions. The explanation is, that in our multicollinearity analysis age and experience were correlated with 0,85, this means that when we investigate income in terms of age and experience at the same time, then experience explains so much of “wage” that age already does. This in turn means that the variable “age” becomes redundant. This is why age is insignificant. (Everything age could explain in assignment 1, can be explained by experience plus a little more, thus making it more significant). An example of how to calculate the JMP conclusions (Only for understanding): 𝑦𝑦 − 0 𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦æ𝑦𝑦𝑦 = 𝑦 𝑦𝑦𝑦𝑦 = 𝑦−𝑦−1; 𝑦 2 −1097,50 − 0 = −1,30 847,35 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦æ𝑦𝑦𝑦 = 𝑦𝑦−𝑦−1;𝑦 = 1,985 (From previous assignment) 𝑦𝑦𝑦𝑦 = 2 k= Number of x variables n= sample size a= significance level On the background of the above we can see that we should maintain H0 conclusion which in turn results in an insignificant model. This means that “women” can not explain salary. 30 b) Argue for the actual (reduced) model. The model is as follows: We can see that not all variables are significant. We thus conclude it necessary to remove the insignificant variables. The model is reduced until only significant variables are left. We have removed age and women from the model. First we removed age due to the high p-value and then we removed women from the model due to it being insignificant. Above we see the results of the complete significant test. First off, we can see that the “whole model test” (Analysis of variance) is significant. Furthermore, we can see that all the individual x-variables are significant (parameter estimates). We can thus conclude that salary can be explained by experience and education. Furthermore, we can see that the model has a degree of explanation = 0,82, which means that experience and education can explain 82% of the variation in salary. This is fairly high. It is important to note that there was an assumption violation with salary (y-variable) not being normally distributed (See assignment 1). The above conclusion is based on this violation. With more than one x variable we choose r squared adjusted. 31 We can then setup the following linear regression model: ̂ = 28.851 + 912,82 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 1373,96 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦ø𝑦 From the above, we can derive that when experience and education = 0, then the income is equal to 28851. This is assumed to be likely for an uneducated individual without experience (at the age of 29). Furthermore, we can derive that for every time experience is increased by a year, salary is increased by 912,82. Likewise an added year of education results in a salary increase of 1373,96.- c) Compare the reduced model with the original model from 4.a. Which one is preferable? d) In the final reduced model, please investigate if there are any interaction effects that should be included. 32 Assignment 4: Variance analyses (one-way ANOVA /1 factor ANOVA) a) Calculate the mean and variance of the salary for employees on each of the 3 machines and test for homogeneity of variances for the three samples, specifying the assumptions for the test in question b precisely. We start with examining the distributions: This is done through JMP: Analyze→ Distribution → Add y variable → Add x variable in “By” → Ok From the first variable (Løn, where Maskine =1). Here there is not a normal distribution, but a left-skewed distribution. This is concluded as majority of the observations fall after the middle (50.000) From machine two, the mean is 42.459. The distribution is relatively normally distributed. From machine 3, we can see that the mean is 40.026. The distribution here is right skewed. From the above, we have evidence that indicates that workers on machine 3, earn more than those working the other 2 machines. This is examined through a one-way ANOVA. 33 Assumptions for one-way ANOVA: Refer back to assignment 1 where these assumptions have already been discussed (Trustworthiness, SRS, Independence). There needs to be equally large (uniform) groups. In machine 1 we have 35 observations, in machine 2 we have 33 observations and in machine 3 we have 32 observations. These seem to be approximately even groups. We now test for variance homoscedasticity (equal variances) internally in the groups. This is done through Hartley’s F-test (Fmax Test). 𝑦0 : 𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦2𝑦𝑦𝑦 𝑦𝑦𝑦𝑦 = 2 𝑦𝑦𝑦𝑦 When we do max divided by minimum, then the result will always be over 1. Therefore we can get away with only doing the upper critical limit test. We start by calculating the observator: ‘ 𝑦𝑦𝑦𝑦 = 93442 68882 = 1,84 Calculating the critical limit: 𝑦𝑦1−1;𝑦2−1;𝑦∗ 2 Where n1 defines the sample size, which belongs to the max standard deviation and n2 defines the sample size, that belongs to the min standard deviation. Alpha star is calculated below: 𝑦 ∗= 2 ∗ 𝑦𝑦𝑦ℎ𝑦 2 ∗ 0,05 𝑦∗ = = 0,01667 => = 0,00833 𝑦(𝑦 − 1) 3(3 − 1) 2 Where k = number of groups. 𝑦36−1;32−1;0,00833=2,381 We can see that the Fobs is = 1,84 and the critical limit = 2,381. This means that the observed value falls within the critical limit, which means that we fail to reject H0. Thus the variances are equal and furthermore, this also means that our assumption is fulfilled. 34 The significance level is corrected according to the Bonferroni principle. More tests increase the risk of making a type 1 error, why the significance level increases. This is typically used in Hartley’s F-test and in simultaneous confidence intervals b) Conduct a relevant test to determine if the salaries are the same for employees working on different machines. Anova 1 way 𝑦0 : 𝑦1 = 𝑦2 = 𝑦3 𝑦1 : 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 2 𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 This can be written as regression model: 𝑦 = 𝑦 + 𝑦𝑦 + 𝑦𝑦 Where 𝑦0 : 𝑦𝑦 = 0 Assumptions have already been covered in previous assignments, why we will only be conducting the test: Calculations that JMP does: (For understanding) 𝑦𝑦𝑦 1,8𝑦𝑦𝑦 The following observer belongs to the hypotheses; 𝑦𝑦𝑦𝑦 = 𝑦𝑦𝑦 = 63.198.039 = 29,1 This is then tested against the critical limit: 𝑦𝑦1−1;𝑦2−1;𝑦 = 𝑦2−1;100−2;0,05 = 3,938 ≈ 3,94 It’s apparent that our F-observer is larger than the critical limit, so we can conclude that at LEAST one of the machine groups has a different mean than the others. In other words, at least one of the machines has a different salary from the others. From the analysis of variance, we can see the same conclusion as above. This is shown through the significant p-value, which indicates that at least one of the groups has a significantly different mean. We reject H0 in favor of H1. We now examine WHERE the difference lies, i.e., where the mean is different from the others. 35 From the plot we can see that machine 1 has a significantly higher salary than the other machines. Furthermore, it is a little hard to conclude whether machines 2 and 3 share the same mean or are significantly different. We thus use following outputs to examine this: From the LSMeans table we can see that: Machine 1 has a mean income of 53809. Machine 2 has a mean of 42459. Machine 3 has a mean of 40026. The question is now, whether machine 2 and 3 are significantly different or whether they can be statistically viewed as equal: In ANOVA model: → machine red triangle → LSMeans student’s t-test From the table we can see the following: If we take the difference between machine 2 and 1, the difference is significant with 11.350. This was as expected. The difference between machine 3 and 1 is 13.783. The difference between machine 2 and 3 is 2.432, which is NOT significant. Thus the salary for the employees that are working machine 2 and 3, can be (statistically) viewed as equal. The conclusion is therefore, that the workers on machine 1 earn significantly more than the others. 36 Assignment 5: Variance analyses (Two-way/two-factor ANOVA) a) Formulate a Two-factor ANOVA model and test whether there is an interaction between gender and marital status when explaining salary. Alternative formulation: Estimate following model: (Possible exam question format) 𝑦𝑦ø𝑦 = 𝑦0 + 𝑦1 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 𝑦𝑦𝑦𝑦𝑦 + 𝑦3 𝑦𝑦 + 𝑦 Where EG is the interaction variable between sex and civil status. To test whether there is a difference in salary in comparison to civil status and gender, we can setup the following model: 𝑦𝑦𝑦 = 𝑦 + 𝑦𝑦 + 𝑦𝑦 + 𝑦𝑦𝑦 + 𝑦𝑦ℎ𝑦 Where Gamma (𝑦𝑦𝑦 ) is the interaction between gender and civil status. The following assumptions are tied to the analysis: 1. Normally distributed populations or n>30 (look at the distributions) We can with the aid of JMP examine whether the above assumption is met: (In the exam if you have a lot of graphs, merely screenshot the “ugly” ones for discussion) Please note, that with the many distributions, i have chosen to discuss the distributions that are furthest from the assumption regarding normally distributed populations. From the first distribution is for women who are not married, we can see that the population is not normally distributed. Furthermore, there is a gap between 45.000 and 50.000 which in principle means that we cannot explain anything for an unmarried woman who earns between 45.000 and 50.000. It gets even worse in the next distribution for married women, where we observe multiple gaps around 60.000 and 70.000. The assumptions for normally distributed populations are therefore not met, which in turn will have negative effects on a future conclusion. 2. Even population sizes (approximately) We can see that the groups are approximately equal: 𝑦1 = 23, 𝑦2 = 26, 𝑦3 = 24, 𝑦4 = 27 The above samples are approximately equal and therefore the assumption is met. 37 3. False f- test - variance homogeneity in between the groups We now test if the variance between the groups is equal: 𝑦0 : 𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦: 𝑦𝑦𝑦𝑦 = 𝑦𝑦𝑦𝑦 = 𝑦2𝑦𝑦𝑦 𝑦2𝑦𝑦𝑦 102912 89802 = 1,313 Calculating the critical limit: 𝑦𝑦1−1;𝑦2−1;𝑦∗ = 𝑦27−1;24−1;0,00417 = 3,0678 (From the excel template) 2 Alpha-star calculation: 𝑦 ∗= 2 ∗ 𝑦𝑦𝑦ℎ𝑦 2 ∗ 0,05 𝑦∗ = = 0,00833 => = 0,00417 𝑦(𝑦 − 1) 4(4 − 1) 2 From the above, we can see that the observed value falls within the critical limit. Thus the variances can be assumed to be equal. This is positive for the test, and the assumption regarding variance homoscedasticity is thus met. Test can now be formulated: (2-factor anova model) This has the following hypothesis: 𝑦0 : 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦0 : 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦0 : 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 NOTE: We ALWAYS need to examine interaction first, and we might potentially need to reduce interaction first, if possible. If we have a significant interaction between the variables, we cannot reduce the model further. If we on the contrary have an insignificant interaction term, we need to reduce this first. 38 We use JMP to examine this: From the parameter estimates, we can see that the interaction term isn’t significant, and furthermore that “kvinde” (gender) isn’t significant either. Given that we always need to reduce the interaction term first, BUT first we need to examine the interaction term: The graph shows that the lines are reasonably parallel, which means that there is no interaction between being married and a specific gender. There is no indication that women are more or less married than men in this data set. On the flipside, there is no evidence that supports that men are more or less married when compared to women. Furthermore there is no difference in income if you are married / gender. There is no interaction between gender and marital status compared to income. It is not such that you earn more as a single man compared to being a single woman. Or that you earn more as a single man compared to married woman. The Correlation is apparent; if you are married (disregarding gender), you earn more compared to if you are single (not married). On the contrary, if you are single (disregarding gender) you earn less, than if you are married. On account of the above, we can now reduce (remove) the interaction term. (Take the interaction term out of model and run again.) By running the model we get the following results: Now we can see that gender is insignificant, which means that no matter what gender you are then you earn the same. This is in accordance with the results from the previous questions, where we concluded that there was no linear correlation between gender and income. 39 The model is reduced until there are only significant terms. Therefore the next step is take gender out of the model. Taking the new (and reduced) parameter estimates into consideration, we can see that the model is now significant. Furthermore, we can see that being married has a significant effect on salary. The correlation is as follows: From the table to the left, we can clearly see, that being married has a significant effect on salary. We can see that if civilstatus = 0 (not married) the mean salary is 42.352 and for civilstatus = 1 (married) the mean salary is 48.580. There’s a large difference between the two, where you make more money if you’re married. This makes a lot of sense, seeing as most people get married later in their lives, which means that the people that are married, are often also older. In an earlier question, we found that age was significant in relation to explaining salary. This pairs well with the conclusion above. Given that married people often are older and the older people earn more money, this correlation is apparent. To conclude, the following model is formulated: 𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 The above model is significant in relation to explaining salary. b) If an interaction is concluded, it should be commented upon. If it is concluded that there is no interaction, the model should be reduced and commented upon. Question b) has been answered throughout subquestion a, so please refer back to question a, to see conclusion on interaction and a reduced model. 40 Assignment 6: Logistic regression (Y variable is categorical, x variables are continuous) The following variables can be considered as explanatory variables in the chosen model: salary, age, experience, gender, machine, and marital status. No transformed forms and/or interaction effects should be used. a) Develop a model that can explain whether an employee is satisfied or not. Model formulation: We set up an estimated model, and refer back to previous assignment for explanation: ̂ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦4 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦5 ∗ 𝑦𝑦𝑦ℎ𝑦𝑦𝑦 + 𝑦6 ∗ 𝑦𝑦𝑦𝑦𝑦 We now examine whether we can predict the likelihood of being satisfied or not, based on the above mentioned variables. The dependent variable (satisfaction) is a categorical variable and the model MLE (Maximum likelihood estimation) is thus used. The model is a likelihood model that estimates the likelihood for being satisfied. Assumptions for logistic regression: 1. It is required that the dependent variable is categorically binary This assumption is met, given that the variable “satisfaction” is categorical. From the distribution above, we can clearly see that the distribution is categorical. We can see two outcomes which is alsoc required. Furthermore there’s approximately an equal number of observations in 0 and 1. They are distributed as follows: 45% unsatisfied and 55% satisfied. 41 2. Approximately the same number of observations in group 1 and 0. This assumption is followed as seen by the distribution above having approximately the same number of observations on either side. 3. There must not be multicollinearity between the x variables We want a large correlation between the x-variables and the y-variable and a low correlation between the x-variables. There is a fairly high correlation between satisfaction and salary. This is as expected, given that most people work to earn money. By which salary is a big factor. Furthermore we can see that age can explain satisfaction alongside experience. A little more questionable is the negative correlation between machine and satisfaction. This means that the higher the machine you have (1-3), the lower the satisfaction. This actually makes sense, given that we in an earlier assignment concluded that it was most attractive to work at machine one, as you then earn more money in comparison to the two other machines. It is peculiar however that satisfaction drops by working at machine 1. We can also see that if you are a woman (1=woman, 0=man), then you are per definition unsatisfied. Between the x variables: In a previous question, the relation between age vs experience and experience vs income. Has already been discussed. Therefore we will focus on the rest. There is a big correlation between machine and income, this links nicely back to the previous question where machine 1 was the one where we earned the most. Furthermore there is a correlation between age and marital status which makes sense as discussed in an earlier question. 42 We test the following hypotheses: - To start off, we make a whole model test, to control whether the model as a whole is significant: 𝑦0 : 𝑦1 = 𝑦2 . . . = 0 𝑦1 : 𝑦 𝑦 ≠ 0 We are testing whether the slope coefficient is equal to 0 and thus if there is at least one outlier. The model's assumptions have already been covered in a previous question. (Refer back to regression) To this we can use JMP: From the whole model test, we can see that the p value is significant. This means that the model as a whole is significant. There is thus at least one slope coefficient that is NOT equal to 0. We can now examine which coefficients are = 0 and thus need to be reduced. From the parameter estimates we can see that there is a fair amount of slope coefficients that are not significant. By first glance it seems that gender is significant. The model is reduced until there are only slope coefficients left. 43 We remove slope coefficients one by one in the following order (always reducing the most insignificant): 1) Machine, 2) marital status, 3) age, 4) experience. The result of the model is as follows: From the above, we can now see that salary and gender are significant, as their p-values fall below the significance level. This in turn means that their respective slope coefficients aren’t equal to 0, and that they thus aid in predicting the likelihood of being satisfied. ̂ The final model is as follows: 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 = −6,8934918 + 0,000018074 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 − 1,831 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 Alternative model formulation: 𝑦(𝑦 = 1(𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦)) = 𝑦𝑦𝑦(−6,893+0,00018∗𝑦𝑦𝑦𝑦𝑦𝑦−1,832∗𝑦𝑦𝑦𝑦𝑦𝑦) 1+𝑦𝑦𝑦 (−6,893+0,00018∗𝑦𝑦𝑦𝑦𝑦𝑦−1,832∗𝑦𝑦𝑦𝑦𝑦𝑦) The model has also removed the issues with multicollinearity, so now the model can be said to be relatively strong. b) Based on the final (reduced) model, interpret and assess the results. In order to give the best possible explanation of the model, vi utilize JMP to explain pseudo R^2, estimates and the Hitt rate: From our confusion matrix, we can calculate our pseudo R^2 and furthermore from that our hitt rate. We can also calculate the most likely outcomes. 44 + 35 79 = = 79% 100 100 Thus the model has a prediction accuracy of 79%. We would like the model to predict 25% more outcomes compared to if we guessed. This can be found out by the following: 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 ℎ𝑦𝑦𝑦𝑦𝑦𝑦𝑦 = 55 = 55% 100 45 𝑦(𝑦 = 0) = = 45% 100 𝑦(𝑦 = 1) = 44 If we just guessed then we would have hit correctly roughly 55% of the time, this can be compared to our overall hit rate: 55% ∗ (1 + 0,25) = 0,6875 = 68,75% Therefore our model is significantly better, because our model has a better prediction accuracy compared to 68,75. This in turn, means we have a good model. Furthermore, we can in our Hitt ratio see: 44 𝑦(𝑦 = 1) = 55 = 0,8 = 80% 35 = 0,78 = 78% 45 The model is almost as good at prediction satisfaction (y=1) as it is at predicting dissatisfaction (y=0). Another expression for our model's quality is the R^2. For this we utilize our JMP output: 𝑦(𝑦 = 0) = From the output above we can see that our 𝑦2𝑦 = 0,3505. This means that the model can explain roughly 35% of the satisfaction in the employee. Lastly we can analyse the coefficients individually. When income rises then satisfaction also rises. This can be seen by the sign of income (+). On the other hand when gender rises (0= man, 1= woman) then satisfaction falls. This means men have a larger satisfaction compared to women. This would mean that the women are unsatisfied. 45 c) What is the odds ratio for men being satisfied compared to women being satisfied? From our odds ratio we get the following: 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 = 𝑦𝑦𝑦𝑦(𝑦 = 1)𝑦𝑦𝑦𝑦𝑦𝑦 = 0,1601 => 𝑦𝑦𝑦(−1,832) 𝑦𝑦𝑦𝑦(𝑦 = 0)𝑦𝑦𝑦𝑦 The above should be interpreted as such, so that there is a 16% chance of a given woman to be more satisfied than a man. We can also calculate the opposite: 𝑦𝑦𝑦𝑦(𝑦 = 0)𝑦𝑦𝑦𝑦 = 6,24 => 𝑦𝑦𝑦(1,832) 𝑦𝑦𝑦𝑦(𝑦 = 1)𝑦𝑦𝑦𝑦𝑦𝑦 From the above we can derive that it is 6 times more likely for a man to be satisfied compared to a woman. 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 = d) What is the odds ratio for each additional unit of currency earned? Odds to be satisfied are 1,000181 times larger for each extra dkk you earn. (Read from the odds ratio above) 46 e) What is the probability that an average, married, female employee in machine 1 becomes satisfied? (Not all information may be necessary for the prediction.) JMP: Analyze → distribution → add y variable →Change gender (x) to nominal → add x variable → run → stack → Take the mean value: The average woman earns 45.670. This means that we can calculate the probability with JMP: From our Logistic regression (reduced model) → Red triangle (Nominal logistic fit) → Profiler From the graph to the left, we can derive that if you are a woman with average salary, then there’s a 61,6% chance that you are dissatisfied and a 38,4% chance of you being satisfied. Assignment 7: Chi squared and goodness of fit test a) To assess whether the employees in the study are representative of the gender distribution in the company, please conduct a relevant test to determine if the gender distribution in the sample aligns with the entire company, considering the fact that there are an equal number of men and women. In this question we are being asked to examine whether gender can take a uniform distribution. To start with we construct the following hypothesis: 𝑦0 : 𝑦𝑦𝑦𝑦 = 0,50 & 𝑦𝑦𝑦𝑦𝑦𝑦 = 0,50 𝐻1 : 𝐻𝐻 𝐻𝐻𝐻𝐻𝐻 𝐻𝐻𝐻 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 Following 𝑦2 assumptions will be discussed: 1. Mutually exclusive groups Gender is assumed mutually exclusive, as you cannot be both a man and woman in this data set. It is not possible to choose anything else, it is either or. 47 2. Independence between groups We assume there is independence between choice of gender. One person's pick won’t influence the gender choice of the next sample. 3. Rule of five Rule of five means the expected value in all groups must be over 5. “Percentage multiplied with the lowest sample observation” => 50%*49 =24,5 This is larger than 5, why the expected value in both groups will be over 5. To examine the following, the observed value is calculated. 𝑦 2 𝜒 =∑ 𝑦 (𝑦𝑦 − 𝑦𝑦 )^2 𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 − 𝑦ℎ𝑦 𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦)^2 (𝑦ℎ𝑦 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 = . . . +. . . 𝑦ℎ𝑦 𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦ℎ𝑦 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 (The red highlights are for understanding. DON'T add these in an exam) From JMP we can see that it calculates an observed value of =0,04 and a p-value of 0,8415. This means that we fail to reject H0. This in turn means that the proportion of men and women in the company for both genders = 50%. This can be backed up by following confidence intervals From the table it’s apparent that in both scenarios (genders) 0,50 is within the confidence interval, which further supports the conclusion that the proportion of men/women (gender) is equal at the company. 48 Assignment 8: Chi-squared independence test: a) Conduct a relevant test to determine if the gender distribution across different machines differs from each other. Comment on the results you find. On the background of the question text, an independence test needs to be conducted. Following Assumptions need to be discussed further (in comparison to assignment 7) Mutually exclusive: The machines are assumed to be mutually exclusive. In reality there might be a chance that employees work on different machines. For the discussion of mutual exclusivity regarding gender, we refer back to the discussion in the previous section (Assignment 7). Rule of 5: The expected value for all groups must be larger than 5. JMP is used to control the last mentioned assumption. In JMP: Analyze → Fit y by x → add y variable (nominal) → add x variable (nominal) → OK → Red triangle on contingency table → remove total% → remove colum% → remove row% → add Expected → add Cell Chi Square 49 From the table to the left we can see that expected value in all groups is larger than 5. (The Expected) 50 We therefore set the following hypotheses 𝑦0 : 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦 𝑦𝑦𝑦ℎ𝑦𝑦𝑦𝑦 𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦 𝑦𝑦𝑦ℎ𝑦𝑦𝑦𝑦 The following observed value is used: 𝑦 2 𝜒 =∑ 𝑦 (𝑦𝑦 − 𝑦𝑦 )^2 𝑦𝑦 We can see in the jmp output that jmp calculates an observation value of 0,144. With a corresponding p value of 0,9305. We then choose our h0 conclusion. This means that there is no difference between gender and machines. It can not be concluded that there are more men on machine 1 compared to women. Moreover we can see that chi^2 values are low. If the chi squared values were high then we would lean towards our h1 hypothesis. This is not the case in this question and the distribution between gender and machines are therefore completely independent. (The larger the Chi-Squared value, the higher likelihood of there being a significant difference) 51 Assignment 9: Forecasting The company's average salary for the period 2011-2018 is shown in the JMP file. The variable ID can be used to indicate time (1=January 2011… 96=December 2018). Assumptions that are not explicitly requested to be tested are assumed to be met in the tasks below. a) Estimate a Trend model that explains the development in average salary and use the model to predict the average salary in January 2019. Evaluate whether the model exhibits autocorrelation. Model formulation We construct an estimator model for explanation: ̂ = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦 𝑦 A_ID is used as time, where 1= january 2011 and 96= december 2018 52 The assumptions are assumed to be true, JMP is used to provide the following results. From the model, we can immediately see that the R-square value, that the predictability of the model is = 0,8 - This means that time can explain 80% of the variation in average salary. Furthermore, we can see that, the coefficient slope for ID is =86,00, which means that for each month that passes, the salary is increased with 86 kr. We can also from parameter estimates see that B0 =26.190, which means that in period 0 (December 2010) the average salary was 26.190. From the model we can see that there are a lot of systematic outliers, which potentially could indicate seasonal fluctuations (these will be analyzed in question 8.d) We can then produce the following model: 𝑦𝑦𝑦𝑦̂ 𝑦𝑦𝑦𝑦𝑦𝑦 = 26.190 + 86 ∗ 𝑦 From the assignment we’re asked to predict the average salary in January 2019. We can then calculate the following: If the value 96 in ID correlates to December 2018, then 97 must be January 2019. By inserting this we can calculate to the following: ̂ 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦19 = 26.190 + 86 ∗ 97 ̂ And if we calculate, we get: 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦19 = 34.532 Thus the mean salary in January 2019 is 34.532. We’re furthermore asked to test for autocorrelation. To test for autocorrelation we can use JMP: 53 From the Durbin-Watson test, we can see that the p-value > 5%, which means that there’s no autocorrelation. This is positive for our test. 54 b) Assess whether there is a possibility to enhance the trend model by incorporating seasonality and use the model to predict the average salary for January 2016. We saw that there were some outliers in the model that weren’t incorporated. This could possibly be explained by seasonality fluctuations, such as a December bonus. Therefore we insert a dummy variable that could describe this correlation for us. In JMP: Add a new column → Name it → right click new colum → Formula → Conditional → If → Month → add “==” to month for strict command → insert “December” → Then clause =1 → else = 0 By expanding the model, we can examine the following model: ̂ = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦 In JMP: Analyze → Fit model → Add y variable (mean salary)→ add x variables (season + ID) → Run From the model we can see that there is a perfect predictability value of 99%. We furthermore see that the model as a whole is significant, with a p-value that is approximately 0%. We can also see that all the x-variables and their respective coefficients are significant. This means that season was indeed utilized in explaining the average salary. On the background of this the regression is set up: 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 = 25.960 + 83,47 ∗ 𝑦𝑦𝑦 + 4227 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 (1 = 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) We can see from the above that one would receive a bonus in December of 4227kr. This changes our regression outcome from 34,059 when there wasn’t any mention of December, to the following: 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 = 25.960 + 83,47 ∗ 𝑦𝑦𝑦 + 4227 ∗ 1 = 34059 + 4227 = 38286 For January 2019: 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 = 25.960 + 83,47 ∗ 𝑦𝑦𝑦 + 4227 ∗ 0 = 25.960 + 8096 = 34.057 55 c) Estimate an autoregressive model where average salary is explained by previous periods' salaries. Use the model to predict the average salary for January 2019. We need to explain the mean salary based on earlier periods' average salary. We therefore set up the following autoregressive model: ̂ = 𝑦0 + 𝑦1 ∗ 𝑦𝑦−1 𝑦 Which can also be reformulated to: 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦−1 We calculate a new variable in JMP, which is the average salary for a previous period. We now setup the results for the autoregressive model: From the model, we can see it has a predictability of 0,60 or 60%, which is assumed to be okay. Furthermore we can see that the model is significant. We can also derive that previous periods explain 80% of a current period's salary. This is evaluated to be relatively high Using the model results, we can then set up the following model: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑖𝑛𝑐𝑜𝑚𝑒 = 6107 + 0,80 ∗ 𝑜𝑙𝑑 𝑠𝑎𝑙𝑎𝑟𝑦 We asked about the income for january 2019, so therefore we can use the old salary from december 2018. Predicting the salary of January 2019 with our model: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑠𝑎𝑙𝑎𝑟𝑦𝐽𝑎𝑛19 = 6107 + 0,8 ∗ 38.196 = 36.664,5 In this question, the mean salary for January 2019 is 36.664, whereas in the trend series further up, it was 38.286. This is a relatively large difference in salary. This will be discussed in the next question. 56 d) Which model provides the best estimate for January 2019? Model January 2019 wage R^2 8.b (Trend model) 34.532 0,80 8.d (Trend model with season) 34.057 0,99 8.e (autoregression) 36.664 0,60 Using the above table, we can conclude that the model with the best predictability, is model 8.D (Trend model including season). This is due to the high predictability and the fact that it accounts for seasonal fluctuations. 57 JMP Guide: Scatterplot Analyze→ Fit y by x → y response variable (income), x factor (describing variable) age. Check distribution should be the same (in this case continuous). For regression line (after creating scatter plot): → Red triangle → Fit line For log(y) regression line (improving model if needed): → Analyze → Fit y by x → y response variable (income), x factor (describing variable) age → Red triangle → Fit line → Red triangle → Fit special → Log(y) Checking for improvement in distribution (log-level model) → Analyze → Distribution → right click y variable→ transform → Log → add log(y) into JMP y variable → Check distribution. Can add a little extra (Normal Distribution and Quantile plot): Distribution: Red triangle (Log) → Continuous fit → Normal distribution Normal Quantile plot: Red triangle (Log) → Normal Quantile Plot Parameter estimates → make regression line → Screenshot parameter estimates and discuss Checking variable distribution JMP (Uniform/Normal distribution) Analyze → Distribution → Add either X or Y variable (NOT BOTH) → Okay → red triangle (distribution)→ Stack→ copy plot into answer and discuss problems with model (if any). Residual plot in JMP Analyze→ Fit y by x → Check variables (continuous or independent) Fit income (Y) variable, fit age (X) variable → press ok→ go top red triangle → plot line of fit→go red triangle (linear fit) → click plot residuals. 58 Residual analysis with multiple variables Analyze → fit model→ y variable (income)--> cross model effects add x variables→ run Saving residuals to dataset (Test for distribution): Analyze → fit model→ y variable (income)--> cross model effects add x variables→ run → red triangle → save columns → residuals Then: Analyze → Distribution → add residuals as y → run → stack Confidence interval in JMP Go to analyze→ fit model→ Income in y, x in construct cross model effects→ run → red triangle→ regression reports→ show all confidence interval→ then scroll to bottom Or for confidence intervals in the dataset: Go to analyze→ fit model→ Income in y, x in construct cross model effects→ run → red triangle→ save column → Mean confidence limit formula Inserting quadratic terms for improving correlation: Analyze → Fit model → add y variable → add x variable → right click x variable → Transform→ Square→ Add x squared to x variables → Run *Always recommended to use Fit model when dealing with more than 1 variable* Prediction interval: Analyze → Fit model → Add y → add x → run → red triangle → Save columns → Indiv confidence limit Formula → Adds confidence intervals to dataset for each value. Find the Test for multicollinearity This is for multiple regression: Analyze→ Multivariate methods → multivariate→ y variable (income, age, experience, sex, Education) One-way/One-factor Anova (Checking distribution of variables) Analyze→ distribution→ y variable→ put x variable into “by” → stack Analyze→ fit model→ y variable (continuous)→ x variable (nominal - might need to convert) into construct model effects→ red triangle on x variable plot→ lsmeans plot Put into “By” because we test income in terms of what machine you are working at 59 Two-way/Two-factor ANOVA: (checking distribution of variables): Analyze → distribution → Add y variable → add x variables (Make sure to convert to categorical variables) → Analyze → Fit model → add y variable → add x variables → add interaction term (highlight x variables in column selection and press cross in model effects) → Run Logistic Regression Analyze → Fit model → add y variable (As nominal) → add x variables (as continuous) → Run Check distribution: Analyze→ distribution → change y variable to nominal→ ok→ stack→ ss (all variables need to be continuous when testing for) Multicollinearity: Analyze→ multivariate methods→ multivariate→ add continuous variables → OK Interpretation of model (pseudo r^2 and hit rate): Red triangle on nominal logistic—> odds ratio and confusion matrix Chi squared: For distribution: Analyze→ distribution→ y variable (nominal)--> press red triangle and press test probabilities→ Distribution dependant variable in relation to independent (y in relation to x): Analyze → Fit y by x → add y variable (nominal) → add x variable (nominal) → OK → Red triangle on contingency table → remove total% → remove column% → remove row% → add Expected → add Cell Chi Square To find confidence intervals: press red arrow and go down to confidence intervals 95% Forecasting analyze → fit model→ y variable (average income)--> x income (ID dates) Test for autocorrelation: From forecasting model → Response red triangle → Row Diagnostics → Durbin Watson Test. 60 Adding dummy variable (To include outliers): Add a new column → Name it → right click new column → Formula → Conditional → If → Fill in if formula with relevant information: Autoregressive model: Step 1) Lag the variable: double click on new column (old salary) → right click→ formula → row→ lag → choose variable to lag (in this case old salary) → n=1 (given we only go back one period) analyze → fit model → choose y variable → choose x variables → run