Homework Assignment #5, Math 738/8838, Spring 2023 Due 4/17/2023 I want the completed assignment handed in electronically through myCourses as a PDF document. You can upload the completed assignment through the Assignment page on myCourses. Do note email your completed assignment to the instructor they must be submitted through myCourses (Canvas). Important: in the name of the file that you submit please be certain to include your name (last name, then first name), the course number, and the assignment number; this helps greatly with grading. You may discuss the homework with each other; however, you must turn in your own original solutions – no copying. The intent is that you may participate in a study group, but you must do your own work. Remember, copying someone else’s work and presenting it as your own is cheating (and stealing from them). The rules will be enforced. Please show all work for each answer; i.e., show how you arrived at the answer. Also, in order to get full credit, where asked to use JMP to answer questions you must turn in a copy of the portion of the JMP output that is relevant to your solution. If you use other software, then you still must include screenshots from that output that support your answers. Do Not just write down answers, in order to get full credit, I need to see the basis for your answers and any relevant JMP (other software) output upon which your answer is based. For JMP users, remember the selection tool (fat plus sign) on the Cursor Tool Bar can be used to copy paste output from JMP to other applications. The assignment covers logistic and Poisson regression. 1 (15 pts) We will use the JMP dataset Lost Sales. The data consists of “quotes” provided by a company in response to sales inquiries by potential customers. The quote consists of the Part Type (AE = after market, OE = original equipment), Quote is the price given to the customer, and Time to Delivery is the lead time for the customer to receive the product after an actual order is placed. Status is the target and informs as to whether the company won the business Won, or lost the business Lost. The manager of Sales and Marketing considers the amount of lost business to be unacceptable and is of the opinion that the primary cause of lost business is that the quoted prices are too high. He wants to make across the board price cuts on products. The Operations Manager of the business is reluctant to cut prices and is unconvinced that price is the primary issue. Your task is to analyze the sales data and determine if the sales and marketing manager is Page 1of 4 Homework Assignment #5, Math 738/8838, Spring 2023 Due 4/17/2023 correct that price is the key issue or is lost sales due to some other possible cause. Step a. I have provided a Validation column that should be used in your analyses to define the validation set; this way everyone has the same set for comparison purposes. Use the Partition platform to perform a k-Nearest Neighbors analysis and select a value of K large, say 20 or so; JMP will search through all K up to the specified size. Based on your k-Nearest Neighbors analysis, how well do the predictors Quote, Part Type, and Time to Delivery classify Status? Based your conclusions on the validation misclassification rates. Step b. Use Fit Model to fit a logistic regression model to Status. You are free to add cross products of predictors or even quadratic terms in continuous predictors – I make no claim that such terms will improve classification. Based upon the Whole Model test does the model appear to have any capacity to classify Status? Explain. Which predictors appear most important to classifying Status? Comment on the classification accuracy of the model based upon Confusion Matrices, Lift, and ROC curves. Does your model appear to be a reliable classifier? Step c. Based upon the analysis in step b, does your analysis support the contention of the Sales and Marketing manager that Quote (the price) is the key cause of Lost business? Comment on what might be more important to lost business if not Quote, again based on your analysis? Step d. Repeat the analysis in step b, but this time use the Generalized Linear Models fitting personality in Fit Model. The distribution can be considered Binomial and the Logit is the appropriate link function. Does the fitting report support the same conclusions as those reached in step b? Explain. Finally, examine the residual plot. Do you see any fit or data issues that may need to be addressed? 2 (15 pts). Use the JMP data set WineQualityBC. The data consists of physical and chemical characteristics of 6,497 bottles of wine. There are 15 features of each wine and the target is Quality Rating with 3 categories. Our interest is in seeing how the characteristics of the wine impact the Page 2of 4 Homework Assignment #5, Math 738/8838, Spring 2023 Due 4/17/2023 quality rating. The quality rating was determined by a panel of wine tasting experts; we will not concern ourselves in this exercise with the taster issues. Step a. Pre-process and explore the data. Look for outliers? Multicollinearity among predictors? Or, anomalies that you uncover (I have none in mind)? If you decide to eliminate any outliers, please reference the points removed. If there is substantial multicollinearity it does affect tests and estimates much as it does in ordinary regression, but is more difficult to diagnose in the logistic case. Step b. Since there are three levels of the target, use Fit Model to perform multinomial logistic regression. I have created a validation column that you can use in the analyses. Based upon the logistic does the overall model appear to be significant in terms of predicting Quality Rating? Explain using the diagnostics available to you in the report. Similarly which predictors seem most important to Quality Rating? Explain. Step c. Finally, suppose the vintners have hired you to recommend which physical characteristics will most likely result in a Good rating. From your analysis, what settings of the important predictors will result in the highest probability of Good? You might try using the Profiler and the Desirability Functions in the Profiler menu. I have not formally discussed Desirability, but it is an important criterion for optimized settings of features to achieve a most desired response level; see if you can figure it out. 3 (10 pts) The data set Bollworm.JMP contains data on an experiment testing the effectiveness of a Hormone spray to kill bollworms, an agricultural pest. The study looks at various concentrations of the hormone applied to batches of bollworm larvae, the batch size can vary for each trial. The target is the count of dead larvae at the conclusion of each experimental trial. The scientists which to see of the hormone at some concentration can effectively kill bollworm larvae. The target is Dead? the count of dead larvae and the Larvae variable is the batch size for each run. Since the number of dead larvae is likely to increase with batch size, we think of Larvae as an Offset variable – for technical reasons. Basically, it is used to adjust the response rate for the opportunity for deaths to be observed on Page 3of 4 Homework Assignment #5, Math 738/8838, Spring 2023 Due 4/17/2023 each trial. Step a. Begin by creating a scatterplot of Dead? vs. Hormone concentration. Do you see an issues or outliers in the relationship? If you find outliers or an issue, try to find a way to correct for it; this could include simply excluding some of the observations. This is an open question and is really up to you. As a hint there really is a substantial issue in the data that for some unknown reason went unrecognized in the published analysis. Step b. Use Fit Model to fit a Poisson regression model (the generalized linear models fitting personality). Dead? is the response and hormone is the predictor. Do not forget that the batch size Larvae is an offset variable and very important to the analysis. From the report does the hormone seem effective in killing bollworm larvae? Explain. From the residual plot does some lack of fit appear present? Explain. Step c. Try refitting the model but this time include a hormone^2 term. Does the fit of the model seem to improve? Again, using the residual plot does there still seem to be a problem with model fit? Explain. Overall does the model seem to predict Dead? well? Explain. If it does not fit well, what do you think might be the reasons (besides there may not have been a relationship)? Page 4of 4