Uploaded by wantmypp

Homework Assignment #5 2023

advertisement
Homework Assignment #5, Math 738/8838, Spring 2023
Due 4/17/2023
I want the completed assignment handed in electronically through myCourses as a
PDF document. You can upload the completed assignment through the
Assignment page on myCourses. Do note email your completed assignment to
the instructor they must be submitted through myCourses (Canvas). Important:
in the name of the file that you submit please be certain to include your name
(last name, then first name), the course number, and the assignment number;
this helps greatly with grading.
You may discuss the homework with each other; however, you must turn in your
own original solutions – no copying. The intent is that you may participate in
a study group, but you must do your own work. Remember, copying someone
else’s work and presenting it as your own is cheating (and stealing from them).
The rules will be enforced.
Please show all work for each answer; i.e., show how you arrived at the answer.
Also, in order to get full credit, where asked to use JMP to answer questions
you must turn in a copy of the portion of the JMP output that is relevant to
your solution. If you use other software, then you still must include screenshots
from that output that support your answers. Do Not just write down answers,
in order to get full credit, I need to see the basis for your answers and any
relevant JMP (other software) output upon which your answer is based. For
JMP users, remember the selection tool (fat plus sign) on the Cursor Tool Bar
can be used to copy paste output from JMP to other applications.
The assignment covers logistic and Poisson regression.
1 (15 pts) We will use the JMP dataset Lost Sales. The data consists of
“quotes” provided by a company in response to sales inquiries by potential
customers. The quote consists of the Part Type (AE = after market, OE =
original equipment), Quote is the price given to the customer, and Time to
Delivery is the lead time for the customer to receive the product after an
actual order is placed. Status is the target and informs as to whether the
company won the business Won, or lost the business Lost. The manager of
Sales and Marketing considers the amount of lost business to be
unacceptable and is of the opinion that the primary cause of lost business is
that the quoted prices are too high. He wants to make across the board price
cuts on products. The Operations Manager of the business is reluctant to
cut prices and is unconvinced that price is the primary issue. Your task is to
analyze the sales data and determine if the sales and marketing manager is
Page 1of 4
Homework Assignment #5, Math 738/8838, Spring 2023
Due 4/17/2023
correct that price is the key issue or is lost sales due to some other possible
cause.
Step a. I have provided a Validation column that should be used in
your analyses to define the validation set; this way everyone has the
same set for comparison purposes. Use the Partition platform to
perform a k-Nearest Neighbors analysis and select a value of K large,
say 20 or so; JMP will search through all K up to the specified size.
Based on your k-Nearest Neighbors analysis, how well do the
predictors Quote, Part Type, and Time to Delivery classify Status?
Based your conclusions on the validation misclassification rates.
Step b. Use Fit Model to fit a logistic regression model to Status. You
are free to add cross products of predictors or even quadratic terms in
continuous predictors – I make no claim that such terms will improve
classification. Based upon the Whole Model test does the model
appear to have any capacity to classify Status? Explain. Which
predictors appear most important to classifying Status? Comment on
the classification accuracy of the model based upon Confusion
Matrices, Lift, and ROC curves. Does your model appear to be a
reliable classifier?
Step c. Based upon the analysis in step b, does your analysis support
the contention of the Sales and Marketing manager that Quote (the
price) is the key cause of Lost business? Comment on what might be
more important to lost business if not Quote, again based on your
analysis?
Step d. Repeat the analysis in step b, but this time use the Generalized
Linear Models fitting personality in Fit Model. The distribution can
be considered Binomial and the Logit is the appropriate link function.
Does the fitting report support the same conclusions as those reached
in step b? Explain. Finally, examine the residual plot. Do you see
any fit or data issues that may need to be addressed?
2 (15 pts). Use the JMP data set WineQualityBC. The data consists of
physical and chemical characteristics of 6,497 bottles of wine. There are 15
features of each wine and the target is Quality Rating with 3 categories.
Our interest is in seeing how the characteristics of the wine impact the
Page 2of 4
Homework Assignment #5, Math 738/8838, Spring 2023
Due 4/17/2023
quality rating. The quality rating was determined by a panel of wine tasting
experts; we will not concern ourselves in this exercise with the taster issues.
Step a. Pre-process and explore the data. Look for outliers?
Multicollinearity among predictors? Or, anomalies that you uncover
(I have none in mind)? If you decide to eliminate any outliers, please
reference the points removed. If there is substantial multicollinearity
it does affect tests and estimates much as it does in ordinary
regression, but is more difficult to diagnose in the logistic case.
Step b. Since there are three levels of the target, use Fit Model to
perform multinomial logistic regression. I have created a validation
column that you can use in the analyses. Based upon the logistic does
the overall model appear to be significant in terms of predicting
Quality Rating? Explain using the diagnostics available to you in the
report. Similarly which predictors seem most important to Quality
Rating? Explain.
Step c. Finally, suppose the vintners have hired you to recommend
which physical characteristics will most likely result in a Good
rating. From your analysis, what settings of the important predictors
will result in the highest probability of Good? You might try using
the Profiler and the Desirability Functions in the Profiler menu. I
have not formally discussed Desirability, but it is an important
criterion for optimized settings of features to achieve a most desired
response level; see if you can figure it out.
3 (10 pts) The data set Bollworm.JMP contains data on an experiment
testing the effectiveness of a Hormone spray to kill bollworms, an
agricultural pest. The study looks at various concentrations of the hormone
applied to batches of bollworm larvae, the batch size can vary for each trial.
The target is the count of dead larvae at the conclusion of each experimental
trial. The scientists which to see of the hormone at some concentration can
effectively kill bollworm larvae. The target is Dead? the count of dead
larvae and the Larvae variable is the batch size for each run. Since the
number of dead larvae is likely to increase with batch size, we think of
Larvae as an Offset variable – for technical reasons. Basically, it is used to
adjust the response rate for the opportunity for deaths to be observed on
Page 3of 4
Homework Assignment #5, Math 738/8838, Spring 2023
Due 4/17/2023
each trial.
Step a. Begin by creating a scatterplot of Dead? vs. Hormone
concentration. Do you see an issues or outliers in the relationship? If
you find outliers or an issue, try to find a way to correct for it; this
could include simply excluding some of the observations. This is an
open question and is really up to you. As a hint there really is a
substantial issue in the data that for some unknown reason went
unrecognized in the published analysis.
Step b. Use Fit Model to fit a Poisson regression model (the
generalized linear models fitting personality). Dead? is the response
and hormone is the predictor. Do not forget that the batch size
Larvae is an offset variable and very important to the analysis. From
the report does the hormone seem effective in killing bollworm
larvae? Explain. From the residual plot does some lack of fit appear
present? Explain.
Step c. Try refitting the model but this time include a hormone^2 term.
Does the fit of the model seem to improve? Again, using the residual
plot does there still seem to be a problem with model fit? Explain.
Overall does the model seem to predict Dead? well? Explain. If it
does not fit well, what do you think might be the reasons (besides
there may not have been a relationship)?
Page 4of 4
Download