STAT 303: Take-home Final Exam Name(s): Fall 2014 Points: 100

advertisement
STAT 303: Take-home Final Exam
Fall 2014
Points: 100 Points
Name(s):________________________________
________________________________
________________________________
For the first few questions, we are going to investigate issues related to Medicare spending. The cost of
healthcare has gained a lot of attention in the last couple years. The Medicare Hospital Spending
dataset can be downloaded from our course website.
Variables in dataset:
 Hospital Name
 Provider Number
 Period (before, during, or after hospital admission)
 Claim Type: Home Health Agency, Hospice, Inpatient, Outpatient, Skilled Nursing Facility,
Durable Medical Equipment, Carrier, and Total
 AvgBillPerEpisode = Average bill per episode for that particular claim type at that hospital
 AvgBillPerEpisode = Average bill per episode for that particular claim type for all hospitals in
that state
 AvgBillPerEpisode = Average bill per episode for that particular claim type for all hospitals in
the nation
1. For this first problem, obtain a subset of the data where Claim Type = Total. This represents the
average total expense per episode for that hospital. Use Analyze > Distribution to obtain a
complete summary of this outcome variable.
a. What is the average bill per episode? (1 pt)
b. What is the standard deviation? What does this value tell us about the outcome variable?
Discuss. (3 pts)
c. What is the standard error? What does this value tell us? Discuss. (3 pts)
d. What do the upper and lower 95% Mean values tell you about the outcome variable of
interest? Discuss. (3 pts)
1
Z-Scores are used to identify outliers in a dataset (see p8 of Chapter 5 Notes).
Formula
𝑍 − 𝑆𝑐𝑜𝑟𝑒 =
𝐷𝑎𝑡𝑎 𝑃𝑜𝑖𝑛𝑡 − 𝑀𝑒𝑎𝑛
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
Outlier Rule
𝐿𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 − 2
𝐵𝑒𝑡𝑤𝑒𝑒𝑛
− 2 𝑎𝑛𝑑 2
𝑍 − 𝑆𝑐𝑜𝑟𝑒 = {
𝐺𝑟𝑒𝑎𝑡𝑒𝑟 𝑡ℎ𝑎𝑛 2
𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝑜𝑛 𝑙𝑜𝑤𝑒𝑟 𝑠𝑖𝑑𝑒
𝑁𝑜𝑡 𝑎𝑛 𝑜𝑢𝑡𝑙𝑖𝑒𝑟
𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝑜𝑛 𝑢𝑝𝑝𝑒𝑟 𝑠𝑖𝑑𝑒
Z-scores can be obtained in JMP by selecting Save > Standardized.
Getting JMP to compute Z-Scores automatically,
select Save > Standardized from the red-drop
down window.
Each observations has its own Z-Score. JMP places the
Z-score for each observation in a column in the dataset.
Create a new variable in your dataset, say Outlier High
Side, i.e. the expensive hospitals.
Enter the following formula, If statement can be found
under Conditional, Comparison is used to get “>”, use
double quotes around strings in JMP.
e. Create a new variable called Outlier High Side as described above. Obtain a subset of data
that includes only the hospitals that would be considered outlier on the high side. Select
Tables > Summary and place N(State) in the Statistics box and State in the Group box as is
show here. Click OK.
2
Use the output obtained to determine which two states have the most outliers for most
expensive hospitals. (4 pts)
State with most number of outliers on high side?
State with 2nd most number of outliers on high side?
f.
Repeat part e to identify the states with the most amount of outliers on the low side. (2 pts)
State with most number of outliers on low side?
State with 2nd most number of outliers on low side?
2. For this problem, we will obtain a different subset of the original dataset.
Period = During Index Hospital Admission AND Claim Type = Inpatient
Using this subset of the data, select Fit Y by X and setup the following.
Apply a Local Data Filter onto this output. Pick any three states in the Local Data Filter and
obtain the Means and Std Deviations, CDF Plot, and a Compare Densities plot as is shown here.
Provide a three sentence description of what is learned by considering these plots. (4 pts)
3
3. Use the same subset of the data used in Problem #2.
Period = During Index Hospital Admission AND Claim Type = Inpatient
Select Tables > Summary and specify the following in this window.
a. What information is gained by considering the N(AvgBillPerEpisode_HospitalLevel) column
in the output returned by JMP? (3 pts)
b. What information is gained by considering the Mean(AvgBillPerEpisode_HospitalLevel)
column in the output returned by JMP? (3 pts)
c. What information is gained by considering the Std Dev(AvgBillPerEpisode_HospitalLevel)
column in the output returned by JMP? (3 pts)
d. Which state has the smallest standard deviation from part c? What can be said about
average hospital bill per episode for inpatient stays during hospitalization for this state?
Discuss. (3 pts)
4. Suppose your peer in the class did not subset the data correctly in Problem #2. In particular, he
just used Claim Type = Inpatient when obtaining the subset and ignored the Period.
Period = During Index Hospital Admission AND Claim Type = Inpatient
Obtain the Means and Standard Deviations, CDF Plot, and Compare Densities plot as was done
in Problem #2. Compare and contrast the output in Problem #2 to that obtained here. Why are
the summaries and plots obtained here not a good reflection of average per episode expense for
inpatient during hospital admission? Discuss. (3 pts)
4
Keeping close track of inventory on hand is important – too much excess costs and having to little might
be production stops. Most inventory tracking systems are computerized; however, the actual inventory
on hand often is different than what the computer suggests. Even with the advent of these types of
systems, many companies periodically complete an expensive and time consuming count of the physical
inventory on hand to recalibrate the computer system.
One simple approach to verify the agreement between the two systems is to calculate the ratio.
Ratio 
Computer Inventory
Physical Inventory
Ratio Interpretation: The 5th observation in the above list has a ratio of 1.15. This implies that our
computer inventory system says we have 15% more inventory on hand than what was physically
counted.
Consider the following summaries for the ratio values.
5. The average ratio value is about 1.17. Interpret this value for a manager who happens
to hate math. What does this value imply about the agreement between the two
investory systems? (3 pts)
6. Next, consider the Lower 95% Mean and Upper 95% Mean values. Again, provide an
interpretation of what is learned by considering these values in laymen’s language. (3
pts)
5
7. Assume we want the two inventory systems to agree. What are the optimal values for
the following quantities when the Ratio = Computer / Physical is considered. (2 pts)
a. Optimal Value for Mean: _________________________
b. Optimal Value for Standard Deviation: _____________________
8. Use Test Mean from the red drop down menu in JMP to conduct the following statistical test.
𝐻0 : 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑅𝑎𝑡𝑖𝑜 = 1
𝐻𝐴 : 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑅𝑎𝑡𝑖𝑜 ≠ 1
a. What is the p-value from this test? (2 pts) _________________
b. Write a conclusion for this test – again using laymen’s languge. (3 pts)
c. In what ways does the Lower and Upper 95% mean values from Problem #6 support
what you learned from the p-value above. (3 pts)
9. Suppose yoru friend decides to use the difference between the computer and physcial inventory
values instead of the ratios.
a. How would the test conducted in Problem #8 change when the analysis is based on the
differences? Discuss. (2 pts)
b. Run an appropriate statistical test on the differences. Write a final conslusion for your
test for a manager that doesn’t understand much math. (3 pts)
c. Obtain the Lower and Upper 95% Mean values from the analysis done on the
differences. Interpret these values in the context of this problem. (3 pts)
6
A second method in making comparisons between the two inventory systems is the plot the computer
inventory against the physical inventory which has been done here. A simple linear regression analysis
was run in JMP as well and the output is provided here.
Scatterplot
JMP Output
Dotted Line:Y=X line
Solid Line: Trend line
10. Identify the following values. (2 pts each)
c. Y-Intercept of trend line: ______________________
d. Slope of trend line: _____________________
e. The best estimate for the the amount of phyical inventory on hand when the computer
inventory is 1500: __________________________
7
Consider the followign graphs. In each graph, the dotted line is the Y=X line and the solid line represents
various trend lines. For simplicity, you can assume the R2 value is the same and really close to 1 for all
three models (i.e. the dots fall really close to the trend line).
Graph A
Graph B
Graph C
11. Answer the following True / False questions regarding these plots. (1 pt each)
a.
T
F
b.
T
F
c.
T
F
e.
T
F
Graph A would be the best situation of the three presented here
for making appropriate adjustments in order to get the two
systems to agree.
The computer system is most off from the physical inventory in
Graph C.
When inventory is high, the computer system tends to overestimate inventory in Graph C.
If the R2 value is really close to 1 on each of the plots, then the
average distance a data point is from the line will be fairly large.
8
12. Two-Way ANOVA and Crabs…
Mahon & Campbell recorded data on 200 specimens of
Leptograpsus variegates crabs found on the shore in
Western Australia. This species has two color forms -- blue
and orange. They recorded the crab’s sex as well – Male,
Female.
There are four outcome variables in this dataset – Rear
Width, Frontal Lobe Width, Carapace (Shell) Width,
Carapace (Shell) Length. These outcome variables are
identified as RW, FL, CW, and CL, respectively, to the right.
a. Download the Crabs.JMP file off our course website. Run a two-way ANOVA model to
investigate the potential differences in the rear width between species (Blue vs. Orange)
and Gender (Female vs. Male). Make sure to include the interaction term in your model
(see pages 43 – 45 of Chapter 6 notes for more information). Copy and paste or screen
capture the output and include it here. (3 pts)
b. Notice that the two-way interaction term is statistically significant in this analysis.
Explain in everyday language what this means in the context of this problem. (3 pts)
Hint: It might be useful to look at the LSMEANS plot for the interaction term.
c. Obtain the ‘letterings’ (i.e. A, B, C, etc) in JMP for the interaction term. Explain in the
context of this problem, the information that is gained by looking at these letterings. ( 3
pts)
d. Next, run an analysis for Frontal Lobe width. There are some differences in the final
conclusions from this analysis as compared to the analysis for rear width. Discuss these
differences in detail. (4 pts)
9
13. In order to manage a population of African Lions a wildlife manager needs to know the age
structure of the population. Behavioral studies suggest that removal of lions in excess of six
years has virtually no impact on the social structure of the group, whereas taking younger males
is more disruptive. It has been suggested that the amount of black pigmentation on the nose of
male lions increases with age. Whitman et al. (2004) measured the proportion of black on the
noses of lions of known ages in Tanzania, East Africa. The research question of interest is how
well the proportion of black in the nose predicts the animal’s age.
Close-up colour photographs were taken of known-aged lions from the Serengeti National Park and Ngorongoro Crater,
Tanzania, between 1999 and 2002. Each photograph was first digitized at high resolution into a .tif file, and the fleshy part of the
nose (‘nose tip’) from each image was excised using Adobe Photoshop 4.01 LE. Then, the Spatial Analyst extension of ESRI
Arcview 3.2 was used to rasterize each cut-out nose tip and assign each newly created ‘grid’ a range of colour values. By
limiting the colour values to either ‘black’ or ‘not black’, the nasal pigmentation pattern was ‘mapped’ and quantified for the
percentage of readable pixels that contained ‘black’.
Identification photograph of a 3-yr-old Serengeti male
Excised photo of nose tip
GIS rendering of nose colouration
a. Download the Lions dataset off our course website. Obtain a scatterplot of Age (X Variable)
vs Proportion of Black on Nose (Y Variable). Discuss the relationship between these two
variables. Do you think a trend line would be appropriate to model the relationship
between these two variables? Discuss. (2 pts)
b. Carry out a hypothesis test to ensure the slope of the true regression line is not flat (see
page 17 of Chapter 7 notes). Give a conclusion for this test in the context of this problem. (2
pts)
c. What is the R-Square value for your model? Explain what these numbers are measuring in
the context of this problem. (2 pts)
10
d. In practice is it much easier to take a picture of a lion’s nose than to determine its age.
Thus, plugging Age=6 into my estimated regression line suggests that a general rule of
thumb is that a lion with over 40% black on their nose will likely be older than 6 years of
age.
Mean Porpotion|Age  0.0582  0.0587 * Age
 0.0582  0.0587 * 6
 0.4104
 41%
A good friend and colleague of yours (who has not had much statistics) says the
following, “Your rule of thumb is not reasonable – by looking at the graph your cutoff
should be more like 60%.”
i. Why is a cutoff of 60% not statistically reasonable? (2 pts)
ii. Using the regression model constructed from the observed data, what would be
the approximate Age of a lion with 60% black on their nose? (2 pts)
iii. What would be the consequences (in terms of lion population management) of
using your colleague’s cutoff of 60%? (2 pts)
11
Consider the following plots from the Lions analysis. For this analysis the Y variable was Proportion of
Nose Black (PNB) and the x variable was Age. Each plot contrasts the difference between two prides,
the Serengeti and Ngorongoro prides. The plot on left fit the data using two lines and spline curves were
used on the plot on the right.
Graph #1
Trend Lines
Graph #2
Spline Curves
14. Answer the following T/F questions regarding these plots. (2 pts each)
a.
b.
c.
T
T
T
F
F
F
Suppose your friend runs a complete statistical analysis for Graph
#1 and determines that we lack statistical evidence to suggest a
difference in the slopes and in the y-intercepts between the two
prides.
The correct conclusion from this analysis is that the relationship
between PNB and Age is the same for Serengeti and Ngorongoro
prides.
Suppose five years have passed and you now have a lot more data
to use in construction of Graph #1 and this additional data is used
to fit updated models. The statistical tests are run again and we
are now able to say there are differences in the y-intercepts, but
we still lack statistical evidence to say the slopes are different.
The correct conclusion from this analysis is that the Serengeti pride
has a statistically higher PNB than Ngorongoro and this is true
regardless of the Age of the lion.
The models in Graph #2 are more likely to mimic the true
relationship between PNB and Age than the models in Graph #1
over the entire lifespan of a lion.
12
Download