STAT 303: Take-home Final Exam Fall 2014 Points: 100 Points Name(s):________________________________ ________________________________ ________________________________ For the first few questions, we are going to investigate issues related to Medicare spending. The cost of healthcare has gained a lot of attention in the last couple years. The Medicare Hospital Spending dataset can be downloaded from our course website. Variables in dataset: Hospital Name Provider Number Period (before, during, or after hospital admission) Claim Type: Home Health Agency, Hospice, Inpatient, Outpatient, Skilled Nursing Facility, Durable Medical Equipment, Carrier, and Total AvgBillPerEpisode = Average bill per episode for that particular claim type at that hospital AvgBillPerEpisode = Average bill per episode for that particular claim type for all hospitals in that state AvgBillPerEpisode = Average bill per episode for that particular claim type for all hospitals in the nation 1. For this first problem, obtain a subset of the data where Claim Type = Total. This represents the average total expense per episode for that hospital. Use Analyze > Distribution to obtain a complete summary of this outcome variable. a. What is the average bill per episode? (1 pt) b. What is the standard deviation? What does this value tell us about the outcome variable? Discuss. (3 pts) c. What is the standard error? What does this value tell us? Discuss. (3 pts) d. What do the upper and lower 95% Mean values tell you about the outcome variable of interest? Discuss. (3 pts) 1 Z-Scores are used to identify outliers in a dataset (see p8 of Chapter 5 Notes). Formula 𝑍 − 𝑆𝑐𝑜𝑟𝑒 = 𝐷𝑎𝑡𝑎 𝑃𝑜𝑖𝑛𝑡 − 𝑀𝑒𝑎𝑛 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 Outlier Rule 𝐿𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 − 2 𝐵𝑒𝑡𝑤𝑒𝑒𝑛 − 2 𝑎𝑛𝑑 2 𝑍 − 𝑆𝑐𝑜𝑟𝑒 = { 𝐺𝑟𝑒𝑎𝑡𝑒𝑟 𝑡ℎ𝑎𝑛 2 𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝑜𝑛 𝑙𝑜𝑤𝑒𝑟 𝑠𝑖𝑑𝑒 𝑁𝑜𝑡 𝑎𝑛 𝑜𝑢𝑡𝑙𝑖𝑒𝑟 𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝑜𝑛 𝑢𝑝𝑝𝑒𝑟 𝑠𝑖𝑑𝑒 Z-scores can be obtained in JMP by selecting Save > Standardized. Getting JMP to compute Z-Scores automatically, select Save > Standardized from the red-drop down window. Each observations has its own Z-Score. JMP places the Z-score for each observation in a column in the dataset. Create a new variable in your dataset, say Outlier High Side, i.e. the expensive hospitals. Enter the following formula, If statement can be found under Conditional, Comparison is used to get “>”, use double quotes around strings in JMP. e. Create a new variable called Outlier High Side as described above. Obtain a subset of data that includes only the hospitals that would be considered outlier on the high side. Select Tables > Summary and place N(State) in the Statistics box and State in the Group box as is show here. Click OK. 2 Use the output obtained to determine which two states have the most outliers for most expensive hospitals. (4 pts) State with most number of outliers on high side? State with 2nd most number of outliers on high side? f. Repeat part e to identify the states with the most amount of outliers on the low side. (2 pts) State with most number of outliers on low side? State with 2nd most number of outliers on low side? 2. For this problem, we will obtain a different subset of the original dataset. Period = During Index Hospital Admission AND Claim Type = Inpatient Using this subset of the data, select Fit Y by X and setup the following. Apply a Local Data Filter onto this output. Pick any three states in the Local Data Filter and obtain the Means and Std Deviations, CDF Plot, and a Compare Densities plot as is shown here. Provide a three sentence description of what is learned by considering these plots. (4 pts) 3 3. Use the same subset of the data used in Problem #2. Period = During Index Hospital Admission AND Claim Type = Inpatient Select Tables > Summary and specify the following in this window. a. What information is gained by considering the N(AvgBillPerEpisode_HospitalLevel) column in the output returned by JMP? (3 pts) b. What information is gained by considering the Mean(AvgBillPerEpisode_HospitalLevel) column in the output returned by JMP? (3 pts) c. What information is gained by considering the Std Dev(AvgBillPerEpisode_HospitalLevel) column in the output returned by JMP? (3 pts) d. Which state has the smallest standard deviation from part c? What can be said about average hospital bill per episode for inpatient stays during hospitalization for this state? Discuss. (3 pts) 4. Suppose your peer in the class did not subset the data correctly in Problem #2. In particular, he just used Claim Type = Inpatient when obtaining the subset and ignored the Period. Period = During Index Hospital Admission AND Claim Type = Inpatient Obtain the Means and Standard Deviations, CDF Plot, and Compare Densities plot as was done in Problem #2. Compare and contrast the output in Problem #2 to that obtained here. Why are the summaries and plots obtained here not a good reflection of average per episode expense for inpatient during hospital admission? Discuss. (3 pts) 4 Keeping close track of inventory on hand is important – too much excess costs and having to little might be production stops. Most inventory tracking systems are computerized; however, the actual inventory on hand often is different than what the computer suggests. Even with the advent of these types of systems, many companies periodically complete an expensive and time consuming count of the physical inventory on hand to recalibrate the computer system. One simple approach to verify the agreement between the two systems is to calculate the ratio. Ratio Computer Inventory Physical Inventory Ratio Interpretation: The 5th observation in the above list has a ratio of 1.15. This implies that our computer inventory system says we have 15% more inventory on hand than what was physically counted. Consider the following summaries for the ratio values. 5. The average ratio value is about 1.17. Interpret this value for a manager who happens to hate math. What does this value imply about the agreement between the two investory systems? (3 pts) 6. Next, consider the Lower 95% Mean and Upper 95% Mean values. Again, provide an interpretation of what is learned by considering these values in laymen’s language. (3 pts) 5 7. Assume we want the two inventory systems to agree. What are the optimal values for the following quantities when the Ratio = Computer / Physical is considered. (2 pts) a. Optimal Value for Mean: _________________________ b. Optimal Value for Standard Deviation: _____________________ 8. Use Test Mean from the red drop down menu in JMP to conduct the following statistical test. 𝐻0 : 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑅𝑎𝑡𝑖𝑜 = 1 𝐻𝐴 : 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑅𝑎𝑡𝑖𝑜 ≠ 1 a. What is the p-value from this test? (2 pts) _________________ b. Write a conclusion for this test – again using laymen’s languge. (3 pts) c. In what ways does the Lower and Upper 95% mean values from Problem #6 support what you learned from the p-value above. (3 pts) 9. Suppose yoru friend decides to use the difference between the computer and physcial inventory values instead of the ratios. a. How would the test conducted in Problem #8 change when the analysis is based on the differences? Discuss. (2 pts) b. Run an appropriate statistical test on the differences. Write a final conslusion for your test for a manager that doesn’t understand much math. (3 pts) c. Obtain the Lower and Upper 95% Mean values from the analysis done on the differences. Interpret these values in the context of this problem. (3 pts) 6 A second method in making comparisons between the two inventory systems is the plot the computer inventory against the physical inventory which has been done here. A simple linear regression analysis was run in JMP as well and the output is provided here. Scatterplot JMP Output Dotted Line:Y=X line Solid Line: Trend line 10. Identify the following values. (2 pts each) c. Y-Intercept of trend line: ______________________ d. Slope of trend line: _____________________ e. The best estimate for the the amount of phyical inventory on hand when the computer inventory is 1500: __________________________ 7 Consider the followign graphs. In each graph, the dotted line is the Y=X line and the solid line represents various trend lines. For simplicity, you can assume the R2 value is the same and really close to 1 for all three models (i.e. the dots fall really close to the trend line). Graph A Graph B Graph C 11. Answer the following True / False questions regarding these plots. (1 pt each) a. T F b. T F c. T F e. T F Graph A would be the best situation of the three presented here for making appropriate adjustments in order to get the two systems to agree. The computer system is most off from the physical inventory in Graph C. When inventory is high, the computer system tends to overestimate inventory in Graph C. If the R2 value is really close to 1 on each of the plots, then the average distance a data point is from the line will be fairly large. 8 12. Two-Way ANOVA and Crabs… Mahon & Campbell recorded data on 200 specimens of Leptograpsus variegates crabs found on the shore in Western Australia. This species has two color forms -- blue and orange. They recorded the crab’s sex as well – Male, Female. There are four outcome variables in this dataset – Rear Width, Frontal Lobe Width, Carapace (Shell) Width, Carapace (Shell) Length. These outcome variables are identified as RW, FL, CW, and CL, respectively, to the right. a. Download the Crabs.JMP file off our course website. Run a two-way ANOVA model to investigate the potential differences in the rear width between species (Blue vs. Orange) and Gender (Female vs. Male). Make sure to include the interaction term in your model (see pages 43 – 45 of Chapter 6 notes for more information). Copy and paste or screen capture the output and include it here. (3 pts) b. Notice that the two-way interaction term is statistically significant in this analysis. Explain in everyday language what this means in the context of this problem. (3 pts) Hint: It might be useful to look at the LSMEANS plot for the interaction term. c. Obtain the ‘letterings’ (i.e. A, B, C, etc) in JMP for the interaction term. Explain in the context of this problem, the information that is gained by looking at these letterings. ( 3 pts) d. Next, run an analysis for Frontal Lobe width. There are some differences in the final conclusions from this analysis as compared to the analysis for rear width. Discuss these differences in detail. (4 pts) 9 13. In order to manage a population of African Lions a wildlife manager needs to know the age structure of the population. Behavioral studies suggest that removal of lions in excess of six years has virtually no impact on the social structure of the group, whereas taking younger males is more disruptive. It has been suggested that the amount of black pigmentation on the nose of male lions increases with age. Whitman et al. (2004) measured the proportion of black on the noses of lions of known ages in Tanzania, East Africa. The research question of interest is how well the proportion of black in the nose predicts the animal’s age. Close-up colour photographs were taken of known-aged lions from the Serengeti National Park and Ngorongoro Crater, Tanzania, between 1999 and 2002. Each photograph was first digitized at high resolution into a .tif file, and the fleshy part of the nose (‘nose tip’) from each image was excised using Adobe Photoshop 4.01 LE. Then, the Spatial Analyst extension of ESRI Arcview 3.2 was used to rasterize each cut-out nose tip and assign each newly created ‘grid’ a range of colour values. By limiting the colour values to either ‘black’ or ‘not black’, the nasal pigmentation pattern was ‘mapped’ and quantified for the percentage of readable pixels that contained ‘black’. Identification photograph of a 3-yr-old Serengeti male Excised photo of nose tip GIS rendering of nose colouration a. Download the Lions dataset off our course website. Obtain a scatterplot of Age (X Variable) vs Proportion of Black on Nose (Y Variable). Discuss the relationship between these two variables. Do you think a trend line would be appropriate to model the relationship between these two variables? Discuss. (2 pts) b. Carry out a hypothesis test to ensure the slope of the true regression line is not flat (see page 17 of Chapter 7 notes). Give a conclusion for this test in the context of this problem. (2 pts) c. What is the R-Square value for your model? Explain what these numbers are measuring in the context of this problem. (2 pts) 10 d. In practice is it much easier to take a picture of a lion’s nose than to determine its age. Thus, plugging Age=6 into my estimated regression line suggests that a general rule of thumb is that a lion with over 40% black on their nose will likely be older than 6 years of age. Mean Porpotion|Age 0.0582 0.0587 * Age 0.0582 0.0587 * 6 0.4104 41% A good friend and colleague of yours (who has not had much statistics) says the following, “Your rule of thumb is not reasonable – by looking at the graph your cutoff should be more like 60%.” i. Why is a cutoff of 60% not statistically reasonable? (2 pts) ii. Using the regression model constructed from the observed data, what would be the approximate Age of a lion with 60% black on their nose? (2 pts) iii. What would be the consequences (in terms of lion population management) of using your colleague’s cutoff of 60%? (2 pts) 11 Consider the following plots from the Lions analysis. For this analysis the Y variable was Proportion of Nose Black (PNB) and the x variable was Age. Each plot contrasts the difference between two prides, the Serengeti and Ngorongoro prides. The plot on left fit the data using two lines and spline curves were used on the plot on the right. Graph #1 Trend Lines Graph #2 Spline Curves 14. Answer the following T/F questions regarding these plots. (2 pts each) a. b. c. T T T F F F Suppose your friend runs a complete statistical analysis for Graph #1 and determines that we lack statistical evidence to suggest a difference in the slopes and in the y-intercepts between the two prides. The correct conclusion from this analysis is that the relationship between PNB and Age is the same for Serengeti and Ngorongoro prides. Suppose five years have passed and you now have a lot more data to use in construction of Graph #1 and this additional data is used to fit updated models. The statistical tests are run again and we are now able to say there are differences in the y-intercepts, but we still lack statistical evidence to say the slopes are different. The correct conclusion from this analysis is that the Serengeti pride has a statistically higher PNB than Ngorongoro and this is true regardless of the Age of the lion. The models in Graph #2 are more likely to mimic the true relationship between PNB and Age than the models in Graph #1 over the entire lifespan of a lion. 12