Math 311, Winter 2003, Lab 5

advertisement
Math 311, Winter 2003, Lab 5
Part I: the amazing confidence interval
In this part of the lab, we’re going to have Minitab compute many, many 95% confidence
intervals and see what percentage of these confidence intervals capture the true mean of
the distribution (hopefully you can guess).
The population we will be pulling our data from will be uniformly distributed between 9
and 11. Thus the mean of the population will be μ = 10 (obvious) and the standard
deviation will be σ = 0.57735 (not obvious!).
Step 1: Start Minitab
Step 2: Have Minitab compute 10,000 rows of 36 columns (c1-c36) of uniformly
distributed data with a lower endpoint of 9.0 and an upper endpoint of 11.0. Recall menu
series is Calc>Random Data>Uniform…
This gives us 10,000 samples of size n = 36
If 10,000 takes too long, try 5,000.
Step 3: Have Minitab compute the mean of each row and store that mean in column c38.
Title this column “Sample Means.”
Step 4: Have Minitab compute the lower bound for each of our 10,000 confidence
intervals and store this in column c40.
Do this as follows: select Calc>Calculator and then type c40 in the “Store result in
variable” box. Then type c38 – 1.96*0.57735/6 in the “Expression:” box. Make certain
you know where each of these numbers came from.
Title column c40 “Lower bound.”
Step 5: Have Minitab compute the upper bound for each of our 10,000 confidence
intervals and store this in column c41
Do this as follows: select Calc>Calculator and then type c41 in the “Store result in
variable” box. Then type c38 + 1.96*0.57735/6 in the “Expression:” box.
Title column c41 “Upper bound.”
Step 6: Now we need to figure out which, if any, of the confidence intervals we just
computed captured the true mean (μ = 10) of the population. So, select Calc>Calculator
and then type c43 in the “Store result in variable” box. Then type 10 >= c40 And 10 <=
c41 in the “Expression:” box.
This command will return a 1 in column c43 if 10 is in the confidence interval.
It will return a 0 if 10 is not in the confidence interval.
Step 7: Now let’s see how many times μ = 10 was captured by selecting Calc>Column
Statistics… then select Sum and an input variable of c43. Leave the “Store result in:”
box empty. This command will return the number of times μ = 10 was captured.
Question 1: How many times was μ = 10 captured? What percentage is this of the
10,000 confidence intervals you computed? How does this compare to the 95%
confidence you were trying to establish?
Part II: insurance gumshoes – from the files of “real life.”
Download the dataset Firefraud.mtw from the class webpage. Here is the story behind
the data:
A wholesale furniture retailer stores in-stock items at a large warehouse located in
Florida. Several years ago, a fire destroyed the warehouse and all the furniture in it.
After determining the fire was an accident, the retailer sought to recover costs by
submitting a claim to its insurance company.
As is typical in a fire insurance policy of this type, the furniture retailer must provide the
insurance company with an estimate of “lost” profit for the destroyed items. Retailers
calculate profit margin in percentage form using the Gross Profit Factor (GPF). By
definition, it GPF for a single sold item is the ratio of the profit to the item’s selling price
measured as a percentage. I.e.
Item GPF = (Profit / Sales Price) * 100%
Of interest to both the retailer and the insurance company is the average GPF for all of
the items in the warehouse. Since these furniture pieces were all destroyed, their eventual
selling prices and profit values are obviously unknown. Consequently, the average GPF
for all the warehouse items is unknown.
One way to estimate the mean GPF of the destroyed items is to use the mean GPF of
similar, recently sold items. The retailer sold 3,005 furniture items in the year prior to the
fire and kept paper invoices on all sales. Rather than calculate the mean GPF for all
3,005 items (the data was not computerized), the retailer sampled a total of 253 of the
invoices and computed the mean GPF for these items as 50.8%. The retailer applied this
average GPF to the costs of the furniture items destroyed in the fire to obtain an estimate
of the “lost” profit.
According to experienced claims adjusters at the insurance company, the GPF for sale
items of the type destroyed in the fire rarely exceeds 48%. Consequently, the estimate of
50.8% appeared to be unusually high. (A 1% increase in GPF for items of this type
equates to, approximately, an additional $16,000 in profit.) Consequently, a dispute
arose between the furniture retailer and the insurance company, and a lawsuit was filed.
In one portion of the suit, the insurance company accused the retailer of fraudulently
representing its sampling methodology. Rather than selecting a sample randomly, the
retailer was accused of selecting an unusual number of “high profit” items from the
population in order to increase the average GPF of the sample.
To support its claim of fraud, the insurance company hired a CPA firm to independently
assess the retailer’s true GPF. Through the discovery process, the CPA firm legally
obtained the paper invoiced for the entire population of 3,005 items sold the year before
the fire and input the information in to a computer. The selling price, the profit, profit
margin, and month sold for these 3,005 furniture items are available in the file Firefraud.
Question 2: Suppose we want to know how likely it is to obtain a GPF value that
exceeds the estimated mean GPF of 50.8%. Since the data for all 3,005 items are
available in the file, we can find the actual mean and standard deviation for the 3,005
gross profit margins (you should do so now). Assume that these data come from a
normally distributed population. Find the probability that a randomly selected item will
have a GPF that exceeds 50.8%.
Question 3: In the previous question you assumed that the data was normally distributed.
One method for verifying this is to look at a histogram. Although you cannot use a
histogram to “prove” that the data is normal, you can use it to “prove” that it isn’t normal.
For example, if the data were skewed strongly to the left in the histogram you would be
able to conclude that it wasn’t normal.
There are other ways of checking normality besides looking at a histogram. As a rule of
thumb, if the data set is from a normally distributed population, then
(Q3 – Q1)  s  1.34
where Q1 is the first quartile, Q3 is the third quartile, and s is the standard deviation of
the data set. (Q3 – Q1 is called the Interquartile Range). Compute (Q3 – Q1)  s for these
data.
What do you conclude?
Be careful – notice that the rule of thumb is “if the population is normally distributed,
then the ratio is close to 1.34” it does not say “if the ratio is 1.34, then the population is
normally distributed.” One may conclude, however, that if the ratio is nowhere near
1.34, then the population probably isn’t normally distributed.
Question 4: Recall that the retailer sampled paper invoices for 253 of 3,005 furniture
items sold in the previous year. The retailer claimed that, in fact, the 253 items were
obtained by selecting an initial random sample of 134 items, and then augmenting this
sample with a second random sample of 119 items. The mean GPF’s for the two
subsamples were calculated to be 50.6% and 51.0%, respectively, yielding an overall
average of 50.8% - a value that was deemed unusually high.
Is it likely that two independent, random samples of size 134 and 119 will yield mean
GPF’s of at least 50.6% and 51.0%, respectively? (This was the question posed to a
statistician retained by the CPA firm.) This is, we want to find the probability,
P( x1  50.6 and x2  50.6) , where x1 is the sample mean for the first sample of 134
items and x2 is the sample mean for the second sample of 119 items. Note that since the
samples were obtained independently,
P( x1  50.6 and x2  50.6)  P( x1  50.6) P( x2  50.6)
Use the mean and standard deviation of the entire 3,005 furniture items sold as the mean
and standard deviation of the population.
Question 5: How does the probability obtained in Q4 mesh with the probability obtained
in Q2?
If you were the statistician retained by the CPA firm, what would you recommend?
Download