Lab 3

advertisement
Lab 3 (M13)
Commands
set obs n
gen x=+*invnorm(uniform())
edit
graph varname, histogram normal
summarize x y, detail
corr x y
drop x
Objective: This lab will give you practice generating and summarizing data from a normal distribution; it also
will give you practice exploring the relationship between two variables.
Activity 1:
Lesson 1.
We discussed in class that the data that you collect in a study arise from some kind of underlying process. For
example, the underlying process that gives rise to an individual’s height involves a summing of the effects of
multiple genes and environmental factors. If you collect measurements on the heights of 100 people and plot a
histogram of your data, the histogram should look roughly like a normal curve. In fact, any variable that arises
from a “summing of things,” will look normally distributed when plotted.
Instead of actually going out and measuring the heights of 100 people, let’s generate 100 height
measurements, assuming that height follows a normal curve with  = 64.5 and  = 2.5.
Type
set obs 100
gen height=64.5+2.5*invnorm(uniform())
To look at the values of the 100 observations:
Type
Edit
[set the number of observations to 100]
[generate numbers from a normal distribution with
mean=64.5 and standard deviation = 2.5 and put them in a
variable named normal]
[opens the file that contains your data]
a) Make a histogram of the 100 observations. Do the data roughly follow a normal curve?
b) Now superimpose the idealized normal curve on the histogram of your data.
Type
graph height, histogram normal
Do the data roughly follow that normal curve? Why doesn’t your histogram match the normal curve
perfectly? In general, can you conclude that there is a relationship between the underlying process that gives
rise to your data and the shape of the distribution of the data?
c) Compute the mean and standard deviation x and s of the 100 values you obtained and record them.
Lesson 2:
Histograms of data that look somewhat normally distributed can be idealized by a normal curve with mean 
and standard deviation . Often, however, you will not know the true values of  and . Because the normal
curve is an idealization of your data, the mean () and sd () of the normal curve can be estimated by your
sample mean x and sd s. Of course, you hope that the sample mean x and sd s are good estimates of the
mean () and sd () of the population from which your sample data came. Let’s find out.
d) Look at the sample mean and sd of the first 100 values that you generated. Are x and s close in value to
the  and  of the distribution from which the observations were drawn?
e) Maybe your first set of results were a fluke. Let’s see what happens if we repeat the process of generating
100 observations from that normal distribution.
To accommodate the repetition, you have a choice. Keep the variable named “height” and replace the first set
of 100 observations with a new set of 100 observations (you will delete the first set of observations with this
option) OR create an entirely new variable with a new name to hold the new set if 100 observations.
To just keep using the variable named “height”:
Type
drop height
set obs 100
gen height=64.5+2.5*invnorm(uniform())
To create a new variable:
Type
set obs 100
gen varname=64.5+2.5*invnorm(uniform())
[you must specify the varname that you want]
Repeat 15 times the process of generating 100 observations from the N(64.5, 2.5) distribution and recording x
and s. With each repetition, does it look like x and s are close in value to the  and  of the distribution from
which the observations were drawn?
f) Make a histogram of the 15 values of x and another histogram of the 15 values of s. Briefly describe each of
these distributions. Are they symmetric or skewed? Are they roughly normal? Where are their centers?
g) Is there evidence that the sample mean and standard deviation are good estimates of the idealized mean
and standard deviation? Explain why or why not.
Before you leave, hand parts b, c, d, f, g into your TA
Activity 2:
Example 2.18 in Moore & McCabe
Femur
Humerus
38
41
56
63
59
70
64
72
74
84
These data are the lengths of two bones in five fossil specimens of the extinct beast Archaeopteryx.
h) Graph a scatterplot for these data. Describe the relationship between the length of the femur and the
length of the humerus that you see from the scatterplot. Is it linear? Does it appear that the lengths of the
femur and humerus are correlated or not? Do you expect the value of the correlation coefficient to be large or
small, positive or negative?
i) Now compute the correlation coefficient by hand.
j) Now use STATA to compute the correlation between femur and humerus. Compare the computer output
with your hand calculation – they should be the same.
To compute the correlation between two variables:
Type
corr x y
k) Now edit your datafile and add the following pair of observations: femur = 72 and humerus = 20. Graph the
scatterplot. What do you see? Will this change the value of the correlation coefficient? Compute the
correlation coefficient for these data. What is it?
Before you leave, hand parts h, i, j, k into your TA.
Download