Lab 3 (M13) Commands set obs n gen x=+*invnorm(uniform()) edit graph varname, histogram normal summarize x y, detail corr x y drop x Objective: This lab will give you practice generating and summarizing data from a normal distribution; it also will give you practice exploring the relationship between two variables. Activity 1: Lesson 1. We discussed in class that the data that you collect in a study arise from some kind of underlying process. For example, the underlying process that gives rise to an individual’s height involves a summing of the effects of multiple genes and environmental factors. If you collect measurements on the heights of 100 people and plot a histogram of your data, the histogram should look roughly like a normal curve. In fact, any variable that arises from a “summing of things,” will look normally distributed when plotted. Instead of actually going out and measuring the heights of 100 people, let’s generate 100 height measurements, assuming that height follows a normal curve with = 64.5 and = 2.5. Type set obs 100 gen height=64.5+2.5*invnorm(uniform()) To look at the values of the 100 observations: Type Edit [set the number of observations to 100] [generate numbers from a normal distribution with mean=64.5 and standard deviation = 2.5 and put them in a variable named normal] [opens the file that contains your data] a) Make a histogram of the 100 observations. Do the data roughly follow a normal curve? b) Now superimpose the idealized normal curve on the histogram of your data. Type graph height, histogram normal Do the data roughly follow that normal curve? Why doesn’t your histogram match the normal curve perfectly? In general, can you conclude that there is a relationship between the underlying process that gives rise to your data and the shape of the distribution of the data? c) Compute the mean and standard deviation x and s of the 100 values you obtained and record them. Lesson 2: Histograms of data that look somewhat normally distributed can be idealized by a normal curve with mean and standard deviation . Often, however, you will not know the true values of and . Because the normal curve is an idealization of your data, the mean () and sd () of the normal curve can be estimated by your sample mean x and sd s. Of course, you hope that the sample mean x and sd s are good estimates of the mean () and sd () of the population from which your sample data came. Let’s find out. d) Look at the sample mean and sd of the first 100 values that you generated. Are x and s close in value to the and of the distribution from which the observations were drawn? e) Maybe your first set of results were a fluke. Let’s see what happens if we repeat the process of generating 100 observations from that normal distribution. To accommodate the repetition, you have a choice. Keep the variable named “height” and replace the first set of 100 observations with a new set of 100 observations (you will delete the first set of observations with this option) OR create an entirely new variable with a new name to hold the new set if 100 observations. To just keep using the variable named “height”: Type drop height set obs 100 gen height=64.5+2.5*invnorm(uniform()) To create a new variable: Type set obs 100 gen varname=64.5+2.5*invnorm(uniform()) [you must specify the varname that you want] Repeat 15 times the process of generating 100 observations from the N(64.5, 2.5) distribution and recording x and s. With each repetition, does it look like x and s are close in value to the and of the distribution from which the observations were drawn? f) Make a histogram of the 15 values of x and another histogram of the 15 values of s. Briefly describe each of these distributions. Are they symmetric or skewed? Are they roughly normal? Where are their centers? g) Is there evidence that the sample mean and standard deviation are good estimates of the idealized mean and standard deviation? Explain why or why not. Before you leave, hand parts b, c, d, f, g into your TA Activity 2: Example 2.18 in Moore & McCabe Femur Humerus 38 41 56 63 59 70 64 72 74 84 These data are the lengths of two bones in five fossil specimens of the extinct beast Archaeopteryx. h) Graph a scatterplot for these data. Describe the relationship between the length of the femur and the length of the humerus that you see from the scatterplot. Is it linear? Does it appear that the lengths of the femur and humerus are correlated or not? Do you expect the value of the correlation coefficient to be large or small, positive or negative? i) Now compute the correlation coefficient by hand. j) Now use STATA to compute the correlation between femur and humerus. Compare the computer output with your hand calculation – they should be the same. To compute the correlation between two variables: Type corr x y k) Now edit your datafile and add the following pair of observations: femur = 72 and humerus = 20. Graph the scatterplot. What do you see? Will this change the value of the correlation coefficient? Compute the correlation coefficient for these data. What is it? Before you leave, hand parts h, i, j, k into your TA.