8 - Continuous Random Variables and the Normal Distribution Motivating Problem: Diagnosing Spina Bifida The procedure of amniocentesis involves drawing a sample of the amniotic fluid that surrounds an unborn child in its mother’s womb. If the concentration of alpha fetoprotein is high this can indicate that the child has the condition spina bifida which can be very serious. However the concentration of alpha fetoprotein tends to increase with the size of the foetus. Amniocentesis is not without risks because it results in miscarriage for 1% of those who have it, so preliminary tests involve measuring the level of alpha fetoprotein in the mother’s urine. For mothers with normal foetuses the mean level of alpha fetoprotein is 15.73 moles/liter with a standard deviation of 0.72 moles/liter. For mothers carrying foetuses with spina bifida the mean is 23.05 and the standard deviation is 4.08. In both groups the distribution of alpha fetoprotein appears to be approximately Normally distributed. To operate a diagnostic test for spina bifida, medical professionals must set a threshold concentration of alpha fetoprotein, T, say. If the alpha fetoprotein level is below T, the foetus is diagnosed as not having spina bifida, whereas if the level is above T further testing is required. If T was set at 17.80 moles/litre: 15.73 23.05 What is the probability that a foetus with spina bifida is correctly diagnosed? What is the probability that a foetus not suffering from spina bifida is correctly diagnosed? If they wanted to ensure that 99% of foetuses with spina bifida were correctly diagnosed, at what level should they set T? What are the implications of setting T as this level? Our goal is to be able to answer these questions assuming that the alpha fetoprotein levels for both groups of foetuses are approximately normally distributed, i.e. follow bell-shape curves as shown in the diagram above. CONTINUOUS RANDOM VARIABLES If a random variable, X, can take any value in some interval of the real line it is called a __________________________ Examples: 50 THE STANDARDIZED HISTOGRAM or DENSITY SCALING Example: Dietary Carbohydrate: The average daily intake of carbohydrate (g/day) in the diet found for a sample n=5929 people. The histogram of the data given in Graph (a) shows the carbohydrate intake. From this we see that: * * * (b ) A re a b e tw e e n a = 2 2 5 an d b = 3 7 5 sh ad ed ( a ) S ta n d a r d iz e d h is to g ra m S h a d e d a re a = .4 8 3 .0 0 4 .0 0 4 .0 0 2 .0 0 2 .0 0 0 0 200 400 600 C a rb o h y d ra te (g /d a y ) .0 0 0 800 (C o rre s p o n d s to 4 8 .3 % o f o b s e rv a tio n s ) 0 600 800 S h a d e d a re a = .4 8 6 .004 .002 37 5 (d ) A re a b e tw e e n a = 2 2 5 an d b = 3 7 5 sh ad ed (c ) W ith a p p ro x im a tin g c u rv e .0 0 4 225 (c f. a re a = .4 8 3 f o r h is to g r a m ) .002 0 200 400 C a r b o h y d r a te 600 ( g/d a y ) 800 0 600 225 800 375 Note: Histograms are usually drawn with the height of the rectangle for the ith interval being the frequency, (representing the count) or the relative frequency (representing the proportion). The standardized histogram adjusts the height of the rectangle to ____________________ __________________________________________________. i.e. The area of the ith rectangle tells us what proportion of the data lie in the ith class interval. For a standardized histogram: The vertical scale is : Relative frequency/interval width this is called the density scale. Total area under the histogram = ______ The proportion of the data between a and b is __________________________________________ Carbohydrate Example (cont’d) The proportion of people with carbohydrate intakes between 225 and 375 g/day is the shaded area in the histogram in Graph (b) (= 0.483 or 48.3%) Graph (c) shows an approximating smooth curve on the standardized histogram and on Graph (d) the area from 225 to 375 is shaded. This area is calculated to be 0.486 and is very close to the proportion of people who had carbohydrate intakes of between 225 and 375 g/day. 51 Example 2: Cell Radii of Malignant Breast Tumors (see Assignment #1) In JMP we can add a density scale axis to a histogram in JMP select Histogram Options > Density Axis. The histogram on the left is for cell radii of malignant tumor cells from the fine needle aspirations in the breast cancer study on your first assignment. Here the class intervals all have width 1 so the area of any bar is simply the density axis value, hence the heights of the bars here represent empirical probabilities. For example if we let X = radius of a randomly selected malignant tumor cell we estimate that the P(14 < X < 15) .10 or a10% chance. Example 3: Spina Bifida (SpinaBifida.JMP in Read Only folder) To obtain a standardized histogram in JMP select Histogram Options > Density Axis. This is a histogram of the alpha fetoprotein levels of women who are carrying a foetus with spina bifida. The shaded area in the histogram above is 2.5 (width of the interval) times .10 (height of the bar in the standardized scale) which is .25. This says that the estimated probability that the alpha fetoprotein level found in the urine of a mother carrying a foetus with spina bifida lies between 22.5 moles/liter and 25 moles/liter is .25, or a 25% chance. If we define X = alpha fetoprotein level found in the urine of mothers carrying a foetus with spina bifida we can say the following: P(22.5 X 25) .25 or 25% chance Note: Examination of the spreadsheet confirms that the number of observations highlighted is exactly 25 of the 100 observations. DENSITY/SMOOTH CURVES Take a standardized histogram, decrease the width of the class intervals and increase the number of observations. Then the top of the histogram tends to a smooth curve. n = 100 52 n = 500 n = 10000 n =100000 n = 1,000,000 0.08 0.05 Density 0.10 0.03 10 20 30 40 The limiting smooth curve can be described using a function called the probability density function. To obtain a sampled-based estimate of this function in JMP select Analyze > Distribution > Fit Distribution > Smooth Curve. As n increases, as shown above the histogram itself “converges” to the probability density function. PROPERTIES OF THE PROBABILITY DENSITY FUNCTION (p.d.f.), 1. f(x) (i.e. the p.d.f. curve stays above the x-axis) 2. Pa X b = 3. Area under the p.d.f. curve = ENDPOINTS OF INTERVALS For a continuous random variable, X, endpoints of intervals are ___________________ Pa X b = (Inclusion or exclusion of the endpoints will not change the area.) 53 THE NORMAL DISTRIBUTION Examples: Alpha fetoprotein levels of mothers carrying a foetus with spina bifida. Limiting distribution that is a smooth bell shaped symmetric curve is called the Normal p.d.f. curve or just the Normal curve. 50% 50% Mean If a random variable, X, has a Normal distribution with a mean and a standard deviation we write: The Normal distribution is important because: it fits a lot of data reasonably well; it can be used to approximate other distributions (e.g. binomial) it is important in statistical inference (see later work.) Shape is solely determined by and , the population mean controls where the normal is centered, and the population standard deviation controls the spread about . Example: Alpha fetoprotein levels found in the urine of mothers carrying a foetus with spina bifida. Let X = alpha fetoprotein level in the urine of a mother carrying a foetus with spina bifida. The mean AFP level is _________ moles/liter and the standard deviation is __________ moles/liter. EMPIRICAL RULE Approximately _______ % of the mothers in this population will have AFP levels within 1 standard deviation of the mean, i.e. we estimate that approximately ________% of this population of mothers will have AFP levels: between _______________ and ________________ = between _______________ and ________________ . Approximately _______ % of the mothers in this population will have AFP levels within 2 standard deviation of the mean, i.e. we estimate that approximately ________% of this population of mothers will have AFP levels: between _____________ and ______________ = between _____________ and ______________ Approximately _______ % of the mothers in this population will have AFP levels within 3 standard deviation of the mean, i.e. we estimate that approximately ________% of this population of mothers will have AFP levels: between _____________ and ________________ = between ______________ and ________________ 54 For the Normal Distribution: 68% chance of falling within 1 of ; 95% chance of falling within 2 of ; 99.7% chance of falling within 3 of . A random observation has approximately: or 68% of observations are within 1 of ; 95% of observations are within 2 of ; 99.7% of observations are within 3 of . In a Normal distribution, approximately: OBTAINING OTHER PROBABILITES ASSOCIATED WITH A NORMAL DISTRIBUTION Normal distribution probabilities can be obtained from all statistical packages by giving the mean and standard deviation of the distribution. (see Normal Probability Calculator in Tutorials section of website) Most table and computer packages calculate the value of P(X x). i.e. cumulative or lower tail probabilities. Area = P(X x) OR Area = P(X x) x Standardization and the Standard Normal Distribution Fact: If X ~ N( , ) then if we define a new random variable Z x X then Z ~ N(0,1) i.e. we create a new random variable Z where the observed values of Z are the z-scores for the random variable X. The process of converting a random variable X to z-scores is called standardization. Basic method for obtaining probabilities 1. Sketch a Normal curve, marking the mean and the value(s) of interest. 2. Shade the area under the curve corresponding to the required probability. 3. Convert all values in original scale to their corresponding z-scores. 4. Obtain the desired probability from the lower-tail areas provided by the standard normal table found in the front inside cover of the text. 55 Find the following standard normal probabilities using the Standard Normal Table a) P(Z < .67) b) P(Z > 2.25) c) P(Z > 3.00) e) P(Z < -2.33) f) P(-1.96 < Z < 1.96) h) Find z so that P(Z < z) = .90, i.e. what is the 90th percentile of the standard normal distribution? Spina Bifida Example (continued) X = AFP level of a randomly selected mother carrying a foetus with spina bifida . Lets assume that X~Normal ( =23.05, = 4.08) using the sample mean and sample standard deviation. Find the following: a) P(X < 15.00) = 56 b) P(X < 27.00) c) P(X > 17.0) d) Find the 90th percentile. e) Find the 25th percentile Original Problem: Spina bifida 15.73 23.05 Recall: For normal foetuses =15.73, = 0.72 and for foetuses with spina bifida = 23.05 and = 4.08. Assume the threshold for detecting spina bifida is set at 17.8. (A foetus would be diagnosed as not having spina bifida if the fetoprotein level is below 17.8) a) What is the probability that a foetus not suffering from spina bifida is correctly diagnosed? b) What is the probability that a foetus with spina bifida is correctly diagnosed? c) If they wanted to ensure that 99% of foetuses with spina bifida were correctly diagnosed, at what level should they set T ? 57