1180:Lab9 James Moore March 26th, 2012 1 Diagnosing Bloaty Head We’ve already discussed how to make differential diagnosis using Bayes theorem. In the original example, all tests could only return two results:positive or negative. Let’s look at the key diagnostic tool for analyzing bloaty head: head circumference. The mean head circumference in healthy individuals is 55cm versus 56.6cm in BH patients. Both populations are normally distributed with a standard deviation of 1.2cm. Suppose 75% of individuals are normal and 25% have BH. We can generate a joint probability distribution to represent these facts. One random variable represents whether or not a patient has BH and is either 0 or 1. The other random variable is head circumeference. We can’t list the probability for every value of head circumference because it is a continuous random variable. Instead we will plot the probability density. First, plot the distribution for the ‘normal’ group. Note that we multiply by .75 to reflect that 75% of the population is normal. head_circlist=seq(51,61,.1) plot(head_circlist,.75*dnorm(head_circlist,mean=55,sd=1.2), xlab=’Head Circumference’,type=’l’) Then we add on the distribution for the ‘Bloaty Head’ group. lines(head_circlist,.25*dnorm(head_circlist,mean=56.6,sd=1.2)) Save This Plot Next, we calculate the marginal distributions for head circumference. We do this simply by adding together the distributions from the two groups headcirc_mpf=.75*dnorm(head_circlist,mean=55,sd=1.2) +.25*dnorm(head_circlist,mean=56.6,sd=1.2) Now we can use this marginal distribution to calculate the conditional probability that someone has bloaty head. Note that the formula below is exactly Bayes’ theorem. It’s also the same taking a single column (or row) and divide by the total. 1 plot(head_circlist,.25*dnorm(head_circlist,mean=56.6,sd=1.2)/headcirc_mpf, xlab="Head Circumference",ylab="Probability of Bloaty Head") We want to know when we are 95% sure that someone has bloaty head. We put a line at .95 to help read the graph. lines(head_circlist,.95+0*headcirclist) Save this plot and record what circumference is needed to be 95% sure We can use the pnorm function (which is just the cdf) to figure out what percentage of BH positive people have an undectable disease. Suppose we determine the threshold to be 58. It’s not that but suppose that it were. Evaluates the cumulative distribution function at 58. In other words, it tells us what percentage of BH patients have a head circumference below 58. pnorm(58,mean=56.6,sd=1.2) [1] 0.8783275 Record what percentage of BH positive people won’t be detectable with 95% confidence 2 Adding a test for Height The test in the previous section kind of sucked. This is because of the large amount of variability in head circumference. Much of this variability can actually be explained by height. As we don’t yet have much (any) experience with multivariate statistics, we’ll do all of this with simulation. First, we assign patients to be either BH positive or not. Their heights are then drawn from a predetermined distribution. The head circumference is based both on height and whether it is bloated. There is also some randomness in that measurement. We collect all variables into a data frame for convenience. N=10000 BH_pos=rbinom(N,size=1,prob=.25) height=rnorm(N,mean=165,sd=12) hcirc=rnorm(N,mean=.1*height+38.5+1.6*BH_pos,sd=.3) Patients=data.frame(BH_pos,height,hcirc) We have used the summary command before. Now, we can see that this actually gives us an estimate of the marginal probability distributions for all of our random variables. > summary(Patients) BH_pos height Min. :0.0000 Min. :122.2 hcirc Min. :50.61 2 1st Qu.:0.0000 Median :0.0000 Mean :0.2484 3rd Qu.:0.0000 Max. :1.0000 1st Qu.:156.9 Median :165.1 Mean :165.1 3rd Qu.:173.4 Max. :222.9 1st Qu.:54.43 Median :55.38 Mean :55.41 3rd Qu.:56.36 Max. :61.20 This is not very useful by itself, but it’s good practice to always do this with a data frame. In addition, we should always take a look at what we’ve actually computed with the head command. head(Patients) BH_pos height 1 0 157.3137 2 1 175.2231 3 1 146.6770 4 0 162.3073 5 0 151.8542 6 1 165.7490 hcirc 54.70650 57.99186 54.41488 55.03319 53.38065 57.28193 Another nice thing about data frames is that R can ’intuitively understand’ them. So we can give the basic plot command, with only one extra option to make the plotted points a bit smaller. plot(Patients,pch=’.’) Save the resulting plot. Explain what is plotted in each box. Answer this question later if it’s unclear immediately. Althought this plot is a good start, we’d like to visualize the relationship between all three variables. First we’ll plot the joint distribution of height and head circumference in the normies. plot(height~hcirc,Patients,subset=BH_pos==0,pch=’.’) Add the joint distribution of bloaty headed patients to this plot as red dots. Then save the plot 3 More Questions 1. Create a histogram to show the marginal distributions of height and head circumference. Save both plots. (Hint:use the function hist) 2. To make a diagnosis, which conditional probability distribution will we need to calculate? 3. Use the cov and cor R functions on your data frame. Record the results. Analyze what each result in the corr table means. Why is the table symmetric? 4. How does the addition of the height measurement help? Justify your answer using the joint probability distributions with and without the extra measurement. 3