Biostatistics and Epidemiology, Midterm Review New York Medical College By: Jasmine Nirody This review is meant to cover lectures from the first half of the Biostatistics course. The sections are not organised by lecture, but rather by topic. If you have any comments or corrections, please email them to me at jnirody@gmail.com! 1 Introduction to Statistics This section discusses the definition of biostatistics (the application of statistics to a wide range of topics in the biological and medical sciences) and how it can be used in our medical and research careers. 1.1 Types of Measurements Data can be highly variable due to several factors, including genetics, age, sex, race, economic background, measurement techniques, and many others. For this reason, we need ways to classify measurements. 1.1.1 Categorical Data Categorical data is meant to place data into more or less “arbritrary” groups—meaning, that the way the groups are ordered or presented doesn’t affect anything in the presentation. This is a qualitative measure, and usually is not numerical (though sometimes numbers can be assigned; we will show an example). Examples: Sex (Male/Female), Blood Type (A,B,AB,O), Disease Status (Y/N, 1/0). [Note in the last example that, though numbers (1,0) may be used to denote categorical data, the number choices are arbritary. That is, you can choose to denote positive or negative disease status by any symbol—1, 0, 293, or so on.] When constructing a categorical variable, note that categories should be exhaustive, that is, there should be a category for every possibility (this often means including an “Other” category) and mutually exclusive, which means that all observations fit into one, and ONLY one, category. Often it is possible to “convert” ordinal or quantitative data into categorical data. For example, consider a situation where you have weight data which is continuous quantitative. By assigning ranges to be considered Underweight, Normal, Overweight, and Obese, we now have data in unordered categories. As a general rule, more informative data can be converted into less informative data, but not vice versa. 1.1.2 Ordinal Data Ordinal data assigns data into categories which can be ranked, though only the order, and not the ‘distance’, between categories is considered. A good general rule is that if there is no “true zero”, the data is ordinal. Examples: Opinion ranked on a scale of 1-5. Here, we could have used any (numeric or non-numeric) scale with 5 categories, say [Poor, Fair, Good, Very Good, Amazing], or 1-10 (using only even numbers), or the set [-4, 23 , 9, 30, 3993]. The specific numbers used don’t matter in ordinal data, only that they exist in some prespecified order. 1 1.1.3 Quantitative Data Quantitative data is data you can put onto a numeric scale, where “zero” has a real meaning. There are two types of quantitative data, continuous (which can take on any value within a certain range) and discrete (which can only have certain values within a certain range). A good way to tell the difference is to pick any two possible values the data could have, and then pick any random value between those two numbers. Can the data take on that value? If no, it’s discrete. If yes, it might be continuous, but you need to see if it can do that for any value in that range, or if you just picked luckily! Examples: Blood pressure (continuous), Height/Weight (Continuous), Age in whole years (discrete), Age with no restrictions (Continuous—you can be 22.33 or 22.33943 or 22.33943943 or blablabla years old). Quantative data is often presented in frequency tables. We show an example from the lecture in Figure 1. Figure 1: An example of a frequency table. Relative frequency (RF) is defined as the fraction (in the form of a fraction, percentage, or decimal) of times a certain answer occurs. For example, the RF of 4-year olds in the data shown in Figure 1 can be calculated as: 7 1 number of four year olds = = = 50% = 0.5. total number of children 14 2 Cumulative relative frequency (Cum RF) is the sum of all the relative frequencies that occur when any value less than or equal to the answer is considered. For example, the Cum RF for 4-year olds in the same data is: 7+5 12 number of four year olds + number of three year olds = = = 86% = 0.86. total number of children 14 14 1.2 Types of Inaccuracies Inaccuracies in collecting and presenting data can appear through imprecision in measurement (which results in poor reproducability of data), or by inherent bias in the measurement. See Figure 2. 2 Descriptive Statistics Often, working with raw data is difficult and cumbersome, especially since there is usually a lot, so we look for ways to visualise the data. This is accomplished by using frequency distributions. 2.1 Frequency Distributions Continuous quantiative variables can be represented using continuous distributions. Discrete variables can be plotted using histograms and other methods, but we will not be particularly concerned with this. A more full discussion is given in the lecture slides. 2 Figure 2: Repeated glucose measurements on a single sample. The distribution we will be most concerned with in this course is the Normal distribution or Gaussian distribution, shown in Figure 3. The normal distribution has a higher density in the middle, and tapers off towards the edges. Even within the category of “normal distributions”, we can observe some unique shapes: the distribution might be flat and spread out (high variability in the data) or very high and thin (low variability in the data). Of course, distributions don’t necessarily have to be normal, but we will discuss these in a later section. Figure 3: A normal (Gaussian) distribution. 2.2 Measures of Central Tendency Instinctively, when the grades for an exam come out, the first thing we wonder is “How did I do in relation to the rest of the class?”. To properly answer this question, we involve measures of central tendency. In this section, we discuss not one, not two, but three ways to formally define the center of a distribution: the mean, median, and mode. 3 2.2.1 Mean The mean, also called the average, is the sum of all observations in a certain group, divided by the number of observations in that group. Symbolically, in a group of n observations: n x̄ = X xi x1 + x2 + x3 + .. + xn = . n n i=1 When the number of observations in a data set in small, the mean is sensitive to extreme values (outliers). [Note, we define outliers as values which are three standard deviations from the center. These terms will be further discussed in a later section.] As the number of observations in a group increase, the effect of outliers is diluted. 2.2.2 Median The median is defined as the true midpoint of a set of data. The calculation of the median is easily done in two steps: 1. Arrange data in order of magnitude. 2. If number of observations is odd, choose middle number. If even, choose middle two numbers and calculate their mean. The median is insensitive to outliers. Consider a set of numbers organised by order of magnitude. Whether the value in the final space has magnitude 40 or 9008, the median remains unchanged. 2.2.3 Mode The mode is probably the most simple measure of central tendency to calculate. It is defined as the value which occurs most often in a data set. There may be more than one mode in a set (multimodal ), but rarely more than two (bimodal ). 2.3 Measures of Variability While knowing where the center of a distribution is located is important, we also tend to wonder what the distribution actually looks like—that is, are all the data points located right at the center, or are they spread out? To answer this question formally, we use measures of variability. We also discuss three of these: range, variance, and standard deviation. 2.3.1 Range The range is defined as the difference between the highest and lowest values in a data set. Calculation is straightforward. 2.3.2 Variance We define a deviation of a value as the difference between that value and the mean. Symbolically, xi − x̄ . We then can define the variance (S 2 ) as the sum of the squares of the deviations divided by one less than the number of observations: Pn (xi − x̄)2 2 . s = i=1 n−1 Note that the squared deviations, rather than the deviations themselves, are used. This is to account for values on opposite sides of the mean, which would have deviations of opposite sign, and would cancel each other out in a summation. 4 2.3.3 Standard Deviation Because variance uses the squares of the deviations, the units of variance are also squared. That is, if the units of the observations in a data set is inches, then the unit of the variance of this set will be inches squared. For this reason, it is usually preferable to use another measure of variability, the standard deviation. The standard deviation is simply the square root of the variance: sP n 2 √ i=1 (xi − x̄) . s = s2 = n−1 Example: Consider the following data set: [3, 5, 6, 9, 0, -5, 3]. The mean of this data set is calculated as follows: 21 3 + 5 + 6 + 9 + 0 + 3 + (−5) = = 3. x̄ = 7 7 The median is calculated by ordering the data set in order of magnitude: [-5, 0, 3, 3, 5, 6, 9]. Since n is odd, we choose the midpoint: 3. The mode is easily seen to also be 3. The range of the data is the difference between the highest value (9) and the lowest value (-5) = 14. The variance is calculated as follows: Pn (xi − x̄)2 (−5 − 3)2 + (0 − 3)2 + (3 − 3)2 + (3 − 3)2 + (5 − 3)2 + (6 − 3)2 + (9 − 3)2 = = 20.33333. s2 = i=1 n−1 6 Calculation of the standard deviation is straightforward from here: √ √ s = s2 = 20.33333 = 4.50925. 2.4 Quartiles We define a quartile as one of four equal groups, representing one fourth of a distribution. Specifically, we define • first quartile: the lowest 25% of the data • second quartile: cuts the data set in half • third quartile: the highest 25% of the data (or, conversely, the lowest 75%). There are many ways to compute quartiles, all of which provide different results. We will discuss Tukey’s hinges. 2.4.1 Tukey’s Hinges System Tukey’s Hinges system is used to determine the 25th and 75th percentiles of a data set–so, the first and third quartiles. According to this system, the first quartile is defined as the median of the first half of the sample and the third quartile as the median of the second half of the sample. The calculations then, can be divided into the following simple steps: 1. Order the data from smallest to largest. This is similar to if we were simply finding the median of a set. 2. That being said....find the median. Remember, if n is even, the median is the mean of the middle two numbers. 5 3. Since the median is the midpoint of the set, split the data set into two groups–one with values higher than the median, and one with values lower than the median. [Note: When n is even, the median is not included in either of the two groups! When n is odd, the median is included in both of the two groups!] 4. Now you have two sets of data. Find the median of each. The median of the “low” group is Q1, the first quartile. The median of the “high” group is Q3, the third quartile. Again, remember that if n (where now, n is the number in each of the two groups) is even, you use the mean of the two middle values. Finally, we discuss the interquartile range, which is defined simply as the difference between the third and first quartile: IQR = Q3-Q1. Example: Let’s use the same data set as above: [3, 5, 6, 9, 0, -5, 3]. As before, we order it by magnitude to get: [-5, 0, 3, 3, 5, 6, 9]. From above, we know that the median of the set is 3, and that n = 7 is odd. So the median is included in both high and low groups. We now form these groups: high group = [3 5 6 9] , low group = [-5 0 3 3]. The median calculations are straightforward (remember n = 4 is now even), and we arrive at Q1 = 5.5 and Q3 = 1.5. The interquartile range, Q1 - Q3, is 4. 2.5 Coefficient of Variation We quickly discuss one final term dealing with variability: the coefficient of variation. This is defined as the ratio of the standard deviation to the mean: s cv = . x̄ Note that this is only valid for data with a non-zero mean. 3 Basic Probability Knowing the basic rules of probability is important to understand and deal with random variability in a data set. In this section, we will define some terms and explain some fundamental proability rules. 3.1 Probability The probability of an event is the ways the event can occur divided by the total number of possible events: P (E) = n number of favorable outcomes = . N number of possible outcomes Often there are so many possible events that it is not possible to count all of them, and so we cannot directly determine the probability of an event by counting. In this case, we have to estimate the probability as a long term relative frequency—that is, repeat a process over and over until we are more or less sure we are “close” to the real probability. The Law of Large Numbers states that if the same experiment is performed a large number of times, the average results from that experiment will be close to the expected value. For example, in a coin toss, we expect to get the result “heads” with a probability 0.5. While we cannot observe this directly, if we were to perform a large number of coin tosses, the proportion of heads would be close to the expected value 0.5. Since the probability of an event is a ratio, it’s value is always between 0 and 1. If the proabability of an event is 0, this means that the number of favorable outcomes is 0, and so the event is impossible. On the 6 other hand, if the probability is 1, the number of favorable outcomes is equal to the number of possible outcomes, and the event is certain to occur. 3.2 Multiple Events So far we have described the probability of single events, e.g. the results of a single coin toss. However, often we are concerned with multiple events, e.g. the rolling of two die simultaneously or two consecutive coin tosses. 3.2.1 The Addition Rule The addition rule is used to calculate the probability of event A or event B occurring. This probability is calculated by: P (A or B) = P (A) + P (B). This rule can be generalised to any number of events 3.2.2 The Multiplication Rule Before we continue, we define mutually exclusive events as those which cannot occur simulataneously, for example, one cannot have been vaccinated for the flu and not vaccinated for the flu at the same time. If events A and B are mutually exclusive, then the probability of both A and B occurring is always 0. For non-mutually exclusive events, however: P (A and B) = P (A) × P (B). This rule can also be generalised to any number of events. 3.2.3 Conditional Probability Up until now, we have assumed that all events are independent of each other—that is, the outcome of one event doesn’t influence the outcome of the following event. The probabilities in this case are called unconditional probabilities. However, when two events are not independent, we consider their conditional probabilities. An example of conditional probability is the likelihood of getting into an car accident with a BAC of 0.8 versus that of a sober driver. We calculate the conditional probability of event A given that event B has occurred as follows: P (A|B) = P (A and B) . P (B) Note that, if A and B are independent events: P (A|B) = P (A and B) P (A)P (B) = = P (A). P (B) P (B) So, the conditional probability of event A occurring if event B has occurred is simply the probability of event A occurring, as expected. 3.3 The Binary Classification Test Finally, we will quickly go over some concepts necessary to analyse a binary classification test. A binary classification test is one that classifies the members of a set into two groups depending on the existence of 7 one property, for example, diseased or not diseased. Sensitivity is a measure of the number of positives which are correctly identified divided by the total number of actual positives (for example, the number of people correctly diagnosed with a disorder divided by the total number of people who actually have that disorder). Specificity is the number of correctly identified negatives divided by the total amount of negatives (for example, the number of people who are identified as healthy divided by the total number of healthy people). The outcome of a binary classification test may take four results: • True positive: sick people diagnosed as sick • False positive or Type I error: healthy people diagnosed as sick • True negative: healthy people identified as healthy • False negative or Type II error: sick people left undiagnosed. 4 The Binomial Distribution The binomial distribution is the probability distribution typically associated with the number of successes when performing n “yes/no” experiments. The classical experiment associated with the binomial distribution is the Bernoulli trial, an experiment with random outcome with two possibilities: “success” or “failure”. (Fun fact: the binomial distribution for n = 1 is called the Bernoulli distribution). So, when would we use the binomial distribution? Consider a population—say all the adults on the planet Blorg. If you are considering a certain trait for which you know the prevalence in that population (say, purple skin), then we can use the binomial distribution to tell us what the chances are of randomly selecting some person (or some random sample of people) from the population who has that trait. Example: So let us assume we have the Blorgian population we discussed above. Now, assume we know that the percentage of Blorgian adults with purple skin is 29% (p = 0.29). Now, if we pick 1000 random adults (N = 1000) from this population, we want to know what is the probability (P ) we will get 230 (x = 230) purple skinned ones. For this, we use the binomial distribution and the following formula: N x p̂ = p (1 − p)n−x . x Here, we calculate N x by: N N! . = x!(N − x)! x So, in our example: P = 1000! 0.29230 (1 − 0.29)1000−230 230!(1000 − 230)! This is a very big number, and very difficult to calculate, even on a calculator. We’ll see in a later section that we can use the normal distribution to approximate the binomial distribution in certain cases. There are some other things we can calculate for the binomial distribution—mean and variance are given as follows: µ = Np s2 = N p(1 − p). 8 5 The Normal Distribution The normal distribution, which we briefly discussed before (See Figure 3), is considered the most “basic” probability distribution, and is determined by its mean µ and its standard deviation σ. A normal distribution has the following properties: • the mean is at the center • if you consider an area under the curve one standard deviation in both directions away from the mean, you will cover approximately 68% of the area (exactly, 68.2%) • similarly, two standard deviations comprise about 95% and three, 99.7% (See Figure 4) Figure 4: Standard deviations in a normal distribution. Example: Let’s consider the same Blorgian population as above, again, and this time we’ll look at number of nipples. The mean number of nipples in this population is 4, with a standard deviation of 0.7, because Blorgians can have fractions of a nipple. We want to know the probability of finding a Blorgian with 3.3 nipples or less. This corresponds to a boundary that is one standard deviation towards the left of the mean. Looking at the normal curve (Figure 4), we see that if we consider the area under the curve to the left of this boundary (which corresponds to less than or equal to 3.3 nipples), the probability of finding such a Blorgian is approximately 16%. Since we often cannot sample an entire planet, we must settle for choosing a random sample of size N . If we do this, then we must find a relationship between the parameters of the population (µ, σ)pand the statistics σ of the sample (x̄, s). For a normal approximation, this is quite simple: x̄ = µ and s = N . This term, s is called the standard error. Note that as N becomes large, the standard error becomes small, so that the distribution converges onto the true mean of the population. 5.1 The Central Limit Theorem Even though the normal distribution is nice to work with, it is often not a great approximation for a sample (for example, in skewed distributions). However, if N is large enough, we can use the normal distribution to approximate the sample mean, no matter the original distribution of the population. This is the Central Limit Theorem. More formally: • If the underlying distribution of the population is normal, X N (µ, σ 2 ), then the sample mean distri2 bution is also a normal distribution with X̄ N (µ, σN ). • However, if the underlying distribution of the population is not normal but rather some unknown distribution, X f (x|µ, σ 2 ), then for large enough N , the sample mean distribution can be approximated 2 to the normal distribution X̄ ≈ N (µ, σN ). 9 For our purposes, we consider N = 50 − 100 to be large enough. 5.2 Standardized Normal Distribution and Z-Scores To make calculations more convenient, sometimes we standardize the normal distribution—meaning, we convert the distribution to one that has µ = 0 and σ = 1. In order to do this, we first shift the distribution so that it is centered around zero (so, we subtract the mean) and then we divide by the standard deviation. This gives us the Z-score: X −µ Z= . σ Now, we can use this Z-score to tell us many things (quickly, and without any other calculations!) about where we sit on the distribution. Example: Z < −1 means that we are one standard deviation to the left of the mean, and so, as we had calculated before: P (Z < −1) = 0.16. 6 Hypothesis Testing In doing research, we are often presented with a claim, which we then either prove or disprove by experiments. The same is true in statistics, and is called hypothesis testing. The original claim presented to you is called the null hypothesis, and the opposite of that claim, which you are trying to prove, is called the alternative hypothesis. The procedure for developing a hypothesis test is as follows: 1. Develop the null and alternative hypotheses. An important thing to consider is that the two hypotheses must encompass all possibilities and be mutually exclusive—that is, there should only be cases when one, and ONLY one, of the hypotheses is true. 2. Set an α-level. This determines your tolerance of Type 1 errors (or false positives, discussed previously. In the case of hypothesis testing, a Type 1 error means rejecting the null hypothesis when it is true.) A typical α-level is 0.05 (5%), but some stricter journals may require 0.01 (1%). 3. Once the α-level is established, you can calculate whether or not you can reject the null hypothesis at this level. Often it happens that with a high α, you get high Type 1 error—that is, you may accept the null hypothesis when you might have rejected it at a more strict α. Example: Let’s say we have our Blorgian population once again! We are presented with the claim that the proportion of Blorgians with blue toenails in the population is 29%. Set up your null hypothesis as: H0 :µ = 0.29 H1 :µ 6= 0.29. We set an α value of 0.05, which means we are willing to take a 5% chance that we are wrong. This corresponds to a “critical” z-score of 1.96 (We shouldn’t try to memorise the big table of z-scores, it’s probably impossible and definitely a huuuge waste of time. But this one is good to know!). Now we take a random sample with N = 100. In this sample, we find 33 sets of blue toenails (x = 33, p̂ = 0.33). We now calculate the z-score: p̂ − p z=q p(1−p) N 0.330 − 0.29 =q = 0.88 0.29(1−0.29) 100 Now, since the z we calculated is lower than our critical z-score, we fail to reject the null hypothesis! 10 But what if we had taken a bigger sample? Let’s consider N = 1000. In this sample, we find 330 sets of blue toenails (x = 33, p̂ = 0.330). We now calculate the z-score: p̂ − p z=q p(1−p) N 0.330 − 0.290 =q = 2.79 0.290(1−0.290) 100 Here, we see that our z-score is now BIGGER than the critical, and so this time we reject the null hypotheis. By taking a bigger sample size, we avoided a Type 1 error. 6.1 Confidence Intervals So far, we have discussed point estimation—specifically, estimation of the mean. Another type of estimation is interval estimation, which attempts to provide a range of likely values, called a confidence interval. As with point estimation, we set an error level which we deem acceptable (generally, this is 5%, corresponding to a 95% confidence interval—meaning that we are 95% confident that the correct answer lies within the range we are suggesting). Note that the higher the acceptable error, the smaller the interval actually is! This may seem counter-intuitive at first, but consider that if you want to be 100% confident (thus have the smallest error, 0%) you are in the right range, you would have to include all possible values (thus having the largest confidence interval possible). Example: Let us consider again the same population as before, Blorgians with 29% blue toenails. Now we want to know the 95% confidence interval for a sample of size N = 100. Let’s say the population standard deviation is σ = 3%. Then we can calculate the standard error by: 3 σ = 0.3 σE = √ = 10 N Now, for a 95% confidence interval, we know we will cover the range −1.96 ≤ z ≤ 1.96. This means we can go 1.96 standard errors in both directions from the mean. So, the 95% confidence interval is then given as: 29% ± (0.3)(1.96) = 29% ± 0.6%. But what if we didn’t know the standard deviation of the population? In this case, we would use a Student’s t-distribution instead of the normal, and a t-statistic instead of the z-score. For this course, the calculations will be exactly the same, only using different charts. Note, however, that on the t-statistic chart there is an extra parameter called degree of freedom, which is simply equal to N − 1. 6.2 Comparing Two Means (Two Sample t-test) Sometimes we are given two samples and our task is to find out if there exists a statistically significant difference between them. The procedure is not so different from a one-sample t-test (which is not so different from a z-test) but we will work out an example anyways! Example: Blorg’s neighboring planet Glorf also has some subset of the population with multiple (and fractional) nipples. Everyone actually suspects that the Glorfites migrated over from Blorg and are the same species. Scientists determined that the only way to know for sure is to check if there is a statistically significant difference in the two populations in relation to nipple number. We pick two samples (N = 10), one Glorfite and one Blorgian. We observe that the Glorfite sample has on average 3.7 nipples, with a standard deviation of 0.3 (variance of 0.09), and the Blorgians have 3.9 nipples with a standard deviation of 0.3 (variance of 0.09). The closeness of the variance of the two samples is necessary for the two-sample t-test and is referred to as homogeneity of variance. We also must have that the two populations are normally, or close to normally, distributed. Now we begin the calculations! 11 First we must calculate the difference between the two means: µB − µG = 3.9 − 3.7 = 0.2. The claim that has been made is that there is no difference between the two populations, so we state our null and alternative hypotheses as follows: H0 :µB − µG = 0 H1 :µB − µG 6= 0. The standard error of the difference of the means is given as: r r s21 + s22 0.32 + 0.32 SEM1 −M2 = = = 0.4246. N 10 Next, we compute our t-statistic by: t= observed − hypothesised = 0 − 0.20.4246 = 0.4714. standarderror Using degrees of freedom = 10 - 1 = 9, we see that a t-statistic of 0.4714 corresponds to a p-value of approximately 0.64 for a two-sided t-test. This is much higher than our cut-off of 0.05, and so we fail to reject the null. [Note, however, we used an extremely small sample size, and the result very well might not have been the same had we used more of the population.] 6.3 Comparing Multiple Means (ANOVA) Sometimes, if we wish to compare multiple means (more than 2), we must consider an alternative method other than the t-test. Technically, we could perform as many pairwise comparisons as needed to come to a conclusion, but this can be tiring and tedious. It also increases our chances of making a Type 1 error (because we have a chance to make one at every test), though it decreases our chance to make a Type 2 error (because we have 6 chances, rather than 1, to reject the null hypothesis). We would like to think of a single test which would efficiently and easily perform a comparison between multiple means. Such a test is the ANOVA (or ANalysis Of VAriance). ANOVA can only determine whether at least one population mean is different from at least one other population mean, but not which mean is different. If we wish to find that out, we perform other (usually pairwise) tests called post-hoc tests after the ANOVA. Example: The planet on the other side of Blorg, Flugle, is also suspected of being composed of migrated Blorgians. In addition to the samples above, we also pick 10 Fluglers, who have on average 3.2 nipples with a standard deviation of 0.3 (variance 0.09). We state our hypothesis: H0 :µF = µG = µB H1 :not all of the population means are equal. For ANOVA tests, we use a statistic called the F-statistic, which depends on several parameters including: number of groups r (here, 3), combined sample size N (here, 30), and α (here, 0.05 as usual). The critical value of F is denoted as F(r−1,N −r,α) = F(2,27,0.05) . The first value, r − 1 is called the numerator degrees of freedom and the second, N − r, is called the denominator degrees of freedom. Our critical F-value is 3.35 (from F-statistic table in lecture slide appendix). The calculation of the F-statistic is somewhat complicated (and we won’t work it out here) but we give the formula: P ni (X̄i − X̄)2 /(r − 1) between-group variability F = = Pi . 2 within-group variability ij (Xij − X̄i ) /(N − r) 12 Here, r and N are as defined above, ni is the size of an individual group i, and X̄i is the mean of that group, while X̄ is the mean of the entire data set, and Xij is an individual observation (number “j”) in group i. Since this is a pretty tedious calculation, we won’t do it out here, but let’s assume that the F-value was less than the critical value of 3.35, and the Blorgians were correct in assuming that they are the sole source of intelligent life in their immediate surroundings. 7 Correlation and Regression The final section (thank god!) has to do with correlation and regression, which are both methods to evaluate and quantify the relationship between two (quantitative) variables. One of the variables is called the dependent variable and the other the independent variable. The dependent variable is usually the factor we are measuring or interested in, such as disease prevalence or outcome of a treatment, while the independent variable is something we freely control, like dosage level or exposure to a carcinogen. Data points are usually graphically represented in a scatter plot, such as one shown in Figure 5. Figure 5: Scatter plot denoting cigarette use vs. kidney disease. 7.1 Correlation In this section, we talk about Pearson’s correlation (ρ in a population, r in a sample), which is defined as a measure of strength of the linear relationship between two variables. If a relationship between two variables exists but is not linear, then this coefficient may not be adequate to describe the relation. This coefficient has a value between -1 and 1, with r = −1 denoting a perfect negative relationship between the two variables, r = 1 denoting a perfect positive relationship between variables, and r = 0 denoting that there is no (linear) relationship. 7.2 Simple Linear Regression Going one step farther than correlation, regression is used to denote a functional relationship between two variables by fitting a line to bivariate data points. The equation denoting a relationship between variables x and y is given as: y = a + bx where x is the independent and y is the dependent variable, b is the slope of the line, and a is the y-value at which the line crosses the x-axis. Since there is almost no way that there will be a single line that goes 13 perfectly through all points, there will be some distance between the points and the line. We call this the residual, and calculate it by: residual = observed y - predicted y The least squares line is the one which minimizes this error. To calculate the parameters a and b for this line, we use the following formulas: sy ) sx a =ȳ − bx̄, b =r( where sy and sx are the sample standard deviations of x and y, r is the correlation coefficient, and x̄ and ȳ are the sample means. 7.3 Multiple Linear Regression Often, there are multiple factors that affect a certain outcome. In this case, we need to consider more than one independent variable, and so we perform multiple linear regression. In this course, we won’t really be concerned much with multiple linear regression except to note how changing each independent variable affects the dependent variable. Example: Attractiveness (A) on Blorg is a combination of three factors: number of nipples (n), how blue one is (which Blorgians rate on a continous scale: 0 ≤ b ≤ 10), and intelligence (which Blorgians also rate on a continous scale: 0 ≤ i ≤ 10). The relationship is given by the following equation: A = 2.1n − 2.3b + 0.8i + 1.4. (Intelligence is not that important to the Blorgians.) From this equation, we can see how changes in any of these attribute can affect attractiveness. For example, if one loses a nipple (somehow), one’s attractiveness goes down by 2.1 units. Conversely, if one were to find that nipple someone else lost, then that person’s attractiveness would increase by 2.1 units. In another example, if a Blorgian fell into a tub of permanent paint (which exists on Blorg, I guess) and became less blue by 3 units, his/her attractiveness would increase (because there is a negative sign before the blueness term) by 6.9 units! We can do this for any number of independent variables. Most likely anything more complicated would be done using a software, which we have not learned in this term, so don’t worry about more complicated problems! Thanks for reading! Again, please send me any corrections that you find!! 14