Introductory Biostatistics

advertisement
Biostatistics and Epidemiology, Midterm Review
New York Medical College
By: Jasmine Nirody
This review is meant to cover lectures from the first half of the Biostatistics course. The sections are not
organised by lecture, but rather by topic. If you have any comments or corrections, please email them to me
at jnirody@gmail.com!
1
Introduction to Statistics
This section discusses the definition of biostatistics (the application of statistics to a wide range of
topics in the biological and medical sciences) and how it can be used in our medical and research
careers.
1.1
Types of Measurements
Data can be highly variable due to several factors, including genetics, age, sex, race, economic background,
measurement techniques, and many others. For this reason, we need ways to classify measurements.
1.1.1
Categorical Data
Categorical data is meant to place data into more or less “arbritrary” groups—meaning, that the way the
groups are ordered or presented doesn’t affect anything in the presentation. This is a qualitative measure,
and usually is not numerical (though sometimes numbers can be assigned; we will show an example).
Examples: Sex (Male/Female), Blood Type (A,B,AB,O), Disease Status (Y/N, 1/0). [Note in the last
example that, though numbers (1,0) may be used to denote categorical data, the number choices are arbritary. That is, you can choose to denote positive or negative disease status by any symbol—1, 0, 293, or so on.]
When constructing a categorical variable, note that categories should be exhaustive, that is, there should
be a category for every possibility (this often means including an “Other” category) and mutually exclusive,
which means that all observations fit into one, and ONLY one, category.
Often it is possible to “convert” ordinal or quantitative data into categorical data. For example, consider a
situation where you have weight data which is continuous quantitative. By assigning ranges to be considered
Underweight, Normal, Overweight, and Obese, we now have data in unordered categories. As a general rule,
more informative data can be converted into less informative data, but not vice versa.
1.1.2
Ordinal Data
Ordinal data assigns data into categories which can be ranked, though only the order, and not the ‘distance’,
between categories is considered. A good general rule is that if there is no “true zero”, the data is ordinal.
Examples: Opinion ranked on a scale of 1-5. Here, we could have used any (numeric or non-numeric) scale
with 5 categories, say [Poor, Fair, Good, Very Good, Amazing], or 1-10 (using only even numbers), or the
set [-4, 23 , 9, 30, 3993]. The specific numbers used don’t matter in ordinal data, only that they exist in some
prespecified order.
1
1.1.3
Quantitative Data
Quantitative data is data you can put onto a numeric scale, where “zero” has a real meaning. There are
two types of quantitative data, continuous (which can take on any value within a certain range) and discrete
(which can only have certain values within a certain range). A good way to tell the difference is to pick any
two possible values the data could have, and then pick any random value between those two numbers. Can
the data take on that value? If no, it’s discrete. If yes, it might be continuous, but you need to see if it can
do that for any value in that range, or if you just picked luckily!
Examples: Blood pressure (continuous), Height/Weight (Continuous), Age in whole years (discrete), Age
with no restrictions (Continuous—you can be 22.33 or 22.33943 or 22.33943943 or blablabla years old).
Quantative data is often presented in frequency tables. We show an example from the lecture in Figure 1.
Figure 1: An example of a frequency table.
Relative frequency (RF) is defined as the fraction (in the form of a fraction, percentage, or decimal) of times
a certain answer occurs. For example, the RF of 4-year olds in the data shown in Figure 1 can be calculated
as:
7
1
number of four year olds
=
= = 50% = 0.5.
total number of children
14
2
Cumulative relative frequency (Cum RF) is the sum of all the relative frequencies that occur when any value
less than or equal to the answer is considered. For example, the Cum RF for 4-year olds in the same data is:
7+5
12
number of four year olds + number of three year olds
=
=
= 86% = 0.86.
total number of children
14
14
1.2
Types of Inaccuracies
Inaccuracies in collecting and presenting data can appear through imprecision in measurement (which results
in poor reproducability of data), or by inherent bias in the measurement. See Figure 2.
2
Descriptive Statistics
Often, working with raw data is difficult and cumbersome, especially since there is usually a lot, so we look
for ways to visualise the data. This is accomplished by using frequency distributions.
2.1
Frequency Distributions
Continuous quantiative variables can be represented using continuous distributions. Discrete variables can
be plotted using histograms and other methods, but we will not be particularly concerned with this. A more
full discussion is given in the lecture slides.
2
Figure 2: Repeated glucose measurements on a single sample.
The distribution we will be most concerned with in this course is the Normal distribution or Gaussian
distribution, shown in Figure 3. The normal distribution has a higher density in the middle, and tapers
off towards the edges. Even within the category of “normal distributions”, we can observe some unique
shapes: the distribution might be flat and spread out (high variability in the data) or very high and thin
(low variability in the data). Of course, distributions don’t necessarily have to be normal, but we will discuss
these in a later section.
Figure 3: A normal (Gaussian) distribution.
2.2
Measures of Central Tendency
Instinctively, when the grades for an exam come out, the first thing we wonder is “How did I do in relation
to the rest of the class?”. To properly answer this question, we involve measures of central tendency. In
this section, we discuss not one, not two, but three ways to formally define the center of a distribution: the
mean, median, and mode.
3
2.2.1
Mean
The mean, also called the average, is the sum of all observations in a certain group, divided by the number
of observations in that group. Symbolically, in a group of n observations:
n
x̄ =
X xi
x1 + x2 + x3 + .. + xn
=
.
n
n
i=1
When the number of observations in a data set in small, the mean is sensitive to extreme values (outliers).
[Note, we define outliers as values which are three standard deviations from the center. These terms will be
further discussed in a later section.] As the number of observations in a group increase, the effect of outliers
is diluted.
2.2.2
Median
The median is defined as the true midpoint of a set of data. The calculation of the median is easily done in
two steps:
1. Arrange data in order of magnitude.
2. If number of observations is odd, choose middle number. If even, choose middle two numbers and
calculate their mean.
The median is insensitive to outliers. Consider a set of numbers organised by order of magnitude. Whether
the value in the final space has magnitude 40 or 9008, the median remains unchanged.
2.2.3
Mode
The mode is probably the most simple measure of central tendency to calculate. It is defined as the value
which occurs most often in a data set. There may be more than one mode in a set (multimodal ), but rarely
more than two (bimodal ).
2.3
Measures of Variability
While knowing where the center of a distribution is located is important, we also tend to wonder what the
distribution actually looks like—that is, are all the data points located right at the center, or are they spread
out? To answer this question formally, we use measures of variability. We also discuss three of these: range,
variance, and standard deviation.
2.3.1
Range
The range is defined as the difference between the highest and lowest values in a data set. Calculation is
straightforward.
2.3.2
Variance
We define a deviation of a value as the difference between that value and the mean. Symbolically,
xi − x̄
. We then can define the variance (S 2 ) as the sum of the squares of the deviations divided by one less than
the number of observations:
Pn
(xi − x̄)2
2
.
s = i=1
n−1
Note that the squared deviations, rather than the deviations themselves, are used. This is to account for
values on opposite sides of the mean, which would have deviations of opposite sign, and would cancel each
other out in a summation.
4
2.3.3
Standard Deviation
Because variance uses the squares of the deviations, the units of variance are also squared. That is, if the
units of the observations in a data set is inches, then the unit of the variance of this set will be inches squared.
For this reason, it is usually preferable to use another measure of variability, the standard deviation. The
standard deviation is simply the square root of the variance:
sP
n
2
√
i=1 (xi − x̄)
.
s = s2 =
n−1
Example: Consider the following data set: [3, 5, 6, 9, 0, -5, 3]. The mean of this data set is calculated as
follows:
21
3 + 5 + 6 + 9 + 0 + 3 + (−5)
=
= 3.
x̄ =
7
7
The median is calculated by ordering the data set in order of magnitude: [-5, 0, 3, 3, 5, 6, 9]. Since n is odd,
we choose the midpoint: 3. The mode is easily seen to also be 3.
The range of the data is the difference between the highest value (9) and the lowest value (-5) = 14. The
variance is calculated as follows:
Pn
(xi − x̄)2
(−5 − 3)2 + (0 − 3)2 + (3 − 3)2 + (3 − 3)2 + (5 − 3)2 + (6 − 3)2 + (9 − 3)2
=
= 20.33333.
s2 = i=1
n−1
6
Calculation of the standard deviation is straightforward from here:
√
√
s = s2 = 20.33333 = 4.50925.
2.4
Quartiles
We define a quartile as one of four equal groups, representing one fourth of a distribution. Specifically, we
define
• first quartile: the lowest 25% of the data
• second quartile: cuts the data set in half
• third quartile: the highest 25% of the data (or, conversely, the lowest 75%).
There are many ways to compute quartiles, all of which provide different results. We will discuss Tukey’s
hinges.
2.4.1
Tukey’s Hinges System
Tukey’s Hinges system is used to determine the 25th and 75th percentiles of a data set–so, the first and third
quartiles. According to this system, the first quartile is defined as the median of the first half of the sample
and the third quartile as the median of the second half of the sample. The calculations then, can be divided
into the following simple steps:
1. Order the data from smallest to largest. This is similar to if we were simply finding the median of a
set.
2. That being said....find the median. Remember, if n is even, the median is the mean of the middle two
numbers.
5
3. Since the median is the midpoint of the set, split the data set into two groups–one with values higher
than the median, and one with values lower than the median. [Note: When n is even, the median is
not included in either of the two groups! When n is odd, the median is included in both of the two
groups!]
4. Now you have two sets of data. Find the median of each. The median of the “low” group is Q1, the
first quartile. The median of the “high” group is Q3, the third quartile. Again, remember that if n
(where now, n is the number in each of the two groups) is even, you use the mean of the two middle
values.
Finally, we discuss the interquartile range, which is defined simply as the difference between the third and
first quartile: IQR = Q3-Q1.
Example: Let’s use the same data set as above: [3, 5, 6, 9, 0, -5, 3]. As before, we order it by magnitude
to get: [-5, 0, 3, 3, 5, 6, 9]. From above, we know that the median of the set is 3, and that n = 7 is odd. So
the median is included in both high and low groups. We now form these groups:
high group = [3 5 6 9] , low group = [-5 0 3 3].
The median calculations are straightforward (remember n = 4 is now even), and we arrive at Q1 = 5.5 and
Q3 = 1.5. The interquartile range, Q1 - Q3, is 4.
2.5
Coefficient of Variation
We quickly discuss one final term dealing with variability: the coefficient of variation. This is defined as the
ratio of the standard deviation to the mean:
s
cv = .
x̄
Note that this is only valid for data with a non-zero mean.
3
Basic Probability
Knowing the basic rules of probability is important to understand and deal with random variability in a
data set. In this section, we will define some terms and explain some fundamental proability rules.
3.1
Probability
The probability of an event is the ways the event can occur divided by the total number of possible events:
P (E) =
n
number of favorable outcomes
=
.
N
number of possible outcomes
Often there are so many possible events that it is not possible to count all of them, and so we cannot directly
determine the probability of an event by counting. In this case, we have to estimate the probability as a
long term relative frequency—that is, repeat a process over and over until we are more or less sure we are
“close” to the real probability. The Law of Large Numbers states that if the same experiment is performed
a large number of times, the average results from that experiment will be close to the expected value. For
example, in a coin toss, we expect to get the result “heads” with a probability 0.5. While we cannot observe
this directly, if we were to perform a large number of coin tosses, the proportion of heads would be close to
the expected value 0.5.
Since the probability of an event is a ratio, it’s value is always between 0 and 1. If the proabability of an
event is 0, this means that the number of favorable outcomes is 0, and so the event is impossible. On the
6
other hand, if the probability is 1, the number of favorable outcomes is equal to the number of possible
outcomes, and the event is certain to occur.
3.2
Multiple Events
So far we have described the probability of single events, e.g. the results of a single coin toss. However,
often we are concerned with multiple events, e.g. the rolling of two die simultaneously or two consecutive
coin tosses.
3.2.1
The Addition Rule
The addition rule is used to calculate the probability of event A or event B occurring. This probability is
calculated by:
P (A or B) = P (A) + P (B).
This rule can be generalised to any number of events
3.2.2
The Multiplication Rule
Before we continue, we define mutually exclusive events as those which cannot occur simulataneously, for
example, one cannot have been vaccinated for the flu and not vaccinated for the flu at the same time. If
events A and B are mutually exclusive, then the probability of both A and B occurring is always 0. For
non-mutually exclusive events, however:
P (A and B) = P (A) × P (B).
This rule can also be generalised to any number of events.
3.2.3
Conditional Probability
Up until now, we have assumed that all events are independent of each other—that is, the outcome of
one event doesn’t influence the outcome of the following event. The probabilities in this case are called
unconditional probabilities. However, when two events are not independent, we consider their conditional
probabilities. An example of conditional probability is the likelihood of getting into an car accident with a
BAC of 0.8 versus that of a sober driver.
We calculate the conditional probability of event A given that event B has occurred as follows:
P (A|B) =
P (A and B)
.
P (B)
Note that, if A and B are independent events:
P (A|B) =
P (A and B)
P (A)P (B)
=
= P (A).
P (B)
P (B)
So, the conditional probability of event A occurring if event B has occurred is simply the probability of event
A occurring, as expected.
3.3
The Binary Classification Test
Finally, we will quickly go over some concepts necessary to analyse a binary classification test. A binary
classification test is one that classifies the members of a set into two groups depending on the existence of
7
one property, for example, diseased or not diseased.
Sensitivity is a measure of the number of positives which are correctly identified divided by the total number
of actual positives (for example, the number of people correctly diagnosed with a disorder divided by the
total number of people who actually have that disorder). Specificity is the number of correctly identified
negatives divided by the total amount of negatives (for example, the number of people who are identified
as healthy divided by the total number of healthy people). The outcome of a binary classification test may
take four results:
• True positive: sick people diagnosed as sick
• False positive or Type I error: healthy people diagnosed as sick
• True negative: healthy people identified as healthy
• False negative or Type II error: sick people left undiagnosed.
4
The Binomial Distribution
The binomial distribution is the probability distribution typically associated with the number of successes
when performing n “yes/no” experiments. The classical experiment associated with the binomial distribution is the Bernoulli trial, an experiment with random outcome with two possibilities: “success” or “failure”.
(Fun fact: the binomial distribution for n = 1 is called the Bernoulli distribution).
So, when would we use the binomial distribution? Consider a population—say all the adults on the planet
Blorg. If you are considering a certain trait for which you know the prevalence in that population (say,
purple skin), then we can use the binomial distribution to tell us what the chances are of randomly selecting
some person (or some random sample of people) from the population who has that trait.
Example: So let us assume we have the Blorgian population we discussed above. Now, assume we know
that the percentage of Blorgian adults with purple skin is 29% (p = 0.29). Now, if we pick 1000 random
adults (N = 1000) from this population, we want to know what is the probability (P ) we will get 230
(x = 230) purple skinned ones. For this, we use the binomial distribution and the following formula:
N x
p̂ =
p (1 − p)n−x .
x
Here, we calculate N
x by:
N
N!
.
=
x!(N − x)!
x
So, in our example:
P =
1000!
0.29230 (1 − 0.29)1000−230
230!(1000 − 230)!
This is a very big number, and very difficult to calculate, even on a calculator. We’ll see in a later section
that we can use the normal distribution to approximate the binomial distribution in certain cases.
There are some other things we can calculate for the binomial distribution—mean and variance are given as
follows:
µ = Np
s2 = N p(1 − p).
8
5
The Normal Distribution
The normal distribution, which we briefly discussed before (See Figure 3), is considered the most “basic”
probability distribution, and is determined by its mean µ and its standard deviation σ. A normal distribution
has the following properties:
• the mean is at the center
• if you consider an area under the curve one standard deviation in both directions away from the mean,
you will cover approximately 68% of the area (exactly, 68.2%)
• similarly, two standard deviations comprise about 95% and three, 99.7% (See Figure 4)
Figure 4: Standard deviations in a normal distribution.
Example: Let’s consider the same Blorgian population as above, again, and this time we’ll look at number
of nipples. The mean number of nipples in this population is 4, with a standard deviation of 0.7, because
Blorgians can have fractions of a nipple. We want to know the probability of finding a Blorgian with 3.3
nipples or less. This corresponds to a boundary that is one standard deviation towards the left of the mean.
Looking at the normal curve (Figure 4), we see that if we consider the area under the curve to the left of this
boundary (which corresponds to less than or equal to 3.3 nipples), the probability of finding such a Blorgian
is approximately 16%.
Since we often cannot sample an entire planet, we must settle for choosing a random sample of size N . If we
do this, then we must find a relationship between the parameters of the population (µ, σ)pand the statistics
σ
of the sample (x̄, s). For a normal approximation, this is quite simple: x̄ = µ and s = N
. This term, s
is called the standard error. Note that as N becomes large, the standard error becomes small, so that the
distribution converges onto the true mean of the population.
5.1
The Central Limit Theorem
Even though the normal distribution is nice to work with, it is often not a great approximation for a sample
(for example, in skewed distributions). However, if N is large enough, we can use the normal distribution
to approximate the sample mean, no matter the original distribution of the population. This is the Central
Limit Theorem. More formally:
• If the underlying distribution of the population is normal, X N (µ, σ 2 ), then the sample mean distri2
bution is also a normal distribution with X̄ N (µ, σN ).
• However, if the underlying distribution of the population is not normal but rather some unknown
distribution, X f (x|µ, σ 2 ), then for large enough N , the sample mean distribution can be approximated
2
to the normal distribution X̄ ≈ N (µ, σN ).
9
For our purposes, we consider N = 50 − 100 to be large enough.
5.2
Standardized Normal Distribution and Z-Scores
To make calculations more convenient, sometimes we standardize the normal distribution—meaning, we
convert the distribution to one that has µ = 0 and σ = 1. In order to do this, we first shift the distribution
so that it is centered around zero (so, we subtract the mean) and then we divide by the standard deviation.
This gives us the Z-score:
X −µ
Z=
.
σ
Now, we can use this Z-score to tell us many things (quickly, and without any other calculations!) about
where we sit on the distribution.
Example: Z < −1 means that we are one standard deviation to the left of the mean, and so, as we had
calculated before: P (Z < −1) = 0.16.
6
Hypothesis Testing
In doing research, we are often presented with a claim, which we then either prove or disprove by experiments.
The same is true in statistics, and is called hypothesis testing. The original claim presented to you is called
the null hypothesis, and the opposite of that claim, which you are trying to prove, is called the alternative
hypothesis. The procedure for developing a hypothesis test is as follows:
1. Develop the null and alternative hypotheses. An important thing to consider is that the two hypotheses
must encompass all possibilities and be mutually exclusive—that is, there should only be cases when
one, and ONLY one, of the hypotheses is true.
2. Set an α-level. This determines your tolerance of Type 1 errors (or false positives, discussed previously.
In the case of hypothesis testing, a Type 1 error means rejecting the null hypothesis when it is true.)
A typical α-level is 0.05 (5%), but some stricter journals may require 0.01 (1%).
3. Once the α-level is established, you can calculate whether or not you can reject the null hypothesis at
this level. Often it happens that with a high α, you get high Type 1 error—that is, you may accept
the null hypothesis when you might have rejected it at a more strict α.
Example: Let’s say we have our Blorgian population once again! We are presented with the claim that the
proportion of Blorgians with blue toenails in the population is 29%. Set up your null hypothesis as:
H0 :µ = 0.29
H1 :µ 6= 0.29.
We set an α value of 0.05, which means we are willing to take a 5% chance that we are wrong. This corresponds to a “critical” z-score of 1.96 (We shouldn’t try to memorise the big table of z-scores, it’s probably
impossible and definitely a huuuge waste of time. But this one is good to know!).
Now we take a random sample with N = 100. In this sample, we find 33 sets of blue toenails (x = 33,
p̂ = 0.33). We now calculate the z-score:
p̂ − p
z=q
p(1−p)
N
0.330 − 0.29
=q
= 0.88
0.29(1−0.29)
100
Now, since the z we calculated is lower than our critical z-score, we fail to reject the null hypothesis!
10
But what if we had taken a bigger sample? Let’s consider N = 1000. In this sample, we find 330 sets of blue
toenails (x = 33, p̂ = 0.330). We now calculate the z-score:
p̂ − p
z=q
p(1−p)
N
0.330 − 0.290
=q
= 2.79
0.290(1−0.290)
100
Here, we see that our z-score is now BIGGER than the critical, and so this time we reject the null hypotheis.
By taking a bigger sample size, we avoided a Type 1 error.
6.1
Confidence Intervals
So far, we have discussed point estimation—specifically, estimation of the mean. Another type of estimation
is interval estimation, which attempts to provide a range of likely values, called a confidence interval. As
with point estimation, we set an error level which we deem acceptable (generally, this is 5%, corresponding
to a 95% confidence interval—meaning that we are 95% confident that the correct answer lies within the
range we are suggesting). Note that the higher the acceptable error, the smaller the interval actually is!
This may seem counter-intuitive at first, but consider that if you want to be 100% confident (thus have the
smallest error, 0%) you are in the right range, you would have to include all possible values (thus having
the largest confidence interval possible).
Example: Let us consider again the same population as before, Blorgians with 29% blue toenails. Now we
want to know the 95% confidence interval for a sample of size N = 100. Let’s say the population standard
deviation is σ = 3%. Then we can calculate the standard error by:
3
σ
= 0.3
σE = √ =
10
N
Now, for a 95% confidence interval, we know we will cover the range −1.96 ≤ z ≤ 1.96. This means we can
go 1.96 standard errors in both directions from the mean. So, the 95% confidence interval is then given as:
29% ± (0.3)(1.96) = 29% ± 0.6%.
But what if we didn’t know the standard deviation of the population? In this case, we would use a Student’s
t-distribution instead of the normal, and a t-statistic instead of the z-score. For this course, the calculations
will be exactly the same, only using different charts. Note, however, that on the t-statistic chart there is an
extra parameter called degree of freedom, which is simply equal to N − 1.
6.2
Comparing Two Means (Two Sample t-test)
Sometimes we are given two samples and our task is to find out if there exists a statistically significant
difference between them. The procedure is not so different from a one-sample t-test (which is not so different
from a z-test) but we will work out an example anyways!
Example: Blorg’s neighboring planet Glorf also has some subset of the population with multiple (and
fractional) nipples. Everyone actually suspects that the Glorfites migrated over from Blorg and are the
same species. Scientists determined that the only way to know for sure is to check if there is a statistically
significant difference in the two populations in relation to nipple number. We pick two samples (N = 10),
one Glorfite and one Blorgian. We observe that the Glorfite sample has on average 3.7 nipples, with a
standard deviation of 0.3 (variance of 0.09), and the Blorgians have 3.9 nipples with a standard deviation of
0.3 (variance of 0.09). The closeness of the variance of the two samples is necessary for the two-sample t-test
and is referred to as homogeneity of variance. We also must have that the two populations are normally, or
close to normally, distributed. Now we begin the calculations!
11
First we must calculate the difference between the two means: µB − µG = 3.9 − 3.7 = 0.2. The claim that
has been made is that there is no difference between the two populations, so we state our null and alternative
hypotheses as follows:
H0 :µB − µG = 0
H1 :µB − µG 6= 0.
The standard error of the difference of the means is given as:
r
r
s21 + s22
0.32 + 0.32
SEM1 −M2 =
=
= 0.4246.
N
10
Next, we compute our t-statistic by:
t=
observed − hypothesised
= 0 − 0.20.4246 = 0.4714.
standarderror
Using degrees of freedom = 10 - 1 = 9, we see that a t-statistic of 0.4714 corresponds to a p-value of
approximately 0.64 for a two-sided t-test. This is much higher than our cut-off of 0.05, and so we fail to
reject the null. [Note, however, we used an extremely small sample size, and the result very well might not
have been the same had we used more of the population.]
6.3
Comparing Multiple Means (ANOVA)
Sometimes, if we wish to compare multiple means (more than 2), we must consider an alternative method
other than the t-test. Technically, we could perform as many pairwise comparisons as needed to come to
a conclusion, but this can be tiring and tedious. It also increases our chances of making a Type 1 error
(because we have a chance to make one at every test), though it decreases our chance to make a Type 2
error (because we have 6 chances, rather than 1, to reject the null hypothesis).
We would like to think of a single test which would efficiently and easily perform a comparison between
multiple means. Such a test is the ANOVA (or ANalysis Of VAriance). ANOVA can only determine whether
at least one population mean is different from at least one other population mean, but not which mean is
different. If we wish to find that out, we perform other (usually pairwise) tests called post-hoc tests after
the ANOVA.
Example: The planet on the other side of Blorg, Flugle, is also suspected of being composed of migrated
Blorgians. In addition to the samples above, we also pick 10 Fluglers, who have on average 3.2 nipples with
a standard deviation of 0.3 (variance 0.09). We state our hypothesis:
H0 :µF = µG = µB
H1 :not all of the population means are equal.
For ANOVA tests, we use a statistic called the F-statistic, which depends on several parameters including:
number of groups r (here, 3), combined sample size N (here, 30), and α (here, 0.05 as usual). The critical
value of F is denoted as F(r−1,N −r,α) = F(2,27,0.05) . The first value, r − 1 is called the numerator degrees of
freedom and the second, N − r, is called the denominator degrees of freedom. Our critical F-value is 3.35
(from F-statistic table in lecture slide appendix).
The calculation of the F-statistic is somewhat complicated (and we won’t work it out here) but we give the
formula:
P
ni (X̄i − X̄)2 /(r − 1)
between-group variability
F =
= Pi
.
2
within-group variability
ij (Xij − X̄i ) /(N − r)
12
Here, r and N are as defined above, ni is the size of an individual group i, and X̄i is the mean of that group,
while X̄ is the mean of the entire data set, and Xij is an individual observation (number “j”) in group i.
Since this is a pretty tedious calculation, we won’t do it out here, but let’s assume that the F-value was less
than the critical value of 3.35, and the Blorgians were correct in assuming that they are the sole source of
intelligent life in their immediate surroundings.
7
Correlation and Regression
The final section (thank god!) has to do with correlation and regression, which are both methods to evaluate and quantify the relationship between two (quantitative) variables. One of the variables is called the
dependent variable and the other the independent variable. The dependent variable is usually the factor we
are measuring or interested in, such as disease prevalence or outcome of a treatment, while the independent
variable is something we freely control, like dosage level or exposure to a carcinogen. Data points are usually
graphically represented in a scatter plot, such as one shown in Figure 5.
Figure 5: Scatter plot denoting cigarette use vs. kidney disease.
7.1
Correlation
In this section, we talk about Pearson’s correlation (ρ in a population, r in a sample), which is defined as a
measure of strength of the linear relationship between two variables. If a relationship between two variables
exists but is not linear, then this coefficient may not be adequate to describe the relation. This coefficient has
a value between -1 and 1, with r = −1 denoting a perfect negative relationship between the two variables,
r = 1 denoting a perfect positive relationship between variables, and r = 0 denoting that there is no (linear)
relationship.
7.2
Simple Linear Regression
Going one step farther than correlation, regression is used to denote a functional relationship between two
variables by fitting a line to bivariate data points. The equation denoting a relationship between variables
x and y is given as:
y = a + bx
where x is the independent and y is the dependent variable, b is the slope of the line, and a is the y-value
at which the line crosses the x-axis. Since there is almost no way that there will be a single line that goes
13
perfectly through all points, there will be some distance between the points and the line. We call this the
residual, and calculate it by:
residual = observed y - predicted y
The least squares line is the one which minimizes this error. To calculate the parameters a and b for this
line, we use the following formulas:
sy
)
sx
a =ȳ − bx̄,
b =r(
where sy and sx are the sample standard deviations of x and y, r is the correlation coefficient, and x̄ and ȳ
are the sample means.
7.3
Multiple Linear Regression
Often, there are multiple factors that affect a certain outcome. In this case, we need to consider more
than one independent variable, and so we perform multiple linear regression. In this course, we won’t really
be concerned much with multiple linear regression except to note how changing each independent variable
affects the dependent variable.
Example: Attractiveness (A) on Blorg is a combination of three factors: number of nipples (n), how blue
one is (which Blorgians rate on a continous scale: 0 ≤ b ≤ 10), and intelligence (which Blorgians also rate
on a continous scale: 0 ≤ i ≤ 10). The relationship is given by the following equation:
A = 2.1n − 2.3b + 0.8i + 1.4.
(Intelligence is not that important to the Blorgians.)
From this equation, we can see how changes in any of these attribute can affect attractiveness. For example, if
one loses a nipple (somehow), one’s attractiveness goes down by 2.1 units. Conversely, if one were to find that
nipple someone else lost, then that person’s attractiveness would increase by 2.1 units. In another example, if
a Blorgian fell into a tub of permanent paint (which exists on Blorg, I guess) and became less blue by 3 units,
his/her attractiveness would increase (because there is a negative sign before the blueness term) by 6.9 units!
We can do this for any number of independent variables. Most likely anything more complicated would
be done using a software, which we have not learned in this term, so don’t worry about more complicated
problems!
Thanks for reading! Again, please send me any corrections that you find!!
14
Download