University of Saskatchewan - The Department of Mathematics

advertisement
University of Saskatchewan
College of Arts & Science
Department of Mathematics and Statistics
STATS 245.3 (02): Introduction to Statistical Methods
Introduction
What is Statistics? Any beginning statistics student should know at least one version of the definition
of Statistics. The authors of our textbook provide the following definition “Statistics as a subject
provides a body of principles and methodology for designing the process of data collection,
summarizing and interpreting the data, and drawing conclusions or generalities [from the data].”
This course provides an introduction to each of these three areas which together form the field of
Statistics.
The branch of statistics referred to as experimental design deals specifically with the collection of
data. The branch of statistics referred to as descriptive statistics deals with summarizing the data
after its been collected. The branch of statistics referred to as inferential statistics deals with
assessing and evaluating the data and drawing conclusions based on this assessment.
Common to all three parts of the definition of Statistics is data. Data, simply put, is the information
that is recorded during some investigation. Data can be either numerical (sometimes referred to as
measurement) or qualitative (sometimes referred to as categorical)..
Information that can be represented or measured numerically is referred to as numerical (or
measurement) data. Examples of this are such properties as heights, weights, etc. Information that
cannot be measured numerically but classified by categories is referred to as qualitative (or
categorical) data. Examples of this type of data are properties such as eye colour, gender, etc.
Whether the data to be collected is numerical or categorical, the data is based on observations of
some characteristic of a collection of objects. The individual entities, whether a person or object,
whose characteristics are of interest are referred to as units. A population is the complete collection
of the units from which information is sought. A statistical population is the set of measurements (or
record of some qualitative trait) on the population.
Examples:
Population
Unit
Characteristic
STATS 245.3 (02): Introduction to Statistical Methods
Populations are generally quite large and perhaps even infinite in size. As a result, populations are
generally too large with which to work. Hence it is extremely difficult to analyze the entire population.
In these instances when the population size is too large or too difficult (or both) to analyze, we have
to be content with a smaller subset of the original population. This smaller subset is referred to as a
sample. In order to get an accurate representation of the general population, the subjects to be
included in the sample must be selected randomly from the general population.
Example:
Properties of our random sample usually constitute our data. Once we collected our data, then
what?
Describing A Data Set (Descriptive Statistics)
Numerical Data: Suppose we have a data set of n readings (heights, weights, etc.). We denote the
individual readings x1, x2, x3, …,xn . We denote the i’th reading by xi. Sometimes we wish to add the
readings, that is
x1+x2+x3+…+xn
A short-hand notation for the above is Sigma notation, that is
n
x1  x 2  x3  ...  x n   xi .
i 1
We can use Sigma notation for other expressions. For example,
m
x
k 1
2
k
 x12  x 22  ...  x m2
n
2 y1  2 y 2  2 y 3  ...  2 y n   2 y i .
i 1
Example: Suppose a research scientist records four observations: x1=3, x2=1, x3=0, x4=4.
Compute the following:
[2]
STATS 245.3 (02): Introduction to Statistical Methods
4
1.
x
i 1
i
4
2.
x
i 1
2
i
4
3.
 (x
i 1
i
 2) 2
4 
4.  xi 
 i 1 
2
Measures of Central Tendency
A numerical value describing a sample is called a statistic. A numerical value describing a
population is called a parameter. We often use statistics to estimate parameters. One of the most
common examples of a statistic is the sample mean (or average). Given individual readings x1, x2,
x3, …,xn, the sample mean is denoted x and is calculated using the formula
n
x
x
i 1
n
i
.
Example: Find the average of the five readings 2.1, 3.2, 4.1, 5.6, and 3.7.
Notes: When doing any calculations, rounding off introduces error. Whenever possible, do NOT
round off until you are ready to report the result of your calculation!!!!
The sample mean x is an example of a statistic and it is quite often used to estimate the parameter
, the population mean. Why? Quite often it is unreasonable (if not impossible) to determine the
population mean . Using x to estimate  is a topic from “inferential statistics” which we will be
discussing later in the course.
[3]
STATS 245.3 (02): Introduction to Statistical Methods
Both the population mean  and the sample mean x are measures of central tendency.  is the
average or “central” value for the population and x is the average or “central” value for the sample.
The mean (or average) is not the only measure of central tendency. Two other types of central
tendency are the median and mode.
The sample median is denoted ~
x . The median represents the middle value of a set when the set is
expressed in ascending or descending order. If there is an odd number of elements in this ordered
set, then the median is the middle value. If there is an even number of elements in this ordered set,
then the median is the average of the two middle values.
Example: Find the median time of the following times: 37, 49.7, 43, 45.1, 44.6.
Example: Find the median height of the following set: 123 cm, 178 cm, 167 cm, 159 cm.
The sample mode is also another measure of central tendency. The mode is the value which occurs
the most frequently in a sample.
Example: Find the mode of the following shoe sizes: 9, 5, 8, 6, 9, 7.
What happens if no value occurs more than once in a data set?
Which measure of central tendency is the “best”? The answer is very problem specific.
instance, calculate the mean and median of the following survival times:
3, 15, 46, 64, 126, 623.
[4]
For
STATS 245.3 (02): Introduction to Statistical Methods
Conclusions:
If the number of observations is quite large (greater than or equal to 30, for instance), the idea of
median can be extended to dividing the data into quarters, tenths, or even hundredths!!! Just as the
point which divides a data set in half is called the median, the points which divide the data set into
quarters are called quartiles, the points which divide a data set into tenths are called deciles, and the
points which divide a data set into hundredths are called percentiles.
Definition: The sample 100p-th percentile is a value such that after the data are ordered from
smallest to largest, at least 100p% of the observations are at or below this value and at least 100(1p)% are at or above this value.
The deciles D1, D2, D3, D4, D5, D6, D7, D8, and D9 are the 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and
90th percentiles, respectively.
The quartiles Q1, Q2, and Q3 are the 25th, 50th, and 75th percentiles, respectively.
Calculating the Sample 100p-th Percentile:
1. Order the data from smallest to largest
2. Calculate np.
3. If np is not an integer, round it up to the next integer and find the corresponding ordered value. If
np is an integer, say k, calculate the average of the k’th and the (k+1)’st ordered values.
Example: The times (in seconds) of 50 runners in a marathon were:
238
243
245
239
243
245
240
243
245
240
243
245
240
244
246
241
244
246
242
244
246
242
244
246
242
245
246
242
245
246
Locate the quartiles and also compute the first decile.
[5]
246
246
247
247
247
247
248
248
248
248
248
248
248
249
249
249
250
250
250
251
STATS 245.3 (02): Introduction to Statistical Methods
[6]
STATS 245.3 (02): Introduction to Statistical Methods
Measures of Variation
The simplest measure of the variability among a set of data is the range. The range of a sample is
simply the largest value in the set minus the smallest value in the set.
Example: Find the range of the following shoe sizes: 9, 5, 8, 6, 9, 7:
Find the range of the following survival times: 3, 15, 46, 64, 126, 623.
Find the range of the marathon runners’ times.
Consider the ranges of the sample 2, 3, 4, 5, 6, 7 and the sample 2, 6, 6, 6, 6, 7. Though both
samples have a range of 5, the first sample is certainly more spread out than the second sample.
We would like a statistic which measures the variability in the sample associated with how the data is
spread out. One such method uses x , the center of the sample, as a “frame of reference”.
Because x is a measure of the center of a data set, one method of determining the variability of the
individual data points xi about the center is their deviation from the mean, that is xi  x .
Example:
Observation
xi
Deviation
xi  x .
3
5
7
7
8
x =
We refer to the quantity
[x
i
[x
i
 x] =
 x ] as the total deviation and is not a good measure of the variation
within a data set because the total variation is always 0 (to be proved on Assignment #1). To obtain
a measure of how “spread out” the data is from the mean, we need to eliminate the negative signs
of the individual deviations. One way is to square the individual deviations. If we add these
individual squared deviations and divide this sum by one less than the number of data points, we get
the sample variance, denoted s2, with n-1 degrees of freedom.
[7]
STATS 245.3 (02): Introduction to Statistical Methods
Example:
Observation
xi
3
5
7
7
8
x
i
Deviation
xi  x .
[x
i
Squared Deviation
xi  x 2 .
[x
 x] =
i
x =
s2 
 x ]2 
[x
Now we define the sample standard deviation, denoted s, as s 
i
 x ]2
n 1
s 
2

[x
i
 x ]2
n 1
sample standard deviation, not the sample variance, as a basic measure of variability.
Example:
1. a. Compute the sample standard deviations for
i. 1, 4, 5, 9, 11
ii. 3, 5, 7, 7, 8
b. Compare the two standard deviations you calculated above.
[8]
. We use the
STATS 245.3 (02): Introduction to Statistical Methods
The formula used to calculate the sample variance, s 2 
[x
i
 x ]2
n 1
, is somewhat tedious to
calculate. An equivalent, but somewhat more calculation friendly, formula for the sample variance is
2

xi  
1 

2
 x i 
 . Proving the equivalency will be a question on Assignment 1.
s 
n 1 
n 


2
Example: Calculate the sample standard deviation of 4, 2, 3, 3, 6, 3.
Another measure of variation is the sample interquartile range, which is the length of the interval
covering the center half of the data points.
Example: Calculate the sample interquartile range for the Marathon runner example.
[9]
STATS 245.3 (02): Introduction to Statistical Methods
Describing A Data Set with a Table
In a data set, the frequency of a value is the number of times the value appears in the data set. The
relative frequency of a value is the proportion of times the value appears in a data set.
Example: On a Stats 245 quiz (marked out of 5), the grades of twenty students were
5
4
2
5
4
2
1
4
4
2
3
3
3
3
5
4
3
5
0
3
Determine the frequency and relative frequency of each grade.
The table we completed above are referred to as the Frequency Distribution Table.
Grouped Frequency Distribution Table
Quite often when the range of the data set is large, the individual frequencies do not reveal much
information about the data. We commonly group the data into intervals. If the intervals are not given
apriori, a rule of thumb is to form TEN intervals. This by no means always holds (for instance, when the
range is small or when there is a lot of data). With experience, one is able to determine an appropriate
number of intervals to represent his/her data. The intervals should not overlap and should continuously
cover the range of the data.
A template for the Grouped Frequency Distribution Table is as follows:
Interval
Midpoint
Tally
Freq.
[10]
Cumulative
Frequency
Percentage
Frequency
Cumulative
Per. Freq.
STATS 245.3 (02): Introduction to Statistical Methods
Example: The times (in seconds) of 50 runners in a marathon were:
246
243
244
244
242
246
250
242
244
247
238
245
246
248
245
248
242
246
240
250
246
243
249
240
239
248
243
246
245
247
251
241
246
243
244
249
245
245
247
248
240
248
245
249
246
248
242
247
248
250
Suppose the classes for the runners were defined (inclusively) by:
237-239
240-242
243-245
246-248
249-251
Complete a Grouped Frequency Distribution Table.
Interval
Midpoint
Tally
Freq.
Cumulative
Frequency
Relative
Frequency
Cumulative
Rel. Freq.
Notes on a Grouped Frequency Distribution Table:
1. Suitable class sizes are subjective and depend on the data. Usually the class sizes should be
of the same width. If the class sizes are too wide, then too much of the information is lost. The
graphic representation of the Grouped Frequency Distribution (a histogram, to be discussed in
the next section) would appear to be ”box-like”. If the class sizes are too narrow, then the
graphic representation (a histogram) has many little boxes and little discernable information
can be obtained from the graph.
[11]
STATS 245.3 (02): Introduction to Statistical Methods
2. It is a good strategy to set the class boundaries (which we use to form the class intervals) at
values where no data points lie. Usually the trick we used above works. In the above
example, our data consisted of whole numbers, so we were able to use ###.5’s as the
boundaries. If our data consisted of data of the form $###.##, our class boundaries would
have to be of the form $###.##5.
Describing A Data Set Graphically
We will look at some of the more popular graphical representations of data.
1. Pie Chart: The Manitoba Government decided that in the 1994 fiscal year, each dollar will be spent
as follows: Health $0.25, Education $0.25, Highways $0.15, Administration $0.20, Parks $0.10, and
Other $0.05. Draw the corresponding Pie Chart:
2. Pareto Diagram: A Pareto Diagram displays events according to their frequency. The events are
plotted in descending frequency with the miscellaneous (or other) category always being plotted last.
Example: Draw a Pareto Diagram for the grades of the Stats 245 quiz (marked out of 5).
[12]
STATS 245.3 (02): Introduction to Statistical Methods
3. Line (or Stick) Diagram: Line Diagrams are generally used to graphically illustrate the frequency
(or relative frequency) of discrete events. We always label the horizontal axis with the events and the
vertical axis with the frequencies (or relative frequencies). We should take particular note that if we
are graphing the events vs the relative frequencies, the labeling on the vertical axis should always
start with 0 (zero) and finish with 1 (one).
Example: Draw a Line Diagram for the frequency of the grades of the Stats 245 quiz (marked out of
5).
Example: Draw a Line Diagram for the relative frequency of the grades of the Stats 245 quiz (marked
out of 5).
4. Dot Diagram: Suppose we have the following set of measurements (in cm):
{37.0, 43.0, 44.6, 45.1, 37.0, 49.7}
The corresponding Dot Diagram plots the data points in the sample as well as the sample mean on a
number line with integer labellings:
[13]
STATS 245.3 (02): Introduction to Statistical Methods
5. Stem-and-Leaf Display: A stem-and-leaf display can be thought is a variation of the histogram (to
be discussed in a moment), especially when the observations are two-digit numbers. To draw a stemand-leaf display,
a. List the digits 0 to 9 in a column and draw a vertical line. These correspond to the leading
digits of the observations.
b. For each observation, record its second digit to the right of this vertical line in the row
where the first digit appears.
c. Arrange the second digits in each row so that they are in increasing order.
The column of the first digits is referred to as the stem and the second digits are the leaves.
Example: Construct a Stem-and-Leaf Display of the following Stats 245 grades:
75
78
15
90
98
99
57
62
42
90
68
58
75
80
60
49
84
89
77
52
Stem-and-Leaf Display for
Stats 245 Grades
0
1
2
3
4
5
6
7
8
9
To see variations of the Stem-and-Leaf Display for data, other than two-digit whole numbers, please
refer to the textbook.
6. Boxplots: A Boxplot is a graphically illustration of the data which is based on the information in the
quartiles. To construct a Boxplot, you need the minimum and maximum observations and the values
Q1, Q2, and Q3.
Example: Draw a Boxplot for the “Marathon Times” example.
[14]
STATS 245.3 (02): Introduction to Statistical Methods
7. Scatter Diagram: A Scatter Diagram can be used to display data sets consisting of two variables,
where each variable is measured numerically. We would call this data set bivariate A Scatter
Diagram is a first step when trying to determine a relationship between the two variables. Graphically,
one variable is represented by the horizontal axis and the other variable is represented by the vertical
axis.
Example: Let y denote the height (centimeters relative to a fixed reference level) of the water in a
reservoir, and let x be the monthly rainfall (centimeters). The data for a period of 14 months is as
follows.
y
x
30
5
-10
2
0
1
50
9
10
6
-40
3
-5
4
25
8
-15
6
-60
0
15
5
35
7
20
10
-5
9
Draw a Scatter Diagram for the above information.
8. Histogram: The histogram is the most common graph used to represent data pictorially.
histogram can be used to display frequencies (or relative frequencies).
A
Example: Construct a histogram for the frequency and also the relative frequency of the grades of the
Stats 245 quiz (marked out of 5).
[15]
STATS 245.3 (02): Introduction to Statistical Methods
Example: Construct a histogram for the grouped frequencies of the marathon runners.
We can estimate the mean and standard deviation of a data set from the information stored in a
histogram:
The Mean
y
1 # int ervals
 yi f i where yi is the midpoint of the i’th interval and
n i 1
fi is the frequency of the i’th interval
The Standard Deviation
2
# int ervals2  # int ervals 
n  y i f i     y i f i 
i 1
  i 1
 where y is the midpoint of the i’th interval and
2
s  
i
n(n  1)
fi is the frequency of the i’th interval
Example: For the previous example, use the above two formulae to estimate the mean and standard
deviation. Compare these values to the mean and standard deviation calculated from the raw data:
mean=245.28 and st. dev. =3.11.
[16]
STATS 245.3 (02): Introduction to Statistical Methods
Chapter 3
When discussing the topic of Scatter Diagrams, we were introduced to the idea of collecting numerical
data for two distinct variables. We called this bivariate data. Generally when one collects data for more
than one variable, s/he wishes to determine:
1. if any relationship exists between the different variables,
2. how strong the relationship is, and
3. whether one variable can be predicted from the other variable(s).
When two traits which can be recorded in some qualitative, categorical manner are observed for each
sampling unit, the information can be represented in a two-way frequency table. The categories for
one trait are labeled along the left margin of the table and the categories for the second trait are
labeled along the top margin. The frequencies are stored in the corresponding cells. This two-way
frequency table is called a contingency table.
From the information contained in a contingency table we can create two different tables: the relative
frequency table and the group relative frequency table.
Example: For the following contingency table, complete the corresponding relative frequency table
and the group relative frequency table.
Died
Research Hospital
90
Community Hospital 23
Total
113
Lived Total
2110 2200
677
700
2787 2900
Notes:
(1) One must be careful when combining the information from two contingency tables from
different sources into one table. The information found in this new table can be misleading.
There is a risk that some unreported variable(s) may cause a reversal of the findings recorded
[17]
STATS 245.3 (02): Introduction to Statistical Methods
in the original tables. This reversal of the findings is referred to as Simpson’s Paradox. For an
explicit example, please refer to the textbook pages 93 to 94.
(2) Quite often one of the traits in the contingency table is the placebo effect, that is the tendency
for a person to respond to a treatment just because of the attention that is being paid to
him/her.
~~~
The following is the midterm and final exam data from a Law 30 class.
Student
1
2
3
4
5
6
7
8
9
10
Midterm Mark (X)
55
60
65
85
75
70
75
65
65
85
Final Mark (Y)
60
70
60
75
80
85
80
70
60
80
By creating a scatter diagram with the above data, we get the following graph :
Scatter Diagram of Midterm and Final Marks
90
80
Final Marks (Y)
70
60
50
40
30
20
10
0
0
20
40
60
80
100
Midterm Marks (X)
We would like to determine if there is a linear relationship between the Midterm
Marks (X) and the Final Marks (Y). This relationship is referred to as the
correlation  (rho). (This is NOT a p). We measure the correlation by the sample
correlation coefficient r where
r
n xi yi    xi  yi 
n x   x  n y   y   .
2
i
2
i
[18]
2
i
2
i
STATS 245.3 (02): Introduction to Statistical Methods
The standard deviation of r is given by s 
1 r 2
n2
It can be shown using the above formula that we always have –1 < r < +1.
In other words, the sample correlation coefficient can NEVER be smaller than –1 or
greater than +1.
A line (green) of the form
has a positive slope.
A line (green) of the form
Notes:
(1)
(2)
has a negative slope.
r does NOT measure the slope of the linear line (referred to as the regression
line or the line of best fit) that we are trying to fit our data, apart from the sign.
If +0.7 < r <+1, then we have a strong positive correlation between the two
variables.
If +0.4 < r <+0.7, then we have a moderate positive correlation between the
two variables.
[19]
STATS 245.3 (02): Introduction to Statistical Methods
If +0.0 < r <+0.4, then we have a weak positive correlation between the two
variables.
If –1.0 < r <-0.7, then we have a strong negative correlation between the two
variables.
If –0.7 < r <-0.4, then we have a moderate negative correlation between the
two variables.
If -0.4 < r <0.0, then we have a weak negative correlation between the two
variables.
[20]
STATS 245.3 (02): Introduction to Statistical Methods
If r is close to ZERO, then there is little to no correlation (or linear relationship)
between the two variables.
Example: Calculate the linear correlation between the Midterm Marks (X) and the
Final Marks (Y).
[21]
STATS 245.3 (02): Introduction to Statistical Methods
Does |r|~1 always imply a strong linear relationship between the two variables?
A common error that people make is that they interpret a strong correlation as a cause and
effect. Sometimes such a relationship does exist (Smokers and Physical Endurance), but in
many cases no such causal relationship exists even if the correlation is strong (our midterm and
final exam scores). In such situations, there usually are “hidden” variables linking the two
quantities of interest. Also note that if say r=0.6, this does not mean that the independent
variable(s) explain(s) 60% of the variability in the dependent variable.
Predicting One Variable From Another
Once we have determined that a correlation exists between the two variables of interest,
we would like to draw a regression equation (or a line of best fit) through the data points.
One reason for determining the regression equation is we can use a regression equation
to predict the dependent variable given a specific value for the independent variable.
For two variables, the equation of the sample regression line is yˆ  ˆ 0  ˆ1 x , where the ^
indicates a sample estimator,
ˆ1 
n xi yi    xi  yi 


n  x   xi 
2
i
2
and ˆ 0  y  ˆ1 x .
Example: Calculate the regression line for the midterm and final exam data.
We will revisit correlation and linear regression later in the course.
[22]
Download