Classifying Variables, Frequency Distributions and Tables - nwss

advertisement
CLASSIFYING VARIABLES, FREQUENCY DISTRIBUTIONS AND TABLES
A population is the complete collection of measurements, objects, or individuals under
study. A survey of the entire population is called a census.
A sample is a portion or subset taken from a population. Therefore, sample results will not
always accurately reflect the population. However, if sampling is done properly and
scientifically the sample results will be sufficiently accurate. The subset should be chosen at
random to avoid bias.
A survey is the collection of information from a sample.
A random sample is a sample in which each member of the population has the same chance
of being selected.
Data is information about individuals in a population.
Categorical variables describe a particular quality or characteristic which can be divided
into categories. This is qualitative data.
Examples:
Numerical variables describe a characteristic which has a numerical value that can be
counted or measured. This is quantitative data.
Examples:
A parameter is a numerical quantity that measures some aspect of a population.
Outliers are data values that are much larger or much smaller than the general body of data.
They should be included in analysis unless they are the result of human or other known error.
A distribution is the pattern of variation of data, which may be described as:
positively skewed
symmetrical
negatively skewed
A discrete numerical variable takes exact number values and is often a result of counting.
Examples:
A continuous variable takes numerical values within a certain continuous range. It is usually
a result of measuring.
Examples:
A frequency distribution or frequency table groups data items into classes and then records
the number of items that appear in each class. The classes can either be continuous or
discontinuous.
Ex. Arrange the following heights (in cm) of a group of students into a frequency table.
156
150
153
172
160
154
168
163
152
153
152
155
170
157
150
160
158
150
170
162
152
156
154
152
160
159
154
160
163
151
172
157
151
174
160
154
Choose your interval width so that there are about 10 classes or intervals (i.e., the first
interval could be 148-150cm). Use the difference between the lowest and highest values to
determine an appropriate interval width. Mark a check in the appropriate row of the tally
column for each entry. Then total the tallies to get the frequency.
Heights
148  x  151
151  x  154
154  x  157
157  x  160
160  x  163
163  x  166
166  x  169
169  x  172
172  x  175
Tally
Frequency
Relative Frequency
Totals
The total of the frequencies should be the total number of data items.
We will graph the data to best show its distribution. We do this by making a histogram. A
histogram is a bar graph that portrays the data found in a frequency distribution. The bars
are of equal width and correspond to the class intervals. A histogram is used for continuous
data while a column graph is used for discrete (discontinuous) data.
Make a histogram for the data above.
A relative frequency histogram would be similar, but the vertical axis would show the
relative frequency as a % rather than the number of items.
Higher Level Only
Alternatively, if the class widths are not uniform, the area of the bar can represent the
frequency of the class. The height of the bar is measured by the density, or the frequency of
the class per unit of the class size.
Area of the bar = width x height
(where height is the frequency density= frequency/class width)
Area of the bar is the frequency (see pp. 473-474 in Pearson HL Text)
The mode of a data set is the item or class of items that has the highest frequency in the
distribution. A distribution can be uniform (each item or class has the same frequency),
unimodal, bimodal, symmetric, etc.
A cumulative frequency distribution shows the accumulated frequencies of the table.
Ex. A cumulative frequency distribution showing litres of fizzy cola syrup sold by 50
employees of Slimline Beverage Company in 1 week.
Litres Sold
<80
<90
<100
<110
<120
<130
<140
<150
Number of
Employees
0
2
8
18
32
41
48
50
How many employees sold between 110 and 120 litres?
The graphic presentation of a
cumulative frequency distribution is
called a cumulative frequency
graph/curve or an ogive (oh jive).
Ex. Make a cumulative frequency distribution for the data in the table below. Then make a
cumulative frequency graph and find the upper quartile, median, and lower quartile.
Salary
0-9,999
10,000-14,999
15,000-19,999
20,000-24,999
25,000-29,999
30,000-34,999
35,000-39,999
40,000-44,999
45,000-49,999
Number of Employees
2
5
9
11
22
30
13
7
1
Salary
Number of Employees
0-9,999
0-14,999
0-19,999
0-24,999
0-29,999
0-34,999
0-39,999
0-44,999
0-49,999
You can create frequency histograms and bar graphs using your graphing calculator. Enter
your data into a list, go to STATPLOT and change it to a bar graph and use the appropriate
list, then graph.
To find the median, find the halfway point of the cumulative frequencies and draw a straight
line to the curve. Then draw a straight line down to the x-axis to find the median value.
For a frequency polygon, the point goes at the middle of the interval.
35
30
Frequency
25
20
15
10
5
0
0
10000 15000 20000 25000 30000 35000 40000 45000 50000 55000
Salaries
Stem and Leaf Plots
Stem
0
1
2
3
Leaf
2578
0144479
2336
14
MEASURES OF CENTRAL TENDENCY AND QUARTILES
A parameter is a numerical characteristic of a population.
A statistic is a numerical characteristic of a sample. A statistic is a single value that
summarizes some characteristic of interest.
There are many statistics that can be of interest. Some of these statistics are measures of
central tendency such as the mean, median, and mode of a set of data.
The mode of a set of data is the most commonly occurring value in a series. If we are
looking at classes, it is the class that contains the most entries (has highest frequency). It is
possible to have one mode, more than one mode (bimodal, trimodal), or no mode at all. The
mode is not the most useful of the measures of central tendency and is typically only used
when others are not available. The mode will always be a value in the data series.
The arithmetic mean of a set of data is usually what people mean when they say average.
The mean is the sum of all the data values divided by the number of data values. The mean
does not necessarily have to be a value in the data series.

x
N
mean for a population
x
x
n
mean for a sample
N  the number of x items in the population
n  the number of x items in the sample
If the data is in the form of a frequency table with item xi of frequency f i then the formula
is:
 f i xi
Mean 
 fi
For a table that uses intervals, use the mid-value of the interval as your xi .
The median of a set of data occupies the middle position in an array of values. Half of the
items are below the median, and half the values are above the median. In the case where
there is an even number of items, the median is the mean of the two middle values. To find
the median, arrange the items in order of size
Odd number of scores 
Even number of scores 
( f  1)
th score
2
f
( f  1)
th and 
th score
2
2
Ex. Find the mean, median, and mode of the ages shown below.
23
43
28
Mean:
41
35
52
Median:
23
29
Mode:
On the Calculator: Enter the items into list 1 on your calculator. Press STAT, EDIT,
SortA(L1) to sort the list in ascending order. The press 2nd STAT, press MATH and
median(L1) to find the median of the list. Press 2nd STAT, MATH, and mean(L1) to find
the mean of the set of data. Note: You can also choose STAT, CALC, 1-Var Stats L1 to
find the mean and median. You will have to view the sorted list to find the mode – press
2nd STAT, choose L1, press enter and scroll through the list.
Ex. Find the mean, median, and mode of lengths of 100 Dover sole shown in the table
below.
Length (mm)
Number of Fish
Mid-Interval
275-299
300-324
325-349
350-374
375-399
400-424
425-449
450-474
Total
1
1
14
24
30
22
6
2
100
287
312
337
362
387
412
437
462
Mean:
Median:
f i xi
287
312
4718
8688
11610
9064
2622
924
38225
Mode:
Note: To do this on the calculator, enter the mid-interval values into L1 and the
frequencies into L2. Then press STAT, CALC, 1 Var Stats (L1, L2), ENTER
Note where the median and mean are for skewed data: (median is solid, mean is dashed)
positively skewed
symmetrical
A quartile contains a quarter of the data values.
negatively skewed
The lower quartile (Q1) is the value below which lie one-quarter of the data items.
The upper quartile (Q3) is the value above which lie one-quarter of the data items.
The median divides the data series into two parts, an upper and lower part. The lower
quartile is the midpoint of the lower part and the upper quartile is the midpoint of the upper
part.
The inter-quartile range (IQR) is the difference between the upper and lower quartiles. It
contains the middle 50 percent of the values.
On a cumulative frequency curve, the lower quartile is found by finding a quarter of the
cumulative frequency, drawing a horizontal line to the curve and dropping a vertical line to
the x-axis. The upper quartile is found similarly.
In fact, any percentile can be found using this method. The tenth percentile is the value
below which lie ten percent of the data values. The upper quartile therefore is also known as
the 75th percentile.
The kth percentile, Pk, is a value such that k percent of the data are less than or equal to Pk
and (100-k) percent are greater than or equal to Pk.
Ex: Find the median and first and third quartiles for the following:
1) 3, 5, 8, 9, 11
2) 3, 5, 6, 8, 9, 11
3) 3, 5, 6, 8, 9, 11, 11
4) 3, 5, 6, 8, 9, 11, 11, 13
Cumulative Frequency
Mark
Frequency
10  x  20
20  x  30
30  x  40
40  x  50
50  x  60
60  x  70
70  x  80
80  x  90
90  x  100
Total:
2
1
4
3
5
11
25
17
11
Cumulative
Frequency
Draw a cumulative frequency graph.
Find the following percentiles:
Median (50th):
75th:
15th:
80th:
What percentage of scores were between 30 and 50?
What percentage of students failed?
What percentage of students were between 80 and 100?
BOXPLOTS AND MEASURES OF SPREAD
A box and whisker plot (or boxplot) shows the middle half of the values in a data set – what
we call the interquartile range – as a box and then draws lines, or whiskers, extending to the
left and right from the box to indicate the remaining 50 percent of the data items. (Only
Standard Level requires the box plot).
A box and whisker plot uses a five-number summary; the minimum value (Minx), the lower
quartile (Q1), the median (Q2), the upper quartile (Q3), and the maximum value (Maxx).
The lower and upper quartiles form the box, while whiskers are drawn from either side to the
minimum and maximum values. Outliers are values that are extremely large or extremely
small compared to the rest of the data.
The box is drawn from Q1, to Q3 with a line drawn at the median. The whiskers are from the
box to the minimum and maximum (unless they are outliers).
Min
Q1
Median Q3
Max
Ex. Find the median, Q1 and Q3 of the set of data {0 1 2 2 4 5 5 6 7}.
Ex. Find the five number summary for the data series below and use it to draw a box and
whisker plot.
4
7
6
3
9
6
2
9
5
7
3
2
On the calculator, enter the data into L1, then press STAT, CALC, 1 Var Stats L1 and read
off the five-number summary. To graph the boxplot, press 2nd Y=, select Plot1, select the
boxplot with the median line, set the Xlist to L1 and the frequency to 1.
One advantage to a boxplot is that it can show whether data is reasonably symmetrical or
skewed. If the median is in the centre of the box then the data set is reasonably symmetrical.
However, if the median is closer to either the upper or lower quartile then the data set may be
skewed. Also, skewness is indicated if one whisker is appreciably longer than another.
Ex. The boxplot shows the results of a test (out of 100 marks). Find the following:
0
10
20
30
40
50
60
70
80
90
100
a) The highest mark scored
b) The lowest mark scored
c) The mark above which half of the class scored
d) The mark that 25% of the class scored above
e) The mark that 75% of the class scored above
MEASURES OF SPREAD
Variance
There are other ways of measuring data other than measures of central tendency. One other
ways is to use measures of dispersion, that is, a measure of the variability that exists in a data
set. One reason to measure dispersion is to judge how well the average value depicts the
data. Two sets of data can have the same mean, but be very different.
For example look at these two sets of class test scores:
Set 1:
23
45
67
75
84
96
59
67
68
69
74
Set 2:
53
The mean of both sets is 65, but it is easy to see that the scores in Set 1 are much more spread
out.
One measure of dispersion is the range of the data set, but this does not necessarily give us
enough information about the data set because it only uses the minimum and maximum
values in its calculation.
Another measure of dispersion that is used is the variance. The calculation for variance
involves using all the values in the data set.
k
2 
Population variance
 f (x  )
i 1
i
2
i
k
, where n =
n
f
i 1
i
k
sn 
2
Sample variance
 f (x  x)
i 1
i
2
i
n
k
, where n =
f
i 1
i
The variance is the average of the squares of the differences between an item and the mean.
Since variance measures the square of the difference, if we take the square root of the
variance, we should get the average difference between items and the mean. This is called
the standard deviation.
Standard Deviation

if std deviation is high, data is widely spread

if std deviation is low, data is clustered, (usually about the mean).
Calculated in the following way:

where:
x1   
x = any score
2
 x2     ...  xn   
n
2
m = mean
n
2
 
 x   
i 1
i
n
2
s  sample standard deviation
  population standard deviation
Ex. Calculate the variance and standard deviation for the following data set.
Data
1 4 5
 5
5 5 6 6 8
 
Note: This works the same as grouped discrete data except that you have to use class
midpoints to represent the class.
f = frequency  2  variance
å f i (xi - m )2
s=
å fi
 fi ( xi   )2
 
 fi
2
and
The larger the variance is, the larger the standard deviation is, and therefore the larger the
spread in the data (more deviation from the mean). A very small standard deviation indicates
that the data items are clustered around the mean.
k
Population standard deviation

 f (x  )
i 1
i
2
i
k
, where n =
n
f
i 1
i
k
Sample standard deviation
sn 
 f (x  x)
i 1
i
2
i
n
k
, where n =
f
i 1
i
Note: For this course we will use the formula for sample standard deviation but we will
always use the output that represents the population standard deviation,  or  x .
Ex. Calculate the mean, variance, and standard deviation for the following data set.
Score
Freq.(f)
f  x x   ( x   )2
Midpoint (x)
0-3
4-7
8-11
12-15
16-19
20-23
24-27
28-31
32-35
Total:
f  ( x   )2
Total:
Standard Deviation and Mean may be calculated by using the TI-83/84 Plus
The key strokes are: STAT
ENTER
enter the values in List 1
STAT  CALC Enter (1- varStats) ENTER*
*the default for this is the data entered in List 1.
_
Mean on the calculator is
To clear the list:
x
STAT
and standard deviation is
ENTER
 (to L1)
CLEAR

ENTER
Ex. Use the calculator to determine the mean and standard deviation for the following sets of
data:
Data 1:
1 4 5 5 5 6 6 8
Data 2:
1 3 4 5 5 6 7 8
Data 3:
1 2 3 4 5 6 7 8
To calculate the mean and standard deviation of a distribution we always enter
OUTCOMES in LIST 1 and FREQUENCY (of each outcome) in LIST2
Once the 2 lists are entered, and you have 1-Var Stats on the screen, you must “tell” the
calculator that 2 lists must be used. The key strokes are:
2nd
1(L1)
,
2nd
2(L2)
then press enter to get
the solution.
Ex. Using the calculator, determine the mean and standard deviation for the following data:
3 coins are tossed 40 times and the number of heads showing each
time is recorded. The results are:
# of heads
0
1
2
3
Frequency
7
18
11
4
CORRELATION AND THE PEARSON COEFFICIENT OF COVARIANCE
Correlation refers to the relationship or association between two variables.
Follow these steps when looking at the correlation:
Step 1: Look at the scatterplot for any pattern.
Step 2: Look at the spread of points to make a judgement about the strength of the
correlation.
For positive relationships:
For negative relationships:
Step 3: Look at the pattern of points to see whether or not it is linear.
Step 4: Look for any outliers. Investigate any outliers as they may be mistakes made in
recording or plotting data. If the data is genuinely extraordinary it should be included.
Causation
Correlation and causation are often conflated. Just because two variables are correlated does
not mean that one causes the other.
For example, if it were found that a correlation existed between raining and being in school,
that doesn’t mean that going to school causes it to rain, nor that rain causes you to attend
school.
Only if the variables are related such that if one is changed the other changes as well can we
conclude that there exists a causal relationship between the variables.
Measuring Correlation
The correlation coefficient (r) measures the strength of correlation between two variables.
This value lies between -1 and 1.
Two variables are positively correlated if an increase in one variable results in an increase in
another in an approximately linear manner.
For positively correlated variables, the value of r lies between 0 and 1.

If r is near 0, that indicates that no linear association (correlation) is present.

If r is near 1, that indicates that a perfect linear association (perfect positive
correlation) exists.
The following scatter diagrams show various r values for positive correlation.
Two variables are negatively correlated if an increase in one variable results in a decrease in
another in an approximately linear manner.
For positively correlated variables, the value of r lies between 0 and -1.

If r is near 0, that indicates that no linear association (correlation) is present.

If r is near -1, that indicates that a perfect linear association (perfect negative
correlation) exists.
The following scatter diagrams show various r values for positive correlation.
Pearson’s Correlation Coefficient
Pearson’s correlation coefficient is used to find the degree of linearity between two random
variables X and Y, given n ordered pairs: (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), …, (𝑥𝑛 , 𝑦𝑛 )
𝑟=
where:
𝑠𝑥𝑦
𝑠𝑥 𝑠𝑦
𝑠𝑥𝑦 = 𝑡ℎ𝑒 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝑋 𝑎𝑛𝑑 𝑌
𝑠𝑥 = 𝑡ℎ𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑋
𝑠𝑦 = 𝑡ℎ𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑌
The second formula is useful because it doesn’t require the means of the X and Y
distributions to be found.
Ex. A chemical company has been trying out a new product to control the number of lawn
beetles in the soil. Determine the extent of the correlation between the quantity of chemical
used and the number of surviving lawn beetles per square metre of lawn.
Lawn
A
B
C
D
E
Amount of chemical (g)
2
5
6
3
9
x
y
𝑥 − 𝑥̅
Number of surviving lawn beetles
11
6
4
6
3
(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)
𝑦 − 𝑦̅
Totals:
𝑥̅ =
𝑟=
𝑦̅ =
𝑟=
∑(𝑥−𝑥̅ )(𝑦−𝑦̅)
√∑(𝑥−𝑥̅ )2 ∑(𝑦−𝑦̅)2
(𝑥 − 𝑥̅ )2
(𝑦 − 𝑦̅)2
We will usually use technology to find this value.
Calculator:
Ex. Ms. Allan wants to know if there is a correlation between the focussed time students
spend studying for their Statistics Test and their test results. Using your calculator, find the
correlation coefficient, r, for the set of data below.
Time spent studying
(hours)
3
2
1.6
5
Score on test
(%)
84
76
75
85
LINE OF BEST FIT, LINEAR REGRESSION, INTERPOLATION AND
EXTRAPOLATION
Line of Best Fit
The method of fitting a line to a set of data and then finding the equation of the line is called
regression. Sometimes this line is called the model.
We can use a line of best fit to predict a value of the dependent variable given a value of the
independent variable.
We can fit a straight line ‘by eye’ or by the method of ‘least squares’ (linear regression).
To draw a line of best fit by eye, find the mean of the data points, plot it and draw a line
through the mean that fits the trend of the data and so that about half the data points are above
the line and half are below it.
Ex. Create a scatter plot, find the mean, and graph the best fit line for the data given below.
Age
(months)
8
9
10
12
15
Shoe Size
1
2
2.5
3
4
To find the equation of the line, choose two points that the line passes through and use them
to find the slope and then generate the equation 𝑦 = 𝑚𝑥 + 𝑏.
Linear Regression
Of course, drawing the line of best fit by eye isn’t very accurate. We will use the method of
linear regression to find the best fit line.
A residual is the vertical distance between a data point and the possible line of best fit.
Least Squares Regression for y on x
The method for finding the best line involves a process that minimizes the sum of the squares
of the residuals.
Note: 𝑦 − 𝑦̅ =
𝑠𝑥𝑦
𝑠𝑥2
(𝑥 − 𝑥̅ )
We will use technology to find the equation of the best fit line and use the line to predict.
Calculator:
Ex. Use a calculator to find the equation of the best fit line for the data given in the first
example (age and shoe size).
Interpolation and Extrapolation
The largest value in the data set is called the upper pole and the smallest value in the data set
is called the lower pole (whether you use the independent or dependent values depends on
which variable you are looking at).
If we are trying to predict a data value that lies between the poles, we are interpolating. If we
are trying to predict a data value that lies outside the poles, we are extrapolating.
The accuracy of an interpolation depends on how linear the
data is.
The accuracy of an extrapolation depends on how linear
the data is and on the assumption that the trend will
continue past the poles.
Ex. The table below shows a restauranteur’s data for the number of diners in March and the
temperature at noon.
Temperature (Xº C)
23
25
28
30
30
27
25
28
32
31
33
29
27
Number of diners (Y)
57
64
62
75
69
58
61
78
80
67
84
73
76
a) Graph the data on a
scatterplot (use your
calculator and sketch it
here).
b) Using technology, find
the value of the Pearson
Correlation Coefficient, r.
c) What is the equation of the least squares regression line? Graph it above.
d) Using the equation you found in part c;
i) How many diners could be expected if the temperature was 26º C?
ii) How many diners could be expected if the temperature was 35º C?
Download