Uploaded by Linda Huang

Statistics Exercises: Central Tendency & Distribution

advertisement
CB2200 Week 3
Introduction
to Statistics
Exercises
TUESDAY
23
Concept Review
● Measures of Central Tendency: Mean, Median (only numerical variable) and
Mode (central tendency X best to describe categorical variable)
● Measures of Variation: Range, Interquartile Range (IQR), Variance and
Standard deviation
● Distribution Shape of curve
● Five-numbers Summary and Boxplot
Mean
● Sample mean
● Population mean
Median
Mode
● Value that occurs most often
● Not affected by extreme values (outliers)
● Used for both numerical and categorical data
● There may be no mode
● There may be several modes
Comparison Table
Measures
Variable Type
Can be nonexist?
Can be morethan-one?
Effect of
Outliers
Mean
Numerical
No
No
Yes
Median
Numerical
No
No
No
Mode
Numerical &
Categorical
Yes
Yes
No
Range
● Simplest measure of variation
● Difference between the largest and the smallest values
● Disadvantage: Ignores the way in which data are distributed
● X reflect the distribution shape
Quartiles
Calculating the quantile position for n values:
Eg. N = 3
Q1 = (3+1)/4
=1
= whole number
● If r is a whole number, it is the ranked position to use = n (odd number)
● When r is not a whole number, the following linear interpolation steps can be used to
determine the quantile value = n (even number)
1.
2.
d = r – [r], where [r] is the integer part of r
■ Eg. d = 8.75 - 8 = 0.75
Quantile value = X[r] + d*(X[r]+1 – X[r]), where X[r] is the value at the rank rth position
→ eg. X8 + 0.75 * (X8 + 1 - X8) = X8 + 0.75 * (X9 - X8)
Interquartile Range
● Interquartile range IQR = Q3 – Q1 and measures the spread in the middle 50%
of the data
● Interquartile range is also called the mid-spread because it covers the middle
50% of the data
● Not influenced by outliers or extreme values
● Usually, values fall outside the range [Q1 - 1.5*IQR, Q3 + 1.5*IQR] are
considered as outliers
Variance
● Most preferred measure of variation due to its mathematical property
● It shows variation of each value from the mean
■ Sample Variance
■ Population Variance
Standard Deviation
● It is the square-root of variance.
● It has the same units as the original data
■ Sample Standard Deviation
■ Population Standard Deviation
Standard Deviation
■ Smaller standard deviation means most values of X are closer to its mean value.
→ The more the data are concentrated, the smaller the range, variance, and standard
deviation → safer, more stable and more reliable
■ Larger standard deviation means the values of X are more spread out
→ The more the data are spread out, the greater the range, variance, and standard
deviation → higher volatility, unstable
Distribution Shape
● Position of mean and median for unimodal continuous distribution
Negatively skewed/
Positively skewed/
● If data are skewed, the median may be a more appropriate measure of
central tendency
The Five Number Summary and Boxplot
● The five numbers that help describe the center, spread and shape of data are
● Boxplot
Distribution Shape = Boxplot
Distribution Shape and Boxplot
Q7
The following is a set of data for a population of size N=10:
7, 5, 11, 8, 3, 5, 2, 1, 10, 8
a) Compute the population mean (µ).
= (7 + 5 + 11 + 8 + 3 + 5 + 2 + 1 + 10 +8) / 10
Population mean = µ = 6
b) Compute the population standard deviation (σ).
= Variance = (7-6)^ 2+…..+[(8-6)^2]/10 = 10.2
= Population standard deviation (σ) = 3.1937
→ Square root of
= 10.2
Q8
n = 10
A food inspector, examining 10 bottles of a certain brand of honey, obtained the following
percentages of impurities:
23.5 19.8 21.3 22.6 19.4 18.2 24.7 21.9 20.0 21.1
What are the mean and standard deviation of this sample?
1
2
Q9
The data contain the price for two tickets with online service charges, large popcorn, and two
medium soft drinks at a sample of six theater chains:
n=6
$36.15 $31.00 $35.05 $40.25 $33.75 $43.00
a) Compute the mean and median
●
Mean, X = (36.15 + 31 + 35.05 + 40.25 + 33.75 + 43) / 6 = 36.53
●
Median → 1. Order the sample data in ascending order (Small to Large)
2. $31, $33.75 (2nd), $35.05, $36.15, $40.25 (5th), $43
3. Median = ($35.05 + $36.15) /2 = $35.6
Q9
The data contain the price for two tickets with online service charges, large popcorn, and two
medium soft drinks at a sample of six theater chains:
$36.15 $31.00 $35.05 $40.25 $33.75 $43.00
b) Compute the variance, standard deviation and range
●
Variance (S2) = (31-36.53)2+.....+(43-36.53)2/(6-1) = 19.27
●
Standard deviation (S) = Square root of 19.27 = 4.39
●
Range = XLargest - XSmallest = 43 - 31 = 12
c) Are the data skewed? If so, how?
→ Since the mean (36.53) is only slightly greater than the median (35.6), the data are slightly
right skewed
Q9
The data contain the price for two tickets with online service charges, large popcorn, and two
medium soft drinks at a sample of six theater chains:
$31
$33.75
$35.05
$36.15
$40.25
$43
d) Based on the results of (a) through (c), describe the data.
1. Describe range (difference): There is a $12 difference between the most
expensive and the least expensive outlet.
2. Compare the mean and median: The prices vary around $36.53 (Mean) with
half of the outlets being more expensive than $35.6 (Median).
3. Mention middle half of the price (Q1&Q3): The middle half of the prices fall
between $33.75 (2nd) and $40.25 (5th)
$31, $33.75 (2nd), $35.05, $36.15, $40.25 (5th), $43
Quartile
Quartile Position
Q1
r = (n+1)/4 = 7/4 = 1.75 ~ 2nd
Q3
r = 3(n+1)/4 = 3(7)/4 = 5.25 ~ 5th
Q10
The data contains the total fat, in grams per serving, for a sample of 20 chicken sandwiches
from fast-food chains. The data are as follows:
n = 20
7
8
4
5 16 20 20 24 19 30
23 30 25 19 29 29 30 30 50 56
a) Compute the first quartile (Q1), the third quartile (Q3), and the interquartile range.
1. Reorder the data → 4 (Min), 5, 7, 8, 16 (5th), 19, 19, 20, 20, 23, 24, 25, 29, 29, 30 (15th), 30,
30, 30, 50, 56 (Max)
Quartile
Quartile Position
Quartile Value
Q1
r = (20+1)/4 = 5.25 ~ 5
16+0.25(19-16)= 16.75
Q3
r = 3(20+1)/4 = 15.75 ~ 15
30+0.75(30-30) = 30
Interquartile range
= Q3-Q1
= 30 – 16.75 = 13.25
Q10
The data contains the total fat, in grams per serving, for a sample of 20 chicken sandwiches
from fast-food chains. The data are as follows:
7
8
4
5 16 20 20 24 19 30
23 30 25 19 29 29 30 30 50 56
b) List the five-number summary.
c) Construct a boxplot and describe the shape.
Q11
The following data is the number of vitamin supplements sold by a health food store in a
sample of 11 days:
n = 11
19 19 20 20 20 22 23 25 26 27 30
a) What are the average (X-bar) and standard deviation (S) of daily sale of vitamin supplements of
the health food store?
b) Work out a five-number summary of the data in the sample. Comment on the distribution of the
sample data.
Formula Reference
Formula Reference
Sample
Mean
Population
µ
Variance
S2
σ2
Standard
Deviation
S
σ
Size
n
N
Summary
Download