Chapter 3: Data Description – Numerical Methods

advertisement
Chapter 3: Data Description – Numerical Methods
Learning Objectives
Upon successful completion of Chapter 3, you will be able to:
•
•
•
•
Summarize data using measures of central tendency, such as the mean, median, mode and
midrange.
Describe data using measures of variation, such as the range, variance, and standard
deviation.
Identify the position of a data value in a data set, using various measures of position, such
as percentiles, deciles, and quartiles.
Use the techniques of exploratory data analysis, including boxplots and five-number
summaries, to discover various aspects of data.
I. Basic Vocabulary
A. Statistics vs. Parameter
•
•
A statistic is a numerical characteristic or numerical summary obtained by using the
data values from a sample.
A parameter is a numerical characteristic or numerical summary obtained by using all
the data values for the entire population.
B. Numerical Summaries of Quantitative Data
1. Measures of the average or center: mean, median, mode, and midrange.
2. Measures of variation (spread, variability, or dispersion): range, variance, and standard
deviation.
THE GENERAL ROUNDING RULE:
Always round to one more place than the data when the final answer is computed.
C. Notation for numerical summaries indicate if it is a parameter or a statistic
N: population size n: sample size
•
population mean
sample mean
Note: The mean is found the same way for the sample or population
•
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 1
population variance
population standard deviation
sample variance
sample standard deviation
population proportion
sample proportion
is a value of the variable or an answer to the question asked
II. Measures of the Center
A. Mean
I. Mean (ungrouped data) – To calculate the mean, take the sum of all data values, and
then divide by the number of values:
Sample Mean
Population Mean
Note: The mean is found the same way for the sample and population.
Example: Mean I
3. 5
10
15
20
25
22
Note: The answer is rounded t one more place than the data.
a) Example: Mean II
5.0
10.1
15.3
20.2
57.3
Note: The answer is rounded t one more place than the data.
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 2
II. Mean for Grouped Data – To calculate the mean of grouped data,
1. Use the class midpoint (𝑥𝑖 ) for each class
2. Use the class frequency for each class (f) with the formula
a) Example: Mean for Grouped Data
Class
7 – 21
22- 36
37 – 51
Totals
Midpoint(x)
14
29
44
Frequency (f)
3
5
4
12
xf
42
145
176
363
Note: The answer is rounded t one more place than the data.
III. Weighted Mean – A weighted mean has an additional factor or weight for each class.
a) Example: Weighted Mean
PSU Grade Point Average (GPA) – grades are weighted by their quality points.
Course
English
Stat
History
Totals
Credit (w)
3.0
4.0
3.0
10.0
Dr. Janet Winter, jmw11@psu.edu
Grade (x)
B+ 3.5
A 4.0
C 2.0
Stat 200
xw
10.5
16.0
6.0
32.5
Page 3
�)
B. Median (MD, 𝒙
I. Median – To calculate the median, place the data in increasing order and find a value in
the center of the ordered list.
II. Median:
• The middle value in an ordered list of data.
• It is the value with the same number of data values above and below it.
• Used for data sets with outliers.
• In the absence of outliers, use the mean.
Process:
1. Order the data values from the smallest to the largest.
2. When the sample size n is odd, the median is the data located in the exact middle.
3. When the sample size n is even, there are two data values in the middle. The median
is the average of the two data values in the middle.
Example:
Example data: 14 23 13 54 67 12 4
Ordered data: 4 12 13 14 23 54 67
Answer: 14 is the data value in the center
Example: Median
Example data: 14 23 13 54 67 12 4
Ordered data: 4 12 13 14 23 54 67
90
90
The middle value is between 14 and 23.
Solve for the average: (14 + 23)/2 = 37/2 = 18.5
MD = 18.5 (round to one more place or tenths)
C. Mode
•
•
•
•
•
•
For ungrouped data, the mode is the value that occurs most often in a data set.
For grouped data, the mode is the class with the highest frequency and is called the
modal class.
Bi-modal – Two classes each with the largest frequency OR Two data value each with
the largest frequency.
No mode- no value is repeated.
Multi modal- more than two data values or more than two classes with the same
greatest frequency.
No symbol for the mode.
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 4
Question 1
When a person says that the average age of a group of workers is 35, the average
a)
b)
c)
d)
is the mean of the ages.
is the median of the ages.
could be either the mean or the median of the ages.
do not know.
Question 2
If we are taking a test and we wish to score in the upper half of the students, then we wish
to be higher than the
a)
b)
c)
d)
is the mean of the ages.
is the median of the ages.
could be either the mean or the median of the ages.
do not know.
D. Midrange (MR)
I. Midrange (MR) :
• The value in the middle of the range
• The value midway between the lowest and highest data values
II. Example: Midrange
Find the midrange for: 2, 13, 1, 25, 45, 67, 90
1. Order the data: 1 2 13 25 45 67 90
2. The average of 1 (lowest) and 90 (highest)
3. (1 + 90)/2 = 45.5 (round to one more place or tenths)
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 5
E. Comparison of Mean, Median, Mode, and Midrange
Takes
Every
Affecte
Advantages
Measure
How
Value
d by
and
of
Definition
Existence
Common?
into
Extreme Disadvantage
Center
Account Values?
s
?
Mean
Median
Mode
middle
value
most
frequent
data value
Midrange
used throughout
this book; works
well with many
statistical
methods
often a good
choice if there
are some
extreme values
most
familiar
“average”
always
exists
yes
yes
commonly
used
always
exists
no
no
sometimes
used
might not
exist; may
be more
than one
mode
no
no
appropriate for
data at the
nominal level
rarely used
always
exists
no
yes
very sensitive
extreme values
General Comments:
• For a data collection that is approximately symmetric with one mode, the mean, median, mode, and
midrange tend to be about the same.
• For a data collection that is obviously asymmetric, it would be good to report both the mean and median.
• The mean is relatively reliable. That is, when samples are drawn from the same population, the sample
means tend to be more consistent than the other measures of center (consistent in the sense that the
means of samples drawn from the same population don’t vary as much as the other measure of center.
(Triola & Triola, 2006)
Question 3
Which measures of the center are influenced by outliers?
a)
b)
c)
d)
e)
Mean
Median
Mode
Midrange
Both A & D
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 6
Question 4
If we tally the votes in an election, then the winner would be the candidate corresponding to
a)
b)
c)
d)
the mean of the number of votes.
the median of the number of votes.
the mode of the number of votes.
do not know.
III. Shapes of Distributions
A. Symmetrical
•
Symmetrical shapes have evenly distributed data values on both side of the mean.
•
Mean median and mode are all equal.
B. Positively skewed or Right skewed
•
Positively skewed or right skewed shapes have the majority of data values fall to the left of
the mean and cluster at the lower end of the distribution, with the tail to the right.
•
The mean and median are to the right of the mode.
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 7
C. Negatively skewed or Left skewed
•
Negatively skewed or left skewed shapes have the majority of the data values fall to the
right of the mean and cluster at the upper end of the distribution, with the tail to the left.
•
The mean and median are to the left of the mode.
IV. Measures of Spread – (dispersion or variability)
A. Types
1. Range – the highest value minus the lowest value in a data set (R)
2. Variance
3. Standard Deviation
Question 5
An entertainment event advertises that people ages 1 to 100 would enjoy the event. The
advertisement specifically describes a set of people with
a)
b)
c)
d)
a large number of ages.
a large range of ages.
a large mean of ages.
do not know.
B. Variance
I. Population Variance ( ) - To calculate population variance:
1. Find the mean.
2. Subtract the Mean from each data value
3. Square each difference
4. Divide the sum by the number of data values
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 8
II. Sample Variance ( ) – To calculate sample variance:
1. Find the mean.
2. Subtract the Mean from each data value
3. Square each difference
4. Divide the sum by the number of data values minus one
∑(𝑥𝑖 − 𝑥̅ )2
𝑠 =
(𝑛 − 1)
2
III. Example: Sample Variance
Data: 3 7 6 4
x
3
7
6
4
Total
3 – 5 = -2
7–5=2
6–5=1
4 – 5 = -1
4
4
1
1
10
Note: Round to one more
place than the original data.
C. Standard Deviation
I. Population Standard Deviation (σ) – The population standard deviation (σ) is the
square root of the population variance (σ2).
Same Rounding rule: Round the final answer to one more decimal place than the original
data.
II. Sample Standard Deviation (s)
a) Deviation Formula – the sample standard deviation (s) is the square root of the
sample variances (s2).
𝑠= �
(𝑥𝑖 −𝑥̅ )2
(𝑛−1)
Same Rounding rule: Round the final answer to one more decimal place than the
original data.
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 9
b) Computational Formula
Note: This is used for better accuracy when the mean has several decimal points
and folks are more likely to ignore those decimals.
Process:
1. Find the sum of all of the data values
2. Find the sum of the squared data values
3. Multiply the sum of the squared data values by the number of data values
4. Square the sum of the data values in step 1
5. Subtract step 4 answer from the step 3 answer:
6. Divide the difference in step 5 by the n times (n – 1)
7. Take the square root of the quotient
c) Example: Standard Deviation (again with the computational formula)
Data: 3 7 6 4
x
3
7
6
4
Total 20
x2
9
49
36
16
110
D. Range Rule of Thumb
A rough estimate of the standard deviation is:
Where range is highest data value minus lowest data value.
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 10
E. Standard Deviation for Grouped Data:
Use the class midpoints and frequencies
𝑠= �
(𝑥𝑖 −𝑥̅ )2 𝑓𝑖
(𝑛−1)
I. Example: Standard Deviation for Grouped Data
Data:
Midpoint
14
29
44
Frequency (f)
3
5
4
12
xf
42
145
176
363
x2f
588
4205
7744
12537
Question 6
If we know the variance of a set of data, then to calculate the standard deviation of this data
a) is a long process because of the many operations needed.
b) is a short process because the standard deviation is equal to the variance.
c) is a short process because the standard deviation is the square root of the variance.
d) do not know.
F. Uses for Variance and Standard Deviation
1. Measures of spread, variability, and consistency.
2. To complete inferential statistics.
3. To understand data distributions using Chebyshev’s theorem and the Empirical Rule.
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 11
G. Coefficient of variation (cvar): Comparing Standard Deviations for Different
Distributions
•
•
•
To compare standard deviations for different distributions, use the coefficient of
variation.
The coefficient of variation is the standard deviation divided by the mean and
multiplied by 100%.
It is free of measurement units.
𝑐𝑣𝑎𝑟 =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝑚𝑒𝑎𝑛
× 100%
I. Example: Comparing Standard Deviations for Different Distributions I
The mean of the number of sales of cars over a 3 month period is 87, and the standard
deviation is 5. The mean of the commissions is $5225, and the standard deviation is
$773. Compare the variations of the two.
Sales
Commissions
The commissions are more variable than the sales.
II. Example: Comparing Standard Deviations for Different Distributions II
John took two tests last week. The average for the history test was 61.3 and the
standard deviation was 2.94. The average for the math test was 81.5 and the standard
deviation was 3.14. Compare the variation for the two tests.
History Test
Math Test
The history test is more variable than the math test.
V. Calculator
A. TI-83 Key Strokes to Clear Lists
ALWAYS clear out Lists before entering data.
1. STAT
2. CLRLIST (L1, L2, L3, …)
Use second function 1 for L1, second function 2 for L2 etc. Be sure to include commas
and end with parentheses.
3. ENTER
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 12
B. TI-83 Key Strokes to Enter Data
Enter data into a Cleared List.
1. STAT
2. EDIT
3. Enter the data in the lists as need pressing ENTER after each data value.
C. TI-83 Basic Statistics for Ungrouped Data
1.
2.
3.
4.
5.
Clear L1 and enter the data in L1
STAT
CALC
1 – VARIABLE STATS L1
ENTER
D. TI-83 Basic Statistics for Grouped Data
1.
2.
3.
4.
5.
6.
7.
8.
Clear L1, L2
STAT
EDIT
Enter midpoints in L1 and enter their corresponding frequencies in L2
STAT
CALC
I-variable stats L1, L2
Check that n is the sum of the frequencies
VI. Rules For Data Distribution
•
•
For all data sets, use Chebyshev’s Theorem.
For bell-shaped or approximately normally distributed data sets, use the Empirical Rule
(68 - 99 - 99.75 Rule)
A. Chebyshev’s Theorem for All Distributions
For any distribution, the proportion of values from a data set that will fall within k standard
deviations of the mean will be at least:
1– 1/k2, where k is a number greater than 1.
I. Process
Select values for k and compute 1 – 1/k2
k
2.1
2
3.3
1 – 1/k2
1
1−
= .7732
2.12
1
1 − 2 = .75
2
1
1−
= .90817
3.32
Dr. Janet Winter, jmw11@psu.edu
Interpretation
77.32% of the data is within 2.1 standard deviations of the mean
or would be in the interval (𝑋� − 2.1 𝑆, 𝑋� + 2.1 𝑆)
75% of the data is within 2 standard deviations of the mean or
would be in the interval (𝑋� − 2 𝑆, 𝑋� + 2 𝑆)
90.82% of the data is within 3.3 standard deviations of the mean
or would be in the interval (𝑋� − 3.35 𝑆, 𝑋� + 3.35 𝑆)
Stat 200
Page 13
II. Example: Chebyshev’s Theorem for All Distributions - with k = 2
The mean price of houses in a certain neighborhood is $150,000, and the standard
deviation is $10,000.
Find the price range for which at least 75% of the houses will sell. Using the table from
the previous example, k=2.
$150,000 +2($10,000) = $150,000 + $20,000 = $170,000
$150,000 +2($10,000) = $150,000 – $20,000 = $130,000
75% of the houses cost between $130,000 and $170,000
B. Empirical Rule for Bell Shaped Distributions
•
•
•
Approximately 68% of data values fall within one standard deviation of the mean.
Approximately 95% of the data values fall within two standard deviations of the mean.
Approximately 99.75% of the data values fall within three standard deviations of the
mean.
VII. Measures of Position or Relative Standing
Measures of position are the relative positions of one data value in comparison with the entire
set of data values.
• Z-score
• Percentiles
• Quartiles
• Deciles
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 14
A. Standard Scores (used to compare data values between two groups)
To compare data values, subtract the mean from the data value and divide by the standard
deviation.
I. Z-score Forumlas
For samples:
For populations:
II. Example: Z-score I
A student scored 75 on a calculus test that had a mean of 50 and a standard deviation
of 10; she scored 80 on a history test with a mean of 75 and a standard deviation of
6.1.
Compare her relative positions on the two tests.
The second z-score is larger. Thus, the 75 in calculus is a better grade as a standard
score or compared to the classmates than the 80 on the history test.
Note: z-scores are always given to two-place accuracy.
III. Understanding Z-score
a) Z-scores have a mean of 0 and a standard deviation of 1.
b) A z-score is the number of standard deviations a value is away from the mean for a
specific distribution.
c)
d) Ordinary and Unusual z-scores
• Ordinary values: -2 < z < 2
• Unusual values: z < -2 or z> 2
e) Whenever a value is less than the mean, its corresponding z-score is negative.
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 15
f) Example: Z-score II
Using the information below, compare Joe’s height of 78 inches to Susan’s height of
73 inches.
Men have heights with a mean of 69.0 inches and a standard deviation of 2.8 inches.
Women have heights with a mean of 63.6 inches and a standard deviation of 2.5 inches.
Joe: z = (78 – 69)/2.8 = 3.21
Susan: z = (73 – 63.6)/2.5 = 3.76
Susan is taller compared to other women than Joe compared to other men.
B. Percentiles (the position of a data value within its group)
A percentile, P, is an integer between 1 and 99 such that P% of the data values are less than or
equal to the value and (100 – P)% of the data values are greater than or equal to the value.
I. Given a data value x, find the percentile P
1. Count the number of data values below x
2. Add .5
3. Divide the sum by the number of data values n
4. Multiply by 100%
5. Round to an integer using regular rounding rules
II. Given the percentile P, find the data value x
• n: the total number of data values
• p: the percentile
• c: used to find the position of the data value
1. Order the data lowest to highest
2. To find the position of the data value x, let: c = (n p)/100
3. To find the data value, use the position value c
•
•
If c is not a whole number, round to the next larger whole number. Starting at
the lowest data value, count to the number that corresponds to the rounded up
value of c.
If c is a whole number, use the value halfway between the c th and (c + 1) st
values when counting up from the lowest value.
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 16
III. Example: Percentiles I
Find the value corresponding to the 13th percentile.
Unordered Data: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10
Ordered data: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20
=
c
•
•
•
n⋅ p
=
100
10 ⋅13
= 1.3
100
Since c is not a whole number, round up to 2.
Start at the lowest score and count to the second value, which is 3.
3 is the 13th percentile value.
IV. Example: Percentiles II
A teacher gives a 20-point test to 10 students. The scores are shown below.
Find the percentile rank of a score of 12.
Unordered Data: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10
Ordered Data: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20
Percentile =
6 + 0.5
⋅100% = 65th percentile
10
C. Quartiles
Divide the order list of data values into four groups.
• Q1 is the same as the 25th percentile
• Q2 is the 50th percentile or the median
• Q3 is the 75th percentile
Question 7
If a botanist measures the length of flower petals and finds that 75% of the lengths are 1.5
cm or longer, then 1.5 is
a) the f the first quartile of lengths of petals.
b) the 25th percentile of lengths of petals.
c) both of the above.
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 17
d) do not know.
D. Deciles
Deciles divide the distribution into 10 groups. They are denoted by D1, D2, …, D10.
How do deciles related to percentiles?
Question 8
The percentile that corresponds to the mean is
a) the 50th percentile.
b) the 100th percentile.
c) no particular percentile corresponds to the mean.
d) do not know.
VIII. Exploratory Data Analysis
A. Introduction
I. Purpose:
• Examine data patterns when the mean is affected by outliers.
• Find gaps in the data.
• Find patterns.
• Compare data sets.
• Identify outliers (values located far away from other values)
II. Exploratory Data Analysis is…
1. Five-number summary
2. Box plot
B. Five-Number Summary
A five-number summary is a list of:
• The lowest value of data set (L or minimum)
• Q1 (25th percentile)
• The median (MD or 50th percentile)
• Q3 (75th percentile)
• The highest value of data set (H or maximum)
A box plot is a graphical representation of a five-number summary on a scaled axes. Be
sure the box is above the scaled line and drawn to scale (see example in the text and
section x).
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 18
Question 9
A box plot can be drawn from data in a stem and leaf plot by
a) counting the values in the stem and leaf plot to determine the five number summary.
b) adding the values in the stem and leaf plot to determine the five number summary.
c) graphing only the stems and not the leaves from the stem and leaf plot.
d) do not know.
IX. Outliers
•
•
•
•
•
An outlier is an extremely high or an extremely low data value when compared with the
rest of the data values.
Can be the result of measurement or observational error. Outliers can also indicate
something else in the data.
Can have a dramatic affect on the mean
Can have a dramatic affect on the standard deviation
Can have a dramatic affect on the scale of the histogram so that the shape of the
distribution is obscured.
A. Outliers for Normally Distributed Data
Any data value more than three standard deviations away from the mean is considered an
outlier.
B. Outliers for Other Distributions
1.
2.
3.
4.
Arrange the data in order
Find Quartile 1 and Quartile 3
Find the inter-quartile range: IQR = Q3 – Q1
Outliers are:
• Any data value larger than Q3 + 1.5 (IQR)
• Any data value smaller than Q1 - 1.5 (IQR)
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 19
X. Box Plots (Box and Whisker Plots)
Scaled graph of the five number summary
Process:
1. Find the 5-number summary (minimum, Q1, Q2, Q3, and maximum)
2. Construct a horizontal scale that includes the minimum and the maximum data. Start
the scale at or below the lowest data values and end it slightly above the largest data
value.
3. Construct a rectangle “floating” above the line with the left end at Quartile 1 and the
right end at Quartile 3.
4. Construct a vertical line segment inside the box at the median.
5. Construct a horizontal line segment from the center of the lower vertical box edge to
the lowest data value that is not an outlier. Construct a second horizontal line segment
from the cent of the upper vertical box edge to the highest data value that is not an
outlier.
6. Graph mild outliers with a solid dot. Graph extreme outliers with an open dot.
XI. Summary
•
Histograms, frequency polygons and ogives are used for quantitative data organized in a
grouped frequency distribution.
•
Pareto charts and bar graphs are frequency graphs for qualitative variables.
•
Time series graphs are used to show a pattern or trend that occurs over time.
•
Pie graphs are used to show the relationship between the parts and the whole for
qualitative or categorical data.
•
Data can be organized in meaningful ways using frequency distributions and graphs.
•
In descriptive statistics, we use all of these numerical and graphical techniques with
sampling methods to collect, organize, summarize, and present data.
•
Data is organized for interpretation and inference
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 20
Answer: Question 1
When a person says that the average age of a group of workers is 35, the average
C – could be either the mean or the median of the ages.
Answer: Question 2
If we are taking a test and we wish to score in the upper half of the students, then we wish to be
higher than the
B – the median of the test scores.
Answer: Question 3
Which measures of the center are influenced by outliers?
E – both A & D.
Answer: Question 4
If we tally the votes in an election, then the winner would be the candidate corresponding to
C – the mode of the number of votes.
Answer: Question 5
An entertainment event advertises that people ages 1 to 100 would enjoy the event. The
advertisement specifically describes a set of people with
B – a large range of ages.
Answer: Question 6
If we know the variance of a set of data, then to calculate the standard deviation of this data
C – is a short process because the standard deviation is the square root of the variance.
Answer: Question 7
If a botanist measures the length of flower petals and finds that 75% of the lengths are 1.5cm or
longer, then 1.5 is
C – both A & C.
Answer: Question 8
The percentile that corresponds to the mean is
C – no particular percentile corresponds to the mean.
Answer: Question 9
A box plot can be drawn from data in a stem and leaf plot by
A – counting the values in the stem and leaf plot to determine the five number summary.
Dr. Janet Winter, jmw11@psu.edu
Stat 200
Page 21
Download