Chapter 03.pptx

advertisement
Chapter 3
Numerical Descriptions of Data
Slide set to accompany "Statistics Using Technology" by Kathryn Kozak (Slides by David H Straayer) is
licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at
http://www.tacomacc.edu/home/dstraayer/published/Statistics/Book/StatisticsUsingTechnology112314b.pdf.
3.1 Measures of Center
• Mode (nah, not really)
• Mean
• Median
Mode
• Most frequently appearing value, or most
common frequency class
• “Humps” in the distribution
• Need not be all the exact same height or
count
• Mostly to recognize bi-modal and multi-modal
• Multi-Modal is often a tip-off that different
types of individuals in the data set.
Mean or Average
• Known as Mean, Arithmetic Mean, and
Average.
• Especially useful when the data is roughly
symmetrical and without many outliers.
• Can be misleading on very skewed data.
• Consider the average income of people in this
classroom.
• What happens if Bill Gates walks in and we
calculate a new average?
Median
• The “halfway point” – roughly half are smaller
than this value, have are larger.
• This measure of center is more “resistant” to
skewness and outliers.
• Frequently used for distributions like income
and house cost.
• Doesn’t change much when Bill G walks into
the room
Shapes and measures of center
Weighted Average
• Students are often familiar with this, as grades
are usually computed as weighted averages.
• For example:
– Homework 20%
– Worksheets 15%
– Tests
65%
• w is the weight of the data value, x
• If the weights add up to 1, even simpler!
Section 3.2: Measures of Spread
• Range
– Just Max-Min
– Obviously sensitive to outliers
• IQR
– Range of middle half
– Less sensitive to outliers
– Will discuss more in next section
• Standard Deviation
– Appropriate for symmetric distributions where the
mean is a good measure of center.
Spread: how is the data spread out?
• We are often interested in the source and
amount of variation in a data set.
• You will find it useful to pause for a moment
and ask “What is the source of variation in this
data set?”
• Often knowing the source of variation can
help you understand the shape of the
distribution.
Canonical Examples of variation
• The weights of Ping-Pong balls
– Because of standards and high-quality
manufacturing, very little spread
• The weights of apples in a grocery store
– Obviously some, but not extreme, variability
• “Your little brother’s rock collection”
– That sapphire is almost microscopic, but the
granite rock is pretty hefty. A lot of variation.
Range
• Sorting the data makes it easy to find
minimum and maximum values.
• Many computer systems have functions min()
and max().
• Probably the most common measure in nontechnical situations, but its sensitivity to
outliers can be a problem.
Standard Deviation
• I call it “Average un-averageness”.
• It is a good description of spread in data that
is reasonably free of extreme outliers.
• It is sensitive to outliers.
• Although not as useful in skewed data sets, it
can be calculated for any data set.
Let’s think about the design of S.D.
• Each data point in a data set has a deviation
from the mean (average).
• A deviation is calculated with a subtraction:
xi - 𝑥 is the deviation of “the ith x”
• By the definition of mean, and the laws of
math, if we add up all the deviations, the sum
is always zero.
We might try absolute values
• We could add up (and then divide to make an
average) the absolute values of the deviations.
• Although this seems appealing, there are a
few difficult-to-explain reasons why absolute
values are not such a good idea.
• But there is another easy way to get rid of all
those nasty minus signs…
Squaring gets rid of negatives
• Squaring the deviations also gets rid of the
negatives, so they can be added up on the way
to calculating an average.
• This also has some nice mathematical
properties. (smooth graph, rewarding “close”)
We can get an average by dividing
• Yet another strange complication – in most cases
it is better to divide by “one less than the count”.
• This makes for a little bit larger (safer) estimates
of variability. “Unbiased estimator”
• We call it variance.
• Of course when we’re done, we got a (sort of)
average squared deviation.
• Units like squared ounces or grams are just plain
weird.
Variance
• This is the average squared deviation.
• Not much use for us in this class, because the
units are squared.
• Think of it as “the last thing you calculate
before you take the square root.”
Getting back to sane units
• Simple – just take the square root of the
average squared deviation.
• And that’s what we call the “sample standard
deviation” – or just “standard deviation” for
short. (S or Sx)
• There is also a “population standard
deviation” – where we don’t subtract one
before dividing, but we won’t use it much.
A formula for nerds
Sx =
𝑥𝑖 − 𝑥 2
𝑛−1
Population Standard Deviation
• We use a Greek letter for this – the lower case
sigma: s
s=
𝑥𝑖 −𝑥
𝑁
2
Technology
• Excel, and many other computer systems, uses
stdev(…) for standard deviation, and has
average(…) (or mean(…) on other systems) for
mean, and median(…) for median
• On the TI, put the data in a list, and perform
Stats/Calc/1-var stats
On the TI
On the TI
On the TI
That little “down arrow”
• On the TI screen, that means “there is more
on the next page – just scroll down to see it.
• Well cover the second screen stuff next.
Use of Standard Deviation
• Chebyshev’s Theorem: For any set of data:
– At least 75% of the data fall in the interval from
 - 2s to  + 2s.
– At least 88.9% of the data fall in the interval from
 - 3s to  + 3s.
– At least 93.8% of the data fall in the interval from
 - 4s to  + 4s.
Z-Score and unusualness
• How far is a data point from the mean?
• If the data value is outside two standard
deviations of the mean, either above or below,
then the number is uncommon.
𝑥−𝜇
𝑥−𝑥
𝑆𝑥
• 𝑧=
or
says how far x is away from
s
the mean, ( 𝑜𝑟 𝑥 − 𝑏𝑎𝑟), in units of standard
deviation (s or Sx) “Dilbert’s way of
expressing weirdness”
3.3 Ranking
•
•
•
•
Percentile: ranks by 100ths
Decile: ranks by 10ths
Quartile: ranks by 4ths
Quintiles: ranks by 5ths
Percentile
• Percentile: The kth percentile is the data value
that has k% of the data at or below that value.
• For example, “The 1%” refers to the highestearning 1% of Americans. The 99th percentile
is about $400,000
Quartiles
1. Sort the data in increasing order.
2. Find the median, this divides the data list into 2
halves.
3. Find the median of the data below the median.
This value is Q1.
4. Find the median of the data above the median.
This value is Q3.
There are slightly different algorithms for
computing Q1 and Q3. Our author uses: “Ignore
the median in both calculations for Q1 and Q3”
Interquartile Range (IQR)
• IQR = Q3-Q1 This is another measure of
spread. It is resistant to outliers in much the
same way as median is resistant to outliers to
express the center.
• Q1-1.5*IQR and Q3 + 1.5*IQR provide another
definition of “outlier”
5-Number Summary
• These 5 numbers are often used to provide a
nice summary of a distribution:
1.
2.
3.
4.
5.
Minimum
Q1
Median
Q3
Maximum
Box-and-Whiskers Plot
Uses of Box-and-Whiskers plots
• They are frequently used to gage symmetry
and spread of a distribution.
• Multiple Box-and-Whiskers plots on the same
field are convenient to compare two
distributions.
Download