Intro to Stats Chapter 1 Lecture Notes

advertisement
STAT 2331
Chapter 1: Looking at Data – Distributions (Exploring and Summarizing Data)
A variable is any characteristic we are studying – height, GPA, religious affiliation, income,
major, and number of pets. For each of these characteristics, we expect to see a certain amount of
variability – some people are tall, others are short; some people have many pets, some people
have no pets, etc. Statistical methods provide ways to measure and understand variability.
1.1
Different Types of Data (textbook page 2-8)
In general, there are two types of variables: categorical and quantitative. A variable is called
categorical if each observation belongs to one of a set of categories. A variable is called
quantitative if observations are numerical values. Some examples are given in the table below.
Variable Type
Examples
Categorical
Gender (male, female), type of residence (house, apartment, dorm,
other), political affiliation (Republican, Democrat, other)
Quantitative
Age, daily high temperature, SAT score, calories consumed per day
For quantitative variables, we distinguish between discrete and continuous. Discrete variables
have a countable number of values. Continuous variables have uncountable number of values.
Continuous variables are usually variables that can take on all values on an interval.
If you’re not sure if a variable is categorical or quantitative, think about finding the average. If it
does make sense to find the average, then the variable is quantitative. If computing the average
does not make sense, then the variable is categorical.
Example 1: Variable types
Identify the variable and then determine if it is a categorical or quantitative variable.
1. The proportion of customers currently using Verizon is 43%.
2. The average high temperature in March (spring break month) in Athens is 66 F°.
Using Frequency Tables
Frequency tables are an easy way to summarize data (usually categorical). A frequency table is
a listing of possible values for a variable together with the number of observations for each
value. The term frequency means count.
Example 2: Favorite cookie
Thirty college students were randomly selected and asked which of the following are their
favorite cookie: oreo, sugar, peanut butter, chocolate chip, brownie, or oatmeal raisin. The
responses are shown below. Organize the data into a frequency table.
Oreo
Sugar
Peanut Butter
Chocolate Chip
Chocolate Chip
Chocolate Chip
Oreo
Oreo
Oatmeal Raisin
Oreo
Oatmeal Raisin
Chocolate Chip
Chocolate Chip
Sugar
Brownie
Sugar
Oatmeal Raisin
Oreo
Brownie
Peanut Butter
Peanut Butter
Oatmeal Raisin
Chocolate Chip
Oreo
Oatmeal Raisin
Brownie
Peanut Butter
Brownie
Chocolate Chip
Oreo
Favorite Cookie
Oreo
Chocolate Chip
Oatmeal Raisin
Sugar
Peanut Butter
Brownie
1.2
Frequency
Graphical Summaries of Categorical Data (textbook page 8-27)
The two primary graphical displays for a
categorical variable are the bar graph and the pie
chart. In the bar graphs shown, notice that (1) the
categories are on the horizontal axis, (2)
frequencies (or proportions) are on the vertical
axis, and (3) the heights of the rectangles for each
category are equal to the category’s frequency or
proportion.
Proportion
A pie chart is a circle divided into
sectors. Each sector represents a category
of data.
NOW ON TO QUANTITATIVE VARIABLES
1.3
Graphs for Quantitative Variables (textbook page 8-27)
Recall Two Different Types of Data – Categorical and Quantitative.
quantitative variables
Now we’ll focus on
A dot plot is as it sounds – a plot with dots. This is best understood through an example.
3, 7, 6, 8, 5, 5, 7, 5, 9
A dot plot for this data set looks like this:
Do you know what a histogram is? A histogram is a graph that uses bars to portray the
frequencies of the possible outcomes of a quantitative variable. The horizontal, x, axis represents
the values the variable can take on. The vertical, y, axis tells how many of each value falls
within a certain range of values.
Example 4: SMU on Offense
The table below shows the number of points that the SMU football team will score in the 20182019 season. Consider a histogram with bar widths by tens: 10-19, 20-29, 30-39, 40-49 and 5059.
Opponent
Points
Opponent
Points
North Texas
31
TCU
52
Michigan
13
Navy
17
HBU
45
UCF
31
Tulane
41
Cincinnati
24
Houston
27
Connecticut
51
Memphis
18
Tulsa
30
UCF
48
Florida St.
26
You can use the point totals to create a frequency table, then use the table to create the
histogram. Usually, the bars in a histogram are touching (though that’s not always the case in the
images provided).
Frequency
3
20-29
2
30-39
3
40-49
2
50-59
2
SMU Offense
4
Frequency
Group
10-19
3
2
1
0
10-19
20-29
30-39
40-49
50-59
Points Scored
Example 5: Sodium The sodium values of 20 cereals are given in the table below.
Cereal
Frosted Mini Wheats
Raisin Bran
All Bran
Apple Jacks
Cap’n Crunch
Cheerios
Cinnamon Toast Crunch
Crackling Oat Bran
Fiber One
Frosted Flakes
Sodium
0
340
70
140
200
180
210
150
100
130
Using intervals of 50, connect information
in the frequency table and histogram.
Interval
Frequency
0-49
1
50-99
2
100-149
4
150-199
6
200-249
5
250-299
1
300-349
1
Cereal
Fruit Loops
Honey Bunches of Oats
Honey Nut Cheerios
Life
Rice Krispies
Honey Smacks
Special K
Wheaties
Corn Flakes
Honeycomb
Sodium
140
180
190
160
290
50
220
180
200
210
Example 5: IQ Scores
The histogram shows for 7th
grade IQ scores. How many
students were sampled?
a.
b.
c.
d.
205
98
149
324
Using the same histogram, which
class has the most observations?
a.
b.
c.
d.
e.
f.
g.
h.
i.
60-69
70-79
80-89
90-99
100-109
110-119
120-129
130-139
140-149
Using the same histogram, how many observations are in the bin with the most?
a. 98
b. 100
c. 109
d. 119
The Shape of a Distribution
Looking at the shape of a histogram (or dot plot) allows us to describe the distribution of the
data set. Some common distributions are presented in the table below.
Shape
Description
How it looks
30
Symmetric/Unimodal
One side is a mirror image
of the other. The histogram
looks symmetric (SAT
scores, height of male SMU
students).
25
20
15
10
5
0
30
25
Skewed left
Left tail is stretched out
longer than the right tail
(life span of humans).
20
15
10
5
0
30
25
Skewed right
Right tail is stretched out
longer than the left tail
(income, number of pets).
20
15
10
5
0
Bimodal
Two distinct humps (height
of all SMU students –
why?).
We’ll deal mostly with the first three types of graphs shown above, all of which are unimodal
distributions. In picturing features such as skew and symmetry, it’s common to use smooth
curves to summarize the shape of a histogram.
Example 6: Reading a Histogram
The figure below shows a histogram for 7th grade IQ scores. Answer the questions that follow.
7th Grade IQ Scores
100
90
80
Frequency
70
60
50
98
40
30
20
10
0
38
32
14
2
60-69
12
4
70-79
80-89
90-99
4
1
100-109 110-119 120-129 130-139 140-149
IQ Score
1. How many students were sampled?
2 + 4 + β‹― + 4 + 1 = 205
2. Which class has the highest frequency? How many observations fell within that range?
3. Which class has the fewest observations? How many observations fell within that range?
4. What proportion of students have an IQ between 120 and 129?
5. Describe the shape of the distribution.
Example 6: More . . . Reading Histograms
Match the variable description to the possible histogram:
1.
2.
3.
4.
Scores on a fairly easy exam
SAT scores of a group of college students
Heights of college students
Number of medals won by countries in the 2018 Winter Olympics
A.
B.
C.
D.
Hits per Game
Example 7: Hitting streak
Frequency
4
The following histogram shows how many
hits a baseball player had in the first eight
games of the season. Convert the
histogram to a data set.
3
2
3
1
2
1
1
1
2
1
0
3
Hits
4
5
Time plots
Time plot is a plot values according to the time when it was obtained. Time goes to the
horizontal direction.
1.4
Measuring the Center of Quantitative Data (textbook page 27-51)
It’s always a good idea to look at the data with a graph first to get a feel for the distribution – is it
symmetric, is it skewed, are there outliers? (We’ll talk more about outliers later.) After graphing
the data, you can summarize data with the center and the spread of the data set.
The most frequently used measures of center are the mean and the median. The mean (average)
is the sum of the observations divided by the number of observations. When observations are
ordered from smallest to largest, the median is the point that splits the data in two – half of the
observations are below it, half of the observations are above it.
π‘₯Μ… =
The formula for the mean is:
∑π‘₯
𝑛
Remember, π‘₯Μ… is a statistic representing the sample average, where πœ‡ is a parameter representing
the population average. We will never calculate πœ‡.
Example 8: Calculating mean and median
Student
Test 1
Test 2
Test 3
Test 4
Test 5
20
20
80
100
100
Mean
Median
Mean
Median
Why do we need different measure of center?
Test 1
Test 2
Test 3
Test 4
Test 5
Student A
50
50
50
50
50
Student B
30
40
55
55
70
1.5
Measuring the Variability of Quantitative Data (textbook 27-51)
We’ve discussed a few ways to describe the center of quantitative data (mean, median), and now
we’ll talk about how to describe the variability. Why do we need to measure the variability of
data?
Test 1
Test 2
Test 3
Test 4
Test 5
Student A
50
50
50
50
50
Student B
30
40
50
60
70
Example 9: Variability
Which set of numbers (A, B, or C) do you think has the most variability?
A: 13, 17, 1, 18, 4, 22, 11, 8, 6
B: 122, 125, 127, 126, 128, 121, 128
C: 51, 78, 53, 42, 47, 33, 82, 75, 91
The three measures of spread (range, variance, and standard deviation)
Range:
Variance:
Standard deviation:
Mean
Median
Example 10: Calculating standard deviation
A data set contains the observations: 6, 1, 4, 3, 1
a. x =
b. x2 =
c. ∑(π‘₯ − π‘₯Μ… ) =
Observation
(π‘₯ − π‘₯Μ… )
6
1
4
3
1
Sums
(π‘₯ − π‘₯Μ… ) ) =
𝑠=√
∑(π‘₯ − π‘₯Μ… )2
1
=√
=√
𝑛−1
5−1
Example 11: Exam standard deviation
For an exam given to a class, the students’ scores
ranged from 35 to 98 with a mean of 74. Which of the
following is the most realistic value for the standard
deviation: -10, 0, 3, 12, 63? What is unrealistic about
the other values?
A distribution with a large standard deviation will be
wider than a distribution with a smaller standard
deviation. In the graphs below, the second distribution
is wider and has a larger standard deviation.

=
You won’t EVER have to calculate standard
deviation by hand. Any statistical software
package, including StatCrunch, and advanced
scientific calculators (TI-83 and TI-84) will do this
for you.
Related to the standard deviation is the variance.
Variance is standard deviation squared. The
symbol for sample variance is 𝑠 2 . If you are given
the standard deviation and asked to find the
variance, all you have to do is square 𝑠 . The
formula is: 𝑠 2 =
∑(π‘₯−π‘₯Μ… )2
𝑛−1
Because the mean is
used in calculating both 𝑠 and 𝑠 2 , these statistics will be influenced by extremely large or
extremely small observations.
A Quick Note on Symbols
Remember, a statistic is a numerical summary of a sample. A parameter is a numerical
summary of a population. In practice, parameters are generally unknown. I have introduced
symbols for the sample standard deviation and the sample variance, but not the population
standard deviation or the population variance. A summary of the symbols you should be familiar
with is in the table below.
Statistic/Parameter
Sample
Population
Mean
π‘₯Μ…
πœ‡
Proportion
𝑝̂
𝑝
Standard deviation
𝑠
𝜎
Variance
𝑠2
𝜎2
1.6
Using Measures of Position to Describe Variability (textbook 27-51)
The mean and median describe the center of the data. The range and standard deviation describe
the variability. We can use some measures of position to further describe the data.
The 𝒑𝒕𝒉 percentile is a value such that 𝑝 percent of the observations fall below or at that value.
Suppose your SAT score falls at the 90th percentile. This means 90% of the people who took the
SAT scored below you (and so 10% of them scored above you).
Three useful percentiles are the quartiles. Each set of data has three quartiles. The 25th percentile
is referred to as the first quartile (Q1). The 50th percentile is the second quartile (Q2), but we
just call this the median. The 75th percentile is referred to as the third quartile (Q3). The
quartiles split the distribution into four parts, each containing 25% of the observations:
Example 12: Manual Dexterity
A research study of manual dexterity involved determining the time required to complete a task.
The time required for each of 40 individuals is:
7.1 7.2 7.2 7.6 7.6 7.9 8.1 8.1 8.1 8.3 8.3 8.4 8.4 8.9 9.0 9.0 9.1
9.1 9.1 9.1 9.4 9.6 9.9 10.1 10.1 10.1 10.2 10.3 10.5 10.7 11.0 11.1 11.2 11.2
11.2 12.0 13.6 14.7 14.9 15.5
Compute Median, Q1, and Q3
Median = ______
Q1 = __________
Q3 = __________
The Interquartile Range (IQR)
The middle 50% of observations fall between the first quartile and the third quartile – 25% from
Q1 to the median and 25% from the median to Q3. The distance from Q1 to Q3 is called the
interquartile range, denoted IQR.
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
Example 13: Manual Dexterity IQR
Find the IQR for the manual dexterity data
Q1 = 8.3
Q3 = 10.85
IQR =
Detecting Potential Outliers
An outlier is an unusually small or unusually large observation. Outliers can occur due to an
error in data entry, but this isn’t always the case. Consider the number of home runs Brady
Anderson hit per season from 1992 to 2001: 21, 13, 12, 16, 50, 18, 18, 24, 19, 8. Fifty is an
unusually large observation and would likely be classified as an outlier.
One of the ways to flag an observation as potentially being an outlier is the 1.5 x IQR Criterion.
This criterion states that if an observation is less than 𝑄1 − 1.5(𝐼𝑄𝑅) or greater than 𝑄3 +
1.5(𝐼𝑄𝑅), it is considered an outlier.
Example 14: Manual Dexterity outliers?
Are any of the observations in the manual dexterity data outliers?
Recall Q1 = 8.3
Med= 9.25
Q3 = 10.85
Min = 7.1
Max = 15.5 IQR = 2.55
The Five-Number Summary and Box Plots
The five-number summary of a data set is the minimum value, the first quartile, the median, the
third quartile, and the maximum value. The five-number summary is the basis of a graphical
display called the box plot. In other textbooks, the box plot is occasionally called a box-andwhisker plot because of its appearance.
Example 15: Manual Dexterity Boxplot
Using the manual dexterity data from the previous example, construct a box plot.
Recall Q1 = 8.3
Med= 9.25 Q3 = 10.85
Min = 7.1 Max = 15.5 IQR = 2.55.
How to Measure and Report Center and Spread
The five-number summary is usually better than the mean and standard deviation for describing a
skewed distribution or a distribution with extreme outliers. Use the mean and the standard
deviation only for reasonably symmetric distributions that are free of outliers.
Which measures of center and spread should be used for a distribution?
•
•
•
•
•
•
If the shape is skewed, the median and IQR should be reported.
If the shape is unimodal and symmetric, the mean and standard deviation and possibly the
median and IQR should be reported.
If there are multiple modes, try to determine if the data can be split into separate groups.
If there are unusual observations point them out and report the mean and standard
deviation with and without the values.
Always pair the median with the IQR and the mean with the standard deviation.
Remember -- The median and IQR are resistant to skewness and outliers, but the mean
and standard deviation are not.
Side-by-Side Boxplots Help to Compare Groups
Boxplots are particularly useful to compare two or
more groups on a quantitative variable. Some
engineers in Germany were investigating ways to
improve traffic flow by enabling traffic lights to
communicate information about traffic flow with
nearby lights. The graph below displays the results
of one experiment that simulated buses moving
along a street and recorded the delay time (in
seconds) for both a fixed time and a flexible system
of traffic lights. Compare the two groups.
Using the Box Plot to Determine the Shape of the Distribution
Like a histogram, a box plot can tell us if the distribution of the data is symmetric or if skew is
present.
•
•
•
If the box plot looks approximately symmetric, then the distribution is approximately
symmetric.
If the median is closer to Q1 and/or the right whisker is much longer than the left
whisker, then the distribution is skewed right.
If the median is closer to Q3 and/or the left whisker is much longer than the right
whisker, then the distribution is skewed left.
1.7
Normal Distribution (textbook 27-51)
Density Curves
A density curve is a curve that
-
Always positive horizontal values
Area under the curve is 1
Comparing the Mean and Median
The mean is sensitive to extreme values in the data set. This means an extremely large value will
pull the mean to the right and an extremely small value will pull the mean to the left. This is
because the calculation of the mean uses all values in the data set.
The median, which is determined by only the values in the middle of the data set, is generally
resistant to extreme values.
Normal Distributions
All normal distributions have symmetric, unimodal, and bell-shaped distribution curves.
The mean μ decides the location and the standard deviation σ decides a shape.
The Empirical Rule (aka 68-95-99.7 Rule)
In the normal distribution, then the value of 𝜎 has a more precise interpretation. The Empirical
Rule says that if a distribution is bell shaped, then
•
•
•
68% of the observations fall within 1 standard deviation of the mean, denoted πœ‡ ± 𝜎
95% of the observations fall within 2 standard deviations of the mean, πœ‡ ± 2𝜎
99.7% all observations fall within 3 standard deviations of the mean, πœ‡ ± 3𝜎
This graph illustrates this rule.
Example 12: White Walkers
White walkers have taken over the
Westeros. Once the white walkers
invade, the mean time until every dead
body in one city is revived is 150
minutes with a standard deviation of 25
minutes.
1. What percentage of a city is
entirely infected in somewhere
between 125 and 150 minutes?
2. 95% of dead bodies in the city will
be reanimated within how much
time?
3. What percentage of dead bodies is
entirely revived in less than 125
minutes?
Standardizing observations: z-score
The z-score of an observation tells us how many standard deviations the observation falls from
the mean. A positive z-score indicates the observation is above the mean. A negative z-score
indicates the observation is below the mean.
𝑧=
π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘Žπ‘‘π‘–π‘œπ‘› − π‘šπ‘’π‘Žπ‘› π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘Žπ‘‘π‘–π‘œπ‘› − πœ‡
=
π‘ π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ π‘‘π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘›
𝜎
Example 16: Height and z-scores
The height of 20-29 year old man follows normal distribution with average 70 inches and a
standard deviation of 2.8 inches. The height of female also follows normal with average 64.6
inches and a standard deviation of 2.6 inches.
1. Find the z-score for a 75-inch tall man.
2. Find the z-score for a 70 inch tall woman.
3. Who is relatively taller, a 75 inch man or a 70 inch woman?
4. What is the z-score for a man whose height is 2.3 standard deviations below the average
height?
5. David’s height is 1.1 standard deviations above the mean. How tall is David?
Using z-scores to check for Outliers
When the data are approximately normal, a data value is regarded as an outlier if it falls more
than three standard deviations from the mean. In other words, if an observation has a z-score less
than -3 or greater than +3, then it is an outlier.
(Think about this in connection with the Empirical Rule which says that all of the data should
fall within three standard deviations of the mean. Any observation beyond this three standard
deviation ban is an outlier.)
Example 17: Heights of Men
Assume that male heights are normally distributed with a mean of 70 inches and a standard
deviation of 2.8 inches.
1. Would a male with a height of 78 inches be considered an outlier?
2. What height would cutoff outliers for short men?
Download