DobbinChapter1.sec.1..

advertisement
CHAPTER 1, SECTIONS 1.1 AND 1.2 Revised January 24, 2012
Section 1.1
Looking at Data
Individuals (units) are objects described by a set of data. A unit can be a
person, a place, or a thing (ie. a student, the city of West Lafayette, Purdue
University). A person is often called a subject.
A variable is any characteristic of an individual. It could be your birth date,
gender, marital status, in state or out of state status, major, semester in
school, gpa, etc…….
There are two types of variables:
Quantitative variable, takes numerical values for which arithmetic
operations such as adding and averaging make sense. Examples: your age,
GPA, exam scores in this course.
Categorical variable , places an individual into one of several groups or
categories, and uses the count or percent of the individuals for each
category.
Examples: your major, gender , academic year (freshman, sophomore,
junior, senior), in-state vs out-of –state status, etc.
The distribution of a variable describes what values that variable takes and
how often it takes on that value.
If you have more than one variable in a problem, look at each variable by
itself first, then look for any relationships between the variables.
Example:
Lecture 2, Section 1.1 & 1.2
Page 1
Identify the unit in the following questions and whether the answer would
give you categorical or quantitative data. If it is categorical, state the
possible answers.
a) What letter grade did you get in your Calculus class last semester?
b) What was your score on the last exam?
c) What is your GPA?
d) How many red M&Ms are in this bag?
e) What colors are in the bag?
f) What color was the most common?
g) Which type of M&Ms has more red ones, peanut or plain?
We will look at describing data on a single variable by:
1. Graphing it, always a good place to start.
2. Finding numerical summaries.
Graphing Methods:
The best way to examine data is to make a graph of it.
Graphing techniques depend on the type of variable.
For Categorical variables the most commonly used types are
1. Bar graph
2. Pie graph
With bar graphs, the y-axis represents either the
1. frequency (number of observations) in each of the categories
which are displayed on the x axis, OR
2. relative frequency (number of observations / total number of
observations.
Lecture 2, Section 1.1 & 1.2
Page 2
Example:
The distribution of the brand of golf drivers used in the 2008 Mercedes
Championship:
Driver Brand
Taylor Made
Titleist
Calloway
Ping
Nike
All others
Total
No of Users
15
10
6
3
2
4
40
Percent
37.5
25
15
7.5
5
10
100
For Quantitative Variables:
1. Stem-and-leaf-plots
2. Histograms
3. Box Plots
Stem-and-leaf-plots: Displays the actual values of all observations. Good
for small amounts of data. Similar to a histogram.
Steps:
1. Arrange the data in ascending order.
Lecture 2, Section 1.1 & 1.2
Page 3
2. Disect each observation into a stem and leaf . The leaf is the right-most
digit, (always a single digit from 0 to 9), and the stem is all digits in
front of the leaf.
3. Write stems in a vertical column with the smallest at the top, and draw a
vertical line to separate the stems from the leaves.
4. Write each leaf in the row to the right of its stem, in increasing order out
from the stem.
5. As an option we might want to “split” the stems, ie, list each stem twice,
in order to spread out the leaves. Leaves 0 through 4 go on the first stem
and leaves 5 through 9 go on the second stem.
6. As another option we might want to reduce the number of stems by
rounding each observation to a value ending in 0 and dropping the 0 so
that the leaf is the last digit before the 0.
Example: Bob’s last 20 golf scores, beginning with his last score.
69 73 77 77 80 76 75 77 78 78 77 81 82 75 79 76 83 77 80 84
stems
6
7
8
leaves
9
3776578875967
012304
stems
6
7
8
leaves
9
3556677777889
001234
split
stems
leaves
6
6
7
7
8
8
9
3
556677777889
001234
Example: Breaking strength of wood.
23422, 25389, 28128, 22673, 29452, 28138, 24487, 26841, 27793
Rounded to 100's
stems
leaves
22 7
23 4
24 5
25 4
26 8
27 8
28 11
29 5
Lecture 2, Section 1.1 & 1.2
Page 4
Rounded to 1000's
stems
leaves
2 334
2 578889
Histograms: Large data sets, classified into class intervals, with height of
bar displaying the count or the percent of the observations for each class
interval.
Steps:
1. Divide the range of data into class intervals of equal width.
2. Count the number of individuals in each class interval or develop a
percent of the total in each class interval. The counts in each class
interval are called frequencies. The table of frequencies for all class
intervals is called a frequency table.
3. Draw the Histogram.
Example 1: USGA handicap ranges and percentage within each range.
HANDICAP PERCENT
0 -<5
5 -<10
10 - <15
15 - <20
20 - <25
25 - <30
30 - <35
35 - < 40
All
4.8
15.6
26.4
24.8
15.6
7.9
3.2
1.7
100
Lecture 2, Section 1.1 & 1.2
Page 5
If the overall pattern of a large number of observations is quite regular, we
chose to describe it by a smoth curve called a density curve. A density
curve is an idealized model for a distribution of data.
There are many types of density curves. Some are sketched below.
Unimodal
Bimodal
Multimodal
What to look for in a histogram or density curve: Look at the overall
pattern of the data and any deviations from that pattern (outliers).
The pattern is described by shape, center, and spread.
1. SHAPE:
Unimodal Left Skewed
Unimodal Symmetric
Unimodal Right Skewed
2. The center is defined by either 1) median, 2) mean, or 3) mode



The median, M, is the middle value in an ascending list of
values.
The mean is the arithmetic average of all the values. Either
Sample mean, X , or the Population mean , µ
The mode is the value which occurs most often.
Lecture 2, Section 1.1 & 1.2
Page 6



If the pattern is symmetric, the mean = median.
If the pattern is left skewed, the mean < median.
If the pattern is right skewed, the mean >median.
3.
The spread defines the width of a distribution.
 Sometimes the spread is a From ___ To____ expression.
 Sometimes it just the total range.
 Sometimes it is the center + a value, and center – a value.
Section 1.2
Describing Distributions with Numerical Summaries:
A numerical summary must describe two important features:
1. The value that represents the center of the distribution, and
2. A measure of the spread (variability) going away from the center.
Measuring Center:
1. Mode: The measurement that occurs most often. Not often used.
2. Median: The middle value (countwise) when the measurements are
arranged from the lowest to the highest. Symbol is M
Steps to finding the median:
a. Arrange observations from smallest to largest.Count the
observations.Calculate n  1 to find the position of the center of
2
the data set
b. If n is odd, M is the data point at the center of the data set.
Lecture 2, Section 1.1 & 1.2
Page 7
c. If n is even,
n 1
ends with .5, and falls between 2 data points,
2
called the middle pair, M = the average of the middle pair.
Example: Bob’s last 20 golf scores,
69 73 77 77 80 76 75 77 78 78 77 81 82 75 79 76 83 77 80 84
Put the data in ascending order:
69 73 75 75 76 76 77 77 77 77 77 78 78 79 80 80 81 82 83 84
N=20, Position of the Median = (20+1)/2 = 10.5
The Median is half way between the 10th and 11th values and is therefore 77.
3. Mean or average (arithmetic mean): The sum of the measurements
divided by the total number of measurements.
On Bob’s last 20 golf scores:
x
sum = 1554
( x  x  ......  xn )
1
xi  1 2

n
n
Lecture 2, Section 1.1 & 1.2
Page 8
= 1554 / 20 = 77.7
RESISTANT MEASURES:
A measure that can resist the influence of extreme observations is called a
resistant measure.
Characteristics of each measure:
Mode
Median
Mean
- there can be more
- only one median.
- only one mean.
than one mode
- resistant measure
- non-resistant
for a data set.
- quantitative data
measure.
- Resistant
only.
- quantitative data
measure.
- used with skewed
only.
- For grouped data
data usually.
- Used
with
value can change
- Affected
less
symmetric data
depending
on
than the mean.
usually
categories.
- Affected
more
- Can be found for
than the median.
both categorical
and quantitative
data.
Measuring Spread: (variability)
There are three measures of spread that we will look at:
1. Range, Maximum - Minimum
2. Five number summary, which is Minimum, First Quartile, Median,
Third Quartile and Maximum
3. Standard Deviation, or Variance which is standard deviation squared.
1. Range
The difference between the largest and the smallest measurement of a data
set. Range =Maximum - Minimum
Example, the range on Bob’s golf scores is:
84- 69 = 15
2. The Five Number Summary
Lecture 2, Section 1.1 & 1.2
Page 9
The five number summary involves determination of the Quartiles and the
Median. We showed how the Median is determined above.
Now we show how the Quartiles are determined.
The quartiles Q1 and Q3 are calculated as follows
a. Arrange the observations in ascending order and determine the
median, M, as shown previously.
b. Now repeat the procedure to find the median of the lower half, ie,
the 1st quartile, Q1, is the median of the observations which are
below the the overall median, M.
c. Now repeat the procedure to find the median of the upper half, ie,
the third quartile, Q3, is the median of the observations which are
above the overall median, M
Example: Bob’s last 20 golf scores, beginning with his last score.
69 73 77 77 80 76 75 77 78 78 77 81 82 75 79 76 83 77 80 84
Put the data in ascending order:
69 73 75 75 76 76 77 77 77 77 77 78 78 79 80 80 81 82 83 84
Lower Half: Observations 1 through 10, n = 10 in lower half
Upper Half: Observations 11 through 20, n = 10 in upper half
NOTE THAT THE MEDIAN DOES NOT BELONG TO EITHER
HALF. THE MEDIAN SEPARATES THE TWO HALVES.
The First Quartile is at position 5.5 in the lower half and has a value of 76
The Third Quartile is at position 5.5 in the upper half and has a value of 80.
Five Number Summary:
Minimum, Q1, MEDIAN, Q3, Maximum
Example with Bob’s golf scores:
69, 76, 77, 80, 84
Lecture 2, Section 1.1 & 1.2
Page 10
The Interquartile Range, IQR
The IQR is the distance between the first and the third quartiles.
IQR = Q3 – Q1. In the example above, it would be 80-76 = 4
Outliers: (1.5 X IQR criteria)
Call an observation a suspected outlier if it falls more than 1.5(IQR) above
the third quartile or more than 1.5(IQR) below the first quartile.
Example with Bob’s golf scores:
IQR = 80-76 = 4
1.5 (IQR) = 6
Upper cutoff = 80 + 6 = 86
Lower cutoff = 76 – 6 = 70
(69 is a suspected outlier)
Boxplot:
A boxplot is a graph of the five number summary. Lines extend from the
box out to the smallest and largest observations.
- A central box spans the quartiles, Q1 and Q3.
- A line inside the box marks the median, M.
- Lines extend from the box out to the smallest and largest observations.
- In a modified boxplot, lines extend from the box out to the smallest
and largest observations which are NOT outliers by the 1.5(IQR) rule,
and an asterix marks any outliers.
Example with Bob’s golf scores:
Lecture 2, Section 1.1 & 1.2
Page 11
3. Standard Deviation
The variance, s 2 , of a set of observations is:
( x1  x)2  ( x2  x)2  .......  ( xn  x) 2
1

( xi  x)2 = 230.2 / 19

n 1
n 1
= 12.11579
S2 
The standard deviation, s , is the square root of the variance, s 2 .
Example, the standard deviation of Bob’s last 20 golf scores is:
S = Square Root (Variance) = square root (12.11579) = 3.48077
Characteristics of measures of spread:
Range:
- non-resistant.
- Simple.
Lecture 2, Section 1.1 & 1.2
Page 12
IQR
- Resistant.
- used with the
median.
- IQR=0 does not
mean there is no
spread.
- Used with nonsymmetric data
usually.
Standard Deviation
- Non-resistant.
- Approx=range/4
- Used with the
mean.
- Good for
symmetric
distributions with
no outliers.
- S=0 means there is
no spread
Best method for describing center and spread:
 The 5-Number Summary is better for describing skewed distributions
or distributions with outliers.
 The Mean and Standard Deviation are preferred for describing
reasonably symmetric distributions free of outliers.
 Again, always start with a graph of the data to evaluate skewness.
Linear Transformation of data values:
Example: The temperature is 86 degrees F (Fahrenheit), what is the
temperature in Celsius?
C = 5/9 ( F -32)
C = 5/9 ( 86-32) = 30
A linear transformation of the form xnew  a  bx multiplies the measure of
center and the measure of spread by b, and shifts the center by a.
Mean and Standardization For Grouped Data.
See my webpage for an Excel File which shows how to calculate the mean
and standard deviation from grouped data.
Lecture 2, Section 1.1 & 1.2
Page 13
Download