Chapter 12
Describing Data
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Doing Exploratory Data Analysis


Use EXPLORATORY DATA ANALYSIS (EDA) to
search for patterns in your data
Before conducting any inferential statistic, use
EDA to ensure that your data meet the
requirements and assumptions of the test you
are planning to use (e.g., normally distributed)
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.

Steps involved in the EDA:
1. Organize and summarize your data on a data
coding sheet
2. If desired, organize data for computer entry
3. Graph data (bar graph, histogram, line graph, or
scatterplot) so that you can visually inspect
distributions

This will help you choose the appropriate statistics
4. Display frequency distributions on a histogram,
and create a STEMPLOT
5. Examine your graphs for normality or skewness
in your distributions
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Graphing Your Data

Bar Graph





Presents data as bars extending from the axis
representing the independent variable
Length of each bar determined by value of the
dependent variable
Width of each bar has no meaning
Can be used to represent data from single-factor and
two-factor designs
Best if independent variable is categorical
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.

Line Graph




Data represented by a series of points connected by a
line
Most appropriate for quantitative independent
variables
Used to display functional relationships
Line graphs can show different shapes


Positively accelerated: Curve starts flat and becomes
progressively steeper as it moves along x-axis
Negatively accelerated: Curve is steep at first and then “levels
off” as it moves along x-axis

Once the curve levels off it is said to be asymptotic
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.

A line graph can vary in complexity



Scatterplot



A monotonic function represents a uniformly increasing or
decreasing function
A nonmonotonic function has reversals in direction
Used to represent data from two dependent variables
The value of one dependent variable is represented on
the x-axis and the value of the other on the y-axis
Pie Chart

Used to represent proportions or percentages
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
The Frequency Distribution



Represents a set of mutually exclusive categories
into which actual values are classified
Can take the form of a table or a graph
Graphically, a frequency distribution is shown on
a histogram



A bar graph on which the bars touch
The y-axis represents a frequency count of the number
of observations falling into a category
Categories represented on the x-axis
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Histogram Showing a
Normal Distribution
5
Frequency
4
3
2
1
0
1
2
3
4
5
6
Response Categories
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
7
Histogram Showing a Positive Skew
6
Frequency
5
4
3
2
1
0
1
2
3
4
5
6
7
Response Category
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
8
9
Histogram Showing a Negative Skew
6
Frequency
5
4
3
2
1
0
1
2
3
4
5
6
7
Response Category
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
8
9
A Bimodal Distribution
20
Frequency
15
10
5
0
55
60
65
70
75
80
85
90
Grade category
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
95
Measures of Center: Characteristics
and Applications

Mode





Most frequent score in a distribution
Simplest measure of center
Scores other than the most frequent not considered
Limited application and value
Median




Central score in an ordered distribution
More information taken into account than with the
mode
Relatively insensitive to outliers
Used primarily when the mean cannot be used
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.

Mean



Average of all scores in a distribution
Value dependent on each score in a distribution
Most widely used and informative measure of
center
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Measures of Center: Applications

Mode


Used if data are measured along a nominal scale
Median


Used if data are measured along an ordinal or
nominal scale
Used if interval data do not meet requirements for
using the mean
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.

Mean



Used if data are measured along an interval or
ratio scale
Most sensitive measure of center
Used if scores are normally distributed
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Measures of Spread: Characteristics

Range





Subtract the lowest from the highest score in a
distribution of scores
Simplest and least informative measure of spread
Scores between extremes are not taken into account
Very sensitive to extreme scores
Semi-Interquartile Range


Less sensitive than the range to extreme scores
Used when you want a simple, rough estimate of
spread
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.

Variance


Average squared deviation of scores from the mean
Standard Deviation


Square root of the variance
Most widely used measure of spread
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Measures of Spread:
Applications

The range and standard deviation are sensitive to
extreme scores



In such cases the semi-interquartile range is best
When your distribution of scores is skewed, the
standard deviation does not provide a good index
of spread
With a skewed distribution, use the semiinterquartile range
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
The Five Number Summary and
Box Plots

Five Number Summary


Convenient way to represent a distribution with a few
numbers
Statistics included





Minimum score
The first quartile
The median (second quartile)
Third quartile
Maximum score
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Example of a Five Number Summary
Maximum
132
Third Quartile
(Q2)
110
Median (Q2)
101
First Quartile
(Q1)
90
Minimum
67
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.

Boxplot





Graphic representation of the five number summary
First and third quartile define the ends of the box
A line in the box represents the median
Vertical “whiskers” extending above and below the box
represent the maximum and minimum scores
(respectively)
Data from multiple treatments are represented by side-
by-side boxplots
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Example of a Boxplot
150
IQ
100
50
0
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
The Pearson Product–Moment
Correlation (r)




Most widely used measure of correlation
Value of r can range from +1 through 0 to –1
Magnitude of r tells you the degree of LINEAR
relationship between variables
Sign of r tells you the direction (positive or
negative) of the relationship between variables
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.



Presence of outliers affects the sign and
magnitude of r
Variability of scores within a distribution
affects the value of r
Used when scores are normally distributed
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Measures of Association

Pearson Product-Moment Correlation


Point-Biserial Correlation


Index of linear relationship between two continuously
measured variables
Index of correlation between two variables, one of
which is measured on a nominal scale and the other
on at least an interval scale
Spearman Rank-Order Correlation

Index of correlation between two variables measured
along an ordinal scale
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.

Phi Coefficient

Index of correlation between two variables
measured along a nominal scale
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Linear Regression and
Prediction

Used to find the straight line that best fits the
data plotted on a scatterplot
The best fitting straight line is known as the least

The regression line is defined mathematically:

squares regression line
Y  a  bx
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.



The regression weight (b) is based on raw scores
and is difficult to interpret
The standardized regression weight (beta weight)
is based on standard scores and is easier to
interpret
You can predict a value of Y from a value of X
once the regression equation has been calculated

The difference between predicted and observed values
of Y is the standard error of estimate
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.