1.1 Displaying Distributions with Graphs
YMS3e
Describe what is meant by exploratory data analysis.
Explain what is meant by distribution of a variable.
Differentiate between categorical variables and quantitative variables.
Construct bar graphs and pie charts for a set of categorical data.
Construct stemplot for a set of quantitative data.
Construct back-to-back stemplot to compare two related distributions.
Construct a stemplot using split stems.
Construct a histogram for a set of quantitative data, and discuss how changing the class width can change the impression of the data given by the histogram.
Describe the overall pattern of a distribution by its shape, center and spread.
Explain what is meant by the mode of a distribution.
Recognize and identify symmetric and skewed distributions.
Explain what is meant by outlier in a stemplot or histogram.
Construct and interpret an ogive (relative cumulative frequency graph) from a relative frequency table.
Construct a time plot for a set of data collected over time.
Neilsen Ratings
Read the study on page 37.
What do you observe? Does one network appear to “win” the ratings race?
How can we get a better sense of which network has the best ratings?
How can Statistics help us understand this data?
Exploratory Data Analysis :
Statistical practice of analyzing distributions of data through graphical displays and numerical summaries.
Distribution:
Description of the values a variable takes on and how often the variable takes on those values.
An EDA allows us to identify patterns and departures from patterns in distributions.
EDA is the part of statistical practice concerned with reviewing, communicating, and using data where there is a low level of knowledge about its cause system.
EDA Objectives
Suggest hypotheses about the causes of observed phenomena.
Assess assumptions on which statistical inference will be based.
Support the selection of appropriate statistical tools and techniques.
Provide a basis for further data collection through surveys or experiments.
Categorical Variable:
Values are labels or categories.
Distributions list the categories and either the count or percent of individuals in each.
Displays: BarGraphs and PieCharts
When describing a distribution remember your SOCS!
Shape
Outliers
Center
Spread
Look carefully at data, searching for patterns and for situations that seem to differ from the population.
Clusters
Outliers
Gaps
Quantitative Variable:
Values are numeric - arithmetic computation makes sense
(average, etc.)
Distributions list the values and number of times the variable takes on that value.
Displays:
Dotplots
Stemplots
Histograms
Boxplots
Only organized Data can
Illuminate!
Your goal is to make neat, organized, labeled graphs that display the distribution of data effectively and provide an insight into patterns and departures from patterns.
Small datasets with a small range (max-min) can be easily displayed using a dotplot .
Draw and label a number line from min to max.
Place one dot per observation above its value.
Stack multiple observations evenly.
A stemplot gives a quick picture of the shape of a distribution while including the numerical values.
Separate each observation into a stem and a leaf. eg. 14g -> 1|4 256 -> 25|6 32.9oz -> 32|9
Write stems in a vertical column and draw a vertical line to the right of the column.
Write each leaf to the right of its stem.
Example1.4, pages 42-43
Literacy Rates in
Islamic Nations
Note: Stemplots do not work well for large data sets
Back-to-Back Stemplots : Compare datasets
Splitting Stems :
Double the number of stems, writing 0-4 after the first and 5-9 after second.
Split them into five (0-1, 2-3, 4-5, 6-7, 8-9)
Example1.5, pages 42-43
Virginia College
Tuition
Page 47 # 1.3: Cheese and Chemistry
As cheddar cheese matures, a variety of chemical processes take place. The taste of mature cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the Latrobe Valley of
Victoria, Australia, samples of cheese were analyzed for their chemical composition.
The final concentration of lactic acid in the 30 samples, as a multiple of their initial concentrations are given in the table.
A dotplot and a stemplot from the
Minitab statistical software package.
Which plot does a better job of summarizing the data? Explain why.
What do the numbers in the left column in the stemplot tell us? How does Minitab identify the row that contains the center of the distribution?
The final concentration of lactic acid in one of the samples stayed the same (as its initial concentration). Identify the sample in both plots.
Histograms break the range of data values into classes and displays the count/% of observations that fall into that class.
Divide the range of data into equal-width classes.
Count the observations in each class -
“ frequency ”
Draw bars to represent classes - height = frequency
Bars should touch (unlike bar graphs).
Example1.6, page 49
IQ Scores for 5th Graders
Describe the SOCS
What do these data suggest?
Page 57 #1.11: Presidential ages at inauguration
The table gives the ages of all U.S. presidents when they took office.
Make a histogram of the ages of the presidents at inauguration. Use class intervals of 40 to 44, 45-49, and so on. Each interval should contain the left hand endpoint but not the right hand endpoint.
Describe the shape, center and spread of the distribution.
Who was the youngest president? Who was the oldest?
Was Bill Clinton, at age 46, unusually young?
Be sure to label carefully any required graphs.
This means your axes should be labeled and your scales should made clear.
“Describe” means to discuss shape, center and spread!
The purpose of an Exploratory Data Analysis is to organize data and identify patterns/departures.
PLOT YOUR DATA - Choose an appropriate graph
Look for overall pattern and departures from pattern
Shape {mound, bimodal, skewed, uniform}
Outliers {points clearly away from body of data}
Center {What number “typifies” the data?}
Spread {How “variable” are the data values?}
Outliers need to be looked at carefully.
Is it “bad data” that can be thrown out?
Is there a reason for that particular value to occur?
Modes
Peaks in the graph.
A distribution can be unimodal (1 peak), bimodal
(2 peaks), etc…
Symmetric
The values above and below the midpoint are mirror images of each other.
Skewed
Skewed right means the tail is pulled to the right, skewed left means the tail is pulled to the left.
Relative frequency refers to the proportion of values that fall into a certain class.
Cumulative frequency refers to the number of values that fall are contained in a class and in all classes below it.
Relative Cumulative Frequency refers to the proportion of values that fall into a class and into all classes before it.
These are graphed with an Ogive.
Page 60-61 Example 1.9
Presidents
Page 64 1.14: Glucose Levels
People with diabetes must monitor and control their blood glucose level. The goal is to maintain “fasting plasma glucose” between about 90 and 130 milligrams per deciliter (mg/dl) of blood.
Here are the fasting plasma glucose levels for 18 diabetics enrolled in a diabetes control class, five months after the end of the class.
Make a stemplot of these data and describe the main features of the distribution. (You will want to round and split stems.) Are there outliers? How well does this group do as a whole achieving the goal for controlling glucose levels?
Construct a relative cumulative frequency graph
(ogive) for these data sets.
Use your graph to answer the following questions.
What percent of blood glucose levels were between 90 and 130?
What is the center of the distribution?
What relative cumulative frequency is associated with a blood glucose level of 130?
A timeplot of a variable plots each observation against the time at which it was measured.
Time is on the horizontal scale
The variable you are measuring goes in the vertical scale.
Connecting the points emphasizes change over time.
1.2 Describing Distributions with Numbers
YMS3e
Given a data set, compute the mean and median as measures of the center.
Explain what is meant by resistant measure.
Identify situations in which the mean is the most appropriate measure of center and situations in which the median is the most appropriate measure.
Given a data set, find the quartiles.
Given a data set, find the five-number summary.
Use the five-number summary of a data set to construct a boxplot for the data.
Compute the interquartile range (IQR) of a data set.
Given a data set, use the 1.5xIQR rule to identify outliers.
Given a data set, compute the standard deviation and variance as measures of spread.
Give two reasons why we use squared deviations rather than just average deviations from the mean.
Explain what is meant by degrees of freedom.
Identify situations in which the standard deviation is the most appropriate measure of spread and situations in which the interquartile range is the most appropriate measure.
Explain the effect of a linear transformation of a data set on the mean, median and standard deviation of the set.
Use numerical and graphical techniques to compare two or more data sets.
Consider the following test scores for a small class:
75 76 82 93 45 68 74 82 91 98
Plot the data and describe the SOCS:
Shape?
Outliers?
Center?
Spread?
What number best describes the “center”?
What number best describes the “spread’?
Numerical descriptions of distributions begin with a measure of its “center”.
If you could summarize the data with one number, what would it be?
Mean x x
x
1
x
2
...
x n n x
x i n
Median: Q2 or M The “middle” value of a dataset.
Arrange observations in order min to max
The mean and the median are the most common measures of center.
If a distribution is perfectly symmetric, the mean and the median are the same.
The mean is not resistant to outliers .
You must decide which number is the most appropriate description of the center...
MeanMedian Applet
Variability is the key to Statistics. Without variability, there would be no need for the subject.
When describing data, never rely on center alone.
Measures of Spread:
Range - {rarely used...why?}
Quartiles - InterQuartile Range {IQR=Q3-Q1}
Variance and Standard Deviation {var and s x
}
Like Measures of Center, you must choose the most appropriate measure of spread.
Quartiles Q1 and Q3 represent the 25th and
75th percentiles.
To find them, order data from min to max.
Determine the median - average if necessary.
The first quartile is the middle of the ‘bottom half’.
The third quartile is the middle of the ‘top half’.
19 22 23 23 23 26 26 27 28 29 30 31 32
Q1=23 med Q3=29.5
45 68 74 75 76 82 82 91 93 98
Q1 med=79 Q3
The 5 Number Summary provides a reasonably complete description of the center and spread of distribution
MIN Q1 MED Q3 MAX
We can visualize the 5 Number Summary with a boxplot .
min=45 Q1=74 med=79 Q3=91 max=98
45 50 55 60 65 70 75 80 85 90 95 100
Outlier?
Quiz Scores
InterQuartile Range “IQR”: Distance between Q1 and
Q3. Resistant measure of spread...only measures middle 50% of data.
IQR = Q3 - Q1 {width of the “box” in a boxplot}
1.5 IQR Rule: If an observation falls more than 1.5 IQRs above Q3 or below Q1, it is an outlier .
Why 1.5? According to John Tukey, 1 IQR seemed like too little and 2 IQRs seemed like too much...
Find 5 Number Summary
Determine IQR
Multiply 1.5xIQR
Set up “fences” Q1-(1.5IQR) and
Q3+(1.5IQR)
Observations “outside” the fences are outliers.
All data on p. 48.
fence: 19.06-
39.99
= -20.93
{
IQR=45.72-19.06
IQR=26.66
1.5IQR=1.5(26.66)
1.5IQR=39.99
fence: 45.72+39.99
= 85.71
} outliers
0 10 20 30 40 50 60 70 80 90 100
Spending ($)
Another common measure of spread is the
Standard Deviation : a measure of the
“ average ” deviation of all observations from the mean.
To calculate Standard Deviation :
Calculate the mean .
Determine each observation’s deviation (x - xbar).
“Average” the squared-deviations by dividing the total squared deviation by (n-1) .
This quantity is the Variance .
Square root the result to determine the Standard
Deviation.
Variance: var
( x
1
x
) 2 ( x
2
x
) 2 n
1
...
( x n
Standard Deviation: s x
x
) 2
( x i
x
) 2 n
1
Example 1.16 (p.85): Metabolic Rates
1792 1666 1362 1614 1460 1867 1439
Metabolic Rates: mean=1600 x (x - x) (x - x) 2
1792
1666
192 36864
66 4356
1362 -238 56644
1614 14 196
1460 -140 19600
1867 267 71289
1439 -161 25921
Totals: 0 214870
Total
Squared
Deviation
Variance
Standard
Deviation
214870 var=214870/6 var=35811.66
s=√35811.66
s=189.24 cal
What does this value, s, mean?
Variables can be measured in different units
(feet vs meters, pounds vs kilograms, etc)
When converting units, the measures of center and spread will change.
Linear Transformations (x new
=a+bx) do not change the shape of a distribution.
Multiplying each observation by b multiplies both the measure of center and spread by b .
Adding a to each observation adds a to the measure of center, but does not affect spread.
To answer a statistical question of interest:
Data : Organize and Examine
Who are the individuals described?
What are the variables?
Why were the data gathered?
When,Where,How,By Whom were data gathered?
Graph : Construct an appropriate graphical display
Describe SOCS
Numerical Summary : Calculate appropriate center and spread ( mean and s or 5 number summary )
Interpretation : Answer question in context !
Data Analysis is the art of describing data in context using graphs and numerical summaries. The purpose is to describe the most important features of a dataset.