Chapter 2: Section 2.1 (Raw Data)
Raw Data: Numbers and Category Labels that have been collected but have not yet been processed in any way.
Ex. Answers to Survey Questions
1.
How many hours do you work per week?
Raw Data: 15 Hours.
2.
What is your height in inches?
Raw Data: 65 Inches.
Sample Data: Measurements taken from a subset of a population.
Population Data: When all individuals in a population are measured.
Statistic: A summary measure computed from sample data.
Parameter: A summary measure computed from population data.
Descriptive Statistics: A summary measure for either a population or a sample.
Chapter 2: Section 2.2 (Types of Data)
Variable: A characteristic that differs from one individual to the next.
Categorical Variable: A non-numerical characteristic. Examples include a person’s sex, place of birth, eye color, or salad preference.
Each individual falls into only one category. Most prevalent summaries of categorical variables include how many individuals and what percent of the group fall into each category.
Quantitative Variables: Raw data which is recorded as a numerical value which is either a measurement or a count taken on each individual. Examples include a person’s height, weight, number of hours worked each week, number of credit hours during the semester, annual salary. Raw data such as social security numbers and student I.D. numbers are NOT quantitative variables. Also known as Measurement
Variable and Numerical Variable .
Continuous Variable: When every value within some interval is a possible result. Example, the weight of males in this class.
Ordinal Variable: When Cateogrical Variables have ordered categories. Example, “On a scale of 1 to 10 where 1=most polite and 10=rude, how polite were the children”.
Explanatory Variable : The variable which is thought to partially explain another variable.
Response Variable : The variable which is partially explained by the explanatory variable.
Example: When examining the relationship between number of hours spent studying and exam scores, the time spent studying is the explanatory variable and the exam scores are the response variables.
Chapter 2: Section 2.3 (Summarizing One or Two
Categorical Variables)
Count the number of individuals that fall into each category.
Calculate the percent in each category
Raw Data for the eye color of 35 students in MA
2830 from Spring 2005.
Blue, Blue, Brown, Hazel, Green, Blue, Blue, Brown,
Blue, Brown, Hazel, Blue, Green, Green, Brown,
Blue, Blue, Blue, Brown, Brown, Blue, Brown,
Green, Blue, Blue, Blue, Brown, Brown, Hazel, Blue,
Brown, Brown, Blue, Brown, Blue
Numerical Summary of The Categorical Data
Eye Color
Blue
Brown
Green
Hazel
Total
Frequency
16
12
4
3
35
Percent
46%
34%
11%
9%
100%
The Same Methods Apply for Two Categorical
Variables
When Summarizing Two Categorical Variables
Look for Outcome or Response Variables
Example 2.2 Pg 19
S Listed
First
S Picked Q Picked
61 (66%) 31 (34%)
Total
92
Q Listed
First
45 (46%) 53 (54%)
Total 106 (56%) 84 (44%)
98
190
Explanatory Variable : The Question Form the
Student Received
Response Variable : Letter Chosen (S or Q)
Visual Summaries for Categorical Variables
Pie Charts: Used for Single Categorical
Variables with few categories
Bar Graphs: Can be used for summarizing one or two categorical variables and making comparisons.
Other, 8, 11%
Fish, 10, 13%
Cat, 22, 29%
Favorite Pet
Dog, 35, 47%
Dog
Cat
Fish
Other
Favorite Pet
30
25
20
15
40
35
10
5
0
Dog
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
47%
Dog
Cat
Pet
Fish
Favorite Pet
29%
13%
Cat
Pet
Fish
Other
11%
Other
50
40
30
20
70
60
10
0
S Picked, 61
Does Order Matter
Q Picked, 45
Q Picked, 53
S Picked, 31
S Listed First
Letter Listed First
Q Listed First
Does Order Matter
70%
60%
50%
40%
30%
20%
10%
0%
S Picked, 66%
Q Picked, 46%
Q Picked, 54%
S Picked, 34%
S Listed First Q Listed First
Letter Listed First
S Picked
Q Picked
S Picked
Q Picked
Chapter 2: Section 2.4 (Quantitative Data)
Median 50
Quartiles 45 57
Extremes 32 74
Summary Features for Quantitative Variables
Location (Center, Average): The Median Is an
Indication of the Center of the Data
Spread (Variability): The Difference between the two extremes and the difference between the two quartiles tell us about the spread of the data.
Shape : Cannot be determined from a 5-Number
Summary
Outliers
A Data Point Not Consistent with the Bulk of the
Data
An Unusually High or Low Variable (Michael
Jordan’s Salary compared to Other Geography
Majors from UNC-Chapel Hill)
Cannot Usually be determined from a 5-Number
Summary
3 Common Reasons for Outliers
1.
A mistake was made wile taking a measurement or entering it into the computer.
Either remove the outlier or if possible-correct the mistake.
2.
The individual in question belongs to a different group than the bulk of individuals measured.
Inclusion or removal should be based upon the question of interest.
3.
The outlier is a legitimate data value and represents natural variability for the group and variable(s) measured. Do not discard.
Chapter 2: Section 2.5 (Pictures for Quantitative
Data)
“You Can Observe a Lot By Watching” Yogi Berra
When displaying quantitative data strive for graphics that allow the reader to assess the center, spread, shape, and outliers of the data.
Shape: The Shape of a dataset is either symmetric or skewed.
For a symmetric data set the Mean and Median usually overlap. They will also overlap the Mode if it is unique.
“Skewed to the Left” - Values trail off to the left
Usually the mean is less than the median when data is skewed to the left.
“Skewed to the Right”
- Values trail off to the right
Usually the mean is greater than the median when the data is skewed to the right.
Boxplot or Box-and-Whisker Plot
Visually displays information obtained from the 5number summary.
Cotinine Levels of 40 Smokers
0 87 173 253 1 103 173 265 1 112
198 266 3 121 208 277 17 123 210 284
32 130 222 289 35 131 227 290 44 149
234 313 48 164 245 477 86 167 250 491
Median 170
Quartiles 86.5 241.5
Extremes 0 491
Boxplots Also Illustrate the Shape of the Data
Skewed Left
Skewed Right
Symmetric
Dotplots : Before computers could draw decent graphs they used to draw Dot Plots as a way of describing numerical data.
The range of values in the dataset are marked on an X-axis and then a dot is placed above the relevant point on the axis for each value in the dataset. If two or more observations have the same value then dots are stacked on top of each other.
Example
Draw a Dot Plot for the following dataset
50 35 70 55 50 30 40 65 50 75 60 45 35 75 60
55 55 50 40 55 50
Stem and Leaf Diagrams
Stem and Leaf Diagrams are graphical ways to display a group of integers in a dataset.
Steps for Constructing a Stem and Leaf
Diagram
1. Select one or more of the leading digits to be the Stem values, the remaining digits become the Leaves.
2. List Possible Stem values in a column
3. Record the Leaf for every observation beside the corresponding Stem value.
4. Indicate on the display what units are used for the Stems and Leaves.
Example
Measurements are taken of the molar polarization of gaseous water at 100kPa. Some of these measurements are given below in units of cm
3 mol
1
.
71 52 52 75 64 60 48 56 67 29 11
53 25 46 58 46 49 62 66 40 19 54
57 54 60 19 59 43 51 40 21 45 46
62 73 59 36 45 55 46 45 32 55 46
51 46 65 49 61 40
Histogram Slides Go Here
Chapter 2: Section 2.6 (Numerical Summaries of
Quantitative Variables) n= The Number of Individuals in a dataset x
1
, x
2
, x
3
,..., x n
represent the individual raw data values
Example: A dataset consists of the height (in inches) of all individuals in MA 7280. The values are 74, 70, 62, 68, 69, 71, 65, 66. n=8 x