Chapter 2: Section 2

advertisement

Chapter 2: Section 2.1 (Raw Data)

Raw Data: Numbers and Category Labels that have been collected but have not yet been processed in any way.

Ex. Answers to Survey Questions

1.

How many hours do you work per week?

Raw Data: 15 Hours.

2.

What is your height in inches?

Raw Data: 65 Inches.

Sample Data: Measurements taken from a subset of a population.

Population Data: When all individuals in a population are measured.

Statistic: A summary measure computed from sample data.

Parameter: A summary measure computed from population data.

Descriptive Statistics: A summary measure for either a population or a sample.

Chapter 2: Section 2.2 (Types of Data)

Variable: A characteristic that differs from one individual to the next.

Categorical Variable: A non-numerical characteristic. Examples include a person’s sex, place of birth, eye color, or salad preference.

Each individual falls into only one category. Most prevalent summaries of categorical variables include how many individuals and what percent of the group fall into each category.

Quantitative Variables: Raw data which is recorded as a numerical value which is either a measurement or a count taken on each individual. Examples include a person’s height, weight, number of hours worked each week, number of credit hours during the semester, annual salary. Raw data such as social security numbers and student I.D. numbers are NOT quantitative variables. Also known as Measurement

Variable and Numerical Variable .

Continuous Variable: When every value within some interval is a possible result. Example, the weight of males in this class.

Ordinal Variable: When Cateogrical Variables have ordered categories. Example, “On a scale of 1 to 10 where 1=most polite and 10=rude, how polite were the children”.

Explanatory Variable : The variable which is thought to partially explain another variable.

Response Variable : The variable which is partially explained by the explanatory variable.

Example: When examining the relationship between number of hours spent studying and exam scores, the time spent studying is the explanatory variable and the exam scores are the response variables.

Chapter 2: Section 2.3 (Summarizing One or Two

Categorical Variables)

Count the number of individuals that fall into each category.

Calculate the percent in each category

Raw Data for the eye color of 35 students in MA

2830 from Spring 2005.

Blue, Blue, Brown, Hazel, Green, Blue, Blue, Brown,

Blue, Brown, Hazel, Blue, Green, Green, Brown,

Blue, Blue, Blue, Brown, Brown, Blue, Brown,

Green, Blue, Blue, Blue, Brown, Brown, Hazel, Blue,

Brown, Brown, Blue, Brown, Blue

Numerical Summary of The Categorical Data

Eye Color

Blue

Brown

Green

Hazel

Total

Frequency

16

12

4

3

35

Percent

46%

34%

11%

9%

100%

The Same Methods Apply for Two Categorical

Variables

When Summarizing Two Categorical Variables

Look for Outcome or Response Variables

Example 2.2 Pg 19

S Listed

First

S Picked Q Picked

61 (66%) 31 (34%)

Total

92

Q Listed

First

45 (46%) 53 (54%)

Total 106 (56%) 84 (44%)

98

190

Explanatory Variable : The Question Form the

Student Received

Response Variable : Letter Chosen (S or Q)

Visual Summaries for Categorical Variables

Pie Charts: Used for Single Categorical

Variables with few categories

Bar Graphs: Can be used for summarizing one or two categorical variables and making comparisons.

Other, 8, 11%

Fish, 10, 13%

Cat, 22, 29%

Favorite Pet

Dog, 35, 47%

Dog

Cat

Fish

Other

Favorite Pet

30

25

20

15

40

35

10

5

0

Dog

50%

45%

40%

35%

30%

25%

20%

15%

10%

5%

0%

47%

Dog

Cat

Pet

Fish

Favorite Pet

29%

13%

Cat

Pet

Fish

Other

11%

Other

50

40

30

20

70

60

10

0

S Picked, 61

Does Order Matter

Q Picked, 45

Q Picked, 53

S Picked, 31

S Listed First

Letter Listed First

Q Listed First

Does Order Matter

70%

60%

50%

40%

30%

20%

10%

0%

S Picked, 66%

Q Picked, 46%

Q Picked, 54%

S Picked, 34%

S Listed First Q Listed First

Letter Listed First

S Picked

Q Picked

S Picked

Q Picked

Chapter 2: Section 2.4 (Quantitative Data)

5-Number Summary

Median 50

Quartiles 45 57

Extremes 32 74

Summary Features for Quantitative Variables

Location (Center, Average): The Median Is an

Indication of the Center of the Data

Spread (Variability): The Difference between the two extremes and the difference between the two quartiles tell us about the spread of the data.

Shape : Cannot be determined from a 5-Number

Summary

Outliers

A Data Point Not Consistent with the Bulk of the

Data

An Unusually High or Low Variable (Michael

Jordan’s Salary compared to Other Geography

Majors from UNC-Chapel Hill)

Cannot Usually be determined from a 5-Number

Summary

3 Common Reasons for Outliers

1.

A mistake was made wile taking a measurement or entering it into the computer.

Either remove the outlier or if possible-correct the mistake.

2.

The individual in question belongs to a different group than the bulk of individuals measured.

Inclusion or removal should be based upon the question of interest.

3.

The outlier is a legitimate data value and represents natural variability for the group and variable(s) measured. Do not discard.

Chapter 2: Section 2.5 (Pictures for Quantitative

Data)

“You Can Observe a Lot By Watching” Yogi Berra

When displaying quantitative data strive for graphics that allow the reader to assess the center, spread, shape, and outliers of the data.

Shape: The Shape of a dataset is either symmetric or skewed.

For a symmetric data set the Mean and Median usually overlap. They will also overlap the Mode if it is unique.

“Skewed to the Left” - Values trail off to the left

Usually the mean is less than the median when data is skewed to the left.

“Skewed to the Right”

- Values trail off to the right

Usually the mean is greater than the median when the data is skewed to the right.

Boxplot or Box-and-Whisker Plot

Visually displays information obtained from the 5number summary.

Cotinine Levels of 40 Smokers

0 87 173 253 1 103 173 265 1 112

198 266 3 121 208 277 17 123 210 284

32 130 222 289 35 131 227 290 44 149

234 313 48 164 245 477 86 167 250 491

5-Number Summary

Median 170

Quartiles 86.5 241.5

Extremes 0 491

Boxplots Also Illustrate the Shape of the Data

Skewed Left

Skewed Right

Symmetric

Dotplots : Before computers could draw decent graphs they used to draw Dot Plots as a way of describing numerical data.

The range of values in the dataset are marked on an X-axis and then a dot is placed above the relevant point on the axis for each value in the dataset. If two or more observations have the same value then dots are stacked on top of each other.

Example

Draw a Dot Plot for the following dataset

50 35 70 55 50 30 40 65 50 75 60 45 35 75 60

55 55 50 40 55 50

Stem and Leaf Diagrams

Stem and Leaf Diagrams are graphical ways to display a group of integers in a dataset.

Steps for Constructing a Stem and Leaf

Diagram

1. Select one or more of the leading digits to be the Stem values, the remaining digits become the Leaves.

2. List Possible Stem values in a column

3. Record the Leaf for every observation beside the corresponding Stem value.

4. Indicate on the display what units are used for the Stems and Leaves.

Example

Measurements are taken of the molar polarization of gaseous water at 100kPa. Some of these measurements are given below in units of cm

3 mol

1

.

71 52 52 75 64 60 48 56 67 29 11

53 25 46 58 46 49 62 66 40 19 54

57 54 60 19 59 43 51 40 21 45 46

62 73 59 36 45 55 46 45 32 55 46

51 46 65 49 61 40

Histogram Slides Go Here

Chapter 2: Section 2.6 (Numerical Summaries of

Quantitative Variables) n= The Number of Individuals in a dataset x

1

, x

2

, x

3

,..., x n

represent the individual raw data values

Example: A dataset consists of the height (in inches) of all individuals in MA 7280. The values are 74, 70, 62, 68, 69, 71, 65, 66. n=8 x

Download