Turning_DataSP14

advertisement

Turning Data into Information

1. Discuss variable types, i.e. categorical (binary, ordinal, and nominal) and quantitative (discrete and continuous).

2. Graph Types i. Pie Chart/Graph and Bar Chart/Graph for categorical data ii. Histogram, Stem and Leaf (stemplot), Boxplot for quantitative data

3. Location or Center i. Mode: most frequent outcome and used primarily for categorical data ii. Mean and median for quantitative data

Mean is found by summing all observations and dividing by total number of observations e.g. in Big 10 President and football coach salaries add the 28 salaries together and divide by 28 = 387.223/28 = 13.83 or $1,383,000

Median found by:

1. Ordering observations from smallest to largest

2. Taking total number of observations plus one then dividing by 2

3. Find the location of the observation found in Step 2 within ordered string e.g. in Big 10 President and football coach salaries order 28 values from lowest to highest; take (28+1)/2 to get 14.5. Then find the observation that would represent position 14.5 position, i.e. the position midway between the 14 th and 15 th observation. From the ordered data, this would be halfway between 7.5 and 8.34 within the ordered string which is 7.92 or $792,000.

Note that if the total number of observations is an odd number then the position will fall exactly on an observation value.

4. Spread of data i. Range = (Max - Min) = 43 – 4.12 = 38.88 ii. Standard Deviation (or square root of variance). Represents the average distance observations fall from the mean. Variance found by taking each observation minus the mean; squaring this difference; adding up these squares; then dividing by n-1. SD = 11.13 ii. The quartiles and their interpretation iii. Interquartile Range or IQR = Q3 - Q1 iiii. Calculating “fence” to find outliers by Q3 + 1.5*IQR and Q1 - 1.5*IQR

IQR = 20.0 – 5.16 = 14.84 times 1.5 = 22.26

Q1 - 1.5*IQR = 5.16 – 22.26 = - 17.1 or 0 since no negative compensation!

Q3 + 1.5*IQR = 20.0 + 22.26 = 42.26

Outliers, then, are any observation(s) that fall outside this range of data are would be marked with an asterisk in a boxplot. We would have at least one with the salary of 43 exceeding 42.26

Descriptive Statistics: Comp(100k)

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3

Comp(100k) 28 0 13.83 2.10 11.13 4.12 5.16 7.92 20.00

Variable Maximum

Comp(100k) 43.00

1

Boxplot of Comp(100k)

0 10 20

Comp(100k)

30 40

Stem-and-Leaf Display: Comp(100k)

Stem-and-leaf of Comp(100k) N = 28

Leaf Unit = 1.0

6 0 444444

(9) 0 555567778

13 1 222

10 1 69

8 2 003

5 2 68

3 3 0

2 3 8

1 4 3

When we have a categorical variable used to define groups of a quantitative variable, e.g. the Big

10 salaries by position of President of Football Coach, side-by-side boxplots can be useful in comparing the distribution across the groups.

Boxplot of Salary(in 100K)

0 10

FootballCoach

20

President

30 40

0 10 20

Panel variable: Position

30 40

Salary(in 100K)

2

Dot Plots by group

OR

Dotplot of Comp(100k)

FC

Pres

6 12 18 24

Comp(100k)

30 36 42

5. Data Shapes - skewed left (negatively skewed), right (positively skewed) and bell-shaped

(symmetric). NOTE: salary/compensation typically follows a right-skew distribution. i. Relationship b/w mean and median

Histogram of Salary(in 100K)

10

4

2

8

6

0

10 20 30

Salary(in 100K)

40

3

6. Empirical Rule: For data that is bell-shaped there exists a unique relationship between number of standard deviations that the data falls from the mean. This unique relationship is described as the Empirical Rule and is interpreted as follows:

For any bell-shaped data set we can expect approximately the following:

- 68% of the observations fall within the mean plus and minus

- 95% of the observations fall within the mean plus and minus one two

- 99.7% of the observations fall within the mean plus and minus

SD

SD three SD

For example, SAT Math scores are typically bell shaped with a mean of 500 and SD of

100. Therefore we would expect roughly 68% of the SAT Math scores to fall between

400 and 600; 95% to fall between 300 and 700; and almost all to fall between 200 and

800.

6. A Z-score is found by taking (observed - mean)/SD. The z-score represents how many standard deviations an observation falls from the mean. One helpful application of zscores when data is bell-shaped is to compare scores across variables that have different measurements (e.g. comparing a really tough class to a really easy class).

7. Categorical variable graphing using bar graphs and pie charts. Consider a January 4-6,

2013 Public Policy Polling survey of 675 PA registered voters regarding their opinion on

Governor Corbett’s suing of the NCAA over the PSU sanction.

Chart of Observed

400

300

200

100

0

Not Sure Oppose

Observed

Support

4

Oppose

230, 34.1%

Pie Chart of Corbet_NCAA

Not Sure

95, 14.1%

Support

350, 51.9%

Category

Support

Oppose

Not Sure

5

Download