Turning Data into Information
1. Discuss variable types, i.e. categorical (binary, ordinal, and nominal) and quantitative (discrete and continuous).
2. Graph Types i. Pie Chart/Graph and Bar Chart/Graph for categorical data ii. Histogram, Stem and Leaf (stemplot), Boxplot for quantitative data
3. Location or Center i. Mode: most frequent outcome and used primarily for categorical data ii. Mean and median for quantitative data
Mean is found by summing all observations and dividing by total number of observations e.g. in Big 10 President and football coach salaries add the 28 salaries together and divide by 28 = 387.223/28 = 13.83 or $1,383,000
Median found by:
1. Ordering observations from smallest to largest
2. Taking total number of observations plus one then dividing by 2
3. Find the location of the observation found in Step 2 within ordered string e.g. in Big 10 President and football coach salaries order 28 values from lowest to highest; take (28+1)/2 to get 14.5. Then find the observation that would represent position 14.5 position, i.e. the position midway between the 14 th and 15 th observation. From the ordered data, this would be halfway between 7.5 and 8.34 within the ordered string which is 7.92 or $792,000.
Note that if the total number of observations is an odd number then the position will fall exactly on an observation value.
4. Spread of data i. Range = (Max - Min) = 43 – 4.12 = 38.88 ii. Standard Deviation (or square root of variance). Represents the average distance observations fall from the mean. Variance found by taking each observation minus the mean; squaring this difference; adding up these squares; then dividing by n-1. SD = 11.13 ii. The quartiles and their interpretation iii. Interquartile Range or IQR = Q3 - Q1 iiii. Calculating “fence” to find outliers by Q3 + 1.5*IQR and Q1 - 1.5*IQR
IQR = 20.0 – 5.16 = 14.84 times 1.5 = 22.26
Q1 - 1.5*IQR = 5.16 – 22.26 = - 17.1 or 0 since no negative compensation!
Q3 + 1.5*IQR = 20.0 + 22.26 = 42.26
Outliers, then, are any observation(s) that fall outside this range of data are would be marked with an asterisk in a boxplot. We would have at least one with the salary of 43 exceeding 42.26
Descriptive Statistics: Comp(100k)
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3
Comp(100k) 28 0 13.83 2.10 11.13 4.12 5.16 7.92 20.00
Variable Maximum
Comp(100k) 43.00
1
Boxplot of Comp(100k)
0 10 20
Comp(100k)
30 40
Stem-and-Leaf Display: Comp(100k)
Stem-and-leaf of Comp(100k) N = 28
Leaf Unit = 1.0
6 0 444444
(9) 0 555567778
13 1 222
10 1 69
8 2 003
5 2 68
3 3 0
2 3 8
1 4 3
When we have a categorical variable used to define groups of a quantitative variable, e.g. the Big
10 salaries by position of President of Football Coach, side-by-side boxplots can be useful in comparing the distribution across the groups.
Boxplot of Salary(in 100K)
0 10
FootballCoach
20
President
30 40
0 10 20
Panel variable: Position
30 40
Salary(in 100K)
2
Dot Plots by group
OR
Dotplot of Comp(100k)
FC
Pres
6 12 18 24
Comp(100k)
30 36 42
5. Data Shapes - skewed left (negatively skewed), right (positively skewed) and bell-shaped
(symmetric). NOTE: salary/compensation typically follows a right-skew distribution. i. Relationship b/w mean and median
Histogram of Salary(in 100K)
10
4
2
8
6
0
10 20 30
Salary(in 100K)
40
3
6. Empirical Rule: For data that is bell-shaped there exists a unique relationship between number of standard deviations that the data falls from the mean. This unique relationship is described as the Empirical Rule and is interpreted as follows:
For any bell-shaped data set we can expect approximately the following:
- 68% of the observations fall within the mean plus and minus
- 95% of the observations fall within the mean plus and minus one two
- 99.7% of the observations fall within the mean plus and minus
SD
SD three SD
For example, SAT Math scores are typically bell shaped with a mean of 500 and SD of
100. Therefore we would expect roughly 68% of the SAT Math scores to fall between
400 and 600; 95% to fall between 300 and 700; and almost all to fall between 200 and
800.
6. A Z-score is found by taking (observed - mean)/SD. The z-score represents how many standard deviations an observation falls from the mean. One helpful application of zscores when data is bell-shaped is to compare scores across variables that have different measurements (e.g. comparing a really tough class to a really easy class).
7. Categorical variable graphing using bar graphs and pie charts. Consider a January 4-6,
2013 Public Policy Polling survey of 675 PA registered voters regarding their opinion on
Governor Corbett’s suing of the NCAA over the PSU sanction.
Chart of Observed
400
300
200
100
0
Not Sure Oppose
Observed
Support
4
Oppose
230, 34.1%
Pie Chart of Corbet_NCAA
Not Sure
95, 14.1%
Support
350, 51.9%
Category
Support
Oppose
Not Sure
5