Describing and Summarizing Data (Section 3) I. One variable

advertisement
Describing and Summarizing Data (Section 3)
I.
One variable
A. Measures of Central Tendency (A measure of central location)
Measures of Central Tendency- A measure that attempts to find the
middle or “typical” value of the data
- There are many measures of central tendency. We will
concentrate on 3 of them (mean, median, and mode)
1. Means
- Population Mean-
1 N
μ x = ∑ xi
N i =1
Example
- Sample mean-
1 n
x = ∑ xi
n i =1
Example
1
- Weighted Means- Sometimes the data you have found has the
observation in classes, then you will have to use a weighted mean.
1 C
∑ f i mi C is the number
N i =1
of classes m i is the class’ middle value
- Population weighted Mean-
μx =
1 C
x = ∑ f i mi
n i =1
- Sample weighted mean-
Example- Sample of years it takes to graduate from college
Years
0-3
3-6
6-9
9-12
Number of people
3
80
13
4
2. Medians
Median- The middle point in the data.
Ex 1: 1,2,3,4,5,6,7
Median=
Ex 2: 1,2,3,4,5,6,7,8
Median=
Why might the median be a better measure of “average” income than
the mean?
Ex Calculate the mean and median of the following incomes:
$25,000 / $30,000 / $30,000 / $35,000 / $35,000 / $40,000 / $2,000,000
2
Median Class- The first class to have a cumulative relative frequency
of 50% or more.
Ex Sample of years it takes to graduate from college
Year
s
0-3
Number
of
People
3
3-6
80
6-9
13
9-12
4
Relative
Frequency
Cumulative
Relative
Frequency
What is the Median Class?
3. Mode
Mode- Value that occurs most often
Ex- 1,2,2,3,4,7,7,9,11
Mode=
Modal Class- Class in a frequency distribution that has the most
occurrences in it.
What is our modal class in the above example?
3
B. Other measures of location
1. Percentile- pth percentile is a value such that p percent falls at
or below the value.
and (100-p) percent of the measure fall at or above the value
Exs
90th percentile:
80th percentile on a test:
Calculating the pth percentile:
1. arrange the data from smallest to largest
2. compute index i
p − percentile of interest
⎛ p ⎞
i=⎜
⎟n
n − number of observations
⎝ 100 ⎠
3. a) if i is not an integer round up
b) if i is an integer add 1 to i
Ex: 200, 190, 170, 210, 220, 120, 140, 250 What numbers are in
the 60th percentile?
2. Quartiles- Dividing the data into four parts (using three lines)
Q1 First quartiles (25th percentile)
Q2 Second quartiles (50th percentile also median)
Q3 Third quartiles (75th percentile)
Ex
5 10
11
12
13
19
23
25
28
29
30
33
Use your eyes to divide the numbers into quartiles
Use the percentile calculation to divide the above into quartiles
4
Divide the Histogram below into quartiles:
Histogram
3.5
Frequency
3
2.5
2
Frequency
1.5
1
0.5
9
7
5
3
1
0
Bin
C. Skewness-
- Skewness- A measure of the degree of asymmetry of a
distribution.
Positive (Right) Skewed
Mean>Median
Negative (Left) Skewed
Mean<Median
Symmetric Distribution
5
D. Measures of Dispersion
- Measures of Dispersion- Measures of the variability of a
distribution (often used to measure risk)
1. Range- Differences between smallest and largest number in a
data set
Example-
2. Interquartile RangeInterquartile Range= Q3 – Q1
Ex Previous example from quartiles to calculate Interquartile range
3. Variance
- Population Variance:
- Sample Variance:
1 N
σ = ∑ ( xi − μ ) 2
N i =1
1 n
2
s =
( xi − x) 2
∑
n − 1 i =1
2
Ex- Sample of drinks per week of law students at IU:
0,1,2,3,4
- Why is sample variance divided by n-1 instead of n?
6
4. Standard Deviation
- Population Standard Deviation: σ = σ 2
- Sample Standard Deviation:
s = s2
Example- Find the Standard deviation for our previous sample
Why do we look at standard deviation instead of just looking at
variance?
What does a Standard deviation tell us?
Why would this be important if our data is returns of a particular
stock?
Stock 1
Stock 2
Year 1
10%
-10%
Year 2 Year 3 Avg Return
10%
10%
40%
0%
7
5. Coefficient of Variation
Coefficient of Variation =
Standard Deviation
s
σ
* 100 or * 100 or
* 100
Mean
x
μ
Why?
It gets rid of units of measurement
The standard deviation gets larger as the magnitude of values used
in calculation gets larger so an unadjusted comparison of the
standard deviations on stocks is not a measure to compare their
“riskyness”.
8
Ex 1 August 17 1998 issue of Fortune Magazine
Mean return of Legg Mason Value Primary Fund is 29.4% and the
standard deviation is 17.3%
Mean return of the Reynolds Blue Chip Growth Fund is 23.7% and its
standard deviation is 17.5%
Calculate the coefficient of variation for both:
Interpret:
The coefficient of variation is the best measure of risk we have
studied thus far.
9
Ex 2:
Stock 1 returns: 5%, 6%, 7%
Stock 2 returns: 7%, 10%, 13%
Calculate the mean & standard deviation of both (assume population):
If you measured risk using standard deviation (or variance or range)
which one would you say is more risky?
Calculate the coefficient of variation for each stock:
What does coefficient of variation say is the riskiest stock?
Why does this not make sense?
10
6. Variance and Standard Deviation for Grouped Data:
σ2 =
- Population Variance:
1
N
C
∑ f i ( mi − μ ) 2
i =1
How would you write this if you had relative frequency?
- Sample Variance:
s2 =
- Population Standard Deviation:
- Sample Standard Deviation:
1 C
f i ( mi − x ) 2
∑
n − 1 i =1
σ=
s=
Example- Sample of years it takes to graduate from college
Years
Number of people
0-3
3
3-6
80
6-9
13
9-12
4
Calculate sample variance:
11
E. Measures of relative location & detecting outliers
1. Z- scores- Number of standard deviations our observation (xi) is
away from the mean.
zi =
xi − x
s
Z-score can be interpreted as measure of relative location in a data
set.
Ex
Mean return of a stock is 15% and the standard deviation is 5%, we
get a return of 10%.
Observe how many standard deviations our return is from the mean.
Confirm your answer by calculating a z-score.
2. General Rules for Bell shaped distributions
Looking at this distribution would you believe that you were 5
standard deviations away from the mean?
12
D. Descriptive (Summary) Statistics in Excel
Click: Tools → Data Analysis → Descriptive Statistics
Click the summary Statistics box
Column1
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
5.5
0.507194614
5.5
3
2.484736011
6.173913043
-0.938233756
0
9
1
10
132
24
This is the summary statistics for the Histogram we used earlier:
Histogram
3.5
Frequency
3
2.5
2
Frequency
1.5
1
0.5
9
7
5
3
1
0
Bin
Why does the mean equal the mean? Why is skewness=0?
Draw 1 standard deviation above and below the mean on the histogram.
Draw 2 standard deviations above and below the mean on the histogram.
Would you believe someone who told you that you were 1 standard
deviation below the mean?
Would you believe someone who told you that you were 2 standard
deviations below the mean?
13
Other commands is Excel:
Mean:
Median:
Mode:
=average()
=median()
=mode()
Population variance:
Sample variance:
Square root:
=VARP()
=Var()
=SQRT()
14
II. Association between two variables
A. Covariance
Population CovarianceSample Covariance-
σ xy =
∑ (xi − μ x )(yi − μ y )
N
1
N
i =1
n
s xy = n1−1 ∑ ( xi − x )( yi − y )
i =1
Ex 1:
Calculate the population covariance for the following two variables
Day
1
2
3
4
5
Food Sales $
2000
500
1000
600
900
Drink Sales $
200
800
500
700
300
Negative Covariance:
Loose Interpretation:
More Technical Interpretation:
Positive Covariance:
Loose Interpretation:
More Technical Interpretation:
15
Ex 2:
Calculate the sample covariance for the following two variables
Day
1
2
3
4
5
Drink Sales $
200
800
500
700
300
Ice Cream Sales $
50
200
100
150
0
Interpret your result:
B. Correlation Coefficient
Population correlation coefficient: ρ xy =
Sample correlation coefficient:
rxy =
σ xy
σ xσ y
s xy
sx s y
• The value of the correlation coefficient is between –1 and 1
(including –1 and 1)
+1:
-1:
Perfect positive linear relationship between the two variables
Perfect negative linear relationship between the two variables
+ number: Positive linear relationship between the two variables
- number: Negative linear relationship between the two variables
16
Ex 1:
Calculate the population correlation coefficient for the following two
variables
Day
1
2
3
4
5
Food Sales $
2000
500
1000
600
900
Drink Sales $
200
800
500
700
300
Ex 2:
Calculate the sample correlation coefficient for the following two
variables
Day
1
2
3
4
5
Drink Sales $
200
800
500
700
300
Ice Cream Sales $
50
200
100
150
0
17
C. Scatter Plots, Covariance, & Correlation Coefficient
Perfect Positive Linear
Relationship
Positive Linear Relationship
8
8
6
6
4
4
2
2
0
-2
0
0
5
10
15
-10
-5
-2 0
-4
-4
-6
-6
-8
-8
Perfect Negative Linear
Relationship
5
10
Negative Linear Relationship
6
8
4
6
2
4
2
-10
0
-10
-5
-2 0
5
-5
0
-2 0
5
10
10
-4
-4
-6
-6
-8
-8
Almost No Linear Relationship
-4
-2
4
3
2
1
0
-1 0
-2
-3
-4
2
4
Additional Excel commands:
Population Covariance:
=COVAR(,)
Sample Covariance:
=(n/n-1)*COVAR(,) where n is the
number of observations you have
Correlation Coefficient
=CORREL(,)
18
Formulas for the Section
Below write down the formulas for this section in an organized manner that will help you remember them.
19
PRACTICE QUESTIONS
1.
Assume the following data is a population :
(Make sure that you can do the following calculations by hand)
1, 5, 6, 9, 200, 9, 6, 2, 5, 17, 25
a. Calculate the mean
b. Calculate the mode
c. Calculate the median
d. What numbers are in the 40th percentile?
e. What is the skewness of the data?
f. Calculate the range
g. Calculate the variance
h. Calculate the standard deviation
i. Calculate the coefficient of variation
2.
Assume the following data is a sample:
(Make sure that you can do the following calculations by hand)
223, 699, 1222, 845, 111, 3
a. Calculate the mean
b. Calculate the mode
c. Calculate the median
d. What is the skewness of the data?
e. Calculate the range
f. Calculate the variance
g. Calculate the standard deviation
h. Calculate the coefficient of variation
3.
Assume the following data is a population:
(Make sure that you can do the following calculations by hand)
Data on speeders
MPH over the limit
0-5
10-15
15-20
20-25
Number of people
3
23
50
25
a. Calculate the mean
b. Calculate the mode
c. Which is the median class?
d. Calculate the variance
e. Calculate the standard deviation
4.
Assume the following data is a sample (for stocks):
(Make sure that you can do the following calculations by hand)
Yield %
-10% but under -5%
-5% but under 0%
0% but under 5%
5% but under 10%
10% but under 15%
15% but under 20%
Cumulative Relative Frequency
0.10
0.20
0.45
0.60
0.85
1.00
a. Calculate the mean
b. Calculate the mode
c. Which is the median class?
20
5.
Answer the following questions using the histogram below:
Frequency
Histogram
10
9
8
7
6
5
4
3
2
1
0
Frequency
20 30 40 50 60 70 80 90 100
Bin
a) Which way is the skew of the histogram?
b) What can you say about the relationship between the mean and median?
6.
Answer the following questions using the histogram below:
Frequency
Histogram
9
8
7
6
5
4
3
2
1
0
Frequency
20 30 40 50 60 70 80 90 100
Bin
c) Which way is the skew of the histogram?
d) What can you say about the relationship between the mean and median?
7.
The mean of data set is 17, the median is 15, the mode is 5, and the variance is 25. The data point you are looking at is
25. How many standard deviations away from the mean is this data point?
21
8. Use the Population data below to answer the following questions:
City
Amount of Rain (inches)
A
17
B
44
C
100
D
50
E
60
a.
b.
ski sales (thousands)
200
100
80
90
77
Calculate the covariance
Calculate the correlation coefficient
9. Use the sample data below to answer the following questions:
FIRM
% OF WOMAN WORKING
FOR THE FIRM
A
17
B
44
C
100
D
50
E
60
a. Calculate the covariance
b. Calculate the correlation coefficient
PROFITS (MILLIONS)
2
1
7
0.5
3
10. Is a histogram used to describe one variable or two?
11. Is a scatter plot used to describe one variable or two?
Use the following table to answer the next 4 questions
weight
# of
dogs
0-50 lbs
15
50-100
25
100-150
10
Relative
frequency
Percent
frequency
Cumulative
frequency
Cumulative
relative
frequency
(in %)
III
I
IV
II
V
12. Assume this is a population. Calculate the weighted mean. (round to the nearest tenth)
a. 47.5
b. 48.5
c. 70.0
d. 17.4
e. None of the above
13. Assume this is a population. Calculate the median class
a. 15 dogs
b. 25 dogs
c. 0-50 lbs
d. 50-100 lbs
e. 100-150 lbs
14. Assume this is a population. Calculate the modal class
a. 15 dogs
b. 25 dogs
c. 0-50 lbs
22
d.
e.
50-100 lbs
100-150 lbs
15. Assume this is a population. Calculate the variance (round to the nearest who number)
a. 1,225
b. 1,250
c. 15,475
d. 15,791
e. None of the above
Use the following table to answer the next 2 questions
Cost
Quality rating
10
25
30
90
50
35
16. Assume the data is a population. Calculate the covariance
a. 1.3
b. 2.2
c. 66.7
d. 100.0
e. None of the above
17. Assume the data is a population. Calculate the correlation coefficient (round to the nearest 100th)
a. 0.00
b. 0.04
c. 0.07
d. 0.10
e. 0.14
18. The mean return of investment 1 is 2% and the variance is 100 percent2
The mean return of investment 2 is 50% and the variance is 10,000 percent2
According to the standard deviation which stock is riskier?
a. Stock 1
b. Stock 2
c. They are equally risky
d. Neither of the stocks has any risk
19. The mean return of investment 1 is 2% and the variance is 100%
The mean return of investment 2 is 50% and the variance is 10,000%
According to the coefficient of variation which stock is riskier?
a. Stock 1
b. Stock 2
c. They are equally risky
d. Neither of the stocks have any risk
Questions from book:
Page 84:
8
Page 92:
18
Page 112:
47 & 48
Page 119:
58
23
ANSWERS
(Some of the Answers may be incorrect; let me know if you think you found an incorrect answer)
1.
a) 25.91
b) 5, 6, 9
c) 6
d) 6 and above
e) right skewed (mean>median)
f) 199
g) 3074.45 (remember it is a population)
h) 55.45
i) 214
2.
a) 517.17
b) none
c) 461
d) right
e) 1219
f) 230,640.17 (remember it is a sample)
g) 480.25
h) 93
3.
a) 17.15 MPH
b) 15-20 MPH
c) 15-20 MPH
d) 18.44 MPH2
e) 4.29
4. it is easier to work with relative frequency so add a column:
Yield %
-10% but under -5%
-5% but under 0%
0% but under 5%
5% but under 10%
10% but under 15%
15% but under 20%
Cumulative Relative Frequency
0.10
0.20
0.45
0.60
0.85
1.00
Relative Frequency
0.10
0.10
0.25
0.15
0.25
0.15
a) 6.5%
b) 5-10%
c) 0-5% or 10-15%
5.
a) left skewed
b) mean < median
6. a) right skewed
b) mean > median
7. It is 1.6 standard deviations above the mean
8.
a) -945.48
b) - 0.76
9.
a) 62.83
b) 0.80
10. One variable
24
11. Two variables
12
13
14
15
16
17
18
19
20
C
D
D
A
C
E
B
A
25
Download