Today: - Common problems from assignment 1. Q1,2,3,4.

advertisement
Today:
- Common problems from assignment 1. Q1,2,3,4.
Question 1. Consider the data set {15, 9, 7, 20, 4, 12, 8, 0, 31}
1f: Determine if this distribution is positively or negatively
skewed. How do you know this works?
Common question: How do we know if it’s skewed without
graphing the distribution?
First: What is skew?
A positive skew means that most extreme values are high
values.
Negative skew means that most extreme values are low.
Synthesis: What does this have to do the other things we know
about this dataset (from previous parts of the question)
-
Range
Median
Interquartile Range
Outliers if any
Mean
The mean is ‘pulled’ by extreme values (sensitive to them)
The median is not affected by extreme values (robust to them)
Since the mean is ‘pulled’ by extreme values, it will be closer to
the extreme values than the median will be.
So if Mean > Median, we have positive skew.
If Mean < Median, we have negative skew.
For {15, 9, 7, 20, 4, 12, 8, 0, 31}
Mean = (0 + 4 + 7 + 8 + 9 + 12 + 15 + 20 + 31)/9 = 11.78
th
Median = ½ * (9 + 1)th value, or 5 value. Median = 9
Remember to sort the numbers lowest  highest first!
Mean = 11.78 > 9 = Median, so we have positive
skew.
The median is not pulled by extreme values.
Consider two datasets: {1,2,3,4,5} and {1,2,3,4,99999}
The median only cares how many numbers are above/below,
in both cases the median is 3.
There are two values above the median and two below and it
doesn’t matter how far above/below.
The mean IS pulled by extremes.
Consider two datasets: {1,2,3,4,5} and {1,2,3,4,99999}
To find the mean, we get the total of the numbers and divide
by the how many there are.
By changing the last number from 5 to 99,999 we change the
total a great deal. We also change the mean from 3 to 20,002.
I’m glad we solved that mystery.
Question 2 (Part C) “Find the median”
Attitude Towards Lab Meat
Strongly Against
Somewhat Against
Slightly Against
Neutral
Slightly For
Somewhat For
Strongly For
Coded Attitude
1
2
3
4
5
6
7
f
23
10
2
26
8
17
14
N = 100
This is a frequency table. If we were to put the coded attitudes
in a raw data set it would look like
{1,1,1,1,1,1,1,….,1,1,2,2,2,….,2,2,3,3,4,4,…. ,7,7,7,7}
“ …. “ means skipped values.
Attitude Towards Lab Meat
Strongly Against
Somewhat Against
Slightly Against
Neutral
Slightly For
Somewhat For
Strongly For
Coded Attitude
1
2
3
4
5
6
7
f
23
10
2
26
8
17
14
N = 100
“f” stands for “Frequency”, as in “how often”. In this case, 1
has a frequency of 23, meaning that 1 appears in the data set
23 times.
N at the bottom is the sample size. There are 100 values in this
data set as a whole.
Attitude Towards Lab Meat
Strongly Against
Somewhat Against
Slightly Against
Neutral
Slightly For
Somewhat For
Strongly For
Coded Attitude
1
2
3
4
5
6
7
f
23
10
2
26
8
17
14
N = 100
cf
23
33
35
61
69
86
100
Since there are 100 values, the median is the ½ x (100 + 1) =
50.5th value, or between the 50th and 51st value.
We find that using the cumulative frequency (part b of
question)
Attitude Towards Lab Meat
Strongly Against
Somewhat Against
Slightly Against
Neutral
Slightly For
Somewhat For
Strongly For
Coded Attitude
1
2
3
4
5
6
7
f
23
10
2
26
8
17
14
N = 100
cf
23
33
35
61
69
86
100
By the cumulative frequency (cf), we find there are 23 values
of 1 or less, and 33 values of 2 or less.
Sorted, the 1st, 2nd, 3rd, … , 22nd, 23rd values are all 1’s, and the
24th, 25th, … , 32nd, 33rd values are all 2’s…
Attitude Towards Lab Meat
Strongly Against
Somewhat Against
Slightly Against
Neutral
Slightly For
Somewhat For
Strongly For
th
th
Coded Attitude
1
2
3
4
5
6
7
th
th
f
23
10
2
26
8
17
14
N = 100
th
cf
23
33
35
61
69
86
100
st
..and that the 36 , 37 , 38 , … , 59 , 60 , 61 values are all
4 (Neutral)
The 50th and 51st values are in the 4 (Neutral) range, so that’s
where the median is, at Neutral.
All good? Onward then!
Question 2 (Parts D and E)
“What assumption is needed to get the mean. Find the mean.”
Note: This opinion data is ordinal, (part a). That means the
mean doesn’t work for it. The mean is only for interval data.
But…
Interval data is just ordinal data that has even spaces between
responses and no gaps.
Attitude Towards Lab Meat
Strongly Against
Somewhat Against
Slightly Against
Neutral
Slightly For
Somewhat For
Strongly For
Coded Attitude
1
2
3
4
5
6
7
In this question, that means we can use the mean
IF we treat
this data like interval data.
We have assume that differences between the categories are
evenly spaced.
Attitude Towards Lab Meat
Strongly Against
Somewhat Against
Slightly Against
Neutral
Slightly For
Somewhat For
Strongly For
Coded Attitude
1
2
3
4
5
6
7
That is, a difference between “Strongly Against” and
“Somewhat Against” is assumed to be the same as the
difference between “Neutral” and “Slightly For”.
Both of these are a difference of one category.
If we do that, we also assume that the mean makes sense.
Finding the mean.
Attitude Towards Lab Meat
Strongly Against
Somewhat Against
Slightly Against
Neutral
Slightly For
Somewhat For
Strongly For
Coded Attitude
1
2
3
4
5
6
7
f
23
10
2
26
8
17
14
N = 100
The mean is the total of all the values divided by the number of
values. There are N = 100 values.
23 of those values are “1”
10 are “2” and so on.
We add up the 23 “1”s, 1 + 1 + 1 + … = 23 x 1
Then add up the “2”s 2 + 2 + 2… = 10 x 2 and so on.
Total = 23x1 + 2x10 + … + 7x14
= 23 + 20 + … + 98 = 393
Mean = 393 / 100 = 3.93.
By the mean, people as a whole are very slightly against lab
grown meat. (The mean is slightly less than 4, and the lower
values are against)
Question 3 – Crosstab of academic success vs Mother’s
age.
Child born before/after 30 * Academic Success Cross-tabulation
Count
Academic Success
Excellent Satisfactory Marginal
Total
Mother’s Over 30
29
45
18
92
age at first Under 30
21
55
32
108
childbirth.
Total
50
100
50
200
Crosstabs also show frequency (or count) data, but they
show the relationship between two categories of ordinal or
nominal data.
Nominal data is just “what type is something”, like flavour
of ice cream.
Ordinal data is similar, but has a natural ordering like “OK,
good, very good, doubleplusgood.”
3d. What percentage of students are only marginally
successful?
Child born before/after 30 * Academic Success Cross-tabulation
Count
Academic Success
Excellent Satisfactory Marginal
Total
Mother’s Over 30
29
45
18
92
age at first Under 30
21
55
32
108
childbirth.
Total
50
100
50
200
The tricky part is identifying..
What group are we discussing? All Students.
What do we want? % that are marginal.
3d. What percentage of students are only marginally
successful?
Mother’s Over 30
age at first Under 30
childbirth.
Total
Academic Success
Excellent Satisfactory Marginal
29
45
18
21
55
32
50
100
Total
50
92
108
200
Only looking at one aspect (success), so use the totals.
200 students in total, 50 are marginal.
Proportion that are marginal: 50 / 200 = .25, or 25%
3e. What is the ratio of students with mothers over 30 at
childbirth to those under 30?
Mother’s Over 30
age at first
Under 30
childbirth.
Total
Academic Success
Excellent Satisfactory Marginal
29
45
18
21
55
32
50
100
50
Total
92
108
A ratio is comparison of the size of one group to another.
92 older mothers to 108 young mothers.
Ratio = 92 to 108, or 92/108.
Alternative answers: 46/54, or 0.85/1.
200
3f. Of the students whose mothers were under 30 at first
childbirth, what percentage have Satisfactory success or
better?
Mother’s Over 30
age at first
Under
childbirth.
Academic Success
Excellent Satisfactory Marginal
29
45
18
Total
92
21
55
32
108
50
100
50
200
30
Total
Start by is identifying..
What group are we discussing? ONLY young mother students.
What do we want? % that are satisfactory or better.
Only considering one category is called conditioning. Our
analysis for this question is conditional on the mothers
being young (it doesn’t apply to all students).
Satisfactory or better means Satisfactory or Excellent.
Academic Success (UNDER 30 ONLY)
Excellent Satisfactory Marginal
21
55
32
Total
108
That’s 21 + 55 = 76 students out of a total 108.
Proportion = 76/108 = .703 =
70.3%
Feelin’ refined and edified? One more problem to go.
Question 4 – Children’s Luck Scores
The histogram should have looked like this.
I would say it’s bimodal, others say it’s multimodal. Both work.
A cumulative frequency table is built going to Analyse 
Descriptive Stats  Frequencies, and just leaving the
frequency table button checked off.
Frequency Percent
0
11
20.4
1
10
18.5
2
3
5.6
3
2
3.7
5
5
9.3
6
1
1.9
7
1
1.9
8
3
5.6
9
6
11.1
10
12
22.2
Total
54
100.0
Cumulative
Percent
20.4
38.9
44.4
48.1
57.4
59.3
61.1
66.7
77.8
100.0
How many students reported that they were 2 or less lucky?
Cumulative Percent
0
1
20.4
38.9
2
44.4
3
5
6
7
8
9
10
48.1
57.4
59.3
61.1
66.7
77.8
100.0
The cumulative percent is adding from lowest values to
highest, so 44.4% of the students reported being 2 or less
lucky.
What percentage of students said they were 8 or more lucky?
Cumulative Percent
0
1
2
3
5
6
20.4
38.9
44.4
48.1
57.4
59.3
7
61.1
8
9
10
66.7
77.8
100.0
The table only shows how many are X or less lucky. However,
8 or more is the opposite of 7 or less.
So 100 – 61.1% = 38.9% are 8 or more lucky.
Alternative method: Add up the frequencies.
Frequency
0
1
2
3
5
6
7
8
9
10
Total
11
10
3
2
5
1
1
3
6
12
54
11 + 10 + 3 = 24 respondents reported being 2 or less lucky.
24 out of 54 total.
24/54 = .444 = 44.4%
Alternative method: Add up the frequencies.
Frequency
0
1
2
11
10
3
3
5
6
7
2
5
1
1
8
9
10
3
6
12
Total
54
3 + 6 + 12 = 21 respondents reported being 8 or more lucky.
21 out of 54 total.
21/54 = .389 = 38.9%
Next time, on StatsClass:
Friday:
- Review of midterm material
- (from the assignment 2 realm)
Monday:
- Midterm (Extra office hours… most of the morning?)
Download