Discrete Data Slides

advertisement
Discrete Data
Distributions and Summary Statistics
Terms: histogram, mode, mean,
range, standard deviation, outlier
Discrete vs. Continuous Data
dis·crete adj.
1. Constituting a separate thing. See Synonyms at distinct.
2. Consisting of unconnected distinct parts.
3. Mathematics: Defined for a finite or countable set of
values; not continuous.
con·tin·u·ous adj.
1. Uninterrupted in time, sequence, substance, or extent.
See Synonyms at continual.
2. Attached together in repeated units: a continuous form
fed into a printer.
3. Mathematics: Of or relating to a line or curve that extends
without a break or irregularity.
Discrete vs. Continuous Data
discrete
Usually related to counts.
Variable values for different units often tie.
Averaging two values does not necessary yield another
possible value.
continuous
Any value in some interval.
A tie among different units is in theory virtually
impossible (and in practice very rare). Ties (due to
rounding) are infrequent in practice.
The average of any two values is another (and different)
possible value.
Distribution
The distribution of a variable tells us what values it takes
and how often it takes those values.
MAKE A PICTURE!
For discrete quantitative data, use a relative frequency
chart / histogram* to display the distribution.
* Fundamentally these are the same thing.
Left Skewed Distribution
Right Skewed Distribution
Symmetric Distribution
Outlier
outlier noun
1: something that is situated away from or
classed differently from a main or related body
2: a statistical observation that is markedly
different in value from the others of the sample
Measures of Center
Median
Half the data are above/below the median.
Not too suitable to highly discrete data. More later about this.
(Sample) Mean
Sum all the data x, then divide by how many (n)
Denoted (“x bar”) x
Both have the same measurement units as the data.
Less Important Measures of Center
Midrange
Average the minimum and maximum
For highly skewed data, the midrange is often a value that is
quite atypical.
Mode
Most common value - highest proportion of occurrence
There can be 2 (or more) modes if there are ties in relative
frequencies.
Generally found by graphical inspection.
Sometimes not anywhere near any “center.”
Both have the same measurement units as the data.
Measure of spread / variation
SAME THING
Range = Max – Min
In statistics Range is a single number
Interquartile Range
Better suited to continuous data
More later about this.
Variance / Standard Deviation
All but variance have the same measurement
units as the data.
Variance S2
Mean of the squared deviations from the mean
1. Obtain the Mean.
2. Determine, for each value, the deviation from the Mean.
3. Square each of these deviations
4. Sum these squares
5. Divide this sum by one fewer than the number of observations
to get the Variance
Measure of squared variation from the mean
Standard Deviation S
Square root of the Variance
Measure of spread / variation (from the mean)
Same measurement units as the data.
Comparing Means & Standard Deviations
Small Class
Large Class
38
40
42
44
46
Age Guess
48
Small: Mean = 41.60
SD = 2.07
Large: Mean = 44.80
SD = 2.59
50
Comparing Means & Standard Deviations
Mean 44.80
Add a 40 and a 50…
SD
2.59
Comparing Means & Standard Deviations
Mean 44.80
SD
2.59
SD
3.58
Add a 40 and a 50…
Mean 44.86
Comparing Means & Standard Deviations
Mean 44.80
Add a 42 and a 48…
SD
2.59
Comparing Means & Standard Deviations
Mean 44.80
SD
2.59
SD
2.73
Add a 42 and a 48…
Mean 44.86
Comparing Means & Standard Deviations
Mean 44.80
Add 45 and 45…
SD
2.59
Comparing Means & Standard Deviations
Mean 44.80
SD
2.59
Mean 44.86
SD
2.12
Add 45 and 45…
Comparing Means & Standard Deviations
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
Mean = 4.0
9
10 11
12 13 14
15 16
SD = 3.0
6
5
4
3
2
1
0
0
1
2
3
4
5
6
Mean = 8.0
7
8
9
10 11
12 13 14
SD = 3.0
15 16
Comparing Means & Standard Deviations
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
Mean = 8.0
9
10 11
12 13 14
15 16
SD = 3.0
6
5
4
3
2
1
0
0
1
2
3
4
5
Mean = 8.0
6
7
8
9
10
11
12
SD = 6.0
13
14
15
16
Computing Mean & Standard Deviation
Data listed by unit
1. By hand with calculator support (UGH)
2. Using your calculator’s built in statistics
functionality
• 60 second quiz: Determine and write down the
mean and standard deviation of at most 10 data
values in under 1 minute
3. Using Excel
4. Using Minitab
Z = # of St Devs from Mean
“…within Z standard deviations of the mean…”
Determine Z  SD.
Find the values
Mean – ZSD
&
Mean + ZSD
This means:
“…between __________ and ______________.”
Mean & Standard Deviation
Where the data are
In general you’ll find that about
68% of the data falls within 1 standard deviation of
the mean
95% falls within 2
all falls within 3
There are exceptions.
These guidelines hold fairly precisely for data
that has a bell (Normal) shaped histogram.
Range Rule of Thumb
To guess the standard deviation, take the usual
range of data and divide by four.
Most homes for sale in the Oswego City School
District are listed at prices between $50,000 and
$200,000. What would you guess for the standard
deviation of prices?
$50,000 to $200,000
Range about $200000 – $50000 = $150000
Apply the RRoT…
$150000 / 4 = $37,500
Students are asked to complete a survey online.
This assignment is made on a Monday at about
noon. The survey closes Wednesday at midnight.
Since each student’s submission is accompanied
by a time stamp, it is simple to figure how early,
relative to the deadline, each student submitted
the work.
For the data set of amount of time early, guess
the standard deviation. Give results in both days
and hours.
This assignment is made on a Monday at about
noon. The survey closes Wednesday at midnight.
That’s 2.5 days, or 60 hours. People will hand it
in between immediately (2.5 days / 60 hours
early) and at the last minute (0 early). The range
is about 2.5 days or 60 hours.
Apply the RRoT…
2.5 / 4 = 0.625 days
these are the same
60 / 4 = 15 hours
Consider GPAs of graduating seniors.
Guess the standard deviation.
GPAs. You can’t graduate under 2.0. All As
gives 4.0.
Min about 2.0
Max probably exactly 4.0
Range about 4.0 – 2.0 = 2.0
Apply the RRoT…
2.0 / 4 = 0.5
Example
An instructor asked students in two sections of
the same course to guess the instructor’s age.
Students in the first class (in a large lecture hall)
had no other knowledge of the instructor’s
personal life. Students in the second class (in a
small classroom) knew that the instructor was the
father of a young girl.
Variable
Guess of instructor’s age
Quantitative
Units
The students
Guess of instructor’s age varies from student to
student.
Variable
Class (or Which class?)
Categorical
Units
The students
Which class varies from student to student.
28
30
32
34
36
38
44
42
40
Age_Large
46
This is a fairly symmetric distribution.
Mode = 42
Range = 54 – 32 = 22
48
50
52
54
28
30
32
34
36
38
44
42
40
Age_Large
This is a symmetric distribution.
Mean = 42.0
Symmetry: Typically Mean  Mode
“Nearly equal”
46
48
50
52
Mode = 42
54
Dotplot of Age_Small
28
30
32
34
36
38
44
42
40
Age_Large
46
48
50
52
54
34
36
38
40
42
44
Age_Small
46
48
50
52
54
Mean = 42.0
28
30
32
Mean = 39.0
28
30
32
34
36
30
32
Mean = 39.0
38
44
42
40
Age_Large
46
48
50
52
54
48
50
52
54
St Dev  22 / 4 = 5.5
Mean = 42.0
28
Dotplot of Age_Small
34
36
38
40
42
44
Age_Small
46
St Dev  22/ 4 = 5.5
30
33
36
39
42
45
Large Class
Mean = 40.25
30
33
St Dev = 4.33 (guess 4.25)
36
39
42
45
Small Class
Mean = 38.15
St Dev = 4.14 (guess 3.75)
Properties: Mean & Standard Deviation
They don’t really “depend” (in the usual sense)
on how much data there is. They depend on the
relative frequency (percent) of occurrence of
each value.
Adding a new unit…
Sometimes the mean will go up; sometimes down.
But on average it will stay the same.
Same for standard deviation.
Standard Deviation Calculation
Standard Deviation Calculation for the Large Section
Age Mean Deviation from Mean
Deviation squared
Sums
43
48
42
44
47
44.8
44.8
44.8
44.8
44.8
43 – 44.8 = -1.8
48 – 44.8 = +3.2
42 – 44.8 = -2.8
44 – 44.8 = -0.8
47 – 44.8 = +2.2
(-1.8)2 = 3.24
3.22 = 10.24
(-2.8)2 = 7.84
(-0.8)2 = 0.64
2.22 = 4.84
224
224.0
0
26.80
Mean = 224 / 5 = 44.8
Variance = 26.8 / 4 = 6.7
SD =
6.7 = 2.59
Standard Deviation Calculation
Standard Deviation Calculation for the Large Section
Age Mean Deviation from Mean
Deviation squared
Sums
43
48
42
44
47
44.8
44.8
44.8
44.8
44.8
43 – 44.8 = -1.8
48 – 44.8 = +3.2
42 – 44.8 = -2.8
44 – 44.8 = -0.8
47 – 44.8 = +2.2
(-1.8)2 = 3.24
3.22 = 10.24
(-2.8)2 = 7.84
(-0.8)2 = 0.64
2.22 = 4.84
224
224.0
0
26.80
Mean = 224 / 5 = 44.8
Variance = 26.8 / 4 = 6.7
SD =
6.7 = 2.59
Standard Deviation Calculation
Standard Deviation Calculation for the Large Section
Age Mean Deviation from Mean
Deviation squared
Sums
43
48
42
44
47
44.8
44.8
44.8
44.8
44.8
43 – 44.8 = -1.8
48 – 44.8 = +3.2
42 – 44.8 = -2.8
44 – 44.8 = -0.8
47 – 44.8 = +2.2
(-1.8)2 = 3.24
3.22 = 10.24
(-2.8)2 = 7.84
(-0.8)2 = 0.64
2.22 = 4.84
224
224.0
0
26.80
Mean = 224 / 5 = 44.8
Variance = 26.8 / 4 = 6.7
SD =
6.7 = 2.59
Standard Deviation Calculation
Standard Deviation Calculation for the Large Section
Age Mean Deviation from Mean
Deviation squared
Sums
43
48
42
44
47
44.8
44.8
44.8
44.8
44.8
43 – 44.8 = -1.8
48 – 44.8 = +3.2
42 – 44.8 = -2.8
44 – 44.8 = -0.8
47 – 44.8 = +2.2
(-1.8)2 = 3.24
3.22 = 10.24
(-2.8)2 = 7.84
(-0.8)2 = 0.64
2.22 = 4.84
224
224.0
0
26.80
Mean = 224 / 5 = 44.8
Variance = 26.8 / 4 = 6.7
SD =
6.7 = 2.59
Standard Deviation Calculation
Standard Deviation Calculation for the Large Section
Age Mean Deviation from Mean
Deviation squared
Sums
43
48
42
44
47
44.8
44.8
44.8
44.8
44.8
43 – 44.8 = -1.8
48 – 44.8 = +3.2
42 – 44.8 = -2.8
44 – 44.8 = -0.8
47 – 44.8 = +2.2
(-1.8)2 = 3.24
3.22 = 10.24
(-2.8)2 = 7.84
(-0.8)2 = 0.64
2.22 = 4.84
224
224.0
0
26.80
Mean = 224 / 5 = 44.8
Variance = 26.8 / 4 = 6.7
ALWAYS – for every data set
SD =
6.7 = 2.59
Standard Deviation Calculation
Standard Deviation Calculation for the Large Section
Age Mean Deviation from Mean
Deviation squared
Sums
43
48
42
44
47
44.8
44.8
44.8
44.8
44.8
43 – 44.8 = -1.8
48 – 44.8 = +3.2
42 – 44.8 = -2.8
44 – 44.8 = -0.8
47 – 44.8 = +2.2
(-1.8)2 = 3.24
3.22 = 10.24
(-2.8)2 = 7.84
(-0.8)2 = 0.64
2.22 = 4.84
224
224.0
0
26.80
Mean = 224 / 5 = 44.8
Variance = 26.8 / 4 = 6.7
SD =
6.7 = 2.59
Standard Deviation Calculation
Standard Deviation Calculation for the Large Section
Age Mean Deviation from Mean
Deviation squared
Sums
43
48
42
44
47
44.8
44.8
44.8
44.8
44.8
43 – 44.8 = -1.8
48 – 44.8 = +3.2
42 – 44.8 = -2.8
44 – 44.8 = -0.8
47 – 44.8 = +2.2
(-1.8)2 = 3.24
3.22 = 10.24
(-2.8)2 = 7.84
(-0.8)2 = 0.64
2.22 = 4.84
224
224.0
0
26.80
Mean = 224 / 5 = 44.8
Variance = 26.8 / 4 = 6.7
SD =
6.7 = 2.59
Standard Deviation Calculation
Standard Deviation Calculation for the Large Section
Age Mean Deviation from Mean
Deviation squared
Sums
43
48
42
44
47
44.8
44.8
44.8
44.8
44.8
43 – 44.8 = -1.8
48 – 44.8 = +3.2
42 – 44.8 = -2.8
44 – 44.8 = -0.8
47 – 44.8 = +2.2
(-1.8)2 = 3.24
3.22 = 10.24
(-2.8)2 = 7.84
(-0.8)2 = 0.64
2.22 = 4.84
224
224.0
0
26.80
Mean = 224 / 5 = 44.8
Variance = 26.8 / 4 = 6.7
SD =
6.7 = 2.59
Standard Deviation Calculation
Standard Deviation Calculation for the Large Section
Age Mean Deviation from Mean
Deviation squared
Sums
43
48
42
44
47
44.8
44.8
44.8
44.8
44.8
43 – 44.8 = -1.8
48 – 44.8 = +3.2
42 – 44.8 = -2.8
44 – 44.8 = -0.8
47 – 44.8 = +2.2
(-1.8)2 = 3.24
3.22 = 10.24
(-2.8)2 = 7.84
(-0.8)2 = 0.64
2.22 = 4.84
224
224.0
0
26.80
Mean = 224 / 5 = 44.8
Variance = 26.8 / 4 = 6.7
SD =
Sample Mean: x  44.80
Sample Standard Deviation: S  2.59
6.7 = 2.59
Download