01/16/2008

advertisement
Univariate Descriptive
Statistics
Chapter 2
Lecture Overview
Tabular and Graphical Techniques
Distributions
Measures of Central Tendency
Measures of Dispersion
Tabular and Graphical Techniques
Frequency Tables
– Ungrouped
– Grouped
Histograms
Cumulative Frequency Histogram
Frequency Tables
Bin
Frequency
170
3
180
7
190
8
200
9
210
12
220
6
230
6
240
4
250
2
260
3
Histograms
Burglary Frequency
15
10
5
0
170 180 190 200 210 220 230 240 250 260
Note: sometimes percent is on the Y axis rather than frequency
Cumulative Frequency Histograms
Burglary Frequency
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
170 180 190 200 210 220 230 240 250 260
Key Concepts
Choosing Intervals (i.e., choosing your “bins”)
Rules from the textbook (pages 38 – 39)
Commonly Used Examples from GIS
–
–
–
–
Equal Interval
Quantiles (e.g., quartiles and quintiles)
Natural Breaks
Standard Deviation
Rules For Bin Sizes
Note: This is very relevant for GIS
Rule 1: Use intervals with simple bounds
Rule 2: Respect natural breakpoints
Rule 3: Intervals should not overlap
Rule 4: Intervals should be the same width
Rule 5: Select an appropriate number of classes
The Effect of Classification
Equal Interval
– Splits data into user-specified number of classes of
equal width
– Each class has a different number of observations
The Effect of Classification
Quantiles
– Data divided so that there are an equal number of
observations are in each class
– Some classes can have quite narrow intervals
The Effect of Classification
Natural Breaks
– Splits data into classes based on natural breaks
represented in the data histogram
The Effect of Classification
Standard Deviation
– Mean + or – Std. Deviation(s)
Key Concepts
Making sense of your histograms using
distributions
–
–
–
–
–
Rectangular
Unimodal
Bimodal
Multimodal
Skew (positive and negative)
Bimodal Distribution
Bimodal Distribution
30
25
20
15
10
5
50
0
M
or
e
47
5
45
0
42
5
40
0
37
5
35
0
32
5
30
0
27
5
25
0
22
5
20
0
17
5
15
0
0
Multimodal Distribution
Multimodal Distribution
30
20
10
975
900
825
750
675
600
525
450
375
300
225
150
0
Skew
An asymmetrical distribution
Measures of Central Tendency
Measures of central tendency
– Measures of the location of the middle or the
center of a distribution
– Mean, median, mode, midrange
Definitions
Midrange
Mode
Median
– Quantiles
Mean
Definitions
Sample Mean
Population Mean
Description of Mean
Mean – Most commonly used measure of central tendency
Average of all observations
The sum of all the scores divided by the number of scores
Note: Assuming that each observation is equally
significant
Symbols
n
: the number of observations
N
: the number of elements in the whole population
Σ
: this (capital sigma) is the symbol for sum
i
: the starting point of a series of numbers
X
: one element in our dataset, usually has a
subscript (e.g., i, min, max)
x
: the sample mean

: the population mean
Summation Notation: Components
refers to where the
sum of terms ends
in
x
i 1
i
indicates we are
taking a sum
refers to where the
sum of terms begins
indicates what we
are summing up
Mathematical Notation of Mean
The mathematical notation used most often in this course is
the summation notation
The Greek letter capital sigma is used as a shorthand way
of indicating that a sum is to be taken:
in
x
i 1
The expression is equivalent to:
i
x1  x2      xn
Summation Notation: Simplification
A summation will often be written leaving out
the upper and/or lower limits of the summation,
assuming that all of the terms available are to be
summed
i n
n
 x   x  x
i 1
i
i 1
i
i
Equation for Mean
Sample mean:
Population mean:
N

x
i 1
N
i
n
x
x
i 1
n
i
Example Mean Calculations
Example I
– Data: 8, 4, 2, 6, 10
5
x
x
i
i 1
5
(8  4  2  6  10)

6
5
Example II
– Sample: 10 trees randomly selected from Battle Park
– Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
10
x
x
i 1
10
i

(9.8  10.2      24.5)
 14.38
10
Example Mean Calculations
Example III
Annual mean temperature (°F)
x  59.70
Monthly mean temperature (°F) at Chapel Hill, NC (2001).
Mean annual precipitation (mm)
Examples IV & V
Mean
1198.10 (mm)
Mean annual temperature (°F)
Mean
58.51 (°F)
Chapel Hill, NC
(1972-2001)
Explanation of Mean
Advantage
– Sensitive to any change in the value of any observation
Disadvantage
– Very sensitive to outliers
#
Tree Height
(m)
#
Tree Height
(m)
1
5.0
6
5.3
2
6.0
7
7.1
3
7.5
8
25.4
4
8.0
9
7.5
5
4.8
10
4.5
Mean = 6.19 m
Mean = 8.10 m
without #8
with #8
Measures of Dispersion
Used to describe the data
dispersion/spread/variation/deviation
numerically
Usually used in conjunction with measures
of central tendency
Measures of variation
16
5
14
4
12
# of obs
10
3
8
2
6
4
1
2
0
10
20
30
40
50
60
70
80
score
Low variation
90
0
10 20 30 40 50 60 70 80 90
score
High variation
Groups have equal means and equal n, but one varies more than the other
Definitions
Range
Mean Deviation
Variance
Standard Deviation
Coefficient of Variation
Pearson’s
Symbols
s2 : the sample variance
σ2 : the population variance
s
: the sample standard deviation
σ
: the population standard deviation
Sample Variance and Standard Deviation
Variance
Standard Deviation
n
n
s 
2
 ( xi  x)
i 1
n 1
s
 ( x  x)
i 1
i
n 1
Note: as with the mean there are both sample and population
standard deviations & variances
Next Class
Read chapter 3
Work on the homework
Come with questions
Bring your laptop
Download