Skewness, Kurtosis & Histograms

advertisement
Review of Previous Lecture
• Range
– The difference between the largest and smallest values
• Interquartile range
– The difference between the 25th and 75th percentiles
• Variance
– The sum of squares divided by the population size or the
sample size minus one
• Standard deviation
– The square root of the variance
• Z-scores
– The number of standard deviations an observation is
away from the mean
Outline of Session
• Another Measure of Dispersion
– Coefficient of Variation (CV)
• Histograms
• Skewness
• Kurtosis
• Other Descriptive Summary Measures
Measures of Dispersion – Coefficient
of Variation
• Coefficient of variation (CV) measures the spread
of a set of data as a proportion of its mean.
• It is the ratio of the sample standard deviation to
the sample mean
s
CV   100%
x
• It is sometimes expressed as a percentage
• There is an equivalent definition for the coefficient
of variation of a population
Measures of Dispersion – Coefficient
of Variation
• A standard application of the Coefficient of
Variation (CV) is to characterize the variability of
geographic variables over space or time
• Coefficient of Variation (CV) is particularly
applied to characterize the interannual variability
of climate variables (e.g., temperature or
precipitation) or biophysical variables (leaf area
index (LAI), biomass, etc)
Chapel Hill
(A)
Bend
(B)
Mean
1198.10
298.07
Standard Deviation
191.80
82.08
Coefficient of Variation
(CV)
0.16
(16%)
0.28
(28%)
Coefficient of Variation (CV)
• It is a dimensionless number that can be used
to compare the amount of variance between
populations with different means
n
s 
2
 (x x)
i 1
n
2
i
n 1
s
 (x  x)
i 1
s
CV   100%
x
i
n 1
2
Source: http://www.daymet.org
Source: http://www.daymet.org
1990 - 2000
Source: Xiao & Moody, 2004
CV of NDVI ~ CV of temperature
CV of NDVI ~ CV of precipitation
NDVI
Temperature
Precipitation
Measures of Skewness and Kurtosis
• A fundamental task in many statistical analyses is
to characterize the location and variability of a
data set (Measures of central tendency vs.
measures of dispersion)
• Both measures tell us nothing about the shape of
the distribution
• A further characterization of the data includes
skewness and kurtosis
• The histogram is an effective graphical technique
for showing both the skewness and kurtosis of a
data set
Histograms
Fig. 3. Histogram of crown width (m) measured in situ for a random sample
of Quercus robur trees in Frame Wood (n = 63; mean = 9.3 m; SD = 4.64 m).
Source: Koukoulas & Blackburn, 2005. Journal of Vegetation Science: Vol. 16, No. 5, pp. 587–596
Frequency & Distribution
• A histogram is one way to depict a frequency
distribution
• Frequency is the number of times a variable takes
on a particular value
• Note that any variable has a frequency distribution
• e.g. roll a pair of dice several times and record the
resulting values (constrained to being between and
2 and 12), counting the number of times any given
value occurs (the frequency of that value
occurring), and take these all together to form a
frequency distribution
Frequency & Distribution
• Frequencies can be absolute (when the frequency
provided is the actual count of the occurrences) or
relative (when they are normalized by dividing the
absolute frequency by the total number of
observations [0, 1])
• Relative frequencies are particularly useful if you
want to compare distributions drawn from two
different sources (i.e. while the numbers of
observations of each source may be different)
Histograms
• We may summarize our data by constructing
histograms, which are vertical bar graphs
• A histogram is used to graphically summarize
the distribution of a data set
• A histogram divides the range of values in a data
set into intervals
• Over each interval is placed a bar whose height
represents the frequency of data values in the
interval.
Building a Histogram
• To construct a histogram, the data are first
grouped into categories
• The histogram contains one vertical bar for each
category
• The height of the bar represents the number of
observations in the category (i.e., frequency)
• It is common to note the midpoint of the category
on the horizontal axis
Building a Histogram – Example
• 1. Develop an ungrouped frequency table
– That is, we build a table that counts the number of
occurrences of each variable value from lowest to
highest:
TMI Value
Ungrouped Freq.
4.16
2
4.17
4.18
…
13.71
4
0
…
1
• We could attempt to construct a bar chart from this table,
but it would have too many bars to really be useful
Building a Histogram – Example
• 2. Construct a grouped frequency table
– Select an appropriate number of classes
Class
4.00 - 4.99
5.00 - 5.99
6.00 - 6.99
7.00 - 7.99
8.00 - 8.99
9.00 - 9.99
10.00 - 10.99
11.00 - 11.99
12.00 - 12.99
13.00 - 13.99
Frequency
120
807
1411
407
87
33
17
22
43
19
Percentage
Building a Histogram – Example
•
3. Plot the frequencies of each class
– All that remains is to create the bar graph
Pond Branch TMI Histogram
Percent of cells in catchment
48
44
40
36
32
28
24
20
A proxy for
Soil Moisture
16
12
8
4
0
4
5
6
7
8
9
10
11
12
13
Topographic Moisture Index
14
15
16
Further Moments of the Distribution
• While measures of dispersion are useful for helping
us describe the width of the distribution, they tell us
nothing about the shape of the distribution
Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA:
Macmillan College Publishing Co., p. 91.
Further Moments of the Distribution
• There are further statistics that describe the
shape of the distribution, using formulae that are
similar to those of the mean and variance
• 1st moment - Mean (describes central value)
• 2nd moment - Variance (describes dispersion)
• 3rd moment - Skewness (describes asymmetry)
• 4th moment - Kurtosis (describes peakedness)
Further Moments – Skewness
• Skewness measures the degree of asymmetry
exhibited by the data
n
skewness 
 (x  x)
i 1
3
i
ns
3
• If skewness equals zero, the histogram is
symmetric about the mean
• Positive skewness vs negative skewness
Further Moments – Skewness
Source: http://library.thinkquest.org/10030/3smodsas.htm
Further Moments – Skewness
• Positive skewness
– There are more observations below the mean
than above it
– When the mean is greater than the median
• Negative skewness
– There are a small number of low observations
and a large number of high ones
– When the median is greater than the mean
Further Moments – Kurtosis
• Kurtosis measures how peaked the histogram is
n
kurtosis 
 (x  x)
i
i
ns
4
4
3
• The kurtosis of a normal distribution is 0
• Kurtosis characterizes the relative peakedness
or flatness of a distribution compared to the
normal distribution
Further Moments – Kurtosis
• Platykurtic– When the kurtosis < 0, the
frequencies throughout the curve are closer to be
equal (i.e., the curve is more flat and wide)
• Thus, negative kurtosis indicates a relatively flat
distribution
• Leptokurtic– When the kurtosis > 0, there are
high frequencies in only a small part of the curve
(i.e, the curve is more peaked)
• Thus, positive kurtosis indicates a relatively
peaked distribution
Further Moments – Kurtosis
platykurtic
leptokurtic
Source: http://www.riskglossary.com/link/kurtosis.htm
• Kurtosis is based on the size of a distribution's
tails.
• Negative kurtosis (platykurtic) – distributions with
short tails
• Positive kurtosis (leptokurtic) – distributions with
relatively long tails
Why Do We Need Kurtosis?
• These two distributions have the same variance,
approximately the same skew, but differ markedly
in kurtosis.
Source: http://davidmlane.com/hyperstat/A53638.html
How to Graphically Summarize Data?
• Histograms
• Box plots
Functions of a Histogram
• The function of a histogram is to graphically
summarize the distribution of a data set
• The histogram graphically shows the following:
1. Center (i.e., the location) of the data
2. Spread (i.e., the scale) of the data
3. Skewness of the data
4. Kurtosis of the data
4. Presence of outliers
5. Presence of multiple modes in the data.
Functions of a Histogram
• The histogram can be used to answer the
following questions:
1. What kind of population distribution do the
data come from?
2. Where are the data located?
3. How spread out are the data?
4. Are the data symmetric or skewed?
5. Are there outliers in the data?
Source: http://www.robertluttman.com/vms/Week5/page9.htm (First three)
http://office.geog.uvic.ca/geog226/frLab1.html (Last)
Box Plots
• We can also use a box plot to graphically
summarize a data set
• A box plot represents a graphical summary of
what is sometimes called a “five-number
summary” of the distribution
– Minimum
– Maximum
– 25th percentile
– 75th percentile
– Median
• Interquartile Range (IQR)
max.
median
min.
Rogerson, p. 8.
75th
%-ile
25th
%-ile
Box Plots
• Example – Consider first 9 Commodore prices ( in
$,000)
6.0, 6.7, 3.8, 7.0, 5.8, 9.975, 10.5, 5.99, 20.0
• Arrange these in order of magnitude
3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5, 20.0
• The median is Q2 = 6.7 (there are 4 values on
either side)
• Q1 = 5.9 (median of the 4 smallest values)
• Q3 = 10.2 (median of the 4 largest values)
• IQR = Q3 – Q1 = 10.2 - 5.9 = 4.3
• Example (ranked)
3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5, 20.0
• The median is Q1 = 6.7
• Q1 = 5.9
Q3 = 10.2
IQR = Q3 – Q1 = 10.2 - 5.9 = 4.3
Box Plots
Example: Table 1.1 Commuting data (Rogerson, p5)
Ranked commuting times:
5, 5, 6, 9, 10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21,
21, 22, 23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 47
25th percentile is represented by observation (30+1)/4=7.75
75th percentile is represented by observation 3(30+1)/4=23.25
25th percentile: 11.75
75th percentile: 26
Interquartile range: 26 – 11.75 = 14.25
Example (Ranked commuting times):
5, 5, 6, 9, 10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21,
21, 22, 23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 47
25th percentile: 11.75
75th percentile: 26
Interquartile range: 26 – 11.75 = 14.25
Other Descriptive Summary Measures
• Descriptive statistics provide an organization
and summary of a dataset
• A small number of summary measures replaces
the entirety of a dataset
• We’ll briefly talk about other simple descriptive
summary measures
Other Descriptive Summary Measures
• You're likely already familiar with some simple
descriptive summary measures
– Ratios
– Proportions
– Percentages
– Rates of Change
– Location Quotients
Other Descriptive Summary Measures
• Ratios –
# of observations in A
=
# of observations in B
e.g., A - 6 overcast, B - 24 mostly cloudy days
• Proportions – Relates one part or category of
data to the entire set of observations, e.g., a box
of marbles that contains 4 yellow, 6 red, 5 blue,
and 2 green gives a yellow proportion of 4/17 or
colorcount = {yellow, red, blue, green}
acount = {4, 6, 5, 2}
ai
proportion 
 ai
Other Descriptive Summary Measures
• Proportions - Sum of all proportions = 1. These
are useful for comparing two sets of data
w/different sizes and category counts, e.g., a
different box of marbles gives a yellow proportion
of 2/23, and in order for this to be a reasonable
comparison we need to know the totals for both
samples
• Percentages - Calculated by proportions x 100,
e.g., 2/23 x 100% = 8.696%, use of these should
be restricted to larger samples sizes, perhaps
20+ observations
Other Descriptive Summary Measures
• Location Quotients - An index of relative
concentration in space, a comparison of a region's
share of something to the total
• Example – Suppose we have a region of 1000 Km2
which we subdivide into three smaller areas of 200,
300, and 500 km2 (labeled A, B, & C)
• The region has an influenza outbreak with 150
cases in A, 100 in B, and 350 in C (a total of 600 flu
cases):
A
B
C
Proportion of Area
200/1000=0.2
300/1000=0.3
500/1000=0.5
Proportion of Cases
150/600=0.25
100/600=0.17
350/600=0.58
Location Quotient
0.25/0.2=1.25
0.17/0.3 = 0.57
0.58/0.5=1.17
Assignment II
• Due by Thursday (02/09/2006)
• Downloadable from Course website:
– http://www.unc.edu/courses/2006spring/geog/090/001/www/
Download