Descriptive Statistics Example Dataset: Shellfish Contamination

advertisement
Descriptive Statistics
Example Dataset:
Shellfish Contamination
Observed variables:
•
Location, Year, Species, Species2, Method, ….
•
Metals (mg/kg):
Lead,
Cadmium, Chromium, Copper,
Mercury, Zinc
1
Understanding Data:
• Understand the distribution of the variables
• Find relationships among variables (trends, correlation)
• Consider the need for transformation
• Look for potentially influential observations
• Find errors that have occurred during data entry
• Test the assumptions of the statistical models that you
intend to employ
2
Descriptive Statistics
Numerical
size, ‘middle’, spread, …
Graphical:
Boxplot, Histogram …
3
Categorical Variables
Size:
Proportions:
n
p1,…, pk
4
Continuous Variables
n
Size:
n
‘middle’: arithmetic mean
or median
Spread:
1
x = ∑ xi ,
n i =1
Standard deviation s
Range, R = max(x)-min(x)
5
Quantiles
Sort the data into ascending order to obtain a sequence of
order statistics
x 1 ,x2 ,
,x n
The p'th quantile qp is the 1+(n-1)p'th order statistic x(1+(n-1)p) (or an
average of neigbouring values if 1+(n-1)p is not integer).
q0.25=lower quartile, q0.5 =median, q0.75= upper quartile
E.g. n= 11, median = 1+(10)(0.5)=6th order statistic
6
Unlike the arithmetic mean,
the median is not at all
influenced by the exact value
of the largest objects and so
provides a resistant measure
of the central location.
7
Graphical Summaries
A picture can save a thousand
numbers ….
8
Boxplot (box-and-whiskers plot)
• The boxplot is a useful way of plotting the 5 quantiles q0,
q0.25,q0.5, q0.75 and q1 of the data.
• The ends of the whiskers show the position of the
minimum and maximum of the data whereas the edges and
line in centre of the box show the upper and lower quartiles
and the median.
• The whiskers show at a glance the behaviour of the
extreme outliers, whereas the box edges and mid-line
summarize the sample in a resistant manner.
• Strong asymmetry in the box mid-line and whiskers
suggests that the data is not symmetric.
9
Modified Box-Plot
• The modified version draws the whiskers
only up to 1.5*IQR beyond the quartiles.
• IQR stands for the Interquartile Range
which is q0.75– q0.25.
• Points beyond the whiskers, called outliers,
are plotted individually (in MINITAB
using the * - symbol).
10
Time Series plots
• Useful way of seeing if there is any trend in a continuous
variable across time.
Scatter plots
• Useful way of seeing if there is any relationship between
pairs of continuous variables.
11
Histogram
• The range of values is divided up into a finite set of class intervals
(bins). The number of objects in each bin is then counted and
divided by the sample size to obtain the frequency of occurrence
and then these are plotted as vertical bars of varying height.
• The histogram quickly reveals the location, spread, and shape of
the distribution. The shape of the distribution can be unimodal (one
hump), multimodal (many humps) or skewed (fatter tail to left or
right).
12
Probability Distributions
•
•
•
•
Models for population variability
Provide simple descriptions
Used as basis for statistical inference
Many different models
– discrete: categories, counts
– continuous: standard measurements
13
Discrete Distributions
• Described by probabilities
Example: Binomial distribution B(n,π)
20% of fish with high pcb levels, i.e π=0.2
How many contaminated fish in a group of size
n?
14
Illustration:
Binomial distributions B(n, π) for n=10
15
Continuous Distributions
• For variables measured to an arbitrary
precision on some scale
• No probability associated with specific
values
• Histograms provide a useful lead-in …
16
Equal-width intervals:
• Draw boxes of height equal to frequency for
each interval
Area of each bar is
proportional to
frequency and
relative frequency
Frequency
300
200
100
0
40
90
140
190
bwt
17
Probability Densities
For a population histogram, as you
• increase the number of histogram cells, and
• decrease the interval width
the histogram approaches a smooth curve
(conceptually).
This is called a probability density function, or
simply a density.
18
Illustration
50000
100000
150000
200000
0
50000
100000
150000
200000
pop
Population histogram, 20 bins
Population histogram, 40 bins
0
50000
100000
pop
150000
200000
1.0 e-05
0.0 e+00
Density
2.0 e-05
pop
1.0 e-05
0.0 e+00
Density
1.0 e-05
0.0 e+00
Density
1.0 e-05
0
2.0 e-05
0.0 e+00
Density
2.0 e-05
Population histogram, 10 bins
2.0 e-05
Population histogram, 5 bins
0
50000
100000
150000
200000
pop
19
Probability Models
This smooth density curve gives us a
probability model for the population
• Take (simple) mathematical forms for these
• Allow probability calculations for the
population (areas under the density curve)
• Can be compared with the distribution of
the sample given by a histogram
20
Histogram with superimposed normal probability
model
0.4
0.3
0.2
Good
Agreement!
0.1
0.0
-3.934136
-2.442382
-0.950628
0.541127
2.032881
3.524636
-3.188259
-1.696505
-0.204750
1.287004
2.778758
x
21
Normal Distribution
• Model for continuous measurements
• Bell-shaped curve that approximates a
density histogram for many types of
observations
• Single mode
• Symmetric
• Parameters:
– mean µ
– standard deviation σ
(variance σ2)
22
Effects of µ and σ
(a) Changing
(b) Increasing
shifts the curve along the axis
increases the spread and flattens the curve
1
1
=
2=
=6
6
2=
140
160
1
= 160
180
2 =174
200
140
160
180
1
=
12
200
2 =170
23
Understanding the standard deviation σ
(c) Probabilities and numbers of standard deviations
Shaded area = 0.683
−
+
68% chance of falling
between − and
+
Shaded area = 0.954
−2
+2
95% chance of falling
between − 2 and
+2
Shaded area = 0.997
−3
+3
99.7% chance of falling
between − 3 and
+3
24
Histogram with normal curve
Histogram of Cadmium
Normal
35
Mean
StDev
N
30
0.2687
0.1633
168
Frequency
25
20
15
10
5
0
-0.00
0.15
0.30
0.45
Cadmium
0.60
0.75
Note: Approximation to normal distribution improves
when taking the logarithmic values.
25
Probability Plot
Probability Plot of Cadmium
Normal - 95% CI
99.9
Mean
StDev
N
AD
P-Value
99
95
Percent
90
0.2687
0.1633
168
4.539
<0.005
80
70
60
50
40
30
20
10
5
1
0.1
-0.4
-0.2
0.0
0.2
0.4
Cadmium
0.6
0.8
26
Probability Plot – Normality
Probability Plot of C22
Normal - 95% CI
99.9
Mean
StDev
N
AD
P-Value
99
95
Percent
90
0.04470
1.068
100
0.194
0.890
80
70
60
50
40
30
20
10
5
1
0.1
-4
-3
-2
-1
0
C22
1
2
3
4
27
Plotting by groups
Probability Plot of Cadmium
Normal - 95% CI
0.00
M
99.9
0.50
0.75
1.00
O
M
Mean
StDev
N
AD
P-Value
99
Percent
0.25
95
90
80
70
60
50
40
30
20
0.1656
0.08702
89
3.465
<0.005
O
Mean
0.3848
StDev
0.1510
N
79
AD
0.242
P-Value
0.763
10
5
1
0.1
0.00
0.25
0.50
0.75
1.00
Cadmium
Panel variable: SpeciesGroup
28
Skewness
Measured by skewness coefficient
– Negative ⇒ left skewed (tail to left)
– Zero
⇒ symmetric
– Positive ⇒ right skewed (tail to right)
Environmental data is frequently positive and
skewed to the right
mean > median
Variable
Cadmium
Skewness
0.84
29
Outliers
Points which are outside the general pattern of
the data
–
–
–
–
–
recording errors
Measurement failures
Rogue values
Greater variability
Unsuspected factors
Identify, assess impact, delete?
30
Histogram of Copper
0
M
90
10
20
30
40
50
60
O
80
Frequency
70
60
50
40
30
20
10
0
0
10
20
30
40
50
60
Copper
Panel variable: SpeciesGroup
31
Measures of location
• Mean
– highly sensitive to outliers, skewness
Median
– insensitive to outliers, distribution shape
• Trimmed mean
– trim 5% from each tail; calculate mean of
central part
– Median is 50% trimmed mean
32
Measures of Spread
• Range = max –min highly sensitive to
outliers
• Standard deviation – very sensitive to
outliers, skewness
• Interquartile range – length of central box of
boxplot
• MAD – median absolute deviation of data
values from the median; robust
33
Download