MAR-550_LabInOceanogr_wk7

advertisement
Laboratory in Oceanography:
Data and Methods
Intro to the Statistics Toolbox
MAR550, Spring 2013
Miles A. Sundermeyer
Sundermeyer
MAR 550
Spring 2013
1
Intro to Statistics Toolbox
Statistics Toolbox/Descriptive Statistics
Measures of Central Tendency
Function Name
Description
Geomean
Geometric mean
harmmean
Harmonic mean
mean
Arithmetic mean
median
50th percentile
mode
Most frequent value
trimmean
Trimmed mean
(specify percentile)
1/ n
•
•
Geometric Mean:
Harmonic Mean:
Sundermeyer
MAR 550
Spring 2013
 n 
  ai 
 i 1 
n
1


 n a1  a2  an  exp   ln ai 
 n i 1

n
1 1
1
 
a1 a2
an

n

n
i 1
1
,
ai  0 for all i
ai
2
Intro to Statistics Toolbox
Statistics Toolbox/Descriptive Statistics
Measures of Dispersion
•
•
•
Function Name
Description
irq
Interquartile range
mad
Mean absolute deviation
moment
Central moment of all orders
range
Range
std
Standard deviation
var
Variance
Interquartile range: difference between the 75th and 25th percentiles
Mean absolute deviation: mean(abs(x-mean(x)))
Moment: mean((x-mean(x)).^order (e.g., order=2 gives variance)
• skewness: third central moment of x, divided by cube of its standard
deviation (pos/neg skewness implies longer right/left tail)
• kurtosis: fourth central moment of x, divided by 4th power of its standard
deviation (high kurtosis means sharper peak and longer/fatter tails)
Sundermeyer
MAR 550
Spring 2013
3
Intro to Statistics Toolbox
Statistics Toolbox/Descriptive Statistics
Examples of Skewness & Kurtosis:
Sundermeyer
MAR 550
Spring 2013
4
Intro to Statistics Toolbox
Statistics Toolbox/Descriptive Statistics
Bootstrap Method
•
•
Involves choosing random samples with replacement from a data set and
analyzing each sample data set the same way as the original data set. The
number of elements in each bootstrap sample set equals the number of
elements in the original data set. The range of sample estimates obtained
provides a means of estimating uncertainty of the quantity being estimated.
In general, bootstrap method can be used to compute uncertainty for any
functional calculation, provided the sample data set is ‘representative’ of the
true distribution.
Jacknife Method
•
Similar to the bootstrap is the jackknife, but uses re-sampling to estimate the
bias and variance of sample statistics.
Sundermeyer
MAR 550
Spring 2013
5
Intro to Statistics Toolbox
Statistics Toolbox/Descriptive Statistics
Example:
Bootstrap Method for estimating uncertainty on Lagrangian Integral Time Scale
(from Sundermeyer and Price, 1998)
“Integrating the LACFs using 100 days as the upper limit of the integral of Rii(t)
in (12) gives the integral timescales I(11,22) = (10.6 ± 4.8, 5.4 ± 2.8) days for the
(zonal, meridional) components, where uncertainties represent 95% confidence
limits estimated using a bootstrap method [e.g., Press et al., 1986].”
Sundermeyer
MAR 550
Spring 2013
6
Intro to Statistics Toolbox
Statistics Toolbox/Statistical Visualization
Probability Distribution Plots
•
Normal Probability Plots:
>> x = normrnd(10,1,25,1);
>> normplot(x)
>> x = exprnd(10,100,1);
>> normplot(x)
Sundermeyer
MAR 550
Spring 2013
7
Intro to Statistics Toolbox
Statistics Toolbox/Statistical Visualization
Probability Distribution Plots
•
Quantile-Quantile Plots:
>> x = poissrnd(10,50,1); y = poissrnd(5,100,1);
>> qqplot(x,y);
>> x = normrnd(5,1,100,1);
>> y = wblrnd(2,0.5,100,1);
>> qqplot(x,y);
Sundermeyer
MAR 550
Spring 2013
8
Intro to Statistics Toolbox
Statistics Toolbox/Statistical Visualization
Probability Distribution Plots
•
Cumulative Distribution Plots:
>> y = evrnd(0,3,100,1);
>> cdfplot(y)
>> hold on
>> x = -20:0.1:10;
>> f = evcdf(x,0,3);
>> plot(x,f,'m')
>> legend('Empirical', ...
'Theoretical', ...
'Location','NW')
Sundermeyer
MAR 550
Spring 2013
9
Intro to Statistics Toolbox
Statistics Toolbox/Probability Distributions/Supported Distributions
Supported distributions include wide range of:
•
Continuous distributions (data)
•
Continuous distributions (statistics)
•
Discrete distributions
•
Multivariate distributions
Function Name
Description
pdf
Probability density functions
cdf
Cumulative distribution functions
inv
Inverse cumulative distribution functions
stat
Distribution statistics functions
fit
Distribution fitting functions
like
Negative log-likelihood functions
rnd
Random number generators
http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/acc
ess/helpdesk/help/toolbox/stats/&http://www.mathworks.com/support/produ
ct/product.html?product=ST
Sundermeyer
MAR 550
Spring 2013
10
Intro to Statistics Toolbox
Statistics Toolbox/Probability Distributions/Supported Distributions
Supported distributions (cont’d)
Name
pdf
cdf
inv
stat
fit
normstat
normfit,
mle,
dfittool
like
rnd
normlike
normrnd,
randn,
random,
randtool
...
Normal
(Gaussian)
Normpdf,
pdf
Normcdf,
cdf
norminv,
icdf
Pearson
system
pearsrnd
pearsrnd
Piecewise
pdf
cdf
icdf
paretotails
random
Rayleigh
raylpdf,
pdf
raylcdf,
cdf
raylinv,
icdf
raylfit, mle,
dfittool
raylrnd,
random,
randtool
raylstat
...
Sundermeyer
MAR 550
Spring 2013
11
Intro to Statistics Toolbox
Statistics Toolbox/Probability Distributions/Supported Distributions
Supported statistics
Name
pdf
cdf
inv
stat
Chi-square
chi2pdf,
pdf
chi2cdf,
cdf
chi2inv,
icdf
chi2sta
t
chi2rnd,
random,
randtool
F
fpdf, pdf
fcdf, cdf
finv, icdf
fstat
frnd, random,
randtool
Noncentral chi-square
ncx2pdf,
pdf
ncx2cdf,
cdf
ncx2inv,
icdf
ncx2st
at
ncx2rnd,
random,
randtool
Noncentral F
ncfpdf,
pdf
ncfcdf,
cdf
ncfinv,
icdf
ncfstat
ncfrnd,
random,
randtool
Noncentral t
nctpdf,
pdf
nctcdf,
cdf
nctinv,
icdf
nctstat
nctrnd,
random,
randtool
Student's t
tpdf, pdf
tcdf, cdf
tinv, icdf
tstat
trnd, random,
randtool
t location- scale
Sundermeyer
MAR 550
Spring 2013
fit
like
rnd
dfittool
12
Intro to Statistics Toolbox
Statistics Toolbox/Hypothesis Tests
Hypothesis Testing
•
Can only disprove a hypothesis
•
null hypothesis – an assertion about a population. It is "null" in that it represents a
status quo belief, such as the absence of a characteristic or the lack of an effect.
•
alternative hypothesis – a contrasting assertion about the population that can be
tested against the null hypothesis
H1: µ ≠ null hypothesis value — (two-tailed test)
H1: µ > null hypothesis value — (right-tail test)
H1: µ< null hypothesis value — (left-tail test)
•
test statistic – random sample of population collected, and test statistic computed to
characterize the sample. The statistic varies with type of test, but distribution under
null hypothesis must be known (or assumed).
•
p-value - probability, under null hypothesis, of obtaining a value of the test statistic as
extreme or more extreme than the value computed from the sample.
•
significance level - threshold of probability, typical value of a is 0.05. If p-value < a
the test rejects the null hypothesis; if p-value > α, there is insufficient evidence to
reject the null hypothesis.
•
confidence interval - estimated range of values with a specified probability of
containing the true population value of a parameter.
Sundermeyer
MAR 550
Spring 2013
13
Intro to Statistics Toolbox
Statistics Toolbox/Hypothesis Tests
Hypothesis Testing
•
Hypothesis tests make assumptions about the distribution of the random
variable being sampled in the data. These must be considered when
choosing a test and when interpreting the results.
•
Z-test (ztest) and the t-test (ttest) both assume that the data are
independently sampled from a normal distribution.
•
Both the z-test and the t-test are relatively robust with respect to departures
from this assumption, so long as the sample size n is large enough.
•
Difference between the z-test and the t-test is in the assumption of the
standard deviation σ of the underlying normal distribution. A z-test assumes
that σ is known; a t-test does not. Thus t-test must determine s from the
sample.
Sundermeyer
MAR 550
Spring 2013
14
Intro to Statistics Toolbox
Statistics Toolbox/Hypothesis Tests
ztest
•
The test requires σ (the standard deviation of the population) to be known
•
The formula for calculating the z score for the z-test is:
x
z
s/ n
http://www.stats4students.com/Essentials/Standard-Score/Overview.php
where:
x is the sample mean
μ is the mean of the population
•
The z-score is compared to a z-table, which contains the percent of area
under the normal curve between the mean and the z-score. This table will
indicate whether the calculated z-score is within the realm of chance, or if it
is so different from the mean that the sample mean is unlikely to have
happened by chance.
Sundermeyer
MAR 550
Spring 2013
15
Intro to Statistics Toolbox
Statistics Toolbox/Hypothesis Tests
ttest
•
Like z-test, except the t-test does not require σ to be known
•
The formula for calculating the t score for the t-test is:
x
t
s/ n
http://www.stats4students.com/Essentials/Standard-Score/Overview.php
where:
x is the sample mean
μ is the mean of the population
s is the sample variance
•
Under the null hypothesis that the population is distributed with mean μ, the
z-statistic has a standard normal distribution, N(0,1). Under the same null
hypothesis, the t-statistic has Student's t distribution with n – 1 degrees of
freedom.
Sundermeyer
MAR 550
Spring 2013
16
Intro to Statistics Toolbox
Statistics Toolbox/Hypothesis Tests
ttest2
•
performs a t-test of the null hypothesis that data in the vectors x and y are
independent random samples from normal distributions with equal means
and equal but unknown variances – unknown variances may be either
equal or unequal.
•
The formula for calculating the score for the t-test2 is:
t
xy
2
s x2 s y

n m
where:
x, y are sample means
sx, sy are the sample variances
http://www.socialresearchmethods.net/kb/stat_t.php
•
The null hypothesis is that the two samples are distributed with the same
mean.
Sundermeyer
MAR 550
Spring 2013
17
Intro to Statistics Toolbox
Statistics Toolbox/Hypothesis Tests
Function
Description
ansaribradley
Ansari-Bradley test. Tests if two independent samples come from the same distribution, against the alternative that they come from
distributions that have the same median and shape but different variances.
chi2gof
Chi-square goodness-of-fit test. Tests if a sample comes from a specified distribution, against the alternative that it does not come from that
distribution.
dwtest
Durbin-Watson test. Tests if the residuals from a linear regression are independent, against the alternative that there is autocorrelation
among them.
jbtest
Jarque-Bera test. Tests if a sample comes from a normal distribution with unknown mean and variance, against the alternative that it does
not come from a normal distribution.
linhyptest
Linear hypothesis test. Tests if H*b = c for parameter estimates b with estimated covariance H and specified c, against the alternative that
H*b ≠ c.
kstest
One-sample Kolmogorov-Smirnov test. Tests if a sample comes from a continuous distribution with specified parameters, against the
alternative that it does not come from that distribution.
kstest2
Two-sample Kolmogorov-Smirnov test. Tests if two samples come from the same continuous distribution, against the alternative that they
do not come from the same distribution.
lillietest
Lilliefors test. Tests if a sample comes from a distribution in the normal family, against the alternative that it does not come from a normal
distribution.
ranksum
Wilcoxon rank sum test. Tests if two independent samples come from identical continuous distributions with equal medians, against the
alternative that they do not have equal medians.
runstest
signrank
Runs test. Tests if a sequence of values comes in random order, against the alternative that the ordering is not random.
signtest
One-sample or paired-sample sign test. Tests if a sample comes from an arbitrary continuous distribution with a specified median, against
the alternative that it does not have that median.
ttest
One-sample or paired-sample t-test. Tests if a sample comes from a normal distribution with unknown variance and a specified mean,
against the alternative that it does not have that mean.
ttest2
Two-sample t-test. Tests if two independent samples come from normal distributions with unknown but equal (or, optionally, unequal)
variances and the same mean, against the alternative that the means are unequal.
vartest
One-sample chi-square variance test. Tests if a sample comes from a normal distribution with specified variance, against the alternative
that it comes from a normal distribution with a different variance.
vartest2
Two-sample F-test for equal variances. Tests if two independent samples come from normal distributions with the same variance, against
the alternative that they come from normal distributions with different variances.
vartestn
Bartlett multiple-sample test for equal variances. Tests if multiple samples come from normal distributions with the same variance, against
the alternative that they come from normal distributions with different variances.
ztest
One-sample z-test. Tests if a sample comes from a normal distribution with known variance and specified mean, against the alternative that
it does not have that mean.
Sundermeyer
MAR 550
Spring 2013
One-sample or paired-sample Wilcoxon signed rank test. Tests if a sample comes from a continuous distribution symmetric about a
specified median, against the alternative that it does not have that median.
18
Intro to Statistics Toolbox
Statistics Toolbox/Analysis of Variance
ANOVA (ANalysis Of VAriance)
• ANOVA is like a t-test among multiple (typically >2) data sets simultaneously
• T-tests can be done between two data sets, or one set and a “true” value
• uses the f-distribution instead of the t-distribution
• assumes that all of the data sets have equal variances
One-way ANOVA is a simple special case of the linear model. The one-way
ANOVA form of the model is
yij  a . j   ij
where:
• yij is a matrix of observations, each column represents a different group.
• a.j is a matrix whose columns are the group means. (The "dot j" notation
means a applies to all rows of column j. That is, αij is the same for all i.)
• εij is a matrix of random disturbances.
The model assumes that the columns of y are a constant plus a random
disturbance. ANOVA tests if the constants are all the same.
Sundermeyer
MAR 550
Spring 2013
19
Intro to Statistics Toolbox
Statistics Toolbox/Analysis of Variance
One-way ANOVA
Example: Hogg and Ledolter bacteria counts in milk. Columns represent
different shipments, rows are bacteria counts from cartons chosen randomly
from each shipment. Do some shipments have higher counts than others?
>> load hogg
>> hogg
hogg =
24
15
21
27
33
23
14
7
12
17
14
16
11
9
7
13
12
18
7
7
4
7
12
18
19
24
19
15
10
20
>> [p,tbl,stats] = anova1(hogg);
>> p
p = 1.1971e-04
• standard ANOVA table has columns for the sums of squares, dof, mean squares
(SS/df), F statistic, and p-value.
• P-value is from F statistic of hypothesis test whether bacteria counts are same.
Sundermeyer
MAR 550
Spring 2013
20
Intro to Statistics Toolbox
Statistics Toolbox/Analysis of Variance
One-way ANOVA (cont’d)
• In this case the p-value is about 0.0001, a very small value. This is a strong indication
that the bacteria counts from the different shipments are not the same. An F statistic
as extreme as this would occur by chance only once in 10,000 times if the counts
were truly equal.
• The p-value returned by anova1 depends on assumptions about random disturbances
εij in the model equation. For the p-value to be correct, these disturbances need to be:
independent, normally distributed, and have constant variance.
Sundermeyer
MAR 550
Spring 2013
21
Intro to Statistics Toolbox
Statistics Toolbox/Analysis of Variance
Multiple Comparisons
• Sometimes need to determine not just whether there are differences among
means, but which pairs of means are significantly different.
• In t-test, compute t-statistic and compare to a critical value. However, when
testing multiple pairs, for example, if probability of t-statistic exceeding
critical value is 5%, then for 10 pairs, much more likely that one of these will
falsely fail that criterion.
• Can perform a multiple comparison test using the multcompare function by
supplying it with the stats output from anova1.
Example:
>> load hogg
>> [p,tbl,stats] = anova1(hogg);
>> [c,m] = multcompare(stats)
Example:
see Light_DO.m
Sundermeyer
MAR 550
Spring 2013
22
Intro to Statistics Toolbox
Statistics Toolbox/Analysis of Variance
Two-way ANOVA
Determine whether data from several groups have a common mean. Differs from oneway ANOVA in that the groups in two-way ANOVA have two categories of defining
characteristics instead of one (e.g., think of two independent variables/dimensions)
Two-way ANOVA is again a special case of the linear model. The two-way ANOVA form
of the model is
yijk    a . j  b. j  g ij   ijk
where:
• yijk is a matrix of observations (with rows i, columns j, and repetition k).
•  is a constant matrix of the overall mean of the observations.
• a.j is a matrix whose columns are deviations of each observation attributable to the
first independent variable. All values in a given column of are identical, and values in
each row sum to 0.
• b.j is a matrix whose rows are the deviations of each observation attributable to the
second independent variable. All values in a given row of are identical, and values in
each column of sum to 0.
• gij is a matrix of interactions. Values in each row sum to 0, and values in each column
sum to 0.
• εij is a matrix of random disturbances.
The model assumes that the columns of y are a series of constants plus a random
disturbance. You want to know if the constants are all the same.
Sundermeyer
MAR 550
Spring 2013
23
Intro to Statistics Toolbox
Statistics Toolbox/Analysis of Variance
Two-way ANOVA
Example: Determine effect of car model and factory on the mileage rating of cars.
There are three models (columns) and two factories (rows). Data from first factory is in
first three rows, data from second factory is in last three rows. Do some cars have
different mileage than others?
>> load mileage
mileage =
33.3000
33.4000
32.9000
32.6000
32.5000
33.0000
34.5000
34.8000
33.8000
33.4000
33.7000
33.9000
37.4000
36.8000
37.6000
36.6000
37.0000
36.7000
>> cars = 3;
>> [p,tbl,stats] = anova2(mileage,cars);[p,tbl,stats] = anova1(hogg);
Sundermeyer
MAR 550
Spring 2013
24
Intro to Statistics Toolbox
Statistics Toolbox/Analysis of Variance
Two-way ANOVA (cont’d)
• In this case the p-value for the first effect is zero to four decimal places. This
indicates that the effect of the first predictor varies from one sample to another.
An F statistic as extreme as this would occur by chance only once in 10,000
times if the samples were truly equal.
• The p-value for the second effect is 0.0039, which is also highly significant. This
indicates that the effect of the second predictor varies from one sample to another.
• Does not appear to be any interaction between the two predictors. The p-value,
0.8411, means that the observed result is quite likely (84 out 100 times) given that
there is no interaction.
• The p-values returned by anova2 depend on assumptions about the random
disturbances εij in the model equation. For the p-values to be correct, these
disturbances need to be: independent, normally distributed, and have constant
variance.
• In addition, anova2 requires that data be balanced, which means there must be the
same number of samples for each combination of control variables. Other ANOVA
methods support unbalanced data with any number of predictors.
Sundermeyer
MAR 550
Spring 2013
25
Intro to Statistics Toolbox
Statistics Toolbox/Regression Analysis
Linear Regression Models
•
In statistics, linear regression models take the form of a summation of
coefficient · (independent variable or combination of independent variables).
For example:
y  bo  b1 x1  b 2 x2  b3 x1 x2  b 4 x12  b5 x22  
•
In this example, the response variable y is modeled as a combination of constant, linear,
interaction, and quadratic terms formed from two predictor variables x1 and x2.
•
Uncontrolled factors and experimental errors are modeled by ε. Given data on x1, x2,
and y, regression estimates the model parameters βj (j = 1, ..., 5).
•
More general linear regression models represent the relationship between a continuous
response y and a continuous or categorical predictor x in the form:
y  b1 f1 ( x)    b p f p ( x)  
Sundermeyer
MAR 550
Spring 2013
26
Intro to Statistics Toolbox
Statistics Toolbox/Regression Analysis
Example (system of equations):
Suppose we have a series of measurements of stream discharge and stage, measured at n
different times.
time (day) = [0 14 28 42 56 70]
stage (m) = [0.612 0.647 0.580 0.629 0.688 0.583]
discharge (m3/s) = [0.330 0.395 0.241 0.338 0.531 0.279]
Suppose we now wish to fit a rating curve to these measurements. Let x = stage, y =
discharge, then we can write this series of measurements as:
yi = mxi + b, with i = 1:n.
This in turn can be written as: y = X b, or:
Sundermeyer
MAR 550
Spring 2013
 y1 
y 
 2
 y2  
 

 y n 
[n  1] 
 x1 1
 x 1
 2  m
 x3 1  

 b 
  
 xn 1
[n  2] [2  1]
27
Intro to Statistics Toolbox
Statistics Toolbox/Regression Analysis
yi = mxi + b
y=Xb
 y1 
y 
 2
 y2  
 

 y n 
[n  1] 
Sundermeyer
MAR 550
Spring 2013
 x1 1
 x 1
 2  m
 x3 1  

 b 
  
 xn 1
[n  2] [2  1]
28
Intro to Statistics Toolbox
Statistics Toolbox/Regression Analysis
Example: Harmonic Analysis:
•
sin(q+f) = sin(q)cos(f) + sin(f)cos(q)
•
Let: A=Ccos(f), B=Csin(f)
=> Csin(wt+f) = Asin(wt) + Bcos(wt)
•
Linear regression y = Xb
 u1   sin( M 2 t1 ) cos( M 2 t1 )  1  A1 
u  sin( M t ) cos( M t )  1  A 
2 2
2 2
 2  
  2

 
  
u  sin( M t ) cos( M t )  1  
B

n 
2 n
2n

 
y
b
X
Sundermeyer
MAR 550
Spring 2013
29
Intro to Statistics Toolbox
Statistics Toolbox/Regression Analysis
Example: Harmonic analysis (cont’d)
Southampton Surface Currents:
Harmonic analysis for M2, M4=2xM2, M6=3xM2 ...
Speed (cms-1)
80
1000
PSD (cm s–1)2
100
10
1
40
0
0
-40
2
4
6
8 10 12 14 16 18 20 22 24
Time(hours)
-80
0.1
0.01
Note: Tidal Harmonics can
cause tidal cycle to appear
asymmetric.
0.001
0.0001
1
10
100
1000
cycles day-1
Sundermeyer
MAR 550
Spring 2013
www.soes.soton.ac.uk/teaching/courses/oa311/tides_3.ppt
30
Intro to Statistics Toolbox
Statistics Toolbox/Regression Analysis
Generalized linear models (GLM) are a flexible generalization of ordinary least
squares regression. They relate the random distribution of the measured
variable of the experiment (the distribution function) to the systematic (nonrandom) portion of the experiment (the linear predictor) through a function
called the link function.
Generalized additive models (GAMs) are another extension to GLMs in which
the linear predictor η is not restricted to be linear in the covariates X but is an
additive function of the xis:
The smooth functions fi are estimated from the data. In general this requires a
large number of data points and is computationally intensive.
Sundermeyer
MAR 550
Spring 2013
31
Data Handling Matlab
Useful Tidbits …
Useful Tidbits
•
regress
- performs multiple linear regression using least squares
•
nlinfit
- performs nonlinear least-squares regression.
•
glmfit
- fits a generalized linear model.
Sundermeyer
MAR 550
Spring 2013
32
Download