MATLAB-instructions

advertisement
Prof. Dr. J. Franke
Basic Statistics A.1
Appendix: Statistics with MATLAB
Statistics Toolbox of MATLAB - Probability distributions
Random number generators:
command ***rnd (e.g. normrnd, exprnd, ...)
generates vectors of i.i.d. (pseudo) random numbers
•
simulations
•
model validation (simulate from chosen model with parameters estimated from data - graphical representations like histogram should look similar for simulated and real data)
Probability densities:
command ***pdf (e.g. normpdf, exppdf, ...)
Prof. Dr. J. Franke
Basic Statistics A.2
(Cumulative) distribution functions:
command ***cdf (e.g. normcdf, expcdf, ...)
for calculating probabilities; if F (z) distribution function of X:
pr(a < X ≤ b) = F (b) − F (a)
Quantiles, median:
command ***inv (e.g. norminv, expinv, ...)
the quantile function is the inverse of the distribution function
α-quantile q α = F −1(α),
1
−1
median q 0.5 = F ( )
Expectation and variance:
command ***stat (e.g. normstat, expstat, ...)
2
Prof. Dr. J. Franke
Basic Statistics A.3
Parameter estimates and confidence intervals:
command ***fit (e.g. normfit, expfit, ...)
by default, the interval estimates or confidence intervals have level 0.95, but that can be changed
Statistics Toolbox of MATLAB - Descriptive statistics
commands mean, median (sample mean and sample median)
data vectors have to be columns!
commands var, std (sample variance s2
N and sample standard
deviation sN )
commands cov, corr (sample covariance and sample correlation)
output = symmetric matrix: corr e.g. has 1s on the diagonal and,
for i 6= j, the estimate of corr (Xi, Xj ) as the (i, j)th entry
Prof. Dr. J. Franke
Basic Statistics A.4
Statistics Toolbox of MATLAB - Hypothesis tests
commands ttest, ttest2 (one- and two-sample t-test)
default level is 0.05, but that can be changed
command ztest (Gauss-test)
Statistics Toolbox of MATLAB - Statistical plots
model-free visualization of data distribution:
command hist (histogram of data)
command histfit (histogram + superimposed normal density with
parameters estimated from data)
command boxplot (boxplot of data)
command ksdensity (smooth probability density estimate, smoothing parameter optimal for normal data)
command ***plot (e.g. normplot - probability plots)
Prof. Dr. J. Franke
Basic Statistics A.5
Statistics Toolbox of MATLAB - Analysis of Variance
commands anova1, anova2 (one- and two-factor layout ANOVA)
MATLAB session
with comments in blue, commands in black
General remark on MATLAB random numbers: As a default, after being started MATLAB always uses the same so-called seed for the basic random number generator (RNG) from which all other random numbers are calculated by
transformations. So, if you start MATLAB again and use the exact sequence
of commands (at least those involving ***rnd functions) of the previous session you will get the exact same results. RNG are deterministic, but they appear
as random variables and cannot be predicted if you do not know the seed.
The advantage of a fixed seed is the reproducibility of simulation sessions. If
you do not like it you can choose different seeds; e.g. if you do some large
simulation and split it between several MATLAB sessions. Other software
automatically uses different seeds (frequently taken from the internal clock of
your computer) for different sessions as a default; here, you have to explicitly
state that you always want to use the same seed from a certain point in the
session on.
help normrnd
help commandname“ provides help for any MATLAB command - in particu”
lar the link to the documentation explains what the inputs and outputs are
and how you can change the default settings
X=normrnd(1,2,100,1);
generates a sample of N=100 normal random variables with mean 1 and standard deviation 2, stored as a column vector of length 100 (i.e. as a 100x1
matrix)
hist(X)
plots histogram of the N numbers in X; the bin number is generated automatically from the sample size N by a rule of thumb, but can be changed
according to your preferences - compare the next 2 commands
hist(X,20)
hist(X,5)
boxplot(X)
plots a boxplot of the sample stored in data vector X
normplot(X)
plots a normal probability plot; if the data are normally distributed then the
data, represented as points, should lie roughly on a straight line.
histfit(X)
plots a histogram together with a plot of the normal density fitted to the data,
i.e. with the sample mean and the sample variance of the data as parameters
[f,t]=ksdensity(X); plot(t,f)
plots an estimate of the probability density of the data without assuming a
specific model - it uses the method of kernel smoothing with a bandwidth
which is optimal in case that the data are normally distributed
X=normrnd(1,2,100,1);
generates another normal sample with the same parameters
histfit(X)
[f,t]=ksdensity(X); plot(t,f)
X=exprnd(1,100,1);
generates a sample of N=100 exponential random variables with parameter
µ = 1/λ = 1 - mark that MATLAB parametrizes this distribution with the
mean µ which is the inverse of the usual parameter λ used in the course
histfit(X)
[f,t]=ksdensity(X); plot(t,f)
here, the density estimate is not very good around 0; one should have made
a boundary correction for the fact that the exponential density function has a
jump at 0
normplot(X)
boxplot(X)
from all the above plots one can see that the data are not normally distributed
help exprnd
help wblrnd
as for the special case of the exponential distribution (see above), MATLAB
uses a different parametrization for the Weibull distribution than the course
help normstat
[M,V]=normstat(1,2)
calculates the expectation and variance of the normal distribution with mean
1 and standard deviation 2
normcdf(3)
calculates the distribution function of the standard normal distribution at 3
normcdf(6)
in the usual MATLAB number format this is shown as 1, but it is not really
a certain event. There is a very small, but positive chance to observe also
standard normal random variables with values above 6 - this can be seen by
changing to a number format showing more digits and avoiding too much
rounding:
format long
normcdf(6)
format short
help normfit
[muhat,sigmahat,muci,sigmaci] = normfit(X)
sample mean, sample standard deviation, confidence interval for µ and confidence interval for σ, assuming that the data in the data vector X are normally
distributed. The default confidence level is 1−α = 0.95, but it can be changed.
− Mark that it might well happen that one of the true parameters is not lying
in the corresponding 0.95 confidence interval; the chance that at least one
of the two confidence intervals does not contain the right µ resp. σ is almost
10%!
X=normrnd(1,2,100,1);
[muhat,sigmahat,muci,sigmaci] = normfit(X)
same for another data set
X=exprnd(1,100,1);
[muhat,muci] = expfit(X)
estimate and confidence interval of the parameter µ = 1/λ of a sample of
exponential random variables
X=exprnd(1,1000,1);
[muhat,muci] = expfit(X)
same but with tenfold sample size N=1000
X=normrnd(1,2,100,1);
generating again a normal sample as before
help ztest
[h,p,ci,zval] = ztest(X,0,2)
result of Gauss test for testing µ = 0 with known standard deviation σ0 = 2,
the default level is α = 0.05, the default alternative is two-sided µ 6= 0.
Output: h shows the decision (0: accept H0 , 1: reject H0 ); p is the p-value (if
p < α, H0 is rejected); ci is a confidence interval for µ, and zval is the value
of the test statistic Z
X=normrnd(0,2,100,1);
[h,p,ci,zval] = ztest(X,0,2)
same but with other data
help ttest
[h,p,ci,zval] = ttest(X,0)
same as for Gauss test but now using the t-test for unknown variance; therefore, the standard deviation σ0 = 2 is missing in the function argument
X=normrnd(0,2,10,1);
[h,p,ci,zval] = ztest(X,0,2)
[h,p,ci,zval] = ttest(X,0)
same as above but with much smaller sample size N=10
X=normrnd(0.5,2,10,1);
up to now, H0 : µ = 0 was true; now we gradually increase µ such that
H1 : µ 6= 0 becomes true; we start with µ = 0.5
[h,p,ci,zval] = ttest(X,0)
X=X+0.5; [h,p,ci,zval] = ttest(X,0)
µ = 1 by adding 0.5 to all previous data
X=X+1; [h,p,ci,zval] = ttest(X,0)
µ = 2 by adding 1 to all previous data
help anova1
X=normrnd(0,1,5,10);
the data Ykj are arranged as a mxp matrix requiring a balanced design as a
default; the numbers in column no. j correspond to the repeated observation
for the factor value j
in contrast to the notation of the course, the two indices are interchanged,
i.e. the first (row) index corresponds to repeated observations whereas the
second (column) index corresponds to the different factor values
p=10 factor values, m=5 observations for each factor value - the means are
now chosen such that they differ by 0.1 for neighbouring factor values, i.e.
they assume the values 0, 0.1, 0.2, ..., 0.9:
for i=2:10; X(:,i)=X(:,i)+(i-1)*0.1; end;
mean(X)
mean“ applied to a matrix results in a row with entries equal to the sample
”
means of the data in each column; you get here the estimates Y •j of the
individual means µj for factor values j=1, ..., p=10
[p,table] = anova1(X)
compared to the ANOVA table in the course, MATLAB omits the last line
related to the total variability; it adds another column on the right which
contains the p-value of the test from which you can see immediately at which
levels you can reject the hypothesis of no factor effects
additionally, MATLAB provides boxplots of the subsamples corresponding to
each factor value
X=normrnd(0,1,10,10);
for i=2:10; X(:,i)=X(:,i)+(i-1)*0.1; end;
same kind fo data, but with m=10 repetitions for each factor value
[p,table] = anova1(X)
help anova2
X1=X;
X2=normrnd(0,1,10,10); for i=2:10; X2(:,i)=X2(:,i)+(i-1)*0.1; end;
X3=normrnd(0,1,10,10);
for i=2:10; X3(:,i)=X3(:,i)+(i-1)*0.1; end;
X=[X1; X2; X3];
The data Ykij of a two-factor layout ANOVA depend on a 3-dimensional array
of numbers with 3 indices k (number of repeated observation for fixed factor
values), i (value of first factor, i=1, ..., p), j (value of second factor, j=1, ...,
q). For anova2, you have to enter the data as a sequence of q matrices of
size mxp stacked one above the other to form a large (q·m)xp matrix
More precisely: for each fixed value j of the second factor form a mxp-matrix
Xj with entries Ykij , k = 1, ..., m, i = 1, ..., p - as tghe data matrix in the onefactor-layout. Then stack the matrices X1, ..., Xq one above the other to get
the large data matrix X which is the input to anova2. The command is of the
form: X = [X1; X2; ...; Xq];
[p,table] = anova2(X)
In this example, the first factor (columns in the ANOVA table), has the same
influence on the means as in the one-factor layout example; the second factor
(rows in the ANOVA table) has no influence on the means
Download