Prof. Dr. J. Franke Basic Statistics A.1 Appendix: Statistics with MATLAB Statistics Toolbox of MATLAB - Probability distributions Random number generators: command ***rnd (e.g. normrnd, exprnd, ...) generates vectors of i.i.d. (pseudo) random numbers • simulations • model validation (simulate from chosen model with parameters estimated from data - graphical representations like histogram should look similar for simulated and real data) Probability densities: command ***pdf (e.g. normpdf, exppdf, ...) Prof. Dr. J. Franke Basic Statistics A.2 (Cumulative) distribution functions: command ***cdf (e.g. normcdf, expcdf, ...) for calculating probabilities; if F (z) distribution function of X: pr(a < X ≤ b) = F (b) − F (a) Quantiles, median: command ***inv (e.g. norminv, expinv, ...) the quantile function is the inverse of the distribution function α-quantile q α = F −1(α), 1 −1 median q 0.5 = F ( ) Expectation and variance: command ***stat (e.g. normstat, expstat, ...) 2 Prof. Dr. J. Franke Basic Statistics A.3 Parameter estimates and confidence intervals: command ***fit (e.g. normfit, expfit, ...) by default, the interval estimates or confidence intervals have level 0.95, but that can be changed Statistics Toolbox of MATLAB - Descriptive statistics commands mean, median (sample mean and sample median) data vectors have to be columns! commands var, std (sample variance s2 N and sample standard deviation sN ) commands cov, corr (sample covariance and sample correlation) output = symmetric matrix: corr e.g. has 1s on the diagonal and, for i 6= j, the estimate of corr (Xi, Xj ) as the (i, j)th entry Prof. Dr. J. Franke Basic Statistics A.4 Statistics Toolbox of MATLAB - Hypothesis tests commands ttest, ttest2 (one- and two-sample t-test) default level is 0.05, but that can be changed command ztest (Gauss-test) Statistics Toolbox of MATLAB - Statistical plots model-free visualization of data distribution: command hist (histogram of data) command histfit (histogram + superimposed normal density with parameters estimated from data) command boxplot (boxplot of data) command ksdensity (smooth probability density estimate, smoothing parameter optimal for normal data) command ***plot (e.g. normplot - probability plots) Prof. Dr. J. Franke Basic Statistics A.5 Statistics Toolbox of MATLAB - Analysis of Variance commands anova1, anova2 (one- and two-factor layout ANOVA) MATLAB session with comments in blue, commands in black General remark on MATLAB random numbers: As a default, after being started MATLAB always uses the same so-called seed for the basic random number generator (RNG) from which all other random numbers are calculated by transformations. So, if you start MATLAB again and use the exact sequence of commands (at least those involving ***rnd functions) of the previous session you will get the exact same results. RNG are deterministic, but they appear as random variables and cannot be predicted if you do not know the seed. The advantage of a fixed seed is the reproducibility of simulation sessions. If you do not like it you can choose different seeds; e.g. if you do some large simulation and split it between several MATLAB sessions. Other software automatically uses different seeds (frequently taken from the internal clock of your computer) for different sessions as a default; here, you have to explicitly state that you always want to use the same seed from a certain point in the session on. help normrnd help commandname“ provides help for any MATLAB command - in particu” lar the link to the documentation explains what the inputs and outputs are and how you can change the default settings X=normrnd(1,2,100,1); generates a sample of N=100 normal random variables with mean 1 and standard deviation 2, stored as a column vector of length 100 (i.e. as a 100x1 matrix) hist(X) plots histogram of the N numbers in X; the bin number is generated automatically from the sample size N by a rule of thumb, but can be changed according to your preferences - compare the next 2 commands hist(X,20) hist(X,5) boxplot(X) plots a boxplot of the sample stored in data vector X normplot(X) plots a normal probability plot; if the data are normally distributed then the data, represented as points, should lie roughly on a straight line. histfit(X) plots a histogram together with a plot of the normal density fitted to the data, i.e. with the sample mean and the sample variance of the data as parameters [f,t]=ksdensity(X); plot(t,f) plots an estimate of the probability density of the data without assuming a specific model - it uses the method of kernel smoothing with a bandwidth which is optimal in case that the data are normally distributed X=normrnd(1,2,100,1); generates another normal sample with the same parameters histfit(X) [f,t]=ksdensity(X); plot(t,f) X=exprnd(1,100,1); generates a sample of N=100 exponential random variables with parameter µ = 1/λ = 1 - mark that MATLAB parametrizes this distribution with the mean µ which is the inverse of the usual parameter λ used in the course histfit(X) [f,t]=ksdensity(X); plot(t,f) here, the density estimate is not very good around 0; one should have made a boundary correction for the fact that the exponential density function has a jump at 0 normplot(X) boxplot(X) from all the above plots one can see that the data are not normally distributed help exprnd help wblrnd as for the special case of the exponential distribution (see above), MATLAB uses a different parametrization for the Weibull distribution than the course help normstat [M,V]=normstat(1,2) calculates the expectation and variance of the normal distribution with mean 1 and standard deviation 2 normcdf(3) calculates the distribution function of the standard normal distribution at 3 normcdf(6) in the usual MATLAB number format this is shown as 1, but it is not really a certain event. There is a very small, but positive chance to observe also standard normal random variables with values above 6 - this can be seen by changing to a number format showing more digits and avoiding too much rounding: format long normcdf(6) format short help normfit [muhat,sigmahat,muci,sigmaci] = normfit(X) sample mean, sample standard deviation, confidence interval for µ and confidence interval for σ, assuming that the data in the data vector X are normally distributed. The default confidence level is 1−α = 0.95, but it can be changed. − Mark that it might well happen that one of the true parameters is not lying in the corresponding 0.95 confidence interval; the chance that at least one of the two confidence intervals does not contain the right µ resp. σ is almost 10%! X=normrnd(1,2,100,1); [muhat,sigmahat,muci,sigmaci] = normfit(X) same for another data set X=exprnd(1,100,1); [muhat,muci] = expfit(X) estimate and confidence interval of the parameter µ = 1/λ of a sample of exponential random variables X=exprnd(1,1000,1); [muhat,muci] = expfit(X) same but with tenfold sample size N=1000 X=normrnd(1,2,100,1); generating again a normal sample as before help ztest [h,p,ci,zval] = ztest(X,0,2) result of Gauss test for testing µ = 0 with known standard deviation σ0 = 2, the default level is α = 0.05, the default alternative is two-sided µ 6= 0. Output: h shows the decision (0: accept H0 , 1: reject H0 ); p is the p-value (if p < α, H0 is rejected); ci is a confidence interval for µ, and zval is the value of the test statistic Z X=normrnd(0,2,100,1); [h,p,ci,zval] = ztest(X,0,2) same but with other data help ttest [h,p,ci,zval] = ttest(X,0) same as for Gauss test but now using the t-test for unknown variance; therefore, the standard deviation σ0 = 2 is missing in the function argument X=normrnd(0,2,10,1); [h,p,ci,zval] = ztest(X,0,2) [h,p,ci,zval] = ttest(X,0) same as above but with much smaller sample size N=10 X=normrnd(0.5,2,10,1); up to now, H0 : µ = 0 was true; now we gradually increase µ such that H1 : µ 6= 0 becomes true; we start with µ = 0.5 [h,p,ci,zval] = ttest(X,0) X=X+0.5; [h,p,ci,zval] = ttest(X,0) µ = 1 by adding 0.5 to all previous data X=X+1; [h,p,ci,zval] = ttest(X,0) µ = 2 by adding 1 to all previous data help anova1 X=normrnd(0,1,5,10); the data Ykj are arranged as a mxp matrix requiring a balanced design as a default; the numbers in column no. j correspond to the repeated observation for the factor value j in contrast to the notation of the course, the two indices are interchanged, i.e. the first (row) index corresponds to repeated observations whereas the second (column) index corresponds to the different factor values p=10 factor values, m=5 observations for each factor value - the means are now chosen such that they differ by 0.1 for neighbouring factor values, i.e. they assume the values 0, 0.1, 0.2, ..., 0.9: for i=2:10; X(:,i)=X(:,i)+(i-1)*0.1; end; mean(X) mean“ applied to a matrix results in a row with entries equal to the sample ” means of the data in each column; you get here the estimates Y •j of the individual means µj for factor values j=1, ..., p=10 [p,table] = anova1(X) compared to the ANOVA table in the course, MATLAB omits the last line related to the total variability; it adds another column on the right which contains the p-value of the test from which you can see immediately at which levels you can reject the hypothesis of no factor effects additionally, MATLAB provides boxplots of the subsamples corresponding to each factor value X=normrnd(0,1,10,10); for i=2:10; X(:,i)=X(:,i)+(i-1)*0.1; end; same kind fo data, but with m=10 repetitions for each factor value [p,table] = anova1(X) help anova2 X1=X; X2=normrnd(0,1,10,10); for i=2:10; X2(:,i)=X2(:,i)+(i-1)*0.1; end; X3=normrnd(0,1,10,10); for i=2:10; X3(:,i)=X3(:,i)+(i-1)*0.1; end; X=[X1; X2; X3]; The data Ykij of a two-factor layout ANOVA depend on a 3-dimensional array of numbers with 3 indices k (number of repeated observation for fixed factor values), i (value of first factor, i=1, ..., p), j (value of second factor, j=1, ..., q). For anova2, you have to enter the data as a sequence of q matrices of size mxp stacked one above the other to form a large (q·m)xp matrix More precisely: for each fixed value j of the second factor form a mxp-matrix Xj with entries Ykij , k = 1, ..., m, i = 1, ..., p - as tghe data matrix in the onefactor-layout. Then stack the matrices X1, ..., Xq one above the other to get the large data matrix X which is the input to anova2. The command is of the form: X = [X1; X2; ...; Xq]; [p,table] = anova2(X) In this example, the first factor (columns in the ANOVA table), has the same influence on the means as in the one-factor layout example; the second factor (rows in the ANOVA table) has no influence on the means