Francis, UNL, 16 April 2015 Computational Statistics in MATLAB By: Francis Ayimiah – Nterful Department of Statistics - UNL Outline Part I Introduction What is MATLAB? The MATLAB System MATLAB Online Help MATLAB Development Environment Starting and Quitting MATLAB MATLAB Desktop and Desktop Tools Manipulating Arrays And Matrices In MATLAB MATLAB Graphics Programming With MATLAB Part II Sampling From Random Variables Sampling from the standard distributions 1 Sampling from non-standard distribution - Inverse Transform Sampling Part I 1. Introduction In this presentation, we will discuss the basics of how to use MATLAB along with the basic statistical aspects of it, including inverse transform sampling. 1.1 What Is MATLAB? MATLAB is a high-performance language for technical computing. It is an integrated computing environment for numeric computation, visualization and programming. Typical uses include - Math and computation - Algorithm development - Data acquisition - Modeling, simulation, and visualization - Scientific and engineering graphics - Application development, including graphical user interface building. MATLAB is an interactive system whose basic data element is a matrix. Programming features 2 in MATLAB are similar to those of other computer languages; examples are functions, IF statements and FOR loops. MATLAB provides GUI tools so the user can develop applications. The name MATLAB stands for matrix (MAT) laboratory (LAB) for the reason that it was originally written to provide easy access to matrix software developed by the LINPACK and EISPACK projects. Latest version of MATLAB released on February 12, 2015. MATLAB features a family of add-on applicationspecific solutions called toolboxes. Toolboxes allow you to learn and apply specialized technology. They are comprehensive collections of MATLAB functions (M-files) that extend the MATLAB environment to solve particular classes of problems. Areas in which toolboxes are available include: 3 - Signal processing - Optimization - Communication - Control system - System identification - Neural networks - Statistics (this include) Probability distribution Descriptive statistics Hypothesis tests Cluster analysis Linear models Nonlinear models Multivariate statistics Statistical plots Design of experiment, and Statistical process control - Image processing, and many others. There are numerous individuals that offer third party toolboxes in MATLAB. These can be found at the MATLAB webpage: 4 Http://www.mathworks.com: 1.2 The MATLAB System The MATLAB system consists of five main parts. These are: 1. Development Environment – This is all the components that help you use MATLAB functions and files, see your results, and interact with data from other sources. It includes the MATLAB desktop and Command Window, a command history, an editor and debugger, browsers for viewing help, the workspace, files, and the search path. The MATLAB development environment is available for the following operating systems: - Microsoft-Window - Macintosh - UNIX / Linux 2. The MATLAB Mathematical Function Library – MATLAB contains a vast collection of computational algorithms that comprise of simple functions like sine, cosine, sum, etc, as well as 5 sophisticated functions for matrix computation and manipulations, statistical analysis, signal processing and many others. 3. The MATLAB Language – At the heart of MATLAB is a high-level programming language with matrix as its basic entity. All the usual programming features such as control flow statements, functions, data structure, input-output, and object-oriented features are available. 4. MATLAB Graphics – MATLAB has extensive tools for displaying vectors and matrices as graphs, as well as annotating and printing these graphs. Handle Graphics is available through the MATLAB Language for MATLAB programmers to build custom user interfaces and applications. 5. The MATLAB Application Program Interface (API) - This includes a library of functions that allows you to write programs that interact with MATLAB. 6 With this library, your C and Fortran programs, for example, can access MATLAB functions, access MATLAB data files, and call MATLAB as a computational engine. 1.3 MATLAB Online Help To view the online documentation, select MATLAB Help from the Help menu in MATLAB.(Further detail in appendix 1) 2. MATLAB Development Environment As discussed above, the MATLAB development environment is the set of tools and facilities that help you use MATLAB functions and files. 2.1 Starting MATLAB On Windows platforms, double-click the MATLAB shortcut icon; , on your Windows desktop to start MATLAB. On UNIX platforms, to start MATLAB, type matlab at the operating system prompt. 7 After starting MATLAB, the MATLAB desktop opens – see Figure 1 below: Figure 1: MATLAB Desktop 2.2 Quitting MATLAB To end your MATLAB session, select Exit MATLAB from the File menu in the desktop, or simply type quit in the Command Window and hit enter. To execute specified functions each time a MATLAB quits, such as saving the workspace, you can create and run a finish.m script. 2.3 MATLAB Desktop The MATLAB desktop contains a number of tools. The tools include - Command Window: Used to enter variables and run functions and M-files. You can run 8 external programs from the MATLAB Command Window. - Command History: Statements you enter in the Command Window are logged in the Command History. So, you can view previously run statements, copy and execute selected statements in the Command History. - Help Browser: Use Help Browser to search and view documentation and demos for all MathWork products. You can also type doc or help in the Command Window to view documentation (e.g. doc plot). - Workspace Browser: This consists of the set of variables (named arrays) built up during a MATLAB session and stored in memory. You add variables to the workspace by using functions, running M-files, and loading saved workspace. - Editor/Debugger: Use the Editor/Debugger to create and debug M-files, which are programs you write to run MATLAB functions. The Editor/Debugger provides a graphical user interface for basic text editing, as well as for M-file debugging. 9 3. Manipulating Array and Matrices 3.1 Array/Matrix operations Like other computer languages, MATLAB provides high-level operators and functions for creating and manipulating arrays. Arithmetic operations on arrays are done element-by-element. This means that addition and subtraction are the same for arrays and matrices, but that multiplicative operations are different. The list of operators includes Subtraction + Addition .* Element-by-element multiplication ./ Element-by-element division .\ Element-by-element left division .^ Element-by-element power .’ Unconjugated array transpose Examples Addition and Subtraction >> a = 1:5 integers 1 to 10 % vector containing the 10 >> b = 3:7 integers 3 to 7 >> a+b = 4 6 8 10 12 % vector containing the Squaring a vector >> t = [ 1 2 3 4 5 ]; >> m = t.^2; % vector of values to square % square all the values m= 1 2 9 16 25 Reshape an array >> a=1:6 a= 123456 >> reshape(a,2,3) x 3 matrix ans = 1 3 5 2 4 6 % reshape vector a to 2 3.2 Building Tables Array operations are useful for building tables. Similar to other familiar languages, MATLAB uses column-oriented analysis for multivariate statistical data. Each column in a data set 11 represents a variable and each row an observation. The (i,j)th element is the ith observation of the jth variable. Example: Consider a data set, D, with three variables: v1, v2, and v3 (making up the columns). For 5 observations, the array is given as follows >> D = [72 134 3.2; 81 201 3.5; 69 156 7.1; 82 148 2.4; 75 170 1.2]; D = 72 134 3.2 81 201 3.5 69 156 7.1 82 148 2.4 75 170 1.2 To obtain the mean and standard deviation of each column, we have >>mu = mean (D), sigma = std (D) mu = 75.8 161.8 3.48 12 Sigma = 5.6303 25.499 2.2107 So, mean (v1) = 75.8, mean (v2) = 161.8, and mean (v3) = 3.48. Similarly, the standard deviations are std (v1) = 5.6303, std(v2) = 25.499, and std(v3) = 2.2107 Can also find correlation among variables, and many others. 4 MATLAB Graphics 4.1 2-D Plots The basic 2-D plotting function in MATLAB is plot. (Further details in appendix 2) Example 1: using the plotmatrix command >> x =randn(50,3); %Normally distributed random values >> y = x*[-1 2 1;2 0 1;1 -2 3]'; figure(1); >> plotmatrix(y) % creates a matrix of subaxes; same as plotmatrix(y,y) >> title('Matrix plot') 13 Example 2: Multiple plots in one figure % subplot (nrows,ncols,plot_number). This is multiple plots in one figure figure(2); x=0:.1:2*pi; % x vector from 0 to 2*pi, dx = 0.1 subplot(2,2,1); % plot sine function plot(x,sin(x)); title('sin(x)') subplot(2,2,2); % plot cosine function plot(x,cos(x)); title('cos(x)') subplot(2,2,3) % plot negative exponential function plot(x,exp(-x)); title('exp(-x)') %Put all above curves together to form the 4th plot subplot(2,2,4); 14 plot(x, sin(x),'k-', x, cos(x),'b+',x, exp(-x),'ro'); legend('sin(x)','cos(x)','exp(-x)') 4.2 3-D plot Example t=0:pi/10:2*pi; % a vector of t values from 0 to 2pi % in increment of pi/10 [X,Y,Z]=cylinder(4*cos(t));% returns the x-, y-,and %zcoordinates of the cylinder 15 figure(3); subplot(2,2,1); mesh(X); subplot(2,2,2); mesh(Y); subplot(2,2,3); mesh(Z); subplot(2,2,4); mesh(X,Y,Z); % mesh produces wireframe surfaces that color only the lines connecting the defining points 5. Programming With MATLAB MATLAB provides extensive programming features, with just a few mentioned here. This include 16 Flow control constructs such as if, for, while, continue, and break. Scripts and Functions, which are called Mfiles. While Scripts do not accept input argument or return output arguments, Functions can accept input arguments and return our arguments. Demonstration programs Part II 6 Sampling from Random Variables 6.1 Sampling from the standard distributions The MATLAB Statistics Toolbox supports about 20 probability distributions. For each distribution, there are 5 associated functions. These are - Probability distribution function (pdf) - Cumulative distribution function (cdf) - Inverse of the cumulative distribution function - Random number generator - Mean and variance as a function of the parameter 17 Table 6.1 lists some of the standard distributions supported by MATLAB and how to sample random values from them. The MATLAB documentation lists many more distributions that can be simulated with MATLAB. Using online resources, it is often easy to find support for a number of other common distributions. Table 6.1: Examples of MATLAB functions for evaluating probability density, cumulative density and drawing random numbers Distribution PDF Normal normpdf normcdf norm Uniform(continuous) unifpdf CDF unifcdf Random # Generation unifrnd Beta betapdf betacdf betarnd Exponential exppdf expcdf exprnd Uniform (discrete) unidpdf unidcdf unidrnd Binomial binopdf binocdf binornd Multinomial mnpdf mnrnd Poisson poisspdf poisscdf poissrnd The Statistics Toolbox has functions for computing parameter estimates and confidence intervals of these data driven distributions. 18 As an illustration for some of these functions, we can use the MATLAB code below (Code 6.2) to visualize the Normal (µ, σ) distribution where µ = 100 and σ = 15. - Let us assume that this distribution represents the observed variability of IQ coefficients in some population. - The code shows how to display the probability density and the cumulative density. - It also shows how to draw random values from this distribution and how to visualize the distribution of these random samples using the hist function. Code 6.2: MATLAB code for visualizing the normal distribution %% Explore the Normal distribution N(mu , sigma) mu = 100; % the mean sigma = 15; % the standard deviation xmin = 70; % minimum x value for pdf and cdf plot xmax = 130; % maximum x value for pdf and cdf plot n = 100; % number of points on pdf and cdf plot 19 k = 10000; % number of random draws for histogram % create a set of values ranging from xmin to xmax x = linspace( xmin , xmax , n ); p = normpdf( x , mu , sigma ); % calculate the pdf c = normcdf( x , mu , sigma ); % calculate the cdf figure( 4 ); clf; % create a new figure and clear the contents subplot( 1,3,1 ); plot( x , p , 'k' ); xlabel( 'x' ); ylabel( 'pdf' ); title('Probability Density Function' ); subplot( 1,3,2 ); plot( x , c , 'k' ); xlabel( 'x' ); ylabel( 'cdf' ); title('Cumulative Density Function' ); % draw k random numbers from a N(mu, sigma)distribution y = normrnd( mu , sigma , k , 1 ); subplot( 1,3,3 ); 20 hist( y , 20 ); xlabel( 'x' ); ylabel( 'frequency' ); title( 'Histogram of random values' ); The code produces the output shown below (Figure 6.3). Figure 6.3: Illustration of the Normal(µ, σ) distribution where µ = 100 and σ = 15 6.2 Sampling from non-standard distribution Here, we want to sample from a distribution that is not one of the standard distributions that is supported by MATLAB. Let 𝐹(𝑥); 𝑥 ∈ 𝐼𝑅; denote any cumulative distribution function (cdf) (continuous or not). (Properties of F in appendix 3) Our objective is to generate (simulate) random variables X distributed as F; that is, we want to 21 simulate a random variable X such that 𝑃(𝑋 ≤ 𝑥 ) = 𝐹 (𝑥 ); 𝑥 ∈ 𝐼𝑅. Define the generalized inverse of F, 𝐹 −1 ∶ [0; 1] → 𝐼𝑅, via 𝐹 −1 (𝑦) = min{𝑥 ∶ 𝐹(𝑥) ≥ 𝑦}; 𝑦 ∈ [0; 1]: 6.2.1 Inverse transform sampling Theorem: (Inverse Transform Method) Let ; 𝐹(𝑥) ; 𝑥 ∈ 𝐼𝑅 be the cumulative density function (cdf) of our target variable X (continuous or not). Let 𝐹 −1 (𝑦), 𝑦 ∈ 𝐼𝑅 be the inverse of this function, assuming that we can actually calculate this inverse. Define 𝑋 = 𝐹 −1 (𝑈), where 𝑈 has the continuous uniform distribution over the interval (0; 1). Then X is distributed as F, that is, 𝑃(𝑋 ≤ 𝑥 ) = 𝐹 (𝑥 ), 𝑥 ∈ 𝐼𝑅. Proof of theorem is in appendix 4. Therefore, in order to generate a random variable X~F, we can generate U according to U(0,1) and then make the transformation X=F − 1 (U) 22 The algorithm is 1. Draw U ∼ Uniform (0, 1) 2. Set X = F−1 (U) Example 1: 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙: 𝑓(𝑥) = 1 𝜆 𝑒 1 𝜆 − 𝑥 Suppose we want to sample random numbers from the exponential distribution. When 𝜆 > 0, the cumulative density function is 𝐹 (𝑥|𝜆) = 1 − 𝑒𝑥𝑝(−𝑥/𝜆). Using some simple algebra, one can find the inverse of this function, which is 𝐹 −1 (𝑢|𝜆) = −𝑙𝑜𝑔(1 − 𝑢)𝜆. This leads to the following sampling procedure to sample random numbers from an Exponential (λ) distribution: 1. Draw 𝑈 ∼ 𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0, 1) 2. Set 𝑋 = −𝑙𝑜𝑔(1 − 𝑈)𝜆 Code 6.4: MATLAB code for inverse transform sampling from exponential (𝜆 = 2) seed=12; rand('state',seed); r=1000; % Let's take r samples u=unifrnd(0,1,r,1); %Uniform (0,1);or can use rand figure(5); 23 hist(u) % this generates a fairly uniform diagram Figure 6.5: Uniform distribution. % let 𝜆 = 2 x=-log(1-u)*2; %inverse cdf %Alternative approach for inverse cdf x2 = icdf('Exponential',u,2); cd=cdf('Exponential',x2,2); %Plotting histogram figure(6); clf subplot( 1,3,1 ); hist(x); xlabel( 'x' ); ylabel( 'frequency' ); title('Histogram of random valuesEDF'); % Actual distribution Z=exprnd(2,r,1); subplot( 1,3,2 ); 24 hist( Z , 20); xlabel( 'x' ); ylabel( 'frequency' ); title('Histogram of random values CDF'); %Plot CDF and EDF on the same graph subplot( 1,3,3 ); tt=sort(x); %Sort random sample mm=(1:r)/r; %Prob of rth sample g=linspace(0,20); w=expcdf(g,2); plot(g,w,'k',tt,mm,'r'); legend({'CDF', 'EDF'}) xlabel( 'x' ); ylabel( 'Probability' ); title( 'CDF vs. EDF of random values ' ); Figure 6.6: Distribution graphs for the exponential(2) 25 Example 2: B𝑖𝑛omial random variables with n = 9, p = 0.60; 𝑛 𝑗 𝑛−𝑗 𝑃(𝑋 = 𝑗) = ( 𝑗 ) 𝑝 𝑞 = 𝑟𝑗, 𝑗 = 0, 1, 2, . . . 𝑛 , 𝑝 = 1−𝑞 Let 𝑟𝑗+1 = 𝑟𝑗 𝑛−𝑗+1 𝑗(1−𝑝) with linear search. Then in MATLAB, we have the following Code 6.7: Generating Binomial RVs n = 9; p=0.60; q = 1-p; r(1) = q^n; js=[1:n+1]; for j=1:n, r(j+1) = r(j)*p*(nj+1)/(j*q); end F = cumsum(r); %cumulative sum of r K=10000; for k=1:K, X(k)=min(js(F>=rand))-1; end figure(7); subplot(1,2,1), hist(X),xlabel('x'),ylabel('Frequency' ) title('Histogram of x') subplot(1,2,2), plot(sort(X),[1:K]/K),xlabel('x'),ylab el('Probability') title('EDF of x') for k=1:K, X(k)=sum(rand(1,n)<p); end 26 Exercise 1 (Part I) 1. Adapt the Matlab program in Code 6.2 to illustrate the Beta(α, β) distribution where α = 2 and β = 3. Similarly, show the Exponential(λ) distribution where λ = 2. 2. Adapt the matlab program above to illustrate the Binomial(N, θ) distribution where N = 10 and θ = 0.7. 3. Write a demonstration program to sample 10 values from a Bernoulli(θ) distribution with θ = 0.3. Recall that there are only two possible outcomes, 0 and 1. With probability θ, the outcome is 1, and with probability 1 − θ, the outcome is 0. In other words, p(X = 1) = θ, and p(X = 0) = 1 − θ. In Matlab, you can simulate the Bernoulli distribution using the binomial distribution with N = 1. However, for the 27 purpose of this exercise, please write the code needed to randomly sample Bernoulli distributed values that does not make use of the built-in binomial distribution. Using a seed of 21, verify that the first four observations agree with 1, 1, 0 and 1 respectively. Your MATLAB code could use the following line for the seeding: >> seed=21; rand(’state’,seed); randn(’state’,seed); Exercise 2 (Part II) 1. In this exercise, we want to generate the random variable x=(x1, x2) from Beta(1, 𝛽) by inverse transform method. a) Write down the algorithm for generating x. b) Use MATLAB to construct two sets of N=1000 observations each from Beta (1,4). Set seed =131. Plot histograms of both sets on the same graph and print out the first 5 observations from each set. Observe that the first observations are x1=0.0251 and x2 = 0.0348. c) Find the mean of each set generated. 28 7. References 1. www.mathworks.com 2. Computational Statistics Handbook with MATLAB; Wendy L. Martinez & Angel R. Martinez 29