Probability Distributions with SCILAB By Gilberto E. Urroz, Ph.D., P.E. Distributed by i nfoClearinghouse.com ©2001 Gilberto E. Urroz All Rights Reserved A "zip" file containing all of the programs in this document (and other SCILAB documents at InfoClearinghouse.com) can be downloaded at the following site: http://www.engineering.usu.edu/cee/faculty/gurro/Software_Calculators/Scil ab_Docs/ScilabBookFunctions.zip The author's SCILAB web page can be accessed at: http://www.engineering.usu.edu/cee/faculty/gurro/Scilab.html Please report any errors in this document to: gurro@cc.usu.edu PROBABILITY DISTRIBUTIONS 3 Discrete probability distributions Bernoulli probability distribution Binomial probability distribution Poisson probability distribution: Geometric probability distribution: Hypergeometric probability mass function 3 3 4 5 6 7 Cumulative distribution functions for discrete probability distributions SCILAB functions for discrete cumulative distribution functions SCILAB function cdfbin Discrete probability calculations through user-defined functions Combinations Binomial distribution Poisson distribution Geometric distribution Hypergeometric distribution Continuous probability functions Factorials and the Gamma function The gamma distribution The exponential distribution The beta distribution The Weibull distribution The uniform distribution User-defined functions for continuous probability distributions Continuous probability distributions used in statistical inference The Normal distribution The Student-t distribution The Chi-squared (χ2) distribution The F distribution Applications of the normal distribution in data analysis Plotting a histogram and its corresponding normal curve Plotting data against their normal scores The lognormal distribution 9 9 9 10 11 11 12 13 14 15 15 16 17 17 19 19 20 25 25 25 27 28 30 31 34 36 Generating synthetic data Generating normally-distributed synthetic data Additional applications of function rand SCILAB function for generating synthetic data Examples of synthetic data generation using function grand Additional notes on function grand Pseudo-random generators Generating log-normally-distributed data Generating data that follows the Weibull distribution Generating data that follows the Student’s t distribution Generating data that follows a discrete distribution 38 38 39 40 41 49 50 51 52 53 54 Download at InfoClearinghouse.com 1 © 2001 Gilberto E. Urroz Statistical simulation Simulating traffic through a service station An user-defined function to simulate traffic through a service station Modeling traffic through a service station with random input 56 57 58 60 STIXBOX: a rudimentary statistics toolbox 63 Exercises 72 Download at InfoClearinghouse.com 2 © 2001 Gilberto E. Urroz Probability Distributions There are a number of mathematical functions that possess the properties of a probability mass function for discrete random variables or the properties of a probability density function for continuous random variables. In this section we introduce a number of those functions for the calculation of probabilities. Because these probability distributions depend on a finite number of parameters they are typically referred to as parametric distributions. Discrete probability distributions Some of the most useful discrete probability distributions are the Bernoulli, Binomial, Poisson, geometric, and hypergeometric distributions. The definitions of the corresponding probability mass and distribution functions are shown below. We also present expressions for the mean, variance, and standard deviation of these distributions. Bernoulli probability distribution The Bernoulli probability distribution applies to a discrete random variable that can only have values of 0 or 1, i.e., X = 0, 1. Let the probability of X = 1 be p, i.e., fX(1) = p, then fX(0) = 1-p. This can be summarized as fX(x) = px(1-p)1-x, x = 0,1 The mean value of the distribution is µX = 0 (1-p) + 1 p = p. The expectation of X2, E(X2), is needed to calculate the variance Var(X) = E(X2)-µX2. For the Bernoulli distribution, E(X2) = 02 (1-p) + 12 p = p, and Var(X) = E(X2)-µX2 = p-p2 = p(1-p). Thus, the standard deviation is σX = [p(1-p)]1/2. These results can be obtained using SCILAB as follows: -->p=poly(0,'p') p = p -->X = [0,1] X = ! 0. 1. ! -->Prob = [1-p p] Prob = Download at InfoClearinghouse.com 3 © 2001 Gilberto E. Urroz ! 1 - p p ! -->muX = X*Prob' muX = p -->EX2 = X^2*Prob' EX2 = p -->VarX = EX2 - muX^2 VarX = 2 p - p The Bernoulli distribution applies to a simple binary experiments in which only two possible outcomes exist: 1 or 0, yes or no, success or failure. The value of the probability of success, p, can be obtained, for example, from the classical or from the frequency definitions of probability. Bernoulli processes constitute the base of the binomial and geometric distributions presented below. Binomial probability distribution If a Bernoulli experiment with success probability p is repeated n times, the probability of having x successes out of the n trials is given by n Γ(n + 1) f X ( x) = ⋅ p x ⋅ (1 − p) n − x = ⋅ p x ⋅ (1 − p) n − x , x = 0,1,2,..., n, 0 < p < 1 Γ(r + 1) ⋅ Γ(n − r + 1) x with µX = np, Var(X) = np(1-p), and σx = [np(1-p)]1/2. In SCILAB, we can define the probability mass function for the Binomial distribution as -->deff('[f]=fX(x,n,p)',… -->'f=gamma(n+1).*p.^x.*(1-p).^(n-x)./(gamma(x+1).*gamma(n-x+1))') Next, we use this function to produce a plot of the probability mass function for n = 10, p = 0.10: -->n=10; p=0.10; xx=[0:1:10]; yy = fX(xx); -->xset('window',1);xset('mark',-9,2); plot2d(xx',yy',-9) -->xtitle('Binomial pmf','x','fX(x)') Download at InfoClearinghouse.com 4 © 2001 Gilberto E. Urroz The following commands produce a plot of the cumulative distribution function: -->yyy = [];for j = 1:n+1, yyy = [yyy sum(yy(1:j))]; end; -->xset('window',2); xset('mark',-9,2); plot2d(xx',yyy',-9) -->xtitle('Binomial cdf','x','FX(x)') Poisson probability distribution: If X is a Binomial variable with n →∞ and p →0, we calculate the parameter λ = n⋅p, and define the Poisson probability mass function as e −λ ⋅ λx f X ( x) = , x = 0,1,2,..., ∞; λ > 0. x! The Poisson pmf can be used to model the number of occurrences of a certain event in a given time period or per unit length, area or volume, if λ represents the mean occurrence of the even per unit time, length, area or volume, respectively. The Poisson distribution has the parameters µX = λ, Var(X) = λ2, and σx = λ. Download at InfoClearinghouse.com 5 © 2001 Gilberto E. Urroz In SCILAB we can define the Poisson distribution pmf as: -->deff('[p]=fX(x,lambda)','p=exp(-lambda).*lambda.^x./gamma(x+1)') A plot of the pmf for λ = 2.5 for values of x between 0 and 20: -->lambda = 2.5; xx = [0:1:20]; yy =fX(xx,lambda); -->xset('window',1);xset('mark',-9,2);plot2d(xx',yy',-9) -->xset('Poisson pmf','x','fX(x)') A plot of the corresponding cumulative distribution function follows: -->yyy = []; for j = 1:21, yyy = [yyy sum(yy(1:j))]; end; -->xset('window',2); xset('mark',-9,2); plot2d(xx',yyy',-9) -->xset('window',2); xset('mark',-9,2); plot2d(xx',yyy',9) -->xtitle('Poisson cdf','x','FX(x)') Geometric probability distribution: Suppose that we have a Bernoulli experiment with probability of success p being repeated until a successful outcome occurs. Let X represent the number of repetitions before a success, then X can be modeled with the geometric pmf: fX(x) = p⋅(1-p)x-1, x = 1, 2, …,∞; 0<p<1. The Poisson distribution has the parameters µX = 1/p, Var(X) = (1-p)/p2, and σx = (1-p)1/2/p. The pmf for the geometric distribution and a plot of it is obtained in SCILAB by using: -->deff('[f]=fX(p,x)','f=p*(1-p)^(x-1)') -->p = 0.25; xx = [0:1:20]; yy = fX(p,xx); -->xset('window',1);xset('mark',-9,2);plot2d(xx',yy',-9) -->xset('window',1);xset('mark',-9,2);plot2d(xx',yy',-9) -->xtitle('geometric pmf','x','fX(x)') Download at InfoClearinghouse.com 6 © 2001 Gilberto E. Urroz A plot of the geometric distribution CDF is shown next: -->yyy = [];for j = 1:21, yyy = [yyy sum(yy(1:j))]; end; -->xset('window',2); xset('mark',-9,2); plot2d(xx',yyy',-9) -->xtitle('geometric cdf','x','FX(x)') Hypergeometric probability mass function Suppose that we have a finite population of N elements, out of which a < N elements are defective. Suppose also that we take a sample of size n < N out of the population, and let X represent the number of defective elements in the sample of size n. The probability of X is given by the following pmf: a N − a x n − x f X ( x , n, a , N ) = ,0 < n < N ,0 < a < N , x = 0,1,..., n. N n Download at InfoClearinghouse.com 7 © 2001 Gilberto E. Urroz Parameters of the distribution are: µX = n⋅a/N, Var(X) = na(N-a)(N-n)/(N2(N-1)). To produce plots of the hypergeometric probability mass function and cumulative distribution function, we first define a function accounting for the binomial coefficient: -->deff('[CC]=C(n,r)','CC=gamma(n+1)./(gamma(r+1).*gamma(n-r+1))') This function is incorporated in the definition of the hypergeometric function: -->deff('[p]=fX(x)','p=C(a,x).*C(N-a,n-x)./C(N,n)') Next, we produce plots of the hypergeometric pmf and CDF for N = 100, a = 25, and n = 20: -->N=100;a=25;n=20; -->xx=[0:1:20];yy=fX(xx); -->xset('window',1);xset('mark',-9,2); -->plot2d(xx',yy',-9);xtitle('Hypergeometric distribution','x','fX(x)'); -->yyy=[];for j=1:21, yyy=[yyy sum(yy(1:j))]; end; -->xset('window',2);xset('mark',-9,2); -->plot2d(xx',yyy',-9);xtitle('Hypergeometric distribution','x','FX(x)'); -->plot2d(xx',yyy',9) Download at InfoClearinghouse.com 8 © 2001 Gilberto E. Urroz Cumulative distribution functions for discrete probability distributions Out of the five probability distributions presented above, namely, Bernoulli, Binomial, Poisson, geometric, and hypergeometric, three of them represent finite populations of discrete values (Bernoulli, Binomial, hypergeometric) and two representing infinite populations (Poisson and geometric). For the Binomial, Poisson, geometric, and hypergeometric functions, the cumulative distribution function is calculated using x FX ( x) = ∑ f X (k ), k =0 where fX(x) represents the corresponding probability mass functions. (This is the definition used to produce the CDF graphics shown in the previous examples). The cumulative distribution function FX(x) is defined in the same range of values of the discrete random variable X. For the geometric distribution, whose domain starts at x = 1, the corresponding expression is x x k =1 k =1 FX ( x) = ∑ f X (k ) = ∑ p (1 − p) k −1 , x = 1,2,3,... SCILAB functions for discrete cumulative distribution functions SCILAB provides a number of functions for operations with cumulative distribution functions. For discrete distributions the following functions are provided: • • • cdfbin - Binomial distribution cdfnbn - Negative binomial distribution cdfpoi - Poisson distribution (described in detail in Chapter …) Information on these functions can be obtained by using the help function. the use of function cdfbin. Next, we describe SCILAB function cdfbin There four different forms of the call to function cdfbin: [P,Q]=cdfbin("PQ",S,Xn,Pr,Ompr) [S]=cdfbin("S",Xn,Pr,Ompr,P,Q) [Xn]=cdfbin("Xn",Pr,Ompr,P,Q,S) [Pr,Ompr]=cdfbin("PrOmpr",P,Q,S,Xn) The variable Pr in these calls represents the probability of success on any given trial that we refer to as p in the definition of the Bernoulli pmf shown earlier. On the other hand, OmPr represents 1-Pr (in some references this is referred to as q = 1 - p), i.e., the probability of failure in a given trial. The variable P represents the probability P(X≤S), where X ~ Binomial(Xn,Pr), while Q = 1 - P. Download at InfoClearinghouse.com 9 © 2001 Gilberto E. Urroz The first argument in the calls to function cdfbin is a string that determines which variable is being sought, according to: -calculate probabilities, P = P(X≤S) and Q = 1 - P -calculate the inverse CDF, i.e., calculate S from P = P(X≤S) -calculate the number of trials (n in the definition of the pdf) - calculate the probability of success in any given trial (p in the pdf definition) “PQ” “S” “Xn” “PrOmpr” Care should be exercised in keeping the proper order of the variables in the calls to the function. Some examples follow: -->n = 10; x = 6; p = 0.35; q = 1-p; -->[P,Q] = cdfbin('PQ',x,n,p,q) Q = //Calculating probabilities .0260243 P = .9739757 -->n=20;p=0.35;q=1-p;P=0.75;Q=1-P; -->x = cdfbin("S",n,p,q,P,Q) x = //Calculating the inverse CDF 7.9132062 -->[p,q] = cdfbin("PrOmpr",P,Q,x,n) //Calculating p and q = 1-p q = .7391494 p = .2608506 Notes: Use help cdfnbn to learn more about the function that implements the negative Binomial distribution. The function cdfpoi was described in detail in Chapter 13. Discrete probability calculations through user-defined functions Besides the few pre-programmed cumulative distribution functions provided by SCILAB, probabilities can be calculated by defining probability mass and cumulative distribution functions for the different distributions presented earlier. The basic definitions of probabilities in terms of probability mass and cumulative distribution functions are: P(X=x) = fX(x), pmf x P( X ≤ x) = ∑ f X (k ), cdf for Binomial, Poisson, and hypergeometric distributions x =0 Download at InfoClearinghouse.com 10 © 2001 Gilberto E. Urroz x P( X ≤ x) = ∑ f X (k ), cdf for geometric distribution x =1 We will define the following functions for the distributions shown earlier: pmf Binomial b(x,n,p) Poisson p(x,lambda) geometric g(x,p) hypergeometric h(x,N,n,a) CDF B(x,n,p) P(x,lambda) G(x,p) H(x,N,n,a) The following is a SCILAB script, called DiscreteProbabilityFunctions, which includes the definitions for the eight function calls listed in the table immediately above: //Defining discrete probability distributions deff('[CC]=C(n,r)','CC=gamma(n+1)./(gamma(r+1).*gamma(n-r+1))') deff('[bb]=b(x,n,p)','bb=C(n,x).*p.^x.*(1-p).^(n-x)') deff('[BB]=B(x,n,p)','BB=sum(b([0:1:x],n,p))') deff('[pp]=p(x,lambda)','pp=exp(-lambda).*lambda^x./gamma(x+1)') deff('[PP]=P(x,lambda)','PP=sum(p([0:1:x],lambda))') deff('[gg]=g(x,p)','gg=p.*(1-p).^(x-1)') deff('[GG]=G(x,p)','GG=sum(g([1:x],p))') deff('[hh]=h(x,N,n,a)','hh=C(a,x).*C(N-a,n-x)./C(N,n)') deff('[HH]=H(x,N,n,a)','HH=sum(h([0:1:x],N,n,a))') //Binomial coefficient //Binomial pmf //Binomial CDF //Poisson pmf //Poisson CDF //Geometric pmf //Geometric CDF //Hypergeometric pmf //Hypergeometric CDF To execute the script that defines the discrete probability functions use: -->exec('DiscreteProbabilityFunctions') Combinations The function C(n,r) represents combinations of n elements taken r by r, or the binomial coefficient: -->C(10,5) ans = 252. This is a vector of values of C(n,r) for n = 10, and r = 0,1, …, 10: -->C10=[];for j=0:10,C10=[C10 C(10,j)]; end; C10 C10 = ! 1. 1. ! 10. 45. 120. 210. 252. 210. 120. 45. 10. Binomial distribution For the binomial distribution with n = 10 and p = 0.25, the following call to function b(x,n,p) calculates the probability P(X=2) = b(2,10,0.25): -->b(2,10,0.25) ans = Download at InfoClearinghouse.com 11 © 2001 Gilberto E. Urroz .2815676 The following is a list of values of the binomial pmf for n = 10, p = 0.25, for all possible values of x = 0,1, …, 10: -->b10=[];for j=0:10,b10=[b10 b(j,10,0.25)]; end; b10 b10 = column 1 to 7 ! .0563135 .016222 ! column .1877117 ! .2815676 .2502823 .145998 .0583992 .0030899 8 to 11 .0003862 .0000286 9.537E-07 ! The binomial CDF for x = 2, n = 10, p = 0.25 is calculated with the following call to function B(x,n,p). This value represents P(X≤2): -->B(2,10,0.25) ans = .5255928 This value represents P(X>2) = 1 - P(X≤2): -->1-B(2,10,0.25) ans = .4744072 The following is a list of values of the binomial CDF for n = 10, p = 0.25, for all values of x = 0,1, …, 10: -->B10=[];for j=0:10,B10=[B10 B(j,10,0.25)]; end; B10 B10 = column 1 to 7 ! .0563135 .9964943 ! column ! .9995842 .2440252 .5255928 .7758751 .9218731 .9802723 8 to 11 .9999704 .9999990 1. ! Poisson distribution The pmf of the Poisson distribution can be used to calculate probabilities such as P(X=2) for λ = 5.2: -->p(2,5.2) ans = .0745840 Download at InfoClearinghouse.com 12 © 2001 Gilberto E. Urroz For P(X=6), the Poisson distribution with for λ = 5.2 produces: -->p(6,5.2) ans = .1514803 The cumulative distribution function for the Poisson distribution, with for λ = 5.2, provides the probability P(X≤6): -->P(6,5.2) ans = .7323933 The following SCILAB commands produce a vector of values of the Poisson cdf for x = 0, 1, …, 10, and λ = 5.2: -->P10=[];for j=1:10, P10=[P10 P(j,5.2)]; end; P10 P10 = column 1 to 7 ! .0342027 .8449216 ! column ! .1087867 .2380655 .406128 .580913 .7323933 8 to 10 .9180650 .9603256 .9823011 ! Geometric distribution The probabilities P(X=3) and P(X=5) using the geometric distribution with p = 0.50 are calculated as: -->g(3,0.50) ans = .125 -->g(5,0.50) ans = .03125 The following example shows a way to calculate a vector of values of the geometric distribution pmf for x = 1, 2, …, 10: -->g([1:10],0.5) ans = column 1 to 9 ! .5 .0039063 .25 .125 .0019531 ! .0625 .03125 .015625 .0078125 column 10 Download at InfoClearinghouse.com 13 © 2001 Gilberto E. Urroz ! .0009766 ! The following evaluations of the geometric distribution cdf are used to calculate the probabilities P(X6), P(X3), and P(X1), respectively: -->G(6,0.5) ans = .984375 -->G(3,0.5) ans = .875 -->G(1,0.5) ans = .5 A vector of values of the geometric distribution CDF, with p = 0.5, is produced by using the following commands: -->G10=[];for j=1:10, G10=[G10 G(j,0.5)]; end; G10 G10 = column 1 to 9 ! .5 .9960938 .75 .875 .9980469 ! .9375 .96875 .984375 .9921875 column 10 ! .9990234 ! Hypergeometric distribution The next line assign values to the parameters N, n, and a in the hypergeometric distribution: -->N=100;n=20;a=35; The probability P(X=12) for the hyperbolic distribution with the parameters N, n, and a defined above is calculated as: -->h(12,N,n,a) ans = .0078581 The cumulative distribution function for the hypergeometric distribution for x = 12 is calculated as follows: -->H(12,N,n,a) ans = .9976693 Download at InfoClearinghouse.com 14 © 2001 Gilberto E. Urroz The value just calculated represents the probability P(X≤12). The next statement generates a vector of values of the hypergeometric pdf for x = 0, 1, 2, …, 20: -->h([0:20],N,n,a) ans = column 1 to 7 ! .0000529 .1847085 ! column ! .2060210 .0019176 ! .0008046 .0055295 .0228093 .0633073 .1256018 .1179114 .0613139 .0248839 .0078581 .0000051 3.698E-07 1.761E-08 4.924E-10 8 to 14 .1768671 column 15 to 21 ! .0003575 6.060E-12 ! .0000501 The next line produces a vector of values of the hypergeometric CDF: -->H10=[];for j=1:10,H10=[H10 h(j,N,n,a)]; end; H10 H10 = column 1 to 7 ! .0008046 .2060210 ! column ! .1768671 .0055295 .0228093 .0633073 .1256018 .1847085 8 to 10 .1179114 .0613139 ! Continuous probability functions In this section we describe several continuous probability distributions including the gamma, exponential, beta, and Weibull distributions. Some of these distributions make use of the Gamma function, Γ(x), which is defined next. __________________________________________________________________________________ Factorials and the Gamma function (see also Chapter 13) The Gamma function is defined by ∞ Γ(α ) = ∫ xα −1e− x dx 0 This function has the property that , Γ(α) = (α-1) Γ(α−1), for α > 1, therefore, it can be related to the factorial of a number, i.e., Download at InfoClearinghouse.com 15 © 2001 Gilberto E. Urroz Γ(α) = (α-1)!, when α is a positive integer. Factorials have applications in combinatorics (calculation of combinations and permutations, etc.), and in some discrete probability distributions (e.g., binomial probability distribution), while the gamma function has applications in continuous probability distributions (e.g., the gamma probability distribution.) __________________________________________________________________________________ The gamma distribution The probability distribution function (pdf) for the gamma distribution is given by f ( x) = 1 x ⋅ x α −1 ⋅ exp( − ), for β β Γ (α ) α x > 0 , α > 0 , β > 0; The parameters α and β are referred to, respectively, as the shape and scale parameters of the gamma distribution. Other parameters of this distribution are: µX = α ⋅ β, σ X = α ⋅ β 2. SCILAB provides function cdfgam for operations with the gamma distribution CDF. The calls to this function take the form [P,Q]=cdfgam("PQ",X,Shape,Scale) [X]=cdfgam("X",Shape,Scale,P,Q) [Shape]=cdfgam("Shape",Scale,P,Q,X) [Scale]=cdfgam("Scale",P,Q,X,Shape) where P = P(XX<X), Q = 1- P, Shape = α, and Scale = β, with XX ~ gamma(α,β). The following are examples of applications of function cdfgam. The following three calls determine, respectively, the probabilities P = P(X<10), P = P(X<3), and P = P(X<0.5), as well as the probabilities of the complement, Q = 1 - P, for the gamma distribution with α = 2, β = 3: -->[P,Q]=cdfgam("PQ",10,2,3) Q = P 2.901E-12 = 1. -->[P,Q]=cdfgam("PQ",3,2,3) Q = .0012341 P = .9987659 Download at InfoClearinghouse.com 16 © 2001 Gilberto E. Urroz -->[P,Q]=cdfgam("PQ",0.5,2,3) Q = .5578254 P = .4421746 The next call to function cdfgam calculates the inverse gamma function, i.e., the value of x for P = P(X<x) where X follows the gamma distribution with α = 2, β = 3: -->x=cdfgam('X',2,3,0.4,0.6) x = .4588071 The next call to the function is used to calculate the shape parameter, α, given a probability P = P(X<0.3) = 0.6, Q = 1-P = 0.4, with X following the gamma distribution with a scale parameter β = 2: -->alpha = cdfgam('Shape',2,0.6,0.4,0.3) alpha = .7190660 The next call to function cdfgam calculates the scale parameter, β, given a probability P = (X<1.2) = 0.2, Q = 1-P = 0.8, with X following the gamma distribution with α = 3: -->beta = cdfgam('Scale',0.2,0.8,1.2,3) beta = 1.2792035 The exponential distribution The exponential distribution is the gamma distribution with α = 1. Its pdf is given by f X ( x) = 1 x ⋅ exp( − ), β β x > 0 , β > 0; While its cdf is given by FX(x) = 1 - exp(-x/β), for x>0, β >0. Parameters of the exponential distribution include: µX = 1 1 , σX = . β β The beta distribution Download at InfoClearinghouse.com 17 © 2001 Gilberto E. Urroz The pdf for the beta distribution is given by fX (x) = Γ(α +β) α−1 ⋅ x ⋅(1−x)β−1, 0 < x <1, α > 0, β >0; Γ(α)⋅Γ(β) As in the case of the gamma distribution, the corresponding cdf for the beta distribution is also given by an integral with no closed-form solution. The parameters of the beta distribution include µX = α α⋅β , Var ( X ) = α+β (α + β + 1)(α + β ) 2 . SCILAB provides function cdfbet for operations with the cumulative distribution function of the beta distribution. Calls to the function are the following: [P,Q]=cdfbet("PQ",X,Y,A,B) [X,Y]=cdfbet("XY",A,B,P,Q) [A]=cdfbet("A",B,P,Q,X,Y) [B]=cdfbet("B",P,Q,X,Y,A) In these calls P = P(XX<X), Y = 1 - X, Q = 1 - P, A, B are the parameters α and β of the beta distribution. Next, we present some applications of function cdfbet. The first example calculate the probability P(X<0.35) for the beta distribution with α = 2, β = 3: -->[P,Q]=cdfbet('PQ',0.35,1-0.35,2,3) Q = .5629813 P = .4370187 An example that calculates the inverse function of the beta cdf, i.e., the value of x for which P = P(X<x) = 0.75, for the beta distribution with α = 3, β = 5 is presented next: -->[X,Y] = cdfbet("XY",3,5,0.75,1-0.75) Y = .5139030 X = .4860970 The next two examples shows how to obtain the parameters a and b from the beta distribution given values of X = 0.3, Y = 1-X = 0.7, P = P(X<0.3) = 0.4, and Q = 1-P = 0.6. In the first application β = 3.5, while in the second application α = 1.5: -->alpha = cdfbet("A",3.5,0.4,0.6,0.3,0.7) alpha = Download at InfoClearinghouse.com 18 © 2001 Gilberto E. Urroz 2.0459494 -->beta = cdfbet("B",0.6,0.4,0.8,0.2,1.5) beta = .7453948 The Weibull distribution The pdf for the Weibull distribution is given by f ( x ) = α ⋅ β ⋅ x β −1 ⋅ exp( −α ⋅ x β ), for x > 0, α > 0, β > 0 While the corresponding cdf is given by F ( x ) = 1 − exp( −α ⋅ x β ), for x > 0, α > 0 , β > 0 Parameters of this distribution are: µ X = α −1 / β ⋅ Γ(1 + 1 2 1 ), Var ( X ) = α − 2 / β Γ(1 + ) − Γ 2 (1 + ) . β β β The uniform distribution The uniform distribution for a continuous random variable is defined for values of X such that a <x<b. The corresponding probability density function is given by f X ( x) = 1 ,a< x<b b−a The cumulative distribution function is FX ( x ) = x−a ,a< x<b b−a The parameters of the uniform distribution are: µX = a+b (b − a) 2 , Var ( X ) = . 2 12 The following function definition implements the cumulative distribution function for the uniform distribution in SCILAB: -->deff('[FF]=FX(x)','FF=(x-a)/(b-a)') For values of a = 2.5 and b = 3.2, we proceed to calculate some probabilities: Download at InfoClearinghouse.com 19 © 2001 Gilberto E. Urroz --> a = 2.5; b = 3.2; First, we calculate P(X<2.7) = FX(2.7): -->FX(2.7) ans = .2857143 Next, we calculate P(X>3) = 1 - P(X<3) = 1 - FX(3): -->1-FX(3) ans = .2857143 The following example calculates P(2.8<X<3) = P(X<3) - P(X<2.8) = FX(3) - FX(2.8): -->FX(3)-FX(2.8) ans = .2857143 User-defined functions for continuous probability distributions The following SCILAB script defines the probability density function and the cumulative density function for four selected continuous distributions: gamma, exponential, beta, and Weibull. The script is called ContinuousProbabilityFunctions, and is invoked by using: -->exec('ContinuousProbabilityFunctions') The listing of the script is the following: //Define selected continuous probability functions deff('[gg]=gam(x,a,b)','gg=x.^(a-1).*exp(-x./b)./(b.^a.*gamma(a))') deff('[GG]=GAM(x,a,b)','GG=intg(0,x,gam)') deff('[ee]=eex(x,b)','ee=exp(-x./b)./b') deff('[EE]=EEX(x,b)','EE=1-exp(-x./b)') deff('[bb]=bet(x,a,b)',... 'bb=gamma(a+b).*x.^(a-1).*(1-x).^b./(gamma(a).*gamma(b))') deff('[BB]=BET(x,a,b)','BB=intg(0,x,bet)') deff('[ww]=w(x,a,b)','ww=a.*b.*x^(b-1).*exp(-a.*x.^b)') deff('[WW]=W(x,a,b)','WW=1-exp(-a.*x.^b)') The functions defined through the script are summarized in the following table: pdf gam(x,α,β) gamma exponential eex(x,β) bet(x,α,β) beta Weibull w(x,α,β) CDF GAM(x,α,β) EEX(x,β) BET(x,α,β) W(x,α,β) Applications of these functions follow, starting with the gamma distribution. The gamma distribution First, we plot the pdf of the distribution using α = 2 and β = 3: Download at InfoClearinghouse.com 20 © 2001 Gilberto E. Urroz -->xx=(0:0.1:20);yy=gam(xx,2,3); -->plot(xx,yy,'x','fX(x)','gamma distribution') A plot of the gamma distribution CDF for α = 2 and β = 3 is obtained by using: -->yyy=[];for x=0:0.1:20, yyy=[yyy GAM(x,2,3)]; end; -->plot(xx,yyy,'x','FX(x)','gamma distribution') The CDF can be used to calculate probabilities. The next three lines calculate the following probabilities P(X<5) = FX(5), P(6<X<11) = FX(11) - FX(5), and P(X>7.5) = 1 - P(X<7.5) = 1 - FX(7.5): -->GAM(5,2,3) ans = .4963317 -->GAM(11,2,3)-GAM(6,2,3) ans = .2867187 -->1-GAM(7.5,2,3) ans = .2872975 The exponential distribution Download at InfoClearinghouse.com 21 © 2001 Gilberto E. Urroz The following commands generate plots of the pdf and CDF for the exponential distribution using β = 2.5: -->xx=(0:0.1:20);yy=eex(xx,2.5); -->plot(xx,yy,'x','fX(x)','exponential distribution') -->yyy=[];for x=0:0.1:20, yyy=[yyy EEX(x,2.5)]; end; -->plot(xx,yyy,'x','FX(x)','exponential distribution') The following probability calculations for the exponential distribution with β = 2.5 are presented next: P(X<6) = FX(6), P(X>4) = 1 - P(X<4) = 1 - FX(4), and P(4<X<6) = FX(6)-FX(4): -->EEX(6,2.5) ans = .9092820 -->1-EEX(4,2.5) ans = .2018965 -->EEX(6,2.5)-EEX(4,2.5) ans = .1111786 The beta distribution Download at InfoClearinghouse.com 22 © 2001 Gilberto E. Urroz To plot the pdf and CDF of the beta distribution with α = 2.5, β = 3.5, we use: -->xx=(0:0.05:1);yy=bet(xx,2.5,3.5); -->plot(xx,yy,'x','fX(x)','beta distribution') -->yyy=[];for x=0:0.05:1, yyy=[yyy BET(x,2.5,3.5)]; end; -->plot(xx,yyy,'x','FX(x)','beta distribution') The following probability calculations for the beta distribution with β = 2.5 are presented next: P(X<0.25) = FX(0.25), P(X>0.75) = 1 - P(X<0.75) = 1 - FX(4), and P(0.3<X<0.8) = FX(0.8)-FX(0.3): -->BET(0.25,2.5,3.5) ans = .1737696 -->1-BET(0.75,2.5,3.5) ans = .4250376 -->BET(0.8,2.5,3.5)-BET(0.3,2.5,3.5) ans = .3428804 The Weibull distribution Download at InfoClearinghouse.com 23 © 2001 Gilberto E. Urroz Plots of the pdf and CDF for the Weibull distribution with α = 2 and β = 3 are obtained as follows: -->xx=(0:0.01:2);yy=w(xx,2,3); -->plot(xx,yy,'x','fX(x)','Weibull distribution') -->yyy=[];for x=0:0.01:2, yyy=[yyy W(x,2,3)]; end; -->plot(xx,yyy,'x','FX(x)','Weibull distribution') The following probability calculations for the Weibull distribution with α = 2 and β = 3 are presented next: P(X<1.5) = FX(1.5), P(X>0.6) = 1 - P(X<0.6) = 1 - FX(4), and P(0.5<X<1.2) = FX(0.8)-FX(0.3): -->W(1.5,2,3) ans = .9988291 -->1-W(0.6,2,3) ans = .6492094 -->W(1.2,2,3)-W(0.5,2,3) ans = .7472451 Download at InfoClearinghouse.com 24 © 2001 Gilberto E. Urroz Continuous probability distributions used in statistical inference Statistical inference is the process by which sample data is used to provide information about the population. Some of the products of statistical inference are the generation of confidence intervals and the test of hypotheses for population parameters. There are a number of continuous probability distributions of great utility in statistical inference. These are: the standard normal distribution the Student’s t distribution the Chi-square (χ2) distribution the F distribution The probability density functions (pdf) for these distributions are presented below: The Normal distribution The expression for the normal distribution pdf is: f ( x) = 1 σ 2π exp[ − (x − µ)2 ], 2σ 2 where µ is the mean, and σ2 the variance of the distribution. SCILAB provides function cdfnor for operations with the cumulative distribution function for the normal distribution. Function cdfnor was presented in detail in Chapter …. To find on-line information on this function use the command: -->help cdfnor The Student-t distribution The Student-t, or simply, the t-, distribution has one parameter ν, known as the degrees of freedom. The probability density function (pdf) is given by ν +1 Γ( ) ν +1 t2 − 2 ⋅ (1 + ) 2 ,−∞ < t < ∞ f (t ) = ν ν Γ ( ) ⋅ πν 2 The following SCILAB commands can be used to plot the pdf for the Student t distribution with -->deff('[f]=fT(t,nu)',... -->'f=gamma((nu+1)./2).*(1+t.^2./nu).^(-(nu+1)/2)/(sqrt(%pi*nu)*gamma(nu/2))') -->tt=[-4:0.1:4];ff=fT(tt,6); -->plot(tt,ff,'t','fT(t)','Student t - nu = 6') Download at InfoClearinghouse.com 25 © 2001 Gilberto E. Urroz SCILAB provides function cdft for operations with the cumulative distribution function of the Student’s t distribution. The calls to the function are as follows: [P,Q]=cdft("PQ",T,Df) [T]=cdft("T",Df,P,Q) [Df]=cdft("Df",P,Q,T) In these function calls, P = P(TT<T), Q = 1 - P, Df = degrees of freedom = ν, with TT ~ Student t(Df). -->[P,Q] = cdft("PQ",0.4,6) Q = //Probability calculation .3515041 P = .6484959 -->t = cdft("T",8,0.45,1-0.45) t = - //Inverse CDF calculation .1297073 -->nu = cdft("Df",0.7,0.3,0.8) nu = //Obtaining degrees of freedom .7716700 A plot of the CDF for the Student t distribution can be produced using the following commands: -->xx=[-4:0.1:4]; -->yy=[];for x=-4:0.1:4, yy=[yy cdft('PQ',x,6)]; end; -->plot(xx,yy,'t','fX(t)','Student t - nu = 6') Download at InfoClearinghouse.com 26 © 2001 Gilberto E. Urroz The Chi-squared (χ2) distribution The Chi-squared (χ2) distribution has one parameter ν, known as the degrees of freedom. The probability distribution function (pdf) is given by f (x) = 1 ν 2 ν 2 ⋅ Γ( ) 2 ν −1 − x ⋅ x 2 ⋅ e 2 ,ν > 0, x > 0 A plot of the pdf for the Chi-square distribution with ν = 10 can be obtained by using: -->xx = [0:0.1:10]; -->yy=[];for x=0:0.1:10, yy=[yy cdfchi('PQ',x,4)]; end; -->plot(xx,yy,'t','fX(t)','Chi-square - nu = 4') SCILAB provides function cdfchi for operations with the cumulative distribution function of the χ2 (chi-square) distribution. The calls to this function include: [P,Q]=cdfchi("PQ",X,Df) [X]=cdfchi("X",Df,P,Q); [Df]=cdfchi("Df",P,Q,X) Download at InfoClearinghouse.com 27 © 2001 Gilberto E. Urroz In these calls to function cdfchi P = P(XX<X), Q = 1 - P, Df = degrees of freedom, with XX ~ χ2 (Df). -->[P,Q] = cdfchi("PQ",1,10) Q = //Probability calculation .9998279 P = .0001721 -->[P,Q] = cdfchi("PQ",0.2,10) Q = //Probability calculation .9999999 P = 7.668E-08 -->chi2 = cdfchi("X",4,0.4,0.6) chi2 = //Inverse CDF calculation 2.7528427 -->nu = cdfchi("Df",0.4,0.6,2.7) nu = //Calculating degrees of freedom 3.9409085 A plot of the CDF for the Chi-square distribution with n = … is obtained by using: -->deff('[f]=fC(x,nu)',... -->'f=x.^(nu/2-1).*exp(-x./2)/(2.^(nu/2).*gamma(nu./2))') -->cc=[0:0.1:30];ff=fC(cc,10); -->plot(cc,ff,'chi^2','fC(chi^2)','Chi-square - nu = 10') The F distribution The F distribution has two parameters νN = numerator degrees of freedom, and νD = denominator degrees of freedom. The probability distribution function (pdf) is given by Download at InfoClearinghouse.com 28 © 2001 Gilberto E. Urroz νN νN −1 νN + νD νN ) ⋅( ) 2 ⋅ F 2 Γ( νD 2 f (x) = νN +νD νN νD νN ⋅ F ( 2 ) ⋅Γ( ) ⋅ (1 + ) Γ( νD 2 2 ) νD>0, νN>0, x>0. A plot of the F-distribution pdf for nN = 4, nD = 6, is obtained by using: -->deff('[f]=fF(F,nuN,nuD)',... -->'f=gamma((nuN+nuD)./2).*(nuN./nuD).^(nuN./2).*F.^(nuN./2-1)./... -->(gamma(nuN./2).*gamma(nuD./2).*(1+nuN.*F./nuD).^((nuN+nuD)./2))') -->xx=[0:0.1:10];ff=fF(xx,4,6); -->plot(xx,ff,'F','fF(F)','F distribution - nuNum = 4 - nuDen = 6') SCILAB provides the function cdff for operations with the cumulative distribution function of the F distribution. [P,Q]=cdff("PQ",F,Dfn,Dfd) [F]=cdff("F",Dfn,Dfd,P,Q); [Dfn]=cdff("Dfn",Dfd,P,Q,F); [Dfd]=cdff("Dfd",P,Q,F,Dfn) In these calls of the function cdff, P = P(FF<F), Q = 1 - P, Dfn and Dfd = degrees of freedom in the numerator and denominator of F. -->[P,Q] = cdff("PQ",1.2,6,12) Q = .3697351 P = .6302649 -->F = cdff("F",10,2,0.4,0.6) F = //Probability calculation //Inverse CDF calculation .9944093 -->nuNum= cdff('Dfn',5,0.4,0.6,0.8) nuNum = //calculating degrees of freedom 5.3847039 Download at InfoClearinghouse.com 29 © 2001 Gilberto E. Urroz A plot of the F-distribution CDF is produced through the following SCILAB commands: -->xx = [0:0.1:10]; -->yy=[];for x=0:0.1:10, yy=[yy cdff('PQ',x,4,6)]; end; -->plot(xx,yy,'t','fX(t)','F - nuNum = 4 - nuDen = 6') Applications of the normal distribution in data analysis The normal distribution, also known as the bell curve, appears commonly when determining the frequency distribution of different types of physical measurements. We first introduced the normal distribution in Chapter 14 as an example of a continuous probability distribution. In this section we present some applications of this probability distribution in data analysis. The probability density function, pdf, for a general normal distribution, X, with a mean value, µ, and a standard deviation, σ, is given by f X ( x) = (x − µ)2 ⋅ exp − 2σ 2 σ 2π 1 , σ > 0, − ∞ < x < ∞. The standard normal distribution has mean value µ = 0 and standard deviation σ = 1. SCILAB provides function cdfnor for operations with the normal cumulative distribution function. The different forms of the call to the function were presented in detail in Chapter$, and are repeated here: [p,q] = cdfnor(“PQ”,x,mu,sigma) [x] = cdfnor(“X”,mu,sigma,p,q) [mu] = cdfnor(“Mean”,sigma,p,q,x) [sigma] = cdfnor(“Std”,p,q,x,mu) where mu is the mean value (m), sigma is the standard deviation (s), p = P(X<x), and q = 1 - p = P(X>x). The first argument in the different calls to cdfnor is a string that indicates the type of result expected: “PQ” “X” “Mean” - to request probabilities p and q to request a value of the normal variable to request the mean of the distribution Download at InfoClearinghouse.com 30 © 2001 Gilberto E. Urroz “Std” - to request the standard deviation of the distribution Because the normal distribution is commonly found in the analysis of physical measurements, it if often recommended that you check if your data set (your sample) follows the normal distribution. In this section we present two graphical approaches for checking if your data follows the normal distribution. The first consists of superimposing a normal distribution pdf, based on the mean value and standard deviation of the sample, on top of the sample histogram. The second approach consists in plotting the data against what is commonly known their normal scores. The resulting graph is equivalent to plotting the data in a normal probability paper, i.e., a paper with one scale representing the normal probability corresponding to the data set. These two approaches are described next. Plotting a histogram and its corresponding normal curve The purpose of this plot is to visually check if the histogram of a sample, with a suitable number of classes, matches a superimposed normal curve. For that purpose we propose the following SCILAB user-defined function, histnorm: function [chi2,cmark,fcount]=histnorm(x, xclass) //This function calculates the frequency distribution //for the data in (row) vector x according to the //class boundaries contained in the (row) vector //xclass. It also produces a histogram of the //data and the normal curve that best fit the data. // //Typical call: [chi2,cm,f] = freqdist(x,xclass) //where cm = class marks, f = frequency count, // chi2 = chi-square parameter for the fitting [m n] = size(x); [m nB] = size(xclass); k = nB - 1; //Sample size //Number of class boundaries //Number of classes //Calculate class marks cmark = zeros(1,k); for ii = 1:k cmark(ii) = 0.5*(xclass(ii)+xclass(ii+1)); end //Initialize frequency counts to zero fcount=zeros(1,k); fbelow=0; fabove=0; //Accumulate frequency counts for ii = 1:n if x(ii) < xclass(1) fbelow = fbelow + 1; elseif x(ii) > xclass(nB) fabove = fabove + 1; else for jj = 1:k if x(ii)>= xclass(jj) & x(ii)< xclass(jj+1) fcount(jj) = fcount(jj) +1; end end end end //define normal CDF, calculate xbar, sx, chi-square parameter nn = sum(fcount); xbar = mean(x); sx = st_deviation(x); xmin = min(xclass); xmax = max(xclass); Download at InfoClearinghouse.com 31 © 2001 Gilberto E. Urroz pk = []; for j = 1:k+1 pk = [pk cdfnor("PQ",xclass(j),xbar,sx)]; end; p_in_classes = pk(k+1)-pk(1); pxclass = pk(2:k+1) - pk(1:k); fc = pxclass*nn*p_in_classes; //Chi square parameter chi2=0; for j = 1:length(fc) chi2 = chi2 + (fcount(j)-fc(j))^2/fc(j); end; //Produce normal distribution for data Dx = (xmax-xmin)/100; xx = [xmin:Dx:xmax]; xxx = xx(1:100) + Dx/2; pkk = []; for j = 1:101 pkk = [pkk cdfnor("PQ",xx(j),xbar,sx)]; end; pp = pkk(2:101) - pkk(1:100); fcc = pp*p_in_classes*nn*100/k; //Determine plot rectangle ymin = 0; ymaxf = max(fcount); ymaxy = max(fcc); ymax = max(ymaxf,ymaxy); ymax = int(1.1*ymax); plotrectangle = [xmin ymin xmax ymax]; //plot the histogram and normal curve xp = xclass(1:k); xset('window',1);xbasc(1); plot2d2('onn',xclass',[fcount fcount(k)]',[1],'011','y',[xmin ymin xmax ymax]); plot2d3('onn',xp',fcount',[1],'000'); plot2d(xxx',fcc',[2],'000'); xtitle('Histogram with normal curve','x','frequency'); //end function histnorm Notice that this function uses SCILAB function cdfnor to calculate values of the cumulative distribution function for the normal distribution where needed. The general call to the function is: [chi2,cm,f] = freqdist(x,xclass) which returns, in general, the class marks, cm, the frequency count, f, and a chi-square parameter defined as ( f i − fci ) 2 χ =∑ , fci i =1 k 2 where fi is the actual frequency count for the ith class, fci is the estimated frequency count obtained from the normal distribution for the ith class, and k is the number of classes in the frequency distribution. The χ2 parameter follows the chi-square distribution with ν = k-1 degrees of freedom, and it is used to check the hypothesis that the frequency distribution under consideration follows indeed the normal distribution. The subject of hypothesis testing is developed in Chapter …, therefore, we delay until then the use of the parameter returned from function histnorm. Download at InfoClearinghouse.com 32 © 2001 Gilberto E. Urroz Application of the function histnorm In this example we apply function histnorm to a set of 200 data values between 0 and 100 generated using function rand as follows: -->x = int(100*rand(1,200)); First, we check the minimum and maximum value of the data: -->min(x), max(x) ans = 0. ans = 99. A set of class boundaries of 0, 10, 20, …, 100, will produce 10 classes for this sample: -->xclass = [0:10:100]; Next, we load the function histnorm and apply the function to the data stored in x using the class boundaries stored in xclass -->getf(‘histnorm’) -->histnorm(x,xclass) ans = 1.9583514 The value returned is the chi-square parameter for the normal curve fitting. The plot of the histogram with the super-imposed normal curve is: A second example for the same data sample is presented next in which we use 20 classes, with class boundaries 0, 5, 10, …, 95, 100, to classify the data: -->xclass=[0:5:100]; The results from function histnorm are the chi-square parameter and the following plot: -->histnorm(x,xclass) ans = 2.0146916 Download at InfoClearinghouse.com 33 © 2001 Gilberto E. Urroz The function can be invoked with a vector of three values in the left-hand side to produce not only the chi-square parameter and the plot, but also the class marks and the frequency count of the sample: -->[X2,cm,f] = histnorm(x,[0:10:100]) f = ! 20. column 1 to 9 18. 27. 18. 23. 22. 16. 18. 14. ! column 10 ! cm ! X2 24. ! = 5. 15. = 1.9583514 25. 35. 45. 55. 65. 75. 85. 95. ! Notice that in the two graphs shown above, the normal curve does not fit the histograms very well. The main reason is that the data was generated from an uniform distribution (i.e., using the default settings of SCILAB’s function rand) and not from a normal distribution. Later in this chapter we deal with the generation of data other than from an uniform distribution, and will be using function histnorm to check how well those data fit the normal distribution. Plotting data against their normal scores Assume that the continuous random variable X follows the normal distribution with mean µ and standard deviation σ. Given a probability p (0<p<1) such that P(X<x)=p with X ~ N(µ,σ), then the value of x is referred to as the normal score for p. [Note: In some references in the statistical literature the normal scores are related to a probability α = 1 - p, so that if P(X>xα) = α, with X ~ N(µ,σ), xα is the normal score for α.] Suppose that we have an ordered data set, xp = {xp1<xp2< …<xpn} that follows the normal distribution with mean and standard deviations equal to the sample’s mean (x) and standard deviation (sx). Also, assume that the probability of the interval [xpi, xpi+1] is the same for all values of i = 1, 2, …, n-1, say P(xpi<X<xpi+1) = q. Also, assume that P(X<xp1) = P(X>xpn) = q. Thus, the entire area under the normal curve is split into n+1 sub-regions of the same area q as illustrated in the figure below. Download at InfoClearinghouse.com 34 © 2001 Gilberto E. Urroz The value of q is, therefore, q = 1/(n+1), and we can write: P(X<xp1) = q, P(X<xp2) = 2q, …, P(X<xpi) = iq, …, P(X<xpn) = nq. In general, P(X<xpi) = i/(n+1) = pi, for of i = 1, 2, …, n. The values pi are referred to as plotting positions for they are used to obtain the normal scores corresponding normal score xpi. Given an ordered data set, x = {x1 < x2 < … < xn}, of size, n, we can generate a vector of plotting positions, pi = i/(n+1), and obtain a set of normal scores xpi, by using the function call cdfnor(“X”,x,sx,pi,1-pi), where x and sx are the mean and standard deviation of the data set. If the given data set, x, does indeed follow the normal distribution with mean µ =x, and standard deviation σ = sx, a plot of normal scores xp versus the original data x should produce a straight line. A function to produce a plot of data versus normal scores The following function, normplot, takes as input a data set, or sample, x = {x1, x2, …, xn}, orders it in increasing order, obtains the plotting positions pi, calculates the normal scores xpi, and plots the normal scores versus the ordered data. It also plots a straight line representing y=x, or the exact fitting for a normal distribution. The closer the plot of normal scores vs. data is to the straight line representing the exact fitting for a normal distribution, the closer the data set follows the normal distribution. function normplot(x) //This function produces a normal probability //paper plot for the data in (row) vector x xx xm sx nn = = = = sortup(x); mean(xx); st_deviation(xx); length(x); //order sample in increasing order //mean of sample //standard deviation of sample //sample size //Calculating plotting positions and normal scores pp = []; xp = []; for j = 1:nn pp = [pp j/(nn+1)]; xp = [xp cdfnor(“X”,xm,sx,pp(j),1-pp(j))]; end; Download at InfoClearinghouse.com 35 © 2001 Gilberto E. Urroz //Determine the plotting rectangle xmin1 = min(xx); xmin2 = min(xp); xmin = min(xmin1,xmin2); xmax1 = max(xx); xmax2 = max(xp); xmax = max(xmax1,xmax2); ymin = min(xp); ymax = max(xp); //Produce a graduated scale [xminp, xmaxp, nxp] = graduate(xmin,xmax); [yminp, ymaxp, nyp] = graduate(ymin,ymax); //Plot scores vs. data and exact normal distribution fitting plot2d(xp’,xp’,[ 1],’011’,’y’,[xminp yminp xmaxp ymaxp]) xset(‘mark’,-9,2); plot2d(xx’,xp’,[-9],’011’,’y’,[xminp yminp xmaxp ymaxp]) xtitle(‘Normal probability plot’,’x’,’z’); //end function normplot An application of this function is shown next. First, we produce a sample of 200 data points using a uniform distribution. Next, we load function normplot and produced the normal probability plot. -->x =int(100*rand(1,200)); -->getf(‘normplot’) -->normplot(x) The resulting graph shows that the data does not follow the normal distribution particularly near the lowest and highest values of the data set. The lognormal distribution If the random variable Y = ln(X) follows the normal distribution with mean µY = µln(X) and standard deviation σY = σln(X), then we say that the random variable X follows the lognormal distribution. The probability density function of the lognormal distribution is given by f X ( x) = (ln x − µ ln( X ) ) 2 ⋅ exp − 2 2σ ln( σ ln( X ) x 2π X) 1 , x > 0. with Download at InfoClearinghouse.com 36 © 2001 Gilberto E. Urroz 1 2 2 2 µ X = exp µ ln( X ) + ⋅ σ ln( X ) , Var ( X ) = exp(σ ln( X ) )(exp(σ ln( X ) ) − 1)exp( 2 µ ln( X ) ). 2 For calculating probabilities we can use the normal distribution cdf by first calculating the natural log of the variable, for example, if X~lognormal(µln(X)=1.2, σln(X)=0.5), to calculate the probability P(X<2) use P(X<2) = P(ln(X)<ln(2)) = P(Y<0.6931) where Y ~ N(1.2, 0.5). We can use function cdfnor to calculate this probability in SCILAB as follows: -->cdfnor(“PQ”,log(2),1.2,0.5) ans = .1553616 Suppose that we want to find the inverse cumulative distribution function, i.e., a value of X for which P(X<x) = 0.35, given µln(X)=1.2, σln(X)=0.5, we can use: -->cdfnor(“X”,1.2,0.5,0.35,0.65) ans = 1.0073398 The previous result actually gives a value of Y = ln(X) with Y ~ N(1.2, 0.5). The corresponding value of X is calculated as X = exp(Y), i.e., -->exp(ans) ans = 2.7383068 A graph of the lognormal probability density function for µln(X)=1.2, σln(X)=0.5 is produced by using: -->deff(‘[ff]=fX(x,mu,sigma)’,... -->‘ff=exp(-(log(x)-mu).^2./(2.*sigma.^2))./(sigma.*x.*sqrt(2.*%pi))’) -->mu=1.2;sigma=0.5;xx=[0.01:0.1:10];yy=fX(xx,mu,sigma); -->plot(xx,yy,’x’,’fX(x)’,’Log-normal pdf’) Download at InfoClearinghouse.com 37 © 2001 Gilberto E. Urroz Generating synthetic data In this section we present pre-defined and user-defined functions that allows us to generate data that follows a particular probability distribution. We refer to such data as synthetic data. Generating normally-distributed synthetic data In the examples presented in the previous section on applications of the normal distribution we generated data by using the function rand, which, by default, produces random data uniformly distributed in the interval [0,1]. The function rand can also be used to produce normally distributed data, z, that follows the standard normal distribution, i.e., Z ~ N(0,1), by, first, using the function call rand(‘normal’) and next using the function call rand(n,m) where n and m are integers. The last call to function rand will produce a matrix of n rows and m columns whose elements are random numbers following the standard normal function. Recalling that the standardized normal variate is defined as Z = (X-µ)/σ, values of x can be obtained from values of z by using x = µ + σz. The following example illustrate how to use function rand to produce 200 data points that follow the normal distribution with mean µ = 150, and standard deviation σ = 50: -->x = 150 + 50.*rand(1,200); To verify that the data do indeed follow the normal distribution, we use functions histnorm and normplot applied to this data set. To use function histnorm, we first determine the minimum and maximum values of the data set to determine which class boundaries use in the histogram: -->xmin = min(x), xmax = max(x) xmin = 34.558873 xmax = 317.59609 We select for class boundaries the values 25, 50, 75, …, 300, 325: -->xclass = [25:25:325]; The resulting histogram and superimposed normal curve are shown next: -->histnorm(x,xclass); Download at InfoClearinghouse.com 38 © 2001 Gilberto E. Urroz The fitting of the histogram to the corresponding normal curve is relatively good, in spite of the apparent discrepancy towards the center of the data. We can also use function normplot to check the normality of the data as follows: -->normplot(x) The resulting normal probability plot is: The plot suggests that the data follows the normal distribution for most of the range except for values larger than about 220. Additional applications of function rand SCILAB’s function rand, as most numerical random number generators, uses a number, known as the seed, to produce random numbers. To find out the current value of the seed in function rand use: -->rand(‘seed’) ans = 8.096E+08 To find out which type of random number generator is active in function rand (i.e., normal or uniform) use: -->rand(‘info’) ans = normal Download at InfoClearinghouse.com 39 © 2001 Gilberto E. Urroz To change the function rand back to uniform use: -->rand(‘uniform’) To change the seed to the number 15, for example, use: -->rand(‘seed’,15) The first 10 random numbers generated by rand after seeding it with a value of 15 are: -->rand(1,10) ans = ! ! column 1 to 5 .1018111 .5348560 column 6 to 10 .4106913 .6578733 .9628528 .1235873 .6667947 ! .6756193 .1201851 .0268646 ! After generating those 10 random numbers the value of seed has changed to: -->rand(‘seed’) ans = 57691269. If, for some reason, you need to re-start the previous sequence of random numbers, you can simply re-seed function rand with the value of 15: -->rand(‘seed’,15) Check that you get the same sequence of random numbers by comparing the following 5 random numbers with the first 5 random numbers generated earlier after using seed = 15: -->rand(1,5) ans = ! .1018111 .5348560 .9628528 .1235873 .6667947 ! SCILAB function for generating synthetic data SCILAB provides function grand (generating random numbers) to generate a vector or matrix with data that follows, among others, the following distributions: binomial, Poisson, gamma, beta, exponential, uniform integer, uniform real, normal, chi-squared, and Student’s t. Two general calls to the function are: [x] = grand(m,n,dist_type,dist_parameters) [x] = grand(A,dist_type,dist_parameters) where dist_type is a string identifying the type of distribution, and dist_parameters is a list of the parameters defining the distribution. In the first form of the call the values m and n represent the number of rows and columns of a matrix to be generated containing random numbers that follow the desired distribution. In the second form of the function call an existing matrix A is provided so that the function generates a new matrix with the same dimensions as A containing the random numbers that follow the desired distribution. Download at InfoClearinghouse.com 40 © 2001 Gilberto E. Urroz The following strings identify the type of distribution requested. parameters required for each distribution: String ‘bin’ ‘poi’ ‘bet’ ‘gam’ ‘exp’ ‘nor’ ‘chi’ ‘f’ ‘uin’ ‘unf’ We also identify the Distribution Parameters N, P Binomial Poisson λ Beta α, β α = shape, β = scale Gamma exponential µ=1/β normal µ, σ chi-square ν νN, νD F a, b uniform integer uniform real a, b The specific function calls for each probability distribution are shown next: Binomial: x=grand(m,n,’bin’,N,P), x=grand(A,’bin’,N,P) Poisson: x=grand(m,n,’poi’,mu), x=grand(x,’poi’,λ) Beta: x=grand(m,n,’bet’,α,β), x=grand(A,’bet’, α,β) Gamma: x=grand(m,n,’gam’, α,β), x=grand(A,’gam’, α,β) Exponential: x=grand(m,n,’exp’,µ), x=grand(A,’exp’,µ) Normal: x=grand(m,n,’nor’,µ, σ), x=grand(A,’nor’, µ, σ) Chi-square: x=grand(m,n,’chi’,ν), x=grand(A,’chi’, ν) F-distribution: x=grand(m,n,’f’, νN, νD), x=grand(A,’f’, νN, νD) Uniform integer: x=grand(m,n,’uin’, α,β), x=grand(x,’uin’, a, b) Uniform real: x=grand(m,n,’unf’, α,β),x=grand(x,’unf’, a, b) Examples of synthetic data generation using function grand The following examples demonstrate how to use function grand to generate sets of 200 data points that follow specific probability distributions. After the data are generated we determine their maximum and minimum values, select class boundaries for histograms of the data, and use functions histnorm and normplot to check how close the data are to normality. We start the exercises by loading these two functions: -->getf(‘histnorm’);getf(‘normplot’); Binomial data -->x=grand(1,200,’bin’,20,0.35);xmin=min(x),xmax=max(x) xmin = 2. Download at InfoClearinghouse.com 41 © 2001 Gilberto E. Urroz xmax = 14. -->xclass=[2:2:14];xset(‘window’,1);histnorm(x,xclass); -->xset(‘window’,2);normplot(x); Poisson data -->x=grand(1,200,’poi’,12.5);xmin=min(x),xmax=max(x) xmin = 4. xmax = 23. -->xclass=[4:2:24];xset(‘window’,1);histnorm(x,xclass); Download at InfoClearinghouse.com 42 © 2001 Gilberto E. Urroz -->xset(‘window’,2);normplot(x); Beta data -->x=grand(1,200,’bet’,2,3);xmin=min(x),xmax=max(x) xmin = .0480813 xmax = .9132797 -->xclass=[0:0.1:1];xset(‘window’,1);histnorm(x,xclass); -->xset(‘window’,2);normplot(x); Download at InfoClearinghouse.com 43 © 2001 Gilberto E. Urroz Gamma data -->x=grand(1,200,’gam’,2,3);xmin=min(x),xmax=max(x) xmin = .0042184 xmax = 2.6455776 -->xclass=[0:0.4:2.8];xset(‘window’,1);histnorm(x,xclass); -->xset(‘window’,2);normplot(x); Download at InfoClearinghouse.com 44 © 2001 Gilberto E. Urroz Normal data -->x=grand(1,200,’nor’,2500,1250);xmin=min(x),xmax=max(x) xmin = 1294.6718 xmax = 6467.2541 -->xclass=[-1000:1000:7000];xset(‘window’,1);histnorm(x,xclass); -->xset(‘window’,2);normplot(x); Chi-square data -->x=grand(1,200,’chi’,12);xmin=min(x),xmax=max(x) xmin = 3.8312405 xmax = 28.583772 -->xclass=[0:3:30];xset(‘window’,1);histnorm(x,xclass); Download at InfoClearinghouse.com 45 © 2001 Gilberto E. Urroz -->xset(‘window’,2);normplot(x); F distribution data -->x=grand(1,200,’f’,10,5);xmin=min(x),xmax=max(x) xmin = .110966 xmax = 53.694396 -->xclass=[0:10:60];xset(‘window’,1);histnorm(x,xclass); -->xset(‘window’,2);normplot(x); Download at InfoClearinghouse.com 46 © 2001 Gilberto E. Urroz -->xclass=[0:2:12];histnorm(x,xclass); -->xclass=[0:0.5:6];histnorm(x,xclass); Uniform integer data -->x=grand(1,200,’uin’,-5,5);xmin=min(x),xmax=max(x) xmin = -5. xmax = 5. Download at InfoClearinghouse.com 47 © 2001 Gilberto E. Urroz -->xclass=[-5:1:5];xset(‘window’,1);histnorm(x,xclass); -->xset(‘window’,2);normplot(x); Uniform real data -->x=grand(1,200,’unf’,-5,5);xmin=min(x),xmax=max(x) xmin = -4.9677424 xmax = 4.9660118 -->xclass=[-5:1:5];xset(‘window’,1);histnorm(x,xclass); Download at InfoClearinghouse.com 48 © 2001 Gilberto E. Urroz -->xset(‘window’,2);normplot(x); Additional notes on function grand The previous examples were used to illustrate applications of function grand to the generation of data that follows the binomial, Poisson, gamma, beta, exponential, normal, chi-square, F-, uniform integer, and uniform real distributions. Function grand allows the user to obtain data that follow other distributions that are not presented in this book, such as the negative binomial distribution, the multinomial distribution, the non-central F distribution, and the noncentral chi-square distribution. (To find information about these and other distributions consult a statistics and probability textbook such as Spanos, A., 1999, “Probability Theory and Statistical Inference - Econometric Modeling with Observational Data,” Cambridge University Press, Cambridge, U.K.). To obtain additional details on the use of function grand use: -->help grand Function grand has access to 32 different random number generators that constitute the basis upon which random numbers that follow a particular probability distribution are generated. By default, functions rand and grand use generator number 1. To check out which is the current active random number generator use: -->grand(‘getcgn’) ans = 1. This result indicates that you are currently using SCILAB’s default random number generator. The random number generators provided by SCILAB for use with function grand require two seed numbers. To see the current seed numbers you can use the statement: -->seeds = grand(‘getsd’) seeds = 1.0E+08 * ! 20.45933 9.2172801 ! You can re-initialize those seed to the original seeds by using: Download at InfoClearinghouse.com 49 © 2001 Gilberto E. Urroz -->grand(‘initgn’,-1) ans = 1. We can check the initial seeds after re-initialization by using: -->seeds = grand(‘getsd’) seeds = 1.0E+08 * ! 12.345679 1.2345679 ! You can also re-seed the generator (i.e., provide new seeds) by using the following call to function grand: -->grand(‘setall’,10,20) ans = setall To check that the new seeds are active use: -->seeds=grand(‘getsd’) seeds = ! 10. 20. ! To change the random number generator from generator number 1 to generator number 5, for example, use: -->grand(‘setcgn’,5) ans = 5. The following call to function grand can be used to verify that the change of generator has been made: -->grand(‘getcgn’) ans = 5. To check the values of the seeds for the current generator use: -->seeds=grand(‘getsd’) seeds = ! 3.795E+08 77757764. ! Pseudo-random generators The random number generators used in SCILAB and other computer applications are known as pseudo-random generators because, after generating a sufficiently long sequence of numbers, the numbers start repeating. Therefore, they are not strictly random generators, but only quasi-random or pseudo-random. The random number generator provided with SCILAB is able to produce 2.3×1018 numbers before repetition of numbers occurs. This collection of numbers is partitioned into 32 pseudorandom generators, each containing 220 =1,048,576 blocks of non-overlapping random numbers. Each block is 230 = 1,073,741,824 in length. Download at InfoClearinghouse.com 50 © 2001 Gilberto E. Urroz Given the size of the sequences of random numbers that can be generated with each of SCILAB’s 32 pseudo-random number generators, we are confident that the numbers thus generated are random enough for most practical applications. Furthermore, use of the default generator should be enough for most applications unless you Another application of function grand is in the generation of permutations of a column vector. For example, the following application produces 10 permutations of the vector M containing the first five positive integers. The permutations are shown as columns of a matrix. -->M = [1 2 3 4 5]’; -->grand(10,’prm’,M) ans = ! 1. 2. 4. ! 3. 1. 2. ! 2. 3. 5. ! 5. 4. 3. ! 4. 5. 1. 1. 4. 5. 3. 2. 4. 2. 5. 3. 1. 4. 2. 3. 5. 1. 5. 1. 2. 3. 4. 4. 3. 2. 1. 5. 1. 4. 2. 3. 5. 3. ! 2. ! 5. ! 4. ! 1. ! Generating log-normally-distributed data To generate log-normally distributed data we first generate a set of normally distributed data and then apply the exponential function to that data set. For example, if X follows the lognormal distribution with µln(X)=1.2, σln(X)=0.5, we can use the following SCILAB commands to generate a set of 200 data points. We apply functions histnorm and normplot to this data set to check how close the data are to normality. -->y=grand(1,200,’nor’,1.2,0.5); //Generate normal data N(1.2,0.5) -->x=exp(y); //Generate log-normal data by using exp -->xmin=min(x),xmax=max(x) //Determine min and max values xmin = 1.1210567 xmax = 11.161347 -->xclass=[0:2:12];histnorm(x,xclass); //Histogram -->normplot(x); Download at InfoClearinghouse.com //Normal probability plot 51 © 2001 Gilberto E. Urroz Generating data that follows the Weibull distribution SCILAB does not provide for a function to generate data that follows the Weibull distribution, however, using the uniformly-generated random numbers from function rand we can generate numbers p between 0 and 1 that represent probabilities p = FX(x) = P(X<x). Next, we use the cumulative distribution function for the Weibull distribution, namely, F ( x ) = 1 − exp( −α ⋅ x β ), for x > 0, α > 0 , β > 0 and solve for x given values of p, i.e., ln(1 − p ) x = − α 1/ β . The following SCILAB commands are used to generate 200 data points that follow the Weibull distribution with a =2, b = 3. We also use functions histnorm and normplot to check how close these data are to normality. -->getf(‘histnorm’);getf(‘normplot’) -->p=rand(1,200); -->a=2; b=3; -->x = (-log(1-p)/a)^(1/b); -->xmin=min(x), xmax = max(x) xmin = .1230276 xmax = 1.3553315 -->xclass = [0:0.1:1.4]; -->histnorm(x,xclass); //Load functions //Generate probabilities //parameters of Weibull distribution //generate Weibull data //check data range Download at InfoClearinghouse.com 52 //select classes for histogram //plot histogram and normal curve © 2001 Gilberto E. Urroz -->normplot(x) //create normal probability plot It is interesting to notice that this Weibull data is very close to normality. Generating data that follows the Student’s t distribution Function grand does not allow for the generation of data following the Student’s t distribution. However, SCILAB provides for function cdft which lets you obtain the inverse of the cumulative distribution. Using an approach similar to that shown above for the Weibull distribution, we can generate random probability values through function rand, and then use function cdft to generate the data required. The following example illustrates the procedure: -->getf(‘histnorm’);getf(‘normplot’); -->pp = rand(1,200); -->x = []; -->for j =1:200 --> x = [x cdft(“T”,6,pp(j),1-pp(j))]; -->end; -->xmin=min(x), xmax=max(x) xmin = 6.9441809 xmax = 3.4425429 //Load functions histnorm & normplot //Generate random probabilities //This line and the for … end //construct calculate values of x //Determine min & max values -->xclass=[-7:1:4];xset(‘window’,1);histnorm(x,xclass); Download at InfoClearinghouse.com 53 //Histogram © 2001 Gilberto E. Urroz -->xset(‘window’,2);normplot(x); //Normal probability plot Generating data that follows a discrete distribution Using function grand we were able to generate discrete data that follows the binomial, Poisson, and uniform integer distributions. In this section we present a general method for the generation of data given a discrete distribution in the form of a table. For example, the following table shows the probability mass function, fx(x) = P(X=x), and cumulative distribution function, FX(x) = P(X<x), of a discrete random variable X: Random numbers X 0.5 1.5 2.5 3.5 4.5 5.5 fX(x) 0.10 0.25 0.20 0.15 0.15 0.15 FX(x) 0.10 0.35 0.55 0.70 0.85 1.00 From 0.00 0.10 0.35 0.55 0.70 0.85 to 0.10 0.35 0.55 0.70 0.85 1.00 The last two columns of the table represent the range of probabilities corresponding to the cumulative distribution function for each value of X. The procedure for generating data Download at InfoClearinghouse.com 54 © 2001 Gilberto E. Urroz consists in obtaining a value of random probability p = P(X<x) from a uniform distribution, e.g., using function rand, and then assigning a value of X according to the range of values of the random numbers. Thus, if function rand produces the random number 0.25, we assign to x the corresponding value X = 1.5. The following function, discrand, will generate a matrix of dimensions n×m random numbers given vectors of values of X and FX, representing the values of a discrete random variable and its corresponding cumulative distribution function. function [x] = discrand(n,m,xx,FX) //A function to generate a matrix nxm //following a discrete probability distribution //represented by vectors xx and FX = P(X<xx) nx = length(xx); pp = rand(n,m); x = zeros(n,m); FXX = [0.00 FX]; for i = 1:n for j = 1:m for k = 1:nx if pp(i,j)>FXX(k) & pp(i,j)<=FXX(k+1) then x(i,j) = xx(k); end; end; end; end; //end function discrand An application of the function to generate 200 data points that follow the probability distribution shown in the table above is presented next. We first load function discrand, then enter the values of X and FX(x), and generate a row vector of 200 points. Next, we load functions histnorm and normplot to check how well the data follows a normal distribution. -->getf(‘discrand’) -->X = [0.5:1.0:5.5]; FX = [0.10,0.35,0.55,0.70,0.85,1.00]; -->x=discrand(1,200,X,FX); -->getf(‘histnorm’);getf(‘normplot’); -->xmin=min(x), xmax=max(x) xmin = .5 xmax = 5.5 -->xclass=[0.5:0.5:5.5]; -->histnorm(x,xclass) ans = 24.643214 Download at InfoClearinghouse.com 55 © 2001 Gilberto E. Urroz -->normplot(x) Statistical simulation Many physical or other type of systems are described by one or more mathematical relationships (e.g., algebraic, difference, or differential equations) of diverse degrees of complexity. We will refer to the set of mathematical relationships that describe a physical system as a model. A model typically depends of certain constant values known as the parameters of the model. In the simplest of cases, a model can be represented by a black box into which a set of input data is provided, and from which a set of output results is obtained. This is illustrated in the following figure: If the model is such that for a given set of input data it always produces a predictable result, it is referred to as a deterministic model. An example of a deterministic model is the equation Download at InfoClearinghouse.com 56 © 2001 Gilberto E. Urroz that describes the electric current, I, through a resistor, R, when a voltage, V, is applied across the terminals of the resistor. The equation is I = V/R. If we apply a constant voltage Vo to the resistor, we get back a constant electric current, I0 = Vo/R. If we instead apply a variable voltage V(t) = Vo⋅sin(ωt), we obtain an electric current, I(t) = (Vo/R)⋅sin( ωt). Thus, knowing the value of the resistance R and the input to the system, i.e., the voltage, V0 or V(t), we can always find the value of the electric current. We cannot get more deterministic than this example. If the input to the model is of a random nature, or if there is a random component to the model itself, the model is said to be probabilistic or stochastic. For example, the black-box model described above can be used to describe a hydrological basin. The input data is the amount and duration of the precipitation falling on the basin on a certain period of time. (A graphical representation of precipitation vs. time is referred to as a hyetograph). This input is, by its own nature, random or stochastic. This means that we cannot know exactly the amount of precipitation that will occur, say, in the next 24 hours. Although a hydrological basin is extremely more complicated than an electric resistor, the model used to predict the runoff (output) to the system can be a simple relationship involving one or two parameters. (A graphical representation of the runoff coming out of the basin as a function of time is known as a hydrograph). If the input hyetograph is known, then the output hydrograph can be obtained in a deterministic way. However, because we do not know exactly the input hyetograph for a particular period of time, except in a statistical manner, the model is indeed a stochastic one. Through the keeping of historical records of precipitation in the basin we can get a good idea of the stochastic nature of precipitation to use as input for our stochastic model. We can then generate synthetic data representing the precipitation and use it as input to the model. This approach to modeling physical (or economical, or other type of) systems is known as a Monte Carlo method. (The name derives from the capital of the European principalty of Monaco, the city of Monte Carlo, famous for its casinos, where the laws of probability are seen in action night and day.) Monte Carlo methods find applicability in all types of models where there is a random component to the input or parameters of the model. Statistical modeling can be used to model, for example, economic responses from human populations, the distribution of soil permeabilities in an aquifer, the distribution of animal or plant populations, traffic patterns in highways or airports, weather phenomena, etc. A simple application of a Monte Carlo method to simulate the patterns of traffic through a service station is shown below. Simulating traffic through a service station Suppose we want to simulate the traffic through a service station in which only one customer can be serviced at a time. We also assume that once a customer arrives to the service station, he or she will not leave until service is provided. This is a simplistic model, but it could be used to simulate a vehicle service station in a city or highway, a medical emergency room, a highway service station for state or privately own trucks, a store, etc. The first customer arrives at a certain arrival time, AT1 (Arrival Time). He or she is taken care of right away so that the starting time of service for customer 1, ST1 (Starting Time), coincides with his or her arrival time, thus, ST1 = AT1. The waiting time for customer 1 is, therefore, zero, i.e., WT1 = 0. The number of customers awaiting service at this point is also zero, i.e., Download at InfoClearinghouse.com 57 © 2001 Gilberto E. Urroz NW1 = 0. The time required to service this first customer is referred to as TS1 (Time of Service). The first customer leaves the service station at time ET1 = ST1 + TS1 (Ending Time). The second customer arrives at the service station at a time AT2. If AT2 < ET1 (i.e., the second customer arrives before service for the first one has finished), the second customer must wait until the first customer leaves, so that ST2 becomes ET1 (ST2 = ET1). In this case, we can calculate a waiting time for the second customer equal to WT2 = ET1 - AT2. Also, the number of customers waiting for service at this point is NW2 = 1. If, instead, the second customer arrives at a time AT2 ≥ ET1, then ST2 = AT2, and WT2 = 0. In any event, the ending time for the second customer is calculated as ET2 = ST2 + TS2. We define the inter-arrival time between customers 1 and 2 as IAT1 = AT2 - AT1. In general, the inter-arrival time between customers i and i+1 is IATi = ATi+1 - ATi. The inter-arrival time (IATi) and the time of service (TSi) are considered random variables of discrete nature. Thus, IATi and TSi constitute random input to the model. Suppose that we want to simulate the operation of the service center for n customers, we first generate n-1 values of inter-arrival time {IAT1, IAT2, …, IATn-1}, as well as n values of the service time {TS1, TS2, …, TSn}. Then, we proceed to calculate the arrival times as ATi+1 = ATi + IATi, i = 1, 2, …, n-1. As indicated earlier, the starting and ending times for the first customer are ST1 = AT1, ET1 = ST1 + TS1. Also, the waiting time and number of customers waiting at this stage are both zero, i.e., WT1 = 0, and NW1 = 0. The starting time for customer 2 is obtained as follows: If AT2 > ET1, then ST2 = AT2, WT2 = 0, NW2 = 0 If AT2 < ET1, then ST2 = ET1, WT2 = ET1 - AT2, and NW2 = 1. For the third customer, we need to check the arrival time, AT3, against the ending times of both the first and second customers so we can determine the starting time, the waiting time, and the number of customers waiting at that point. The following piece of pseudo-code can be used to determine such values: for j = 2:n NWj = 0 WTj = 0 for k = 1:j-1 if ATj < ETk then NWj = NWj + 1 WTj = ETk - ATj STj = ETk else STj = ATj end end ET(j) = ST(j)+TS(j) End An user-defined function to simulate traffic through a service station The steps outlined above are put together in the following function, service: function [MR] = service(IAT,TS) Download at InfoClearinghouse.com 58 © 2001 Gilberto E. Urroz //Simulation of traffic in a service station //Given n-1 values of inter-arrival time IAT //and n values of time of service TS. //Results: //Arrival time = AT, Starting time = ST //Ending time = ET, Waiting time = WT //Number of waiting customers = NW // n = length(TS); AT = zeros(1,n); ST = zeros(1,n); ET = zeros(1,n); NW = zeros(1,n); WT = zeros(1,n); IATT = [IAT 0]; ST(1) = AT(1); ET(1) = ST(1) + TS(1); for j = 2:n AT(j) = AT(j-1) + IAT(j-1); end; for j = 2:n NW(j) = 0; WT(j) = 0; for k = 1:j-1 if AT(j) < ET(k) then NW(j) = NW(j) + 1; WT(j) = ET(k) - AT(j); ST(j) = ET(k); else ST(j) = AT(j); end; end; ET(j) = ST(j)+TS(j); end; disp(' '); printf('===============================================================\n'); printf(' j AT IAT ST TS ET WT NW \n'); printf('===============================================================\n'); for j = 1:n printf('%3.0f %8.2f %8.2f %8.2f %8.2f %8.2f %8.2f %3.0f\n',... j,AT(j),IATT(j),ST(j),TS(j),ET(j),WT(j),NW(j)); end; printf('===============================================================\n'); MR = [AT' IATT' ST' TS' ET' WT' NW']; //Matrix of Results printf('AT = arrival times IAT = inter-arrival times \n'); printf('ST = starting times TS = time of service \n'); printf('ET = ending times WT = waiting times \n'); printf('NW = number of customers waiting \n'); disp(' AT IAT ST TS ET WT NW'); //end function service As an example, suppose that we have the following inter-arrival times (IAT) and times of service (TS): Download at InfoClearinghouse.com 59 © 2001 Gilberto E. Urroz -->IAT = [ 0.5 0.75 0.5 0.25 0.5]; -->TS = [ 1 2 1 1 2 1]; We can load function service and run it with the values of IAT and TS defined earlier to obtain the following results: -->Matrix_of_results = service(IAT,TS) =============================================================== j AT IAT ST TS ET WT NW =============================================================== 1 0.00 .50 0.00 1.00 1.00 0.00 0 2 .50 .75 1.00 2.00 3.00 .50 1 3 1.25 .50 3.00 1.00 4.00 1.75 1 4 1.75 .25 4.00 1.00 5.00 2.25 2 5 2.00 .50 5.00 2.00 7.00 3.00 3 6 2.50 0.00 7.00 1.00 8.00 4.50 4 =============================================================== AT = arrival times IAT = inter-arrival times ST = starting times TS = time of service ET = ending times WT = waiting times NW = number of customers waiting AT IAT ST Matrix_of_results = ! ! ! ! ! ! 0. .5 1.25 1.75 2. 2.5 .5 .75 .5 .25 .5 0. 0. 1. 3. 4. 5. 7. TS 1. 2. 1. 1. 2. 1. ET 1. 3. 4. 5. 7. 8. WT 0. .5 1.75 2.25 3. 4.5 NW 0. 1. 1. 2. 3. 4. ! ! ! ! ! ! The function is designed to provide a table of results, as well as a matrix summarizing the results in case that additional operations on those results are required within SCILAB. The function, as applied in this case, is purely deterministic in the sense that for the given input we get a unique result. To work out a stochastic modeling of traffic through a service station we need to provide random input. The following example shows how to obtain that random input. Modeling traffic through a service station with random input Suppose that the inter-arrival times and time of service for the service station model follows the probability distributions shown in the following table: x = IAT 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Download at InfoClearinghouse.com FX(x) 0.05 0.10 0.20 0.35 0.45 0.50 0.70 0.75 0.95 1.00 x = TS 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 60 FX(x) 0.10 0.20 0.40 0.70 0.80 0.90 0.95 1.00 © 2001 Gilberto E. Urroz We want to analyze the traffic through the service station for 10 customers by generating 9 inter-arrival times and 10 service times from these generations. The inter-arrival times and times of service can be generated using function discrand as follows: -->getf('discrand') -->xIAT = [0.1:0.1:1.0]; FIAT = [0.05,0.1,0.2,0.35,0.45,0.5,0.7,0.75,0.95,1.0]; -->xTS = [0.25:0.25:2]; FTS = [0.1,0.2,0.4,0.7,0.8,0.9,0.95,1]; -->IAT = discrand(1,9,xIAT,FIAT) IAT = ! .4 .7 .7 .5 //generate IAT data .4 -->TS = discrand(1,10,xTS,FTS) TS = ! 1. .75 1. .75 .7 .5 .9 .1 ! //generate TS data .5 1.25 .75 .5 1. .5 ! With these values of IAT and ST we now call function service: -->M = service(IAT,TS) =============================================================== j AT IAT ST TS ET WT NW =============================================================== 1 0.00 .40 0.00 1.00 1.00 0.00 0 2 .40 .70 1.00 .75 1.75 .60 1 3 1.10 .70 1.75 1.00 2.75 .65 1 4 1.80 .50 2.75 .75 3.50 .95 1 5 2.30 .40 3.50 .50 4.00 1.20 2 6 2.70 .70 4.00 1.25 5.25 1.30 3 7 3.40 .50 5.25 .75 6.00 1.85 3 8 3.90 .90 6.00 .50 6.50 2.10 3 9 4.80 .10 6.50 1.00 7.50 1.70 3 10 4.90 0.00 7.50 .50 8.00 2.60 4 =============================================================== AT = arrival times IAT = inter-arrival times ST = starting times TS = time of service ET = ending times WT = waiting times NW = number of customers waiting M ! ! ! ! ! ! ! ! ! ! AT = IAT ST 0. .4 1.1 1.8 2.3 2.7 3.4 3.9 4.8 4.9 .4 .7 .7 .5 .4 .7 .5 .9 .1 0. 0. 1. 1.75 2.75 3.5 4. 5.25 6. 6.5 7.5 TS 1. .75 1. .75 .5 1.25 .75 .5 1. .5 ET WT 1. 1.75 2.75 3.5 4. 5.25 6. 6.5 7.5 8. NW 0. .6 .65 .95 1.2 1.3 1.85 2.1 1.7 2.6 0. 1. 1. 1. 2. 3. 3. 3. 3. 4. ! ! ! ! ! ! ! ! ! ! Out of the matrix of results, M, we can extract individual columns of data, for example, the waiting time data corresponds to the sixth column of M: Download at InfoClearinghouse.com 61 © 2001 Gilberto E. Urroz -->WT = M(:,6) WT = ! ! ! ! ! ! ! ! ! ! 0. .6 .65 .95 1.2 1.3 1.85 2.1 1.7 2.6 ! ! ! ! ! ! ! ! ! ! The number of waiting customers is extracted from the seventh column of matrix M: -->NW = M(:,7) NW = ! ! ! ! ! ! ! ! ! ! 0. 1. 1. 1. 2. 3. 3. 3. 3. 4. ! ! ! ! ! ! ! ! ! ! The columns of data extracted from the matrix of results, M, can be used to obtain statistics such as the mean and standard deviation: -->WT_mean = mean(WT), WT_sdev = st_deviation(WT) WT_mean = 1.295 WT_sdev = .7836701 -->NW_mean = mean(NW), NW_sdev = st_deviation(NW) NW_mean = 2.1 NW_sdev = 1.2866839 We can also function normplot to check how close the data is to normality: -->getf('normplot') -->normplot(NW') Download at InfoClearinghouse.com 62 © 2001 Gilberto E. Urroz -->normplot(WT') STIXBOX: a rudimentary statistics toolbox STIXBOX (an abbreviation of statistical toolbox) is a collection of functions that perform selected statistical and probability calculations. STIXBOX is available for download from the SCILAB main web page (http://www-rocq.inria.fr/SCILAB/). Instructions for its installation are provided with the downloaded functions. The package includes a set of help manual pages that briefly describe the operation of the functions. Once loaded, the manual pages are available through the main SCILAB Help window. Probability mass and probability density functions Probability mass functions or pmf (for discrete random variables) and probability density functions of pdf (for continuous random variables) start with the letter d, e.g., dbeta, dbinom, etc. Mass distribution functions are referred to by pX(k) = P[X=k], and probability density functions by fX(x). Thus, if X ~ Binomial(n,p) with n = 10, p = 0.5, P[X=2] = pX(2) = dbinom(2,10,0.5). And, if X ~ Normal(µ,σ2) with µ = 1.5, σ = 0.2, then fX(1.75) = dnorm(1.75,1.5,0.2). The following probability mass and density functions are defined: dbeta dbinom dchisq df dgamma dhypgeo dnorm dt the beta density function the binomial probability function the chisquare density function The F density function [modified by the author, 2/1/2001] the gamma density function the hypergeometric probability function the normal density function [modified by the author, 2/1/2001] the student t density function Cumulative distribution functions Cumulative distribution functions (cdf) are referred to as distribution functions if dealing with continuous variable, or as cumulative probability function if dealing with discrete variables. All cdfs in this package start with a p: pbeta, pbinom, etc. Both, discrete and continuous cdfs are referred to by FX(x) = P[X≤x]. Thus, if X ~ Binomial(n,p) with n = 10, p = 0.5, P[X≤2] = Download at InfoClearinghouse.com 63 © 2001 Gilberto E. Urroz FX(2) = pbinom(2,10,0.5). And, if X ~ Normal(µ,σ2) with µ = 1.5, σ = 0.2, then FX(1.75) = pnorm(1.75,1.5,0.2). The following cumulative distribution functions are defined: pbeta pbinom pchisq pf pgamma phypge pnorm pt the beta distribution function the binomial cumulative probability function the chisquare distribution function The F distribution function the gamma distribution function the hypergeometric cumulative probability function the normal distribution function the student t cdf (modified by the author, 2/1/2001) Inverse cumulative distribution functions Inverse cumulative distribution functions start with q: qbeta, qbinom, etc. . If FX(q) = P[X≤q] = p, then q = FX-1(p). The value q is also referred to as a quantile of the distribution. The following inverse cumulative distribution functions are defined: qbeta qbinom qchisq qf qgamma qhypg qnorm qt quantile the beta inverse distribution function the binomial inverse cdf the chisquare inverse distribution function The F inverse distribution function the gamma inverse distribution function the hypergeometric inverse cdf the normal inverse distribution function the student t inverse distribution function empirical quantile (percentile). Generating synthetic data The generation of synthetic data that follows a particular distribution can be accomplished with the following random number generators. The name of the random generator functions begins with r: rbeta, rbinom, etc. Maple already provides function rand that produces uniformly distributed random numbers (use help rand for more information). The functions provided by STIXBOX generates random numbers that follow the distributions suggested by the names of the functions. Thus, if you want to generate n = 10 data values x that follow the normal distribution, with µ = 0.5, and σlnX = 0.1, use rnorm(10,0.5,0.1). rbeta rbinom rchisq rexpweib rf rgamma rgeom rhypg rjbinom rjgamma rjpoiss rnorm rjpoiss random numbers from the beta distribution random numbers from the binomial distribution random numbers from the chisquare distribution random numbers from the exponential or weibull distributions random numbers from the F distribution random numbers from the gamma distribution random numbers from the geometric distribution random numbers from the hypergeometric distribution random numbers from the binomial distribution (reject method) generates gamma random deviates (reject method) random numbers from the poisson distribution (reject method) normal random numbers random numbers from the poisson distribution (renewal method) Download at InfoClearinghouse.com 64 © 2001 Gilberto E. Urroz rt random numbers from the student t distribution Logistic regression These functions involve the logistic population growth model (see, for example, Example 8.3, page 504, in Kottegoda, N.T. and R. Rosso, 1997, Probability, Statistics, and Reliability for Civil and Environmental Engineers, The McGraw-Hill Companies, Inc., New York). lodds loddsinv logitfit log odds function. compute the inverse of log odds. fit a logistic regression model. Statistical graphics Functions to produce a variety of statistical graphics. A normal probability paper plot is obtained by using qqnorm. Probability paper plots are also referred to as Q-Q plots. For that reason the corresponding function names start with qq, e.g., qqgamma, qqnorm, etc. Also of interest are functions histo, plotsym. histo identify pairs plotdens plotsym qqnorm qqplot plot a histogram identify points on a plot by clicking with the mouse. pairwise scatter plots (does not work) draw a nonparametric density estimate. plot with symbols normal probability paper plot empirical quantile vs empirical quantile Binomial coefficients bincoef calculates binomial coefficients: (n r) = n!/(r!(n-r)!), Resampling methods These methods apply to the process of resampling by which an attempt is made to remove any existing bias in the sample. For a quick introduction to jackknife (named so because the jackknife, like this method, is an useful tool) and the bootstrap (named so from the expression "lifting oneself by one's bootstraps"), see pp. 116-117 in Kottegoda, N.T. and R. Rosso, 1997, Probability, Statistics, and Reliability for Civil and Environmental Engineers, The McGraw-Hill Companies, Inc., New York. covboot covjack stdboot stdjack rboot ciboot test1b bootstrap estimate of the variance of a parameter estimate. Jackknife estimate of the variance of a parameter estimate. bootstrap estimate of the parameter standard deviation. Jackknife estimate of the standard deviation of a parameter. simulate a bootstrap resample from a sample. various bootstrap confidence interval. bootstrap t test and confidence interval for the mean. Download at InfoClearinghouse.com 65 © 2001 Gilberto E. Urroz Tests, confidence intervals, and model estimation These are functions related to statistical inference. Of interest for this class are the functions lsfit, testln, and test2r. Use the help function to obtain additional information on the functions. cmpmod ciquant kstwo linreg lsfit lsselect test1n test1r test2n test2r compare small linear model versus large one nonparametric confidence interval for quantile Kolmogorov-Smirnov statistic from two samples (needs function pks) linear or polynomial regression fit a multiple regression model. select a predictor subset for regression tests and confidence intervals based on a normal sample test for median equals 0 using rank test tests and confidence intervals based on two normal samples test for equal location of two samples using rank test Stixbox demonstrations These are SCILAB functions that demonstrate some of the functions contained in STIXBOX stixdemo stixtest demonstrate various stixbox routines. a second demo for stixbox Famous datasets Function getdata is used to load well-known datasets into the SCILAB environment. The data sets included are: 1 Phosphorus Data 2 Scottish Hill Race Data 3 Salary Survey Data 4 Health Club Data 5 Brain and Body Weight Data 6 Cement Data 7 Colon Cancer Data 8 Growth Data 9 Consumption Function 10 Cost-of-Living Data 11 Demographic Data To activate function getdata and load data into variable x use: --> x = getdata() This function produces a dialog box displaying the list of data sets. The user can type in the number of the data set and get back some information about the data set before the set is loaded. The dialog box produced by getdata() is shown below. Download at InfoClearinghouse.com 66 © 2001 Gilberto E. Urroz The dialog box shows that we have selected data set number 5. Pressing [OK] will load the data as well as provide information as shown below. Examples on probability distributions using STIXBOX !Plot of the standard normal distribution: -->z=-4:0.1:4;phi=dnorm(z,0,1);plot(z,phi,'z','phi(z)','standard normal') Download at InfoClearinghouse.com 67 © 2001 Gilberto E. Urroz !Plot of the Student-t distribution for ν = 2, 5, 10, 15, 20 -->t=-4.0:0.1:4;nu=[2,5,10,15,20]; -->for k=1:5,f=dt(t,nu(k));plot2d(t,f,k,'011',' ',[-4 0 4 0.4]), end -->xtitle('Student t distribution','t','f(t)') !Plot of the chi-square distribution for nu=5 -->x=0:0.1:20;nu=5;f=dchisq(x,nu); -->plot(x,f,'x','f(x)','Chi-square distribution, nu=5') !Plot the F distribution for nu1=5 and nu2=10: -->x=0:0.1:5;nu1=5;nu2=10;f=df(x,nu1,nu2); -->plot(x,f,'F','f(F)','F distribution, nu1=5, nu=10') Download at InfoClearinghouse.com 68 © 2001 Gilberto E. Urroz !Determining zα, such that P(Z>zα) > α, or P(Z<zα) > 1- α. Also, zα/2 is such that P(Z>z ) > α/2 α/2, or P(Z<zα/2) > 1- α/2: -->alpha = 0.05; z_alpha=qnorm(1-alpha), z_alpha2=qnorm(1-alpha/2) z_alpha = 1.6448536 z_alpha2 = 1.959964 !Determining tν,α, such that P(T>tα) > α, or P(T<tα) > 1- α. Also tν,α/2 is such that P(T>t ) > α/2 α/2, or P(T<tα/2) > 1- α/2: -->nu=10;alpha=0.01;t_alpha=qt(1-alpha,nu),t_alpha2=qt(1-alpha/2,nu) t_alpha = 2.7637695 t_alpha2 = 3.1692727 !Determining χ2ν,α, such that P(X2>χ2α) > α, or P(X2>χ2α) > 1- α. Similar definitions are used to calculate the values χ2ν,1−α, χ2ν,α/2, χ2ν,1−α/2: -->nu=6;alpha=0.10;X_alpha=qchisq(1-alpha,nu) X_alpha = 10.644641 -->X_alpha2=qchisq(1-alpha/2,nu) X_alpha2 = 12.591587 -->nu=6;alpha=0.10;X_alpha=qchisq(alpha,nu) X_alpha = 2.2041307 -->X_alpha2=qchisq(alpha/2,nu) X_alpha2 = 1.6353829 !Generating 20 data points that follow the Weibull distribution, and producing a normal probability plot for such data: -->x = rexpweib(20,3,5); qqnorm(x,'o') Download at InfoClearinghouse.com 69 © 2001 Gilberto E. Urroz !Generating 200 data points that follow the binomial distribution. A histogram of the data is then produced. -->x = rbinom(200,10,0.35); histo(x); Other options for function histo( ),using 8 suggested classes (or bins). Parameter odd = 0. The function histo( )chooses 6 classes: -->histo(x,8,0) Download at InfoClearinghouse.com 70 © 2001 Gilberto E. Urroz In the next call, we suggest 15 classes, and the odd parameter takes a value odd = 1: -->histo(x,15,1) The next call scales area in the histogram bars so that the total area is equal to 1: -->histo(x,8,0,1) Download at InfoClearinghouse.com 71 © 2001 Gilberto E. Urroz Exercises [1]. The probability of a flood occurring in a particular section of a river in a given month is estimated, form existing records, to be 0.15. (a) What is the probability that there will be three months of flood in the next year. (b) What is the probability that there will be less than 6 months of flood in the next year. [2]. Data kept at an airport shows an average of five cars per minute stopping to leave or pick up passengers in the terminal curb. (a) What is the probability that in the next minute there will be 10 or more cars stopping at the curb? (b) What is the probability that there will be no cars at the curb in a given minute. [3] It is known that 25 out of a batch of 200 concrete cylinders were prepared using a defective type of cement. If a laboratory receives a sample of 15 of those cylinders, what is the probability that the sample will contain 5 of the defective cylinders? [4]. If a factory is known to produce 5% defective truck tires, what is the probability that in a given assembly line the first defective tire is detected after 20 tires have come out of the assembly line? What is the probability that the first defective tire is detected after 10 tires have come out of the assembly line? [5]. The time required to finish the construction of a mile of a particular highway is known to have a normal distribution with a mean value of 3.5 days and a standard deviation of 0.5 days. What is the probability that the next mile in the road will be completed between 3 and 5 days? What is the probability that the construction of the next mile of the road will take more than 7 days? [6]. Let X represent the intensity of an earthquake in a particular scale. If X is modeled using the exponential distribution with parameter β = 6.5, determine the probability that the intensity of the next earthquake will be 3.5 or less. Also, determine the probability that the intensity of the earthquake will be between 2.5 and 4.5. [7]. The gamma distribution, with parameters α =1.2, and β = 0.5, is used to model the time of failure (in hours) of an electronic component. Determine the probability that a particular component will last 100 hours or more. Determine the probability that the component will last less than 2 hours. [8]. If the wind velocity in miles per hour near a harbor is assumed to follow a Weibull distribution with parameters α = 2 and β = 3, determine the probability of the wind velocity being between 15 and 75 mph. Also, determine the probability of the wind velocity being larger than 10 mph. [9]. For a large value of n, the Binomial distribution can be approximated by the normal Suppose that you receive a shipment of distribution with parameters µ = np, σ = np(1-p). 1000 resistors produced by a machine that is know to produce 0.5% defective resistors. What is the probability that there will be more than 200 defective resistors in the shipment by using: (a) the normal distribution approximation to the Binomial distribution, and (b) the Poisson distribution to the Binomial distribution. [10]. Plot the probability mass function, fX(x), for the following discrete distributions: (a) Binomial with n = 20, p = 0.25 (c) Binomial with n = 20, p = 0.75 (e) Geometric with p = 0.25, for x = 1,2,…,10 (b) Binomial with n = 20, p = 0.50 (d) Poisson with λ = 5.0, plot for x = 0,1,2…,10 (f) Geometric with p = 0.50, for x = 1,2,…,10 Download at InfoClearinghouse.com 72 and the cumulative distribution function, FX(x), © 2001 Gilberto E. Urroz (g) Geometric with p = 0.75, for x = 1,2,…,10 (i) Hypergeometric with N=40, n = 10, a = 20 (h) Hypergeometric with N=100,n=20,a=40 (j) Hypergeometric with N = 120,n = 80,a = 10 [11]. Let X be a discrete random variable that follows the binomial distribution with parameters n and p. Let P0 = P(X ≤ x). Calculate: (a) P0 given n = 20, p = 0.35, x = 5 (b) n given p = 0.25, x = 8, P0 = 0.80 (d) x given n = 10, p = 0.80, P0 = 0.30 (c) p given n = 25, x = 20, P0 = 0.75 [12]. Plot the probability density function, fX(x), and the cumulative distribution function, FX(x), for the following continuous distributions: (a) Gamma with α = 0.5, β = 1.5 (c) Beta with α = 0.5, β = 1.5 (e) Weibull with α = 0.5, β = 1.5 (g) Uniform with a = 2, b = 6 (i) Exponential with β = 12.5 (k) Normal with µ = 5, σ = 5 (m) Student t with ν = 4 (o) Chi-square with ν = 4 (q) F distribution with νN = 4, νD = 10 (b) Gamma with α = 2, β = 3 (d) Beta with α = 3, β = 2 (f) Weibull with α = 2, β = 2 (h) Uniform with a = -3, b = 3 (j) Exponential with β = 4.8 (l) Normal with µ = 150, σ = 25 (n) Student t with ν = 12 (p) Chi-square with ν = 12 (r) F distribution with νN = 4, νD = 10 [13]. Let X be a continuous random variable that follows the Gamma probability distribution with parameters α and β. Let P0 = P(X ≤ x). Calculate: (a) P0 given α = 2, β = 3, x = 3.5 (c) β given P0 = 0.60, α = 5, x = 10.5 (b) α given P0 = 0.40, β = 1.5, x = 1.2 (d) x given P0 = 0.20, α = 10.5, β = 0.3 [14]. Let X be a continuous random variable that follows the Beta probability distribution with parameters α and β. Let P0 = P(X ≤ x). Calculate: (a) P0 given α = 2, β = 3.5, x = 0.35 (c) β given P0 = 0.60, α = 2.5, x = 0.45 (b) α given P0 = 0.40, β = 2.3, x = 0.76 (d) x given P0 = 0.20, α = 10.5, β = 0.3 [15]. Let T be a continuous random variable that follows Student t distribution with ν degrees of freedom. Let P0 = P(T ≤ t). Calculate: (a) P0 given ν = 10, t = 1.5 (b) ν given P0 = 0.40, t = -0.8 (c) t given P0 = 0.20, ν = 8 [16]. Let χ2 be a continuous random variable that follows the chi-square distribution with ν degrees of freedom. Let P0 = P(Χ2 ≤ χ2). Calculate: (d) P0 given ν = 6, χ2 = 2.25 (e) ν given P0 = 0.40, χ2 = -0.8 (f) χ2 given P0 = 0.20, ν = 12 [17]. Let F be a continuous random variable that follows the F distribution with νN degrees of freedom in the numerator and νD degrees of freedom in the denominator. Let P0 = P(F≤ F). Calculate: (a) P0 given νN = 4, νD = 10, F = 2.5 (c) νD given P0 = 0.60, νN = 3, F = 0.45 (b) νN given P0 = 0.40, νD = 15, F = 3.2 (d) F given P0 = 0.20, νN = 8, νD = 12 Download at InfoClearinghouse.com 73 © 2001 Gilberto E. Urroz [18]. The following data represent measurements of the diameter of a cylinder produced for a precision mechanism: 232. 246. 260. 244. 267. 248. 308. 247. 264. 243. 242. 221. 228. 243. 270. 250. 275. 274. 255. 275. 239. 261. 205. 261. 260. 244. 217. 254. 236. 281. 265. 260. 230. 226. 240. 262. 273. 252. 264. 257. 259. 228. 263. 260. 268. 236. 269. 255. 265. 231. (a) Use function histnorm with a suitable number of classes to plot a histogram of the data as well as the corresponding normal curve. (b) Use function normplot to produce a normal probability plot of the data. (c) Based on these two plots, how well do the data follow the normal distribution? [19]. The following data set represents the time to failure, in years, of light bulbs. 1.39 .97 1.33 3.05 3.21 1.07 1.01 .82 .42 .74 3.22 .44 2.04 1.17 3.04 3.67 1.97 1.02 1.72 2.74 .55 1.9 .53 2.68 .83 .81 .89 .13 .56 .79 1.22 3.25 2.06 2.13 1.56 1.26 .85 2.96 1.56 1.55 .05 1.04 1.96 2.09 .96 1.54 .43 1.5 1.26 1.23 (a) Use function histnorm with a suitable number of classes to plot a histogram of the data as well as the corresponding normal curve. (b) Use function normplot to produce a normal probability plot of the data. (c) Based on these two plots, how well do the data follow the normal distribution? [20]. The following data set represents the yearly rainfall depth, in mm, recorded at a certain location: 126. 408. 277. 135. 82.9 189. 13.7 215. 41.5 646. 52.3 106. 4.35 7.82 171. 201. 346. 313. 314. 51. 102. 17.4 60.6 43. 830. 165. 29.1 335. 12.8 24.5 468. 59.4 366. 32.6 887. 174. 471. 39.3 44.5 870. (a) Use function histnorm with a suitable number of classes to plot a histogram of the data as well as the corresponding normal curve. (b) Use function normplot to produce a normal probability plot of the data. (c) Based on these two plots, how well do the data follow the normal distribution? [21]. The following data set represents the number of vehicles stopping at a service station in a given hour: 3. 4. 6. 6. 4. 8. 5. 8. 9. 5. 5. 3. 6. 5. 10. 9. 9. 11. 4. 4. 7. 12. 8. 4. 5. 4. 4. 11. 7. 5. 9. 6. 3. 5. 5. 9. 4. 7. 5. 13. 3. 5. 4. 4. 9. 8. 6. 1. 11. 7. 9. 10. 5. 8. 4. 8. 11. 6. 5. 6. (a) Use function histnorm with a suitable number of classes to plot a histogram of the data as well as the corresponding normal curve. (b) Use function normplot to produce a normal probability plot of the data. (c) Based on these two plots, how well do the data follow the normal distribution? [22]. Generate data sets consisting of k values that follow the indicated distribution with the parameters listed below. Use functions histnorm and normplot to produce a histogram and a Download at InfoClearinghouse.com 74 © 2001 Gilberto E. Urroz normal probability plot of the data. How well do the data thus generated follow the normal distribution based on the histogram and probability plot? (a) Binomial, k = 200, n = 30, p = 0.7 (b) Poisson, k = 300, λ = 14.5 (c) Beta, k = 150, α =3.5, β = 5.2 (d) Gamma, k = 100, α =3.5, β = 5.2 (e) Exponential, k = 500, µ = 5.75 (f) Normal, k=180, µ = 5.75, σ = 1.2 (g) Chi-square, k = 230, ν = 5 (h) F-distribution, k = 350, νN = 5, νD = 5 (i) Uniform integer, k = 125, a = -50, b = 50 (j) Uniform real, k = 200, a = 5.5, b = 17.5 (k) Weibull, k = 200, α =7.2, β = 2.1 (l) Student’s t, k = 150, ν = 12 (m) Log-normal, k = 200, µln(X) = 1.2, σln(x) = 0.5 [23]. Generate data sets consisting of 250 values that follow the discrete distribution described by the following probability mass function: x 1.2 2.3 4.1 5.2 6.1 7.2 8.4 9.3 11.1 fX(x) 0.04 0.08 0.12 0.16 0.08 0.04 0.20 0.24 0.04 Use functions histnorm and normplot to produce a histogram and a normal probability plot of the data. How well do the data thus generated follow the normal distribution based on the histogram and probability plot? [24]. Function service was developed to simulate the traffic through a service station. Use function service to produce a simulation of traffic through a service station that takes as input 50 values of the inter-arrival time (IAT) and 50 values of the time of service (TS) generated out of the following cumulative distribution functions: x=IAT 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 FX(x) 0.03 0.14 0.08 0.12 0.23 0.10 0.05 0.05 0.10 0.10 x=TS 0.4 0.8 1.2 1.6 2.0 2.4 FX(x) 0.05 0.15 0.35 0.25 0.15 0.05 Use functions histnorm and normplot to produce a histogram and a normal probability plot of the waiting time (WT) and number of customers waiting (NW). How well do the WT and NW data follow the normal distribution? [25]. One-dimensional random walk. Consider a particle that moves along a straight line subject to a random motion. The particle starts at x1 = 0 and moves to position x2 = x1 + ∆x1, where ∆x1 is a random number. The next position of the particle is x3 = x2 + ∆x2, where ∆x2 is a second random number. Subsequent positions of the particle are given by xk+1 = xk + ∆xk. The Download at InfoClearinghouse.com 75 © 2001 Gilberto E. Urroz random numbers used must include both positive and negative values so that the particle can move forward and backward. (a) Plot the position xk vs. k for a one-dimensional random walk that involves 300 displacements ∆xk generated from a normal distribution with µ = 0 and σ = 1. (b) Plot the position xk vs. k for a one-dimensional random walk that involves 300 displacements ∆xk generated from a uniform distribution between -1 and 1. [26]. Two-dimensional random walk. A two-dimensional random walk involves the displacement of a particle from a point (xk,yk) to a point (xk+1,yk+1) so that xk+1 = xk + rk cos(θk), and xk+1 = xk + rk sin(θk), where the values rk and θk are random numbers. (a) Plot the two-dimensional random walk that results form 200 values of rk with a normal distribution with mean µ = 1 and standard deviation σ = 0.2, and 200 values of θk uniformly distributed between 0 and 2π. (b) Plot the two-dimensional random walk that results form 100 values of rk with a Weibull distribution with parameters α = 2 and β = 3, and 200 values of θk uniformly distributed between 0 and 2π. (c) Plot the two-dimensional random walk that results form 150 values of rk with a Gamma distribution with parameters α = 0.2 and β = 1.3, and 200 values of θk normally distributed with mean µ = π and standard deviation σ = π/2. (d) Plot the two-dimensional random walk that results form 250 values of rk with a Beta distribution with parameters α = 2 and β = 3, and 200 values of θk uniformly distributed between 0 and 2π. [27]. The following table shows the annual maximum flow for the Ganga River in India measured at specific station. Year 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 3 Q(m /s) 7241 9164 7407 6870 9855 11887 8827 7546 8498 16757 9680 14336 8174 8953 7546 Year 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 Download at InfoClearinghouse.com 3 Q(m /s) 7546 11504 8335 15077 6493 8335 3579 9299 7407 4726 8416 4668 6296 8174 9079 Year 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 76 3 Q(m /s) 4545 5998 3470 6155 5267 6193 5289 3320 3232 3525 2341 2429 3154 6650 4442 Year 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 3 Q(m /s) 4458 3919 5470 5978 4644 6381 4548 4056 4493 3884 4855 5760 9192 3024 2509 © 2001 Gilberto E. Urroz 1900 1901 1902 1903 1904 1905 1906 6652 11409 9164 7404 8579 9362 7092 1922 1923 1924 1925 1926 1927 1928 7407 5482 19136 9680 3698 7241 3698 1944 1945 1946 1947 1948 1949 1950 4229 5101 4629 4345 4890 3619 5899 1966 1967 1968 1969 1970 1971 4741 5919 3789 4546 3842 4542 (a) Use function histnorm with a suitable number of classes to plot a histogram of the data as well as the corresponding normal curve. (b) Use function normplot to produce a normal probability plot of the data. (c) Based on these two plots, how well do the data follow the normal distribution? The following problems require that you load the functions from the Stixbox SCILAB toolbox. [28]. Using function getdata() load data set number 1, described as: __________________________________________________________________________________ ************************ Phosphorus Data ********************************** Source: Snedecor, G. W. and Cochran, W. G. (1967),Statistical Methods, (6 Edition), Iowa State University, Ames, Iowa, p. 384. Taken From: Chatterjee and Hadi (1988), p. 82. Dimension: 18 observations on 3 variables Description: An investigation of the source from which corn plants obtain their phosphorus was carried out. Concentrations of phosphorus in parts per millions in each of 18 soils were measured. Column 1 2 3 Description Concentrations of inorganic phosphorus in the soil Concentrations of organic phosphorus in the soil Phosphorus content of corn grown in the soil at 20 degrees C __________________________________________________________________________________ (a) Separate the three columns of data into vectors x, y, and z, and use the user-defined function describe to obtain statistics of each of the columns of data. (b) Use Stixbox function histo to obtain histograms of each of the columns of data. (c) Use Stixbox function qqnorm to obtain a normal probability plot of each of the data columns. [29].Using function getdata() load data set number 1, described as: *********************** Scottish Hill Race Data ************************* (...lines removed...) Column 1 2 3 Definition Distance (miles) Climb (ft) Time (seconds) __________________________________________________________________________________ Download at InfoClearinghouse.com 77 © 2001 Gilberto E. Urroz (d) Separate the three columns of data into vectors x, y, and z, and use the user-defined function describe to obtain statistics of each of the columns of data. (e) Use Stixbox function histo to obtain histograms of each of the columns of data. (f) Use Stixbox function qqnorm to obtain a normal probability plot of each of the data columns. REFERENCES (for all SCILAB documents at InfoClearinghouse.com) Abramowitz, M. and I.A. Stegun (editors), 1965,"Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables," Dover Publications, Inc., New York. Arora, J.S., 1985, "Introduction to Optimum Design," Class notes, The University of Iowa, Iowa City, Iowa. Asian Institute of Technology, 1969, "Hydraulic Laboratory Manual," AIT - Bangkok, Thailand. Berge, P., Y. Pomeau, and C. Vidal, 1984,"Order within chaos - Towards a deterministic approach to turbulence," John Wiley & Sons, New York. Bras, R.L. and I. Rodriguez-Iturbe, 1985,"Random Functions and Hydrology," Addison-Wesley Publishing Company, Reading, Massachussetts. Brogan, W.L., 1974,"Modern Control Theory," QPI series, Quantum Publisher Incorporated, New York. Browne, M., 1999, "Schaum's Outline of Theory and Problems of Physics for Engineering and Science," Schaum's outlines, McGraw-Hill, New York. Farlow, Stanley J., 1982, "Partial Differential Equations for Scientists and Engineers," Dover Publications Inc., New York. Friedman, B., 1956 (reissued 1990), "Principles and Techniques of Applied Mathematics," Dover Publications Inc., New York. Gomez, C. (editor), 1999, “Engineering and Scientific Computing with Scilab,” Birkhäuser, Boston. Gullberg, J., 1997, "Mathematics - From the Birth of Numbers," W. W. Norton & Company, New York. Harman, T.L., J. Dabney, and N. Richert, 2000, "Advanced Engineering Mathematics with MATLAB® - Second edition," Brooks/Cole - Thompson Learning, Australia. Harris, J.W., and H. Stocker, 1998, "Handbook of Mathematics and Computational Science," Springer, New York. Hsu, H.P., 1984, "Applied Fourier Analysis," Harcourt Brace Jovanovich College Outline Series, Harcourt Brace Jovanovich, Publishers, San Diego. Journel, A.G., 1989, "Fundamentals of Geostatistics in Five Lessons," Short Course Presented at the 28th International Geological Congress, Washington, D.C., American Geophysical Union, Washington, D.C. Julien, P.Y., 1998,”Erosion and Sedimentation,” Cambridge University Press, Cambridge CB2 2RU, U.K. Keener, J.P., 1988, "Principles of Applied Mathematics - Transformation and Approximation," Addison-Wesley Publishing Company, Redwood City, California. Kitanidis, P.K., 1997,”Introduction to Geostatistics - Applications in Hydogeology,” Cambridge University Press, Cambridge CB2 2RU, U.K. Koch, G.S., Jr., and R. F. Link, 1971, "Statistical Analysis of Geological Data - Volumes I and II," Dover Publications, Inc., New York. Korn, G.A. and T.M. Korn, 1968, "Mathematical Handbook for Scientists and Engineers," Dover Publications, Inc., New York. Kottegoda, N. T., and R. Rosso, 1997, "Probability, Statistics, and Reliability for Civil and Environmental Engineers," The Mc-Graw Hill Companies, Inc., New York. Kreysig, E., 1983, "Advanced Engineering Mathematics - Fifth Edition," John Wiley & Sons, New York. Download at InfoClearinghouse.com 78 © 2001 Gilberto E. Urroz Lindfield, G. and J. Penny, 2000, "Numerical Methods Using Matlab®," Prentice Hall, Upper Saddle River, New Jersey. Magrab, E.B., S. Azarm, B. Balachandran, J. Duncan, K. Herold, and G. Walsh, 2000, "An Engineer's Guide to MATLAB®", Prentice Hall, Upper Saddle River, N.J., U.S.A. McCuen, R.H., 1989,”Hydrologic Analysis and Design - second edition,” Prentice Hall, Upper Saddle River, New Jersey. Middleton, G.V., 2000, "Data Analysis in the Earth Sciences Using Matlab®," Prentice Hall, Upper Saddle River, New Jersey. Montgomery, D.C., G.C. Runger, and N.F. Hubele, 1998, "Engineering Statistics," John Wiley & Sons, Inc. Newland, D.E., 1993, "An Introduction to Random Vibrations, Spectral & Wavelet Analysis - Third Edition," Longman Scientific and Technical, New York. Nicols, G., 1995, “Introduction to Nonlinear Science,” Cambridge University Press, Cambridge CB2 2RU, U.K. Parker, T.S. and L.O. Chua, , "Practical Numerical Algorithms for Chaotic Systems,” 1989, Springer-Verlag, New York. Peitgen, H-O. and D. Saupe (editors), 1988, "The Science of Fractal Images," Springer-Verlag, New York. Peitgen, H-O., H. Jürgens, and D. Saupe, 1992, "Chaos and Fractals - New Frontiers of Science," Springer-Verlag, New York. Press, W.H., B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, 1989, “Numerical Recipes - The Art of Scientific Computing (FORTRAN version),” Cambridge University Press, Cambridge CB2 2RU, U.K. Raghunath, H.M., 1985, "Hydrology - Principles, Analysis and Design," Wiley Eastern Limited, New Delhi, India. Recktenwald, G., 2000, "Numerical Methods with Matlab - Implementation and Application," Prentice Hall, Upper Saddle River, N.J., U.S.A. Rothenberg, R.I., 1991, "Probability and Statistics," Harcourt Brace Jovanovich College Outline Series, Harcourt Brace Jovanovich, Publishers, San Diego, CA. Sagan, H., 1961,"Boundary and Eigenvalue Problems in Mathematical Physics," Dover Publications, Inc., New York. Spanos, A., 1999,"Probability Theory and Statistical Inference - Econometric Modeling with Observational Data," Cambridge University Press, Cambridge CB2 2RU, U.K. Spiegel, M. R., 1971 (second printing, 1999), "Schaum's Outline of Theory and Problems of Advanced Mathematics for Engineers and Scientists," Schaum's Outline Series, McGraw-Hill, New York. Tanis, E.A., 1987, "Statistics II - Estimation and Tests of Hypotheses," Harcourt Brace Jovanovich College Outline Series, Harcourt Brace Jovanovich, Publishers, Fort Worth, TX. Tinker, M. and R. Lambourne, 2000, "Further Mathematics for the Physical Sciences," John Wiley & Sons, LTD., Chichester, U.K. Tolstov, G.P., 1962, "Fourier Series," (Translated from the Russian by R. A. Silverman), Dover Publications, New York. Tveito, A. and R. Winther, 1998, "Introduction to Partial Differential Equations - A Computational Approach," Texts in Applied Mathematics 29, Springer, New York. Urroz, G., 2000, "Science and Engineering Mathematics with the HP 49 G - Volumes I & II", www.greatunpublished.com, Charleston, S.C. Urroz, G., 2001, "Applied Engineering Mathematics with Maple", www.greatunpublished.com, Charleston, S.C. Winnick, J., , "Chemical Engineering Thermodynamics - An Introduction to Thermodynamics for Undergraduate Engineering Students," John Wiley & Sons, Inc., New York. Download at InfoClearinghouse.com 79 © 2001 Gilberto E. Urroz