Normal Distribution

advertisement

DEFINITIONS

We are studying the effect of environmental conditions on human subjects.

The SYSTEM is the human subject.

The POPULATION is the temperature readings of human subjects under the experimental conditions.

The RANDOM VARIABLE is the measurement recorded on a thermometer.

The SYSTEM OUTPUT is the reading on the thermometer.

In this example the MODEL is assumed to be a normal distribution.

There are two PARAMETERS of interest in a normal distribution: μ and σ

Ten subjects are tested and the DATA are:

98.2, 97.6, 97.7, 98.6, 98.2, 97.8, 96.7, 98.4, 97.9, 97.4.

Entering this data in SPSS we obtain ESTIMATES of μ as 97.85 and σ as

0.550.

These are the estimates of the MEAN and STANDARD DEVIATION of the population temperatures.

The estimates of skewness (b

1

1/2 ) and kurtosis (b

2

) can be obtained from the SPSS output and are 0.798 and 0.975.

Estimate the 90

th

percentile

:

1) If we assume that the normal distribution is the correct model then the 90 th PERCENTILE of the standard normal distribution, Z

.90

is 1.645.

Using the equation

Y

.90

= μ + σZ

.90.

Substituting the estimates for the parameters μ and σ, yields Y

.90

= 98.755.

2) If we do not assume a statistical model and want to estimate the percentile directly from the data the following equation is used:

Let i = np +1/2

Where n is the number of data points, p is the desired percentile and i is the order number in the ordered data list corresponding to the desired percentile.

To find the estimate of the 75 th percentile for our example i = 10x0.75 + 0.5 = 8.0.

Thus the eighth value in the ordered list is the 75 th percentile. If i is not an integer we use linear interpolation to find the estimate.

The ordered list is denoted the ORDER STATISTICS. The order statistics for our example are:

96.7, 97.4, 97.6, 97.7, 97.8, 97.9, 98.2, 98.2, 98.4, 98.6,

Thus 98.2 is the estimate of the 75 th percentile.

To find P{Y < 99.0} we use the DISTRIBUTION FUNCTION of the standard normal distribution, F(99.0).

To obtain this probability, we use the equation:

Z = (Y - μ)/σ

Using the estimates of μ and σ and the table of the normal distribution to find F(Z) .

For this example:

Z = (99.0 – 97.85)/0.55 = 2.09 and F( 2.09) = 0.982.

If we want to obtain a confidence interval for μ we use the Student t distribution.

The 95% interval from SPSS is (97.46, 98.24).

Testing Hypotheses

To TEST the HYPOTHESIS that:

H

0

: μ = 88.7 vs. H

A

: μ > 88.7

A lower tailed t test is used. This can be done using SPSS.

For this example t = -4.89 and p = 0.00043.

Thus the conclusion is that there is only a small probability that

μ = 88.7.

Note for a one tail test we must divide the p-value from SPSS by two.

CONTINUOUS DISTRIBUTIONS

Normal Distribution

The normal is the most commonly used distribution to model system output. The normal distribution is used to represent system output that results from the additive effect of many factors.

Consider the blood pressure measurement from a single individual. As you know ones blood pressure varies each time it is measured. The reading obtained is the result of many factors both physical and mental at the time of the reading. Thus it would be logical to use the normal distribution as a model for blood pressure readings. Many of the system outputs that are measured are the result of a series of added effects.

The density function of the normal distribution is given by: f(t) = [σ 2 (2π)] -1/2 ] exp[-(1/2){(t μ)/σ} 2 ]

∞ <t< ∞,σ>0, -∞ <μ<∞ where μ and σ are the mean and standard deviation of the distribution.

The density can only be evaluated numerically and hence a table or computer program must be used to determine the distribution function F(y). You learned in you statistics class how to estimate these quantities using the sample mean and sample standard deviation. (See equations (2-31) and (2-

51a) in your text.) The normal distribution is symmetrical about its mean hence its skewness measure is zero. The kurtosis measure is 3. The Student t distribution is used to obtain confidence intervals for the parameter μ and the chisquared distribution is used for the parameter σ. Review your statistics notes for these formulas.

Normal Distribution

μ

μ

μ

μ

The Half Normal Distribution

In some problems the random variable of interest is the absolute value of the measurement. Thus if we are measuring the deviation from a standard and are not interested in whether the reading is positive or negative and the distribution of the original data is normal then the appropriate model to use for the absolute values is a halfnormal distribution.

The density of the half normal distribution is: f(t) = [2/πσ 2 ] 1/2 exp[-t 2 /2σ 2 ], t> 0 , σ>0.

There is only one parameter for the halfnormal: σ

The mean of the distribution is 0.798σ and the variance is

0.363σ 2

.

The graph of the distribution looks like the positive portion of the normal curve.

Thus it is skewed with a skewness measure of 0.995 and a kurtosis measure of 3.869.

Half-Normal Distribution

Matching of Moments

The method of MATCHING OF MOMENTS is a useful technique for estimating the parameters of a distribution. In this method we equate the moments obtained from the data (sample mean, sample variance, sample measure of skewness and/or sample measure of kurtosis) with the moments of the selected model. We match one moment for each unknown parameter. Thus for the half-normal distribution we match the sample mean with the mean of the halfnormal, 0.798s. If a distribution had two parameters we would match the sample mean and variance with the mean and variance of the model.

EXAMPLE

A pacemaker is being tested to determine the average error from the nominal of 60 beats per minute. Thus the random variable of interest is the deviation from the nominal of 60.

We are not concerned whether the deviation is positive or negative. The readings are:

6.5 1.6 6.9 11.7 5.7 2.1 2.5 5.3 1.8 2.0 10.9 6.0 13.2

The sample mean is 5.86. Therefore we match 5.86 with, 0.798σ yielding the estimate of sigma of 5.86/0.798 = 7.34. Tables and other methods of estimation can be found in the 1961 issue of

Technometrics, volume 3 on page 543 in an article by Leone and

Nelson entitled “The Folded Normal Distribution”. The folded normal distribution is more general than the half-normal which is a special case of the folded normal where the distribution is folded at zero.

Exponential distribution

Consider some outcome that represents the time to the end of life of a system under study, the time of death of a subject, the time of failure of a piece of equipment, or the time between occurrences of an event where the probability of occurrence is proportional to the length of the time interval and the rate of occurrence is constant over time.

Let F(t) = pr{T < t}, the distribution function. Now the probability that the event of a failure occurring in the next instant of time given that it is working at time t is given by the conditional probability in the interval (t, t+Δt).

This probability is:

[F(t+ Δt) – F(t)]/ [1 – F(t)] = λ Δt where λ is the constant rate of occurrence. The solution of this equation is:

1 – F(t) = e λt where t > 0 and λ > 0

The density function is obtain by differentiating with respect to t.

This yields the exponential distribution f(t) = λe λt t > 0 and λ > 0

The distribution function is 1 – e -

λt

Exponential Distribution

The Conditional Failure Rate

Function

A useful function which can serve as a guide for the selection of a failure model is r(t), the CONDITIONAL

FAILURE RATE FUNCTION. This yields the conditional probability of a failure occurring in the next instant of time given that it has not failed up to time t.

r(t) = f(t)/ [1 – F(t)]

The value of r(t) for the exponential distribution r(t) = λe λt /{1- (1 - e λt ) = λ

The conditional failure rate for the exponential distribution is a constant and it would be used for the distribution of the time to failure for systems which do not wear-out.

The reciprocal of the parameter λ is both the mean and standard deviation of this distribution. The reciprocal of the sample mean, t, is used to estimate this parameter.

A confidence interval for this parameter can be obtained using the fact that the statistic 2nλ.t has a chi-squared distribution with 2n degrees of freedom, where n is the sample size. Hence a (1α) 100% confidence interval from the fact that: pr{ Χ 2

2ν,α/2

< 2nλ.t < Χ 2

2ν,1-α/2

}

_ _ yielding the interval ( Χ 2

2ν,α/2

/ 2nt, Χ 2

2ν,1-α/2

/ 2nt).

Review the use of the chi-squared tables.

In an experiment to test the durability of a new design of pacemakers an accelerated test was conducted on 50 units. The average time to failure was 4.05 months.

Estimate the parameter λ and obtain a 95% confidence interval.

The estimate of λ is 1/4.05 = 0.247. The confidence interval is obtained via use of the chi-square table with degrees of freedom of 100 using α = 0.95 and thus the

0.025 and 0.975 percentiles are found in the table to be

74.2 and 129.6.

This yield the interval (74.2/100x4.05, 129.6/405) =

(0.183, 0.320). We conclude that the best estimate of the mean life of the pacemakers is 4.05 months and we have a confidence that 95% of the units will have a mean life between 3.33 months and 5.46 months in the accelerated environment.

Gamma distribution

Let us now consider a model that can be used when the event or failure does not occur until there are suboccurrences and the time between each of these suboccurrences has an independent exponential distribution.

Thus such cases as the life of an animal that does not die until it is attacked five times by a predator or for the time to overhaul a machine after it is repaired six times where the time between events are independent exponential variables with a constant value of . Thus the random variable of interest is the sum of the time between failures until the occurrence of where the time between occurrences are independent exponential variables each with the parameter

This random variable has a gamma distribution whose density is f(t) = t e - t ; t > 0, 0 0

The function = ( -1)! when is a positive integer. When is not an integer this value of the function must be looked up in a table of the gamma function.

The distribution function can only be evaluated analytically when is an integer. The distribution function F(t), when is an integer is given by

F(t) = 1 k t} k e - t /k!.

This sum can be obtained from a Poisson table with parameter t and y = -1

Check this out in your text.

The mean of this distribution is and the variance is 2 .

The parameters can be estimated by the method of matching the moments. Since there are two parameters we use the sample mean and sample variance in the matching process.

The estimate of is t/s 2 . The estimate of is t 2 /s 2 , i.e. the sample mean squared divided by the sample variance.

If the parameter can only take on integer values the gamma distribution is sometimes called the Erlangian distribution, Computation of confidence intervals and the conditional failure rate are complicated and can be found in

Statistical Modeling Techniques by

Shapiro and Gross published by Marcel Dekker, 1981. The failure rate increases with time when and decreases when it is less than one.

An experiment is run to estimate the average time it takes for a machine to require a complete overhauling. A machine is overhauled after it needs to be recalibrated six times. The times between recalibrations have independent exponential distributions.

The average time between overhauls is 525.5 hours and the standard deviation is 207.7. The estimate of is

525.5/207.7

2 = 0.0122 and the estimate of is 0.0122

(525.5) = 6.4.

Find the probability that a machine will need overhauling in less than 300 hours.

We can get an approximate answer to this if:

• We assume that is equal to 6.0

• Use the Poisson Table with y = 6-1 = 5 and the column value of t = (0.0122)300 = 3.67.

Using the closest column value to 3.67 the of the sum for a value of 5 is approximately 0.844.

Therefore F(300) = 1 – 0.844 = 0.156.

S. Shapiro and L. Chen, “Composite Test for the Gamma Distribution”, Journal of Quality Technology,33, 47-59, (1998)

Gamma distribution

η

η

η

η

η

Weibull distribution

The exponential model is limited in terms of a lifetime model since it can be only be used in situations where the conditional failure rate is constant. If we start with the function: r(t) = [ ](t/ ) then if > 1 the function increases with time and if it is < 1 it decreases.

Note that when r(t) is constant and it is the function for the exponential distribution. Setting r(t) equal to f(t)/(1-F(t)) yields: f(t) =( )[t/ ] exp[- t/ ] t ≥ 0 ,

F(t) = 1 - exp{- t/ ] , t ≥ 0.

0, 0 and

The mean of this distribution is where ( x) is the

{ gamma function discussed previously. The variance of the distribution is

[ – { } 2 ]. The estimation of the parameters requires a numerical procedure or a graphical procedure to be discussed later, Once the parameters are known the distribution function can be used to obtain probabilities.

In a study of 20 patients receiving an analgesic to relieve a headache pain a Weibull distribution was used to model the time to the cessation of pain.

The estimate of was 2.79 and the estimate of was

2.14, The mean relief time was 1.89 hours.

The probability that the relief time will exceed 4.0 hours is obtained from

1 -F(4) = exp{-4/2.14} 2.79

= 0.003.

L. Chen and S. Shapiro, “Can the Idea of the QH Test for Normality be Used for

Testing the Weibull Distribution”, J. Statistical Computation and Simulation, 55,

258-263. (1996)

η

Weibull distribution

σ=1 for all plots

η

η

η

η

The Rayleigh distribution is a

Weibull with = 2.

Lognormal distribution

It was stated that the normal distribution is the statistical model for events that represent additive effects. We now consider a model that is used when the event is caused by multiplicative effects.

If T= X

1

X

2

……X n then Y = ln(Y) = lnX i and Y is the sum of random variables and can be modeled by a normal distribution. The T has a lognormal distribution with density function exp[-(1/2 ){lnt }] 2 , t≥0, ≥0, -inf

< <inf f(t) = [ t 2

The mean and variance of the lognormal are exp[ exp[ ]{exp[ ] -1}.

] and

Estimation of the parameters is simple. Take the natural log of the data and the estimate of is the sample mean of the logs and the estimate of is the sample standard deviation of the logs.

Remember that these are not the mean and the standard deviation of the data!

An experiment is run to estimate the growth of a strain of bacteria in a period of two days. The growth at any point in time depends on the size at the instant prior to the measurement. There is a multiplicative effect that determines the size of the bacteria colony. We will model the size using a lognormal distribution. The following sizes of ten colonies were measured after two days:

9.98 10.36 10.04 12.82 10.86 10.39 9.06 11.17 10.29 10.78

Taking the natural log of these numbers yields:

2.30 2.34 2.31 2.55 2.36 2.34 2.20 2.41 2.33 2.38

The estimate of is 2.352 and the estimate of is 0.039.

The mean size of the colonies is: exp[2.352 + 0.039

2 /2] = 10.514

The variance of the colony size is exp[2(2.352) + 0.039

2 ]{exp[0.039

2 ] – 1} = 110.56(.0015) = 0.17

Lognormal distribution



Logistic distribution

The logistic distribution plays a major role in describing growth processes, survival data and demographic studies. Some of the early applications have been used in the study of population growth as well as a numerous number of other growth function studies which have been referenced in the Handbook of the Logistic Distribution by

Balakrishnan (1992). More recently it has been used in bioassay studies and in the analysis of survival distributions.

F ( x ;

,

)

1

1

 e

 

( x

 

) /

3

,

   x

 

,

 

0,

      with corresponding distribution function given by: f ( x ;

,

)

3 e

1

 e

 

( x

 

) /

3

 

( x

 

) /

3

2

,

   x

 

,

 

0,

     

The mean and variance of the distribution are and .



Like the normal distribution, practitioners often work with the standard logistic random variable, Z =

( X

 

)

3

Z has density:

 f(z) = e

 z

1

 e

 z

2

,

   z

 

The standard logistic distribution function is



F ( z )

1

1

 e

 z

,

   z

 

The moment estimators of the parameters are the sample

 and the sample variance for 2 . The distribution resembles the normal distribution but has heavier tails.

The growth of a bacteria colony can be modeled by a logistic distribution. In an experiment the diameter of a colony is measured after three days growth. Ten colonies are measured and the results are:

23 46 32 38 30 68 44 37 36 30

The sample mean is 38.4 and the sample standard deviation is 12.44.

Hence, the estimate of is 38.4 and the estimate of is 12.44

The probability that the size of a colony will be less than 22 is given by

F(22).

Converting this to the standard variable Z = 3.14(22-38.4)/12.44(1.73) =

-2.39 then

F(-2.39) = 0.084.

Thus there is a little more than a probability of 0.084 that a colony will have a diameter of less than 22.

Logistic distribution

σ

σ

σ

σ

See pages 122-132 of HS for a summary description of this material.

There are several other model which are not covered in this course.

Such models as the Pareto and the extreme value distribution can be found in the literature.

DISCRETE DISTRIBUTIONS

Discrete distributions

The foregoing distributions are used when the data being observed are measurements. In some cases the data are counts (discrete) and hence a discrete distribution must be chosen. The selection process depends on two factors. The type of experiment and the question under study. There are two types of experiments that we will cover in this course. The first involves Bernoulli trials.

Bernoulli trial

A Bernoulli trial is one with the outcome of the experiment has only two possibilities. We will call these success and failure. These are coded zero and one. This covers experiments such as live or die, win or lose, success or failure, cure or not cure, etc.

The Bernoulli random variable is:

0 if a failure occurs

1 if a success occurs

The probability function is: f(y) = p y (1 – p) 1-y y = 0,1.

This is not a useful model but serves as a building block for other discrete distributions.

The Binomial Distribution

Experiment : n independent Bernoulli trials where the pr{1} = p on each trial.

Question : What is the probability of Y successes in n trials?

Answer : Binomial Distribution: f(y) = [n!/y!{(n-y)}!])p y (1-p) n-y ; y=0,1,….,n; 0<p<1.

Where n is the sample size, Y is the number of ones and p is the probability of a one. The estimator of the parameter p is Y/n, the mean of Y is np and the variance is np(1-p).

A new antibiotic is being tested on 20 subjects that have been infected with a virus.

At the end of three days they are tested to see if the infection is gone.

There are two possible outcomes for this experiment: Yes and No

The trials are independent and we are interested in the probability that two or less people will be cured (Y ≤ 2).

The results were that only one person was not cured. In this case Y is the number of persons not cured. The estimate of p is 1/20 = 0.05.

Using a binomial table or the computer the probability that

Y is less than or equal to two is: 0.9245.

Thus, the estimate of the probability that someone will be cured from the treatment is 0.95.

The Geometric distribution

Experiment : n independent Bernoulli trials where the pr{1}

= p on all trials.

Question : What is the probability that the first success comes on trial Y?

Answer : Geometric distribution

In this situation the random variable is the number of trials while in the binomial distribution it was the number of ones occurring in n trials.

f(y) = (1-p) y-1 p, y = 1,2,…….., 0≤p≤1

The mean of the distribution is 1/p and the variance is (1p)/p 2 . The estimator of the parameter, p, is the reciprocal of the sample mean.

In the use of a defibrillator the device is used until the patients heart starts beating on its own. A study was conducted to estimate the probability that the heart starts beating after at least four uses of the device. The data available is the number of trials required on 20 patients for a successful use of the device. The data is:

2 5 3 6 7 4 2 8 2 1 4 6 3 7 1 4 2 4 5 2

The sample mean is 78/20 = 3.9. Thus the estimate of p is 1/3.9 =0.256.

The p{Y≥y} = (1-p) y-1 and p{Y<y} = 1 - (1-p) y-1

The estimated probability that a defibrillator will be used at least five times is (1 - 0.256) 4 = 0.306

A generalization of the geometric is the negative binomial or Pascal distribution. The experiment is the same as in the geometric case; however the question is what is the probability that the s success occurs on trial Y.

The formula and other information can be found in HS.

The Hypergeometric distribution

Experiment : Dependent Bernoulli trials in a finite distribution of size N from which a sample of size n is taken where there are k ones and N-k zeros.

Question : What is the probability of Y successes in the sample of n?

The probability function is: f(y) = [{k!/(y!(k-y))!}][(N-k)!/{(n-y)!(N-k-n+y)}]/ [N!/{n!(N-n)!}]. y=0,1,2…..,n ; y<k; n-y<N-k.

The mean of the distribution is nk/N and the variance is[nk(N-k)(Nn)]/{N 2 (N-1)}.

The parameter of interest in this case is k; its estimator is the greater integer less than or equal to [(N+1)y/n]. In some problems N is the unknown parameter. If that is the case the estimator of N is the greatest integer less than or equal to kn/y.

A public health official is interested in estimating the number of persons with a given disease in a group of

20.

The examination is costly so he takes a random sample of 10 people and he finds that 4 have the disease.

He wishes to estimate, k, the number of sick individuals.

The estimate of k is 21(4)/10 = 8.4, hence the estimate is 8 sick individuals.

Using this estimate, the probability of finding 4 people out of ten with the disease would be:

{8!/(4!4!)x[12!/(6!6!)]}/[20!/(10!10!)] = 0.35

The Poisson distribution

Experiment : Count the number of occurrence of independent events that occur at rate in a period of time or in an area, volume, unit, etc.

Question : What is the probability of Y occurrences in the period of time, area, etc.?

Answer : Poisson distribution.

This experiment does not involve Bernoulli trials nor is there a sample size. Examples are the number of times a pacemaker has to be replaced in a patient in a period of one year, the number of tumors found on a patient, the number of fish caught in a five minute period, the number of arrivals of ambulances in a one hour period, etc. The probability function is: f(y) = y e /y!; > 0, y = 0,1,2,…..

The distribution function is tabled in most statistics books. Note we used these tables to evaluate F(t) for the gamma distribution when is an integer. The mean and variance of the distribution is mean is an estimator of the parameter

The sample

In a experiment the number of bacterial colonies on a slide is counted.

This count can be modeled by a Poisson distribution.

A total of 25 slides are examined yielding a sample mean of 24.0

The estimate of is 24 colonies per slide.

If we wish to estimate the probability that there could be less than 30 colonies on a slide we find: p{Y<30}=0.868

from the Poisson table using F(29).

TESTING MODEL FIT

Test of Model Fit

An important part of analyzing data is to check on whether the model you intend to use is a reasonable approximation of the system output. A test of the distributional assumption should precede any data analysis. There are tests for all the models that we covered in the previous lecture. I will show you some of these; for the others I will supply references. There are many procedures available for testing model fit; many of the ones presented here were developed by your professor who obviously has a bias in their selection.

Probability Plotting

The simplest method of assessing model fit is to use a graphical procedure known as probability plotting. This technique can be used for distributions that do not have a shape parameter or can be transformed to one that does not have a shape parameter. A shape parameter is one that changes the shape of the distribution.

The normal, logistic and exponential distributions have no shape parameters. The lognormal and Weibull can be transformed to distributions without shape parameters. If the parameter is known then probability plots can be made for specified values of this parameter.

To construct a probability plot one follows these steps:

1 . Obtain a sheet of probability paper for the distribution being tested.

2 . Rank the observations from smallest to highest.

3 . Plot the ranked observations vs p=i/(n1). The letter “i” is the rank number of the data point which runs from 1 to n.

4 . Depending on the brand of the paper chosen one of the two axes will have a preprinted scale. The values of p are plotted on this axis.

5 . The values of the ordered observations are plotted on the other axis.

You must scale this axis according to the values of the data. Try to use as much as the axis as is possible.

6 . Plot the data.

7. If the data fall in an approximate straight line then you have chosen a reasonable model. It helps to draw a line on the paper representing a straight line to judge whether it reasonably fits the data. Remember the data are random variables and will not form a perfect straight line.

Look for departures in the tails. See chapter 8 in HS for examples of plots.

8. If the plot is approximately linear crude estimators of the parameters can be obtained from the plot.

9. In evaluating a plot remember that the variance of points in the tail(s) is higher than those in the middle of the distribution. Thus the relative linearity of the plot near the tails of a distribution will often seem poorer than at the center of the distribution even if the correct model was chosen. This statement is not applicable if a tail is bounded. Thus for the exponential distribution it is not true for the lower tail.

10. The plotted points are ordered and hence not independent. Each point is higher than the previous one. Thus we should not expect them to be randomly scattered about the plotted line.

11. A model can never be proven to be correct even if a straight line appears to be appropriate. This is especially true for small sample size where only extreme differences from the selected model can be detected. The best one can say is that there is no evidence from the data that the model is unreasonable.

Using the data from the class assignment prepare a normal probability plot by hand on the probability paper given to you. Normal probability plots can also be done using SPSS.

Enter the data in the computer and use SPSS to prepare the plot.

Compare the two plots.

Normal Distribution

An estimate of the mean of the distribution, can be obtained from the plotted line by determining the data value on the line that crosses the 50 th percentile line (the preprinted scale). The parameter can be estimated from the slope of the plotted line. For any normal distribution the standard deviation equals approximately two-fifths of the difference between the 90 th and the 10 th percentiles. The percentiles are obtained from the plotted line; the data value where the plotted line crosses the .10 and .90 percentile lines on the preprinted scale. Any percentile can be obtained directly from the plotted line in like manner. Use the plot to estimate the standard deviation for the above data set and compare the result with that obtained from SPSS. Find the value from the distribution that will not be exceeded by more the five percentile of the values.

Weibull Distribution

The plot for the Weibull distribution is done exactly as that for the normal distribution except the scales on the paper are different and there are two scales for each axis.

On the paper shown in HS the values on the bottom X scale are Z = ln X.

The scaling of this axis makes the transformation.

You need only to plot the data using the preprinted scale.

The parameter can be estimated from the slope of the plotted line as follows:

1. Select 2 values of W the Y axis values at the right hand side of the paper. Call these W

2

(the larger value chosen) and W

1

(the smaller value chosen).

2. Using the scale at the top of the X axis find values Z

2

Z

1 corresponding to where the chosen W lines cross the plotted line.

and

3. The estimator of η is b = [W

2

- W

1

]/[ Z

2

- Z

1

].

4. The estimator of σ is obtained from finding a, the intercept where, where the plotted line crosses the line Z = 0 on the top scale.

Other Probability Plots

Probability plots for the lognormal distribution can be done in two ways. If you do not have a sheet of lognormal paper you need only take the log of the data and plot the transformed data on regular normal probability paper. Use the same procedures for estimating the two parameters as was done for the normal case. If you have a sheet of log normal paper then simply plot the data without transforming them. This paper has a log scale.

There is also probability paper for the gamma distribution if you know the value of η and it is an integer. The paper varies with the value of η. Note that when η is equal to one the distribution is the exponential. The scale parameter of the gamma distribution can be estimated by the slope of the line.

See HS for further details.

Tests Of Distributional Fit

Probability plots are a simple method of checking whether a chosen model is reasonable.

However it is a subjective procedure and two individuals looking at the same plot could come to different conclusions.

A formal test of hypothesis can be made for which a p-value can be obtained and used for making a decision as to the reasonableness of the chosen model.

Most of these procedures are complex and only references to them will be given.

There are several general approaches to these test procedures; one that I have used will be given for this class.

We have seen that in the normal probability plotting procedure one can estimate the variance of the distribution from the slope of the plotted line squared.

However this estimate is a valid estimator only if the points plotted fall in an approximate linear pattern. If the points do not fall in an approximate straight line then the estimated variance is incorrect.

The sample variance is an unbiased estimator of the variance whether or not the sample data came from a normal model. Similar to the ANOVA technique that you learned in your statistics class a test of the null hypothesis that the model is correct can be made by comparing these two estimates by computing their ratio.

Shapiro-Wilk W Test For The Normal

Distribution

The numerator and denominator of the ratio for testing for normality are not independent and hence the ratio does not have an F distribution and the slope of the line can not be obtained by simple linear regression.

The steps for computing the ratio and obtaining the test statistic, W, can be found in HS. This test procedure is included in most general software packages including

SPSS.

Use the data from the normal and probability plot examples to test for normality.

The Brain-Shapiro Test for the Exponential

Distribution

The test for the exponential distribution in HS is outdated. A better test was devised by Brain and Shapiro published in Technometrics Vol. 25 Pg, 60-76

1983 entitled, “A Regression Test for Exponentiality: Censored and Complete

Samples. The test statistic is computed as follows:

1.Order the data from smallest to largest. X

1

≤ X

2

≤ ……..≤ X n

2. Compute the quantities Y i+1

= [n-1] [X i+1

– X i

], i = 1, 2, 3, .., n-1

3. Compute Z = {[12/(n-2)] 1/2 i-(n/2 Y i+1

}/[ Y i+1

] , i = 1, 2, …, n-1

4. Compute V ={ [5/{4(n+1)(n-2)(n-3)}] 1/2 [12 {i-(n/2)} 2 Y i+1

]- n(n-2) Y i+1

}/ Y i+1

5. Calculate the test statistic U = Z 2 + V 2 .

6. For large samples U has a chi-squared distribution with two degrees of freedom when the data come from an exponential distribution. This is an upper tail test; non-exponentiality will result in large value of U. When n > 15 the critical values for the test are as follows:

90

95 th th percentile: U

0.90

percentile: U

0.95

97.5

th percentile: U

0.975

= 4.605 -2.5/n

= 5.991 – 1.25/n

= 7.378 +3.333/n.

7. Thus if the value of U is greater than one of the three values above then the p-value is less than 1 minus the corresponding subscript.

Hence if it is greater than the 90 th percentile then the p-value is less than 0.10.

Chi-squared Goodness-of-Fit Test

The chi-squared goodness of fit test is used to test whether a selected discrete model is appropriate to model a set of discrete data. We will not discuss this procedure since it was covered in your prior statistics course. The procedure is described in HS.

Goodness-of Fit Tests for Other distributions

The following are references for tests for other models.

Johnson Distribution

Scientists are often faced with the problem of summarizing a set of data by means of a mathematical function which fits the data and allows them to obtain estimates of percentiles and probabilities. A common practice is to use a flexible family of distributions to accomplish this. In most cases the family has four parameters. One such system is the Johnson distributions which has three families each with four parameters.

These families are described in HS; however determining which of the three families to use and estimation of the parameters is out of date. The current procedures are found in an article by Slifker and

Shapiro entitled “The Johnson System: Selection and Parameter

Estimation” published in Technometrics, Vol. 22 Pg. 239-246, 1980.

The following is abstracted from that article:

The system was devised by using a transformation of a standard normal variable using the equation: z = γ ηk i

(x;λ ε) where z is a standard normal variable and k is a function that includes a wide variety of shapes. The parameters

γ and η are shape parameters, λ is a scale parameter and ε is a location parameter.

The three families are obtained by letting the function k equal to the following: k(x; ) = sinh -1 [(x- )/ ] denoted the S

U distribution

-inf<x<inf.

k(x; distribution

≤x≤

) = ln[(xk(x;

-x)] denoted the S

) = ln[(x- )/ ] denoted the S

L distribution x≥

B

The S

L distribution is a form of the log-normal distribution having three parameters. The first step in using this family is to determine which of the three to use. The following is the procedure to accomplish this:

1.Choose a value of z > 0. Any value will do; however if you want to have a good fit in the tail of the distribution selection of a value close to 0.5 is recommended for moderate sample sizes and 1.0 or higher for large sample sizes. A good value is 0.548.

2.Compute 3z. If the recommended value of 0.548 is used then 3z

= 1.645

3.Determine from a table of the normal distribution the percentage points corresponding to -3z, -z, z, and 3z. Using the above selected percentiles the corresponding percentage points are 0.05, 0.292,

0.708 and 0.95. Call these p i’s.

4. We next estimate the data percentiles corresponding to these percentages using the equation i = np i

+ ½ to determine the i th ranking in an order list of the data. Note that n is the sample size. This will usually not be an integer and linear interpolation will be necessary.

5. Next compute the quantities m, n and p using the data values (x jz

,; j=-3,-1,1,3) from the prior step as follows. m = x

3z n = x

-z

- x z

– x

-z p = x

-z

– x

-3z

6. If mn/p 2 >1 use the S

U one use the S

L

. distribution, if it less than one use the S

B and if it equals

7. Once the proper family is selected the next step is to estimate the parameters. The estimation formulas for each family are different; you must select the appropriate set depending on the selection from the last step.

i) Johnson S

U

Distribution

The values of the parameters are presented in such a way as to emphasize their dependance on the ration m/p and n / p.

Parameter estimates for Johnson S

U

Distribution:

ii) Johnson S

B

Distribution

The solutions for the S

B parameters turn out to depend on the ratios and p / n ( as opposed to m/p and n / p for S

U

).

p / m

Parameter estimates for Johnson S

B

Distribution:

iii) Johnson S

L

Distribution

Parameter estimates for Johnson S

L

Distribution:

8. To determine the value of F(x) for the data set one simply substitutes the estimators of the parameters in the transformation for z for the selected model and use a table of the normal distribution.

9. To determine a data percentile corresponding to a given percentage point we solve the defining equation of z obtain x p where z p is the standard normal value corresponding to the desired value of p.

p to

For S

L use x p

= exp[(z p

]

(z p

-

For S

B use x p

= [

For S

U use x p

= e 2w -1]/2e w + e w ]/[1 + e w ] where w =

a. The following example was taken from a large sample of size 9440 measurements. b. Since this is a very large sample we use a value of z = 1.0. c. The first step is to order the data from the smallest to the largest. d. Using the chosen value of z we obtain the order numbers corresponding :

-3z (p=0.0014), -z (p=0.1587), z (p=0.8413) and 3z (p=.9986). e. Using i = np i

+ ½ we find that: x

-3z

= 10.16, x

-z

= 13.58, x z

= 15.24 and x

3z

= 16.68. f. Thus m = 1.447, n = 3.172 and p = 1.661. (Note that the numbers on the above line were rounded off.) Hence mn/p 2 = 1.664. This indicates that the proper model is the S

U distribution. g. Using the estimating equations for this model we find that the estimates of the parameters are:

= 2(1)/[cosh -1 {1/2 (0.87) + 1.910}] = 2.333.

= 2.333 sinh -1 [{1.910 - 0.871}/2{(1.910)(0.871) – 1} 1/2 ] = 1.402.

= 2(1.66)(1.664 – 1) 1/2 /[(0.871 + 1.910 – 2)(0.871 + 1.910 + 2) 1/2 ] =

1.585.

= [15.242 + 13.58]/2 +1.661(1.910 – 0.871)/2(1.910 + 0.871 – 2) =

15.516.

In order to find the probability that a measurement will be smaller than

9,0 we compute F(9.0) by first finding the corresponding z value. z = sinh -1 [(x- )/ ] z = 1.402+ 2.333 sinh -1 [{9.0 -15.516}/1.585] =

1.402 + 2.333 sinh -1 (-4.11) = -3.54

F( -3.54) = 0.00021.

Thus there is a very low probability to get a measurement below 9.0.

If one desires the median of the distribution (p = 0.5) corresponding to a value of z of zero then x

0.50 =

= e 2w -1]/2e w + and w = (z p

(0 –

1.402)/2.333 = - 0.6 and x

0.50

= 1.585[.30 -1]/.549 + 15.516 = 14.41.

Thus one-half of the distribution is below 14.41.

1. S. Gulati and S. Shapiro, “Goodness of Fit for the Pareto Distribution” in

Statistical Models and Biomedical and Technical Systems published by

Birkhauser, Boston, 263-277.(2007)

2. S. Gulati and S. Shapiro, “Goodness of Fit Tests for the Logistic Distribution”,

Journal of Statistical Theory and Practice, 1,

Analyzing Random Models Via

Simulation

In the prior two weeks you learned about modeling of biological systems. However these systems only represent the average output to be expected. The variables in the model are all constants. In real life each subject has a different value for the constant and in order to get a more realistic picture of the output these constants should be replaced by random variables that have distributions like the ones we have discussed. In order to do this it is necessary to choose a statistical distribution for each variable in the model, assign values to the parameters for each random variable and then run the model over and over again on a computer inserting a new value for each of the random variables. This technique is called Monte Carlo simulation and is repeated maybe 1000 or more times. Thus you generate a data set that gives the distribution of the output as opposed to a static model that gives you one value. Then you can use the Johnson System to fit a model to the output and find desired probabilities. Thus you are able to get a more comprehensive picture about the properties of the output.

In this lecture we will first discuss how to generate random numbers from some of the distributions covered earlier. Most computer systems have programs for generating random variates from the normal and uniform distributions. We will use these programs to generate variates from other models. Define Z as a normal variate with mean zero and variance one and U as a uniform variate on the interval (0,1).

Normal Distribution

Suppose we desire random variates, Y, from a normal distribution with mean and variance 2 .

Thus, setting

Y = + Z will yield the desired random variable.

Exponential Distribution

F(y) for all distributions has a uniform distribution on (0,1).

If we can express F(y) = U we can generate a random variable by solving for Y.

Thus F(Y) for an exponential is as an analytical function then by solving for Y we can obtain a variate from that distribution.

Thus for the exponential distribution F(y) = 1 e y = U.

thus Y = -1/ ln(1 – U) yields the desired random variate from the exponential distribution.

Gamma Distribution

(integer shape parameter)

The gamma distribution with an integer shape parameter can be viewed as a sum of independent exponential variables with scale parameter .

Thus to generate a random variate we merely add exponential variables using the above formula:

η

Y = -1/ i=1 ln i i = 1,2,,, .

Log-normal Distribution

The log of the log-normal variable has a normal distribution with parameters and .

Thus, we simply take the inverse of the natural log of Z:

Y = e Z +

Weibull Distribution

Since F(Y) = 1 – exp[-(t/ ]= U, then solving for Y yields:

Y = - [ln(1-U)] 1/ .

Other Distributions

Formulas for some discrete and other continuous distributions can be found in HS.

Download