chapter8

advertisement
Princess Nora University
Modeling and Simulation
Input Modeling
and
Goodness-of-fit tests
Arwa Ibrahim Ahmed
1
Input models
2
Input models provide the driving force for a simulation model.
• The quality of the output is no better than the quality of inputs.
steps of input model development:
• Collect data from the real system.
• Identify a probability distribution to represent the input process.
• Choose parameters for the distribution.
• Evaluate the chosen distribution and parameters for goodness
of fit.
2
DATA COLLECTION
3
One of the biggest tasks in solving a real problem. GIGO –garbage-ingarbage out.
Suggestions that may enhance and facilitate data collection:
• Plan ahead: begin by a practice or pre-observing session, watch for unusual
Circumstances.
• Analyze the data as it is being collected: check adequacy.
• Combine homogeneous data sets, e.g. successive time periods, during the
same time period on successive days.
• Be aware of data censoring: the quantity is not observed in its entirety,
danger of leaving out long process times.
• Check for relationship between variables, e.g. build scatter diagram.
• Check for autocorrelation.
• Collect input data, not performance data.
3
IDENTIFYING THE DISTRIBUTION
4
• Histograms
• Selecting families of distribution
• Parameter estimation
• Goodness-of-fit tests
• Fitting a non-stationary process
4
5
HISTOGRAMS [IDENTIFYING THE
DISTRIBUTION]
• A frequency distribution or histogram is useful in determining the shape of a
Distribution.
The number of class intervals depends on:
• The number of observations.
• The dispersion of the data.
• For continuous data:
• Corresponds to the probability density function of a theoretical
• distribution
• For discrete data:
• Corresponds to the probability mass function.
• If few data points are available: combine adjacent cells to eliminate the
ragged appearance of the histogram.
5
IDENTIFYING THE DISTRIBUTION
A family of distributions is selected based on:
• The context of the input variable:
 Shape of the histogram
• Frequently encountered distributions:
 Easier to analyze: exponential, normal and Poisson
 Harder to analyze: beta, gamma and Weibull
6
IDENTIFYING THE DISTRIBUTION
Use the physical basis of the distribution as a guide,
for example:
• Binomial: of successes in n trials.
• Poisson: of independent events that occur in a fixed
amount of time or
Space.
• Normal: disn’t of a process that is the sum of a
number of component processes.
7
IDENTIFYING THE DISTRIBUTION
• Exponential: time between independent events, or a
process time that is memory less.
• Weibull: time to failure for components.
• Discrete or continuous uniform: models complete
uncertainty.
• Triangular: a process for which only the minimum,
most likely, and maximum values are known.
• Empirical: resample's from the actual data collected
8
Poisson distribution
• is a discrete probability distribution that expresses the probability of a given
number of events occurring in a fixed interval of time and/or space if these
events occur with a known average rate and independently of the time
since the last event.
• Example:
• Suppose you typically get 4 pieces of mail per day. That becomes your
expectation, but there will be a certain spread: sometimes a little more,
sometimes a little less, once in a while nothing at all.
• Given only the average rate, for a certain period of observation (pieces of
mail per day), the Poisson Distribution will tell you how likely it is that you
will get 3, or 5, or 11, or any other number, during one period of
observation.
9
Poisson distribution:
10
The distribution equation
If the expected number of occurrences in a given interval is λ, then the
probability that there are exactly k occurrences (k being a non-negative
integer, k = 0, 1, 2, ...) is equal to:
Where:
* e is the base of the natural logarithm (e = 2.71828...)
* k is the number of occurrences of an event— the probability of which is
given by the function (The random number)
* k! is the factorial of k
* λ is a positive real number, equal to the expected number of occurrences
during the given interval. (an average rate of value.)
For instance, if the events occur on average 4 times per minute, and one is
interested in the probability of an event occurring k times in a 10 minute
interval, one would use a Poisson
distribution as the model with λ = 10×4 = 40.
10
Poisson distribution:
Example:
Consider, in an office 2 customers arrived today. Calculate the possibilities for
exactly 3 customers to be arrived on tomorrow.
Step1: Find e-λ.
where, λ=2 and e=2.718
e-λ = (2.718)-2 = 0.135.
Step2: Find λx.
where, λ=2 and x=3.
λx = 23 = 8.
Step3: Find f(x).
f(x) = e-λλx / x!
f(3) = (0.135)(8) / 3! = 0.18.
Hence there are 18% possibilities for 3 customers to be arrived on tomorrow.
11
Exponential distribution
12
• is a family of continuous probability distributions. It describes
the time between events in a Poisson process, i.e. a process in
which events occur continuously and independently at a
constant average rate. It is the continuous analogue of the
geometric distribution.
• The probability density function (pdf) of an exponential
distribution is
• λ is the parameter of the distribution, x is the
random variable.
12
Exponential distribution
• For example,
• the rate of incoming phone calls differs according to
the time of day. But if we focus on a time interval
during which the rate is roughly constant, such as
from 2 to 4 p.m.
• during work days, the exponential distribution can
be used as a good approximate model for the time
until the next phone call arrives.
13
Binomial Distribution
Definition:
• The Binomial Distribution is one of the discrete probability distribution. It is used when
there are exactly two mutually exclusive outcomes of a trial. These outcomes are
appropriately labeled Success and Failure.
• The Binomial Distribution is used to obtain the probability of observing r
successes in n trials, with the probability of success on a single trial denoted by p.
• Function:
P(X = r) = nCr p r (1-p)n-r
• where,
n = Number of events
r = Number of successful events.
p = Probability of success on a single trial.
nC r
= ( n! / (n-r)! ) / r!
1-p = Probability of failure
14
Goodness-of-fit
Conduct hypothesis testing on input data distribution using:
• Kolmogorov-Smirnov test
• Chi-square test
Goodness-of-fit tests provide helpful guidance for evaluating the
suitability of a potential input model.
No single correct distribution in a real application exists.
• If very little data are available, it is unlikely to reject any
candidate distributions
• If a lot of data are available, it is likely to reject all candidate
distributions.
15
Goodness-of-fit
16
Intuition: comparing the histogram of the data to the shape of the candidate
density or mass function
Valid for large sample sizes when parameters are estimated by maximum
likelihood
By arranging the n observations into a set of k class intervals or cells, the test
statistics is:
 02 
k

i 1
(Oi  Ei ) 2
Ei
which approximately follows the chi-square distribution with k-s-1
degrees of freedom, where s = # of parameters of the hypothesized
distribution estimated by the sample statistics.
16
Goodness-of-fit
17
The hypothesis of a chi-square test is:
distributional
the estimate(s).
H0: The random variable, X, conforms to the
assumption with the parameter(s) given by
H1: The random variable X does not conform.
If the distribution tested is discrete and if combining
adjacent cell is not required (so that Ei > minimum requirement):
Each value of the random variable should be a class interval,
unless combining is necessary, and
pi  p(xi )  P(X  xi )
17
Goodness-of-fit
18
If the distribution tested is continuous:
pi 

ai
ai 1
f ( x) dx  F (ai )  F (ai 1 )
where ai-1 and ai are the endpoints of the ith class
interval
and f(x) is the assumed pdf, F(x) is the assumed cdf.
Recommended number of class intervals (k):
Sample Size, n
Number of Class Intervals, k
20
Do not use the chi-square test
50
5 to 10
100
10 to 20
> 100
n1/2 to n/5
Caution: Different grouping of
data (i.e., k) can affect the
hypothesis testing result.
18
Goodness-of-fit
19
The pmf for the Poisson distribution was given:
e-a ax) / x! , x = 0, 1, 2 ...
p(x) = 
0 , otherwise
For a = 3.64, the probabilities associated with various values
of x are obtained using above equation with the following
results.
p(0) = 0.026 p(3) = 0.211 p(6) = 0.085
p(9) = 0.008
p(1) = 0.096 p(4) = 0.192 p(7) = 0.044
p(10) = 0.003
p(2) = 0.174 p(5) = 0.140 p(8) = 0.020
p(11) = 0.001
19
Goodness-of-fit
20
Vehicle Arrival Example (continued):
H0: the random variable is Poisson distributed.
H1: the random variable is not Poisson distributed.
xi
Observed Frequency, Oi
Expected Frequency, Ei
0
1
2
3
4
5
6
7
8
9
10
> 11
12
10
19
17
19
6
7
5
5
3
3
1
100
2.6
9.6
17.4
21.1
19.2
14.0
8.5
4.4
2.0
0.8
0.3
0.1
100.0
(Oi - Ei)2/Ei
7.87
0.15
0.8
4.41
2.57
0.26
11.62
27.68
Degree of freedom is k-s-1 = 7-1-1 = 5, hence, the hypothesis is
rejected at the 0.05 level of significance.
 02  27 .68   02.05,5  11 .1
20
Goodness-of-fit
21
p-value for the test statistics
The significance level at which one would just reject H0
for the given test statistic value.
A measure of fit, the larger the better
Large p-value: good fit
Small p-value: poor fit
Vehicle Arrival Example (cont.):
H0: data is Possion
Test statistics:   27.68 , with 5 degrees of freedom
p-value = 0.00004, meaning we would reject H0 with
0.00004 significance level, hence Poisson is a poor fit.
2
0
21
Goodness-of-fit
22

Many software use p-value as the ranking measure to automatically
determine the “best fit”. Things to be cautious about:




Software may not know about the physical basis of the data, distribution
families it suggests may be inappropriate.
Close conformance to the data does not always lead to the most
appropriate input model.
p-value does not say much about where the lack of fit occurs
Recommended: always inspect the automatic selection using
graphical methods.
22
Goodness-of-fit
23

Fitting a NSPP to arrival data is difficult, the most practical
approach:


Approximate constant arrival rate over some basic interval of time, but
vary it from time interval to time interval.
Suppose we need to model arrivals over time [0,T], our approach is
the most appropriate when we can:


Observe the time period repeatedly and
Count arrivals / record arrival times.
23
Goodness-of-fit
24

The estimated arrival rate during the ith time period is:
1 n
̂ (t ) 
Cij

nDt j 1
where n = # of observation periods, Dt = time interval length
Cij = # of arrivals during the ith time interval on the jth observation period
Time Period
Number of Arrivals
Day 1
Day 2
Day 3
Estimated Arrival
Rate (arrivals/hr)
8:00 - 8:00
12
14
10
24
8:30 - 9:00
23
26
32
54
9:00 - 9:30
27
18
32
52
9:30 - 10:00
20
13
12
30
24
Goodness-of-fit
25

If data is not available, some possible sources to obtain information
about the process are:





Engineering data: often product or process has performance ratings
provided by the manufacturer or company rules specify time or
production standards.
Expert option: people who are experienced with the process or similar
processes, often, they can provide optimistic, pessimistic and most-likely
times, and they may know the variability as well.
Physical or conventional limitations: physical limits on performance, limits
or bounds that narrow the range of the input process.
The nature of the process.
The uniform, triangular, and beta distributions are often used as
input models.
25
Goodness-of-fit
26

Example: Production planning simulation.


Input of sales volume of various products is required, salesperson of
product XYZ says that:
 No fewer than 1,000 units and no more than 5,000 units will be sold.
 Given her experience, she believes there is a 90% chance of selling
more than 2,000 units, a 25% chance of selling more than 3000
units, and only a 1% chance of selling more than 4,000 units.
Translating these information into a cumulative probability of being less
than or equal to those goals for simulation input:
i
Interval (Sales)
Cumulative Frequency, ci
1
2
3
4
1000  x  2000
2000 < x  3000
3000 < x  4000
4000 < x  5000
0.10
0.75
0.99
1.00
26
Download