EMIS 7300 SYSTEMS ANALYSIS METHODS Spring 2006 Dr. John Lipp Copyright © 2002 - 2006 John Lipp Today’s Session Topics • Part 2: The Statistics You Thought You Knew. – – – – – – – – What is Statistics? Populations and Samples. Mean, Variance, Standard Deviation. Mode, Range, Quartiles, Percentiles. Frequency, Relative Frequency. Dot Diagram and Box Plot. Mechanistic vs. Empirical Models. Deterministic vs. Statistical Modeling. EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-2 What is Statistics? • Statistics is the mathematics branch dealing with applied probability theory. • Statistics is very prescriptive. • Statistics’ main emphasis is on decision making. • Engineering is well populated with decisions to be made based on random or imperfect data: – Is this radar measurement just cosmic radiation, or is it a stealth fighter? – How many high-pressure hoses in this lot should be destructively tested to be confident the whole lot is good? – Is there a correlation between system performance and missile mass, antennae gain, bad FLIR pixels, etc. EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-3 What is Statistics? (cont.) • Statistics is also concerned with estimation of unknown quantities (statistical parameters like mean and variance). • Estimation is the more prevalent statistics problem found in engineering: – Curve fitting (Regression) • Linear, • Logistic, – The Kalman filter (a course into and of itself). – Design of Experiments (another course). EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-4 Populations • Missiles are built in lots of 10. The following parameters are measured as percentages of the requirement specifications. Missile Weight Motor Seeker Range Labor 1 99 96 99 96 105 2 101 102 101 99 101 3 102 98 101 95 101 4 101 105 102 103 97 5 103 99 101 95 96 6 101 103 99 100 99 7 102 102 100 98 90 8 98 98 99 100 101 9 100 94 100 94 105 10 100 101 100 100 100 EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-5 Populations (cont.) Dot Diagrams Weight 90 92 94 96 98 100 102 104 106 108 110 90 92 94 96 98 100 102 104 106 108 110 Motor } Seeker Number of dots represents the frequency of the data value. 90 92 94 96 98 100 102 104 106 108 110 90 92 94 96 98 100 102 104 106 108 110 90 92 EMIS 7300 Spring 2006 94 96 98 100 102 104 106 108 110 Range Labor Copyright 2002 - 2006 Dr. John Lipp S2P2-6 Population Mean, Variance, and Standard Deviation • The size of a population is denoted N. The number of unique data values will be denoted M. • The population mean is a measure of a population’s central tendency. It is commonly denoted by the Greek letter and is computed from data via M Ni M 1 N 1 M xi or xi N i xi xi f i N i1 N i1 N i1 i 1 • The population variance is a measure of a population’s variability about the population mean. It is commonly denoted by the Greek letter 2 and is computed from data via M 1 N 2 2 2 xi or xi 2 f i N i1 i 1 EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-7 Population Mean, Variance, and Standard Deviation (cont.) • The population standard deviation, , is the square root of the population variance. It is also a measure of variability about the mean. – Unlike the population variance, the population standard deviation has the same units as the population mean and the raw data. – In engineering 2 is usually proportional to power, while is proportional to magnitude (voltage, current, force, velocity, etc.). – 1 contains 68.3% of “normal” data – 2 contains 95.5% of “normal” data – 3 contains 99.7% of “normal” data EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-8 Population Mean, Variance, and Standard Deviation (cont.) Weight 90 92 94 96 98 100 102 104 106 108 110 90 92 94 96 98 100 102 104 106 108 110 90 92 94 96 98 100 102 104 106 108 110 90 92 94 96 98 100 102 104 106 108 110 90 92 EMIS 7300 Spring 2006 94 96 98 100 102 104 106 108 110 Motor Seeker Range Labor Copyright 2002 - 2006 Dr. John Lipp S2P2-9 Population Range, Median, Quartiles, and Percentiles • The population range is another measure of variability. It is the difference between the largest and smallest data values. • The population mode is the most frequently occurring value in the samples, that is, the most probable value. Ties are allowed. • The population range and mode are rarely used. If the size of the population is infinite they can be undefined. EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-10 Population Range, Median, Quartiles, and Percentiles • The population median is another measure of central tendency. It is computed via sorting the data (from lowest to highest) and dividing this ordered data into two equal halves at the data mid-point. – If N is odd, the median is the “left over” data point after dividing at the mid-point into equal halves. – If N is even, the median is the average of the two data points on either side of the mid-point. – Regardless of N’s value, the same number of data points are above and below the median’s value (ties with the median are allocated above/below as necessary). EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-11 Population Range, Median, Quartiles, and Percentiles (cont.) • The division points which divide ordered data into four “equal” portions are the population quartiles: – The first or lower quartile is denoted q1. – The second quartile is the median, q2. – The third or upper quartile is denoted q3. • The difference q3 – q1 is called the interquartile range and is yet another measure of variability. For “normal” data the interquartile range should be about 4/3. • The division points which divide ordered data into 100 “equal” portions are the population percentiles. – Percentiles are most commonly denoted as i. EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-12 Population Range, Median, Quartiles, and Percentiles (cont.) • What constitutes “equal” portions is defined by the following algorithm – Sort (order) the x data in increasing order, call the result y y1 , y2 , y3 ,, yN sort x1 , x2 , x3 , xN – Determine the quartile / percentile point z by computing Q( N 1) K ( N 1) z (quartiles ) z ( percentiles) 4 100 – Q {1..3} for the first quartile, second quartile (median), or third quartile, respectively. – K {1..100} for the K-th percentile. EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-13 Population Range, Median, Quartiles, and Percentiles (cont.) – Linear interpolation is used to compute the quartiles and percentiles from the two closest data points to z via f ( z ) z z y z z z y z – z = closest integer less than or equal to z (floor). – z = closest integer greater than or equal to z (ceiling). • For the missile lot, N = 10. Thus the value of z for q1 is 2.75 and for q3 it is 8.25. The quartile equations are then 3 y3 y2 q1 2.75 2 y3 (3 2.75) y2 4 3 y8 y9 q3 8.25 8 y9 9 8.25 y8 4 EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-14 Population Range, Median, Quartiles, and Percentiles (cont.) • The population statistics (means, variances, quartiles, etc.) for the missile lot are computed in the table below. Parameter Weight Motor Seeker Range Labor 100.7 99.8 100.2 98.0 99.5 2 2.01 10.36 0.96 7.60 17.65 1.42 3.22 0.98 2.76 4.20 range 5 11 3 9 15 q3 – q1 2.25 4.75 2.00 5.00 5.25 q1 99.75 97.50 99.00 95.00 96.75 q2 101.0 100.0 100.0 98.5 100.5 q3 102.00 102.25 101.00 100.00 102.00 EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-15 Population Range, Median, Quartiles, and Percentiles (cont.) • Below is a common method of illustrating the statistical quantiles called a box plot. – The box is drawn from q1 to q3 (the interquartile range) and has the median, q2, marked in the middle. – A line and mark known as a whisker extends from the box’s q1 end to the smallest data point within 1.5 interquartile ranges from q1. – Likewise, a whisker is drawn from the box’s q3 end to the largest data point within 1.5 interquartile ranges from q3. Outlier Labor 90 EMIS 7300 Spring 2006 92 94 96 98 100 102 Copyright 2002 - 2006 Dr. John Lipp 104 106 108 110 S2P2-16 Population Range, Median, Quartiles, and Percentiles (cont.) Weight 90 92 94 96 98 100 102 104 106 108 110 90 92 94 96 98 100 102 104 106 108 110 Motor EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-17 Population Range, Median, Quartiles, and Percentiles (cont.) Seeker 90 92 94 96 98 100 102 104 106 108 110 90 92 94 96 98 100 102 104 106 108 110 Range EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-18 Population Range, Median, Quartiles, and Percentiles (cont.) • Code on the left is for MATLAB. % Population analysis for Missile Lot x = [ 99 96 105 99 96 101 102 101 99 101 102 98 101 95 101 101 105 102 103 92 103 99 101 95 96 101 103 99 100 99 102 102 100 98 90 100 101 98 98 99 100 94 100 94 105 100 101 100 100 100 y = sort(x); x_bar = mean(x) sigma = std(x,1) var = sigma.^2 • Notice that MATLAB has built-in functions to compute most of the statistical values. ]; • std(x,1) divides by N to give the population standard deviation, while std(x) divides by N-1 to give the sample standard deviation. q2 = median(x) rng = range(x) q1 = (3*y(3,:) + y(2,:)) / 4 q3 = (3*y(8,:) + y(9,:)) / 4 q31 = q3 - q1 EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-19 In Class Assignment – Deal or Dud? • The customer is unhappy with his lot of missiles and canceling the contract. – They claim the missiles don’t meet the contract requirements!!! • Divide up into two teams of statisticians, – One representing the plantiffs (the customer), and – The other the defendant (Missile King). • Prepare to argue WITH STATISTICS the case for your side! EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-20 Samples • An entire population may not or cannot be measured to determine the population’s statistics: – The measurement process is destructive. – The measurement costs are excessive. – The population is evolving, i.e, the statistics fluctuate. – The population is theoretical (N = ). • Instead of measuring the population, a sub-set or sample of the population can be measured and the parameters estimated by statistical inference. • The number of items in the sub-sample is typically denoted as n in statistics. (Note n < N.) Denote the number of unique data values as m. EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-21 Sample Mean, Variance, and Standard Deviation • The sample mean is an estimate of the population mean. It is commonly denoted by x and is computed from data via m 1 n x xi or x xi f i n i1 i 1 • The sample variance is an estimate of the population variance. It is commonly denoted by s2 and is computed from data via 2 n n n 1 1 1 2 2 2 s xi x xi xi n 1 i1 n 1 i1 n i1 n m 2 or s xi x f i n 1 i1 2 EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-22 Sample Range, Median, Quartiles, and Percentiles (cont.) • The sample standard deviation, s, is likewise the square root of the sample variance and is an estimate of the population standard deviation. • The sample range is the difference between the largest and smallest data sample values. • Similarly, the sample median, sample quartiles, and sample percentiles are found by using the sorted data samples and replacing n for N in the population equations. • The sample mode is the most frequently occurring value in the data samples. Ties are allowed. EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-23 Sample Statistics • Consider the missile lot. If only 3 of the 10 missiles are sampled, what is the sample mean? • Let i, j, and k denote the missiles’ selected for the sample. Then the sample mean is 1 x xi x j xk 3 • Since {i, j, k} are selected at random, so are the {xi, xj, xk} data values. That implies the sample mean is – Itself is a random variable, – Has a population, and – Has its own mean, variance, quartiles, etc. EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-24 Sample Statistics (cont.) • The first step in determining the sample means’ statistics is to determine the population – Select i first. Since N = 10, i can take on one of 10 values. – Select j next. Since i j, j can take on one of 9 values. – Finally select k. Since k i j, k is one of 8 values. – The total population size is 10 9 8 = 720. • The process applicable to determining the population in this case is known as selection without replacement. The general formula for the size is known as the number of permutations n! n Pr (n r )! where n is the population size and r is the sample size. EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-25 Sample Statistics (cont.) • However, the order in which {xi, xj, xk} are added does not change the value of the sample mean. • Regardless of the values of i, j, and k , there are six orders in which they can be randomly draw – {i, j, k} {j, i, k} {i, k, j} {j, k, i} {k, i, j} {k, j, i} – Thus, the population size can be reduced to 720 / 6 = 120. • Generally, the number of orders in which r things can be arranged is r! (= rPr). • When the order of draw is not important, the formula for the number of combinations is n! n Cr (n r )! r! EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-26 Sample Statistics (cont.) • Consider the n = 3 sample mean for the weight of the missile lot. Note that – – – – Largest possible value = (103 + 102 + 102) / 3 = 102 1/3. Smallest value possible = (98 + 98 + 100) / 3 = 99. Range is 3 1/3, with discrete steps every 1/3. That is, the number of unique values is 11 !?! • What is different about the proposed populations? For – N=720: each permutation of i, j, and k is equally probable. – N=120: each combination of i, j, and k is equally probable. – N=11: each sample mean is unequally probable! EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-27 Sample Statistics (cont.) Sampling Distribution (of the Mean) mean weight (n = 3) 98 EMIS 7300 Spring 2006 99 100 Copyright 2002 - 2006 Dr. John Lipp 101 102 S2P2-28 Sample Statistics (cont.) 98 EMIS 7300 Spring 2006 99 100 Copyright 2002 - 2006 Dr. John Lipp 101 102 S2P2-29 Sample Statistics (cont.) % Population of N = 120 • Code on the left and next page is for MATLAB. % i = zeros(120,1); j = zeros(120,1); • This code assumes that you have already run the code on page 1-24. k = zeros(120,1); ijkdx = zeros(120,3); N = 0; for idx = 1:10, • The first code section formulates the population for the n = 3 mean. for jdx = (idx+1):10, for kdx = (jdx+1):10, N = N + 1; i(N) = idx; • The second code computes the statistics. j(N) = jdx; k(N) = kdx; ijkdx(N,:) = [idx jdx kdx]; end end end % Check that i,j,k are all different % section • The third code section computes the data for dot diagrams / histograms and plots them. sum(i==j), sum(i==k), sum(j==k), EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-30 Sample Statistics (cont.) % Sample mean population statistical analysis % Compute / plot dot diagrams (but use the % % built-in histogram function) for all columns % x3 = (x(i,:) + x(j,:) + x(k,:)) / 3; for loop = 1:min(size(x3)), x_bar = mean(x3) rng = (3*min(x3(:,loop))):(3*max(x3(:,loop))); sigma = std(x3,1) var = sigma.^2 figure(loop); q2 = median(x3) hist(3*x3(:,loop),rng); y3 = sort(x3); xx = get(gca,’xtick’); z1 = 1 * (N+1) / 4; set(gca,’xticklabel’,xx/3); z3 = 3 * (N+1) / 4; end q1 = (z1 – floor(z1)) * y3(ceil(z1),:) + … (ceil(z1) – z1) * y3(floor(z1),:) q3 = (z3 – floor(z3)) * y3(ceil(z3),:) + … (ceil(z3) – z3) * y3(floor(z3),:) q31 = q3 – q1 EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-31 Sample Statistics (cont.) Sample Mean (n = 3) of Weight Motor Seeker Range Labor 100.7 99.8 100.2 98.0 99.5 2 0.52 2.69 0.25 1.97 4.58 0.72 1.64 0.50 1.40 2.14 q3 – q1 1.0000 2.3333 0.6667 2.0000 3.0000 q1 100.3333 98.6667 100.0000 97.0000 98.0000 q2 100.6667 99.6667 100.3333 98.0000 99.6667 q3 101.3333 101.0000 100.6667 99.0000 101.0000 Parameter EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-32 Statistical Modeling • Engineering analysis often begins with a mechanistic model of a physical system using scientific first principles, for example, F = ma, V = IR, etc. • The analysis results of such a design are deterministic, exact and reproducible. x1 x2 Transfer Function y = f(x1, x1, …, xn) y xn EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-33 Statistical Modeling (cont.) • Sources of experimental variability are many: – Imperfect hardware and measurement devices. – Assumptions are approximate (frictionless surfaces really aren’t, missiles flex during maneuvers, etc.). • A mechanistic model can be augmented with random errors to represent this lack of knowledge: x1 x2 y = f(x1, x1, …, xn, e1, e1, …, em) y xn e1 e2 em EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-34 Statistical Modeling (cont.) • The primary parameters in a statistical model of a system are – The mean of the response, Y-bar. – The standard deviation of the response, S. – The random distribution of the response errors. • A deterministic model only considers the the mean of the response; – The response’s standard deviation is effectively 0. – The response’s random distribution is an indeterminate concept. EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-35 Statistical Modeling (cont.) • When a mechanistic model is unavailable, an empirical model based on experimental evidence can be constructed by considering the system to be a “black box.” Response Variation Response Mean x1 x2 y 1 x1 xn Input Factors EMIS 7300 Spring 2006 x2 0 1 xn 2 1 x1 n Random Errors / Noise Copyright 2002 - 2006 Dr. John Lipp x2 0 1 xn 2 n y Output Response(s) S2P2-36 Modeling (cont.) • Neither a mechanistic or empirical model is appropriate in some cases! Some phenomenon are purely random, possibly even irreducibly random. Regardless of the problem statistics boils down to modeling! EMIS 7300 Spring 2006 Copyright 2002 - 2006 Dr. John Lipp S2P2-37 Statistical Model for Sample Mean and Variance • Simplest Transfer Function: Y-bar and S are constants. S=1 Y-bar = 5 7.5 7 6.5 6 y 5.5 5 4.5 4 3.5 3 2.5 0 10 EMIS 7300 Spring 2006 20 30 40 50 x 60 Copyright 2002 - 2006 Dr. John Lipp 70 80 90 100 S2P2-38