Education in statistics Combination of random variables Ingemar Sjöström Education in statistics Combinations of random variables Combination of random variables .................................................................................. 3 Gauss’ approximation rules for calculating and ...................................................... 4 A one-variable combination ................................................................................ 4 A multi-variable combination .............................................................................. 6 Appendix ........................................................................................................................ 8 The expected value in a statistical distribution ................................................... 8 The standard deviation in a statistical distribution .............................................. 8 Summary of linear combinations of variables ..................................................... 9 The standard deviation of the linear combination ............................................... 9 Important formulas for the average value ............................................................ 9 Proof of Gauss’ approximation formulas for and ....................................... 10 A combination of one variable .......................................................................... 10 A combination of several variables ................................................................... 11 NB that chapter 4 of ”A course in statistics” contains a full description and discussion of the notion of linear combination of variables. The documents A collection of diagrams and Exercises with computer support also include a number of references to linear combinations of variables and corresponding computer macros for simulation. Reproduction of documents without the written permission of is prohibited according to Swedish copyright law (1960:729). This prohibition includes text as well as illustrations, and concerns any form of reproduction including printing, photocopying, tape-recording etc. © Ing-Stat – statistics for the industry www.ing-stat.se Rev D . 2015-05-19 . 2(12) Combination of random variables The idea with combinations of variables, and especially so-called linear combinations of variables, is extensively covered in chapter 4 of A course in statistics. In addition, the macro-programs %Avestd, %LinC, %CLT, %Die, %LinCom all discuss and examplify several features of such combinations of variables. In this paper we extend the ideas into non-linear combinations. Anyone that works with measurements from some kind of processes and is interested in understanding relationships, sources of variation, improvements based on facts etc, will sooner or later need to learn about combination of variables. This is true irrespective whether if we design products or processes, manufacture or assemble products, control administrative processes, running a service department etc. A combination of variables is when we add, subtract or perform any other mathematical operation involving random variables. Some examples: 1. 2. 3. 4. 5. 6. 7. 8. 9. Assembling parts to a unit means that e.g. lengths are added Summing the daily sales means summing up a number of results Making multilayer boards means summing the thicknesses of a number of layers Calculating the total time of a process by adding a number of steps Calculating the radial deviation from the X- and Y-deviations The total mass in a lift for exactly 12 persons is a simple sum of 12 variables Adding a number of trouble reports (TR) from different areas to one stream of TRs Averaging a number of data gives a linear combination of variables Designing an electronic circuit with formulas for different characteristics Etc. The common feature of the above examples is that there exists a model, e.g. the drawing in example 1 and 3, a flowchart in example 4, Pythagora theorem in example 5, a circuit in example 9, that helps us to write down the combination. In addition, all models contain one or more random variables. (If there is no variation, the problem is a simpler mathematical problem with no uncertainty or variation.) If we call the result Y, we can use a common expression for all examples by the following function: Y f ( X 1 , X 2 , X 3 , , X n ) In some examples above we have a simple addition: Y X1 X 2 X 3 X n but in example 5, calculating the radial deviation from the X- and Y-deviations, we have a more complicated expression: R X 2 Y2 Features of interest. Usually we are interested in the main features e.g. the distribution of the resulting variable the expected value the standard deviation . © Ing-Stat – statistics for the industry www.ing-stat.se Rev D . 2015-05-19 . 3(12) In some special cases it is easy to derive the distribution of the resulting variable but usually it takes some thinking and some mathematical manipulation. In principle, the expected value and the standard deviation of the variable can be derived from the distribution at hand. See the appendix for the necessary formulae. NB that the formulae for the expected value and the standard deviation starts from knowledge of the statistical distribution, but if we do not know the distribution we can not derive these parametersin this way. Before entering ways of deriving the distribution we will therefore use two other approximate ways of calculating expected value and the standard deviation namely the approximation rules of Gauss and simulation. (The last one is a very good way to verify the correctness of the mathematical derivations.) Gauss’ approximation rules for calculating and A one-variable combination If we have a linear combination of variables, the calculation of the expected value and standard deviation is straight forward and can be done without any use of approximation. See chapter 4 in the Ing-Stat document A course in statistics for details. See also the appendix below. However, in the case of a linear combination the approximation rules of Gauss will create the same exact result which we will show by a simple example. (It is when the combination is non-linear that we only get approximations to the true values.) Example 1. Suppose that we have the following model of a linear combination: Y a b X where X is a variable that is linearly transformed into the variable Y. As the model is a straight forward linear combination of a variable, the expected value (Y or E(Y)) and the standard deviation Y of becomes: Y a b X and Y b 2 Y2 Suppose (for simplicity) that X has a uniform, continuous distribution over the range, say, [c = 11, d = 13]. (Actually, we do not need to include the type of distribution because the discussion involves only the expected value and the standard deviation. However, in order to be able to simulate the example below, we need to simulate data from at least some kind of distribution.) The expected value for the variable X is 12 and the variance is V (X ) (c d ) 2 (11 13) 2 0.3333 12 12 A quick simulation to check the result: Random 1000 c2; Uniform 11 13. # Creates 1000 data uniformly # distributed between 11-13. Describe c2 # Gives a number of statistics. NB that the output gives the standard deviation which needs to be squared to give the variance. Suppose that a = 3.2 and b = 4.6. We then get Y a b X 3.2 4.6 12 58.4 © Ing-Stat – statistics for the industry and Y b 2 Y2 4.62 0.3333 2.66 www.ing-stat.se Rev D . 2015-05-19 . 4(12) Approximation formulas, example 1. These approximation formulas rely on using the mathe- matical tools of derivation. See the appendix of a simple proof of Gauss’ formulas. If the combination is linear, the formulas give an exact answer as in this example. E (Y ) E f ( X ) E a b X a b X 3.2 4.6 12 58.4 V (Y ) f ' ( X ) V X b 2 V ( X ) 4.6 2 0.3333 7.05 2 Thus we have the standard deviation of Y as 2.66 (square root of 7.05). A simulation Random 1000 c1; Uniform 11 13. # X-data. Let c2 = 3.2 + 4.6*c1 # Y = 3.2 + 4.6*X Describe c2 # Gives a number of statistics. One simulation gave the average 58.4 and the standard deviation 2.66 which seems to coincide well with the theory above. Example 2. Suppose that we have the following model: A D2 where D is a variable (a diameter) that is used to calculate the area A. 4 As the model is a straight forward non-linear combination of a variable, the expected value (A or E(A)) and the variance V(A) can be calculated approximately as follows (suppose that D = 50 and D = 3.1): E ( A) E f ( D ) 2 D 4 50 2 4 1963 .5 2 D 2 3.1 V (Y ) f ( D ) V X V X 3.12 59279 .3 4 4 ' 2 2 2 Thus we have the standard deviation of Y as 243.5. A simulation gives the following: Random 1000 c1; Normal 50 3.1. # D-data, the diameter. Let c2 = (Pi() * c1**2)/4 # Describe c2 # Gives a number of statistics. One simulation gave the average 1973.7 and the standard deviation 240.5 which seems to coincide well with the theory above. © Ing-Stat – statistics for the industry www.ing-stat.se Rev D . 2015-05-19 . 5(12) A multi-variable combination If we have a linear combination of several linear variables, the calculation of the expected value and standard deviation is straight forward and can be done without any use of approximation. See the appendix below. We can also use the approximation formulas and then we get an exact result. Example 3. Suppose that we have the following model (see the macro-program %LinC): dH D where H is the diameter of a hole D the diameter of a shaft d is the difference (H – D) As the model is a straight forward linear combination of variables the expected value ( d or E(d)) and the standard deviation d of d becomes: d H D d H2 D2 and Using Gauss’ approximation formulas. These approximation formulas rely on using the mathe- matical tools of derivation. See the appendix of a simple proof of Gauss’ formulas. Suppose that we have the following expression: Z f ( X ,Y ) Then the approximation formulas become: E ( Z ) E f ( X , Y ) f ( X , Y ) V ( Z ) V f ( X , Y ) f X' ( X ) V X fY' ( Y ) V Y 2 2 Suppose that H = 50.02 and D = 49.85 and H = 0.015 and D = 0.012. We then get: E ( Z ) E f ( X , Y ) H D 50.02 49.85 0.172 V (Z ) V f ( X , Y ) V H V D H2 D2 0.0152 0.012 2 0.000369 Thus we have the standard deviation d = 0.019. A simulation gives the following: Random 1000 c1; Normal 50.02 0.015. # H-data, the diameter of # the hole. Random 1000 c2; Normal 49.85 0.012. # D-data, the diameter of # the shaft. Let c3 = c1 - c2 # d = H - D, i.e. the difference # between hole and shaft. Describe c3 # Describes the ’d’-data. One simulation gave the average 0.170 and the standard deviation 0.019 which seems to coincide well with the theory above. Note. This example was a linear combination of variables and in such situations the result can be derived exactly according to ordinary rules. The use of the more complicate approximation rules was only to show that these give the same, exact result in linear combinations. © Ing-Stat – statistics for the industry www.ing-stat.se Rev D . 2015-05-19 . 6(12) Example 4. Suppose that we have the following, non-linear model: Z f ( X ,Y ) where Z is a function of the two random variables X and Y. X Y As the model is a non-linear combination of variables the expected value (Z or E(Z)) and the standard deviation Z of Z need to be calculated using the approximation formulas. Suppose that X = 15.2 and Y = 19.8 and X = 1.1 and Y = 2.2. We then get: E ( Z ) E f ( X , Y ) f ( X , Y ) X 15.2 0.768 Y 19.8 V ( Z ) V f ( X , Y ) f X' ( X , Y ) V X fY' ( X , Y ) V Y 2 2 2 2 1 1 15.2 2 2 V X X2 V Y 1.1 19.82 2.2 0.0125 15 . 2 Y X 2 2 Thus we have the standard deviation Z = 0.112. A simulation gives the following: Random 1000 c1; Normal 15.2 1.1. # X-data. # Random 1000 c2; Normal 19.8 2.2. # Y-data. # Let c3 = c1/c2 # Z = X/Y Describe c3 # Describes the Z-data. One simulation gave the average 0.782 and the standard deviation 0.108 which seems to coincide well with the theory above. © Ing-Stat – statistics for the industry www.ing-stat.se Rev D . 2015-05-19 . 7(12) Appendix The expected value in a statistical distribution Type of variable Formula for the expected value Discrete E ( X ) X x P ( X x) Continuous E ( X ) X x f ( x)dx The standard deviation in a statistical distribution Type of variable Discrete Continuous Formula for the standard deviation (x ) (x ) 2 2 ( P x) f ( x)dx The two formulas above are somewhat difficult. The following versions are easier: Discrete x Continuous 2 © Ing-Stat – statistics for the industry www.ing-stat.se x 2 ( P x) 2 f ( x)dx 2 Rev D . 2015-05-19 . 8(12) Summary of linear combinations of variables Let us assume that we have the following linear combination of random variables: Y a X1 b X 2 c X 3 d X 4 The expected value of the linear combination E (Y ) a E ( X 1 ) b E ( X 2 ) c E ( X 3 ) d E ( X 4 ) (where a, b etc can be equal to 1) or with a different notation Y a X b X c X d X 1 2 3 4 The standard deviation of the linear combination Y a 2 V ( X1 ) b2 V ( X 2 ) c 2 V ( X 3 ) d 2 V ( X 4 ) or with a different notation Y a 2 X2 b 2 X2 c 2 X2 d 2 X2 1 2 3 4 Xi = the random variables of the linear combination X = the true mean of the random variable Xi X = the true standard deviation of the random variable Xi i i N.B. If there is a correlation between the Xi variables, the formula must contain the so-called covariance terms. Important formulas for the average value (The average is also a linear combination) E( X ) E( X ) or, with a different notation, X X V (X ) n or, with a different notation, X n n = the number of values in the sample = the true mean value of the random variable X = the true standard deviation of the random variable X and are calculated on theoretical grounds and not from a set of data. © Ing-Stat – statistics for the industry www.ing-stat.se Rev D . 2015-05-19 . 9(12) Proof of Gauss’ approximation formulas for and A combination of one variable We start with the simplest case i.e. only on variable. Suppose that we have the following general expression: Y f (X ) where X is a variable that is transformed into Y by the function f. NB that f is here not a distribution of any kind. We want to get an expression for the expected value E(Y) (i.e. Y). First we write f() as a so-called Taylor-series. (Writing a function as a Taylor-series means that instead of a complicated function f(), we get a series of terms where each term is a simpler expression. By chosing a few terms only we usually get a good enough approximation.) Suppose that we have the following function and its Taylor-series, developed around the point z0: f ( z ) f ( z0 ) f ' ( z0 ) ( z z0 ) f '' ( z0 ) ( z z0 ) 2 ( z z0 ) i f ( i ) ( z0 ) 2 i! i 0 Suppose that we include only the first two terms and regard the rest as zero, we get: f ( z) f ( z0 ) f ' ( z0 ) ( z z0 ) We replace z by X and z0 by X and get: f ( X ) f ( X ) f ' ( X ) ( X X ) Now it is time to take the expectation (i.e. get an expression for E(Y)): E (Y ) E f ( X ) E f ( X ) f ' ( X ) ( X X ) E f ( X ) E f ' ( X ) ( X X ) E f ( X ) E ( X X ) 0 The variance becomes (according to the rules for calculating variances): V (Y ) V f ( X ) V f ( X ) f ' ( X ) ( X X ) A constant V f ( X ) V f ' ( X ) ( X X ) A constant i.e. no variance f ' ( X ) V X 2 A simple example. Suppose that we have a variable X uniformly distributed over the interval [a = 10, b = 15]. We then create Y = ln(X). What is the expected value and variance for Y? The expected value for the variable X is 12.5 and the variance is © Ing-Stat – statistics for the industry www.ing-stat.se Rev D . 2015-05-19 . 10(12) V (X ) (a b) 2 (10 15) 2 2.0833 12 12 According to the approximation rules we get E (Y ) E f ( X ) E f ( X ) E ln(12.5) 2.53 And the variance becomes (knowing that the first derivative of ln(X) is 1/X): 2 2 1 V (Y ) f ' ( X ) V X 2.0833 0.0133 12.5 The standard deviation thus is 0.115. A simulation. Let us simulate this result in order to verify the results: Random 1000 c1; Uniform 10 15. # 1000 values in c1 from a uniform # distribution with end points 10 and 15 Let c2 = loge(c1) # Y = f(X) Describe c2 # Gives a number of statistics. One simulation gave mean = 2.51 and standard deviation = 0.116. This result supports the theoretical derivation above. A combination of several variables Suppose that we have the following general expression: Z f ( X 1 , X 2 , X 3 ,..., X n ) where Xi variables that are transformed into Z by the function f. NB that f is here not a distribution of any kind. We want to get an expression for the expected value E(Z) (i.e. Z). Again we write f() as a socalled Taylor-series. (Writing a function as a Taylor-series means that instead of a complicated function f() we get a series of terms where each term is a simpler expression. By chosing a few terms only we usually get a good enough approximation.) Suppose that we have the following function and its Taylor-series, developed around the point (a1, a2, a3, … an): n f X i ai f ( X 1 , X 2 , X 3 ,... X n ) f (a1 , a2 , a3 ,..., an ) i 0 X i a If we take expectations we get, after setting ai = i: n f X i ai f ( 1 , 2 , 3 ,..., n ) E f ( X 1 , X 2 , X 3 ,... X n ) E f (a1 , a2 , a3 ,..., an ) i 1 X i a The variance becomes (according to the rules calculating variances): © Ing-Stat – statistics for the industry www.ing-stat.se Rev D . 2015-05-19 . 11(12) f V ( Z ) i 1 X i n 2 V X i An example. Suppose that we have two variables X and Y normally distributed with the parameters X = 25.1 and Y = 39.2 and X = 3.1 and Y = 4.2. We then create Z f ( X ,Y ) X 2 Y 2 What is the expected value and the variance for Z? The expected value for the variable Z is: E(Z ) E X Y 2 X2 Y2 25.12 39.22 46.55 2 In order to calculate the variance we first calculate the two derivatives: f 1 . X 2 Y2 X 2 1 2 2 X X X 2 Y2 and f 1 . X 2 Y2 Y 2 1 2 2 Y Y X 2 Y2 Using the formula above we get: X V (Z ) 2 2 X Y 2 Y V X 2 2 X Y 2 2 2 V Y X V X Y V Y X2 Y2 25.12 3.12 39.2 2 4.2 2 15.31 25.12 39.2 2 The standard deviation thus is 3.91. A simulation. Let us simulate this in order to verify the results: Random 1000 c1; normal 25.1 3.1. # 1000 X-values in c1 from a normal # distribution. Random 1000 c2; normal 39.2 4.2. # 1000 Y-values in c1 from a normal # distribution. Let c3 = sqrt(c1**2 + c2**2) # Calculates Z. Describe c3 # Gives a number of statistics. One simulation gave mean = 46.6 and standard deviation = 3.96. This result supports the theoretical derivation above. ■ © Ing-Stat – statistics for the industry www.ing-stat.se Rev D . 2015-05-19 . 12(12)