Dimensional Analysis It can be argued that much of what statisticians talk about in most experimental design courses (beyond the basic principles of randomization, blocking, and replication) is, for practical purposes, largely about details. One much larger issue, which generally receives little attention, is the selection of independent variables or factors to be included in a study. Statistical discussion of this topic often centers around generic power analysis, e.g. the consequences of how many factors can be considered in an experiment operationally constrained to a given number of runs, where interactions of a specified order cannot be ignored. There have been some papers published in statistical journals that offer practical advice on how to interact with investigators in selecting the experimental factors to be used in an experiment; some of these discussions are valuable, even though most of them are essentially based on (valuable) experience and common sense rather than more formal arguments. Physical scientists (perhaps especially physicists and engineers dealing with mechanical and hydrodynamic systems) use dimensional analysis as part of their approach to modeling systems and designing experimental studies. Dimensional analysis (DA) is not rooted in statistical ideas, but is built on principles that are generally accepted as necessary axioms in any modeling exercise that represents a physical process. Perhaps the most remarkable aspect of DA is that it can, in some cases, help an investigator reduce the number of experimental factors to be considered in an experimental study. “Optimal” application of DA in complex applications may require substantially more subject-matter-knowledge than most statisiticans possess. However, because it can be a very powerful experimental design tool from the investigator’s perspective, some knowledge of it’s basic principles should be helpful to statisticians involved in experimental design. A foundational paper in the topic discussed here was written by Buckingham (1914), for whom the Buckingham Pi Theorem is named. Palmer’s (2007) monograph is very accessible for those with a very basic background in physical systems, and contains a large number of good exercises. Albrecht et a. (2013) wrote a general introduction to this topic for statisticians, which is followed by discussion by others. Physical Quantities Any physical quantity has both a value and units. For example, X = 6 ft. has the value of “6” and units of “feet,” a specific unit of the dimension “length.” It is often convenient to think about values and units separately. We will use v(−) to denote the value of the physical argument, and [−] to denote its units, e.g. in the above example, v(X) = 6 and [X] = feet. Systems of units are built around the idea that there are a few “fundamental” physical dimensions that require the definition of units, and that the units of all physical quantities 1 can be expressed as monomials of these. For example, length and time are ordinarily regarded as fundamental dimensions, where specific units may be assigned to be (for example) feet and seconds, respectively. But velocity, while a physical quantity, would be given units of feet × seconds−1 in this system, while acceleration would be assigned units of feet × seconds−2 . Systems can obviously be defined using different specific units for any one dimension e.g. length may be given units of meters or miles rather than feet, but derived units (e.g. for velocity and acceleration) would also be modified to match. Absolute units, or systems of them, are defined so that their corresponding values are on ratio scales with meaningful zeroes. Units of length and mass generally have this property; the most common exception is temperature, where the centegrade and farenheit unit scales are not absolute, but the Kelvin scale is. Conversion between absolute systems of units involves only multiplication by a constant, e.g. X and Y are physically equivalent if v(X) = 1 and [X] = miles, and v(Y ) = 5280 and [Y ] = feet. Note that this does require the units to be absolute to eliminate an “intercept” from the transformation. More particularly, every physical quantity X has units expressed as a product of integer powers of r fundamental units: [X] = F1a1 F2a2 F3a3 ...Frar (1) where the F ’s denote entities like “feet”, “seconds”, et cetera. Much of physical modeling can be done with a system of r = 4 fundamenal physical units defining length, mass, time, and electric charge. For example, denoting generic units of these by L, M, T, and Q, respectively: [velocity] = LT−1 [acceleration] = LT−2 [force] = MLT−2 [mass density] = ML−3 [pressure] = ML−1 T−2 [energy] = ML2 T−2 [electric current] = QT−1 [electric field] = MLQT−2 It should be noted, however, that there is no “universally accepted” system of dimensions/units. For example, the SI system is based on r = 7 fundamental units, while some physicists argue that systems of 3 or fewer fundamental dimensions should be adequate. Physical Equations Physical scientists like to express the deterministic relationship among a collection of physical quantities through a physical equation of general form: f (X1 , X2 , X3 , ..., Xn ) = 0. 2 (2) This equation can be regarded (in the usual way) as a relationship among variables for which the numerical values must be related: f (v(X1 ), v(X2 ), v(X3 ), ..., v(Xn )) = 0. (3) But it can also be regarded as a statement about the physical units involved: f ([X1 ], [X2 ], [X3 ], ..., [Xn ]) = [f ] (4) where the right side identifies the physical units attached to the quantity for which the value must be zero. For example, if X1 is the length and X2 the width of a rectangular region, each expressed in units of feet, and X3 is the area of that rectangle in units of square-feet, f (X1 , X2 , X3 ) can be evaluated numerically or with units as: X1 × X2 − X3 2 feet × feet − feet =0 = feet2 (5) The Dimensional Homogeneity Principle (apparently discussed first in some form by Fourier) actually says two things: 1. The terms of a physical equation must have units that are monomials of fundamental physical units, e.g. [any term] = F1b1 F2b2 F3b3 ...Frbr (6) with one factor for each fundamental unit. If b1 = b2 = b3 = ... = br = 0, this means the term is unitless, and for such terms we write [-]=1. 2. All terms in a physical equation must have the same units. The second of these statements is perhaps the more immediately intuitive; it says you cannot add or subtract quantities that express fundamenally different physical entities, e.g. no addition of length to area, or area to mass. The first is a more fundamental statement about what is required of a physical quantity; it essentially implies that every term in a physical equation must be comprised of products and quotients of integer powers of physical quantities, perhaps multiplied by a constant. Taken together, they imply that every physical equation can be written in terms that are unitless, because statement (1) requires that f (X1 , X2 , X3 , ..., Xn ) = X c c c αi X1 i,1 X2 i,2 X3 i,3 ...Xnci,n = 0 (7) i=1 for some number of terms, and statement (2) implies that the equation can be made unitless by dividing each term by any one of them, e.g. f 0 (X1 , X2 , X3 , ..., Xn ) = 1 + αi ci,1 −c1,1 ci,2 −c1,2 ci,3 −c1,3 X1 X2 X3 ...Xnci,n −c1,n = 0. i=2 α1 X 3 (8) Physical equations with unitless terms have the nice property that they are true for any (absolute) system of physical units. For example, if length is involved in such an equation, expressions of length in each term must be raised to the same power in both the numerator and denominator, so that the units of length “cancel out.” This being the case, the equation can be written or evaluated with either “feet” or “yards” used as the fundamental unit of length, because the power of 3 used for conversion from one to the other also “cancels out” of each term. Given that one writes unitless equations, the units of individual quantities can then be expressed in generic form (e.g. L, M, T, and Q, et cetera). Buckingham’s Pi Theorem Consider a physical equation expressed in absolute units: f (X1 , X2 , X3 , ..., Xn ) = 0, (9) where a a a [Xj ] = F1 1,j F2 2,j F3 3,j ...Frar,j . (10) Now define an r-by-n matrix A with integer elements {A}i,j = ai,j . We assume for the moment that n > r. Let the rank of A be denoted r0 , and note that r0 ≤ r. Suppose that we have assigned subscripts to the X’s so that the first r0 columns of A are linearly independent. As a result, columns r0 + 1 through n of A can each be written as a linear combination of columns 1 through r0 . It follows that f can be rewritten as F (X1 , X2 , X3 , ..., Xr0 , π1 , π2 , π3 , ..., πn−r0 ) = 0, (11) where each π is a unitless product/quotient of integer powers of the X’s; i.e. [π] = 1. For example, if r0 = 2, X1 has units of F1 , X2 has units of F2 , and X3 has units of F1 F2−2 , then π1 can be defined as X1−1 X22 X3 , and F defined by multiplying π1 by X1 X2−2 where X3 appears in f . Buckingham’s “Pi Theorem” (Buckingham, 1914) states that, in fact, F can be written as an equation in only π1 through πn−r0 , i.e. that X1 through Xr0 can be eliminated. To see why this is true, consider rewriting F for every specific set of values of X1 through Xn in the following way. For any given situation, invent a new system of absolute units for which v(X1 ) = v(X2 ) = v(X3 ) = ... = v(Xr0 ) = 1 (or any other constant). In each case, this means X1 through Xr0 are irrelevant, but π1 through πn−r0 are unaffected. As a result, the physical function can be rewritten as: F (π1 , π2 , π3 , ..., πn−r0 ) = 0. (12) Physical scientists have long used this result as a “dimension reduction” technique, to reduce the number of variables that must be simultaneously analyzed in an experiment 4 to determine the underlying physical law. A popular example is the relationship between physical quantities involved in describing the motion of a simple pendulum: period of the pendulum t [t] = T length of the pendulum l [l] = L mass of the pendulum m [m] = M acceleration due to gravity g [g] = LT−2 (horizontal) amplitude a [a] = L where in this simple modeling exercise, the mass is assumed to be concentrated at the bottom of the pendulum. Writing the matrix A with rows corresponding to T, L and M (Q isn’t involved), and columns corresponding to t, l, m, and g, we have: 1 0 0 −2 0 A= 0 1 0 1 1 0 0 1 0 0 (13) Using Buckingham’s result, we begin by stating that the relation governing these physical quantities can be written as F (t, l, m, π1 , π2 ) = 0 and in this case can quickly determine that: π1 = t2 g, l π2 = al . Note that m, the mass of the pendulum, is not involved, and therefore not needed in the final form of the equation: F (t2 l−1 g, al−1 ) = 0, that is, the physical equation reduces to a statement that the single argument must be a zero of some unknown functional form. If a pendulum of a given length can have only one period, there can be only zero of F , and experimentation can be used to establish that unless al−1 is large it’s effect is negligible, and that t2 l−1 g must be 4π 2 for any simple pendulum. The most important thing to realize is that this experiment would be performed in two independent variables, rather than the five variables initially noted. (In any case, g would be very hard to experimentally manage, but π1 can be controlled through changes in t and/or l.) This requires much less effort than investigating the general relationship that might exist among the four original variables. Statistical Modeling Think now about a “standard” set-up for a regression experiment. We have a re- sponse variable y, thought of as a “noisy” measurement of a “true” value Y , and predictors 5 X1 , X2 , ..., Xp . Thinking of the noiseless law relating Y to the X’s, we are in search of how these variables must be related so as to satisfy f (Y, X1 , X2 , ..., Xp ) = 0 (14) for an unknown function f . It is convenient here to label independent variables so that the units of X1 through Xr0 constitute a “basis” for all p + 1 physical quantities, i.e. that the units of all variables can be expressed as products/quotients of integer powers of these. (r0 again represents the rank of the matrix A.) Then Buckingham’s theorem implies that the physical law can be written as F (X1 , ..., Xr0 , π1 , ..., πp+1−r0 ) = 0. (15) But we can exert a “choice” at this point that helps us toward our regression model. Specifically, let π1 be a function only of X1 through Xr0 and Y , π2 be a function only of X1 through Xr0 and Xr0 +1 , et cetera. That is, arrange variables so that each of Y , Xr0 +1 , Xr0 +2 , ... , Xp is involved in exactly one dimensionless π. Buckingham’s theorem then allows us to write an equivalent expression: F (π1 , ..., πp+1−r0 ) = 0. (16) If we insist that only one value of Y (noiseless response) can follow from any specific values of X1 through Xp , this implies that there can be only one value of π1 that satisfies the equation once π2 through πp+1−r0 are specified, i.e. that the relationship can be inverted to the form: π1 = G(π2 , ..., πp+1−r0 ) (17) Finally, let a be the integer power to which Y is raised in π1 , and let ρ1 denote the collection of all factors in π1 except for Y a , i.e. π1 = ρ1 Y a . Then: 1 a Y = [ρ−1 1 G(π2 , ..., πp+1−r0 )] (18) leading to a regression model −1 y = ρ1 a G0 (π2 , ..., πp+1−r0 ) + (19) That is, we have a regression model in r0 fewer predictors than would have been used in the original formulation. Example Péan et al. (1998) described experiments to investigate the effects of several process variables on the encapsulation yield of nerve growth factor in poly(D,L-lactide-co-glycolide) (abbreviated PLGA) biodegradable microparticles, which may have substantial potential 6 in the development of drug delivery systems for specific target cell populations. In one experiment, a unitless response variable reflecting encapsulation efficiency was measured over 16 experimental trials in which 10 controlled variables were varied according to a regular, 2-level fractional factorial design. The two values (and corresponding units) used for each controlled variable are listed in Table 1. While the authors reported these values in commonly used laboratory units, some of them have been converted here for consistency, using seconds (s) as the basic unit of time, milliliters (ml) of volume, and milligrams (mg) of weight or mass. The experimental design (in coded controlled variables) and responses are listed in Table 2. Note that 5 of the 16 values recorded for the response are 5.0 (reported by Péan et al as ‘5’, but extended here to match the precision given for other data values); while no comment about this is given by the authors, the most likely explanations suggest that the standard assumptions generally made in regression analysis may not exactly hold. The authors used these data to screen the ten controlled variables by fitting a first-order (main effects) regression model. They note that by their analysis, controlled variables 4, 6, 1, and 2 have the most important effects; these are the terms for which the associated t-test p-values are less than 0.05. The authors also note that variables 3 and 5 may have smaller effects; these terms have associated p-values of between 0.05 and 0.10. Three of the remaining 4 terms have p-values of less than 0.20, so it is difficult to make a firm conclusion regarding which variables should be ignored in further experimentation. Using the step command of R, a stepwise regression was performed on these data beginning with the null model, limited to ten main effects terms and intercept in the full main effects model, and using the direction = “both” option so that terms can be iteratively added or removed so as to minimize the value of the Akaike Information Criterion (AIC); details are given in the Appendix. The resulting model contained the intercept and all 10 main effects (11 parameters, AIC = 78.42, √ M SE = 10.43, R2 = 0.9557, adjusted R2 = 0.8671), again suggesting that while p-values for some individual terms are greater than 0.05, it is difficult to confidently screen out any of the terms in this model. Depending on “where the line is drawn,” a follow-up experiment in from 4 controlled variables (if everything not significant at the 0.05 level is eliminated) or 6 controlled variables (if a p-value of 0.10 were used as a cut-off) or even more (to include “suggestive” variables) might be necessary. As an alternative to this analysis, the foregoing discussion suggests that an analysis based on 7 unitless controlled variables (10 original variables, minus the 3 fundamenal units of time, length, and mass) might be considered. Table 3 displays seven independent, unitless ratios of the original 10 controlled variables; X4 was the only original variable defined on a unitless scale and it is retained as π1 . Note that the selection of these unitless variables is not unique (and is in fact quite arbitrary in the case of this demonstration). For example, any product or ratio of these quantites is also a unitless function of X’s. With this in mind, we again used stepwise regression (in the same manner described above, see the Appendix) to model the 7 response variable as a linear regression of π1 through π7 , beginning the fit with a null model, but allowing the algorithm to consider any term defined as the product of any number of π’s (e.g. interactions through order 7). While this is not all possible unitless functions of the X’s (again, because ratios of π’s are also unitless), it is a very large collection of terms relative to the size of the data set. In this case, the algorithm stopped when 6 unitless terms had been added. (Despite the fact that the direction = “both” option was used, the algorithm did not delete terms at any iteration; once admitted to the model, no terms were subsequently removed.) Even though this model contains fewer terms than the full first-order polynomial in the original variables, the quality of fit to the 16 data values is comparable (7 parameters, √ AIC = 77.94, M SE = 9.834, R2 = 0.9291, adjusted R2 = 0.8819). Furthermore, this model is expressed as a function of only 4 unitless variables, including main effects for π1 , π3 , π4 , and π5 ; and “two-factor interactions”, or products representing other unitless variables for π1 × π4 and π3 × π5 . Because the fit of the regression to these 4 unitless variables is essentially as good as that for the full main-effects model in the 10 original X’s, it might be argued that a reasonable follow-up experiment could be conducted in 4 dimensions. Note that this involves varying 6 of the original variables (2, 3, 4, 5, 6, and 8, including all but one of the variables, X1 , that appeared to be most effective as predictors in the first analysis), but only in combination so as to construct a good experimental design in π1 , π3 , π4 and π5 . An additional interesting point about the regression model just described based on unitless variables is that p-value of the intercept, computed in the context of the 7-parameter model, is 0.19, suggesting that this model term may be unnessary. Exploring this further, the stepwise regression exercise was repeated for the 7 unitless π’s and all their interactions, but beginning with a null model and including an intercept in the collection of terms that could be added. In this case, the algorithm continued to add terms until the model was saturated (16 terms), which is perhaps not surprising since there are so many candidate terms available. But the 7-parameter model identified in this run fit the data values even √ better (7 parameters, AIC = 65.89, M SE = 6.749), and included terms for π1 , π3 , π4 , π5 , π7 , π1 × π4 , and π5 × π7 . Conclusion The initial response of some to the basic idea of DA is that this seems too much like “magic”, because it appears to offer dimension reduction without problem-specific information or data. In fact, as with a priori simplification of a statistical model, DA is explicitly based on assumptions – in this case, the fundamental assumptions dimensional homogeneity. These principles are firmly established in physical theory and experience, but it should be remembered that they provide the starting point for Buckingham’s theorem. There is one other point that should be explicitly made. The “axioms” that lead to DA are themselves tacitly built on the idea that the relationship to be modeled is complete, 8 i.e. that all relevant variables are included in the development. If this is not true, it is hard to understand how a “physical law” can be contemplated. This, in turn, suggests that DA is primarily relevant in what might be called “closed systems”, where the influence of uncontrolled factors is eliminated or at least minimized. In principle, DA is applicable in any application for which the dimensional homogeneity principle holds, and which is “closed” in the above sense. These conditions (especially the second) are probably most often applicable when studies are done under tight experimental control (i.e. laboratory work). References Albrecht, M.C., C.J. Nachtsheim, T.A. Albrecht, and R.D. Cook (2013). “Experimental Design for Engineering Dimensional Analysis,” with discussion, Technometrics 55, 257-295. Buckingham, E. (1914). “On Physically Similar Systems: Illustrations of the Use of Dimensional Equations,” Phys. Rev. 4, 345-376. Palmer, A.C. (2007). Dimensional Analysis and Intelligent Experimentation, World Scientific, New Jersey. Péan, J.M., M.C. Venier-Julienne, R. Filmon, M. Sergent, R. Phan-Tan-Luu, and J.P. Benoit (1998). “Optimization of HSA and NGF Encaapsulation Yields in PLGA Microparticles,” International Journal of Pharmaceutics 166, 105-115. 9 Table 1: Controlled Variables in the Experiment of Péan et al. Controlled Variable Units Lower Value X1 Volume of internal aqueous phase mg X2 Concentration of HSA in internal aqueous phase mg/ml X3 Quantity of PLGA in organic phase ml X4 Concentration of CMC Na in internal aqueous phase X5 Upper Value 0.2 0.6 12.5 25.0 50 100 %, unitless 0 1 Unltrasonic time s 5 15 X6 Mannitol in internal aqueous phase mg/ml 0 10 X7 Volume of external aqueous phase ml 30 70 X8 Emulsification time s 60 300 X9 Volume of extracting aqueous phase ml 150 400 Time of extraction s 120 600 X10 10 Table 2: Experimental Design and Unitless Response in the Experiment of Péan et al. Controlled Variables (coded) Response − − − − − + − + − − 48.6 + − − + − − + + − + 5.0 − + − + − + − − + + 5.0 + + − − − − + − + − 36.7 − − + + − + + + + − 9.6 + − + − − − − + + + 61.1 − + + − − + + − − + 32.8 + + + + − − − − − − 5.0 − − − − + − + − + + 85.4 + − − + + + − − + − 5.0 − + − + + − + + − − 9.7 + + − − + + − + − + 16.5 − − + + + − − − − + 77.0 + − + − + + + − − − 39.2 − + + − + − − + + − 68.4 + + + + + + + + + + 5.0 11 Table 3: Unitless Re-expression of Controlled Variables for the Experiment of Péan et al. π1 X4 π2 π3 X1 /(X2 × X3 ) X6 /X2 π4 π5 π6 π7 X3 /X7 X5 /X8 X7 /X9 X8 /X10 12 Appendix # Data from Pean et al: D1 <- matrix(c( 200, 1.25, 50, 0, 5, 1, 30, 5, 150, 2, 48.6, 22, 600, 1.25, 50, 1, 5, 0, 70, 5, 150, 10, 5 , 31, 200, 2.50, 50, 1, 5, 1, 30, 1, 400, 10, 5 , 24, 600, 2.50, 50, 0, 5, 0, 70, 1, 400, 2, 36.7, 38, 200, 1.25, 100, 1, 5, 1, 70, 5, 400, 2, 600, 1.25, 100, 0, 5, 0, 30, 5, 400, 10, 61.1, 26, 200, 2.50, 100, 0, 5, 1, 70, 1, 150, 10, 32.8, 34, 600, 2.50, 100, 1, 5, 0, 30, 1, 150, 2, 9.6, 38, 5 , 28, 200, 1.25, 50, 0, 15, 0, 70, 1, 400, 10, 85.4, 29, 600, 1.25, 50, 1, 15, 1, 30, 1, 400, 2, 5 200, 2.50, 50, 1, 15, 0, 70, 5, 150, 2, 9.7, 27, 600, 2.50, 50, 0, 15, 1, 30, 5, 150, 10, 16.5, 21, , 19, 200, 1.25, 100, 1, 15, 0, 30, 1, 150, 10, 77.0, 24, 600, 1.25, 100, 0, 15, 1, 70, 1, 150, 2, 39.2, 51, 200, 2.50, 100, 0, 15, 0, 30, 5, 400, 2, 68.4, 25, 600, 2.50, 100, 1, 15, 1, 70, 5, 400, 10, 5 , 43 ),ncol=12,byrow=T) # Changed to homogeneous units of time (s), mass (mg), and volume (ml) U1<-D1[,1]/(10^3) # recorded in micro (mu) grams, converted to mg. U2<-D1[,2]*10 # recorded in w/v (grams per 100 ml), converted to mg/ml U3<-D1[,3] # recorded in ml U4<-D1[,4] # unitless U5<-D1[,5] # recorded in seconds U6<-D1[,6]*10 # recorded in w/v (grams per 100 ml), converted to mg/ml U7<-D1[,7] # recorded in ml U8<-D1[,8]*60 # recorded in minutes, converted to seconds U9<-D1[,9] # recorded in ml U10<-D1[,10]*60 # recorded in minutes, converted to seconds Y1<-D1[,11] # unitless Y2<-D1[,12] # unitless # Stepwise modeling using original (scaled) variables: fit1lower<-lm(Y1~1) 13 fit1upper<-lm(Y1~U1+U2+U3+U4+U5+U6+U7+U8+U9+U10) summary(fit1upper) step(fit1lower, scope=list(lower=fit1lower,upper=fit1upper), direction="forward",trace=1) #Transformation to unitless Pi’s Pi1 <- U4 Pi2 <- U1/(U2*U3) Pi3 <- U6/U2 Pi4 <- U3/U7 Pi5 <- U5/U8 Pi6 <- U7/U9 Pi7 <- U8/U10 # Stepwise modeling using unitless variables: fit2lower<-lm(Y1~1) fit2upper<-lm(Y1~Pi1*Pi2*Pi3*Pi4*Pi5*Pi6*Pi7) summary(fit2upper) step(fit2lower, scope=list(lower=fit2lower,upper=fit2upper), direction="both",trace=1) 14