P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Copyright © 2008. Wiley. All rights reserved. Chapter 1 Preliminary Data Analysis All natural processes, as well as those devised by humans, are subject to variability. Civil engineers are aware, for example, that crushing strengths of concrete, soil pressures, strengths of welds, traffic flow, floods, and pollution loads in streams have wide variations. These may arise on account of natural changes in properties, differences in interactions between the ingredients of a material, environmental factors, or other causes. To cope with uncertainty, the engineer must first obtain and investigate a sample of data, such as a set of flow data or triaxial test results. The sample is used in applying statistics and probability at the descriptive stage. For inferential purposes, however, one needs to make decisions regarding the population from which the sample is drawn. By this we mean the total or aggregate, which, for most physical processes, is the virtually unlimited universe of all possible measurements. The main interest of the statistician is in the aggregation; the individual items provide the hints, clues, and evidence. A data set comprises a number of measurements of a phenomenon such as the failure load of a structural component. The quantities measured are termed variables, each of which may take any one of a specified set of values. Because of its inherent randomness and hence unpredictability, a phenomenon that an engineer or scientist usually encounters is referred to as a random variable, a name given to any quantity whose value depends on chance.1 Random variables are usually denoted by capital letters. These are classified by the form that their values can possibly take (or are assumed to take). The pattern of variability is called a distribution. A continuous variable can have any value on a continuous scale between two limits, such as the volume of water flowing in a river per second or the amount of daily rainfall measured in some city. A discrete variable, on the contrary, can only assume countable isolated numbers like integers, such as the number of vehicles turning left at an intersection, or other distinct values. Having obtained a sample of data, the first step is its presentation. Consider, for example, the modulus of rupture data for a certain type of timber shown in Table E.1.1, in Appendix E. The initial problem facing the civil engineer is that such an array of data by itself does not give a clear idea of the underlying characteristics of the stress values in this natural type of construction material. To extract the salient features and the particular types of information one needs, one must summarize the data and present them in some readily comprehensible forms. There are several methods of presentation, organization, and reduction of data. Graphical methods constitute the first approach. 1.1 GRAPHICAL REPRESENTATION If “a picture is worth a thousand words,” then graphical techniques provide an excellent method to visualize the variability and other properties of a set of data. To the powerful interactive system of one’s brain and eyes, graphical displays provide insight into the form 1 The term will be formally defined in Section 3.1. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. 3 P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda 4 QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Applied Statistics for Civil and Environmental Engineers and shape of the data and lead to a preliminary concept of the generating process. We proceed by assembling the data into graphs, scanning the details, and noting the important characteristics. There are numerous types of graphs. Line and dot diagrams, histograms, relative frequency polygons, and cumulative frequency curves are given in this section. Subsequently, exploratory methods, such as stem-and-leaf plots and box diagrams and graphs depicting a possible association between two variables, are presented in Sections 1.3 and 1.4. We begin with the simple task of counting. 1.1.1 Line diagram or bar chart The occurrences of a discrete variable can be classified on a line diagram or bar chart. In this type of graph, the horizontal axis gives the values of the discrete variable and the occurrences are represented by the heights of vertical lines. The horizontal spread of these lines and their relative heights indicate the variability and other characteristics of the data. Example 1.1. Flood occurrences. Consider the annual number of floods of the Magra River at Calamazza, situated between Pisa and Genoa in northwestern Italy, over a 34-year period, as shown in Table 1.1.1. A flood in the river at the point of measurement means the river has risen above a specified level, beyond which the river poses a threat to lives and property. The data are plotted in Fig. 1.1.1 as a line diagram. The data suggest a symmetrical distribution with a midlocation of four floods per year. In some other river basins, there is a nonlinear decrease in the occurrences for increasing numbers of floods in a year commencing at zero, showing a negative exponential type of variation. 1.1.2 Dot diagram A different type of graph is required to present continuous data. If the data are few (say, less than 25 items) a dot diagram is a useful visual aid. Consider the possibility that only Copyright © 2008. Wiley. All rights reserved. Table 1.1.1 Number of flood occurrences per year from 1939 to 1972 at the gauging station of Calamazza on the Magra River, between Pisa and Genoa in northwestern Italya Number of floods in a year Number of occurrences 0 1 2 3 4 5 6 7 8 9 0 2 6 7 9 4 1 4 1 0 Total 34 a A flood occurrence is defined as river discharge exceeding 300 m3 /s. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 5 Number of occurrences 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 Number of floods Fig. 1.1.1 Line diagram for flood occurrences in the Magra River at Calamazza between Genoa and Pisa in northwestern Italy. the first 15 items of data in Table E.1.1—which shows the modulus of rupture in N/mm2 for 50 mm × 150 mm Swedish redwood and whitewood—are available. The abridged data are ranked in ascending order and are given in Table 1.1.2 and plotted in Fig. 1.1.2. The reader can see that the midlocation is close to 40 N/mm2 but the wide spread makes this location difficult to discern. A larger sample should certainly be helpful. 1.1.3 Histogram Copyright © 2008. Wiley. All rights reserved. If there are at least, say, 25 observations, one of the most common graphical forms is a block diagram called the histogram. For this purpose, the data are divided into groups according to their magnitudes. The horizontal axis of the graph gives the magnitudes. Blocks are drawn to represent the groups, each of which has a distinct upper and lower limit. The area of a block is proportional to the number of occurrences in the group. The variability of the data is shown by the horizontal spread of the blocks, and the most common values are found in blocks with the largest areas. Other features such as the symmetry of the data or lack of it are also shown. The first step is to take into account the range r of the observations, that is, the difference between the largest and smallest values. Example 1.2. Timber strength. We go back to the timber strength data given in Table E.1.1. They are arranged in order of magnitude in Table 1.1.3. There are n = 165 observations with somewhat high variability, as expected, because timber is a naturally variable material. Here the range r = 70.22 – 0.00 = 70.22 N/mm2 . To draw a histogram, one divides the range into a number of classes or cells n c . The number of occurrences in each class is counted and tabulated. These are called frequencies. Table 1.1.2 The first 15 items of modulus of rupture data measuring timber strengths in N/mm2 , from Table E.1.1 (commencing with the top row), ranked in increasing order 29.11 40.53 29.93 41.64 32.02 45.54 32.40 48.37 33.06 48.78 34.12 50.98 35.58 65.35 39.34 Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda 6 QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Applied Statistics for Civil and Environmental Engineers 25 30 35 40 45 50 55 Modulus of rupture, N/mm 60 65 70 2 Fig. 1.1.2 Dot diagram for a short sample of timber strengths from Table 1.1.3. The width of the classes is usually made equal to facilitate interpretation. For some work such as the fitting of a theoretical function to observed frequencies, however, unequal class widths are used. Care should be exercised in the choice of the number of classes, n c . Too few will cause an omission of some important features of the data; too many will not give a clear overall picture because √ there may be high fluctuations in the frequencies. A rule of thumb is to make n c = n or an integer close to this, but it should be at least 5 and not greater than 25. Thus, histograms based on fewer than 25 items may not be meaningful. Sturges (1926) suggested the approximation n c = 1 + 3.3 log10 n. (1.1.1) A more theoretically based alternative follows the work of Freedman and Diaconis (1981):2 nc = r n 1/3 . 2 iqr (1.1.2) Here iqr is the interquartile range. To clarify this term, we must define Q 2 , or the median. This denotes the middle term of a set of data when the values are arranged in ascending order, or the average of the two middle terms if n is an even number. The first or lower quartile, Q 1 , is the median of the lower half of the data, and likewise the third Table 1.1.3 Ranked modulus of rupture data for timber strengths in N/mm2 , in ascending order a Copyright © 2008. Wiley. All rights reserved. 0.00 17.98 22.67 22.74 22.75 23.14 23.16 23.19 24.09 24.25 24.84 25.39 25.98 26.63 27.31 27.90 27.93 a 2 28.00 28.13 28.46 28.69 28.71 28.76 28.83 28.97 28.98 29.11 29.90 29.93 30.02 30.05 30.33 30.53 31.33 31.60 32.02 32.03 32.40 32.48 32.68 32.76 33.06 33.14 33.18 33.19 33.47 33.61 33.71 33.92 34.12 34.40 34.44 34.49 34.56 34.63 35.03 35.17 35.30 35.43 35.58 35.67 35.88 35.89 36.00 36.38 36.47 36.53 36.81 36.84 36.85 36.88 36.92 37.51 37.65 37.69 37.78 38.00 38.05 38.16 38.64 38.71 38.81 39.05 39.15 39.20 39.21 39.33 39.34 39.60 39.62 39.77 39.93 39.97 40.20 40.27 40.39 40.53 40.71 40.85 40.85 41.64 41.72 41.75 41.78 41.85 42.31 42.47 43.07 43.12 43.26 43.33 43.33 43.41 43.48 43.48 43.64 43.99 44.00 44.07 44.30 44.36 44.36 44.51 44.54 44.59 44.78 44.78 45.19 45.54 45.92 45.97 46.01 46.33 46.50 46.86 46.99 47.25 47.42 47.61 47.74 47.83 48.37 48.39 48.78 49.57 49.59 49.65 50.91 50.98 51.39 51.90 53.00 53.63 The original data set is given in Table E.1.1; n = 165. The median is underlined. See also Scott (1979). Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. 53.99 54.04 54.71 55.23 56.60 56.80 57.99 58.34 65.35 65.61 69.07 70.22 P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis Table 1.1.4 Frequency computations for the modulus of rupture data ranked in Table 1.1.3a Class upper limit (N/mm2 ) 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 a 7 Class center (N/mm2 ) Absolute frequency Relative frequency Cumulative relative frequency (%) 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5 1 0 0 1 9 18 26 38 34 20 9 5 0 3 1 0.006 0.000 0.000 0.006 0.055 0.109 0.158 0.230 0.206 0.121 0.055 0.030 0.000 0.018 0.006 0.61 0.61 0.61 1.21 6.67 17.58 33.33 56.36 76.97 89.09 94.55 97.58 97.58 99.39 100.00 The width of each class is 5 N/mm2 in this example. or upper quartile, Q 3 , is the median of the upper half of the data. This definition will be used throughout.3 Thus, iqr = Q 3 − Q 1 . (1.1.3) Copyright © 2008. Wiley. All rights reserved. Example 1.3. Timber strength. For the timber strength data of Table E.1.1, the median, that is, Q 2 , is 39.05 N/mm2 . Also Q 3 and Q 1 are 44.57 and 32.91 N/mm2 , respectively, and hence iqr = 11.66 N/mm2 . From the simple square-root rule, the number of classes, n c = 12.84. However, by using Eqs. (1.1.1) and (1.1.2), the number of classes are 8.32 and 16.52, respectively. If these are rounded to 9 and 15 and the range is extended to 72 and 75 N/mm2 for graphical purposes, the equal class widths become 8 and 5 N/mm2 , respectively. Let us use these widths. It is important to specify the class boundaries without ambiguity for the counting of frequencies; for example, in the first case, these should be from 0 to 7.99, 8.00 to 15.99, and so on. As already mentioned, the vertical axis of a histogram is made to represent the frequency and the horizontal axis is used as a measurement scale on which the class boundaries are marked. For each of these class widths, 8 and 5 N/mm2 , class boundaries are made and counting of frequencies is completed using Table 1.1.3; the lowest boundary is at 0 and the highest boundaries are at 72 and 75 N/mm2 , respectively. Table 1.1.4 gives the absolute and relative frequencies for class widths of 5 N/mm2 . Rectangles are then erected over each of the classes, proportional in area to the class frequencies. When equal class widths are used, as shown here, the heights of the rectangles represent the frequencies. Thus, Figs. 1.1.3 and 1.1.4 are obtained. The information conveyed by the two histograms seems to be similar. The diagrams are almost symmetrical with a peak in the class below 40 N/mm2 and a steady decrease on either side. This type of diagram usually brings out any possible imperfections in the data, such as There are alternatives, such as rounding (n + 1)/4 and (n + 1) × (3/4) to the nearest integers to calculate the locations of Q 1 and Q 3 , respectively. The rounding is upward or downward, respectively, when the numbers fall exactly between two integers. 3 Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda 8 QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Applied Statistics for Civil and Environmental Engineers 0.3 0.2 72−79.99 64−71.99 56−63.99 48−55.99 40−47.99 32−39.99 24−31.99 0−7.99 0.0 16−23.99 0.1 8−15.99 Relative frequency 0.4 Modulus of rupture (N/mm2) Fig. 1.1.3 Histogram for timber strength data with class width of 8 N/mm2 . the gaps at the ends. Further investigations are required to understand the true nature of the population. More on these aspects will follow in this and subsequent chapters. 1.1.4 Frequency polygon A frequency polygon is a useful diagnostic tool to determine the distribution of a variable. It can be drawn by joining the midpoints of the tops of the rectangles of a histogram after extending the diagram by one class on both sides. We assume that equal class widths are used. If the ordinates of a histogram are divided by the total number of observations, then a relative frequency histogram is obtained. Thus, the ordinates for each class denote the probabilities bounded by 0 and 1, by which we simply mean the chances of occurrence. The resulting diagram is called the relative frequency polygon. Example 1.4. Timber strength. Corresponding to the histogram of Fig. 1.1.4, the values of class center are computed and a relative frequency polygon is obtained; this is shown in Fig. 1.1.5. 0.20 70−74.99 40−44.99 30−34.99 20−24.99 10−14.99 0−4.99 0.00 60−64.99 0.10 50−54.99 Relative frequency Copyright © 2008. Wiley. All rights reserved. 0.30 Modulus of rupture (N/mm2) Fig. 1.1.4 Histogram for timber strength data with class width of 5 N/mm2 . Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 9 Relative frequency 0.3 0.2 0.1 0.0 0 20 40 Modulus of rupture 60 80 (N/mm2) Fig. 1.1.5 Relative frequency polygon for timber strength data with class width of 5 N/mm2 . As the number of observations becomes large, the class widths theoretically tend to decrease and, in the limiting case of an infinite sample, a relative frequency polygon becomes a frequency curve. This is in fact a probability curve, which represents a mathematical probability density function, abbreviated as pdf, of the population.4 Copyright © 2008. Wiley. All rights reserved. 1.1.5 Cumulative relative frequency diagram If a cumulative sum is taken of the relative frequencies step by step from the smallest class to the largest, then the line joining the ordinates (cumulative relative frequencies) at the ends of the class boundaries forms a cumulative relative frequency or probability diagram. On the vertical axis of the graph, this line gives the probabilities of nonexceedance of values shown on the horizontal axis. In practice, this plot is made by utilizing and displaying every item of data distinctly, without the necessity of proceeding via a histogram and the restrictive categories that it entails. For this purpose, one may simply determine (e.g., from the ranked data of Table 1.1.3) the number of observations less than or equal to each value and divide these numbers by the total number of observations. This procedure is adopted here.5 Thus, the probability diagram, as represented by the cumulative relative frequency diagram, becomes an important practical tool. This diagram yields the median and other quartiles directly. Also, one can find the 9 values that divide the total frequency into 10 equal parts called deciles and the so-called percentiles, where the pth percentile is the value that is greater than p percent of the observations. In general, it is possible to obtain the (n − 1) values that divide the total frequency into n equal parts called the quantiles. Hence a cumulative frequency polygon is also called a quantile or Q-plot; a Q-plot though has quantiles on the vertical axis unlike a cumulative frequency diagram. Example 1.5. Timber strength. Figure 1.1.6 is the cumulative frequency diagram obtained from the ranked timber strength data of Table 1.1.3 using each item of data as just described. 4 This function is discussed in Chapter 3. One of the first tasks in applying inferential statistics, as presented in Chapters 4 and 5, will be to estimate the mathematical function from a finite sample and examine its closeness to the histogram. 5 Further aspects of this subject, as related to probability plots, are described in Chapter 5. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Cumulative relative frequency 10 Applied Statistics for Civil and Environmental Engineers 1.0 0.8 0.6 0.4 0.2 0.0 0 20 40 60 80 Modulus of rupture (N/mm2) Fig. 1.1.6 Cumulative relative frequency diagram for timber strength data. The deciles and percentiles can be abstracted. By convention a vertical probability or proportionality scale is used rather than one giving percentages (except in duration curves, discussed shortly). The 90th percentile, for instance, is 51 N/mm2 approximately and the value 40 N/mm2 has a probability of nonexceedance of approximately 0.56. If the sample size increases indefinitely, the cumulative relative frequency diagram will become a distribution curve in the limit. This represents the population by means of a (mathematical) distribution function, usually called a cumulative distribution function, abbreviated to cdf, just as a relative frequency polygon leads to a probability density function. As a graphical method of ascertaining the distribution of the population, the quantile plot can be drawn using a modified nonlinear scale for the probabilities, which represents one of several types of theoretical distributions.6 Also, as shown in Section 1.4, two distributions can be compared using a Q-Q plot. Copyright © 2008. Wiley. All rights reserved. 1.1.6 Duration curves For the assessment of water resources and for associated design and planning purposes, engineers find it useful to draw duration curves. When dealing with flows in rivers, this type of graph is known as a flow duration curve. It is in effect a cumulative frequency diagram with specific time scales. The vertical axis can represent, for example, the percentage of the time a flow is exceeded; and in addition, the number of days per year or season during which the flow is exceeded (or not) may be given. The volume of flow per day is given on the horizontal axis. For some purposes, the vertical and horizontal axes are interchanged as in a Q-plot. One example of a practical use is the scaled area enclosed by the curve, a horizontal line representing 100% of the time, and a vertical line drawn at a minimum value of flow, which is desirable to be maintained in the river. This area represents the estimated supplementary volume of water that should be diverted to the river on an annual basis to meet such an objective. Example 1.6. Streamflow duration. Figure 1.1.7 gives the flow duration curve of the Dora Riparia River in the Alpine region of northern Italy, calculated over a period of 47 years from the records at Salbertrand gauging station. This figure is drawn using the same procedure 6 This method is demonstrated in Section 5.8. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 100 365 90 80 292 70 60 219 50 40 146 30 73 20 Percentage, duration Duration, days per year flow is exceeded 11 10 0 0 0 10 20 30 40 50 Daily streamflow (m3/s) Fig. 1.1.7 Flow duration curve of Dora Riparia River at Salbertrand in the Alpine region of Italy. adopted for a cumulative relative frequency diagram, such as Fig. 1.1.6. For instance, suppose it is decided to divert a proportion of the discharges above 10 m3 /s and below 20 m3 /s from the river. Then the area bounded by the curve and the vertical lines drawn at these discharges, using the vertical scale on the left-hand side, will give the estimated maximum amount available for diversion during the year in m3 after multiplication by the number of seconds in a day. This area is hatched in Fig. 1.1.7. If such a decision were to be implemented over a longterm basis, it should be essential to use a long series of data and to estimate the distribution function. 1.1.7 Summary of Section 1.1 Copyright © 2008. Wiley. All rights reserved. In this section we have introduced some of the basic graphical methods. Other procedures such as stem-and-leaf plots and scatter diagrams are presented in Sections 1.3 and 1.4, respectively. More advanced plots are introduced in Chapters 5 and 6. In the next section we discuss associated numerical methods. 1.2 NUMERICAL SUMMARIES OF DATA Useful graphical procedures for presenting data and extracting knowledge on variability and other properties were shown in Section 1.1. There is a complementary method through which much of the information contained in a data set can be represented economically and conveyed or transmitted with greater precision. This method utilizes a set of characteristic numbers to summarize the data and highlight their main features. These numerical summaries represent several important properties of the histogram and the relative frequency polygon. The most important purpose of these descriptive measures is for statistical inference, a role that graphs cannot fulfill. Basically, there are three distinctive types: measures of central tendency, of dispersion, and of asymmetry, all of which can be visualized through the histogram as discussed in Section 1.1. The additional measure of “peakedness,” that is, the relative height of the peak, requires a large sample for its estimation and is mainly relevant in the case of symmetric distributions. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 12 Applied Statistics for Civil and Environmental Engineers 1.2.1 Measures of central tendency Generally data from many natural systems, as well as those devised by humans, tend to cluster around some values of variables. A particular value, known as the central value, can be taken as a representative of the sample. This feature is called central tendency because the spread seems to take place about a center. The definition of the central value is flexible, and its magnitude is obtained through one of the measures of its location. There are three such well-known measures: the mean, the mode, and the median. The choice depends on the use or application of the central value. The sample arithmetic mean is estimated from a sample of observations: x1 , x2 , . . . , xn , as x̄ = n 1 xi . n i=1 (1.2.1) Copyright © 2008. Wiley. All rights reserved. If one uses a single number to represent the data, the sample mean seems ideal for the purpose. After counting, this calculation is the next basic step in statistics. For theoretical purposes the mean is the most important numerical measure of location. As stated in Section 1.1, if the sample size increases indefinitely a curve is obtained from a frequency polygon; the mean is the centroid of the area between this curve and the horizontal axis and it is thus the balance point of the frequency curve. The population value of the mean is denoted by μ. We reiterate our definition of population with reference to a phenomenon such as that represented by the timber strength data of Table E.1.1. A population is the aggregate of observations that might result by making an experiment in a particular manner. The sample mean has a disadvantage because it may sometimes be affected by unexpectedly high or low values, called outliers. Such values do not seem to conform to the distribution of the rest of the data. There may be physical reasons for outliers. Their presence may be attributed to conditions that have perhaps changed from what were assumed, or because the data are generated by more than one process. On the other hand, they may arise on account of errors of faulty instrumentation, measurement, observation, or recording. The engineer must examine any visible outliers and ascertain whether they are erroneous or whether their inclusion is justifiable. The occurrence of any improbable value requires careful scrutiny in practice, and this should be followed by rectification or elimination if there are valid reasons for doing so. Example 1.7. Timber strength. A case in point is the value of zero in the timber strength data of Table E.1.1 This value is retained here for comparative purposes. The mean of the 165 items, which is 39.09 N/mm2 , becomes 39.33 N/mm2 without the value of zero. Example 1.8. Concrete test Table E.1.2 is a list of the densities and compressive strengths at 28 days from the results of 40 concrete cube test records conducted in Barton-on-Trent, England, during the period 8 July 1991 to 21 September 1992, and arranged in reverse chronological order. These have sample means of 2445 kg/m3 and 60.14 N/mm2 , respectively. The two numbers are measures of location representing the density and compressive strength of concrete. With many discordant values at the extremes, a trimmed mean, such as a 5% trimmed mean, may be calculated. For this purpose, the data are ranked and the mean is obtained after ignoring 5% of the observations from each of the two extremities (see Problem 1.16). Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 13 The technique of coding is sometimes used to facilitate calculations when the data are given to several significant figures but the digits are constant except for the last few. For example, the densities in Table E.1.2 are higher than 2400 N/mm2 and less than 2500 N/mm2 , so that the number 2400 can be subtracted from the densities. The remainders will retain the essential characteristics of the original set (apart from the enforced shift in the mean), thus simplifying the arithmetic. In considering the entire data set, a weighted mean is obtained if the variables of a sample are multiplied by numbers called weights and then divided by the sum of the weights. It is used if some variables should contribute more (or less) to the average than others. The median is the central value in an ordered set or the average of the two central values if the number of values, n, is even, as specified in Section 1.1. Example 1.9. Concrete test. The calculation of the median and other measures of location will be greatly facilitated if the data are arranged in order of magnitude. For example, the compressive strengths of concrete given in Table E.1.2 are rewritten in ascending order in Table 1.2.1. The median of these data is 60.1 N/mm2 , which is the average of 60.0 and 60.2 N/mm2 . The median of the timber strength data of Table 1.1.3 is 39.05 N/mm2 , as noted in the table. The median has an advantage over the mean. It is relatively unaffected by outliers and is thus often referred to as a resistant measure. For instance, the exclusion of the zero value in Table 1.1.3 results only in a minor change of the median from 39.05 to 39.10 N/mm2 . One of the countless practical uses of the median is the application of a disinfectant to many samples of bacteria. Here, one seeks an association between the proportion of bacteria destroyed and the strength of the disinfectant. The concentration that kills 50% of the bacteria is the median dose. This is termed LD50 (lethal dose for 50%) and provides an excellent measure. The mode is the value that occurs most frequently. Quite often the mode is not unique because two or more sets of values have equal status. For this reason and for convenience, the mode is often taken from the histogram or frequency polygon. Copyright © 2008. Wiley. All rights reserved. Example 1.10. Concrete test. For the ranked compressive strengths of concrete in Table 1.2.1, the mode is 60.5 N/mm2 . Example 1.11. Timber strength. From Fig. 1.1.4, for example, the mode of the timber strength data is 37.5 N/mm2 , which corresponds to the midpoint of the class with the highest frequency. However, there is ambiguity in the choice of the class widths as already noted. On the other hand, in Table 1.1.3 there are nine values in the range 38.64–39.34 N/mm2 , and thus 39 N/mm2 seems a more representative value, but this problem can only be resolved theoretically. As the sample size becomes indefinitely large, the modal value will correspond to the peak of the relative frequency curve on a theoretical basis. The mode may often have greater practical significance than the mean and the median. It becomes more useful as the asymmetry of the distribution increases. For instance, if an engineer were to ask a person who sits habitually on the banks of a river fishing to indicate the mean level of the river, he or she is inclined to point out the modal level. It is the value most likely to occur and it Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 14 Applied Statistics for Civil and Environmental Engineers Table 1.2.1 concretea Copyright © 2008. Wiley. All rights reserved. Order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 a Ordered data of density and compressive strength of Density (kg/m3 ) Compressive strength (N/mm2 ) 2411 2415 2425 2427 2427 2428 2429 2433 2435 2435 2436 2436 2436 2436 2437 2437 2441 2441 2444 2445 2445 2446 2447 2447 2448 2448 2449 2450 2454 2454 2455 2456 2456 2457 2458 2469 2471 2472 2473 2488 49.9 50.7 52.5 53.2 53.4 54.4 54.6 55.8 56.3 56.7 56.9 57.8 57.9 58.8 58.9 59.0 59.6 59.8 59.8 60.0 60.2 60.5 60.5 60.5 60.9 60.9 61.1 61.5 61.9 63.3 63.4 64.9 64.9 65.7 67.2 67.3 68.1 68.3 68.9 69.5 The original data sets are given in Table E.1.2. is not affected by exceptionally high or low values. Clearly, the deletion of the zero value from Table 1.1.3 does not alter the mode, as we have also seen in the case of the median. These positive attributes of the mode and median notwithstanding, the mean is indispensable for many theoretical purposes. Also in the same class as the sample arithmetic Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 15 mean, there are two other measures of location that are used in special situations. These are the harmonic and geometric means. The harmonic mean is the reciprocal of the mean of the reciprocals. Thus the harmonic mean for a sample of observations, x1 , x2 , . . . , xn , is defined as x̄ h = 1 . 1/n[(1/x1 ) + (1/x2 ) + · · · + 1/xn )] (1.2.2) It is applied in situations where the reciprocal of a variable is averaged. Example 1.12. Stream flow velocity. A practical example of the harmonic mean is the determination of the mean velocity of a stream based on measurements of travel times over a given reach of the stream using a floating device. For instance, if three velocities are calculated as 0.20, 0.24, and 0.16 m/s, then the sample harmonic mean is x̄ h = 1 = 0.19 m/s. (1/3)[(1/0.20) + (1/0.24) + (1/0.16)] The geometric mean is used in averaging values that represent a rate of change. Here the variable follows an exponential, that is, a logarithmic law. For a sample of observations, x1 , x2 , . . . , xn , the geometric mean is the positive nth root of the product of the n values. This is the same as the antilog of the mean of the logarithms: n n 1 1/n x̄ g = (x1 x2 . . . xn )1/n = exp . (1.2.3) In xi = xi n i=1 i=1 Example 1.13. Population growth. Consider the case of populations of towns and cities that increase geometrically, which means that a future increase is expected that is proportional to the current population. Such information is invaluable for planning and designing urban water supplies and sewerage systems. Suppose, for example, that according to a census conducted in 1970 and again in 1990 the population of a city had increased from 230,000 to 310,000. An engineer needs to verify, for purposes of design, the per capita consumption of water in the intermediate period and hence tries to estimate the population in 1980. The central value to use in this situation is the geometric mean of the two numbers which is x̄ g = (230, 000 × 310, 000)1/2 = 267,021. Copyright © 2008. Wiley. All rights reserved. (Note that the sample arithmetic mean x̄ = 270,000.) As we see in Example 1.13, the geometric mean is less than the arithmetic mean.7 1.2.2 Measures of dispersion Whereas a measure of central tendency is obtained by locating a central or representative value, a measure of dispersion represents the degree of scatter shown by observations or the inherent variability in a phenomenon under observation. Dispersion also indicates the precision of the data. One method of quantification is through an order statistic, that is, one of ranked data.8 The simplest in the category is the range, which is the difference between the largest and smallest values, as defined in Section 1.1. 7 8 This theoretical property is demonstrated in Example 3.10. We shall discuss order statistics formally in Chapter 7; see also Chapter 5. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 16 Applied Statistics for Civil and Environmental Engineers Example 1.14. Timber strength. As noted before, the range of the timber strength data of Table 1.1.3 is 70.22 – 0.00 = 70.22 N/mm2 . Example 1.15. Concrete test. For the compressive strengths of concrete given in Table E.1.2 and ranked in Table 1.2.1, the range is r = 69.5 − 49.9 = 19.6 N/mm2 ; the range of the concrete densities is 2488 – 2411 = 77 kg/m3 . These numbers provide a measure of the spread of the data in each case. The range, however, is a nondecreasing function of the sample size and thus characterizes the population poorly. Moreover, the range is unduly affected by high and low values that may be somewhat incompatible with the rest of the data even though they may not always be classified as outliers. For this reason, the interquartile range, iqr, which is relatively a resistant measure, is preferable. As defined in Section 1.1, in a ranked set of data this is the difference between the median of the top half and the median of the bottom half. Example 1.16. Concrete test. For the compressive strengths of concrete, the iqr is 6.55 N/mm2 . Example 1.17. Timber strength. The timber strength data in Table 1.1.3 have an iqr of 11.66 and 11.47 N/mm2 , respectively, with or without the zero value. A similar and more general measure is given by the interval between two symmetrical percentiles. For example, the 90−10 percentile range for the timber strength data is approximately 52 – 28 = 24 N/mm2 from Fig. 1.1.6. The aforementioned measures of dispersion can be easily obtained. However, their shortcoming is that, apart from two values or numbers equivalent to them, the vast information usually found in a sample of data is ignored. This criticism is not applicable if one determines the average deviation about some central value, thus including all the observations. For example, the mean absolute deviation, denoted by d, measures the average absolute deviation from the sample mean. For a sample of observations, x1 , x2 , . . . , xn , it is defined as Copyright © 2008. Wiley. All rights reserved. d= n |x1 − x̄| + |x2 − x̄| + · · · + |xn − x̄| |xi − x̄| = . n n i=1 (1.2.4) Example 1.18. Annual rainfall. If the annual rainfalls in a city are 50, 56, 42, 53, and 49 cm over a 5-year period, the absolute deviation with respect to the sample mean of 50 cm is given by d= 1 (|50 − 50| + |56 − 50| + |42 − 50| + |53 − 50| + |49 − 50|) = 3.6 cm. 5 This measure of dispersion is easily understood and practically useful. However, it is valid only if the large and small deviations are as significant as the average deviations. There are strong theoretical reasons (as seen in Chapters 3, 4, and 5), on the other hand, for using the sample standard deviation, denoted by s, which is the root mean square deviation about the mean. Indeed, this is the principal measure of dispersion (although the interquartile Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 17 range is meaningful and expedient). For a sample of observations, x1 , x2 , . . . , xn it is defined by n 1 1 2 2 2 s= [(x1 − x̄) + (x2 − x̄) + · · · + (xn − x̄) ] = (xi − x̄)2 . (1.2.5) n n i=1 By expanding and summarizing the terms on the extreme right-hand side, n n n 1 1 s= xi2 − 2x̄ xi + n x̄ 2 = xi2 − x̄ 2 . n i=1 n i=1 i=1 (1.2.6) Engineers will recognize that this measure is analogous to the radius of gyration of a structural cross section. In contrast to the mean absolute deviation, it is highly influenced by the largest and smallest values. The standard deviation of the population is denoted by σ . It is common practice to replace the divisor n of Eq. (1.2.5) by (n– 1) and denote the left-hand side by ŝ. Consequently, the estimate of the standard deviation is, on average, closer to the population value because it is said to have smaller bias. Therefore, Eq. (1.2.5) will, on average, give an underestimate of σ except in the rare case in which μ is known.9 The required modification to Eq. (1.2.6) is as follows: n 1 n ŝ = x2 − (1.2.7) x̄ 2 . n − 1 i=1 i n−1 This reduction in n can be justified by means of the concept of degrees of freedom. It is a consequence of the fact that the sum of the n deviations (x1 − x̄), (x2 − x̄), . . . , (xn − x̄) is zero, which follows from Eq. (1.2.1) for the mean. Hence, regardless of the arrangement of the data, if any (n − 1) terms are specified the remaining term is fixed or known, because xn − x̄ = − n−1 (xi − x̄). i=1 It follows from this equation that one degree of freedom is lost in defining the sample standard deviation. The concept of degrees of freedom was introduced by the English statistician R. A. Fisher on the analogy of a dynamical system in which the term denotes the number of independent coordinate values necessary to determine the system. Copyright © 2008. Wiley. All rights reserved. Example 1.19. Annual rainfall. From the annual rainfall data in Example 1.18 (50, 56, 42, 53, and 49 cm), one can estimate the standard deviation σ by using Eq. (1.2.5), as follows: 1 ŝ = [(50 − 50)2 + (56 − 50)2 + (42 − 50)2 + (53 − 50)2 + (49 − 50)2 ] 5 1 2 110 = (0 + 62 + 82 + 32 + 12 ) = = 4.69 cm. 5 5 An alternative estimate of σ (which is, on average, less biased) is obtained using Eq. (1.2.7) as follows: 110 ŝ = = 5.24 cm. 4 9 Terms such as bias are discussed formally in Section 5.2. It is shown in Example 5.1 that ŝ 2 is unbiased; however, ŝ is known to have bias, though less than s on average. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 18 Applied Statistics for Civil and Environmental Engineers Example 1.20. Timber strength. By using Eq. (1.2.7), the sample standard deviation of the timber strength data of Table E.1.1 is 9.92 N/mm2 (or 9.46 N/mm2 if the zero value is excluded). Example 1.21. Concrete test. By using Eq. (1.2.7), the sample standard deviation for the density and compressive strength of concrete in Table E.1.2 are 15.99 kg/m3 and 5.02 N/mm2 , respectively. Dividing the standard deviation by the mean gives the dimensionless measure of dispersion called the sample coefficient of variation, v: v= ŝ x̄ (1.2.8) This is usually expressed as a percentage. The coefficient of variation is useful in comparing different data sets with respect to central location and dispersion. Copyright © 2008. Wiley. All rights reserved. Example 1.22. Comparison of timber and concrete strength data. From the values of mean and standard deviation in Examples 1.7 and 1.20, the sample coefficient of variation of the timber strength data is 25.3% (or 24.0% without the value of zero). Similarly, from Examples 1.8 and 1.21 the density and compressive strength of concrete data have sample coefficients of 0.65 and 8.24%, respectively. The higher variation in the timber strength data is a reflection of the variability of the natural material, whereas the low variation in the density of the concrete is evidence of a uniform quality in the constituents and a high standard of workmanship, including care taken in mixing. The variation in the compressive strength of concrete is higher than that of its density. This can be attributed to random factors that influence strength, such as some subtle changes in the effectiveness of the concrete that do not alter its density. From the square of the sample standard deviation one obtains the sample variance, ŝ 2 , which is the mean of the squared deviations from the mean. The population variance is denoted by σ 2 . The variance, like the mean, is important in theoretical distributions. By squaring Eqs. (1.2.6) and (1.2.7), two estimators of the population variance are found. Here estimator refers to a method of estimating a constant in a parent population. As in all the foregoing equations, this term means the random variable of which the estimate is a realization. An unbiased estimator is obtained from Eq. (1.2.7) because on average (that is by repeated sampling) the estimator tends to the population variance σ 2 . In other words, the expectation E, which is in effect the average from an infinite number of observations, of the square of the right-hand side of Eq. (1.2.7) is equal to σ 2 . There are also measures of dispersion pertaining to the mean of the deviations between the observations. Gini’s mean difference, for example, is a long-standing method.10 This is given by g= n 2 [x(i) − x( j) ], n (n − 1) i> j j=1 (1.2.9) in which the observations x1 , x2 , . . . , xn are arranged in ascending order. 10 See, for example, Stuart and Ord (1994, p. 58) for more details of this method originated by the Italian mathematician, Gini. See also Problem 1.7. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 19 1.2.3 Measure of asymmetry Another important property of the histogram or frequency polygon is its shape with respect to symmetry (on either side of the mode). The sample coefficient of skewness measures the asymmetry of a set of data about its mean. For a sample of observations, x1 , x2 , . . . , xn , it is defined as n i=1 (xi − x)3 . (1.2.10) ns 3 Division by the cube of the sample standard deviation gives a dimensionless measure. A histogram is said to have positive skewness if it has a longer tail on the right, which is toward increasing values, than on the left. In this case the number of values less than the mean is greater than the number that exceeds the mean. Many natural phenomena tend to have this property. For a positively skewed histogram, g1 = mode < median < mean. This inequality is reversed if skewness is negative. A symmetrical histogram suggests zero skewness. Example 1.23. Comparison of timber and concrete strength data. The coefficient of skewness of the timber strength data of Table E.1.1 and the compressive strength data of Table E.1.2 are 0.15 (or 0.53 after excluding the zero value) and 0.03, respectively. These indicate a small skewness in the first case and a symmetrical distribution in the second case. The example indicates that this measure of skewness is sensitive to the tails of the distribution. 1.2.4 Measure of peakedness The extent of the relative steepness of ascent in the vicinity and on either side of the mode in a histogram or frequency polygon is said to be a measure of its peakedness or tail weight. This is quantified by the dimensionless sample coefficient of kurtosis, which is defined for a sample of observations, x1 , x2 , . . . , xn by Copyright © 2008. Wiley. All rights reserved. g2 = n i=1 (xi − x)4 . ns 4 (1.2.11) Example 1.24. Comparison of timber and concrete strength data. The kurtosis of the timber strength data of Table E.1.1 is 4.46 (or 3.57 without the zero value) and that of the compressive strengths of Table E.1.2 is 2.33. One can easily see from Eq. (1.2.11) that even a small variation in one of the items of data may influence the kurtosis significantly. This observation warrants a large sample size, perhaps 200 or greater, for the estimation of the kurtosis. Small sample sizes, particularly in the second case with n = 40, preclude the attachment of any special significance to these estimates. 1.2.5 Summary of Section 1.2 Of the numerical summaries listed here, the mean, standard deviation, and coefficient of skewness are the best representative measures of the histogram or frequency polygon, from both visual and theoretical aspects. These provide economical measures for summarizing the information in a data set. Sample estimates for the data we have been discussing here, including the coefficients of variation and kurtosis, are given in Table 1.2.2. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 20 Applied Statistics for Civil and Environmental Engineers Table 1.2.2 Sample estimates of numerical summaries of the timber strength data of Table 1.1.3 and the concrete strength and density data of Table 1.2.1 Data set Sample Standard Coefficient of Coefficient Coefficient size Meana deviationa variation (%) of skewness of kurtosis Estimated by equation Timber strength—full sample Timber strength without the zero value Compressive strength of concrete Density of concrete a 1.2.1 1.2.7 1.2.8 1.2.10 1.2.11 165 39.09 9.92 25.3 0.15 4.46 164 39.33 9.46 24.0 0.53 3.57 40 60.14 5.02 8.35 0.03 2.33 40 2445 15.99 0.65 0.38 3.15 Units for strength are N/mm2 ; units for density are kg/m3 . 1.3 EXPLORATORY METHODS Some graphical displays are used when one does not have any specific questions in mind before examining a data set. These methods were appropriately called exploratory data analysis by Tukey (1977). Among such procedures the box plot is advantageous, and the stem-and-leaf plot is also a valuable tool. Copyright © 2008. Wiley. All rights reserved. 1.3.1 Stem-and-leaf plot The histogram is a highly effective graphical procedure for showing various characteristics of data as seen in Section 1.1. However, for smaller samples, less than, say, 40 in size, it may not give a clear indication of the variability and other properties of the data. The stem-and-leaf plot, which resembles a histogram turned through a right angle, is a useful procedure in such cases. Its advantage is that the data are grouped without loss of information because the magnitudes of all the values are presented. Furthermore, its intrinsic tabular form highlights extreme values and other characteristics that a histogram may obscure. As in a histogram, the data are initially ranked in ascending order but a different approach is adopted in finding the number of classes. The class widths are almost invariably equal. For the increments or class intervals (and hence class widths) one uses 0.5, 1, or 2 multiplied by a power of 10, which means that the intervals are in units such as 0.1 or 200 or 10,000, which are more tractable than, say, 0.13 or 140 or 12,000. The terminology is best explained through the following worked example. Example 1.25. Concrete test. For the concrete strength data of Table E.1.2, the maximum and minimum values are 69.5 and 49.9 N/mm2 , respectively. As a first choice, the data can be divided into 21 classes in intervals of 1 N/mm2 with lower boundaries at 49, 50, 51 N/mm2 , and so on, up to 69 N/mm2 . For the ordered stem-and-leaf plot of Fig. 1.3.1, a vertical line is drawn with the class boundaries marked in increasing order immediately to its left. The boundary values are called the leading digits and, together with the vertical line, constitute the stem. The trailing digits on the right represent the items of data in increasing order when read jointly with the leading digits using the indicated units. They are termed leaves, and their counts are the class frequencies. Thus the digits 49 (stem) and 9 (leaf) constitute 49.9. It is useful to provide an additional column at the extreme left, as shown here, giving the cumulative frequencies—called depths—up to each class. This is completed Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 1 49 9 2 50 7 21 2 51 3 52 5 5 53 2 4 7 54 4 6 8 55 8 11 56 3 7 13 57 8 9 15 58 8 9 19 59 0 6 8 8 (7) 60 0 2 5 5 14 61 1 5 9 11 63 3 4 9 64 9 9 7 65 7 9 5 9 9 11 62 6 66 6 67 2 3 4 68 1 3 1 69 5 9 Fig. 1.3.1 Stem-and-leaf plot for compressive strengths of concrete in Table E.1.2; units for stem: 1 N/mm2 ; units for leaves: 0.1 N/mm2 . Copyright © 2008. Wiley. All rights reserved. firstly by starting at the top and totaling downward to the line containing the median for which the individual frequency is given in parentheses, and secondly by starting at the bottom and totaling upward to the line containing the median. The diagram gives all the information in the data, which is its main advantage. Furthermore, the range, median, symmetry, or gaps in the data, frequently occurring values, and any possible outliers can be highlighted. In this example, a symmetrical distribution is indicated. The plot may be redrawn with a smaller number of classes, perhaps for greater clarity, using the guidelines for choosing the intervals stipulated previously. The units of data in a plot can be rounded to any number of significant figures as necessary. Also, the number of stems in a plot can be doubled by dividing each stem into two lines. When 1 multiplied by a power of 10 is used as an interval, for example, the first line, which is denoted by an asterisk (∗ ), will thus have leaves 0 to 4, and the leaves of the second, represented by a period (.), will be from 5 to 9. Likewise, one may divide a stem into five lines. The stem-and-leaf plot is best suited for small to moderate sample sizes, say, less than 200. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 22 Applied Statistics for Civil and Environmental Engineers Strength (N/mm2) Timber strength, excluding 0 value Compressive strength of concrete 10 20 30 40 17.98 15.89 33.10 39.10 44.57 49.9 50 60 70 80 61.78 70.22 65.35 65.61 69.07 46.97 56.8 60.1 63.4 69.5 73.23 Maximum and minimum values Other high values Critical values for detecting outliers Quartiles Fig. 1.3.2 Box plots for timber strength and compressive strength of concrete data from Tables 1.1.3 to 1.2.1. Copyright © 2008. Wiley. All rights reserved. 1.3.2 Box plot Another plot that is highly useful in data presentation is the box plot, which displays the three quartiles, Q 1 , Q 2 , Q 3 , on a rectangular box aligned either horizontally or vertically. The box, together with the minimum and maximum values, which are shown at the ends of lines extended at either side from the box from the midpoints of its extremities, constitute the box-and-whiskers plot, as it is sometimes called. The numerical signposts are arranged as follows from top to bottom: minimum, Q 1 , Q 2 , Q 3 , and maximum. Together they constitute a five-number summary. The minimum and maximum values may be replaced by the 5th and 95th (or other extreme) percentiles or supplemented by these and additional extreme values. These plots play an important role in comparing two or more samples. The width of the box is made proportional to the sample size in such cases, if they are different. Example 1.26. Comparison of timber and concrete strength data. Let us use a box plot to compare the strengths of two representative materials used by civil engineers. Figure 1.3.2 shows the timber strength data ranked in Table 1.1.3, with the zero value excluded, and the compressive strength of concrete data that were ranked in Table 1.2.1. The box plot of compressive strengths of concrete shown on the right strongly indicates symmetry in their distribution. In the case of the timber strength data, the box is less symmetrical. However, there are clear signs of positive skewness; because the length of the line connecting the highest value to the box is longer than that connecting the lowest value to the box. Empirical rules have been devised to detect outliers by means of box plots. As previously stated, this term signifies an excessive discordance with reference to an assumed distribution to which the majority of observations belong. One such procedure identifies Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 23 outliers as those values located at distances greater than 1.5 iqr above the third quartile or less than 1.5 iqr below the first quartile. Example 1.27. Comparison of timber and concrete strength data. The iqrs for the timber strength and compressive strength of concrete data are 11.47 and 6.55 N/mm2 and thus the two critical distances for detecting outliers are 17.21 and 9.83 N/mm2 , respectively. These distances are set out on either from the extremities of the boxes and are shown by thick horizontal lines in Fig. 1.3.2. By this rule, the concrete data do not have any outliers, whereas there are four outliers beyond the demarcating line for high outliers in the timber strength data of Table 1.1.2. These are the values 65.35, 65.61, 69.07, and 70.22 N/mm2 . At the other extremity, there is the zero value that was discarded before the diagram was drawn. When such an observation is recorded one should verify whether it stems from a faulty calibration or other source of error; it is clearly an outlier by the method described here.11 1.3.3 Summary of Section 1.3 In general, box plots are helpful in highlighting distributional features, including the range and many of the properties of a histogram. They provide a valuable means of comparing data measuring related or similar characteristics. The stem-and-leaf plot is also clearly useful in presenting a set of data as an alternative to the histogram. Both diagrams can be easily drawn. These are two of the commonly used exploratory graphical methods. Other methods presented in subsequent chapters include the hanging histogram of Subsection 5.8.5.1. 1.4 DATA OBSERVED IN PAIRS In the preceding sections, the behavior of one variable was considered. Let us extend this discussion to the case where simultaneous observations are made of two variables and a study is made to find an association between the variables. In this section the simple bivariate case of paired samples is examined, and the types of association between them are briefly assessed. Copyright © 2008. Wiley. All rights reserved. 1.4.1 Correlation and graphical plots A specific type of association that is frequently examined is known as correlation (from co-relation). In usual practice, graphical methods are initially applied; subsequently, numerical summaries provide a quantification and a means of assessment. For example, if there are n pairs of observations, (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), of two variables X and Y , a preliminary indication of the correlation is obtained through a scatter diagram. In this plot the coordinates denote the observed pairs of values. Example 1.28. Concrete test. The scatter diagram of Fig. 1.4.1 represents the concrete data of Table E.1.2, with the density and compressive strength at 28 days given by the horizontal and vertical axes, respectively. At first sight, there is no well-defined relationship between the two sets of observations although one would expect a density that is higher or lower than average to be associated with a compressive strength of concrete that is correspondingly higher or lower than its average. 11 More precise methods of systematically detecting outliers (such as those investigated by Kottegoda, 1984) are discussed in Chapter 5. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Compressive strength (N/mm2) 24 Applied Statistics for Civil and Environmental Engineers 80 70 60 50 40 2400 2420 2440 2460 2480 Density of concrete (kg/m3) 2500 Fig. 1.4.1 Scatter diagram of concrete test data from Table E.1.2. 1.4.2 Covariance and the correlation coefficient The sample covariance, s X,Y , gives a numerical summary of the linear association between two quantitative variables X and Y . It is the average of the product of their deviations about the respective means. Thus, s X,Y = n 1 (xi − x̄)(yi − ȳ). n i=1 (1.4.1) The covariance will be greater when there is a greater direct association between X and Y with respect to higher than average values and similarly for lower than average values. If the sample covariance is divided by the sample standard deviations of the two variables, s X and sY [as in Eq. (1.2.6)], one obtains a dimensionless measure of linear association called the sample coefficient of correlation, r X,Y = 1 n ns X sY i=1 (xi − x̄)(yi − ȳ). (1.4.2) Copyright © 2008. Wiley. All rights reserved. Substituting for s X and sY , we find r X,Y = n i=1 n i=1 (xi − x)(yi − y) (xi − x)2 n i=1 (yi − y)2 . (1.4.3) The correlation coefficient is constrained by –1 ≤ r X,Y ≤ 1. Because the association measured here is defined by Eqs. (1.4.2) and (1.4.3), this result is called the linear coefficient of correlation or the product-moment correlation coefficient.12 The two limiting values in the preceding constraint are of theoretical interest and are applicable if all the points of a scatter diagram lie on a straight line of the type Yi = β0 + βi xi , 12 Another measure, Spearman’s rank correlation coefficient, is discussed in Chapter 5. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. (1.4.4) P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 25 where β0 and β1 are constants. The constant β1 will be positive for all positive correlations including the maximum value, r X,Y = 1. In the opposite case, β1 will be negative, indicating negative correlation. That is, a high value of one variable tends to be associated with a low value of the other; the minimum value, r X,Y = –1, is in this category. In some cases the scatter diagram may indicate that there is an exponential or other nonlinear type of relationship between the two variables. In such cases, special procedures are necessary. For example, one may apply a logarithmic, square root, negative reciprocal, or other appropriate transformation to one or both variables prior to analysis (as discussed in Chapter 6). Example 1.29. Concrete test. The scatter diagram of Fig. 1.4.1 does not show a strong relationship between the density and the compressive strength. This fact is confirmed by the correlation coefficient of +0.44 obtained from Eq. (1.4.3). It is possible that the inclusion of additional variables, such as the results of slump tests, will lead to an improved relationship for predictive purposes in a multiple regression analysis. Copyright © 2008. Wiley. All rights reserved. Note that a zero correlation does not show that the variables are independent. For variables that have no dependence, however, the correlation will not be of any significance.13 Note that one is only seeking an association between two variables through the correlation coefficient, not a cause and effect relationship. In some cases there are clear reasons for dependency, as in the case of a force exerted on a steel wire and the consequent increase in its length, or as in rainfall resulting in runoff. Often, however, one cannot reach such a conclusion when there is strong positive or negative correlation. One may find, for instance, that two variables are correlated because they are both associated with a third variable and not because there is a physical relationship between the first two.14 Equations of regression such as Eq. (1.4.4) are generally used to predict Y for a given value of X without invoking a causal relationship. Accordingly, the given value x is called the explanatory (nonrandom) variable and Y is the response (random) variable. Example 1.30. Water quality. Another example of positive or negative correlation is the association between variables measuring water quality. A case study is taken from the Blackwater River in central England, which is constantly monitored for the control of pollution. The variables that are measured, among others, are the amounts of dissolved oxygen, DO, and the biochemical oxygen demand, BOD, in the water. Dissolved oxygen is required for the respiration of aerobic life forms such as fish. The BOD denotes the amount of oxygen used in meeting the metabolic needs of aerobic microorganisms in water, whether naturally occurring or resulting from sewage outflows and other discharges; thus, high values of BOD generally indicate high levels of pollution. Usually determined in a laboratory after a 5-day incubation of samples taken from the water, BOD is the most widely used indicator of pollution despite some shortcomings. Sampling at 38 stations along the river gives the data presented in Table E.1.3. 13 The significance of small values of correlation and whether they probably indicate zero correlation are discussed in Chapter 6, in addition to other aspects of regression including the particular notation of Eq. (1.4.4). The concept of independence is discussed in Chapter 2. 14 An absurdity cited in early literature is the apparent relationship between horse kicks suffered by cavalrymen and wheat production in Europe. Also, Yule (1926) correlates concurrent time series of the proportion of Church of England marriages and the standardized mortality rates per 1000 persons with a “nonsense” correlation coefficient of 0.95; he explains that both variables are highly influenced by a common factor; we now call this behavior spurious correlation. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 26 Applied Statistics for Civil and Environmental Engineers BOD (mg/L) 5 4 3 2 1 5 6 7 8 9 10 DO (mg/L) Fig. 1.4.2 Scatter diagram of water quality data from Table E.1.3. The scatter diagram of the two indicators of water quality data is shown in Fig. 1.4.2. As expected, it strongly indicates a negative type of correlation with high values of DO associated with low values of BOD and vice versa. The coefficient of correlation from Eq. (1.4.3) is −0.90. It suggests that the value of BOD can be estimated from a measurement of the DO. The scatter in the diagram may be partly attributed to some inadequacies of the BOD test and partly to factors such as temperature and rate of flow, which affect the DO. The presence of outliers tends to have a significant effect on the coefficient of correlation. Consider, for example, the lowest BOD in Fig. 1.4.2, which corresponds to the first pair of values in Table E.1.3. This may not warrant consideration as an outlier. It can, however, be due to an incorrect observation or an error in recording. With reference to Example 1.30, it is interesting to note that if one changes the first BOD value of Table E.1.3, from 2.27 to 2.77, the correlation coefficient decreases from –0.90 to –0.92. Copyright © 2008. Wiley. All rights reserved. 1.4.3 Q-Q plots Quantiles representing two attributes or phenomena that are considered to be associated may be compared using a Q-Q plot. Here one plots the quantiles of one data set against the corresponding quantiles of another set as a means of comparing their probability distributions. One proceeds initially with the ranking and calculation of cumulative relative frequencies for a quantile plot for each set of data (as a prerequisite to drawing Fig. 1.1.6, for example). The two quantile plots are then associated graphically by plotting values of data with equal cumulative relative frequencies. In this type of diagram the limiting case, in which the distributions differ only with respect to location and scale, is represented by a straight line. The manner in which the plot departs from linearity indicates other types of difference between the two distributions. When one quantile function represents a theoretical distribution, the Q-Q plot becomes a probability plot. This is a very useful diagram adopted in practice initially by a civil engineer, R. W. Powell in 1943. The probability plot may be considered to be an extension of the box plot, because all the quantiles are used in this method of comparing empirical and theoretical distributions.15 15 Details of this method are given in Section 5.8. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Compressive strength (N/mm2) Preliminary Data Analysis 27 80 70 60 50 40 2400 2420 2440 2460 2480 2500 Density of concrete (kg/m3) Fig. 1.4.3 Q-Q plot of concrete test data from Table E.1.2. Example 1.31. Concrete test. The distributions of the concrete strengths and densities listed in Table E.1.2 are to be compared using a Q-Q plot. For this purpose the ranked data of Table 1.2.1 are used to obtain the cumulative relative frequencies for each item of data in the sample of concrete strengths and the sample of concrete densities. Then a Q-Q plot is drawn by associating data of equal cumulative frequencies. When sample sizes are the same, such as in the case of the data used here, one can proceed directly to the Q-Q plot; in other cases one calculates the quantiles of the smaller sample and then interpolates, correspondingly, the quantiles for the larger sample. There are apparent similarities in the distributions of strengths and densities, as shown in Fig. 1.4.3. Although the distributions are not close, they do not seem to be divergent. 1.4.4 Summary of Section 1.4 Copyright © 2008. Wiley. All rights reserved. A brief preliminary introduction is provided here on methods of investigating data observed in pairs. This is a prelude to the formal presentations in Chapters 3 and 5 and particularly in Chapter 6 on regression and multivariate analysis. 1.5 SUMMARY FOR CHAPTER 1 In this chapter numerous graphical methods for presenting data sets are introduced. These include line diagrams, histograms, relative frequency polygons, cumulative relative frequency diagrams, and scatter diagrams. Details of exploratory methods such as stem-andleaf plots and box plots are also given. Many of the numerical summaries for reducing data in this chapter are essential for the application of statistics and probability in engineering. Among the most important of these statistics are the mean, standard deviation, and the coefficient of correlation. Several sets of data are provided here as examples of random variables which engineers encounter. One needs to interpret these and draw sensible conclusions. The graphical and numerical methods here are a necessary first step and lead into the probabilistic methods of Chapters 2 and 3 and the verification of mathematical models in subsequent chapters. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 28 Applied Statistics for Civil and Environmental Engineers REFERENCES Copyright © 2008. Wiley. All rights reserved. General. The following references are given for further reading as required: Ang, A. M. S., and W. H. Tang (1975). Probability Concepts in Engineering Planning and Design, Vol. 1: Basic Principles, John Wiley and Sons, New York. A blend of theory and practice with wide appeal for practicing civil engineers. Benjamin, J. R., and C. A. Cornell (1970). Probability, Statistics and Decision for Civil Engineers, McGraw-Hill, New York. A classic for civil engineers with examples, extensive case studies, and innumerable problems to solve. Blank, L. T. (1980). Statistical Procedures for Engineering, Management and Science, McGrawHill, New York. Well-explained theory and practical examples, commendable as an introductory text. Chambers, J. M., W. S. Cleveland, B. Kleiner, and P. A. Tukey (1983). Graphical Methods for Data Analysis, Wadsworth, Belmont, CA. A standard reference for those seeking further knowledge of graphical methods. Groeneveld, R. A. (1979). Introductory Statistical Methods—An Integrated Approach Using Minitab, Marcel Dekker, New York. An ideal statistical guide with computer applications suitable for beginners. Hahn, G. J., and S. S. Shapiro (1967). Statistical Models for Engineering, John Wiley and Sons, New York. Reprinted in 1994 as a Wiley Classic in the engineering series. Recommended as a reference book for understanding the basics of statistical applications in engineering. Hand, D. J., F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994). A Handbook of Small Data Sets, Chapman and Hall, London. Diverse data sets. Hines, W. H., and D. C. Montgomery (1990). Probability and Statistics in Engineering and Management Science, 3rd ed., John Wiley and Sons, New York. Comprehensive book of 700 pages, 300 examples, and 626 problems. Hoaglin, D. C., F. Mosteller, and J. W. Tukey (eds.) (1983). Understanding Robust and Exploratory Data Analysis, John Wiley and Sons, New York. This authoritative book will further enhance one’s knowledge of exploratory data analysis. Johnson, R., and G. K. Bhattacharyya (1992). Statistics—Principles and Methods, 2nd ed., John Wiley and Sons, New York. Basic principles well explained. Mendenhall, W., and R. J. Beaver (1994). Introduction to Probability and Statistics, 9th ed., Duxbury Press, Boston. Statistics and probability at beginners’ level. Mendenhall, W., and T. Sincich (1995). Statistics for Engineering and the Sciences, 4th ed., Prentice Hall, Englewood Cliffs, NJ. Introduction with many applications. Moore, D. S., and G. P. McCabe (2003). Introduction to the Practice of Statistics, 4th ed., W. H. Freeman and Co., New York. A useful primer in statistics. Moroney, M. J. (1975). Facts from Figures, reprinted 1990, Penguin Books, London. The best book written for an absolute beginner in statistics. Scheaffer, R. L., and J. T. McClave (1995). Probability and Statistics for Engineers, 4th ed., Duxbury Press, Belmont, CA. A variety of charts and preliminary calculations in Chapter 1. In general, low emphasis in mathematics throughout. Highly commendable as an introduction. Wackerly, D. D., W. Mendenhall, and R. L. Scheaffer (2002). Mathematical Statistics with Applications, 6th ed., Duxbury, Pacific Grove, CA. Comprehensive introduction. Additional references quoted in text Freedman, D., and P. Diaconis (1981). “On the histogram as a density estimator: L 2 theory,” Zeitschrift fur Wahrscheinlich keitstheorie und verwandte Gebiete., Vol. 57, pp. 453–476, Chap. 2. Related to the class intervals of a histogram. Kottegoda, N. T. (1984). “Investigation of outliers in annual maximum flow series,” J. Hydrol., Vol. 72, No. 1, pp. 105–137. Methods of detecting outliers. Scott, D. W. (1979). “On optimal and data-based histograms,” Biometrika, Vol. 66, pp. 605–610. Number of classes for histograms. Stuart, A., and J. K. Ord (1994). Kendall’s Advanced Theory of Statistics, Vol. 1, 6th ed., Charles, Edward Arnold, London. Advanced reference. See Gini’s mean difference in Chapter 2. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 29 Sturges, H. A. (1926). “The choice of a class interval,” J. Am. Stat. Assoc., Vol. 21, pp. 65–66. Historical work on the histogram. Tukey, P. A. (1977). Exploratory Data Analysis, Addison-Wesley, Reading, MA. Original reference on exploratory methods. Yule, G. U. (1926). “Why do we sometimes get nonsense correlation between time series,” J. R. Stat. Soc., Vol. 89, pp. 1–69. Shows how two unrelated variables can have a high coefficient of correlation because they are influenced by a common factor. PROBLEMS Copyright © 2008. Wiley. All rights reserved. 1.1. Earthquake records. Measurements of engineering interest have been recorded during earthquakes in Japan and in other parts of the world since 1800. One of the critical recordings is of apparent relative density, RDEN. After the commencement of a strong earthquake, a saturated fine, loose sand undergoes vibratory motion and consequently the sand may liquefy without retaining any shear strength, thus behaving like a dense liquid. This will lead to failures in structures supported by the liquefied sand. These are often catastrophic. The standard penetration test is used to measure RDEN. Another measurement taken to estimate the prospect of liquefaction is that of the intensity at which the ground shakes. This is the peak surface acceleration of the soil during the earthquake, ACCEL. The data are from J. T. Christian and W. F. Swiger (1975), J. Geotech. Eng. Div., Proc. ASCE, 101, GT111, 1135–1150, and are reproduced by permission of the publisher (ASCE): RDEN (%) ACCEL (units of g) RDEN (%) ACCEL (units of g) RDEN (%) ACCEL (units of g) 53 64 53 64 65 55 75 72 40 58 43 32 40 0.219 0.219 0.146 0.146 0.684 0.611 0.591 0.522 0.258 0.250 0.283 0.419 0.123 30 72 90 40 50 55 50 55 75 53 70 64 53 0.138 0.422 0.556 0.447 0.547 0.204 0.170 0.170 0.192 0.292 0.299 0.292 0.225 50 44 100 65 68 78 58 80 55 100 100 52 58 0.313 0.224 0.231 0.334 0.419 0.352 0.363 0.291 0.314 0.377 0.434 0.350 0.334 Note: g denotes acceleration due to gravity (9.81 m/s2 ). Compute the sample mean x̄, standard deviation ŝ, and the coefficient of skewness, g1 , for RDEN and ACCEL. Construct stem-and-leaf plots for each set. Comment on the distributions. Plot the scatter diagram and calculate the correlation coefficient r . What conclusions can be reached? 1.2. Flood discharge. Annual maximum flood flows in the Po River at Pontelagoscuro, Italy, over a 61-year period from 1918 to 1978 are given in the second column of Table E.7.2. Compute the sample mean x̄ and standard deviation ŝ. Sketch a histogram and the cumulative relative frequency diagram. Compute the quartiles and Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 30 Applied Statistics for Civil and Environmental Engineers draw a box-and-whiskers plot. Comment on the distribution. Flood embankments along the banks of the river can withstand a flow of 5000 m3 /s. What is the probability that this will be exceeded during a 12-month period? 1.3. Flood discharge. The following are the annual maximum flows in m3 /s in the Colorado River at Black Canyon for the 52-year period from 1878 to 1929: 1980 1700 1420 1980 2690 3230 1130 1570 1980 4960 2270 3090 3120 2830 2690 2120 5660 2120 2120 3260 2550 5950 1700 2410 1840 4250 3400 2550 2550 2410 1980 3120 8500 1980 1840 4670 2070 3260 2120 3120 1700 1470 3960 2410 3290 2410 2410 2270 2410 3170 4550 3310 [Adapted from E. J. Gumbel (1954), “Statistical theory of extreme values and some practical applications,” National Bureau of Standards, Applied Mathematics Series 33, U.S. Govt. Printing Office, Washington, DC.] Compute the mean x̄ and standard deviation ŝ. Sketch a histogram and the relative frequency diagram. Compute the quartiles and draw a box-and-whiskers plot. How does this distribution differ from that of Problem 1.2? 1.4. Welding joints for steel. At the University of Birmingham, England, laboratory measurements were taken of the horizontal legs x and vertical legs y of numerous welding joints for steel buildings. The main objective was to make the legs equal to 6 mm. A part of the results is listed below in millimeters. x = 5.5, 5.0, 5.0, 6.0, 7.0, 5.2, 5.5, 5.5, 6.0, 6.0, 4.5, 6.0, 5.5, 7.7, 7.5, 6.0, 5.6, 5.0, 5.5, 5.5, 6.0, 6.5, 5.5, 5.0, 5.5, 5.5, 6.5, 6.5, 7.0, 5.5, 6.5, 5.5, 6.0, 6.5, 8.5, 5.0, 6.0, 6.5, 5.0, 7.0, 5.0, 5.0, 6.5, 6.5, 6.0, 4.7, 8.0, 7.0, 5.5, 7.0, 6.6, 6.5, 7.0, 6.0, 6.5, 5.0, 7.0, 7.5, 7.0, 7.0 y = 6.5, 6.5, 5.5, 7.5, 6.0, 7.0, 5.0, 8.0, 6.7, 7.8, 5.7, 6.5, 5.5, 8.0, 8.0, 6.3, 6.0, 6.0, 6.0, 5.5, 6.5, 6.0, 6.0, 6.0, 6.0, 6.5, 6.5, 6.0, 6.0, 6.5, 7.5, 7.5, 6.0, 4.5, Copyright © 2008. Wiley. All rights reserved. 7.0, 7.0, 6.0, 4.0, 4.0, 7.0, 7.0, 6.5, 7.0, 5.0, 5.0, 5.7, 5.0, 5.0, 6.0, 7.0, 6.0, 7.0, 6.0, 5.5, 6.0, 4.0, 5.5, 8.0, 7.5, 6.5 The data were provided by Dr A. G. Kamtekar. Draw a scatter diagram for these data. Draw a line through the ideal point (x = y = 6 mm) and the origin. Draw two lines through the origin that are symmetrical about the first line and envelope all of the points. Comment on the results. Draw the cumulative sum (cusum) plots, Cxn = n (xi − μx ) and i Cyn = n (yi − μ y ) i for n = 1, 2, . . . , 60 and μx = μ y = 6. Let n−1 dxn = Cxn − min[Cxi ] i=1 and the critical limit be max(dxn ) = 12 mm. Is the critical limit reached? Repeat for the vertical legs y. [Further details of cusum plots are given by W. H. Woodalland B. M. Adams (1993), “The statistical design of cusum charts,” Qual. Eng., Vol. 5, No. 4, pp. 550–570; the associated control chart is the subject of Problem 5.11.] Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 31 1.5. Frost frequency. Excessive frost can be harmful to roads. Frequencies of the number of days of frost during April in Greenwich, England, over a 65-year period are given by C. E. Brooks and N. Carruthers (1953), Handbook of Statistical Methods in Hydrology, H. M. Stationary Office, London, and are listed below: Days of frost Frequency 0 15 1 11 2 5 3 11 4 7 5 6 6 2 7 3 8 2 9 1 10 2 Draw a line diagram of the data. Comment on the results. Compute the mean number of days of frost in April. What is the probability of a frostfree April in a given year? What change would you expect in the frequency distribution for a month in midwinter? 1.6. Concrete cube test. From 28-day concrete cube tests made in England in 1990, the following results of maximum load at failure in kilonewtons and compressive strength in newtons per square millimeter were obtained: Maximum load: 950, 972, 981, 895, 908, 995, 646, 987, 940, 937, 846, 947, 827, 961, 935, 956 Compressive strength: 42.25, 43.25, 43.50, 39.25, 40.25, 44.25, 28.75, 44.25, 41.75, 41.75, 38.00, 42.50, 36.75, 42.75, 42.00, 33.50 Copyright © 2008. Wiley. All rights reserved. The data were supplied by Dr J. E. Ash, University of Birmingham, England. Calculate the means x̄, standard deviations ŝ, mean absolute deviations d, and the coefficients of skewness g1 . Draw two stem-and-leaf plots of the data. Draw a scatter diagram and calculate the coefficient of correlation. What conclusions can be drawn? 1.7. Timber strength. For the timber strength data of Table E.1.1 determine the following measures of dispersion: (a) Interquantile range, iqr (b) Mean absolute deviation, d (c) Gini’s mean difference, g Compare results with the standard deviation ŝ of Table 1.2.2. Repeat these determinations after deleting the zero value. Rank the measures of dispersion in increasing order of susceptibility to the exclusion of the zero value on the basis of percentage change. 1.8. Population growth. From past records, the population of an urban area has doubled every 10 years. Currently, it has a population of 200,000. An engineer needs to make an estimate of the requirements for water supply during the next 23 years. What maximum population does one assume for the period? 1.9. Traffic speed. The following is the frequency distribution of travel times of motorcars on the M1 motorway from Coventry, England, to M10, St Albans, according to a survey conducted in England (see Ph.D. thesis of A. W. Evans, University of Birmingham, England, 1967): Mean times (min): 53, 58, 63, 68, 73, 78, 83, 88, 93, 98, 103, 108,113, 118, 123, 128, 133, 138, 143, 148, 153, 158, 163, 168 Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 32 Applied Statistics for Civil and Environmental Engineers Corresponding frequencies: 10, 24, 109, 127, 122, 119, 97, 102, 104, 92, 68, 72, 66, 61, 36, 33, 17, 15, 10, 8, 9, 6, 7, 3 Draw the histogram. Describe the salient features. What is the likely reason for the twin peaks? What inference can be made from the mean time interval between the two peaks? 1.10. Average speed. On a certain country road that runs from a coastal town to a village in the mountains, the average speed of motorcars is 80 km/h uphill and 100 km/h downhill. What is the average speed for a journey from the town to the village and back? 1.11. Annual rainfall. Catchment-averaged annual rainfall in the Po River basin of Italy for the 61-year period from 1918 to 1978 are given in the penultimate column of Table E.7.2. Draw a stem-and-leaf plot and a box plot of the data. Comment on the type of distribution. 1.12. Rock test. A contractor engaged in building part of a sewer tunnel claimed that the rock was harder than described in his contract with a District Council in the United Kingdom and thus more work was required to construct the tunnel than anticipated. An independent company made tests to verify the contractor’s claim. Among these were uniaxial compressive strengths, of which 123 specimens are listed here, in meganewtons per square meter. 2.40, 22.08, 16.80, 4.80, 21.36, 9.12, 9.36, 3.60, 15.36, 15.60, 6.24, 9.84, 16.08, 30.00, 20.40, 12.96, 19.20, 10.32, 15.84, 62.40, 40.80, 4.80, 7.20, 8.88, 14.40, 14.88, 5.76, 18.72, 12.48, 11.04, 8.64, 19.20, 8.16, 18.96, 8.64, 12.00, 14.88, 17.52, 12.48, 13.44, 9.36, 11.28, 8.88, 15.12, 9.36, 17.28, 26.40, 4.32, 11.28, 7.92, 13.92, 11.76, 9.60, 8.40, 9.84, 27.60, 6.00, 14.40, 8.88, 17.04, 12.48, 9.84, 10.80, 12.24, 12.00, 13.20, 11.28, 11.76, 11.76, 8.00, 9.36, 15.12, 11.52, 16.08, 10.80, 14.64, 8.40, 13.44, 10.56, 9.12, 13.44, 12.72, 13.68, 11.28, 5.52, 11.04, 12.00, 7.20, 8.64, 11.76, 8.64, 7.68, 7.68, 13.92, 6.48, 7.20, 7.92, 9.60, 8.64, 9.12, 12.96, 9.36, 14.64, 9.12, 8.88, 20.40, 17.28, 8.64, 11.76, 7.92, 7.68, 11.04, 12.48, 14.40, 9.84, 9.12, 8.40, 12.00, 4.80, 12.72, 9.60, 8.64, 9.84 Copyright © 2008. Wiley. All rights reserved. Draw histograms using Eqs. (1.1.1) and (1.1.2) for the class widths. What do you notice about the histograms in general? Draw a box-and-whiskers plot. What evidence is there to support the contractor’s claim? 1.13. Soil erosion. Measurements taken on farmlands of the amounts of soil washed away by erosion suggest a relationship with flow rates. The following results are taken from G. R. Foster, W. R. Ostercamp, and L. J. Lane (1982), “Effect of discharge rate on rill erosion,” Winter 1982 Meeting of the American Society of Agricultural Engineers: Flow (L/s) Soil eroded (kg) 0.31 0.82 0.85 1.95 1.26 2.18 2.47 3.01 3.75 6.07 Draw a plot of the data. Comment on the results. 1.14. Concrete cube test. The following 28-day compressive strengths, in newtons per square millimeter, were obtained from test results on concrete cubes in England: Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 33 50.5, 45.8, 49.6, 47.7, 54.0, 49.4, 54.1, 53.1, 56.5, 55.2, 52.7, 52.0, 54.2, 55.2, 53.4, 51.0, 53.1, 48.5, 51.0, 58.6, 52.5, 49.5, 51.1, 48.1, 50.2, 49.3, 47.3, 52.9, 52.8, 49.5, 48.8, 53.8, 47.3, 47.7, 52.2, 45.7, 53.4, 48.5, 49.1, 43.3 The data were supplied by Dr J. E. Ash, University of Birmingham, England. Compare these results with the compressive strengths in Table E.1.2 by drawing back-to-back stem-and-leaf plots. For this purpose, plot the foregoing results on the left of the stem with reference to Fig. 1.3.1 and omit the cumulative frequencies. Comment on the differences in the distributions. 1.15. Water quality. Water quality measurements are taken daily on the River Ouse at Clapham, England. The concentrations of chlorides and phosphates in solution, given below in milligrams per liter, are determined over a 30-day period. Chloride: 64.0, 66.0, 64.0, 62.0, 65.0, 64.0, 64.0, 65.0, 65.0, 67.0, 67.0, 74.0, 69.0, 68.0, 68.0, 69.0, 63.0, 68.0, 66.0, 66.0, 65.0, 64.0, 63.0, 66.0, 55.0, 69.0, 65.0, 61.0, 62.0, 62.0 Phosphate: 1.31, 1.39, 1.59, 1.68, 1.89, 1.98, 1.97, 1.99, 1.98, 2.15, 2.12, 1.90 1.92, 2.00, 1.90, 1.74, 1.81, 1.86, 1.86, 1.65, 1.58, 1.74, 1.89, 1.94, 2.07, 1.58, 1.93, 1.72, 1.73, 1.82 Compare the coefficients of variation v. Draw a scatter diagram and compute the correlation coefficient r . Comment on the results. Do you see any role in this association for predictive purposes? 1.16. Timber strength. From the timber strength data of Table 1.1.3, compute the 3% trimmed mean by omitting 3% of the observations from the highest and the lowest extremities of the ranked data. Compute the standard deviation ŝ and the coefficients of skewness g1 and kurtosis g2 . Compare with the results for the full sample (as given in Table 1.2.2). Copyright © 2008. Wiley. All rights reserved. 1.17. Concrete beam. Joist-hanger tests carried out at the University of Birmingham, England, on concrete beams gave observations of deflections in millimeters and failure load in kilograms. The following results pertain to 75 mm × 150 mm hangers on which timber joists rest: Failure load: 1903, 1665, 1903, 1991, 2229, 1910, 2025, 1991, 1882, 2032, 1896, 1346 Deflection: 0.69, 0.67, 0.80, 0.50, 0.74, 0.78, 0.57, 0.91, 0.54, 0.50, 0.97, 0.62 Determine by drawing a scatter diagram and computing the correlation coefficients whether there is any association between the two variables. Discuss your results. 1.18. Hurricane frequency. Hurricane damage is of great concern to civil engineers. The frequencies of hurricanes affecting the east coast of the United States each year during a period of 69 years are given as follows by H. C. S. Thom (1966), Some Methods of Climatological Analysis, World Meteorological Organisation, Geneva: Number of hurricanes Frequency 0 1 1 6 2 10 3 16 4 19 5 5 6 7 7 3 8 1 9 1 Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 34 Applied Statistics for Civil and Environmental Engineers Draw a line diagram and comment on its form. Discuss differences or similarities between this diagram and Fig. 1.1.1. 1.19. Air pollution. On 13 April 1994, the following concentration of pollutants were recorded at eight stations of the monitoring system for pollution control located in the downtown area of Milan, Italy: Station 3) NO2 (μg/m CO (mg/m3) Aquileia Cenisio Juvara Liguria Marche Senato Verziere Zavattari 130 2.9 120 4.1 130 4.4 115 3.6 135 3.3 142 5.7 90 4.8 116 7.3 Compare the coefficients of variation v of the pollutants and determine their correlation r . 1.20. Storm rainfall. The analysis of storm data is essential for predicting flood hazards in urban areas. Annual maximum rainfall depths (in millimeters) recorded at Genoa University in Italy, for durations varying from 5 minutes to 3 hours, are presented here for the years 1974–1987. Copyright © 2008. Wiley. All rights reserved. Duration (min) Year 5 10 20 30 40 50 60 120 180 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 12.1 10.1 17.9 20.0 5.1 20.5 10.0 12.0 10.0 10.0 20.1 7.6 8.7 24.6 19.5 14.9 20.0 32.6 13.6 26.1 15.7 27.9 14.4 12.1 32.8 8.1 11.7 36.7 28.8 26.7 31.1 52.6 16.0 36.3 20.9 47.9 20.0 17.3 60.0 13.0 20.0 56.7 30.5 31.2 37.2 72.4 21.3 46.1 25.0 56.0 23.3 19.2 65.7 16.5 22.9 73.9 32.4 34.7 41.1 90.1 24.1 49.3 30.5 70.0 25.1 22.1 76.1 21.6 26.1 93.9 35.5 38.2 51.0 108.8 24.6 50.3 38.0 80.0 26.4 27.3 92.8 25.3 26.3 110.1 38.7 40.2 55.7 118.9 25.0 55.6 40.1 89.4 27.2 32.7 105.7 25.3 27.6 128.5 48.0 55.0 67.1 146.5 40.7 65.2 58.0 106.9 34.3 54.4 122.3 27.0 41.1 180.8 51.6 56.0 80.6 157.3 49.9 90.1 63.8 114.2 41.2 66.5 122.3 32.3 56.7 188.7 Compute the mean x̄ and standard deviation ŝ and coefficient of skewness g1 for each duration. Are there some regularities in the growth of these statistics with increasing duration? Comment on the results and the physical relevance to storm characteristics. 1.21. Carbon dioxide. The records of atmospheric trace gases are used in the study of global climatic changes. Monthly carbon dioxide concentrations (in parts per million in volume) recorded at Mount Cimone, Italy, from 1980 to 1988 are given here. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 35 Month Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1980 1981 1982 1983 1984 1985 1986 1987 1988 339.83 343.11 344.38 346.18 349.44 348.17 351.41 353.75 355.02 342.27 342.39 345.68 345.00 351.33 350.62 352.29 354.79 354.96 342.51 342.51 345.70 344.24 350.50 350.61 350.75 352.61 354.51 338.27 339.49 340.80 342.32 346.43 345.93 348.37 350.39 352.20 335.52 335.28 336.66 338.34 344.35 341.43 342.96 347.38 346.71 330.14 330.77 334.65 336.03 346.29 337.67 337.22 341.64 342.60 328.81 330.30 332.40 335.00 335.19 337.16 338.53 341.64 344.60 331.17 333.55 335.15 336.57 337.59 339.40 340.90 342.19 343.66 335.03 336.80 339.26 339.86 342.26 344.07 346.28 345.60 348.99 339.05 339.41 341.19 343.97 344.88 349.49 348.95 350.39 352.42 340.43 343.18 345.18 345.61 346.91 347.40 350.52 352.36 353.27 340.87 341.47 341.70 342.38 346.32 349.92 349.41 351.94 353.13 Compute the mean x̄ and standard deviation ŝ for each year (by rows) and for each month (by columns). Because the temporal evolution of the annual mean indicates that carbon dioxide increases (probably resulting in global warming), compute the annual rate of increase. Comment on the results. 1.22. Historical records of earthquake intensity. Catalogo dei terremoti italiani dall’anno 1000 al 1980 (“Catalog of Italian earthquakes from year 1000 to 1980”) was edited by D. Postpischl in 1985, and is available through the National Research Council of Italy. This directory contains all of the available historical information on earthquakes that occurred in Italy during the past (nearly) 1000 years. It also includes values of earthquake intensity in terms of the Mercalli–Canconi–Sieber (MCS) index. The following table gives the values of MCS intensity for the city of Rome: MCS intensity Copyright © 2008. Wiley. All rights reserved. Century 2 XI XII XIII XIV XV XVI XVII XVIII XIX XX 110 3 Total 113 3 4 5 6 7 2 1 1 7 125 4 50 2 132 56 1 1 1 1 2 14 2 1 1 22 4 2 Total 2 1 1 0 3 0 1 15 301 5 329 Draw the line diagram for the whole data and for those recorded in each century. Compare the data recorded in the eighteenth century with those recorded in the other centuries. 1.23. Sea waves. Because of scarcity of records, the characteristics of sea waves are often derived from other climatological data. For the purpose, the SMB method (named after Sverdrup, Munk, and Bretschneider) is widely used in engineering practice [see U.S. Army Corps of Engineers (1977), Shore Protection Manual, Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 36 Applied Statistics for Civil and Environmental Engineers Vol. 1, Coastal Engineering Research Center, Washington, DC]. Liberatore and Rosso used this model to simulate sea waves in the upper Adriatic Sea [Liberatore, G., and R. Rosso (1983). “Sulla valutazione stocastica dell’onda di progetto in base alla ricostruzione dello stato del mare: un esempio di applicazione per l’Adriatico centro-meridionale,” Giornale del Genio Civile, Vol. 1–3, pp. 3–25]. They investigated two different strategies for model calibration, called “no. 1” and “no. 2” in the table presented here. The table also includes the observed and the simulated values of the height of the highest sea wave and of its period for measurements taken from August 1977 to September 1978. Simulated values Copyright © 2008. Wiley. All rights reserved. Measured values Calibration strategy no. 1 Calibration strategy no. 2 Height (m) Period (s) Height (m) Period (s) Height (m) Period (s) 2.26 3.10 3.22 3.84 2.56 2.74 2.28 3.88 2.49 4.22 2.01 2.77 3.61 3.51 2.52 2.12 2.73 3.30 6.1 4.3 5.7 7.7 5.3 5.7 4.9 6.7 5.0 6.9 5.0 5.9 6.5 7.4 5.0 5.1 6.5 5.4 1.81 2.93 3.24 3.18 2.74 3.49 2.12 5.10 2.14 4.45 2.57 2.68 3.86 4.02 3.39 2.61 2.22 4.05 5.4 6.8 7.2 7.1 6.6 7.4 5.8 9.0 5.8 8.8 6.4 6.5 7.8 8.0 7.3 6.5 6.0 8.0 1.54 2.54 2.80 2.69 2.32 3.00 1.80 4.43 1.81 3.77 2.19 2.27 3.36 3.51 2.95 2.21 1.88 3.49 5.8 6.4 6.7 6.6 6.1 6.9 5.4 8.4 5.4 7.7 5.9 6.0 7.3 7.5 6.9 6.0 5.5 7.5 Draw a scatter diagram to compare the observed and simulated values of wave heights and periods. Compute the correlation coefficients r . Compute the deviations of the simulated data from the observed data, and find the mean x̄1 , standard deviation ŝ1 , and coefficient of variation v of these deviations. Do these results indicate which of the two investigated strategies provides the better representation of sea waves from climatological data? 1.24. Surveying. A triangulated network is used to determine the position of three points in space, denoted by u1 ≡ (x1 , y1 ), u2 ≡ (x2 , y2 ), and u3 ≡ (x3 , y3 ), by measuring their mutual distances and their distances from two reference points, u A ≡ (x A , y A ) and u B ≡ (x B , y B ), as shown in Fig. 1.P1. Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15. P1: SFK/RPW P2: SFK/RPW BLUK154-Kottegoda QC: SFK/RPW April 15, 2008 T1: SFK 7:11 Preliminary Data Analysis 37 y, m 70 2 60 1 50 B 40 30 20 10 A 3 0 0 20 40 60 80 100 x, m Fig. 1.P1 Survey configuration. The Cartesian coordinates of the reference points are x A = y A = 0, x B = 92, and y B = 40 m. The table of the measured distances is given next. Copyright © 2008. Wiley. All rights reserved. uA uB u1 u2 u3 uA uB u1 u2 u3 0 100 50 71 92 100 0 86 70 40 50 86 0 26 99 71 70 26 0 93 92 40 99 93 0 Using appropriate trigonometric methods, find the average location and coefficients of variation of the coordinates of point u1 ≡ (x1 , y1 ). Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook Central. [9 September 2022]. Created from upcatalunya-ebooks on 2022-09-09 20:47:15.