Uploaded by Shriya Upadhya

Applied Statistics for Civil and Environmental Eng... ---- (1 Preliminary Data Analysis)

advertisement
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Copyright © 2008. Wiley. All rights reserved.
Chapter 1
Preliminary Data Analysis
All natural processes, as well as those devised by humans, are subject to variability.
Civil engineers are aware, for example, that crushing strengths of concrete, soil pressures,
strengths of welds, traffic flow, floods, and pollution loads in streams have wide variations.
These may arise on account of natural changes in properties, differences in interactions
between the ingredients of a material, environmental factors, or other causes. To cope
with uncertainty, the engineer must first obtain and investigate a sample of data, such as
a set of flow data or triaxial test results. The sample is used in applying statistics and
probability at the descriptive stage. For inferential purposes, however, one needs to make
decisions regarding the population from which the sample is drawn. By this we mean the
total or aggregate, which, for most physical processes, is the virtually unlimited universe
of all possible measurements. The main interest of the statistician is in the aggregation;
the individual items provide the hints, clues, and evidence.
A data set comprises a number of measurements of a phenomenon such as the failure
load of a structural component. The quantities measured are termed variables, each of
which may take any one of a specified set of values. Because of its inherent randomness
and hence unpredictability, a phenomenon that an engineer or scientist usually encounters
is referred to as a random variable, a name given to any quantity whose value depends
on chance.1 Random variables are usually denoted by capital letters. These are classified
by the form that their values can possibly take (or are assumed to take). The pattern of
variability is called a distribution. A continuous variable can have any value on a continuous scale between two limits, such as the volume of water flowing in a river per second
or the amount of daily rainfall measured in some city. A discrete variable, on the contrary,
can only assume countable isolated numbers like integers, such as the number of vehicles
turning left at an intersection, or other distinct values.
Having obtained a sample of data, the first step is its presentation. Consider, for example, the modulus of rupture data for a certain type of timber shown in Table E.1.1, in
Appendix E. The initial problem facing the civil engineer is that such an array of data by
itself does not give a clear idea of the underlying characteristics of the stress values in
this natural type of construction material. To extract the salient features and the particular
types of information one needs, one must summarize the data and present them in some
readily comprehensible forms. There are several methods of presentation, organization,
and reduction of data. Graphical methods constitute the first approach.
1.1 GRAPHICAL REPRESENTATION
If “a picture is worth a thousand words,” then graphical techniques provide an excellent
method to visualize the variability and other properties of a set of data. To the powerful
interactive system of one’s brain and eyes, graphical displays provide insight into the form
1
The term will be formally defined in Section 3.1.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
3
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
4
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Applied Statistics for Civil and Environmental Engineers
and shape of the data and lead to a preliminary concept of the generating process. We
proceed by assembling the data into graphs, scanning the details, and noting the important
characteristics. There are numerous types of graphs. Line and dot diagrams, histograms,
relative frequency polygons, and cumulative frequency curves are given in this section.
Subsequently, exploratory methods, such as stem-and-leaf plots and box diagrams and
graphs depicting a possible association between two variables, are presented in Sections
1.3 and 1.4. We begin with the simple task of counting.
1.1.1 Line diagram or bar chart
The occurrences of a discrete variable can be classified on a line diagram or bar chart.
In this type of graph, the horizontal axis gives the values of the discrete variable and the
occurrences are represented by the heights of vertical lines. The horizontal spread of these
lines and their relative heights indicate the variability and other characteristics of the data.
Example 1.1. Flood occurrences. Consider the annual number of floods of the Magra River
at Calamazza, situated between Pisa and Genoa in northwestern Italy, over a 34-year period,
as shown in Table 1.1.1.
A flood in the river at the point of measurement means the river has risen above a specified
level, beyond which the river poses a threat to lives and property. The data are plotted in
Fig. 1.1.1 as a line diagram.
The data suggest a symmetrical distribution with a midlocation of four floods per year.
In some other river basins, there is a nonlinear decrease in the occurrences for increasing
numbers of floods in a year commencing at zero, showing a negative exponential type of
variation.
1.1.2 Dot diagram
A different type of graph is required to present continuous data. If the data are few (say,
less than 25 items) a dot diagram is a useful visual aid. Consider the possibility that only
Copyright © 2008. Wiley. All rights reserved.
Table 1.1.1 Number of flood occurrences per
year from 1939 to 1972 at the gauging station of
Calamazza on the Magra River, between Pisa
and Genoa in northwestern Italya
Number of floods
in a year
Number of
occurrences
0
1
2
3
4
5
6
7
8
9
0
2
6
7
9
4
1
4
1
0
Total
34
a
A flood occurrence is defined as river discharge
exceeding 300 m3 /s.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
5
Number of occurrences
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
Number of floods
Fig. 1.1.1 Line diagram for flood occurrences in the Magra River at Calamazza between Genoa
and Pisa in northwestern Italy.
the first 15 items of data in Table E.1.1—which shows the modulus of rupture in N/mm2
for 50 mm × 150 mm Swedish redwood and whitewood—are available. The abridged
data are ranked in ascending order and are given in Table 1.1.2 and plotted in Fig. 1.1.2.
The reader can see that the midlocation is close to 40 N/mm2 but the wide spread makes
this location difficult to discern. A larger sample should certainly be helpful.
1.1.3 Histogram
Copyright © 2008. Wiley. All rights reserved.
If there are at least, say, 25 observations, one of the most common graphical forms is a
block diagram called the histogram. For this purpose, the data are divided into groups
according to their magnitudes. The horizontal axis of the graph gives the magnitudes.
Blocks are drawn to represent the groups, each of which has a distinct upper and lower
limit. The area of a block is proportional to the number of occurrences in the group.
The variability of the data is shown by the horizontal spread of the blocks, and the most
common values are found in blocks with the largest areas. Other features such as the
symmetry of the data or lack of it are also shown.
The first step is to take into account the range r of the observations, that is, the difference
between the largest and smallest values.
Example 1.2. Timber strength. We go back to the timber strength data given in Table E.1.1.
They are arranged in order of magnitude in Table 1.1.3.
There are n = 165 observations with somewhat high variability, as expected, because
timber is a naturally variable material. Here the range r = 70.22 – 0.00 = 70.22 N/mm2 .
To draw a histogram, one divides the range into a number of classes or cells n c . The
number of occurrences in each class is counted and tabulated. These are called frequencies.
Table 1.1.2 The first 15 items of modulus of rupture data measuring
timber strengths in N/mm2 , from Table E.1.1 (commencing with the
top row), ranked in increasing order
29.11
40.53
29.93
41.64
32.02
45.54
32.40
48.37
33.06
48.78
34.12
50.98
35.58
65.35
39.34
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
6
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Applied Statistics for Civil and Environmental Engineers
25
30
35
40
45
50
55
Modulus of rupture, N/mm
60
65
70
2
Fig. 1.1.2 Dot diagram for a short sample of timber strengths from Table 1.1.3.
The width of the classes is usually made equal to facilitate interpretation. For some work
such as the fitting of a theoretical function to observed frequencies, however, unequal class
widths are used. Care should be exercised in the choice of the number of classes, n c . Too
few will cause an omission of some important features of the data; too many will not give
a clear overall picture because
√ there may be high fluctuations in the frequencies. A rule
of thumb is to make n c = n or an integer close to this, but it should be at least 5 and not
greater than 25. Thus, histograms based on fewer than 25 items may not be meaningful.
Sturges (1926) suggested the approximation
n c = 1 + 3.3 log10 n.
(1.1.1)
A more theoretically based alternative follows the work of Freedman and Diaconis (1981):2
nc =
r n 1/3
.
2 iqr
(1.1.2)
Here iqr is the interquartile range. To clarify this term, we must define Q 2 , or the
median. This denotes the middle term of a set of data when the values are arranged in
ascending order, or the average of the two middle terms if n is an even number. The first
or lower quartile, Q 1 , is the median of the lower half of the data, and likewise the third
Table 1.1.3 Ranked modulus of rupture data for timber strengths in N/mm2 , in
ascending order a
Copyright © 2008. Wiley. All rights reserved.
0.00
17.98
22.67
22.74
22.75
23.14
23.16
23.19
24.09
24.25
24.84
25.39
25.98
26.63
27.31
27.90
27.93
a
2
28.00
28.13
28.46
28.69
28.71
28.76
28.83
28.97
28.98
29.11
29.90
29.93
30.02
30.05
30.33
30.53
31.33
31.60
32.02
32.03
32.40
32.48
32.68
32.76
33.06
33.14
33.18
33.19
33.47
33.61
33.71
33.92
34.12
34.40
34.44
34.49
34.56
34.63
35.03
35.17
35.30
35.43
35.58
35.67
35.88
35.89
36.00
36.38
36.47
36.53
36.81
36.84
36.85
36.88
36.92
37.51
37.65
37.69
37.78
38.00
38.05
38.16
38.64
38.71
38.81
39.05
39.15
39.20
39.21
39.33
39.34
39.60
39.62
39.77
39.93
39.97
40.20
40.27
40.39
40.53
40.71
40.85
40.85
41.64
41.72
41.75
41.78
41.85
42.31
42.47
43.07
43.12
43.26
43.33
43.33
43.41
43.48
43.48
43.64
43.99
44.00
44.07
44.30
44.36
44.36
44.51
44.54
44.59
44.78
44.78
45.19
45.54
45.92
45.97
46.01
46.33
46.50
46.86
46.99
47.25
47.42
47.61
47.74
47.83
48.37
48.39
48.78
49.57
49.59
49.65
50.91
50.98
51.39
51.90
53.00
53.63
The original data set is given in Table E.1.1; n = 165. The median is underlined.
See also Scott (1979).
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
53.99
54.04
54.71
55.23
56.60
56.80
57.99
58.34
65.35
65.61
69.07
70.22
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
Table 1.1.4
Frequency computations for the modulus of rupture data ranked in Table 1.1.3a
Class upper limit
(N/mm2 )
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
a
7
Class center
(N/mm2 )
Absolute
frequency
Relative
frequency
Cumulative relative
frequency (%)
2.5
7.5
12.5
17.5
22.5
27.5
32.5
37.5
42.5
47.5
52.5
57.5
62.5
67.5
72.5
1
0
0
1
9
18
26
38
34
20
9
5
0
3
1
0.006
0.000
0.000
0.006
0.055
0.109
0.158
0.230
0.206
0.121
0.055
0.030
0.000
0.018
0.006
0.61
0.61
0.61
1.21
6.67
17.58
33.33
56.36
76.97
89.09
94.55
97.58
97.58
99.39
100.00
The width of each class is 5 N/mm2 in this example.
or upper quartile, Q 3 , is the median of the upper half of the data. This definition will be
used throughout.3 Thus,
iqr = Q 3 − Q 1 .
(1.1.3)
Copyright © 2008. Wiley. All rights reserved.
Example 1.3. Timber strength. For the timber strength data of Table E.1.1, the median,
that is, Q 2 , is 39.05 N/mm2 . Also Q 3 and Q 1 are 44.57 and 32.91 N/mm2 , respectively, and
hence iqr = 11.66 N/mm2 . From the simple square-root rule, the number of classes, n c =
12.84. However, by using Eqs. (1.1.1) and (1.1.2), the number of classes are 8.32 and 16.52,
respectively. If these are rounded to 9 and 15 and the range is extended to 72 and 75 N/mm2
for graphical purposes, the equal class widths become 8 and 5 N/mm2 , respectively. Let us
use these widths. It is important to specify the class boundaries without ambiguity for the
counting of frequencies; for example, in the first case, these should be from 0 to 7.99, 8.00 to
15.99, and so on. As already mentioned, the vertical axis of a histogram is made to represent
the frequency and the horizontal axis is used as a measurement scale on which the class
boundaries are marked. For each of these class widths, 8 and 5 N/mm2 , class boundaries are
made and counting of frequencies is completed using Table 1.1.3; the lowest boundary is
at 0 and the highest boundaries are at 72 and 75 N/mm2 , respectively. Table 1.1.4 gives the
absolute and relative frequencies for class widths of 5 N/mm2 .
Rectangles are then erected over each of the classes, proportional in area to the class
frequencies. When equal class widths are used, as shown here, the heights of the rectangles
represent the frequencies. Thus, Figs. 1.1.3 and 1.1.4 are obtained.
The information conveyed by the two histograms seems to be similar. The diagrams are
almost symmetrical with a peak in the class below 40 N/mm2 and a steady decrease on either
side. This type of diagram usually brings out any possible imperfections in the data, such as
There are alternatives, such as rounding (n + 1)/4 and (n + 1) × (3/4) to the nearest integers to calculate the
locations of Q 1 and Q 3 , respectively. The rounding is upward or downward, respectively, when the numbers fall
exactly between two integers.
3
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
8
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Applied Statistics for Civil and Environmental Engineers
0.3
0.2
72−79.99
64−71.99
56−63.99
48−55.99
40−47.99
32−39.99
24−31.99
0−7.99
0.0
16−23.99
0.1
8−15.99
Relative frequency
0.4
Modulus of rupture (N/mm2)
Fig. 1.1.3 Histogram for timber strength data with class width of 8 N/mm2 .
the gaps at the ends. Further investigations are required to understand the true nature of the
population. More on these aspects will follow in this and subsequent chapters.
1.1.4 Frequency polygon
A frequency polygon is a useful diagnostic tool to determine the distribution of a variable.
It can be drawn by joining the midpoints of the tops of the rectangles of a histogram after
extending the diagram by one class on both sides. We assume that equal class widths are
used. If the ordinates of a histogram are divided by the total number of observations, then
a relative frequency histogram is obtained. Thus, the ordinates for each class denote the
probabilities bounded by 0 and 1, by which we simply mean the chances of occurrence.
The resulting diagram is called the relative frequency polygon.
Example 1.4. Timber strength. Corresponding to the histogram of Fig. 1.1.4, the values
of class center are computed and a relative frequency polygon is obtained; this is shown in
Fig. 1.1.5.
0.20
70−74.99
40−44.99
30−34.99
20−24.99
10−14.99
0−4.99
0.00
60−64.99
0.10
50−54.99
Relative frequency
Copyright © 2008. Wiley. All rights reserved.
0.30
Modulus of rupture (N/mm2)
Fig. 1.1.4 Histogram for timber strength data with class width of 5 N/mm2 .
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
9
Relative frequency
0.3
0.2
0.1
0.0
0
20
40
Modulus of rupture
60
80
(N/mm2)
Fig. 1.1.5 Relative frequency polygon for timber strength data with class width of 5 N/mm2 .
As the number of observations becomes large, the class widths theoretically tend to decrease and, in the limiting case of an infinite sample, a relative frequency polygon becomes
a frequency curve. This is in fact a probability curve, which represents a mathematical
probability density function, abbreviated as pdf, of the population.4
Copyright © 2008. Wiley. All rights reserved.
1.1.5 Cumulative relative frequency diagram
If a cumulative sum is taken of the relative frequencies step by step from the smallest class
to the largest, then the line joining the ordinates (cumulative relative frequencies) at the
ends of the class boundaries forms a cumulative relative frequency or probability diagram.
On the vertical axis of the graph, this line gives the probabilities of nonexceedance of values
shown on the horizontal axis. In practice, this plot is made by utilizing and displaying every
item of data distinctly, without the necessity of proceeding via a histogram and the restrictive categories that it entails. For this purpose, one may simply determine (e.g., from the
ranked data of Table 1.1.3) the number of observations less than or equal to each value and
divide these numbers by the total number of observations. This procedure is adopted here.5
Thus, the probability diagram, as represented by the cumulative relative frequency
diagram, becomes an important practical tool. This diagram yields the median and other
quartiles directly. Also, one can find the 9 values that divide the total frequency into 10
equal parts called deciles and the so-called percentiles, where the pth percentile is the
value that is greater than p percent of the observations. In general, it is possible to obtain
the (n − 1) values that divide the total frequency into n equal parts called the quantiles.
Hence a cumulative frequency polygon is also called a quantile or Q-plot; a Q-plot though
has quantiles on the vertical axis unlike a cumulative frequency diagram.
Example 1.5. Timber strength. Figure 1.1.6 is the cumulative frequency diagram obtained
from the ranked timber strength data of Table 1.1.3 using each item of data as just described.
4
This function is discussed in Chapter 3. One of the first tasks in applying inferential statistics, as presented in
Chapters 4 and 5, will be to estimate the mathematical function from a finite sample and examine its closeness
to the histogram.
5 Further aspects of this subject, as related to probability plots, are described in Chapter 5.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Cumulative relative frequency
10 Applied Statistics for Civil and Environmental Engineers
1.0
0.8
0.6
0.4
0.2
0.0
0
20
40
60
80
Modulus of rupture (N/mm2)
Fig. 1.1.6 Cumulative relative frequency diagram for timber strength data.
The deciles and percentiles can be abstracted. By convention a vertical probability or
proportionality scale is used rather than one giving percentages (except in duration curves,
discussed shortly). The 90th percentile, for instance, is 51 N/mm2 approximately and the
value 40 N/mm2 has a probability of nonexceedance of approximately 0.56.
If the sample size increases indefinitely, the cumulative relative frequency diagram will
become a distribution curve in the limit. This represents the population by means of a
(mathematical) distribution function, usually called a cumulative distribution function, abbreviated to cdf, just as a relative frequency polygon leads to a probability density function.
As a graphical method of ascertaining the distribution of the population, the quantile
plot can be drawn using a modified nonlinear scale for the probabilities, which represents
one of several types of theoretical distributions.6 Also, as shown in Section 1.4, two
distributions can be compared using a Q-Q plot.
Copyright © 2008. Wiley. All rights reserved.
1.1.6 Duration curves
For the assessment of water resources and for associated design and planning purposes,
engineers find it useful to draw duration curves. When dealing with flows in rivers, this type
of graph is known as a flow duration curve. It is in effect a cumulative frequency diagram
with specific time scales. The vertical axis can represent, for example, the percentage of
the time a flow is exceeded; and in addition, the number of days per year or season during
which the flow is exceeded (or not) may be given. The volume of flow per day is given on
the horizontal axis. For some purposes, the vertical and horizontal axes are interchanged
as in a Q-plot. One example of a practical use is the scaled area enclosed by the curve,
a horizontal line representing 100% of the time, and a vertical line drawn at a minimum
value of flow, which is desirable to be maintained in the river. This area represents the
estimated supplementary volume of water that should be diverted to the river on an annual
basis to meet such an objective.
Example 1.6. Streamflow duration. Figure 1.1.7 gives the flow duration curve of the Dora
Riparia River in the Alpine region of northern Italy, calculated over a period of 47 years from
the records at Salbertrand gauging station. This figure is drawn using the same procedure
6
This method is demonstrated in Section 5.8.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
100
365
90
80
292
70
60
219
50
40
146
30
73
20
Percentage, duration
Duration, days per year flow
is exceeded
11
10
0
0
0
10
20
30
40
50
Daily streamflow (m3/s)
Fig. 1.1.7 Flow duration curve of Dora Riparia River at Salbertrand in the Alpine region of Italy.
adopted for a cumulative relative frequency diagram, such as Fig. 1.1.6. For instance, suppose
it is decided to divert a proportion of the discharges above 10 m3 /s and below 20 m3 /s from the
river. Then the area bounded by the curve and the vertical lines drawn at these discharges, using
the vertical scale on the left-hand side, will give the estimated maximum amount available
for diversion during the year in m3 after multiplication by the number of seconds in a day.
This area is hatched in Fig. 1.1.7. If such a decision were to be implemented over a longterm basis, it should be essential to use a long series of data and to estimate the distribution
function.
1.1.7 Summary of Section 1.1
Copyright © 2008. Wiley. All rights reserved.
In this section we have introduced some of the basic graphical methods. Other procedures
such as stem-and-leaf plots and scatter diagrams are presented in Sections 1.3 and 1.4,
respectively. More advanced plots are introduced in Chapters 5 and 6. In the next section
we discuss associated numerical methods.
1.2 NUMERICAL SUMMARIES OF DATA
Useful graphical procedures for presenting data and extracting knowledge on variability and other properties were shown in Section 1.1. There is a complementary method
through which much of the information contained in a data set can be represented economically and conveyed or transmitted with greater precision. This method utilizes a set
of characteristic numbers to summarize the data and highlight their main features. These
numerical summaries represent several important properties of the histogram and the relative frequency polygon. The most important purpose of these descriptive measures is for
statistical inference, a role that graphs cannot fulfill. Basically, there are three distinctive
types: measures of central tendency, of dispersion, and of asymmetry, all of which can
be visualized through the histogram as discussed in Section 1.1. The additional measure
of “peakedness,” that is, the relative height of the peak, requires a large sample for its
estimation and is mainly relevant in the case of symmetric distributions.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
12 Applied Statistics for Civil and Environmental Engineers
1.2.1 Measures of central tendency
Generally data from many natural systems, as well as those devised by humans, tend to
cluster around some values of variables. A particular value, known as the central value,
can be taken as a representative of the sample. This feature is called central tendency
because the spread seems to take place about a center. The definition of the central value is
flexible, and its magnitude is obtained through one of the measures of its location. There
are three such well-known measures: the mean, the mode, and the median. The choice
depends on the use or application of the central value.
The sample arithmetic mean is estimated from a sample of observations: x1 , x2 , . . . ,
xn , as
x̄ =
n
1
xi .
n i=1
(1.2.1)
Copyright © 2008. Wiley. All rights reserved.
If one uses a single number to represent the data, the sample mean seems ideal for the
purpose. After counting, this calculation is the next basic step in statistics. For theoretical
purposes the mean is the most important numerical measure of location. As stated in
Section 1.1, if the sample size increases indefinitely a curve is obtained from a frequency
polygon; the mean is the centroid of the area between this curve and the horizontal axis
and it is thus the balance point of the frequency curve.
The population value of the mean is denoted by μ. We reiterate our definition of population with reference to a phenomenon such as that represented by the timber strength data
of Table E.1.1. A population is the aggregate of observations that might result by making
an experiment in a particular manner.
The sample mean has a disadvantage because it may sometimes be affected by unexpectedly high or low values, called outliers. Such values do not seem to conform to
the distribution of the rest of the data. There may be physical reasons for outliers. Their
presence may be attributed to conditions that have perhaps changed from what were assumed, or because the data are generated by more than one process. On the other hand,
they may arise on account of errors of faulty instrumentation, measurement, observation,
or recording. The engineer must examine any visible outliers and ascertain whether they
are erroneous or whether their inclusion is justifiable. The occurrence of any improbable
value requires careful scrutiny in practice, and this should be followed by rectification or
elimination if there are valid reasons for doing so.
Example 1.7. Timber strength. A case in point is the value of zero in the timber strength
data of Table E.1.1 This value is retained here for comparative purposes. The mean of the
165 items, which is 39.09 N/mm2 , becomes 39.33 N/mm2 without the value of zero.
Example 1.8. Concrete test Table E.1.2 is a list of the densities and compressive strengths
at 28 days from the results of 40 concrete cube test records conducted in Barton-on-Trent,
England, during the period 8 July 1991 to 21 September 1992, and arranged in reverse
chronological order.
These have sample means of 2445 kg/m3 and 60.14 N/mm2 , respectively. The two numbers
are measures of location representing the density and compressive strength of concrete.
With many discordant values at the extremes, a trimmed mean, such as a 5% trimmed
mean, may be calculated. For this purpose, the data are ranked and the mean is obtained
after ignoring 5% of the observations from each of the two extremities (see Problem 1.16).
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
13
The technique of coding is sometimes used to facilitate calculations when the data
are given to several significant figures but the digits are constant except for the last few.
For example, the densities in Table E.1.2 are higher than 2400 N/mm2 and less than
2500 N/mm2 , so that the number 2400 can be subtracted from the densities. The remainders
will retain the essential characteristics of the original set (apart from the enforced shift in
the mean), thus simplifying the arithmetic.
In considering the entire data set, a weighted mean is obtained if the variables of a
sample are multiplied by numbers called weights and then divided by the sum of the
weights. It is used if some variables should contribute more (or less) to the average than
others.
The median is the central value in an ordered set or the average of the two central values
if the number of values, n, is even, as specified in Section 1.1.
Example 1.9. Concrete test. The calculation of the median and other measures of location
will be greatly facilitated if the data are arranged in order of magnitude. For example, the
compressive strengths of concrete given in Table E.1.2 are rewritten in ascending order in
Table 1.2.1.
The median of these data is 60.1 N/mm2 , which is the average of 60.0 and 60.2 N/mm2 .
The median of the timber strength data of Table 1.1.3 is 39.05 N/mm2 , as noted in the
table. The median has an advantage over the mean. It is relatively unaffected by outliers
and is thus often referred to as a resistant measure. For instance, the exclusion of the
zero value in Table 1.1.3 results only in a minor change of the median from 39.05 to
39.10 N/mm2 .
One of the countless practical uses of the median is the application of a disinfectant
to many samples of bacteria. Here, one seeks an association between the proportion of
bacteria destroyed and the strength of the disinfectant. The concentration that kills 50% of
the bacteria is the median dose. This is termed LD50 (lethal dose for 50%) and provides
an excellent measure.
The mode is the value that occurs most frequently. Quite often the mode is not unique
because two or more sets of values have equal status. For this reason and for convenience,
the mode is often taken from the histogram or frequency polygon.
Copyright © 2008. Wiley. All rights reserved.
Example 1.10. Concrete test. For the ranked compressive strengths of concrete in
Table 1.2.1, the mode is 60.5 N/mm2 .
Example 1.11. Timber strength. From Fig. 1.1.4, for example, the mode of the timber
strength data is 37.5 N/mm2 , which corresponds to the midpoint of the class with the highest
frequency. However, there is ambiguity in the choice of the class widths as already noted.
On the other hand, in Table 1.1.3 there are nine values in the range 38.64–39.34 N/mm2 , and
thus 39 N/mm2 seems a more representative value, but this problem can only be resolved
theoretically.
As the sample size becomes indefinitely large, the modal value will correspond to the
peak of the relative frequency curve on a theoretical basis. The mode may often have
greater practical significance than the mean and the median. It becomes more useful as the
asymmetry of the distribution increases. For instance, if an engineer were to ask a person
who sits habitually on the banks of a river fishing to indicate the mean level of the river,
he or she is inclined to point out the modal level. It is the value most likely to occur and it
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
14 Applied Statistics for Civil and Environmental Engineers
Table 1.2.1
concretea
Copyright © 2008. Wiley. All rights reserved.
Order
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
a
Ordered data of density and compressive strength of
Density (kg/m3 )
Compressive strength
(N/mm2 )
2411
2415
2425
2427
2427
2428
2429
2433
2435
2435
2436
2436
2436
2436
2437
2437
2441
2441
2444
2445
2445
2446
2447
2447
2448
2448
2449
2450
2454
2454
2455
2456
2456
2457
2458
2469
2471
2472
2473
2488
49.9
50.7
52.5
53.2
53.4
54.4
54.6
55.8
56.3
56.7
56.9
57.8
57.9
58.8
58.9
59.0
59.6
59.8
59.8
60.0
60.2
60.5
60.5
60.5
60.9
60.9
61.1
61.5
61.9
63.3
63.4
64.9
64.9
65.7
67.2
67.3
68.1
68.3
68.9
69.5
The original data sets are given in Table E.1.2.
is not affected by exceptionally high or low values. Clearly, the deletion of the zero value
from Table 1.1.3 does not alter the mode, as we have also seen in the case of the median.
These positive attributes of the mode and median notwithstanding, the mean is indispensable for many theoretical purposes. Also in the same class as the sample arithmetic
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
15
mean, there are two other measures of location that are used in special situations. These
are the harmonic and geometric means.
The harmonic mean is the reciprocal of the mean of the reciprocals. Thus the harmonic
mean for a sample of observations, x1 , x2 , . . . , xn , is defined as
x̄ h =
1
.
1/n[(1/x1 ) + (1/x2 ) + · · · + 1/xn )]
(1.2.2)
It is applied in situations where the reciprocal of a variable is averaged.
Example 1.12. Stream flow velocity. A practical example of the harmonic mean is the
determination of the mean velocity of a stream based on measurements of travel times over a
given reach of the stream using a floating device. For instance, if three velocities are calculated
as 0.20, 0.24, and 0.16 m/s, then the sample harmonic mean is
x̄ h =
1
= 0.19 m/s.
(1/3)[(1/0.20) + (1/0.24) + (1/0.16)]
The geometric mean is used in averaging values that represent a rate of change. Here the
variable follows an exponential, that is, a logarithmic law. For a sample of observations,
x1 , x2 , . . . , xn , the geometric mean is the positive nth root of the product of the n values.
This is the same as the antilog of the mean of the logarithms:
n
n
1
1/n
x̄ g = (x1 x2 . . . xn )1/n = exp
.
(1.2.3)
In xi =
xi
n i=1
i=1
Example 1.13. Population growth. Consider the case of populations of towns and cities that
increase geometrically, which means that a future increase is expected that is proportional to
the current population. Such information is invaluable for planning and designing urban water
supplies and sewerage systems. Suppose, for example, that according to a census conducted
in 1970 and again in 1990 the population of a city had increased from 230,000 to 310,000.
An engineer needs to verify, for purposes of design, the per capita consumption of water in
the intermediate period and hence tries to estimate the population in 1980. The central value
to use in this situation is the geometric mean of the two numbers which is
x̄ g = (230, 000 × 310, 000)1/2 = 267,021.
Copyright © 2008. Wiley. All rights reserved.
(Note that the sample arithmetic mean x̄ = 270,000.)
As we see in Example 1.13, the geometric mean is less than the arithmetic mean.7
1.2.2 Measures of dispersion
Whereas a measure of central tendency is obtained by locating a central or representative
value, a measure of dispersion represents the degree of scatter shown by observations or
the inherent variability in a phenomenon under observation. Dispersion also indicates the
precision of the data. One method of quantification is through an order statistic, that is,
one of ranked data.8 The simplest in the category is the range, which is the difference
between the largest and smallest values, as defined in Section 1.1.
7
8
This theoretical property is demonstrated in Example 3.10.
We shall discuss order statistics formally in Chapter 7; see also Chapter 5.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
16 Applied Statistics for Civil and Environmental Engineers
Example 1.14. Timber strength. As noted before, the range of the timber strength data of
Table 1.1.3 is 70.22 – 0.00 = 70.22 N/mm2 .
Example 1.15. Concrete test. For the compressive strengths of concrete given in Table
E.1.2 and ranked in Table 1.2.1, the range is r = 69.5 − 49.9 = 19.6 N/mm2 ; the range of
the concrete densities is 2488 – 2411 = 77 kg/m3 . These numbers provide a measure of the
spread of the data in each case.
The range, however, is a nondecreasing function of the sample size and thus characterizes the population poorly. Moreover, the range is unduly affected by high and low
values that may be somewhat incompatible with the rest of the data even though they may
not always be classified as outliers. For this reason, the interquartile range, iqr, which is
relatively a resistant measure, is preferable. As defined in Section 1.1, in a ranked set of
data this is the difference between the median of the top half and the median of the bottom
half.
Example 1.16. Concrete test. For the compressive strengths of concrete, the iqr is 6.55
N/mm2 .
Example 1.17. Timber strength. The timber strength data in Table 1.1.3 have an iqr of
11.66 and 11.47 N/mm2 , respectively, with or without the zero value. A similar and more
general measure is given by the interval between two symmetrical percentiles. For example,
the 90−10 percentile range for the timber strength data is approximately 52 – 28 = 24 N/mm2
from Fig. 1.1.6.
The aforementioned measures of dispersion can be easily obtained. However, their
shortcoming is that, apart from two values or numbers equivalent to them, the vast information usually found in a sample of data is ignored. This criticism is not applicable if one
determines the average deviation about some central value, thus including all the observations. For example, the mean absolute deviation, denoted by d, measures the average
absolute deviation from the sample mean. For a sample of observations, x1 , x2 , . . . , xn , it
is defined as
Copyright © 2008. Wiley. All rights reserved.
d=
n
|x1 − x̄| + |x2 − x̄| + · · · + |xn − x̄| |xi − x̄|
=
.
n
n
i=1
(1.2.4)
Example 1.18. Annual rainfall. If the annual rainfalls in a city are 50, 56, 42, 53, and
49 cm over a 5-year period, the absolute deviation with respect to the sample mean of 50 cm
is given by
d=
1
(|50 − 50| + |56 − 50| + |42 − 50| + |53 − 50| + |49 − 50|) = 3.6 cm.
5
This measure of dispersion is easily understood and practically useful. However, it is valid
only if the large and small deviations are as significant as the average deviations. There are
strong theoretical reasons (as seen in Chapters 3, 4, and 5), on the other hand, for using the
sample standard deviation, denoted by s, which is the root mean square deviation about
the mean. Indeed, this is the principal measure of dispersion (although the interquartile
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
17
range is meaningful and expedient). For a sample of observations, x1 , x2 , . . . , xn it is
defined by
n
1 1
2
2
2
s=
[(x1 − x̄) + (x2 − x̄) + · · · + (xn − x̄) ] =
(xi − x̄)2 .
(1.2.5)
n
n i=1
By expanding and summarizing the terms on the extreme right-hand side,
n
n
n
1 1 s=
xi2 − 2x̄
xi + n x̄ 2 =
xi2 − x̄ 2 .
n i=1
n
i=1
i=1
(1.2.6)
Engineers will recognize that this measure is analogous to the radius of gyration of a
structural cross section. In contrast to the mean absolute deviation, it is highly influenced
by the largest and smallest values. The standard deviation of the population is denoted by
σ . It is common practice to replace the divisor n of Eq. (1.2.5) by (n– 1) and denote the
left-hand side by ŝ. Consequently, the estimate of the standard deviation is, on average,
closer to the population value because it is said to have smaller bias. Therefore, Eq. (1.2.5)
will, on average, give an underestimate of σ except in the rare case in which μ is known.9
The required modification to Eq. (1.2.6) is as follows:
n
1 n
ŝ =
x2 −
(1.2.7)
x̄ 2 .
n − 1 i=1 i
n−1
This reduction in n can be justified by means of the concept of degrees of freedom. It is a
consequence of the fact that the sum of the n deviations (x1 − x̄), (x2 − x̄), . . . , (xn − x̄)
is zero, which follows from Eq. (1.2.1) for the mean. Hence, regardless of the arrangement
of the data, if any (n − 1) terms are specified the remaining term is fixed or known, because
xn − x̄ = −
n−1
(xi − x̄).
i=1
It follows from this equation that one degree of freedom is lost in defining the sample
standard deviation. The concept of degrees of freedom was introduced by the English
statistician R. A. Fisher on the analogy of a dynamical system in which the term denotes
the number of independent coordinate values necessary to determine the system.
Copyright © 2008. Wiley. All rights reserved.
Example 1.19. Annual rainfall. From the annual rainfall data in Example 1.18 (50, 56, 42,
53, and 49 cm), one can estimate the standard deviation σ by using Eq. (1.2.5), as follows:
1
ŝ =
[(50 − 50)2 + (56 − 50)2 + (42 − 50)2 + (53 − 50)2 + (49 − 50)2 ]
5
1 2
110
=
(0 + 62 + 82 + 32 + 12 ) =
= 4.69 cm.
5
5
An alternative estimate of σ (which is, on average, less biased) is obtained using Eq. (1.2.7)
as follows:
110
ŝ =
= 5.24 cm.
4
9
Terms such as bias are discussed formally in Section 5.2. It is shown in Example 5.1 that ŝ 2 is unbiased;
however, ŝ is known to have bias, though less than s on average.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
18 Applied Statistics for Civil and Environmental Engineers
Example 1.20. Timber strength. By using Eq. (1.2.7), the sample standard deviation of
the timber strength data of Table E.1.1 is 9.92 N/mm2 (or 9.46 N/mm2 if the zero value is
excluded).
Example 1.21. Concrete test. By using Eq. (1.2.7), the sample standard deviation for the
density and compressive strength of concrete in Table E.1.2 are 15.99 kg/m3 and 5.02 N/mm2 ,
respectively.
Dividing the standard deviation by the mean gives the dimensionless measure of dispersion called the sample coefficient of variation, v:
v=
ŝ
x̄
(1.2.8)
This is usually expressed as a percentage. The coefficient of variation is useful in comparing
different data sets with respect to central location and dispersion.
Copyright © 2008. Wiley. All rights reserved.
Example 1.22. Comparison of timber and concrete strength data. From the values of
mean and standard deviation in Examples 1.7 and 1.20, the sample coefficient of variation
of the timber strength data is 25.3% (or 24.0% without the value of zero). Similarly, from
Examples 1.8 and 1.21 the density and compressive strength of concrete data have sample
coefficients of 0.65 and 8.24%, respectively. The higher variation in the timber strength data
is a reflection of the variability of the natural material, whereas the low variation in the density
of the concrete is evidence of a uniform quality in the constituents and a high standard of
workmanship, including care taken in mixing. The variation in the compressive strength
of concrete is higher than that of its density. This can be attributed to random factors that
influence strength, such as some subtle changes in the effectiveness of the concrete that do
not alter its density.
From the square of the sample standard deviation one obtains the sample variance, ŝ 2 ,
which is the mean of the squared deviations from the mean. The population variance is
denoted by σ 2 . The variance, like the mean, is important in theoretical distributions.
By squaring Eqs. (1.2.6) and (1.2.7), two estimators of the population variance are found.
Here estimator refers to a method of estimating a constant in a parent population. As in
all the foregoing equations, this term means the random variable of which the estimate is
a realization. An unbiased estimator is obtained from Eq. (1.2.7) because on average (that
is by repeated sampling) the estimator tends to the population variance σ 2 . In other words,
the expectation E, which is in effect the average from an infinite number of observations,
of the square of the right-hand side of Eq. (1.2.7) is equal to σ 2 .
There are also measures of dispersion pertaining to the mean of the deviations between
the observations. Gini’s mean difference, for example, is a long-standing method.10 This
is given by
g=
n
2
[x(i) − x( j) ],
n (n − 1) i> j j=1
(1.2.9)
in which the observations x1 , x2 , . . . , xn are arranged in ascending order.
10
See, for example, Stuart and Ord (1994, p. 58) for more details of this method originated by the Italian
mathematician, Gini. See also Problem 1.7.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
19
1.2.3 Measure of asymmetry
Another important property of the histogram or frequency polygon is its shape with respect
to symmetry (on either side of the mode). The sample coefficient of skewness measures
the asymmetry of a set of data about its mean. For a sample of observations, x1 , x2 , . . . ,
xn , it is defined as
n
i=1
(xi − x)3
.
(1.2.10)
ns 3
Division by the cube of the sample standard deviation gives a dimensionless measure.
A histogram is said to have positive skewness if it has a longer tail on the right, which
is toward increasing values, than on the left. In this case the number of values less than the
mean is greater than the number that exceeds the mean. Many natural phenomena tend to
have this property. For a positively skewed histogram,
g1 =
mode < median < mean.
This inequality is reversed if skewness is negative. A symmetrical histogram suggests zero
skewness.
Example 1.23. Comparison of timber and concrete strength data. The coefficient of
skewness of the timber strength data of Table E.1.1 and the compressive strength data of
Table E.1.2 are 0.15 (or 0.53 after excluding the zero value) and 0.03, respectively. These
indicate a small skewness in the first case and a symmetrical distribution in the second case.
The example indicates that this measure of skewness is sensitive to the tails of the
distribution.
1.2.4 Measure of peakedness
The extent of the relative steepness of ascent in the vicinity and on either side of the
mode in a histogram or frequency polygon is said to be a measure of its peakedness or
tail weight. This is quantified by the dimensionless sample coefficient of kurtosis, which
is defined for a sample of observations, x1 , x2 , . . . , xn by
Copyright © 2008. Wiley. All rights reserved.
g2 =
n
i=1
(xi − x)4
.
ns 4
(1.2.11)
Example 1.24. Comparison of timber and concrete strength data. The kurtosis of the
timber strength data of Table E.1.1 is 4.46 (or 3.57 without the zero value) and that of
the compressive strengths of Table E.1.2 is 2.33. One can easily see from Eq. (1.2.11) that
even a small variation in one of the items of data may influence the kurtosis significantly.
This observation warrants a large sample size, perhaps 200 or greater, for the estimation of
the kurtosis. Small sample sizes, particularly in the second case with n = 40, preclude the
attachment of any special significance to these estimates.
1.2.5 Summary of Section 1.2
Of the numerical summaries listed here, the mean, standard deviation, and coefficient of
skewness are the best representative measures of the histogram or frequency polygon, from
both visual and theoretical aspects. These provide economical measures for summarizing
the information in a data set. Sample estimates for the data we have been discussing here,
including the coefficients of variation and kurtosis, are given in Table 1.2.2.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
20 Applied Statistics for Civil and Environmental Engineers
Table 1.2.2 Sample estimates of numerical summaries of the timber strength data of Table 1.1.3
and the concrete strength and density data of Table 1.2.1
Data set
Sample
Standard Coefficient of Coefficient Coefficient
size Meana deviationa variation (%) of skewness of kurtosis
Estimated by equation
Timber strength—full
sample
Timber strength without the
zero value
Compressive strength of
concrete
Density of concrete
a
1.2.1
1.2.7
1.2.8
1.2.10
1.2.11
165
39.09
9.92
25.3
0.15
4.46
164
39.33
9.46
24.0
0.53
3.57
40
60.14
5.02
8.35
0.03
2.33
40
2445
15.99
0.65
0.38
3.15
Units for strength are N/mm2 ; units for density are kg/m3 .
1.3 EXPLORATORY METHODS
Some graphical displays are used when one does not have any specific questions in mind
before examining a data set. These methods were appropriately called exploratory data
analysis by Tukey (1977). Among such procedures the box plot is advantageous, and the
stem-and-leaf plot is also a valuable tool.
Copyright © 2008. Wiley. All rights reserved.
1.3.1 Stem-and-leaf plot
The histogram is a highly effective graphical procedure for showing various characteristics
of data as seen in Section 1.1. However, for smaller samples, less than, say, 40 in size,
it may not give a clear indication of the variability and other properties of the data.
The stem-and-leaf plot, which resembles a histogram turned through a right angle, is a
useful procedure in such cases. Its advantage is that the data are grouped without loss
of information because the magnitudes of all the values are presented. Furthermore, its
intrinsic tabular form highlights extreme values and other characteristics that a histogram
may obscure. As in a histogram, the data are initially ranked in ascending order but
a different approach is adopted in finding the number of classes. The class widths are
almost invariably equal. For the increments or class intervals (and hence class widths) one
uses 0.5, 1, or 2 multiplied by a power of 10, which means that the intervals are in units
such as 0.1 or 200 or 10,000, which are more tractable than, say, 0.13 or 140 or 12,000.
The terminology is best explained through the following worked example.
Example 1.25. Concrete test. For the concrete strength data of Table E.1.2, the maximum
and minimum values are 69.5 and 49.9 N/mm2 , respectively. As a first choice, the data can
be divided into 21 classes in intervals of 1 N/mm2 with lower boundaries at 49, 50, 51
N/mm2 , and so on, up to 69 N/mm2 . For the ordered stem-and-leaf plot of Fig. 1.3.1, a
vertical line is drawn with the class boundaries marked in increasing order immediately to
its left.
The boundary values are called the leading digits and, together with the vertical line,
constitute the stem. The trailing digits on the right represent the items of data in increasing
order when read jointly with the leading digits using the indicated units. They are termed
leaves, and their counts are the class frequencies. Thus the digits 49 (stem) and 9 (leaf)
constitute 49.9. It is useful to provide an additional column at the extreme left, as shown
here, giving the cumulative frequencies—called depths—up to each class. This is completed
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
1 49
9
2 50
7
21
2 51
3 52
5
5 53
2
4
7 54
4
6
8 55
8
11 56
3
7
13 57
8
9
15 58
8
9
19 59
0
6
8
8
(7) 60
0
2
5
5
14 61
1
5
9
11 63
3
4
9 64
9
9
7 65
7
9
5
9 9
11 62
6 66
6 67
2
3
4 68
1
3
1 69
5
9
Fig. 1.3.1 Stem-and-leaf plot for compressive strengths of concrete in Table E.1.2; units for
stem: 1 N/mm2 ; units for leaves: 0.1 N/mm2 .
Copyright © 2008. Wiley. All rights reserved.
firstly by starting at the top and totaling downward to the line containing the median for which
the individual frequency is given in parentheses, and secondly by starting at the bottom and
totaling upward to the line containing the median.
The diagram gives all the information in the data, which is its main advantage. Furthermore, the range, median, symmetry, or gaps in the data, frequently occurring values, and
any possible outliers can be highlighted. In this example, a symmetrical distribution is
indicated. The plot may be redrawn with a smaller number of classes, perhaps for greater
clarity, using the guidelines for choosing the intervals stipulated previously. The units of
data in a plot can be rounded to any number of significant figures as necessary. Also, the
number of stems in a plot can be doubled by dividing each stem into two lines. When
1 multiplied by a power of 10 is used as an interval, for example, the first line, which
is denoted by an asterisk (∗ ), will thus have leaves 0 to 4, and the leaves of the second,
represented by a period (.), will be from 5 to 9. Likewise, one may divide a stem into five
lines. The stem-and-leaf plot is best suited for small to moderate sample sizes, say, less
than 200.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
22 Applied Statistics for Civil and Environmental Engineers
Strength
(N/mm2)
Timber strength,
excluding 0 value
Compressive strength
of concrete
10
20
30
40
17.98
15.89
33.10
39.10
44.57
49.9
50
60
70
80
61.78
70.22
65.35
65.61
69.07
46.97
56.8
60.1
63.4
69.5
73.23
Maximum and minimum values
Other high values
Critical values for detecting outliers
Quartiles
Fig. 1.3.2 Box plots for timber strength and compressive strength of concrete data from Tables
1.1.3 to 1.2.1.
Copyright © 2008. Wiley. All rights reserved.
1.3.2 Box plot
Another plot that is highly useful in data presentation is the box plot, which displays the
three quartiles, Q 1 , Q 2 , Q 3 , on a rectangular box aligned either horizontally or vertically.
The box, together with the minimum and maximum values, which are shown at the ends of
lines extended at either side from the box from the midpoints of its extremities, constitute
the box-and-whiskers plot, as it is sometimes called. The numerical signposts are arranged
as follows from top to bottom: minimum, Q 1 , Q 2 , Q 3 , and maximum. Together they
constitute a five-number summary. The minimum and maximum values may be replaced
by the 5th and 95th (or other extreme) percentiles or supplemented by these and additional
extreme values. These plots play an important role in comparing two or more samples.
The width of the box is made proportional to the sample size in such cases, if they are
different.
Example 1.26. Comparison of timber and concrete strength data. Let us use a box plot
to compare the strengths of two representative materials used by civil engineers. Figure 1.3.2
shows the timber strength data ranked in Table 1.1.3, with the zero value excluded, and
the compressive strength of concrete data that were ranked in Table 1.2.1. The box plot of
compressive strengths of concrete shown on the right strongly indicates symmetry in their
distribution. In the case of the timber strength data, the box is less symmetrical. However,
there are clear signs of positive skewness; because the length of the line connecting the highest
value to the box is longer than that connecting the lowest value to the box.
Empirical rules have been devised to detect outliers by means of box plots. As previously stated, this term signifies an excessive discordance with reference to an assumed
distribution to which the majority of observations belong. One such procedure identifies
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
23
outliers as those values located at distances greater than 1.5 iqr above the third quartile or
less than 1.5 iqr below the first quartile.
Example 1.27. Comparison of timber and concrete strength data. The iqrs for the timber
strength and compressive strength of concrete data are 11.47 and 6.55 N/mm2 and thus the
two critical distances for detecting outliers are 17.21 and 9.83 N/mm2 , respectively. These
distances are set out on either from the extremities of the boxes and are shown by thick
horizontal lines in Fig. 1.3.2. By this rule, the concrete data do not have any outliers, whereas
there are four outliers beyond the demarcating line for high outliers in the timber strength
data of Table 1.1.2. These are the values 65.35, 65.61, 69.07, and 70.22 N/mm2 . At the other
extremity, there is the zero value that was discarded before the diagram was drawn. When
such an observation is recorded one should verify whether it stems from a faulty calibration
or other source of error; it is clearly an outlier by the method described here.11
1.3.3 Summary of Section 1.3
In general, box plots are helpful in highlighting distributional features, including the range
and many of the properties of a histogram. They provide a valuable means of comparing
data measuring related or similar characteristics. The stem-and-leaf plot is also clearly useful in presenting a set of data as an alternative to the histogram. Both diagrams can be easily
drawn. These are two of the commonly used exploratory graphical methods. Other methods
presented in subsequent chapters include the hanging histogram of Subsection 5.8.5.1.
1.4 DATA OBSERVED IN PAIRS
In the preceding sections, the behavior of one variable was considered. Let us extend this
discussion to the case where simultaneous observations are made of two variables and
a study is made to find an association between the variables. In this section the simple
bivariate case of paired samples is examined, and the types of association between them
are briefly assessed.
Copyright © 2008. Wiley. All rights reserved.
1.4.1 Correlation and graphical plots
A specific type of association that is frequently examined is known as correlation (from
co-relation). In usual practice, graphical methods are initially applied; subsequently, numerical summaries provide a quantification and a means of assessment. For example, if
there are n pairs of observations, (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), of two variables X and Y ,
a preliminary indication of the correlation is obtained through a scatter diagram. In this
plot the coordinates denote the observed pairs of values.
Example 1.28. Concrete test. The scatter diagram of Fig. 1.4.1 represents the concrete data
of Table E.1.2, with the density and compressive strength at 28 days given by the horizontal
and vertical axes, respectively.
At first sight, there is no well-defined relationship between the two sets of observations
although one would expect a density that is higher or lower than average to be associated with
a compressive strength of concrete that is correspondingly higher or lower than its average.
11
More precise methods of systematically detecting outliers (such as those investigated by Kottegoda, 1984)
are discussed in Chapter 5.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Compressive strength (N/mm2)
24 Applied Statistics for Civil and Environmental Engineers
80
70
60
50
40
2400
2420
2440
2460
2480
Density of concrete
(kg/m3)
2500
Fig. 1.4.1 Scatter diagram of concrete test data from Table E.1.2.
1.4.2 Covariance and the correlation coefficient
The sample covariance, s X,Y , gives a numerical summary of the linear association between
two quantitative variables X and Y . It is the average of the product of their deviations about
the respective means. Thus,
s X,Y =
n
1
(xi − x̄)(yi − ȳ).
n i=1
(1.4.1)
The covariance will be greater when there is a greater direct association between X and Y
with respect to higher than average values and similarly for lower than average values. If
the sample covariance is divided by the sample standard deviations of the two variables,
s X and sY [as in Eq. (1.2.6)], one obtains a dimensionless measure of linear association
called the sample coefficient of correlation,
r X,Y =
1
n
ns X sY
i=1
(xi − x̄)(yi − ȳ).
(1.4.2)
Copyright © 2008. Wiley. All rights reserved.
Substituting for s X and sY , we find
r X,Y =
n
i=1
n
i=1
(xi − x)(yi − y)
(xi − x)2
n
i=1
(yi − y)2
.
(1.4.3)
The correlation coefficient is constrained by –1 ≤ r X,Y ≤ 1. Because the association measured here is defined by Eqs. (1.4.2) and (1.4.3), this result is called the linear coefficient
of correlation or the product-moment correlation coefficient.12
The two limiting values in the preceding constraint are of theoretical interest and are
applicable if all the points of a scatter diagram lie on a straight line of the type
Yi = β0 + βi xi ,
12
Another measure, Spearman’s rank correlation coefficient, is discussed in Chapter 5.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
(1.4.4)
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
25
where β0 and β1 are constants. The constant β1 will be positive for all positive correlations
including the maximum value, r X,Y = 1. In the opposite case, β1 will be negative, indicating
negative correlation. That is, a high value of one variable tends to be associated with a
low value of the other; the minimum value, r X,Y = –1, is in this category.
In some cases the scatter diagram may indicate that there is an exponential or other
nonlinear type of relationship between the two variables. In such cases, special procedures
are necessary. For example, one may apply a logarithmic, square root, negative reciprocal,
or other appropriate transformation to one or both variables prior to analysis (as discussed
in Chapter 6).
Example 1.29. Concrete test. The scatter diagram of Fig. 1.4.1 does not show a strong
relationship between the density and the compressive strength. This fact is confirmed by the
correlation coefficient of +0.44 obtained from Eq. (1.4.3). It is possible that the inclusion of
additional variables, such as the results of slump tests, will lead to an improved relationship
for predictive purposes in a multiple regression analysis.
Copyright © 2008. Wiley. All rights reserved.
Note that a zero correlation does not show that the variables are independent.
For variables that have no dependence, however, the correlation will not be of any
significance.13
Note that one is only seeking an association between two variables through the correlation coefficient, not a cause and effect relationship. In some cases there are clear reasons
for dependency, as in the case of a force exerted on a steel wire and the consequent increase in its length, or as in rainfall resulting in runoff. Often, however, one cannot reach
such a conclusion when there is strong positive or negative correlation. One may find, for
instance, that two variables are correlated because they are both associated with a third
variable and not because there is a physical relationship between the first two.14
Equations of regression such as Eq. (1.4.4) are generally used to predict Y for a given
value of X without invoking a causal relationship. Accordingly, the given value x is called
the explanatory (nonrandom) variable and Y is the response (random) variable.
Example 1.30. Water quality. Another example of positive or negative correlation is the
association between variables measuring water quality. A case study is taken from the Blackwater River in central England, which is constantly monitored for the control of pollution.
The variables that are measured, among others, are the amounts of dissolved oxygen, DO,
and the biochemical oxygen demand, BOD, in the water. Dissolved oxygen is required for the
respiration of aerobic life forms such as fish. The BOD denotes the amount of oxygen used in
meeting the metabolic needs of aerobic microorganisms in water, whether naturally occurring
or resulting from sewage outflows and other discharges; thus, high values of BOD generally
indicate high levels of pollution. Usually determined in a laboratory after a 5-day incubation
of samples taken from the water, BOD is the most widely used indicator of pollution despite
some shortcomings. Sampling at 38 stations along the river gives the data presented in Table
E.1.3.
13
The significance of small values of correlation and whether they probably indicate zero correlation are
discussed in Chapter 6, in addition to other aspects of regression including the particular notation of Eq. (1.4.4).
The concept of independence is discussed in Chapter 2.
14 An absurdity cited in early literature is the apparent relationship between horse kicks suffered by cavalrymen
and wheat production in Europe. Also, Yule (1926) correlates concurrent time series of the proportion of Church
of England marriages and the standardized mortality rates per 1000 persons with a “nonsense” correlation
coefficient of 0.95; he explains that both variables are highly influenced by a common factor; we now call this
behavior spurious correlation.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
26 Applied Statistics for Civil and Environmental Engineers
BOD (mg/L)
5
4
3
2
1
5
6
7
8
9
10
DO (mg/L)
Fig. 1.4.2 Scatter diagram of water quality data from Table E.1.3.
The scatter diagram of the two indicators of water quality data is shown in Fig. 1.4.2.
As expected, it strongly indicates a negative type of correlation with high values of DO
associated with low values of BOD and vice versa. The coefficient of correlation from
Eq. (1.4.3) is −0.90. It suggests that the value of BOD can be estimated from a measurement of the DO. The scatter in the diagram may be partly attributed to some inadequacies
of the BOD test and partly to factors such as temperature and rate of flow, which affect
the DO.
The presence of outliers tends to have a significant effect on the coefficient of correlation.
Consider, for example, the lowest BOD in Fig. 1.4.2, which corresponds to the first pair
of values in Table E.1.3. This may not warrant consideration as an outlier. It can, however,
be due to an incorrect observation or an error in recording. With reference to Example
1.30, it is interesting to note that if one changes the first BOD value of Table E.1.3, from
2.27 to 2.77, the correlation coefficient decreases from –0.90 to –0.92.
Copyright © 2008. Wiley. All rights reserved.
1.4.3 Q-Q plots
Quantiles representing two attributes or phenomena that are considered to be associated
may be compared using a Q-Q plot. Here one plots the quantiles of one data set against
the corresponding quantiles of another set as a means of comparing their probability
distributions. One proceeds initially with the ranking and calculation of cumulative relative
frequencies for a quantile plot for each set of data (as a prerequisite to drawing Fig. 1.1.6,
for example). The two quantile plots are then associated graphically by plotting values of
data with equal cumulative relative frequencies. In this type of diagram the limiting case,
in which the distributions differ only with respect to location and scale, is represented by
a straight line. The manner in which the plot departs from linearity indicates other types
of difference between the two distributions.
When one quantile function represents a theoretical distribution, the Q-Q plot becomes
a probability plot. This is a very useful diagram adopted in practice initially by a civil
engineer, R. W. Powell in 1943. The probability plot may be considered to be an extension
of the box plot, because all the quantiles are used in this method of comparing empirical
and theoretical distributions.15
15
Details of this method are given in Section 5.8.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Compressive strength (N/mm2)
Preliminary Data Analysis
27
80
70
60
50
40
2400
2420
2440
2460
2480
2500
Density of concrete (kg/m3)
Fig. 1.4.3 Q-Q plot of concrete test data from Table E.1.2.
Example 1.31. Concrete test. The distributions of the concrete strengths and densities listed
in Table E.1.2 are to be compared using a Q-Q plot. For this purpose the ranked data of Table
1.2.1 are used to obtain the cumulative relative frequencies for each item of data in the
sample of concrete strengths and the sample of concrete densities. Then a Q-Q plot is drawn
by associating data of equal cumulative frequencies. When sample sizes are the same, such
as in the case of the data used here, one can proceed directly to the Q-Q plot; in other cases
one calculates the quantiles of the smaller sample and then interpolates, correspondingly, the
quantiles for the larger sample.
There are apparent similarities in the distributions of strengths and densities, as shown in
Fig. 1.4.3. Although the distributions are not close, they do not seem to be divergent.
1.4.4 Summary of Section 1.4
Copyright © 2008. Wiley. All rights reserved.
A brief preliminary introduction is provided here on methods of investigating data observed
in pairs. This is a prelude to the formal presentations in Chapters 3 and 5 and particularly
in Chapter 6 on regression and multivariate analysis.
1.5 SUMMARY FOR CHAPTER 1
In this chapter numerous graphical methods for presenting data sets are introduced. These
include line diagrams, histograms, relative frequency polygons, cumulative relative frequency diagrams, and scatter diagrams. Details of exploratory methods such as stem-andleaf plots and box plots are also given.
Many of the numerical summaries for reducing data in this chapter are essential for
the application of statistics and probability in engineering. Among the most important of
these statistics are the mean, standard deviation, and the coefficient of correlation. Several
sets of data are provided here as examples of random variables which engineers encounter.
One needs to interpret these and draw sensible conclusions. The graphical and numerical
methods here are a necessary first step and lead into the probabilistic methods of Chapters
2 and 3 and the verification of mathematical models in subsequent chapters.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
28 Applied Statistics for Civil and Environmental Engineers
REFERENCES
Copyright © 2008. Wiley. All rights reserved.
General. The following references are given for further reading as required:
Ang, A. M. S., and W. H. Tang (1975). Probability Concepts in Engineering Planning and Design,
Vol. 1: Basic Principles, John Wiley and Sons, New York. A blend of theory and practice with
wide appeal for practicing civil engineers.
Benjamin, J. R., and C. A. Cornell (1970). Probability, Statistics and Decision for Civil Engineers,
McGraw-Hill, New York. A classic for civil engineers with examples, extensive case studies, and
innumerable problems to solve.
Blank, L. T. (1980). Statistical Procedures for Engineering, Management and Science, McGrawHill, New York. Well-explained theory and practical examples, commendable as an introductory
text.
Chambers, J. M., W. S. Cleveland, B. Kleiner, and P. A. Tukey (1983). Graphical Methods for Data
Analysis, Wadsworth, Belmont, CA. A standard reference for those seeking further knowledge of
graphical methods.
Groeneveld, R. A. (1979). Introductory Statistical Methods—An Integrated Approach Using Minitab,
Marcel Dekker, New York. An ideal statistical guide with computer applications suitable for
beginners.
Hahn, G. J., and S. S. Shapiro (1967). Statistical Models for Engineering, John Wiley and Sons,
New York. Reprinted in 1994 as a Wiley Classic in the engineering series. Recommended as a
reference book for understanding the basics of statistical applications in engineering.
Hand, D. J., F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994). A Handbook of Small
Data Sets, Chapman and Hall, London. Diverse data sets.
Hines, W. H., and D. C. Montgomery (1990). Probability and Statistics in Engineering and Management Science, 3rd ed., John Wiley and Sons, New York. Comprehensive book of 700 pages,
300 examples, and 626 problems.
Hoaglin, D. C., F. Mosteller, and J. W. Tukey (eds.) (1983). Understanding Robust and Exploratory
Data Analysis, John Wiley and Sons, New York. This authoritative book will further enhance
one’s knowledge of exploratory data analysis.
Johnson, R., and G. K. Bhattacharyya (1992). Statistics—Principles and Methods, 2nd ed., John
Wiley and Sons, New York. Basic principles well explained.
Mendenhall, W., and R. J. Beaver (1994). Introduction to Probability and Statistics, 9th ed., Duxbury
Press, Boston. Statistics and probability at beginners’ level.
Mendenhall, W., and T. Sincich (1995). Statistics for Engineering and the Sciences, 4th ed., Prentice
Hall, Englewood Cliffs, NJ. Introduction with many applications.
Moore, D. S., and G. P. McCabe (2003). Introduction to the Practice of Statistics, 4th ed., W. H.
Freeman and Co., New York. A useful primer in statistics.
Moroney, M. J. (1975). Facts from Figures, reprinted 1990, Penguin Books, London. The best book
written for an absolute beginner in statistics.
Scheaffer, R. L., and J. T. McClave (1995). Probability and Statistics for Engineers, 4th ed., Duxbury
Press, Belmont, CA. A variety of charts and preliminary calculations in Chapter 1. In general,
low emphasis in mathematics throughout. Highly commendable as an introduction.
Wackerly, D. D., W. Mendenhall, and R. L. Scheaffer (2002). Mathematical Statistics with Applications, 6th ed., Duxbury, Pacific Grove, CA. Comprehensive introduction.
Additional references quoted in text
Freedman, D., and P. Diaconis (1981). “On the histogram as a density estimator: L 2 theory,”
Zeitschrift fur Wahrscheinlich keitstheorie und verwandte Gebiete., Vol. 57, pp. 453–476, Chap.
2. Related to the class intervals of a histogram.
Kottegoda, N. T. (1984). “Investigation of outliers in annual maximum flow series,” J. Hydrol., Vol.
72, No. 1, pp. 105–137. Methods of detecting outliers.
Scott, D. W. (1979). “On optimal and data-based histograms,” Biometrika, Vol. 66, pp. 605–610.
Number of classes for histograms.
Stuart, A., and J. K. Ord (1994). Kendall’s Advanced Theory of Statistics, Vol. 1, 6th ed., Charles,
Edward Arnold, London. Advanced reference. See Gini’s mean difference in Chapter 2.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
29
Sturges, H. A. (1926). “The choice of a class interval,” J. Am. Stat. Assoc., Vol. 21, pp. 65–66.
Historical work on the histogram.
Tukey, P. A. (1977). Exploratory Data Analysis, Addison-Wesley, Reading, MA. Original reference
on exploratory methods.
Yule, G. U. (1926). “Why do we sometimes get nonsense correlation between time series,” J. R.
Stat. Soc., Vol. 89, pp. 1–69. Shows how two unrelated variables can have a high coefficient of
correlation because they are influenced by a common factor.
PROBLEMS
Copyright © 2008. Wiley. All rights reserved.
1.1. Earthquake records. Measurements of engineering interest have been recorded
during earthquakes in Japan and in other parts of the world since 1800. One
of the critical recordings is of apparent relative density, RDEN. After the commencement of a strong earthquake, a saturated fine, loose sand undergoes vibratory motion and consequently the sand may liquefy without retaining any shear
strength, thus behaving like a dense liquid. This will lead to failures in structures
supported by the liquefied sand. These are often catastrophic. The standard penetration test is used to measure RDEN. Another measurement taken to estimate
the prospect of liquefaction is that of the intensity at which the ground shakes.
This is the peak surface acceleration of the soil during the earthquake, ACCEL.
The data are from J. T. Christian and W. F. Swiger (1975), J. Geotech. Eng. Div.,
Proc. ASCE, 101, GT111, 1135–1150, and are reproduced by permission of the
publisher (ASCE):
RDEN
(%)
ACCEL
(units of g)
RDEN
(%)
ACCEL
(units of g)
RDEN
(%)
ACCEL
(units of g)
53
64
53
64
65
55
75
72
40
58
43
32
40
0.219
0.219
0.146
0.146
0.684
0.611
0.591
0.522
0.258
0.250
0.283
0.419
0.123
30
72
90
40
50
55
50
55
75
53
70
64
53
0.138
0.422
0.556
0.447
0.547
0.204
0.170
0.170
0.192
0.292
0.299
0.292
0.225
50
44
100
65
68
78
58
80
55
100
100
52
58
0.313
0.224
0.231
0.334
0.419
0.352
0.363
0.291
0.314
0.377
0.434
0.350
0.334
Note: g denotes acceleration due to gravity (9.81 m/s2 ).
Compute the sample mean x̄, standard deviation ŝ, and the coefficient of skewness,
g1 , for RDEN and ACCEL. Construct stem-and-leaf plots for each set. Comment on
the distributions. Plot the scatter diagram and calculate the correlation coefficient
r . What conclusions can be reached?
1.2. Flood discharge. Annual maximum flood flows in the Po River at Pontelagoscuro,
Italy, over a 61-year period from 1918 to 1978 are given in the second column
of Table E.7.2. Compute the sample mean x̄ and standard deviation ŝ. Sketch a
histogram and the cumulative relative frequency diagram. Compute the quartiles and
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
30 Applied Statistics for Civil and Environmental Engineers
draw a box-and-whiskers plot. Comment on the distribution. Flood embankments
along the banks of the river can withstand a flow of 5000 m3 /s. What is the probability
that this will be exceeded during a 12-month period?
1.3. Flood discharge. The following are the annual maximum flows in m3 /s in the
Colorado River at Black Canyon for the 52-year period from 1878 to 1929:
1980
1700
1420
1980
2690
3230
1130
1570
1980
4960
2270
3090
3120
2830
2690
2120
5660
2120
2120
3260
2550
5950
1700
2410
1840
4250
3400
2550
2550
2410
1980
3120
8500
1980
1840
4670
2070
3260
2120
3120
1700
1470
3960
2410
3290
2410
2410
2270
2410
3170
4550
3310
[Adapted from E. J. Gumbel (1954), “Statistical theory of extreme values and some
practical applications,” National Bureau of Standards, Applied Mathematics Series
33, U.S. Govt. Printing Office, Washington, DC.]
Compute the mean x̄ and standard deviation ŝ. Sketch a histogram and the relative
frequency diagram. Compute the quartiles and draw a box-and-whiskers plot. How
does this distribution differ from that of Problem 1.2?
1.4. Welding joints for steel. At the University of Birmingham, England, laboratory
measurements were taken of the horizontal legs x and vertical legs y of numerous
welding joints for steel buildings. The main objective was to make the legs equal to
6 mm. A part of the results is listed below in millimeters.
x = 5.5, 5.0, 5.0, 6.0, 7.0, 5.2, 5.5, 5.5, 6.0, 6.0, 4.5, 6.0, 5.5, 7.7, 7.5, 6.0, 5.6,
5.0, 5.5, 5.5, 6.0, 6.5, 5.5, 5.0, 5.5, 5.5, 6.5, 6.5, 7.0, 5.5, 6.5, 5.5, 6.0,
6.5, 8.5, 5.0, 6.0, 6.5, 5.0, 7.0, 5.0, 5.0, 6.5, 6.5, 6.0, 4.7, 8.0, 7.0, 5.5, 7.0,
6.6, 6.5, 7.0, 6.0, 6.5, 5.0, 7.0, 7.5, 7.0, 7.0
y = 6.5, 6.5, 5.5, 7.5, 6.0, 7.0, 5.0, 8.0, 6.7, 7.8, 5.7, 6.5, 5.5, 8.0, 8.0, 6.3, 6.0,
6.0, 6.0, 5.5, 6.5, 6.0, 6.0, 6.0, 6.0, 6.5, 6.5, 6.0, 6.0, 6.5, 7.5, 7.5, 6.0, 4.5,
Copyright © 2008. Wiley. All rights reserved.
7.0, 7.0, 6.0, 4.0, 4.0, 7.0, 7.0, 6.5, 7.0, 5.0, 5.0, 5.7, 5.0, 5.0, 6.0, 7.0, 6.0,
7.0, 6.0, 5.5, 6.0, 4.0, 5.5, 8.0, 7.5, 6.5
The data were provided by Dr A. G. Kamtekar.
Draw a scatter diagram for these data. Draw a line through the ideal point
(x = y = 6 mm) and the origin. Draw two lines through the origin that are symmetrical about the first line and envelope all of the points. Comment on the results.
Draw the cumulative sum (cusum) plots,
Cxn =
n
(xi − μx )
and
i
Cyn =
n
(yi − μ y )
i
for n = 1, 2, . . . , 60 and μx = μ y = 6. Let
n−1
dxn = Cxn − min[Cxi ]
i=1
and the critical limit be max(dxn ) = 12 mm. Is the critical limit reached? Repeat for
the vertical legs y. [Further details of cusum plots are given by W. H. Woodalland
B. M. Adams (1993), “The statistical design of cusum charts,” Qual. Eng., Vol. 5,
No. 4, pp. 550–570; the associated control chart is the subject of Problem 5.11.]
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
31
1.5. Frost frequency. Excessive frost can be harmful to roads. Frequencies of the number
of days of frost during April in Greenwich, England, over a 65-year period are
given by C. E. Brooks and N. Carruthers (1953), Handbook of Statistical Methods
in Hydrology, H. M. Stationary Office, London, and are listed below:
Days of frost
Frequency
0
15
1
11
2
5
3
11
4
7
5
6
6
2
7
3
8
2
9
1
10
2
Draw a line diagram of the data. Comment on the results. Compute the mean number
of days of frost in April. What is the probability of a frostfree April in a given
year? What change would you expect in the frequency distribution for a month in
midwinter?
1.6. Concrete cube test. From 28-day concrete cube tests made in England in 1990,
the following results of maximum load at failure in kilonewtons and compressive
strength in newtons per square millimeter were obtained:
Maximum load: 950, 972, 981, 895, 908, 995, 646, 987, 940, 937, 846, 947, 827,
961, 935, 956
Compressive strength: 42.25, 43.25, 43.50, 39.25, 40.25, 44.25, 28.75, 44.25, 41.75,
41.75, 38.00, 42.50, 36.75, 42.75, 42.00, 33.50
Copyright © 2008. Wiley. All rights reserved.
The data were supplied by Dr J. E. Ash, University of Birmingham, England.
Calculate the means x̄, standard deviations ŝ, mean absolute deviations d, and
the coefficients of skewness g1 . Draw two stem-and-leaf plots of the data. Draw a
scatter diagram and calculate the coefficient of correlation. What conclusions can
be drawn?
1.7. Timber strength. For the timber strength data of Table E.1.1 determine the following measures of dispersion:
(a) Interquantile range, iqr
(b) Mean absolute deviation, d
(c) Gini’s mean difference, g
Compare results with the standard deviation ŝ of Table 1.2.2. Repeat these determinations after deleting the zero value. Rank the measures of dispersion in increasing
order of susceptibility to the exclusion of the zero value on the basis of percentage
change.
1.8. Population growth. From past records, the population of an urban area has doubled
every 10 years. Currently, it has a population of 200,000. An engineer needs to make
an estimate of the requirements for water supply during the next 23 years. What
maximum population does one assume for the period?
1.9. Traffic speed. The following is the frequency distribution of travel times of motorcars on the M1 motorway from Coventry, England, to M10, St Albans, according
to a survey conducted in England (see Ph.D. thesis of A. W. Evans, University of
Birmingham, England, 1967):
Mean times (min): 53, 58, 63, 68, 73, 78, 83, 88, 93, 98, 103, 108,113, 118, 123,
128, 133, 138, 143, 148, 153, 158, 163, 168
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
32 Applied Statistics for Civil and Environmental Engineers
Corresponding frequencies: 10, 24, 109, 127, 122, 119, 97, 102, 104, 92, 68, 72,
66, 61, 36, 33, 17, 15, 10, 8, 9, 6, 7, 3
Draw the histogram. Describe the salient features. What is the likely reason for the
twin peaks? What inference can be made from the mean time interval between the
two peaks?
1.10. Average speed. On a certain country road that runs from a coastal town to a village
in the mountains, the average speed of motorcars is 80 km/h uphill and 100 km/h
downhill. What is the average speed for a journey from the town to the village and
back?
1.11. Annual rainfall. Catchment-averaged annual rainfall in the Po River basin of Italy
for the 61-year period from 1918 to 1978 are given in the penultimate column of
Table E.7.2. Draw a stem-and-leaf plot and a box plot of the data. Comment on the
type of distribution.
1.12. Rock test. A contractor engaged in building part of a sewer tunnel claimed that the
rock was harder than described in his contract with a District Council in the United
Kingdom and thus more work was required to construct the tunnel than anticipated.
An independent company made tests to verify the contractor’s claim. Among these
were uniaxial compressive strengths, of which 123 specimens are listed here, in
meganewtons per square meter.
2.40, 22.08, 16.80, 4.80, 21.36, 9.12, 9.36, 3.60, 15.36, 15.60, 6.24, 9.84, 16.08,
30.00, 20.40, 12.96, 19.20, 10.32, 15.84, 62.40, 40.80, 4.80, 7.20, 8.88, 14.40,
14.88, 5.76, 18.72, 12.48, 11.04, 8.64, 19.20, 8.16, 18.96, 8.64, 12.00, 14.88, 17.52,
12.48, 13.44, 9.36, 11.28, 8.88, 15.12, 9.36, 17.28, 26.40, 4.32, 11.28, 7.92, 13.92,
11.76, 9.60, 8.40, 9.84, 27.60, 6.00, 14.40, 8.88, 17.04, 12.48, 9.84, 10.80, 12.24,
12.00, 13.20, 11.28, 11.76, 11.76, 8.00, 9.36, 15.12, 11.52, 16.08, 10.80, 14.64,
8.40, 13.44, 10.56, 9.12, 13.44, 12.72, 13.68, 11.28, 5.52, 11.04, 12.00, 7.20, 8.64,
11.76, 8.64, 7.68, 7.68, 13.92, 6.48, 7.20, 7.92, 9.60, 8.64, 9.12, 12.96, 9.36, 14.64,
9.12, 8.88, 20.40, 17.28, 8.64, 11.76, 7.92, 7.68, 11.04, 12.48, 14.40, 9.84, 9.12,
8.40, 12.00, 4.80, 12.72, 9.60, 8.64, 9.84
Copyright © 2008. Wiley. All rights reserved.
Draw histograms using Eqs. (1.1.1) and (1.1.2) for the class widths. What do you notice about the histograms in general? Draw a box-and-whiskers plot. What evidence
is there to support the contractor’s claim?
1.13. Soil erosion. Measurements taken on farmlands of the amounts of soil washed away
by erosion suggest a relationship with flow rates. The following results are taken
from G. R. Foster, W. R. Ostercamp, and L. J. Lane (1982), “Effect of discharge
rate on rill erosion,” Winter 1982 Meeting of the American Society of Agricultural
Engineers:
Flow (L/s)
Soil eroded (kg)
0.31
0.82
0.85
1.95
1.26
2.18
2.47
3.01
3.75
6.07
Draw a plot of the data. Comment on the results.
1.14. Concrete cube test. The following 28-day compressive strengths, in newtons per
square millimeter, were obtained from test results on concrete cubes in England:
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
33
50.5, 45.8, 49.6, 47.7, 54.0, 49.4, 54.1, 53.1, 56.5, 55.2, 52.7, 52.0, 54.2, 55.2, 53.4,
51.0, 53.1, 48.5, 51.0, 58.6, 52.5, 49.5, 51.1, 48.1, 50.2, 49.3, 47.3, 52.9, 52.8, 49.5,
48.8, 53.8, 47.3, 47.7, 52.2, 45.7, 53.4, 48.5, 49.1, 43.3
The data were supplied by Dr J. E. Ash, University of Birmingham, England.
Compare these results with the compressive strengths in Table E.1.2 by drawing
back-to-back stem-and-leaf plots. For this purpose, plot the foregoing results on the
left of the stem with reference to Fig. 1.3.1 and omit the cumulative frequencies.
Comment on the differences in the distributions.
1.15. Water quality. Water quality measurements are taken daily on the River Ouse at
Clapham, England. The concentrations of chlorides and phosphates in solution,
given below in milligrams per liter, are determined over a 30-day period.
Chloride: 64.0, 66.0, 64.0, 62.0, 65.0, 64.0, 64.0, 65.0, 65.0, 67.0, 67.0, 74.0, 69.0,
68.0, 68.0, 69.0, 63.0, 68.0, 66.0, 66.0, 65.0, 64.0, 63.0, 66.0, 55.0, 69.0, 65.0, 61.0,
62.0, 62.0
Phosphate: 1.31, 1.39, 1.59, 1.68, 1.89, 1.98, 1.97, 1.99, 1.98, 2.15, 2.12, 1.90 1.92,
2.00, 1.90, 1.74, 1.81, 1.86, 1.86, 1.65, 1.58, 1.74, 1.89, 1.94, 2.07, 1.58, 1.93, 1.72,
1.73, 1.82
Compare the coefficients of variation v. Draw a scatter diagram and compute the
correlation coefficient r . Comment on the results. Do you see any role in this
association for predictive purposes?
1.16. Timber strength. From the timber strength data of Table 1.1.3, compute the 3%
trimmed mean by omitting 3% of the observations from the highest and the lowest
extremities of the ranked data. Compute the standard deviation ŝ and the coefficients
of skewness g1 and kurtosis g2 . Compare with the results for the full sample (as
given in Table 1.2.2).
Copyright © 2008. Wiley. All rights reserved.
1.17. Concrete beam. Joist-hanger tests carried out at the University of Birmingham,
England, on concrete beams gave observations of deflections in millimeters and
failure load in kilograms. The following results pertain to 75 mm × 150 mm hangers
on which timber joists rest:
Failure load: 1903, 1665, 1903, 1991, 2229, 1910, 2025, 1991, 1882, 2032, 1896,
1346
Deflection: 0.69, 0.67, 0.80, 0.50, 0.74, 0.78, 0.57, 0.91, 0.54, 0.50, 0.97, 0.62
Determine by drawing a scatter diagram and computing the correlation coefficients
whether there is any association between the two variables. Discuss your results.
1.18. Hurricane frequency. Hurricane damage is of great concern to civil engineers.
The frequencies of hurricanes affecting the east coast of the United States each
year during a period of 69 years are given as follows by H. C. S. Thom (1966),
Some Methods of Climatological Analysis, World Meteorological Organisation,
Geneva:
Number of hurricanes
Frequency
0
1
1
6
2
10
3
16
4
19
5
5
6
7
7
3
8
1
9
1
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
34 Applied Statistics for Civil and Environmental Engineers
Draw a line diagram and comment on its form. Discuss differences or similarities
between this diagram and Fig. 1.1.1.
1.19. Air pollution. On 13 April 1994, the following concentration of pollutants were
recorded at eight stations of the monitoring system for pollution control located in
the downtown area of Milan, Italy:
Station
3)
NO2 (μg/m
CO (mg/m3)
Aquileia Cenisio Juvara
Liguria Marche Senato Verziere Zavattari
130
2.9
120
4.1
130
4.4
115
3.6
135
3.3
142
5.7
90
4.8
116
7.3
Compare the coefficients of variation v of the pollutants and determine their correlation r .
1.20. Storm rainfall. The analysis of storm data is essential for predicting flood hazards
in urban areas. Annual maximum rainfall depths (in millimeters) recorded at Genoa
University in Italy, for durations varying from 5 minutes to 3 hours, are presented
here for the years 1974–1987.
Copyright © 2008. Wiley. All rights reserved.
Duration (min)
Year
5
10
20
30
40
50
60
120
180
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
12.1
10.1
17.9
20.0
5.1
20.5
10.0
12.0
10.0
10.0
20.1
7.6
8.7
24.6
19.5
14.9
20.0
32.6
13.6
26.1
15.7
27.9
14.4
12.1
32.8
8.1
11.7
36.7
28.8
26.7
31.1
52.6
16.0
36.3
20.9
47.9
20.0
17.3
60.0
13.0
20.0
56.7
30.5
31.2
37.2
72.4
21.3
46.1
25.0
56.0
23.3
19.2
65.7
16.5
22.9
73.9
32.4
34.7
41.1
90.1
24.1
49.3
30.5
70.0
25.1
22.1
76.1
21.6
26.1
93.9
35.5
38.2
51.0
108.8
24.6
50.3
38.0
80.0
26.4
27.3
92.8
25.3
26.3
110.1
38.7
40.2
55.7
118.9
25.0
55.6
40.1
89.4
27.2
32.7
105.7
25.3
27.6
128.5
48.0
55.0
67.1
146.5
40.7
65.2
58.0
106.9
34.3
54.4
122.3
27.0
41.1
180.8
51.6
56.0
80.6
157.3
49.9
90.1
63.8
114.2
41.2
66.5
122.3
32.3
56.7
188.7
Compute the mean x̄ and standard deviation ŝ and coefficient of skewness g1 for
each duration. Are there some regularities in the growth of these statistics with
increasing duration? Comment on the results and the physical relevance to storm
characteristics.
1.21. Carbon dioxide. The records of atmospheric trace gases are used in the study
of global climatic changes. Monthly carbon dioxide concentrations (in parts per
million in volume) recorded at Mount Cimone, Italy, from 1980 to 1988 are given
here.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
35
Month
Year Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
1980
1981
1982
1983
1984
1985
1986
1987
1988
339.83
343.11
344.38
346.18
349.44
348.17
351.41
353.75
355.02
342.27
342.39
345.68
345.00
351.33
350.62
352.29
354.79
354.96
342.51
342.51
345.70
344.24
350.50
350.61
350.75
352.61
354.51
338.27
339.49
340.80
342.32
346.43
345.93
348.37
350.39
352.20
335.52
335.28
336.66
338.34
344.35
341.43
342.96
347.38
346.71
330.14
330.77
334.65
336.03
346.29
337.67
337.22
341.64
342.60
328.81
330.30
332.40
335.00
335.19
337.16
338.53
341.64
344.60
331.17
333.55
335.15
336.57
337.59
339.40
340.90
342.19
343.66
335.03
336.80
339.26
339.86
342.26
344.07
346.28
345.60
348.99
339.05
339.41
341.19
343.97
344.88
349.49
348.95
350.39
352.42
340.43
343.18
345.18
345.61
346.91
347.40
350.52
352.36
353.27
340.87
341.47
341.70
342.38
346.32
349.92
349.41
351.94
353.13
Compute the mean x̄ and standard deviation ŝ for each year (by rows) and for each
month (by columns). Because the temporal evolution of the annual mean indicates
that carbon dioxide increases (probably resulting in global warming), compute the
annual rate of increase. Comment on the results.
1.22. Historical records of earthquake intensity. Catalogo dei terremoti italiani
dall’anno 1000 al 1980 (“Catalog of Italian earthquakes from year 1000 to 1980”)
was edited by D. Postpischl in 1985, and is available through the National Research
Council of Italy. This directory contains all of the available historical information
on earthquakes that occurred in Italy during the past (nearly) 1000 years. It also
includes values of earthquake intensity in terms of the Mercalli–Canconi–Sieber
(MCS) index. The following table gives the values of MCS intensity for the city of
Rome:
MCS intensity
Copyright © 2008. Wiley. All rights reserved.
Century
2
XI
XII
XIII
XIV
XV
XVI
XVII
XVIII
XIX
XX
110
3
Total
113
3
4
5
6
7
2
1
1
7
125
4
50
2
132
56
1
1
1
1
2
14
2
1
1
22
4
2
Total
2
1
1
0
3
0
1
15
301
5
329
Draw the line diagram for the whole data and for those recorded in each century.
Compare the data recorded in the eighteenth century with those recorded in the
other centuries.
1.23. Sea waves. Because of scarcity of records, the characteristics of sea waves are
often derived from other climatological data. For the purpose, the SMB method
(named after Sverdrup, Munk, and Bretschneider) is widely used in engineering
practice [see U.S. Army Corps of Engineers (1977), Shore Protection Manual,
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
36 Applied Statistics for Civil and Environmental Engineers
Vol. 1, Coastal Engineering Research Center, Washington, DC]. Liberatore and
Rosso used this model to simulate sea waves in the upper Adriatic Sea [Liberatore,
G., and R. Rosso (1983). “Sulla valutazione stocastica dell’onda di progetto in base
alla ricostruzione dello stato del mare: un esempio di applicazione per l’Adriatico
centro-meridionale,” Giornale del Genio Civile, Vol. 1–3, pp. 3–25]. They investigated two different strategies for model calibration, called “no. 1” and “no. 2” in the
table presented here. The table also includes the observed and the simulated values
of the height of the highest sea wave and of its period for measurements taken from
August 1977 to September 1978.
Simulated values
Copyright © 2008. Wiley. All rights reserved.
Measured values
Calibration strategy no. 1
Calibration strategy no. 2
Height (m)
Period (s)
Height (m)
Period (s)
Height (m)
Period (s)
2.26
3.10
3.22
3.84
2.56
2.74
2.28
3.88
2.49
4.22
2.01
2.77
3.61
3.51
2.52
2.12
2.73
3.30
6.1
4.3
5.7
7.7
5.3
5.7
4.9
6.7
5.0
6.9
5.0
5.9
6.5
7.4
5.0
5.1
6.5
5.4
1.81
2.93
3.24
3.18
2.74
3.49
2.12
5.10
2.14
4.45
2.57
2.68
3.86
4.02
3.39
2.61
2.22
4.05
5.4
6.8
7.2
7.1
6.6
7.4
5.8
9.0
5.8
8.8
6.4
6.5
7.8
8.0
7.3
6.5
6.0
8.0
1.54
2.54
2.80
2.69
2.32
3.00
1.80
4.43
1.81
3.77
2.19
2.27
3.36
3.51
2.95
2.21
1.88
3.49
5.8
6.4
6.7
6.6
6.1
6.9
5.4
8.4
5.4
7.7
5.9
6.0
7.3
7.5
6.9
6.0
5.5
7.5
Draw a scatter diagram to compare the observed and simulated values of wave
heights and periods. Compute the correlation coefficients r . Compute the deviations
of the simulated data from the observed data, and find the mean x̄1 , standard deviation
ŝ1 , and coefficient of variation v of these deviations. Do these results indicate which
of the two investigated strategies provides the better representation of sea waves
from climatological data?
1.24. Surveying. A triangulated network is used to determine the position of three points
in space, denoted by u1 ≡ (x1 , y1 ), u2 ≡ (x2 , y2 ), and u3 ≡ (x3 , y3 ), by measuring
their mutual distances and their distances from two reference points, u A ≡ (x A , y A )
and u B ≡ (x B , y B ), as shown in Fig. 1.P1.
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
P1: SFK/RPW
P2: SFK/RPW
BLUK154-Kottegoda
QC: SFK/RPW
April 15, 2008
T1: SFK
7:11
Preliminary Data Analysis
37
y, m
70
2
60
1
50
B
40
30
20
10
A
3
0
0
20
40
60
80
100
x, m
Fig. 1.P1 Survey configuration.
The Cartesian coordinates of the reference points are x A = y A = 0, x B = 92, and
y B = 40 m. The table of the measured distances is given next.
Copyright © 2008. Wiley. All rights reserved.
uA
uB
u1
u2
u3
uA
uB
u1
u2
u3
0
100
50
71
92
100
0
86
70
40
50
86
0
26
99
71
70
26
0
93
92
40
99
93
0
Using appropriate trigonometric methods, find the average location and coefficients
of variation of the coordinates of point u1 ≡ (x1 , y1 ).
Kottegoda, NT, & Rosso, R 2008, Applied Statistics for Civil and Environmental Engineers, Wiley, Hoboken. Available from: ProQuest Ebook
Central. [9 September 2022].
Created from upcatalunya-ebooks on 2022-09-09 20:47:15.
Download