Uploaded by White Snow

E BOOK Applied Multivariate Statistical Analysis in Medicine 1st Edition by Jingmei Jiang

advertisement
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
CHAPTER 1
Overview of multivariate statistical
analysis
1.1 Introduction
In medical research, it is often necessary to include multiple variables simultaneously to
fully describe and analyze the phenomena of interest. For example, health status assessment may involve using indicators of physiological, psychological, and social adjustment,
whereas disease diagnosis may require the integration of clinical manifestations, imaging
examinations, and laboratory tests. In the prediction of cardiovascular events, variables
such as body mass index, blood pressure, lipid levels, diabetes mellitus, and smoking status
may be considered. Multivariate statistical analysis consists of a collection of methods that
can be used when several measurements are made for each individual or object in one or
more samples. The goal of multivariate statistical analysis is to extract important information that is hidden within these complex variable relationships and to identify the essential
features of the phenomena being studied.
The need to understand the relationships between many variables makes multivariate
analysis an inherently difficult subject. Often, the human mind is overwhelmed by the
amount of data. Additionally, more mathematics is required to derive multivariate statistical techniques for making inferences than in a univariate setting. In this textbook, we
introduce the basics of multivariate statistical analysis based on algebraic concepts and,
to avoid derivations of statistical results that require the calculus of many variables, we
make use of illustrative examples and a minimal amount of mathematics. Despite this,
basic mathematical sophistication and a desire to think quantitatively will be required.
We have attempted to motivate readers’ study of multivariate analysis and provide rudimentary, but important, methods for organizing, summarizing, and displaying multivariate data.
Multivariate statistics originated in the 1920s, and famous statisticians such as J. Wishart, H. Hotelling, R.A. Fisher, and S.N. Roy were pioneers in this field. The specific
content of multivariate statistical analysis not only includes the direct extension of
methods used in univariate analysis but also covers problems unique to scenarios in which
multiple variables are encountered simultaneously. With the development of computer
technology, multivariate statistical analysis has been widely used in fields such as geology,
meteorology, economics, and medicine.
1
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
We Don’t reply in this website, you need to contact by email for all chapters
Instant download. Just send email and get all chapters download.
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
You can also order by WhatsApp
https://api.whatsapp.com/send/?phone=%2B447507735190&text&type=ph
one_number&app_absent=0
Send email or WhatsApp with complete Book title, Edition Number and
Author Name.
2
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Applied Multivariate Statistical Analysis in Medicine
In 1975, British statistician M.G. Kendall summarized the issues studied using classic
multivariate statistical analysis into the following categories:
1. Data reduction or structural simplification
Data reduction is the process of converting data from a space with high dimensions to
a space with low dimensions while retaining important features and valuable information
from the original data. The intention is that this will be a simplified and more easily interpretable representation of the phenomenon of interest. Principal component analysis and
factor analysis, which we introduce in Chapters 8 and 9, are typical structural simplification methods.
2. Classification and discrimination
Classification and discrimination refer to classifying (or clustering) individuals (or variables) on the basis of measured characteristics. Additionally, rules for classifying objects
into well-defined groups may be required. Typical and frequently used methods include
cluster analysis and discriminant analysis, which we introduce in Chapters 11 and 12.
3. Investigation of the relationship between variables of interests
Biomedical research often focuses on examining the relationship between variables of
interest. This type of research aims to determine whether there is a correlation between
variables and whether predictions can be made about one variable based on others. Statistical methods such as regression analysis (Chapters 4e7) and canonical correlation
(Chapter 10) are commonly used to address these issues.
4. Statistical inference of multivariate data
The statistical inference of multivariate data is similar to that of univariate analysis,
with a focus on estimating and testing hypotheses about the parameters of multivariate
populations. This may be performed to validate assumptions or to reinforce prior convictions. We introduce these contents in Chapters 2 and 3.
5. Theoretical basis of multivariate statistical analysis
The theoretical basis of multivariate statistical analysis involves multidimensional
random vectors and mostly normal random vectors, in addition to various multivariate
statistics defined using these vectors, and deriving their distribution and studying their
properties, and studying their sampling distribution theory. We also introduce these contents in Chapter 3.
We conclude this brief introduction to multivariate analysis with a quotation from
F.H.C. Marriott: “You should keep it in mind whenever you attempt or read about a
data analysis. It allows one to maintain a proper perspective and not be overwhelmed
by the elegance of some of the theory.”
1.2 Application of multivariate statistical analysis
To further illustrate the application of multivariate statistical analytic techniques in medical research, we provide several examples (or problems) that we have experienced
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Overview of multivariate statistical analysis
personally in this book, which may help to promote readers’ conceptual understanding of
multivariate statistical methods and facilitate a deepened understanding of this topic in
combination with their own practice. We classify these examples on the basis of the
objective and content of multivariate statistical analysis.
1. Data reduction or structural simplification
• The assessment of self-care ability is a crucial component in research on the quality
of life among older adults and encompasses 12 indicators that range from basic
abilities, such as dressing and eating, to more complex abilities, such as shopping
and financial management. In practice, a challenge arises in the analysis of such
data because it is essential to determine how to simplify the data structure while
preserving critical information.
• A study aims to evaluate the health status of residents in Wuhan urban, China. The
project encounters the issue of managing a large number of variables, and it becomes challenging to evaluate the health status of the participants comprehensively using a relatively large index system.
2. Classification and discrimination
• An approach involves grouping geographic areas based on demographic, medical,
and health service indicators, followed by an assessment of the appropriateness of
medical resource allocation.
• For patients with pulmonary nodules, how are malignant tumors identified using
image information such as the size, location, and shape of the nodules, combined
with the clinical manifestations of the patients?
3. Investigation of the relationship between variables of interests
• Based on data from a national health survey, in a study, researchers aim to examine
the correlation between the physical development of adolescents and their lung
function status, taking into consideration their physical development status.
• Prognostic factors that influence the outcome of breast cancer surgery are
explored and the extent of the influence of various prognostic factors on the survival time of patients is determined in a follow-up study.
4. Statistical inference of multivariate data
Multivariate inference is particularly useful for curbing the researcher’s natural tendency
to read too much into the data. Examples include the following:
• How is the efficacy of a new drug evaluated compared with existing drugs in the
treatment of patients with AIDS through changes in laboratory indicators such as
virology and immunology?
• Systolic blood pressure, total cholesterol, and body mass index are important predictors of cardiovascular disease. How are the distributions of these indicators compared
across various ethnic groups based on sample data, and used to guide the derivation of
the prevention and control policy of cardiovascular disease?
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
3
4
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Applied Multivariate Statistical Analysis in Medicine
It should be noted that real-world research is often multifaceted and complex, and
multiple approaches could be suitable for the same problem, in many cases.
The field of multivariate statistics in medical research is highly practical and useful.
Although we do not intend to overemphasize the mathematical foundation of this field,
it is important to recognize the inherent relationship between statistical theory and application. The vast range of research topics within basic medicine, clinical medicine, public
health, and preventive medicine provides ample opportunities for the application of
multivariate statistics. Additionally, the use of multivariate statistical methods continues
to expand into new fields, with basic statistical theory serving as a common theoretical
foundation for these methods.
1.3 Structure of multivariate data
Whenever a researcher intends to investigate a phenomenon or validate a certain hypothesis, more than one variable (characteristic) is usually involved and thus a data structure
with multivariate data can be formed. We now introduce the preliminary concepts
that underlie these first steps of data organization.
Let xij ði ¼ 1; 2; /; n; j ¼ 1; 2; /; pÞ denote the particular measurement of the ith
item (object, or observation) of the jth variable. Consequently, n measurements for p variables can be displayed as shown in Table 1.1.
Alternatively, we can display these data as a rectangular array, called data matrix X, of
n rows and p columns:
2 T 3
Xð1Þ
2
3
x11 x12 / x1p
6 T 7
6 x21 x22 / x2p 7 6 X 7
6
7 6 ð2Þ 7
X ¼6
(1.1)
7; orb X1 ; X2 ; /; Xp ;
7b6
4 «
«
« 5 6 « 7
5
4
xn1 xn2 / xnp
T
XðnÞ
Table 1.1 Tabular representation of the multivariate data structure.
Variable
Item
X1
X2
/
Xj
/
Xp
1
2
«
i
«
n
x11
x21
«
xi1
«
xn1
x12
x22
«
xi2
«
xn2
/
/
x1j
x2j
«
xij
«
xnj
/
/
x1p
x2p
«
xip
«
xnp
/
/
/
/
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Overview of multivariate statistical analysis
where the superscript “T ” denotes “transpose”. Data matrix X may be simplified
as xij np.
T ¼ x ; x ; /; x , which denotes
The ithði ¼ 1; 2; /; nÞ row of matrix X is XðiÞ
i1 i2
ip
the observation of the ith item is called row vector. Before a specific observation behavior
occurs, it is a p-dimensional random vector.
The jthðj ¼ 1; 2; /; pÞ column of matrix X is
2 3
x1j
6 x2j 7
6 7
X j ¼ 6 7;
4 « 5
xnj
which denotes n observations of the jth variable, is called column vector. Before a specific
observation behavior occurs, it is an n-dimensional random vector.
In multivariate statistical analysis, all the content involved consists of random vectors
or random matrices that are composed of multiple random vectors. For details, see Chapter
13.
Representing multivariate data using a data matrix has the following advantages: (1) it
may be more convenient for the transformation, processing, and calculation of data; and
(2) it is easy to program the data matrix on a computer; hence, the calculation of some
statistics can be completed by the program.
Example 1.1: In a national project aimed at understanding the health status and basic
physiological parameters of different regions in China, chest circumference (cm) X1 ,
waist circumference (cm) X2 , and hip circumference (cm) X3 were measured. Part of
the data of 57 junior girls (12 years old) from Jiangsu province is shown in Table 1.2.
Three random variables (X1 , X2 , and X3 ) are involved in this research. The measurements of these variables for each participant constitute a row vector, which is a random
Table 1.2 Physiological data of 12-year-old girls in Jiangsu province.
Individual
X1
X2
X3
1
2
3
4
5
6
7
«
57
72.0
78.0
75.0
70.0
76.0
71.0
63.0
«
80.0
65.0
67.0
62.0
61.0
60.0
62.0
58.0
«
68.0
80.0
91.0
80.0
88.0
91.0
83.0
78.0
«
92.0
CAMS Innovation Fund for Medical Sciences (CIFMS) (2020-I2M-2e009).
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
5
6
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Applied Multivariate Statistical Analysis in Medicine
vector with three dimensions ðp ¼ 3Þ. When the measurements of the 57 junior girls are
complete, the row vectors are
T
T
T
¼ ð72:0; 65:0; 80:0Þ; Xð2Þ
¼ ð78:0; 67:0; 91:0Þ; /; Xð57Þ
¼ ð80:0; 68:0; 92:0Þ.
Xð1Þ
Similarly, we can obtain column vectors of the three variables (chest circumference,
waist circumference, and hip circumference):
3
3
3
2
2
2
72:0
65:0
80:0
6 78:0 7
6 67:0 7
6 91:0 7
7
7
7
6
6
6
X1 ¼ 6
7; X2 ¼ 6
7; X3 ¼ 6
7.
4 « 5
4 « 5
4 « 5
80:0
68:0
92:0
1.4 Descriptive statistics of multivariate data
Much of the information contained in data can be assessed by calculating certain numerical characteristics known as descriptive statistics. For example, the sample arithmetic mean
(or sample mean) in univariate analysis is a descriptive statistic that provides a measure of
the central location for a set of data. Additionally, the average of the squares of the distances of all the values from the mean provides a measure of the spread, or variation. In
multivariate analysis, we rely most heavily on descriptive statistics that measure location,
variation, and linear association between variables. We provide formal definitions of these
quantities in Chapter 2. In the present chapter, we introduce commonly used descriptive
statistics, such as the sample mean vector, sample covariance matrix, and sample correlation matrix.
1.4.1 Sample mean vector
The sample mean vector plays a central role in the description of the sample data matrix.
Let n be the number of items of each of p variables. The mean vector calculated from
the sample data is denoted by
2
3
X1
6
7
T
6X 7 X ¼ 6 2 7 ¼ X 1 ; X 2 ; /; X p ;
(1.2)
4 « 5
Xp
P
where X j ¼ 1n ni¼1 xij ðj ¼ 1; 2; /; pÞ.
In Example 1.1, there are three variables (X1 , X2 , and X3 ), the sample mean of each
variable can be calculated as
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
We Don’t reply in this website, you need to contact by email for all chapters
Instant download. Just send email and get all chapters download.
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
You can also order by WhatsApp
https://api.whatsapp.com/send/?phone=%2B447507735190&text&type=ph
one_number&app_absent=0
Send email or WhatsApp with complete Book title, Edition Number and
Author Name.
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Overview of multivariate statistical analysis
X1 ¼
57
1 X
1
ð72:0 þ 78:0 þ / þ 80:0Þ ¼ 76:58;
xi1 ¼
57 i¼1
57
X2 ¼
57
1 X
1
ð65:0 þ 67:0 þ / þ 68:0Þ ¼ 67:14;
xi2 ¼
57 i¼1
57
X3 ¼
57
1 X
1
ð80:0 þ 91:0 þ / þ 92:0Þ ¼ 87:51.
xi3 ¼
57 i¼1
57
can be obtained as follows:
Thus, the sample mean vector X
3
3 2
2
76:58
X1
7
7 6
6
X ¼ 4 X 2 5 ¼ 4 67:14 5 ¼ ð76:58; 67:14; 87:51ÞT .
X3
87:51
1.4.2 Sample covariance matrix
The variance-covariance matrix generalizes the notion of variance from one-dimension
to multiple dimensions. We can use the variance-covariance matrix to depict the degree
of dispersion of multiple random variables in the sample and the relationship between any
two variables.
To improve readers’ understanding of the concept, we write the calculation of the
sample variance-covariance matrix in two parts:
2
1 Xn sjj ¼
x
ðj ¼ 1; 2; /; pÞ;
(1.3)
x
ij
j
n 1 i¼1
1 Xn xij xj ðxik xk Þðj; k ¼ 1; 2; /; p; j s kÞ.
(1.4)
sjk ¼
i¼1
n1
Eq. (1.3) is the calculation of the variance of each component of the p-dimensional
random vector. Eq. (1.4) is the covariance between any two variables Xj and Xk in the
p-dimensional random vector. In fact, Eqs. (1.3) and (1.4) can be expressed in a uniform
manner because the variance of Xj could be viewed as its own covariance. For convenience, hereafter, we refer to the variance-covariance matrix of samples as the covariance
matrix.
Thus, for any given p-dimensional random vector, the sample covariance matrix is
2
3
s11 s12 / s1p
6 s21 s22 / s2p 7
6
7
S¼6
7.
4 «
«
« 5
sp1
sp2
/
spp
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
7
8
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Applied Multivariate Statistical Analysis in Medicine
Matrix S is a symmetrical matrix (i.e., S ¼ S T ) that constitutes p variances and pðp 2 1Þ
covariances, where the variances lie on the leading diagonal of the matrix.
Example 1.2: Referring to Example 1.1, calculate the sample covariance matrix of
the three variables (X1 , X2 , and X3 ).
Solution:
First, calculate the variance of each component of the three-dimensional random
vector:
1 X57
s11 ¼ s21 ¼
ðxi1 x1 Þ2
57 1 i¼1
1 ¼ ð72:0 76:58Þ2 þ ð78:0 76:58Þ2 þ / þ ð80:0 76:58Þ2 ¼ 67:32;
56
1 X57
s22 ¼ s22 ¼
ðxi2 x2 Þ2
57 1 i¼1
1 ¼ ð65:0 67:14Þ2 þ ð67:0 67:14Þ2 þ / þ ð68:0 67:14Þ2 ¼ 69:02;
56
1 X57
ðxi3 x3 Þ2
s33 ¼ s23 ¼
57 1 i¼1
1 ¼ ð80:0 87:51Þ2 þ ð91:0 87:51Þ2 þ / þ ð92:0 87:51Þ2 ¼ 38:47.
56
Then, calculate the covariance between any two variables:
1 X57
ðxi1 x1 Þðxi2 x2 Þ
57 1 i¼1
1
¼ ½ð72:0 76:58Þð65:0 67:14Þ þ / þ ð80:0 76:58Þð68:0 67:14Þ ¼ 60:85;
56
1 X57
s13 ¼
ðxi1 x1 Þðxi3 x3 Þ
57 1 i¼1
1
¼ ½ð72:0 76:58Þð80:0 87:51Þ þ / þ ð80:0 76:58Þð92:0 87:51Þ ¼ 47:31;
56
1 X57
s23 ¼
ðxi2 x2 Þðxi3 x3 Þ
57 1 i¼1
1
¼ ½ð65:0 67:14Þð80:0 87:51Þ þ / þ ð68:0 67:14Þð92:0 87:51Þ ¼ 43:27.
56
s12 ¼
Since s12 ¼ s21 , s13 ¼ s31 , and s23 ¼ s32 . Thus, the sample covariance matrix of the
three random variables X1 , X2 , and X3 can be obtained as follows:
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Overview of multivariate statistical analysis
2
67:32
60:85 47:31
47:31
43:27 38:47
6
S ¼ 4 60:85
3
7
69:02 43:27 5.
1.4.3 Sample correlation matrix
Another important descriptive statistic is the sample correlation matrix.
The Pearson correlation coefficient between variables Xj and Xk in p-dimensional space is
defined as
Pn sjk
i¼1 xij xj ðxik xk Þ
ffi ðj; k ¼ 1; 2; /; p; j s kÞ. (1.5)
rjk ¼ pffiffiffiffiffiffiffiffiffi ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 Pn
Pn sjj skk
2
x
x
ðx
x
Þ
j
k
i¼1 ij
i¼1 ik
The values of the correlation coefficient lie between 1 and þ1, and the magnitude
of the absolute value of rjk denotes the strength of the correlation between variables Xj
and Xk , whereas the sign indicates the direction of the correlation.
Based on Eq. (1.5), the sample correlation matrix is defined as
2
3
1 r12 / r1p
6 r21 1 / r2p 7
6
7
R¼6
7.
4 «
«
« 5
rp1
rp2
/
1
The reason that we are interested in the correlation coefficient statistic is that it is unit
free; that is, it does not vary as the unit of measurement changes. In fact, when each variable is normalized, the covariance matrix obtained after the normalized transformation is
equal to the correlation matrix of the original variable; this standardized covariance is
called a correlation. In practice, the correlation coefficient is more intuitive than covariance in the measurement of the correlation between variables.
Example 1.3: Referring to Example 1.1, calculate the correlation matrix of the three
variables (X1 , X2 , and X3 ).
Solution:
Based on Eq. (1.5), correlation matrix of X is as below
3
2
1:00 0:89 0:93
7
6
R ¼ 4 0:89 1:00 0:84 5.
0:93 0:84 1:00
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
9
10
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Applied Multivariate Statistical Analysis in Medicine
1.5 Statistical distance
Although they may appear formidable at first, most multivariate techniques are based on
the simple concept of distance. Distance quantifies how far two objects are from each
other. Because most multivariate methods rely on the measurement of distance between
samples or variables, it is necessary to introduce the concept of distance prior to the introduction of a specific multivariate statistical method. A comprehensive discussion about
distance is available in Chapter 11.
The Euclidean distance (or straight-line distance) is the most common measure of distance.
If we consider point Pðx1 ; x2 Þ in two-dimensional space, the Euclidean distance to origin
point O(0,0) is, according to the Pythagorean theorem,
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(1.6)
dðO; PÞ ¼ x21 þ x22 .
Generally, as we expand two-dimensional space to p-dimensional space, for any given
point P in p-dimensional space with coordinate x1 ; x2 ; /; xp , its Euclidean distance
from P to origin point Oð0; 0; /; 0Þ is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(1.7)
dðO; PÞ ¼ x21 þ x22 þ / þ x2p .
The Euclidean distance between two arbitrary points P ¼ x1 ; x2 ; /; xp and
Q ¼ y1 ; y2 ; /; yp is given by
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
dðP; QÞ ¼ ðx1 y1 Þ2 þ ðx2 y2 Þ2 þ / þ xp yp .
(1.8)
Although the Euclidean distance is simple and intuitive, it is unsatisfactory for most
statistical purposes. This is because each coordinate contributes equally (equal weight)
to the calculation of the Euclidean distance. Therefore, the Euclidean distance may fail
to capture the change of values of indicators with varying degrees of variation.
The purpose now is to develop a statistical distance that accounts for differences in variation and, in due course, the presence of correlation. Because our choice depends on the
sample variances and covariances, at this point, we use the term statistical distance to
distinguish it from the ordinary Euclidean distance.
To illustrate, suppose we have n pairs of measurements on two variables each having
mean zero. Call the variables x1 and x2 , and assume that the x1 measurements vary independently of the x2 measurements. In addition, assume that the variability in the x1 measurements is larger than the variability in the x2 measurements. A scatter plot of the data
would look something like the one pictured in Fig. 1.1.
It can be easily found that the number of observation points contained in the unit
length (density) of the x1 -axis is much less than the number of observation points
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Overview of multivariate statistical analysis
Figure 1.1 Schematic scatter plot of point in a plane.
contained in the unit length of the x2 -axis, which may be caused by the different dimensions of x1 and x2 or the degree of variation itself.
A common approach to solve this problem is to divide each coordinate by the sample
standard deviation. Therefore, on division by the standard deviations, we obtain the
*
1
2
pxffiffiffiffi
“standardized” coordinates x*1 ¼ pxffiffiffiffi
s11 and x2 ¼ s22 . The standardized coordinates
ensure the consistency of the measurement scale.
Thus, the statistical distance of point Pðx1 ; x2 Þ from origin Oð0; 0Þ can be computed
from its standardized coordinates x*1 ; x*2 :
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
* 2 * 2ffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffi 2
pffiffiffiffiffiffi 2
x21 x22
þ ; (1.9)
dðO; PÞ ¼
x1 þ x2 ¼ ðx1 = s11 Þ þ ðx2 = s22 Þ ¼
s11 s22
that is, the statistical distance is the weighted Euclidean distance from the original
coordinates.
The difference between Eqs. (1.9) and (1.6) is that k1 ¼ s111 and k2 ¼ s122 , which are
the weights for x21 and x22 , respectively, are added to Eq. (1.9). When the two variables
have the same variance, that is, k1 ¼ k2 , the statistical distance differs from the Euclidean
distance by a constant term; that is, if the variability in the x1 direction is the same as that
in the x2 direction, and the x1 values vary independently of the x2 values, then the
Euclidean distance is appropriate.
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
11
We Don’t reply in this website, you need to contact by email for all chapters
Instant download. Just send email and get all chapters download.
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
You can also order by WhatsApp
https://api.whatsapp.com/send/?phone=%2B447507735190&text&type=ph
one_number&app_absent=0
Send email or WhatsApp with complete Book title, Edition Number and
Author Name.
12
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Applied Multivariate Statistical Analysis in Medicine
Setting the right-hand side of Eq. (1.9) to cðc 0Þ and squaring both sides of Eq. (1.9),
we obtain
x21 x22
þ
¼ c2.
s11 s22
(1.10)
Eq. (1.10) indicates that the locus of all points whose statistical distance from the
origin is squared by constant c 2 is an ellipse centered on the origin, with the major
(long) and minor (short) axes coinciding with the coordinate axes.
The concept of statistical distance can be easily generalized to p-dimensional space.
Given an arbitrary point P ¼ x1 ; x2 ; /; xp
and any fixed point
Q ¼ ðy1 ; y2 ; /; yp Þ, and if we assume that the coordinate variables vary independently
of one another. Let s11 ; s22 ; /; spp be sample variances constructed from n measurements.
Then the statistical distance from P to Q is
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
u
u
2
2
y
x
p
p
tðx1 y1 Þ
ðx2 y2 Þ
þ
þ/þ
dðP; QÞ ¼
.
(1.11)
s11
s22
spp
Eq. (1.11) has a similar geometric implication to Eq. (1.9); that is, it represents all
points whose statistical distances squared with respect to fixed point Q are a constant
and distributed on a hyperellipsoid, whose center is Q. Additionally, each principal
axis is parallel to the corresponding coordinate axis. Eq. (1.11) also indicates that when
s11 ¼ s22 ¼ / ¼ spp ¼ 1; that is, the lengths of the main axes of the ellipsoid are
all 1, the hyperellipsoid becomes a unit sphere, and then the statistical distance is reduced
to the Euclidean distance.
The distance in Eq. (1.11) still does not include most of the important cases we
encounter because of the assumption of independent coordinates. As shown in
Fig. 1.2, the spread of the points indicates that variables x1 and x2 are related to each
other. In fact, the coordinates of the pairs ðx1 ; x2 Þ exhibit a tendency to be large or small
together, and the sample correlation coefficient is positive. Moreover, the variability in
the x2 direction is larger than that in the x1 direction.
Fig. 1.2 shows that, in the case in which the distribution of the points remains unchanged, rotating the original coordinate system counterclockwise by an angle of q leads
to new coordinates, which makes the new coordinates e
x1 and e
x2 independent. Thus, we
define the statistical distance from point P e
x1 ; e
x2 to origin Oð0; 0Þ as
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
e
x21 e
x2
dðO; PÞ ¼
þ 2;
es11 es22
(1.12)
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Overview of multivariate statistical analysis
Figure 1.2 Schematic diagram of positive correlation data and a rotating coordinate.
where es11 and es22 denote the sample variances computed using the e
x1 and e
x2
measurements, respectively.
The original coordinates ðx1 ; x2 Þ and coordinates ðe
x1 ; e
x2 Þ after the rotation have the
following relationship:
e
x1 ¼ x1 cos q þ x2 sin q; e
x2 ¼ x1 sin q þ x2 cos q.
(1.13)
Eq. (1.13) is substituted into Eq. (1.12). Then, after a simple calculation, the distance
from point Pðe
x1 ; e
x2 Þ to origin Oð0; 0Þ can be calculated using (x1 ; x2 ); that is, the original
coordinate of P:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
dðO; PÞ ¼ a11 x21 þ 2a12 x1 x2 þ a22 x22 .
(1.14)
The coefficients a11 ; a12 ; a22 in Eq. (1.14) are determined by q. The difference between Eqs. (1.14) and (1.12) lies in the presence of the cross-product term 2a12 x1 x2
necessitated by the nonzero correlation r12 .
Generally, under the condition that variables are correlated with each other, the statistical distance between point Pðx1 ; x2 Þ and any fixed point Qðy1 ; y2 Þ is expressed as
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(1.15)
dðP; QÞ ¼ a11 ðx1 y1 Þ2 þ 2a12 ðx1 y1 Þðx2 y2 Þ þ a22 ðx2 y2 Þ2 .
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
13
14
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Applied Multivariate Statistical Analysis in Medicine
Additionally, the coordinates of all the points Pðx1 ; x2 Þ whose distance to Qðy1 ; y2 Þ is
constant c satisfy
a11 ðx1 y1 Þ2 þ 2a12 ðx1 y1 Þðx2 y2 Þ þ a22 ðx2 y2 Þ2 ¼ c 2 ;
(1.16)
Eq. (1.16) is the equation of an ellipse centered at Q. The graph of such an equation is
displayed in Fig. 1.3. The major and minor axes are indicated. They are parallel to the e
x1
and e
x2 axes.
Eqs. (1.15) and (1.16) can be directly generalized to p-dimensional space for the calculation of the distance between two points.
For a p-dimensional random vector, the Mahalanobis distance, proposed by P. C.
Mahalanobis in 1936, for two observed points P and Q (xðiÞ and xðjÞ ) is defined as
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
T
xðiÞ xðjÞ S1 xðiÞ xðjÞ ;
dðP; QÞ ¼
(1.17)
where S1 denotes the inverse matrix of the sample covariance matrix. The Mahalanobis
distance is also called the generalized distance, which is a more general form of the statistical
distance.
Figure 1.3 Ellipse of points at a constant
distance from fixed point.
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Overview of multivariate statistical analysis
The statistical distance is a basic element of statistical description and statistical inference. The main difference between the statistical distance and Euclidean distance is that,
by incorporating the reciprocal of the standard deviation of each observation index as the
weighted Euclidean distance, the statistical distance can take both the variability between
observed values and the relationship between observed variables into consideration,
which makes it immune to the influence of the dimensions of each variable. In the
following chapters, we repeatedly use the concept of the statistical distance so that readers
can understand it further through the study of statistical principles and the analysis of case
studies.
1.6 Statistical software
The advancement of multivariate statistical analysis has been significantly propelled by
rapid progress in computer technology. The availability of modern computer facilities
enables the analysis of large datasets, thereby helping the application of multivariate
techniques to emerging domains such as image analysis and improving the efficacy
of data analysis, particularly in fields such as medicine. Given the substantial number
of variables involved and the growing complexity of calculation methods, the application of multivariate statistics in medical research would face severe limitations
without the assistance of specialized statistical software. Statistical software, which
serves as a pivotal data analysis tool, represents a distinct technology within the realm
of statistics and assumes an indispensable role in the execution of various intricate
multivariate statistical approaches. At present, there is an abundance of software
choices available, each with a wide range of capabilities for conducting multivariate
analysis. These options encompass, but are not restricted to, well-known software
such as Statistical Product and Service Solutions (SPSS), Statistics Analysis System
(SAS), Stata, and R.
Among the prominent statistical software options for multivariate analysis, each possesses distinct functionalities and characteristics that cater to diverse analytical needs.
SAS is well known for its adaptability and strong statistical prowess. It is prominent in
the realm of data handling by offering efficient utilities for data manipulation. Furthermore, SAS offers sophisticated statistical modeling methods, which makes it the preferred
option in sectors such as healthcare, finance, and research.
Stata is highly esteemed for its user-friendly interface and comprehensive statistical capabilities. It accommodates a diverse array of data types, and provides an extensive library
of statistical and graphical functions. Researchers value Stata for its adeptness in managing
large datasets and its robust regression analysis tools.
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
15
16
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Applied Multivariate Statistical Analysis in Medicine
SPSS is widely recognized for its intuitive interface and user-friendly approach to statistical analysis. It is the preferred tool for beginners and social scientists. SPSS excels in
data visualization, which makes it easier for researchers to present and interpret results.
It also integrates seamlessly with other data analysis software, which enhances its
versatility.
R stands out as an open-source programming language and environment for statistical
computing and graphics. Its strengths are its extensibility and the vast communitycontributed packages available for specialized analyses. R is favored by data scientists
and statisticians because of its flexibility, which allows custom script creation and the
implementation of cutting-edge statistical techniques.
The statistical software that researchers use depends on the particular analysis required,
the amount of data, the expertise of the user, and personal preferences. SAS, Stata, SPSS,
and R each have strengths; hence, they are all useful for data analysts and researchers. In
this book, we focus on SAS, but the concepts presented in the textbook can be applied to
other software. This allows readers to adapt and switch between various programs as
needed in their research.
1.7 Problems
1. What is multivariate analysis? Are these study variables generally correlated or
independent?
2. Among the five main contents for multivariate statistical analysis introduced in this
chapter, which methods can be regarded as the direct expansion of univariate analysis
and which methods have more characteristics of multivariate analysis? Please explain,
with examples.
3. What is a random vector? Does the concept of random vectors exist in univariate
statistical analysis? Please provide your explanation.
4. What is the statistical distance? Have we been exposed to this concept in the study of
univariate statistics? Please provide examples of its role in statistical inference.
5. Please explain the role that multivariate analysis plays in medical research combining
the context of this chapter and your own experience.
6. To explore the relationship between body weight (kg) X1 and forced vital capacity
(FVC) (L) X2 of adults, 30 males under 40 years old were randomly sampled, and
their body weight and FVC measured. Data are shown in Table 1.3.
(a) Create scatter plots and marginal scatter plots (plotting only one variable each time)
of the data and interpret these graphs.
(b) Assess the signs of the sample covariance based on the scatter plot.
(c) Calculate the sample mean vector, sample covariance matrix, and sample correlation
matrix.
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
Overview of multivariate statistical analysis
Table 1.3 Body weight and FVC data of 30 adult males.
Individual
X1
X2
Individual
X1
X2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
69.8
85.5
74.8
52.3
67.4
61.8
49.2
56.9
59.1
48.9
48.9
60.3
76.7
66.9
53.1
4.13
4.44
4.02
4.21
3.83
4.74
4.26
4.32
4.42
4.27
4.27
4.18
4.61
4.44
3.83
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
60.5
90.4
80.2
80.2
51.7
71.1
71.1
57.5
55.3
50.7
85.8
77.9
68.5
67.1
77.9
4.48
4.69
5.01
5.01
4.49
4.78
4.78
5.11
4.15
3.93
4.92
5.23
4.53
4.14
4.57
National Survey on Health Status and Basic Physiological Parameters (2022).
Bibliography
Acock, C. (2023). A gentle introduction to Stata (Revised Sixth Edition). Stata Press.
Adachi, K. (2016). Matrix-based introduction to multivariate data analysis. Springer.
Cotton, R. (2013). Learning R: A step-by-step function guide to data analysis (1st ed.). O’Reilly Media.
Elliott, A. C., & Woodward, W. A. (2023). SAS essentials: Mastering SAS for data analytics (2nd ed.). Wiley.
Johnson, R., & Wichern, D. (2018). Applied multivariate statistical analysis (6th ed.). Pearson.
Kendall, M. G. (1975). Multivariate analysis. Griffin.
Mahalanobis, P. C. (1936). On the generalized distance in s49-55tatistics. The National Institue of Sciences of
India, 2(1), 49e55.
Marriott, F. H. C. (1974). The interpretation of multiple observations. Academic Press.
Pituch, J. A., & Stevens, J. P. (2016). Applied multivariate statistics for the social sciences analyses with SAS and
IBM’s SPSS (6th ed.). Routledge.
Rencher, A. C., & Christensen, W. F. (2012). Methods of multivariate analysis (3rd ed.). Wiley.
Salcedo, J., McCormick, K., Peck, J., & Wheeler, A. (2017). SPSS statistics for data analysis and visualization (1st
ed.). Wiley.
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
17
We Don’t reply in this website, you need to contact by email for all chapters
Instant download. Just send email and get all chapters download.
Get all Chapters For Ebook Instant Download by email at
etutorsource@gmail.com
You can also order by WhatsApp
https://api.whatsapp.com/send/?phone=%2B447507735190&text&type=ph
one_number&app_absent=0
Send email or WhatsApp with complete Book title, Edition Number and
Author Name.
Related documents
Download