Uploaded by shaynameklejr

Introduction to Statistics Coursebook

advertisement
Introduction to Statistics:
A Tour in 14 Weeks
Reinhard Furrer
and the Applied Statistics Group
Version February 21, 2023
Git b91ab56
Contents
Prologue
v
1 Exploratory Data Analysis and Visualization of Data
1.1 Structure and Labels of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Visualizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Constructing Good Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
6
8
20
22
24
2 Random Variables
2.1 Basics of Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
28
30
34
37
39
42
42
3 Functions of Random Variables
3.1 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Random Variables Derived from Gaussian Random Variables . . . . . . . . . . .
3.3 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Functions of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
45
47
51
54
57
57
4 Estimation of Parameters
4.1 Linking Data with Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Construction of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Comparison of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Interval Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
61
64
67
69
76
76
i
ii
CONTENTS
5 Statistical Testing
81
5.1
The General Concept of Significance Testing . . . . . . . . . . . . . . . . . . . . .
82
5.2
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.3
Testing Means and Variances in Gaussian Samples . . . . . . . . . . . . . . . . .
92
5.4
Duality of Tests and Confidence Intervals . . . . . . . . . . . . . . . . . . . . . .
99
5.5
Missuse of p-Values and Other Dangers . . . . . . . . . . . . . . . . . . . . . . .
99
5.6
Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.7
Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6 Estimating and Testing Proportions
105
6.1
Estimation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2
Comparison of two Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3
Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4
Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.5
Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7 Rank-Based Methods
123
7.1
Robust Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2
Rank-Based Tests as Alternatives to t-Tests . . . . . . . . . . . . . . . . . . . . . 127
7.3
Comparing More Than two Groups . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.4
Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.5
Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.6
Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8 Multivariate Normal Distribution
143
8.1
Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2
Conditional Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.3
Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.4
Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.5
Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9 Estimation of Correlation and Simple Regression
157
9.1
Estimation of the Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.2
Estimation of a p-Variate Mean and Covariance . . . . . . . . . . . . . . . . . . . 161
9.3
Simple Linear Regression
9.4
Inference in Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.5
Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.6
Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
10 Multiple Regression
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
177
10.1 Model and Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
10.2 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.4 Extensions of the Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
CONTENTS
iii
10.5 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
11 Analysis of Variance
199
11.1 One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.2 Two-Way and Complete Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . 206
11.3 Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
11.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.5 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
11.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
12 Design of Experiments
215
12.1 Basic Idea and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
12.2 Sample Size Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
12.3 Blocking, Randomization and Biases . . . . . . . . . . . . . . . . . . . . . . . . . 220
12.4 DoE in the Classical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
12.5 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
12.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
13 Bayesian Approach
231
13.1 Motivating Example and Terminology . . . . . . . . . . . . . . . . . . . . . . . . 232
13.2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
13.3 Choice and Effect of Prior Distribution . . . . . . . . . . . . . . . . . . . . . . . . 240
13.4 Regression in a Bayesian Framework . . . . . . . . . . . . . . . . . . . . . . . . . 242
13.5 Appendix: Distributions Common in Bayesian Analysis
. . . . . . . . . . . . . . 244
13.6 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
13.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
14 Monte Carlo Methods
249
14.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
14.2 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
14.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
14.4 Bayesian Hierarchical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
14.5 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
14.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
Epilogue
265
A Software Environment R
267
iv
CONTENTS
B Calculus
271
B.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
B.2 Functions in Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
B.3 Approximating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
C Linear Algebra
277
C.1 Vectors, Matrices and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
C.2 Linear Spaces and Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
C.3 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
C.4 Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
C.5 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
References
282
Glossary
289
Index of Statistical Tests and Confidence Intervals
293
Video Index
295
Index of Terms
297
Prologue
This document accompanies the lecture STA120 Introduction to Statistics that has been given
each spring semester since 2013. The lecture is given in the framework of the minor in Applied
Probability and Statistics (www.math.uzh.ch/aws) and comprises 14 weeks of two hours of lecture
and one hour of exercises per week.
As the lecture’s topics are structured on a week by week basis, the script contains thirteen
chapters, each covering “one” topic. Some of chapters contain consolidations or in-depth studies
of previous chapters. The last week is dedicated to a recap/review of the material.
I have thought long and hard about an optimal structure for this script. Let me quickly
summarize my thoughts. It is very important that the document contains a structure that is
tailored to the content I cover in class each week. This inherently leads to 14 “chapters.” Instead
of covering Linear Models over four weeks, I framed the material in four seemingly different
chapters. This structure helps me to better frame the lectures: each week having a start, a set
of learning goals and a predetermined end.
So to speak, the script covers not 14 but essentially only three topics:
1. Background
2. Statistical foundations in a (i) frequentist and (ii) Bayesian approach
3. Linear Modeling
We will not cover these topics chronologically. For a smoother setting, we will discuss some part
of the background (multivariate Gaussian distribution) before linear modeling, when we need it.
This also allows a recap of several univariate concepts. Similarly, we cover the Bayesian approach
at the end. The book is structured according the path illustrated in Figure 1.
In case you use this document outside the lecture, here are several alternative paths through
the chapters, with a minimal impact on concepts that have not been covered:
• Focusing on linear models: Chapters 1, 2, 3, 7, 4, 5, 9, 10, 11 and 12
• Good background in probability: You may omit Chapters 2, 3 and 7
• Bare minimum in the frequentist setting: Chapters 2, 3, 7, 4, 5, 9, 10, and 11
All the datasets that are not part of regular CRAN packages are available via the URL
www.math.uzh.ch/furrer/download/sta120/. The script is equipped with appropriate links that
facilitate the download.
v
vi
Prologue
Start
Background
Exploratory Data Analysis
Random Variables
Functions of Random Variables
Multivariate Normal Distribution
Statistical Foundations
Estimation of Parameters
Statistical Testing
Frequentist
Proportions
Rank−Based Methods
Bayesian Approach
Bayesian
Monte Carlo Methods
End
Linear Modeling
Correlation and Simple Regression
Multiple Regression
Analysis of Variance
Design of Experiments
Figure 1: Structure of the script
6 min
The lecture STA120 Introduction to Statistics formally requires the prerequisites MAT183
Stochastic for the Natural Sciences and MAT141 Linear Algebra for the Natural Sciences or
equivalent modules. For the content of these lectures we refer to the corresponding course
web pages www.math.uzh.ch/fs20/mat183 and www.math.uzh.ch/hs20/mat141. It is possible to
successfully pass the lecture without having had the aforementioned lectures, some self-studying is
necessary though. This book and the accompanying exercises require differentiation, integration,
vector notation and basic matrix operations, concept of solving a linear system of equations.
Appendix B and C give the bare minimum of relevant concepts in calculus and in linear algebra.
We review and summarize the relevant concepts of probability theory in Chapter 2 and in parts
of Chapter 3.
I have augmented this script with short video sequences giving additional – often more technical – insight. These videos are indicated in the margins with a ‘video’ symbol as here.
Some more details about notation and writing style.
• I do not differentiate between ‘Theorem’, ‘Proposition’, ‘Corollary’, they are all termed as
‘Property’.
• There are very few mathematical derivations in the main text. These are typical and
important. Quite often we only state results. For the interested, the proofs of these are
given in the ‘theoretical derivation’ problem at the end of the chapter.
Prologue
vii
• Some of the end-of-chapter problems and exercises are quite detailed others are open. Of
course there are often several approaches to achieve the same; R-Code of solutions is one
way to get the necessary output.
• Variable names are typically on the lower end of explicitness and would definitely be criticized by a computer scientists scrutinizing the code.
• I keep the headers of the R-Codes rather short and succinct. Most of them are linked to
examples and figures with detailed explanations.
• R-Code contain short comments and should be understandable on its own. The degree of
difficulties and details increases over the chapters. For example, in earlier chapters, we use
explicit loops for simulations, whereas in the latter chapters we typically vectorize.
• The R-Code of each chapter allows to reconstruct the figures, up to possible minor differences in margin specifications. There are a few illustrations made in R, for which the code
is presented only online and not in the book.
• At the end of each chapter there are standard references for the material. In the text I
only cite particularly important references.
• For clarity, we omit the prefix ‘http://’ or ‘https://’ from the URLs - unless necessary. The
links have been tested and worked at the time of the writing.
Many have contributed to this document. A big thanks to all of them, especially (alphabetically) Zofia Baranczuk, Federico Blasi, Julia Braun, Matteo Delucci, Eva Furrer, Florian Gerber,
Michael Hediger, Lisa Hofer, Mattia Molinaro, Franziska Robmann, Leila Schuh and many more.
Kelly Reeve spent many hours improving my English. Without their help, you would not be
reading these lines. Please let me know of any necessary improvements and I highly appreciate
all forms of contributions in form of errata, examples, or text blocks. Contributions can be
deposited directly in the following Google Doc sheet.
Major errors that are detected after the lecture of the corresponding semester started are
listed www.math.uzh.ch/furrer/download/sta120/errata.txt.
Reinhard Furrer
February 2023
viii
Prologue
Chapter 1
Exploratory Data Analysis and
Visualization of Data
Learning goals for this chapter:
⋄ Understand the concept and the need of an exploratory data analysis (EDA)
within a statistical data analysis
⋄ Know different data types and operations we can perform with them
⋄ Calculate different descriptive statistics from a dataset
⋄ Perform an EDA in R
⋄ Sketch schematically, plot with R and interpret a histogram, barplot, boxplot
⋄ Able to visualize multivariate data (e.g. scatterplot) and recognize special
features therein
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter01.R.
It is surprisingly difficult to start a statistical study from scratch (somewhat similar as starting
the first chapter of a book). Hence to get started, we assume a rather pragmatic setup: suppose
we have “some” data. This chapter illustrates the first steps thereafter: exploring and visualizing
the data. The exploration consists of understanding the types and the structure of the data,
including its features and peculiarities. A valuable graphical visualization should quickly and
unambiguously transmit its message. Depending on the data and the intended message, different
types of graphics can be used — but all should be clear, concise and stripped of clutter. Of
course, much of the visualization aspects are not only used at the beginning of the study but are
also used after the statistical analysis. Figure 1.1 shows one representation of a statistical data
analysis flowchart and we discuss in this chapter the right most box of the top line, performing
an exploratory data analysis. Of course, many elements thereof will be again very helpful when
1
2
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
we summarize the results of the statistical analysis (part of the bottom right most box in the
workflow). Subsequent chapters come back to questions we should ask ourselves before we start
collecting data, i.e., before we start an experiment, and how to conduct the statistical data
analysis.
Hypothesis to
investigate/
Phenomena to study
Design the
experiment
Propose statistical
model
Collect data, do
the experiment
Fit the model
(e.g., estimation)
Perform exploratory
data analysis
Validate the model
Summarize results,
and conclude
Figure 1.1: Data analysis workflow seen from a statistical perspective. There might be
situations where an EDA shows that no statistical analysis is necessary (dashed arrow
on the right). Similarly, model validation may indicate that the proposed model is not
adequate (dashed back-pointing arrow).
The workflow is of course not always as linear as indicated, even trained statisticians may
need to revisit statistical models used. Moreover, the workflow is just an extract of the scientific
cycle, or the scientific method: conclusions lead to new or refined hypotheses that require again
data and analyses.
1.1
Structure and Labels of Data
At the beginning of any statistical data analysis, an exploratory data analysis (EDA) should
be performed (Tukey, 1977). An EDA summarizes the main characteristics of the data by
representing observations or measured values graphically and describing them qualitatively and
quantitatively. Each dataset tells us a ‘story’ that we should try to understand before we begin
with the analysis. To do so, we should ask ourselves questions like
• What is the data collection or data generating process?
(discussed in Section 1.1.1)
• What types of data do we have?
(discussed in Section 1.1.2)
• How many data points/missing values do we have?
(discussed in Section 1.2)
• What are the key summary statistics of the data?
(discussed in Sections 1.2 and 1.3.1)
• What patterns/features/clusters exist in the data?
(discussed in Section 1.3.2)
At the end of a study, results are often summarized graphically because it is generally easier
to interpret and understand graphics than values in a table. As such, graphical representation
of data is an essential part of statistical analysis, from start to finish.
1.1. STRUCTURE AND LABELS OF DATA
1.1.1
3
Accessing the Data
Assuming that a data collection process is completed, the “analysis” of this data is one of the
next steps. This analysis is typically done in an appropriate software environment. There are
many of such but our prime choice is R (R Core Team, 2020), often used alternatives are SPSS,
SAS, Minitab, Stat, Prism besides other general purpose programming languages and platforms.
Appendix A gives links to R and to some R resources. We assume now a running version of R.
The first step of the analysis is loading data in the software environment. This task sounds
trivial and for pre-processed and readily available datasets often is. Cleaning own and others’
data is typically very painful and eats up much unanticipated time. We do load external data
but in this book we will not cover the aspect of data cleaning — be aware of this step when
planning your analysis.
When storing own data it is recommended to save it in a tabular, comma-separated values
format, typically called a CSV file. Similarly, the metadata should be provided in a separate text
file. This ensures that the files are readable on “all” computing environments with open source
software and remain so in the foreseeable future. The next two examples illustrate how to access
data in R.
Example 1.1. There are many datasets available in R, the command data() lists all that
are currently available on your computing system. Additional datasets are provided by many
packages. Suppose the package spam is installed, e.g., by executing install.packages("spam"),
then the function call data(package="spam") lists the datasets included in the package spam. A
specific dataset is then loaded by calling data() with argument the name of the dataset (quoted
or unquoted work both). (The command data( package=.packages( all.available=TRUE))
would list all datasets from all R packages that are installed on your computing system.)
♣
Example 1.2. Often, we will work with own data and hence we have to “read-in” (or load or
import) the data, e.g., with R functions read.table(), read.csv(). In R-Code 1.1 we readin observations of mercury content in lake Geneva sediments. The data is available at www.
math.uzh.ch/furrer/download/sta120/lemanHg.csv and is stored in a CSV file, with observation
number and mercury content (mg/kg) on individual lines (Furrer and Genton, 1999). After
importing the data, it is of utmost importance to check if the variables have been properly
read, that (possible) row and column names are correctly parsed. Possible R functions for this
task are str(), head(), tail(), or visualizing the entire dataset with View(). The number of
observations and variables should be checked (if known), for example with dim() for matrices (a
two-by-two arrangement of numbers) and dataframes (a handy tabular format of R), or length()
for vectors or from the output of str(). Note that in subsequent examples we will not always
display the output of all these verification calls to keep the R display to a reasonable length.
We limit the output to pertinent calls such that the given R-Code can be understood by itself
without the need to run it in an R session (although this is recommended).
R-Code 1.1 includes also a commented example code that illustrates how the format of the
imported dataset changes when arguments of importing functions (here read.csv()) are not
properly set (commented second but last line).
♣
4
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
R-Code 1.1 Loading the ‘lemanHg’ dataset in R.
Hg.frame <- read.csv('data/lemanHg.csv')
str( Hg.frame)
# dataframe with 1 numeric column and 293 observations
## 'data.frame': 293 obs. of 1 variable:
## $ Hg: num 0.17 0.21 0.06 0.24 0.35 0.14 0.08 0.26 0.23 0.18 ...
head( Hg.frame, 3)
# column name is 'Hg'
##
Hg
## 1 0.17
## 2 0.21
## 3 0.06
Hg <- Hg.frame$Hg
str( Hg)
##
# equivalent to `Hg.frame[,1]` or `Hg.frame[,"Hg"]`
# now we have a vector
num [1:293] 0.17 0.21 0.06 0.24 0.35 0.14 0.08 0.26 0.23 0.18 ...
sum( is.na( Hg))
# check if there are NAs, alt: `any( is.na( Hg))`
## [1] 0
# str( read.csv('data/lemanHg.csv', header=FALSE))
# Wrong way to import data. Result is a `factor` not `numeric`!
The tabular format we are using is typically arranged such that each observation (data for
a specific location or subject or time point) that may consist of several variables or individual
measurements (different chemical elements or medical conditions) is on one line. We call the
entirety of these data points a dataset.
When analyzing data it is always crucial to query the origins of the data. In the following,
we state some typical questions that we should ask ourselves as well as resulting consequences
or possible implications and pitfalls. “What was the data collected for?”: The data may be
of different quality if collected by scientists using it as their primary source compared to data
used to counter ‘fake news’ arguments. Was the data rather observational or from a carefully
designed experiment. The latter often being more representative and less prone to biases due to
the chosen observation span. “Has the data been collected from different sources?”: This might
imply heterogeneous data because the scientists or labs have used different standards, protocols
or tools. In case of specific questions about the data, it may not be possible to contact all original
data owners. “Has the data been recorded electronically?”: If electronic data recording has lower
chances of erroneous entries. No human is perfect (especially reading off and entering numbers)
and such datasets are more prone to contain errors compared to automatically recorded ones.
In the context of an educational data analysis, the following questions are often helpful “Is
the data a classical textbook example?”: In such setting, we typically have cleaned data that
serve to illustrate one (or at most a couple) pedagogical concepts. The lemanHg dataset used in
this chapter is such a case, no negative surprises to be expected. “Has the data has been analyzed
elsewhere before?”: such data have been typically cleaned and we already have one reference for
1.1. STRUCTURE AND LABELS OF DATA
5
an analysis. “Are the data stemming from simulated data?”: in such cases, there is a known
underlying generation processes.
Of in all the cases mentioned before we can safely apply the proverb “the exception proves
the rule”.
1.1.2
Types of Data
Presumably we all have a fairly good idea of what data is. However, often this view is quite
narrow and boils down to data are numbers. But “data is not just data”: data can be hard or
soft, quantitative or qualitative.
Hard data is associated with quantifiable statements like “The height of this female is 172
cm.” Soft data is often associated with subjective statements or fuzzy quantities requiring interpretation, such as “This female is tall”. Probability statements can be considered hard (derived
from hard data) or soft (due to a lack of quantitative values). In this book, we are especially
concerned with hard data.
An important distinction is whether data is qualitative or quantitative in nature. Qualitative
data consists of categories and are either on nominal scale (e.g., male/female) or on ordinal
scale (e.g., weak<average<strong, nominal with an ordering). Quantitative data is numeric and
mathematical operations can be performed with it.
Quantitative data is either discrete, taking on only specific values (e.g., integers or a subset
thereof), or continuous, taking on any value on the real number line. Quantitative data is
measured on an interval scale or ratio scale. Unlike the ordinal scale, the interval scale is
uniformly spaced. The ratio scale is characterized by a meaningful absolute zero in addition to the
characteristics of the interval scale. Depending on the measurement scale, certain mathematical
operators and thus summary measures or statistical measures are appropriate. The measurement
scales are classified according to Stevens (1946) and summarized in Table 1.1. We will discuss
the statistical measures based on data next and their theoretical counterparts and properties in
later chapters.
Non-numerical data, often gathered from open-ended responses or in audio-visual form, is
considered qualitative. We will not discuss such type of data here.
Example 1.3. The classification of elements as either “C” or “H” results in a nominal variable. If
we associate “C” with cold and “H” with hot we can use an ordinal scale (based on temperature).
In R, nominal scales are represented with factors. R-Code 1.2 illustrates the creation of
nominal and interval scales as well as some simple operations. It would be possible to create
ordinal scales as well, but we will not use it in this book.
When measuring temperature in Kelvin (absolute zero at −273.15◦ C), a statement such as
“The temperature has increased by 20%” can be made. However, a comparison of twice as hot
(in degrees Celsius) does not make sense as the origin is arbitrary.
♣
6
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Table 1.1: Types of scales according to Stevens (1946) and possible mathematical
operations. The statistical measures are for a description of location and spread.
Measurement scale
nominal ordinal interval
Mathematical
operators
=, ̸=
=, ̸=
<, >
=, ̸=
<, >
−, +
mode
mode
median
mode
median
arithmetic mean
location
Statistical
measures
range
spread
standard deviation
real
=, ̸=
<, >
−, +
×, /
mode
median
arithmetic mean
geometric mean
range
studentized range
standard deviation
coefficient of variation
R-Code 1.2 Example of creating ordinal and interval scales in R.
ordinal <- factor( c("male","female"))
ordinal[1] == ordinal[2]
## [1] FALSE
# ordinal[1] > ordinal[2]
interval <- c(2, 3)
interval[1] > interval[2]
# warning ‘>’ not meaningful for factors
## [1] FALSE
1.2
Descriptive Statistics
With descriptive statistics we summarize and describe quantitatively various aspects and features
of a dataset.
Although rudimentary information, very basic summary of the data is its size: number of
observations, number of variables, and of course their types (see output of R-Code 1.1). Another
important aspect is the number of missing values which deserve careful attention. In R a missing
value is represented as NA in contrast to NaN (not a number) or Inf (for “infinity”). One has to
evaluate if the missing values are due to some random mechanism, emerge consistently or with
some deterministic pattern, appear in all variables, for example.
For a basic analysis one often neglects the observations if missing values are present in any
variable. There exist techniques to fill in missing values but these are quite complicated and not
treated here.
As a side note, with a careful inspection of missing values in ozone readings, the Antarctic
7
1.2. DESCRIPTIVE STATISTICS
“ozone hole” would have been discovered more than one decade earlier (see, e.g., en.wikipedia.
org/wiki/Ozone_depletion#Research_history).
Informally a statistic is a single measure of some attribute of the data, in the context of this
chapter a statistic gives a good first impression of the distribution of the data. Typical statistics
for the location (i.e., the position of the data) include the sample mean, truncated/trimmed
mean, sample median, sample quantiles and quartiles. The trimmed mean omits a small fraction
of the smallest and the same small fraction of the largest values. A trimming of 50% is equivalent
to the sample median. Sample quantiles or more specifically sample percentiles link observations
or values with the position in the ordered data. For example, the sample median is the 50thpercentile, half the data is smaller than the median, the other half is larger. The 25th- and
75th-percentile are also called the lower and upper quartiles, i.e., the quartiles divide the data in
four equally sized groups. Depending on the number of observations at hand, arbitrary quantiles
are not precisely defined. In such cases, a linearly interpolated value is used, for which the precise
interpolation weights depend on the software at hand. It is important to know this potential
ambiguity less important to know the exact values of the weights.
Typical statistics for the spread (i.e., the dispersion of the data) include the sample variance,
sample standard deviation (square root of the sample variance), range (largest minus smallest
value), interquartile range (third quartile minus the first quartile), studentized range (range divided by the standard deviation, representing the range of the sample measured in units of sample
standard deviations) and the coefficient of variation (sample standard deviation divided by the
sample mean). Note that the studentized range and the coefficient of variance are dimension-less
and should only be used with ratio scaled data.
We now introduce mathematical notation for several of these statistics. For a univariate
dataset the observations are written as x1 , . . . , xn (or with some other latin letter), with n being
the sample size. The ordered data (smallest to largest) is denoted with x(1) ≤ · · · ≤ x(n) . Hence,
we use the following classical notation:
sample mean:
sample median:
x=
n
X
xi ,
(1.1)
i=1

x
(n/2+1/2) ,
med(xi ) =
 1 (x
+x
2
(n/2)
if n odd,
(n/2+1) ), if n even,
(1.2)
n
sample variance:
sample standard deviation:
1 X
(xi − x)2 ,
n−1
i=1
√
s = s2 .
s2 =
(1.3)
(1.4)
If the context is clear, we may omit sample. Note that some books also use the quantifier
empirical instead of sample. The symbols for the sample mean, sample variance and sample
standard deviation are quite universal. However, this is not the case for the median. We use the
notation med(x1 , . . . , xn ) or med(xi ) if no ambiguities exist.
8
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Example 1.4 (continued from Example 1.2). In R-Code 1.3 several summary statistics for 293
observations of mercury in lake Geneva sediments are calculated (see R-Code 1.1 to load the
data). Of course all standard descriptive statistics are available as predefined functions in R. ♣
R-Code 1.3 A quantitative EDA of the ‘lemanHg’ dataset.
c( mean=mean( Hg), tr.mean=mean( Hg, trim=0.1), median=median( Hg))
##
mean tr.mean median
## 0.46177 0.43238 0.40000
c( var=var( Hg), sd=sd( Hg), iqr=IQR( Hg))
# capital letters for IQR!
##
var
sd
iqr
## 0.090146 0.300243 0.380000
summary( Hg)
##
##
Min. 1st Qu.
0.010
0.250
# min, max, quartiles and mean
Median
0.400
range( Hg)
Mean 3rd Qu.
0.462
0.630
Max.
1.770
# min, max, but not the difference
## [1] 0.01 1.77
tail( sort( Hg))
# sorts and then list the 6 largest values
## [1] 1.28 1.29 1.30 1.31 1.47 1.77
For discrete data, the sample mode is the most frequent value of a sample frequency distribution; in order to calculate the sample mode, only the operations {=, ̸=} are necessary. Continuous
data are first divided into categories (by discretization or binning) and then the mode can be
determined.
In subsequent chapters, we will discuss (statistical) properties of these different statistics.
Then, it will be important to emphasize when we are referring to the sample mean or to the
theoretical mean of our statistical model (also known as the expectation).
Another important aspect of EDA is the identification of outliers, which are defined (verbatim
from Olea, 1991): “In a sample, any of the few observations that are separated so far in value
from the remaining measurements that the questions arise whether they belong to a different
population, or that the sampling technique is faulty. The determination that an observation is
an outlier may be highly subjective, as there is no strict criteria for deciding what is and what is
not an outlier”. Graphical respresentations of the data often help in the identification of outliers,
as we will see in the next section.
1.3
Visualizing Data
In the remaining part of the chapter, we discuss how to “graphically” present and summarize
data. With graphically we mean a visualization using a graph, plot, chart or even a diagram. The
1.3. VISUALIZING DATA
9
exact definition and meaning of the latter four terms is not universal but a rough characterization
is as follows. A diagram is the most generic term and is essentially a symbolic representation of
information (Figure 1.1 being an example). A chart is a graphical representation of data (a pie
chart being an example). A plot is a graphical representation of data in a coordinate system (we
will discuss histograms, boxplots, scatterplots, and more below). Finally, a graph is “continuous”
line representing data in a coordinate system (for example, visualizing a function).
Remark 1.1. There are several fundamentally different approaches to creating plots in R: base
graphics (package graphics, which is automatically loaded upon starting R), trellis graphics (packages lattice and latticeExtra), and the grammar of graphics approach (package
ggplot2). In this book we focus on base graphics. This approach is in sync with the R source
code style and we have a clear direct handling of all elements. ggplot functionality may produce seemingly fancy graphics at the price of certain black box elements and more complex code
structure.
♣
1.3.1
Graphically Represent Univariate Data
Univariate data is composed of one variable, or a single scalar component, measured at several
instances or for several individuals or subjects. Graphical depiction of univariate data is usually
accomplished with bar plots, histograms, box plots, or Q-Q plots among others. We now look at
these examples in more details.
Ordinal data, nominal data and count data are often represented with bar plots (also called
bar charts). The height of the bars is proportional to the frequency of the corresponding value.
The German language discriminates between bar plots with vertical and horizontal orientation
(‘Säulendiagramm’ and ‘Balkendiagramm’).
Example 1.5. R-Code 1.4 and Figure 1.2 illustrate bar plots with data giving aggregated CO2
emissions from different sources (transportation, electricity production, deforestation, . . . ) in the
year 2005, as presented by the SWISS Magazine 10/2011-01/2012, page 107 (SWISS Magazine,
2011) and also shown in Figure 1.11. Here and later we may use a numerical specification of the
colors which are, starting from one to eight, black, red, green, blue, cyan, magenta, yellow and
gray.
Note that the values of such emissions vary considerably according to different sources, mainly
due to the political and interest factors associated with these numbers.
♣
Histograms illustrate the frequency distribution of observations graphically and are easy to
construct and to interpret (for a specified partition of the x-axes, called bins, observed counts or
proportions are recorded). Histograms allow one to quickly assess whether the data is symmetric
or rather left-skewed (more smaller values on the left side of the bulk of the data) or right-skewed
(more larger values on the right side of the bulk of the data), whether the data is unimodal (has
rather one dominant peak) or several or whether exceptional values are present. Several valid
rules of thumb exist for choosing the optimal number of bins. However, the number of bins is a
subjective choice that affects the look of the histogram, as illustrated in the next example.
10
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
R-Code 1.4 Representing emissions sources with barplots. (See Figure 1.2.)
100
dat <- c(2, 15, 16, 32, 25, 10)
# see Figure 1.11
emissionsource <- c('Air', 'Transp', 'Manufac', 'Electr', 'Deforest', 'Other')
barplot( dat, names=emissionsource, ylab="Percent", las=2)
barplot( cbind('2005'=dat), col=c(2,3,4,5,6,7), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', xlim=c(0.2,4))
15
60
Percent
20
40
25
Percent
Other
Deforest
Electr
Manufac
Transp
Air
80
30
20
10
5
Other
Deforest
Electr
Manufac
Transp
Air
0
0
2005
Figure 1.2: Bar plots: juxtaposed bars (left), stacked (right) of CO2 emissions according to different sources taken from SWISS Magazine (2011). (See R-Code 1.4.)
Example 1.6 (continued from Examples 1.2 and 1.4). R-Code 1.5 and the associated Figure 1.3
illustrate the construction and resulting histograms of the mercury dataset. The different choices
illustrate that the number of bins needs to be carefully chosen. Often, the default values work
well, which is also the case here. In one of the histograms, a “smoothed density” has been
superimposed. Such curves will be helpful when comparing the data with different statistical
models, as we will see in later chapters. When adding smoothed densities, the histogram has
to represent the fraction of data in each bin (argument probability=TRUE) as opposed to the
frequency (the default).
The histograms in Figure 1.3 shows that the data is unimodal, right-skewed, no exceptional
values (no outliers). Important statistics like mean and median can be added to the plot with
vertical lines. Due to the slight right-skewedness, the mean is slightly larger than the median.
♣
When constructing histograms for discrete data (e.g., integer values), one has to be careful
with the binning. Often it is better to manually specify the bins. To represent the result of several
dice tosses, it would be advisable to use hist( x, breaks=seq( from=0.5, to=6.5, by=1)),
or possibly use a bar plot as explained above. A stem-and-leaf plot is similar to a histogram, in
which each individual observation is marked with a single dot (stem() in R). Although this plot
gives the most honest representation of data, it is rarely used today, mainly beacuse it does not
work for very large sample sizes. Figure 1.4 gives an example.
11
1.3. VISUALIZING DATA
R-Code 1.5 Different histograms (good and bad ones) for the ‘lemanHg’ dataset. (See
Figure 1.3.)
histout <- hist( Hg)
# default histogram
hist( Hg, col=7, probability=TRUE, main="With 'smoothed density'")
lines( density( Hg))
# add smooth version of the histogram
abline( v=c(mean( Hg), median( Hg)), col=3:2, lty=2:3, lwd=2:3)
hist( Hg, col=7, breaks=90, main="Too many bins")
hist( Hg, col=7, breaks=2, main="Too few bins")
str( histout[1:3])
# Contains essentially all information of histogram
## List of 3
## $ breaks : num [1:10] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
## $ counts : int [1:9] 58 94 61 38 26 9 5 1 1
## $ density: num [1:9] 0.99 1.604 1.041 0.648 0.444 ...
0.5
1.0
1.5
0.0
0.5
1.0
Hg
Too many bins
Too few bins
Frequency
0
5
0
1.5
100 200
Hg
10 15
0.0
Frequency
1.0
0.0
40
Density
80
With 'smoothed density'
0
Frequency
Histogram of Hg
0.0
0.5
1.0
Hg
1.5
0.0
0.5
1.0
1.5
2.0
Hg
Figure 1.3: Histograms of mercury data with various bin sizes. In the top left panel,
the smoothed density is in back, the mean and median are in green dashed and red
dotted vertical lines, respectively. (See R-Code 1.5.)
A boxplot or a box-and-wishker plot is a graphical representation of five location statistics of
the observations: median, the lower and upper quartiles and the minimum and maximum values.
A box ranging extending to both quartiles contains thus half of the data values. Typically,
all points smaller than the first quartile minus 1.5·IQR and larger than the third quartile plus
12
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Figure 1.4: Stem-and-leaf plot of residents of the municipality of Staldenried as of
31.12.1993 (Gemeinde Staldenried, 1994).
1.5·IQR are marked individually. The closest non-marked observations to these bounds are called
the whiskers.
A violin plot combines the advantages of the box plot and a “smoothed” histogram by essentially merging both. Compared to a boxplot a violin plot depicts possible multi-modality
and for large datasets de-emphasizes the marked observations outside he whiskers (often termed
“outliers” but see our discussion in Chapter 7).
Example 1.7 (continued from Examples 1.2, 1.4 and 1.6). R-Code 1.6 and Figure 1.5 illustrates
the boxplot and violin plot for the mercury data. Due to the right-skewedness of the data, there
are several data points beyond the upper whisker. Notice that the function boxplot() has several
arguments for tailoring the appearance of the box plots. These are discussed in the function’s
help file.
♣
A quantile-quantile plot (QQ-plot) is used to visually compare the ordered sample, also called
sample quantiles, with the quantiles of a theoretical distribution. For the moment we think of
this theoretical distribution in form of a “smoothed density” similar to the superimposed line
in the top right panel of Figure 1.3. We will talk more about “theoretical distributions” in the
next two chapters. The theoretical quantiles can be thought of as the n values of a “perfect”
realization. If the points of the QQ-plot are aligned along a straight line, then there is a good
13
1.3. VISUALIZING DATA
R-Code 1.6 Boxplot and violin plot for the ‘lemanHg’ dataset. (See Figure 1.5.)
out <- boxplot( Hg, col="LightBlue", ylab="Hg", outlty=1, outpch='')
# 'out' contains numeric values of the boxplot
quantile( Hg, c(0.25, 0.75)) # compare with summary( Hg) and out["stats"]
## 25% 75%
## 0.25 0.63
IQR( Hg)
# interquartile range
## [1] 0.38
quantile(Hg, 0.75) + 1.5 * IQR( Hg)
# upper boundary of the whisker
## 75%
## 1.2
Hg[ quantile(Hg, 0.75) + 1.5 * IQR( Hg) < Hg] # points beyond the whisker
## [1] 1.25 1.30 1.47 1.31 1.28 1.29 1.77
1.5
1.0
0.5
0.0
Hg
1.0
0.5
0.0
Hg
1.5
require(vioplot)
# R package providing violin plots
vioplot( Hg, col="Lightblue", ylab="Hg")
1
Figure 1.5: Box plots and violin plot for the ‘lemanHg’ dataset. (See R-Code 1.6.)
match between the sample and theoretical quantiles. A deviation of the points indicate that the
sample has either too few or too many small or large points. Figure 1.6 illustrates a QQ-plot
based on nine data points (for the ease of understanding) with quantiles from a symmetric (a)
and a skewed (b) distribution respectively. The last panel shows the following cases with respect
to the reference quantiles: (c1) sample has much smaller values than expected; (c2) sample has
much larger values than expected; (c3) sample does not have as many small values as expected;
(c4) sample does not have as many large values as expected.
Notice that the QQ-plot is invariant with respect to changing the location or the scale of the
sample or of the theoretical quantiles. Hence, the omission of the scales in the figure.
Example 1.8 (continued from Examples 1.2, 1.4 to 1.7). R-Code 1.7 and Figure 1.7 illustrate
a QQ-plot for the mercury dataset by comparing it to a “bell-shaped” and a right-skewed distribution. The two distributions are called normal or Gaussian distribution and chi-squared
distribution and will be discussed in subsequent chapters. To “guide-the-eye”, we have added the
a line passing through the lower and upper quartile. The right panel shows a suprisingly good
14
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
(a)
(b)
(c1)
(c2)
(c3)
(c4)
Figure 1.6: (left) QQ-plots with a symmetric (a) and a right-skewed (b) theoretical
distribution. The histogram of the data and the theoretical density are shown in the
margins in green. (c): schematic illustration for the different types of deviations. See
text for an interpretation.
fit.
♣
R-Code 1.7 QQ-plot of the ‘lemanHg’ dataset. (See Figure 1.7.)
qqnorm( Hg)
# QQplot with comparing with bell-shaped theoretical
qqline( Hg, col=2, main='') # add read line
theoQuant <- qchisq( ppoints( 293), df=5) # minor mystery for the moment
# hist( theoQuant, prob=TRUE); lines(density(theoQuant)) # convince yourself
qqplot( theoQuant, Hg, xlab="Theoretical quantiles")
# For 'chisq' some a priori knowledge was used, for 'df=5' minimal
# trial and error was used.
qqline( Hg, distribution=function(p) qchisq( p, df=5), col=2)
1.3.2
Visualizing Multivariate Data
Multivariate data means that two or more variables are collected for each observation and are
of interest. Additionally to the univariate EDA applied for each variable, visualization of multivariate data is often accomplished with scatterplots (plot(x,y) for two variables or pairs() for
several) so that the relationship between pairs of variables is illustrated. For three variables, an
interactive visualization based on the function plot3d() from the package rgl might be helpful.
In a scatterplot, “guide-the-eye” lines are often included. In such situation, some care is
needed as there is an perception of asymmetry between y versus x and x versus y. We will
discuss this further in Chapter 9.
In the case of several frequency distributions, bar plots, either stacked or grouped, may
also be used in an intuitive way. See R-Code 1.8 and Figure 1.8 for two slightly different par-
15
1.3. VISUALIZING DATA
1.5
1.0
0.0
0.5
Hg
1.0
0.5
0.0
Sample Quantiles
1.5
Normal Q−Q Plot
−3
−2
−1
0
1
2
3
0
5
Theoretical Quantiles
10
15
20
Theoretical quantiles
Figure 1.7: QQ-plots using the normal distribution (left) and a so-called chi-squared
distribution with five degrees of freedom (right). The red line passes through the lower
and upper quantiles of both the sample and theoretical distribution. (See R-Code 1.7.)
titions of CO2 emission sources, based on (SWISS Magazine, 2011) and www.c2es.org/factsfigures/international-emissions/sector (approximate values).
R-Code 1.8 Bar plots for emissions by sectors for the year 2005 from two sources. (See
Figure 1.8.)
dat2 <- c(2, 10, 12, 28, 26, 22) # source c2es.org
mat <- cbind( SWISS=dat, c2es.org=dat2)
barplot(mat, col=c(2,3,4,5,6,7), xlim=c(0.2,5), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', las=2)
barplot(mat, col=c(2,3,4,5,6,7), xlim=c(1,30), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', beside=TRUE, las=2)
100
Percent
80
60
40
30
Air
Transp
Manufac
Electr
Deforest
Other
25
Percent
Other
Deforest
Electr
Manufac
Transp
Air
20
15
10
20
5
c2es.org
SWISS
c2es.org
0
SWISS
0
Figure 1.8: CO2 emissions by sectors visualized with bar plots for the year 2005 from
two sources (left: stacked, right: grouped). (See R-Code 1.8.)
16
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Example 1.9. In this example we look at a more complex dataset containing several measurements from different penguin species collected over three years on three different islands. The
package palmerpenguins provides the CSV file called penguins.csv in the package directory
extdata. The dataset provides bill length, bill depth, flipper length, (all in millimeters) and
body mass (in gram) of three penguin species Adelie, Chinstrap, Gentoo, next to their habitat
island, sex and the year when the penguin has been sampled (Horst et al., 2020).
R-Code 1.9 loads the data, performs a short EDA and provides the code for some representative figures. We deliberately use more arguments for several plotting functions to highlight
different features in the data or of the R functions. The eight variables are on the nominal (e.g.,
species), interval (year) and real scale (e.g., bill_length_mm). Two of the length measurements have been rounded to integers (flipper_length_mm and body_mass_g). For two out of
the 344 penguins we do not have any measures and sex is missing for nine additional penguins
(summary(penguins$sex)). Due to their natural habitat, not all penguins have been observed
on all three islands, as shown by the cross-tabulation with table().
Figure 1.9 shows some representative graphics. The first two panels show that, overall,
flipper length is bimodal with a more pronounced first mode. When we partition according to
the species, we notice that the first mode corresponds species Adelie and Chinstrap, the latter
slightly larger. Comparing the second mode with the violin plot of species Gentoo (middle panel)
we notice that some details in the violin plot are “smoothed” out.
The third panel shows a so-called mosaic plot, were we represent a two dimensional frequency
table, i.e., specifically a visualization of the aforementioned table(...) call. Here, in such
a mosaic plot, the width of each vertical band corresponds to the proportion of penguins on
the corresponding island. Each band is then divided according to the species proportion. For
example, we quickly realize that Adeline is home on all three islands and that on Torgersen island
the smallest sample has been taken. Empty cells are indicated with dashed lines. A mosaic plot
is a qualitative assessment only and the areas of the rectangles are harder to compare than the
corresponding numbers of the equivalent table() output.
The final plot is a scatterplot of the four length measures. Figure 1.9 shows that for some
variables specific species cluster well (e.g., Gentoo with flipper length), whereas other variables
are less separated (e.g., Adelie and Chinstrap with bill depth). To quickly increase information
content, we use different colors or plotting symbols according to the ordinary variables. With
the argument row1attop=FALSE we obtain a fully symmetric representation that allows quick
comparisons between both sides of the diagonal. Here, with a black-red coloring we see that
male penguins are typically larger. In fact, there is even a pretty good separation between
both sexes. Some care is needed that the annotations are not overdone: unless there is evident
clustering more than three different symbols or colors are often too much. R-Code 1.9 shows how
to redefine the different panels of a scatterplot (see help(pairs) for a more professional panel
definition). By properly specifying the diag.panel argument, it possible to add charts or plots
to the diagonal panels of pairs(), e.g., the histograms of the variables.
♣
R-Code 1.9: Constructing histograms, boxplots, violin plots and scatterplot with
penguing data. (See Figure 1.9.)
17
1.3. VISUALIZING DATA
require(palmerpenguins)
penguins <- read.csv(
# `path.package()` provides local location of pkg
paste0( path.package('palmerpenguins'),'/extdata/penguins.csv'))
str(penguins, strict.width='cut')
## 'data.frame': 344 obs. of
## $ species
: chr
## $ island
: chr
## $ bill_length_mm
: num
## $ bill_depth_mm
: num
## $ flipper_length_mm: int
## $ body_mass_g
: int
## $ sex
: chr
## $ year
: int
8 variables:
"Adelie" "Adelie" "Adelie" "Adelie" ...
"Torgersen" "Torgersen" "Torgersen" "Torgers"..
39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42..
18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2..
181 186 195 NA 193 190 181 195 193 190 ...
3750 3800 3250 NA 3450 3650 3625 4675 3475 42..
"male" "female" "female" NA ...
2007 2007 2007 2007 2007 2007 2007 2007 2007 ..
summary(penguins[, c(3:6)])
# others variables can be summarized by `table()`
##
##
##
##
##
##
##
##
bill_length_mm bill_depth_mm
Min.
:32.1
Min.
:13.1
1st Qu.:39.2
1st Qu.:15.6
Median :44.5
Median :17.3
Mean
:43.9
Mean
:17.2
3rd Qu.:48.5
3rd Qu.:18.7
Max.
:59.6
Max.
:21.5
NA's
:2
NA's
:2
flipper_length_mm body_mass_g
Min.
:172
Min.
:2700
1st Qu.:190
1st Qu.:3550
Median :197
Median :4050
Mean
:201
Mean
:4202
3rd Qu.:213
3rd Qu.:4750
Max.
:231
Max.
:6300
NA's
:2
NA's
:2
#penguins <- na.omit(penguins) # remove NA's
table(penguins[,c(2,1)])
# tabulating species on each island
##
species
## island
Adelie Chinstrap Gentoo
##
Biscoe
44
0
124
##
Dream
56
68
0
##
Torgersen
52
0
0
hist( penguins$flipper_length_mm, main='', xlab="Flipper length [mm]", col=7)
box()
# rectangle around for similar view with others
with(penguins, vioplot(flipper_length_mm[species=="Gentoo"],
flipper_length_mm[species=="Chinstrap"], flipper_length_mm[
species=="Adelie"], names=c("Gentoo", "Chinstrap", "Adelie"),
col=5:3, xlab="Flipper length [mm]", horizontal=TRUE, las=1))
mosaicplot( table( penguins[,c(2,1)]), col=3:5, main='', cex.axis=.75, las=1)
upper.panel <- function( x,y, ...)
# see `?pairs` for a better way
points( x, y, col=as.numeric(as.factor(penguins$species))+2,...)
lower.panel <- function( x,y, ...)
18
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
points( x, y, col=as.numeric(as.factor(penguins$sex)))
pairs( penguins[,c(3:6)], gap=0, row1attop=FALSE,
# scatterplot
lower.panel=lower.panel, upper.panel=upper.panel)
# pch=as.numeric(as.factor(penguins$island)) # clutters & is not helpful
60
Biscoe
50
Adelie
Adelie
species
30
40
Chinstrap
20
Chinstrap
Gentoo
10
Frequency
Dream Torgersen
0
Gentoo
170 180 190 200 210 220 230
170
Flipper length [mm]
180
190
200
210
220
island
230
Flipper length [mm]
45
55
170
190
210
230
4500
6000
35
210
230
3000
body_mass_g
14 16 18 20
170
190
flipper_length_mm
55
bill_depth_mm
35
45
bill_length_mm
14 16 18 20
3000
4500
6000
Figure 1.9: Visualization of palmerpenguins data. Top left to right: histogram for the
variable flipper length, violin plots for flipper length stratified according to species and
mosaic plot for species on each island (green: Adeline; blue Chinstrap; cyan: Gentoo).
Bottom: scatterplot matrix with colors for species (above the diagonal, colors as above)
and sex (below the diagonal, black female, read male). (See R-Code 1.9.)
For a dozen and more variables, scatterplots are no longer helpful as there are too many
panels and nontrivial underlying structures are hardly visible. Parallel coordinate plots are a
popular way of representing observations in high dimensions. Each variable is recorded along
19
1.3. VISUALIZING DATA
a vertical axis and the values of each observation are then connected with a line across the
various variables. That means that points in the usual (Euclidean) representation correspond to
lines in a parallel coordinate plot. In a classical version of the plot, all interval scaled variables
are normalized to [0, 1]. Additionally, the inclusion of nominal and ordinal variables and their
comparison with interval scaled variables is possible. The natural order of the variables in such
a plot may not be optimal for an intuitive interpretation.
Example 1.10. The dataset swiss (provided by the package datasets) contains 6 variables
(standardized fertility measure and socio-economic indicators) for 47 French-speaking provinces
of Switzerland at about 1888.
R-Code 1.10 and Figure 1.10 give an example of a parallel coordinate plot. The lines are
plotted according to the percentage of practicing catholics (the alternative being protestant).
Groups can be quickly detected and strong associations are spotted directly. For example,
provices with more catholics have a higher fertility rate or lower rates on examination (indicated
by color grouping). Provinces with higher fertility have also higher values in agriculture (lines
are in general parallel) whereas higher agriculture is linked to lower examination (lines typically
cross). We will revisit such associations in Chapter 9.
♣
R-Code 1.10 Parallel coordinate plot for the swiss dataset. (See Figure 1.10.)
dim( swiss)
## [1] 47
# in
package:datasets, available without the need to load.
6
# str( swiss, strict.width='cut')
# or even:
# head( swiss); summary( swiss)
require( MASS)
# package providing the function `parcoord()`
parcoord( swiss, col=2-(swiss$Catholic<40) + (swiss$Catholic>60))
Fertility
Agriculture
Examination
Education
Catholic
Infant.Mortality
Figure 1.10: Parallel coordinate plot of the swiss dataset. The provinces (lines) are
colored according to the percentage catholic (black less than 40%, red between 40% and
60%, green above 60%). (See R-Code 1.10.)
20
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
An alternative way to represent high-dimensional data is to project the data to two or three
dimensions, which can be represented with classical visualization tools. The basic idea of projection pursuit is to find a projection, which highlights an interesting structure of the dataset
(Friedman and Tukey, 1974). These projections are often varied and plotted continuously to find
as many possible structures, see Problem 1.7 for a guided example.
1.4
Constructing Good Graphics
A good graphic should immediately convey the essence without distraction and influential elements. Hence, when creating a figure, one should start thinking what is the message the figure is
conveying. Besides this fundamental rule additional basic guidelines of graphics and charts are:
• Construct honest graphs without hiding facts. Show all data, not omitting some observations or hiding information through aggregation. In R, it is possbile to construct a boxplot
based on three values, but such a representation would suggest to the reader the availability
of a larger amount of data. Thus hiding the fact that only three values are present.
• Construct graphs that are not suggestive. A classical deceptive example is to choose the
scale such that small variations are emphasized (zooming in) or the variations are obscured
by the large scale (zooming out). See bottom panel of Figure 1.11 where by starting the
y-axis at 3 a stronger decrease is suggested.
• Use appropriate and unambiguous labels with units clearly indicated. For presentations
and publications the labels have to be legible. See top left panel of Figure 1.11 where we
do not know the units.
• To compare quantities, one should directly represent ratios, differences or relative differences. When different panels need to be compared, they should have the same plotting
range.
• Carefully choose colors, hues, line width and type, symbols. These need to be explained in
the caption or a legend. Certain colors are more dominant or are directly associated with
emotions. Be aware of color blindness
• Never use three-dimensional renderings of lines, bars etc. The human eye can quickly
compare lengths but not angles, areas, or volumes.
• Never change scales mid axis. If absolutely necessary a second scale can be added on the
right or on the top.
Of course there are situations where it makes sense to deviate from individual bullets of the
list above. In this book, we carefully design our graphics but in order to keep the R-code to
reasonable length, we bend the above rules quite a bit.
From a good-scientific-practice point of view we recommend that figures should be constructed
such that they are “reproducible”. To do so
• create figures based on a script. The figures in this book can be reconstructed based on
sourcing the dedicated R scripts.
• do no post-process figures with graphics editors (e.g., with PhotoShop, ImageMagick)
1.4. CONSTRUCTING GOOD GRAPHICS
21
• as some graphics routines are based on random numbers, initiate the random number
generator before the construction (use set.seed(), see also Chapter 2).
Do not use pie charts unless absolutely necessary. Pie charts are often difficult to read. When
slices are similar in size it is nearly impossible to distinguish which is larger. Bar plots allow an
easier comparison, compare Figure 1.11 with either panel of Figure 1.2.
Figure 1.11: SWISS Magazine 10/2011-01/2012, 107.
Figures 1.12 and 1.11 are examples of badly designed charts and plots. The top panel Figure 1.12 contains an unnecessary 3-D rendering. The lower panel of the figure is still suboptimal
because of the shading not all information is visible. Depending on the intended message, a plot
instead of a graph would be more adequate.
22
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Bad graphics can be found everywhere including in scientific journals. Figure 1.13 is a snapshot of the webpage http://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/ that includes
the issues of the figures and a discussion of possible improvements.
Figure 1.12: Bad example (above) and improved but still not ideal graphic (below).
Figures from university documents.
1.5
Bibliographic Remarks
A “complete” or representative list of published material about and tutorials on displaying information is beyond the scope of this section. Here are a few links to works that I consider
relevant.
Many consider John W. Tukey to be the founder and promoter of exploratory data analysis.
Thus his EDA book (Tukey, 1977) is often seen as the (first) authoritative text on the subject.
In a series of books, Tufte rigorously yet vividly explains all relevant elements of visualization
and displaying information (Tufte, 1983, 1990, 1997b,a). Many university programs offer lectures
on information visualization or similar topics. The lecture by Ross Ihaka is one example worth
mentioning: www.stat.auckland.ac.nz/ ihaka/120/lectures.html.
Friendly and Denis (2001) give an extensive historical overview of the evolvement of cartography, graphics and visualization. The document at euclid.psych.yorku.ca/SCS/Gallery/milestone/
1.5. BIBLIOGRAPHIC REMARKS
Figure 1.13: Examples of bad plots in scientific journals. The figure is taken
from www.biostat.wisc.edu/˜kbroman/topten_worstgraphs/. The website discusses
the problems with each graph and possible improvements (‘[Discussion]’ links).
23
24
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
milestone.pdf has active links for virtually endless browsing. See also the applet at www.datavis.ca/
milestones/.
There are many interesting videos available illustrating good and not-so-good graphics. For
example, www.youtube.com/watch?v=ajUcjR0ma4c. The websites qz.com/580859/the-mostmisleading-charts-of-2015-fixed/ and www.statisticshowto.com/misleading-graphs/ illustrate and
discuss misleading graphics.
The page en.wikipedia.org/wiki/List_of_statistical_software gives an extensive list of available statistical software programs and environments (from open-source to proprietary) and most
of these are compared at en.wikipedia.org/wiki/Comparison_of_statistical_packages.
The raw swiss data is available at https://opr.princeton.edu/archive/Download.aspx?FileID=
1113 and documentation at https://opr.princeton.edu/archive/Download.aspx?FileID=1116, see
also https://opr.princeton.edu/archive/pefp/switz.aspx.
1.6
Exercises and Problems
Problem 1.1 (Introduction to R/RStudio) The aim of this exercise is to get some insight
on the capabilities of the statistical software environment R and the integrated development
environment RStudio.
a) R has many built-in datasets, one example is volcano. Based on the help of the dataset,
what is the name of the Volcano? Describe the dataset in a few words.
b) Use the R help function to get information on how to use the image() function for plotting
matrices. Display the volcano data.
c) Install the package fields. Display the volcano data with the function image.plot().
What is the maximum height of the volcano?
d) Use the the R help function to find out the purpose of the function demo() and have a look
at the list of available demos. The demo of the function persp() utilizes the volcano data
to illustrate basic three-dimensional plotting. Call the demo and have a look at the plots.
Problem 1.2 (EDA of bivariate data) On www.isleroyalewolf.org/data/data/home.html the
file isleroyale_graph_data_28Dec2011.xlsx contains population data from wolves and moose.
Download the data from the STA120 course page. Have a look at the data.
a) Construct a boxplot and a QQ-plot of the moose and wolf data. Give a short interpretation.
b) Jointly visualize the wolves and moose data, as well as their abundances over the years.
Give a short interpretation of what you see in the figures. (Of course you may compare
the result with what is given on the aforementioned web page).
Problem 1.3 (EDA of trivariate data) Perform an EDA of the dataset www.math.uzh.ch/
furrer/download/sta120/lemanHgCdZn.csv, containing mercury, cadmium and zinc content in
sediment samples taken in lake Geneva.
1.6. EXERCISES AND PROBLEMS
25
Problem 1.4 (EDA of multivariate data) In this problem we want to explore the classical
mtcars dataset (directly available through the package datasets). Perform an EDA thereof and
provide at least three meaningful plots (as part of the EDA) and a short description of what
they display. In what measurement scale are the variables stored and what would be the natural
or original measurement scale?
Problem 1.5 (Beyond Example 1.9) In this problem we extend R-Code 1.9 and create further
graphics summarizing the palmerpenguins dataset.
a) What is mosaicplot( table( penguins[,c(2,1,8)])) displaying? Give an interpretation of the plot. Can you increase the clarity (of the information content) with colors?
Would it be possible to further display sex? Compared to the mosaic-plot of Figure 1.9 is
there any further insight?
b) Add the marginal histograms to the diagonal panels of the scatterplot matrix in Figure 1.9.
c) In the panels of Figure 1.9 only one additional variable is used for annotation. With the
upper- and lower-diagonal plots, it is possible to add two variables. Convince yourself that
adding additionally year or island is rather hindering interpretability than helpful. How
can the information about island and year be summarized/visualized?
Problem 1.6 (Parallel coordinate plot) Construct a parallel coordinate plot using the built-in
dataset state.x77. In the left and right margins, annotate the states. Give a few interpretations
that can be derived from the plot.
Problem 1.7 (Feature detection with ggobi) The open source visualization program ggobi may
be used to explore high-dimensional data (Swayne et al., 2003). It provides highly dynamic and
interactive graphics such as tours, as well as familiar graphics such as scatterplots, bar charts and
parallel coordinates plots. All plots are interactive and linked with brushing and identification.
The package rggobi provides a link to R.
a) Install the required software ggobi and R package rggobi and explore the tool with the
dataset state.x77.
b) The synthetic dataset whatfeature, available at www.math.uzh.ch/furrer/download/sta120/
whatfeature.RData has a hidden feature. Try to find it using projection pursuit in rggobi
and notice how difficult it is to find pronounced structures in small and rather lowdimensional datasets.
Problem 1.8 (BMJ Endgame) Discuss and justify the statements about ‘Skewed distributions’
given in doi.org/10.1136/bmj.c6276.
26
CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Chapter 2
Random Variables
Learning goals for this chapter:
⋄ Describe in own words a cumulative distribution function (cdf), a probability
density function (pdf), a probability mass function (pmf), and a quantile
function
⋄ Schematically sketch, plot in R and interpret a pdf/pmf, a cdf, a quantile
function
⋄ Verify that a given function is a pdf/pmf or a cdf. Find a multiplicative
constant that makes a given function a pdf/pmf or cdf.
⋄ Pass from a cdf to a quantile function, pdf or pmf and vice versa
⋄ Given a pdf/pmf or cdf calculate probabilities
⋄ Give the definition and intuition of an expected value (E), variance (Var),
know the basic properties of E, Var used for calculation
⋄ Describe a binomial, Poisson and Gaussian random variable and recognize
their pmf/pdf
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter02.R.
Probability theory is the prime tool of all statistical modeling. Hence we need a minimal
understanding of the theory of probability in order to well understand statistical models, the
interpretation thereof, and so forth. Probability theory could (or rather should?) be covered
in entire books. Hence, we boil the material down to the bare minimum used in subsequent
chapters and thus to some more experienced reader, there seem many gaps in this chapter.
27
28
CHAPTER 2. RANDOM VARIABLES
2.1
Basics of Probability Theory
Suppose we want to (mathematically or stochastically) describe an experiment whose outcome is
perceived as “random”, the outcome is rather according to chance than to plan or determinism.
To do so, we need probability theory, starting with the definition of a probability function, then
random variable and properties thereof.
Assume we have a certain experiment. The set of all possible outcomes of this experiment
is called the sample space, denoted by Ω. Each outcome of the experiment ω ∈ Ω is called an
elementary event. A subset of the sample space Ω is called an event, denoted by A ⊂ Ω.
Example 2.1. Tossing two coins results in a sample space {HH, HT, TH, TT}. The event “at
least one head” {HH, HT, TH} consists of three elementary events.
♣
A probability measure is a function P : Ω → [0, 1], that assigns to an event A of the sample
space Ω a value in the interval [0, 1], that is, P(A) ∈ [0, 1], for all A ⊂ Ω. Such a function cannot
be chosen arbitrarily and has to obey certain rules that are required for consistency. For our
purpose, it is sufficient to link these requirements to Kolmogorov’s axioms. More precisely, a
probability function must satisfy the following axioms:
1. 0 ≤ P(A) ≤ 1, for every event A,
2. P(Ω) = 1,
P
S
3. P i Ai = i P(Ai ), for Ai ∩ Aj = ∅, i ̸= j.
In the last bullet, we only specify the index without indicating start and end, which means sum
P
over all possible indices i, say ni=1 , where n may be finite or countably infinite. (Similarly for
the union).
Summarizing informally, a probability is a function that assigns a value to each event of the
sample space constraint to:
1. the probability of an event is never smaller than zero or greater than one,
2. the probability of the whole sample space is one,
3. the probability of several events is equal to the sum of the individual probabilities, if the
events are mutually exclusive.
Probabilities are often visualized with Euler diagrams (Figure 2.1), which clearly and intuitively illustrate consequences of Kolmogorovs axioms, such as:
monotonicity
empty set
union of two events
complement of an event
conditional probability
law of total probability
if C ⊂ B =⇒ P(C) ≤ P(B),
P(∅) = 0
(2.1)
(2.2)
P(A ∪ B) = P(A) + P(B) − P(A ∩ B),
(2.3)
P(B ) = P(Ω\B) = 1 − P(B),
(2.4)
c
P(A | B) =
P(A ∩ B)
,
P(B)
P(A) = P(A | B) P(B) + P(A | B c ) P(B c ).
(2.5)
(2.6)
29
2.1. BASICS OF PROBABILITY THEORY
For the second but last statement, we require that P(B) > 0 and conditioning is essentially
equivalent to reducing the sample space from Ω to B. The last statement can be written for
S
P
arbitrary number of events Bi with Bi ∩ Bj = ∅, i ̸= j and i Bi = Ω yielding P(A) = i P(A |
Bi ) P(Bi ).
Ω
B
D
C
A
Figure 2.1: Euler diagram where events are illustrated with ellipses. The magenta
area represents A ∩ B and the event C is in the event B, C ⊂ B.
We now need to formalize this figurative description of probabilities by introducing random
variables. A random variable is a function that assigns values to the elementary events of a random experiment, that is, these values or ranges of values are assumed with certain probabilities.
The outcomes of the experiment, i.e., the values are called realizations of the random variable.
The following definition introduces a random variable and gives a (unique) characterization
of random variables. In subsequent sections, we will see additional characterizations. These,
however, will depend on the type of values the random variable takes.
Definition 2.1. Let P(·) be a probability measure. A random variable X is a measurable
function from the sample space Ω to R and represents a possible numerical outcome of an
experiment (Measurable in terms of the probability measure P(·)). The distribution function
(cumulative distribution function, cdf) of a random variable X is
F (x) = FX (x) = P(X ≤ x),
for all x ∈ R.
(2.7)
♢
Random variables are denoted with uppercase letters (e.g. X, Y ), while realizations (i.e.,
their outcomes, the observations of an experiment) are denoted by the corresponding lowercase
letters (x, y). This means that the theoretical “concept“ are denoted by uppercase letters. Actual
values or data, for example the columns in your dataset, would be denoted with lowercase letters.
The first two following statements are a direct consequence of the monotonicity of a probability measure and probability of empty set/sample space. The last requires a bit more work to
show but will be intuitive shortly.
Property 2.1. A distribution function FX (x) is
1. monotonically increasing, i.e., for x < y, FX (x) ≤ FX (y);
2. normalized, i.e.
lim FX (x) = 0 and lim FX (x) = 1.
x↘−∞
x↗∞
30
CHAPTER 2. RANDOM VARIABLES
3. right-continuous, i.e., lim FX (x + ϵ) = FX (x), for all x ∈ R;
ϵ↘0
Remark 2.1. In more formal treatise, one typically introduces a probability space, being a
triplet (Ω, F, P), consisting of a sample space Ω, a σ-algebra F (i.e., a collection of subsets of Ω
including Ω itself, which is closed under complement and under countable unions) and a probability measure P. A random variable on a probability space is a measurable function from Ω to
the real numbers: X : Ω → R, ω 7→ X(ω). To indicate its dependence on elementary events, one
often writes the argument explicitly, e.g., P(X(ω) ≤ x).
♣
For further characterizations of random variables, we need to differentiate according to the
sample space of the random variables. The next two sections discuss the two essential settings.
2.2
Discrete Distributions
A random variable is called discrete when it can assume only a finite or countably infinite number
of values, as illustrated in the following two examples.
Example 2.2. Let X be the sum of the roll of two dice. The random variable X assumes
the values 2, 3, . . . , 12, with probabilities 1/36, 2/36, . . . , 5/36, 6/36, 5/36, . . . , 1/36. Hence, for
example, P(X ≤ 4) = 1/6, P(X < 2) = 0, P(X < 12.2) = P(X ≤ 12) = 1. The left panel
of Figure 2.2 illustrates the distribution function. This distribution function (as for all discrete
random variables) is piece-wise constant with jumps equal to the probability of that value. ♣
Example 2.3. A boy practices free throws, i.e., foul shots to the basket standing at a distance
of 15 ft to the board. Each shot is either a success or a failure, which can be coded as 1/0. He
counts the number of attempts it takes until he has a successful shot. We let the random variable
X be the number of throws that are necessary until the boy succeeds. Theoretically, there is no
upper bound on this number. Hence X can take the values 1, 2, . . . .
♣
Next to the cdf, another way of describing discrete random variables is the probability mass
function, defined as follows.
Definition 2.2. The probability mass function (pmf) of a discrete random variable X is defined
by fX (x) = P(X = x).
♢
In other words, the pmf gives probabilities that the random variables takes a precise single
value, whereas, as seen, the cdf gives probabilities that the random variables takes that or any
smaller value.
Figure 2.2 illustrates the cumulative distribution and the probability mass function of the
random variable X as given in Example 2.2. The jump locations and sizes (discontinuities) of
the cdf correspond to probabilities given in the right panel. Notice that we have emphasized the
right continuity of the cdf (see Proposition 2.1.3) with the additional dot.
31
2.2. DISCRETE DISTRIBUTIONS
It is important to note that we have a theoretical construct here. When tossing two dice
several times and reporting the frequencies of their sum, the corresponding plot (bar plot or
histogram with appropriate bins) does not exactly match the right panel of Figure 2.2. The
more tosses we take, the better the match is which we will discuss further in the next chapter.
R-Code 2.1 Cdf and pmf of X as given in Example 2.2. (See Figure 2.2.)
2
4
6
8
10
0.00 0.05 0.10 0.15 0.20
pi = fX(xi)
0.0 0.2 0.4 0.6 0.8 1.0
FX(x)
plot.ecdf( outer(1:6, 1:6, "+"),
# generating all possible outcomes
ylab=bquote(F[X](x)), main='', pch=20) # `bquote` for subscripts
x <- 2:12
# possible outcomes
p <- c(1:6, 5:1)/36
# corresponding probabilities
plot( x, p, type='h', xlim=c(1,13), ylim=c(0, .2),
xlab=bquote(x[i]), ylab=bquote(p[i]==f[X](x[i])))
points( x, p, pch = 20) # adding points for clarity
12
2
4
x
6
8
10
12
xi
Figure 2.2: Cumulative distribution function (left) and probability mass function
(right) of X = “the sum of the roll of two dice”. The two y-axes have a different scale.
(See R-Code 2.1.)
Property 2.2. Let X be a discrete random variable with probability mass function fX (x) and
cumulative distribution function FX (x). Then:
1. The probability mass function satisfies fX (x) ≥ 0 for all x ∈ R.
X
2.
fX (xi ) = 1.
i
3. FX (x) =
X
fX (xi ).
i:xi ≤x
4. The values fX (xi ) > 0 are the “jumps” in xi of FX (x).
5. The cumulative distribution function is a right-continuous step function.
32
CHAPTER 2. RANDOM VARIABLES
Points 3 and 4 of the property show that there is a one-to-one relation (also called a bijection)
between the cumulative distribution function and probability mass function. Given one, we can
construct the other.
There is of course no limitation on the number of different random variables. In practice,
we can often reduce our framework to some common distributions. We now look at two discrete
ones.
2.2.1
Binomial Distribution
A random experiment with exactly two possible outcomes (for example: heads/tails, male/female,
success/failure) is called a Bernoulli trial or Bernoulli random variable. For simplicity, we code
the sample space with ‘1’ (success) and ‘0’ (failure). The probability mass function is determined
by a single probability:
P(X = 1) = p,
P(X = 0) = 1 − p,
0 < p < 1,
(2.8)
where the cases p = 0 and p = 1 are typically not relevant.
Example 2.4. A single three throw (see also Example 2.3) can be modeled as a Bernoulli
experiment with success probability p. In practice, repeating throws might probably effect the
success probability. For simplicity, one often keeps the probability constant nevertheless.
♣
If a Bernoulli experiment is repeated n times (resulting in an n-tuple of zeros and ones),
the exact order of the successes are typically not important, only the total number. Hence, the
random variable X = “number of successes in n trials” is intuitive. The distribution of X is
called the binomial distribution, denoted with X ∼ Bin(n, p) and the following applies:
n k
P(X = k) =
p (1 − p)n−k ,
0 < p < 1, k = 0, 1, . . . , n.
(2.9)
k
Example 2.5. In this example we visualize a particular binomial random variable including a
hypothetical experiment. Suppose we draw with replacement 12 times one card from a deck of
36. The probability of having a face card in a single draw is thus 3/9. If X denotes the total
number of face cards drawn, we model X ∼ Bin(12, 1/3). To calculate the probability mass or
cumulative distribution function, R provides the construct of prefix and variate. Here, the latter
is binom. The prefixes are d for the probability mass function, p for the cumulative distribution
function, both illustrated in Figure 2.3.
When I have made the experiment, I’ve had three face cards; my son only had two. Instead
of asking more persons to make the same experiment, we ask R to do so, using the prefix r
with the variate binom. (This also implies that the deck is well mixed and all is added up
correctly). Figure 2.3 shows the counts (left) for 10 and frequencies (right) for 100 experiments,
i.e., realizations of the random variable. The larger the number of realizations, the closer the
match to the probability mass function in the top left corner (again, further discussed in the
next chapter). For example, P (X = 5) = 0.19, whereas 24 out of the 100 realizations “observed”
33
2.2. DISCRETE DISTRIBUTIONS
5 face cards and 24/100 = 0.24 is much closer to 0.19, compared to 1/10 = 0.1, when only ten
realizations are considered.
We have initialized the random number generator or R with a function call set.seed() to
obtain “repeatable” or reproducible results.
♣
R-Code 2.2 Density and distribution function of a Binomial random variable. (See Figure 2.3.)
plot( 0:12, dbinom(0:12, size=12, prob=1/3), type='h',
xlab='x', ylab=bquote(f[X](x)))
plot( stepfun( 0:12, pbinom(-1:12, size=12, prob=1/3)),
ylab=bquote(F[X](x)), verticals=FALSE, main='', pch=20)
set.seed( 14)
# same results if the following lines are evaluated again
print( x10 <- rbinom(10, size=12, prob=1/3 ))
# printing the 10 draws
##
[1] 3 5 7 4 8 4 6 4 4 3
barplot( table( factor(x10, levels=0:12)))
# barplot of the 10 draws
x100 <- rbinom(100, size=12, prob=1/3 )
# 100 draws
barplot( table( factor(x100, levels=0:12))/100) # frequency barplot
sum(x100==5)
# how many times five face cards?
## [1] 24
2.2.2
Poisson Distribution
The Poisson distribution gives the probability of a given number of events occurring in a fixed
interval of time if these events occur with a known and constant rate over time. One way
to formally introduce such a random variable is by defining it through its probability mass
function. Here and in other cases to follow, memorizing the exact form of the pdf is not necessary;
recognizing the stated pmf as the Poisson one is.
Definition 2.3. A random variable X, whose probability mass function is given by
P(X = k) =
λk
exp(−λ),
k!
0 < λ,
k = 0, 1, . . . ,
is said to follow a Poisson distribution with parameter λ, denoted by X ∼ Pois(λ).
(2.10)
♢
The Poisson distribution is also a good approximation for the binomial distribution with large
n and small p; as a rule of thumb if n > 20 and np < 10 (see Problem 3.3.a).
Example 2.6. Seismic activities are quite frequent in Switzerland, most of them are not of high
strength, luckily. The webpage ecos09.seismo.ethz.ch provides a portal to download information
about earthquakes in the vicinity of Switzerland along several other variables. From the page we
have manually aggregated the number of earthquakes with a magnitude exceeding 4 between 1980
34
0
2
4
6
8
10
0.0 0.2 0.4 0.6 0.8 1.0
FX(x)
0.10
0.00
fX(x)
0.20
CHAPTER 2. RANDOM VARIABLES
12
0
4
6
8
10
12
x
0
0.00
1
0.10
2
3
0.20
4
x
2
0 1 2 3 4 5 6 7 8 9
11
0 1 2 3 4 5 6 7 8 9
11
Figure 2.3: Top row: probability mass function and cumulative distribution function
of the binomial random variable X ∼ Bin(12, 1/3). Bottom row: observed counts for
10 repetitions and observed frequencies for 100 repetitions of the experiment. (See RCode 2.2.)
and 2005 (Richter magnitude, ML scale). Figure 2.4 (based on R-Code 2.3) shows a histogram of
the data with superimposed probability mass function of a Poisson random variable with λ = 2.
There is a surprisingly good fit. There are slightly too few years with a single event, with is
offset by a few too many with zero and two events. The value of λ was chosen to match visually
the data. In later chapters we see approaches to determine the best possible value (which would
be λ = 1.92 here).
♣
R-Code 2.3 Number of earthquakes and Poisson random variable. (See Figure 2.4.)
mag4 <- c(2,0,0,2,5,0,1,4,1,2,1,3,3,2,2,3,0,0,3,1,2,2,2,4,3) # data
hist( mag4, breaks=seq(-0.5, to=10), prob=TRUE, main="", col='gray')
points( 0:10, dpois(0:10, lambda=2), pch=19, col=4) # pmf of X~Pois(2)
2.3
Continuous Distributions
A random variable is called continuous if it can (theoretically) assume any value within one or
several intervals. This means that the number of possible values in the sample space is uncount-
35
0.20
0.10
0.00
Density
0.30
2.3. CONTINUOUS DISTRIBUTIONS
0
2
4
6
8
mag4
Figure 2.4: Number of earthquakes with a magnitude exceeding 4 between 1980 and
2005 as barplot with superimposed probabilities of the random variable X ∼ Pois(2).
(See R-Code 2.3.)
able infinite. Therefore, it is impossible to assign a positive probability value to (elementary)
events. Or, in other words, given such an infinite amount of possible outcomes, the likeliness
of one particular value being the outcome becomes zero. For this reason, we need to consider
outcomes that are contained in a specific interval. Instead of a probability mass function we
introduce the density function which is, loosely speaking, the theoretical counterpart to a histogram. The probability is described by an integral, as an area under the probability density
function, which is formally defined as follows.
Definition 2.4. The probability density function (density function, pdf) fX (x), or density for
short, of a continuous random variable X is defined by
Z b
P(a < X ≤ b) =
fX (x)dx,
a < b.
(2.11)
a
♢
The density function does not give directly a probability and thus cannot be compared to
the probability mass function. The following properties are nevertheless similar to Property 2.2.
Property 2.3. Let X be a continuous random variable with density function fX (x) and distribution function FX (x). Then:
1. The density function satisfies fX (x) ≥ 0 for all x ∈ R and fX (x) is continuous almost
everywhere.
Z ∞
2.
fX (x)dx = 1.
−∞
Z x
3. FX (x) =
fX (y)dy.
−∞
4. fX (x) = FX′ (x) =
d
FX (x).
dx
5. The cumulative distribution function FX (x) is continuous everywhere.
36
CHAPTER 2. RANDOM VARIABLES
6. P(X = x) = 0.
Example 2.7. The continuous uniform distribution U(a, b) is defined by a constant density
function over the interval [a, b], a < b, i.e., f (x) = 1/(b − a), if a ≤ x ≤ b, and f (x) = 0,
otherwise. Figure 2.5 shows the density and cumulative distribution function of the uniform
distribution U(0, 1) (see also Problem 1.6).
The distribution function is continuous everywhere and the density has only two discontinuities at a and b.
♣
R-Code 2.4 Density and distribution function of a uniform distribution. (See Figure 2.5.)
−0.5
0.0
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
punif(x)
0.0 0.2 0.4 0.6 0.8 1.0
fX(x)
plot( c(-0.5, 0, NA, 0, 1, NA, 1, 1.5), c(0, 0, NA, 1, 1, NA, 0, 0),
type='l', xlab='x', ylab=bquote(f[X](x)))
# curve(dunif( x), -0.5, 1.5) # does not emphasize the discontinuity!!
curve(punif( x), -0.5, 1.5)
−0.5
x
0.0
0.5
1.0
1.5
x
Figure 2.5: Density and distribution function of the uniform distribution U(0, 1). (See
R-Code 2.4.)
As given by Property 2.3.3 and 4, there is again a bijection between the density function and
the cumulative distribution function: if we know one we can construct the other. Actually, there
is a third characterization of random variables, called the quantile function, which is essentially
the inverse of the cdf. That means, we are interested in values x for which FX (x) = p.
Definition 2.5. The quantile function QX (p) of a random variable X with (strictly) monotone
cumulative distribution function FX (x) is defined by
QX (p) = FX−1 (p),
0 < p < 1,
i.e., the quantile function is equivalent to the inverse of the distribution function.
(2.12)
♢
37
2.4. EXPECTATION AND VARIANCE
In R, the quantile function is specified with the prefix q and the corresponding variate. For
example, qunif is the quantile function for the uniform distribution, which is QX (p) = a+p(b−a)
for 0 < p < 1.
The quantile function can be used to define the theoretical counter part to the sample quartiles
of Chapter 1 as illustrated next.
Definition 2.6. The median ν of a continuous random variable X with cumulative distribution
function FX (x) is defined by ν = QX (1/2). Accordingly, the lower and upper quartiles of X are
QX (1/4) and QX (3/4).
♢
The quantile function is essential for the theoretical quantiles of QQ-plots (Section 1.3.1). For
example, the two left-most panels of Figure 1.6 are based essentially on the i/(n + 1)-quantiles.
Remark 2.2. Instead of the simple i/(n + 1)-quantiles, R uses the form (i − a)/(n + 1 − 2a), for
a specific a ∈ [0, 1]. The precise value can be specified using the argument type in quantile(),
or qtype in qqline or a in ppoints(). Of course, different definitions of quantiles only play a role
for very small sample sizes and in general we should not bother what approach has been taken. ♣
Remark 2.3. For discrete random variables the cdf is not continuous (see the plateaus in the left
panel of Figure 2.2) and the inverse does not exist. The quantile function returns the minimum
value of x from among all those values with probability p ≤ P(X ≤ x) = FX (x), more formally,
QX (p) = inf {p ≤ FX (x)},
x∈R
0 < p < 1,
where for our level of rigor here, the inf can be read as min.
2.4
(2.13)
♣
Expectation and Variance
Density, cumulative distribution function or quantile function uniquely characterize random variables. Often we do not require such a complete definition and “summary” values are sufficient.
We start introducing a measure of location (expectation) and spread (variance). More and
alternative measures will be seen in Chapter 7.
Definition 2.7. The expectation of a discrete random variable X is defined by
E(X) =
X
xi P(X = xi ) .
(2.14)
i
The expectation of a continuous random variable X is defined by
Z
E(X) =
xfX (x)dx ,
(2.15)
R
where fX (x) denotes the density of X.
♢
38
CHAPTER 2. RANDOM VARIABLES
Remark 2.4. Mathematically, it is possible that the expectation is not finite: the random
variable X may take very, very large values too often. In such cases we would say that the
expectation does not exist. We see a single example in Chapter 8 where this is the case. In all
other situations we assume a finite expectation and for simplicity, we do not explicitly state this.
♣
Many other “summary” values are reduced to calculate a particular expectation. The following
property states how to calculate the expectation of a function of the random variable X, which
is in turn used to summarize the spread of X.
Property 2.4.For an “arbitrary” real function g we have:
X


g(xi ) P(X = xi ), if X discrete,

Zi
E g(X) =



g(x)fX (x)dx,
if X continuous.
R
We often take for g(x) = xk , k = 2, 3 . . . and we call E(X k ) the higher moments of X.
Definition 2.8. The variance of X is the expectation of the squared deviation from its expected
value:
Var(X) = E (X − E(X))2
(2.16)
and is also denoted as the centered second moment, in contrast to the second moment E(X 2 ).
p
The standard deviation of X is SD(X) = Var(X).
♢
The expectation is “linked” to the average (or sample mean, mean()) if we have a set of
realizations thought to be from the particular random variable. Similarly, the variance is “linked”
to the sample variance (var()). This link will be formalized in later chapters.
Example 2.8.
1. The expectation and variance of a Bernoulli trial are
E(X) = 0 · (1 − p) + 1 · p = p,
Var(X) = (0 − p)2 · (1 − p) + (1 − p)2 · p = p(1 − p).
(2.17)
(2.18)
2. The expectation and variance of a Poisson random variable are (see Problem 3.3.a)
E(X) = λ,
Var(X) = λ.
(2.19)
Property 2.5. For random variables X and Y , regardless of whether discrete or continuous,
and for a and b given constants, we have
2
1. Var(X) = E(X 2 ) − E(X) ;
2. E(a + bX) = a + b E(X);
39
2.5. THE NORMAL DISTRIBUTION
3. Var(a + bX) = b2 Var(X) ,
4. E(aX + bY ) = a E(X) + b E(Y ).
The second but last property seems somewhat surprising. But starting from the definition of
the variance, one quickly realizes that the variance is not a linear operator:
2 2 Var(a + bX) = E a + bX − E(a + bX)
= E a + bX − (a + b E(X)) ,
(2.20)
followed by a factorization of b2 .
Example 2.9 (continuation of Example 2.2). For X the sum of the roll of two dice, straightforward calculation shows that
E(X) =
12
X
by equation (2.14),
i P(X = i) = 7,
i=2
6
X
=2
i=1
1
7
i =2· ,
6
2
by using Property 2.5.4 first.
(2.21)
♣
(2.22)
As a small note, recall that the first moment is a measure of location, the centered second
moment a measure of spread. The centered third moment is a measure of asymmetry and can
be used to quantify the skewness of a distribution. The centered forth moment is a measure of
heaviness of the tails of the distribution.
2.5
The Normal Distribution
The normal or Gaussian distribution is probably the most known distribution, having the omnipresent “bell-shaped” density. Its importance is mainly due the fact that the sum of Gaussian
random variables is distributed again as a Gaussian random variable. Moreover, the sum of many
“arbitrary” random variables is distributed approximately as a normal random variable. This is
due to the celebrated central limit theorem, which we see in details in the next chapter. As in
the case of a Poisson random variable, we define the normal distribution by giving its density.
Definition 2.9. The random variable X is said to be normally distributed if the cumulative
distribution function is given by
Z x
FX (x) =
fX (x)dx
(2.23)
−∞
with density function
1 (x − µ)2
exp − ·
,
f (x) = fX (x) = √
2
σ2
2πσ 2
1
for all x (µ ∈ R, σx > 0). We denote this with X ∼ N (µ, σ 2 ).
(2.24)
The random variable Z = (X − µ)/σ (the so-called z-transformation) is standard normal and
its density and distribution function are usually denoted with φ(z) and Φ(z), respectively.
♢
40
CHAPTER 2. RANDOM VARIABLES
While the exact form of the density (2.24) is not important, a certain recognizing factor will
be very useful. Especially, for a standard normal random variable, the density is proportional to
exp(−z 2 /2). Figure 2.7 gives the density, distribution and the quantile function of a standard
norm distributed random variable.
Property 2.6. Let X ∼ N (µ, σ 2 ), then E(X) = µ and Var(X) = σ 2 .
The following property is essentially a rewriting of the second part of the definition. We will
state it explicitly because of its importance.
X−µ
Property 2.7. Let X ∼ N (µ, σ 2 ), then X−µ
∼
N
(0,
1)
and
F
(x)
=
Φ
. Conversely, if
X
σ
σ
2
Z ∼ N (0, 1), then σZ + µ ∼ N (µ, σ ), σ > 0.
The cumulative distribution function Φ has no closed form and the corresponding probabilities must be determined numerically. In the past, so-called “standard tables” summarized
probabilities and were included in statistics books. Table 2.1 gives an excerpt of such a table.
Now even “simple” pocket calculators have the corresponding functions to calculate the probabilities. It is probably worthwhile to remember 84% = Φ(1), 98% = Φ(2), 100% ≈ Φ(3), as well
as 95% = Φ(1.64) and 97.5% = Φ(1.96). Relevant quantiles have been illustrated in Figure 2.6
for a standard normal random variable. For arbitrary normal density, the density scales linearly
with the standard deviation.
P(Z<0)=50.0%
P(Z<1)=84.1%
P(Z<2)=97.7%
−3
−2
−1
0
1
2
3
P(−1<Z<1)=68.3%
P(−2<Z<2)=95.4%
P(−3<Z<3)=99.7%
−3
−2
−1
0
1
2
3
Figure 2.6: Different probabilities for the standard normal distribution.
Table 2.1: Probabilities of the standard normal distribution. The table gives the value
of Φ(zp ) for selected values of zp . For example, Φ(0.2 + 0.04) = 0.595.
zp
0.00
0.02
0.04
0.06
0.08
0.0
0.500
0.508
0.516
0.524
0.532
0.1
0.540
0.548
0.556
0.564
0.571
0.2
0.579
0.587
0.595
0.603
0.610
0.3
0.618
0.626
0.633
0.641
0.648
0.4 . . . 1 . . . 1.6
0.655
0.841
0.945
0.663
0.846
0.947
0.670
0.851
0.949
0.677
0.855
0.952
0.684
0.860
0.954
1.7
0.955
0.957
0.959
0.961
0.962
1.8
0.964
0.966
0.967
0.969
0.970
1.9
0.971
0.973
0.974
0.975
0.976
2 ... 3
0.977
0.999
0.978
0.999
0.979
..
.
0.980
0.981
41
2.5. THE NORMAL DISTRIBUTION
R-Code 2.5 Calculation of the “z-table” (see Table 2.1) and density, distribution, and
quantile functions of the standard normal distribution. (See Figure 2.7.)
y <- seq( 0, by=.02, length=5)
# row values
x <- c( seq( 0, by=.1, to=.4), 1, seq(1.6, by=.1, to=2), 3) # column values
round( pnorm( outer( x, y, "+")), 3)
1.0
0.5
−0.5
0.0
dnorm
1.5
2.0
plot( dnorm, -3, 3, ylim=c(-.5,2), xlim=c(-2.6,2.6))
abline( c(0, 1), h=c(0,1), v=c(0,1), col='gray') # diag and horizontal lines
plot( pnorm, -3, 3, col=3, add=TRUE)
plot( qnorm, 0, 1, col=4, add=TRUE)
−2
−1
0
1
2
x
Figure 2.7: Probability density function (black), cumulative distribution function
(green), and quantile function (blue) of the standard normal distribution. (See RCode 2.5.)
We finish this section with two probability calculations that are found similarly in typical
textbooks. We will, however, revisit the Gaussian distribution in virtually every following chapter.
Example 2.10. Let X ∼ N (4, 9). Then
−2 − 4 3
3
= P(Z ≤ −2) = Φ(−2) = 1 − Φ(2) ≈ 1 − 0.977 = 0.023 .
1. P(X ≤ −2) = P
X − 4
≤
2. P(|X − 3| > 2) = 1 − P(|X − 3| ≤ 2) = 1 − P(−2 ≤ X − 3 ≤ 2)
= 1 − P(X − 3 ≤ 2) − P(X − 3 ≤ −2) = 1 − Φ
≈1−
0.626 + 0.633
+ (1 − 0.841) ≈ 0.5295 .
2
5−4
3
+Φ
1−4
3
♣
42
CHAPTER 2. RANDOM VARIABLES
2.6
Bibliographic Remarks
There is an abundance list of books in probability, discussing the concepts of probabilities,
random variables and properties thereof, e.g., Grinstead and Snell (2003) (PDF version under
GNU FDL exists), Ross (2010) (or any older version), or advanced, but classical Feller (1968).
In fact, the majority of textbooks in statistics have some introduction to probability, in a similar
sprit as here.
The ultimate reference for (univariate) distributions is the encyclopedic series of Johnson
et al. (2005, 1994, 1995). Figure 1 of Leemis and McQueston (2008) illustrates extensively
the links between countless univariate distributions, a simplified version is available at https:
//www.johndcook.com/blog/distribution_chart/.
In general, wikipedia has nice summaries of many distributions. The page https://en.
wikipedia.org/wiki/List_of_probability_distributions lists many thereof.
2.7
Exercises and Problems
Problem 2.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Show Property 2.5.
b) Show Property 2.6.
Hint: Let Z ∼ N (0, 1). Show that E(Z) = 0 and Var(Z) = 1, then use Property 2.5.
Problem 2.2 (Visualizing probabilities) For events B1 , . . . , B5 with Bi ∩ Bj = ∅, i ̸= j and
P
S5
i P(A | Bi ) P(Bi ) using an Euleri=1 Bi = Ω visualize the law of total probability P(A) =
diagram.
Problem 2.3 (Properties of probability measures) For events A and B show that:
a) P(A\B) = P(A) − P(A ∩ B);
b) P(A ∪ B) = 1 − P(Ac ∩ B c );
c) P(A ∩ B) = 1 − P(Ac ∪ B c ).
Problem 2.4 (Counting events) Place yourself in a sidewalk café during busy times and for
several consecutive minutes, count the number of persons walking by. (If you do not enjoy the
experimental setup, it is possible to count the number of vehicules in one of the live cams linked
at https://www.autobahnen.ch/index.php?lg=001&page=017).
a) Visualize the data with a histogram and describe its form.
b) Try to generate realizations from a Poisson random variable whose histogram matches best
the the one seen in a), i.e., use rpois() for different values of the argument lambda.
Would there be any other distribution that would yield a better match?
43
2.7. EXERCISES AND PROBLEMS
Problem 2.5 (Discrete uniform distribution) Let m be a positive integer. The discrete uniform
distribution is defined by the pmf
P(X = k) =
1
,
m
k = 1, . . . , m.
a) Visualize the pmf and cdf of a discrete uniform random variable with m = 12.
b) Draw several realizations from X, visualize the results and compare to the pmf of a). What
are sensible graphics types for the visualization?
Hint: the function sample() conveniently draws random samples without replacement.
c) Show that E(X) =
Hint:
m
X
k2 =
k=1
m2 − 1
m+1
and Var(X) =
.
2
12
m(m + 1)(2m + 1)
.
6
Problem 2.6 (Uniform distribution) We assume X ∼ U(0, θ), for some value θ > 0.
a) For all a and b, with 0 < a < b < θ show that P(X ∈ [a, b]) = (b − a)/θ.
b) Calculate E(X), SD(X) and the quartiles of X.
c) Choose a sensible value for θ. In R, simulate n = 10, 50, 10000 random numbers and
visualize the histogram as well as a QQ-plot thereof. Is it helpful to superimpose a smoothed
density to the histograms (with lines( density( ...)))?
Problem 2.7 (Calculating probabilities) In the following settings, approximate the probabilities
and quantiles q1 and q2 using a Gaussian “standard table”. Compare these values with the ones
obtained with R.
X ∼ N (2, 16) :
P(X < 4),
P(0 ≤ X ≤ 4),
P(X > q1 ) = 0.95,
P(X < −q2 ) = 0.05.
If you do not have standard table, the following two R commands may be used instead: pnorm(a)
and qnorm(b) for specific values a and b.
Problem 2.8 (Exponential Distribution) In this problem we get to know another important
distribution you will frequently come across - the expontential distribution. Consider the random
variable X with density

0,
x < 0,
f (x) =
c · exp(−λx), x ≥ 0,
with λ > 0. The parameter λ is called the rate. Subsequently, we denote an exponential random
variable with X ∼ Exp(λ).
a) Determine c such that f (x) is a proper density.
b) Determine the cumulative distribution function (cdf) F (x) of X.
44
CHAPTER 2. RANDOM VARIABLES
c) Determine the quantile function Q(p) of X. What are the quartiles of X?
d) Let λ = 2. Calculate:
P(X ∈ R)
P(X ≤ 4)
P(X ≥ −10)
P(X ≤ log(2)/2)
P(X = 4)
P(3 < X ≤ 5)
e) Show that E(X) = 1/λ and Var(X) = 1/λ2 .
f) Show that P(X > s + t | X > t) = P(X > s), s, t ≥ 0.
A random variable satisfying the previous equation is called memoryless. (Why?)
Problem 2.9 (BMJ Endgame) Discuss and justify the statements about ‘The Normal distribution’ given in doi.org/10.1136/bmj.c6085.
Chapter 3
Functions of Random Variables
Learning goals for this chapter:
⋄ Know the definition and properties of independent and identically distributed
(iid) random variables
⋄ Know the distribution of the average of iid Gaussian random variables
⋄ Explain in own words how to construct chi-squared, t- and F -distributed
random variables
⋄ Know the central limit theorem (CLT) and being able to approximate binomial random variables
⋄ Able to calculate the cdf of transformed random variable, and if applicable,
the pdf as well
⋄ Able to approximate the mean and variance of transformed random variables
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter03.R.
The last chapter introduced “individual” random variables, e.g., a Poisson random variable,
Normal random variable. In subsequent chapters, we need as many random variables as we have
observations because we will match one random variable with each observation. Of course, these
random variables may have the same distribution (may be identical).
3.1
Independent Random Variables
We considered the case of several trials and aggregated these to a binomial random variable. By
definition, the trials are identical and independent of each other. We now formally introduce
independence.
45
46
CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
Definition 3.1. Two events A and B are independent if
P(A ∩ B) = P(A) P(B) .
(3.1)
Two random variables X and Y are independent if for all A and B
P(X ∈ A ∩ Y ∈ B) = P(X ∈ A) P(Y ∈ B) .
(3.2)
The random variables X1 , . . . , Xn are independent if for all A1 , . . . , An
P
n
\
i=1
n
Y
Xi ∈ Ai =
P(Xi ∈ Ai ) .
(3.3)
i=1
If this is not the case, they are dependent (all three cases).
♢
The concept of independence can be translated to information flow. Recall the formula of
conditional probability (2.5), if we have independence, P(A | B) = P(A): knowing anything
about the event B does not change the probability (knowledge) of the event A. Knowing that I
was born on a Sunday does not provide any information whether I have a driving license. But
having a driving license increases the probability that I own a car.
Formally, to write Equation (3.2), we would have to introduce bivariate random variables.
We will formalize these ideas in Chapter 8 but note here that the definition of independence also
implies that the joint density and the joint cumulative distribution is simply the product of the
individual ones, also called marginal ones. For example, if X and Y are independent, their joint
density is fX (x)fY (y).
Example 3.1. The sum of two dice (Example 2.2) is not independent of the value of the first
die: P(X ≤ 4 ∩ Y ≤ 2) = 5/36 ̸= P(X ≤ 4) P(Y ≤ 2) = 1/6 · 1/3.
♣
Example 3.2 (continuation of Example 2.3). If we assume that each foul shot is independent,
then the probability that 6 shots are necessary is P(X = 6) = (1 − p) · (1 − p) · (1 − p) · (1 − p) ·
(1 − p) · p = (1 − p)5 p, where p is the probability that a shot is successful.
This independence also implies that the success of the next shot is independent of my previous
one. In other words, there is no memory of how often I have missed in the past. See also
Problem 3.2.
♣
We will often use several independent random variables with a common distribution function.
Definition 3.2. A random sample X1 , . . . , Xn consists of n independent random variables with
iid
the same distribution, specified by, say, F . We write X1 , . . . , Xn ∼ F where iid stands for
“independent and identically distributed”. The number of random variables n is called the sample
size.
♢
3.2. RANDOM VARIABLES DERIVED FROM GAUSSIAN RANDOM VARIABLES
47
The iid assumption is very crucial and relaxing the assumptions to allow, for example, dependence between the random variables, has severe implications on the statistical modeling.
Independence also implies a simple formula for the variance of the sum of two or several random
variables, a formal justification will follow in Chapter 8.
Property 3.1. Let X and Y be two independent random variables. Then
1. Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) .
1
Let X1 , . . . , Xn ∼ F with E(X1 ) = µ and Var(X1 ) = σ 2 . Denote X =
iid
n
n
X
Xi . Then
i=1
n
2. E(X) = E
1X Xi = E(X1 ) = µ .
n
i=1
n
3. Var(X) = Var
1X 1
σ2
Xi = Var(X1 ) =
.
n
n
n
i=1
Example 3.3. For X ∼ Bin(n, p), we have
E(X) = np,
Var(X) = np(1 − p)
as we can write the binomial random variable as a sum of Bernoulli ones.
(3.4)
♣
The latter two points of Property 3.1 are used when we investigate statistical properties of
the sample mean. This concept is quite powerful and works as follows. Consider the sample
P
mean x = 1/n ni=1 xi which can be seen as function of n arguments f (x1 , . . . , xn ) = x. We then
evaluate the function at the arguments corresponding to the random sample, f (X1 , . . . , Xn ),
P
which is the random sample mean X = 1/n ni=1 Xi , a random variable itself!
In subsequent chapters we encounter essentially the following two situations:
1. (practical context) We have n observations x1 , . . . , xn which we will model statistically and
iid
thus assume a distribution F such that X1 , . . . , Xn ∼ F .
2. (theoretical approach) We start with a random sample and study the theoretical properties.
Often these will be illustrated with realizations from the the random sample (generated
using R).
3.2
Random Variables Derived from Gaussian Random Variables
iid
Gaussian random variable have many appealing properties such that we often assume X1 , . . . , Xn ∼
N (µ, σ 2 ). In Chapter 1 we summarized data x1 , . . . , xn with statistical measures like sample
mean, sample variance etc. Now using the approach outlined above by replacing the observations with the random variables (in equations (1.1), (1.3), for example) we may wonder what is
the distribution of these statistics?
We start with the following property which is essential and will be consistently used throughout the work.
48
CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
Property 3.2. Let X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) be independent and a and b arbitrary,
then aX1 + bX2 ∼ N (aµ1 + bµ2 , a2 σ12 + b2 σ22 ).
Hence, the density of the sum of two random variables is again unimodal and does not
correspond to a simple “superposition” of the densities. The following property is essential and
a direct consequence of the previous one.
iid
Property 3.3. Let X1 , . . . , Xn ∼ N (µ, σ 2 ), then X ∼ N (µ, σ 2 /n).
The next paragraphs discuss random variables that are derived (obtained as functions) from
Gaussian random variables, going beyond the average of iid Gaussians.
p
P
More specifically, we will look at the distribution of (X−µ)/ σ 2 /n, S 2 = 1/(n−1) ni=1 (Xi −
X)2 and other forms that are crucial in later chapters.
The expressions of the densities of the following random variables is not essential, they are
complex and we do not use them subsequently. Similarly, the expectation and the variance are for
reference only and might be helpful when scrutinizing some derivations or statistical approaches.
3.2.1
Chi-Square Distribution
The distribution of squared standard normal random variables is said to be chi-squared. Formally,
iid
let Z1 , . . . , Zn ∼ N (0, 1). The distribution of the random variable
X=
n
X
Zi2
(3.5)
i=1
is called the chi-square distribution (X 2 distribution) with n degrees of freedom and denoted
X ∼ Xn2 . The following applies:
E(X) = n;
Var(X) = 2n.
(3.6)
We summarize below, the exact link between the distribution of S 2 and a chi-square distribution. Further, the chi-square distribution is used in numerous statistical tests that we see
Chapter 5 and 6.
R-Code 3.1 Chi-square distribution for various degrees of freedom. (See Figure 3.1.)
x <- seq( 0, to=50, length=150)
# x-values for which we plot
plot(x, dchisq( x, df=1), type='l', ylab='Density') # plot first density
for (i in 1:6)
# different degrees of freedom
lines( x, dchisq(x, df=2^i), col=i+1)
legend( "topright", legend=2^(0:6), col=1:7, lty=1, bty="n")
For small n the density of the chi-square distribution is quite right-skewed (Figure 3.1). For
larger n the density is quite symmetric and has a bell-shaped form. In fact, for n > 50, we
can approximate the chi-square distribution with a normal distribution, i.e., Xn2 is distributed
approximately N (n, 2n) (a justification is given in Section 3.3).
49
0.6
3.2. RANDOM VARIABLES DERIVED FROM GAUSSIAN RANDOM VARIABLES
0.3
0.2
0.0
0.1
Density
0.4
0.5
1
2
4
8
16
32
64
0
10
20
30
40
50
x
Figure 3.1: Densities of the chi-square distribution for various degrees of freedom.
(See R-Code 3.1.)
Remark 3.1. A chi-square distribution is a particular case of a so-called Gamma distribution,
which we will introduce in Chapter 13.
There are further transformations of a chi-square distribution that are approximately norp
2Xn2 is approximately normally
mal, e.g. for Xn2 with n > 30 the random variable X =
√
distributed with expectation 2n − 1 and standard deviation of one. Similar approximations
♣
exist for (Xn2 /n)1/3 or (Xn2 /n)1/4 , see, e.g., Canal (2005).
3.2.2
Student’s t-Distribution
We now introduce a distribution that is used when standardizing X. This distribution is quite
omnipresent and borrowed its name to further statistical approaches, e.g., the famous t-tests to
compare sample means.
2 be two independent random variables. The distribution of the
Let Z ∼ N (0, 1) and X ∼ Xm
random variable
Z
V =p
X/m
(3.7)
is called the t-distribution (or Student’s t-distribution) with m degrees of freedom and denoted
V ∼ tn . We have:
E(V ) = 0,
m
Var(V ) =
,
(m − 2)
for m > 1;
(3.8)
for m > 2.
(3.9)
The density is symmetric around zero and as m → ∞ the density converges to the standard
normal density φ(x) (see Figure 3.2, based on R-Code 3.2).
Remark 3.2. For m = 1, 2 the density is heavy-tailed and the variance of the distribution does
not exist. Realizations of this random variable occasionally manifest with extremely large values.
50
CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
R-Code 3.2 t-distribution for various degrees of freedom. (See Figure 3.2.)
0.4
x <- seq( -3, to=3, length=100)
# x-values for which we plot
plot( x, dnorm(x), type='l', ylab='Density') # Gaussian density as reference
for (i in 0:6)
lines( x, dt(x, df=2^i), col=i+2)
# t-densities with different dofs
legend( "topright", legend=2^(0:6), col=2:8, lty=1, bty="n")
0.2
0.0
0.1
Density
0.3
1
2
4
8
16
32
64
−3
−2
−1
0
1
2
3
x
Figure 3.2: Densities of the t-distribution for various degrees of freedom. The normal
distribution is in black. A density with 27 = 128 degrees of freedom would make the
normal density function appear thicker. (See R-Code 3.2.)
Of course, the sample variance can still be calculated (see R-Code 3.3). We come back to this
issue in Chapter 6.
♣
R-Code 3.3 Sample variance of the t-distribution with one degree of freedom.
set.seed( 14)
tmp <- rt( 1000, df=1)
var( tmp)
# 1000 realizations
# variance is huge!!
## [1] 37391
sort( tmp)[1:7]
# many "large" values, but 2 exceptionally large
## [1] -190.929 -168.920
-60.603
-53.736
-47.764
-43.377
sort( tmp, decreasing=TRUE)[1:7]
-36.252
## [1] 5726.53 2083.68
280.85
239.75
137.36
The following property is fundamental in statistics.
119.16
102.70
51
3.3. LIMIT THEOREMS
iid
Property 3.4. Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Define X =
1. (a)
X −µ
√ ∼ N (0, 1)
σ/ n
(b)
n
n
i=1
i=1
1X
1 X
Xi and S 2 =
(Xi −X)2 .
n
n−1
n−1 2
2
S ∼ Xn−1
.
σ2
2. X and S 2 are independent.
3.
X −µ
√ ∼ tn−1 .
S/ n
Statement 1(a) is not surprising and is a direct consequence of Properties 2.7 and 3.3. Statement 1(b) is surprising insofar that centering the random variables with X amounts to reducing
the degrees of freedom by one. Point 2 seems very surprising as the same random variables are
used in X and S 2 . A justification is that S 2 is essentially a sum of random variables which were
corrected for X and thus S 2 does not contain information about X anymore. In Chapter 8 we
give a more detailed explanation thereof. Point 3 is not surprising as we use the definition of the
t-distribution and the previous two points.
3.2.3
F -Distribution
The F -distribution is mainly used to compare two sample variances with each other, as we will
see in Chapters 5, 10 and ongoing.
2 and Y ∼ X 2 be two independent random variables. The distribution of the
Let X ∼ Xm
n
random variable
W =
X/m
Y /n
(3.10)
is called the F -distribution with m and n degrees of freedom and denoted W ∼ Fm,n . It holds
that:
n
,
n−2
2n2 (m + n − 2)
Var(W ) =
,
m(n − 2)2 (n − 4)
E(W ) =
for n > 2;
(3.11)
for n > 4.
(3.12)
That means that if n increases the expectation gets closer to one and the variance to 2/m, with
m fixed. Figure 3.3 (based on R-Code 3.4) shows the density for various degrees of freedom.
3.3
Limit Theorems
The probability mass function of a binomial random variable with a large n and moderate p
(i.e., p not too close to zero or one) has a bell shaped form. Similarly, the magenta density of
a seemingly arbitrary random variable in Figure 3.1 looks very much like a Gaussian density.
Coincidence?
No. The following theorem is of paramount importance in statistics and sheds some insight.
In fact, the theorem gives us the distribution of X for virtually arbitrary distributions of Xi and
goes much beyond Property 3.1!
52
CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
R-Code 3.4 F -distribution for various degrees of freedom. (See Figure 3.3.)
x <- seq(0, to=4, length=500)
# x-values for which we plot
df1 <- c( 1, 2, 5, 10, 50, 50, 250)
# dof for the numerator
df2 <- c( 1, 50, 10, 50, 50, 250, 250)
# dof for the denumerator
plot( x, df( x, df1=1, df2=1), type='l', ylab='Density')
for (i in 2:length(df1))
lines( x, df(x, df1=df1[i], df2=df2[i]), col=i)
legend( "topright", col=1:7, lty=1, bty="n",
legend=parse(text=paste0("F['",df1,",",df2,"']")))
2.0
0.0
1.0
Density
3.0
F1,1
F2,50
F5,10
F10,50
F50,50
F50,250
F250,250
0
1
2
3
4
x
Figure 3.3: Density of the F -distribution for various degrees of freedom. (See RCode 3.4.)
Property 3.5. (Central Limit Theorem (CLT), classical version) Let X1 , X2 , X3 , . . . an infinite
sequence of iid random variables with E(Xi ) = µ and Var(Xi ) = σ 2 . Then
lim P
n→∞
X −µ
n
√ ≤ z = Φ(z)
σ/ n
(3.13)
where we kept the subscript n for the random sample mean to emphasis its dependence on n.
The proof of the CLT is a typical exercise in a probability theory lecture. Many extensions
of the CLT exist, for example, the independence assumptions can be relaxed.
Using the central limit theorem argument, we can show that distribution of a binomial random
variable X ∼ Bin(n, p) converges to a distribution of a normal random variable as n → ∞. Thus,
the distribution of a normal random variable N (np, np(1 − p)) can be used as an approximation
for the binomial distribution Bin(n, p). For the approximation, n should be larger than 30 for
p ≈ 0.5. For p closer to 0 and 1, n needs to be much larger.
For a binomial random variable, P(X ≤ x) = P(X < x+1), x = 1, 2, . . . , n, which motivates a
53
3.3. LIMIT THEOREMS
a so-called continuity correction when calculating probabilities. Specifically, instead of P(X ≤ x)
we approximate P(X ≤ x + 0.5) as illustrated in the following example.
Example 3.4. Let X ∼ Bin(30, 0.5). Then P(X ≤ 10) = 0.0494, “exactly”. However,
X − np
10 − 15 10 − np P(X ≤ 10) ≈ P p
≤p
=Φ p
= 0.0339 ,
np(1 − p)
np(1 − p)
30/4
X − np
10.5 − 15 10 + 0.5 − np P(X ≤ 10 + 0.5) ≈ P p
≤ p
=Φ p
= 0.05017 .
np(1 − p)
np(1 − p)
30/4
(3.14)
(3.15)
The improvement of the continuity correction can be quantified by the reduction of the absolute
errors |0.0494 − 0.0339| = 0.0155 vs |0.0494 − 0.0502| = 0.0008 or by the relative errors |0.0494 −
0.0339|/0.0494 = 0.3138 vs |0.0494 − 0.0502|/0.0494 = 0.0162.
♣
Another very important limit theorem is the law of large numbers (LLN) that essentially
states that for X1 , . . . , Xn iid with E(Xi ) = µ, the average X n converges to µ. We have deliberately used the somewhat ambiguous “convergence” statement, a more rigorous statement is
technically a bit more involved. We will use the LLN in the next chapter, when we try to infer
parameter values from data, i.e., say something about µ when we observe x1 , . . . , xn .
Remark 3.3. There are actually two forms of the LLN theorem, the strong and the weak
formulation. We do not not need the precise formulation later and thus simply state them here
for the sole reason of stating them
Weak LLN: lim P |X n − µ| > ϵ = 0 for all ϵ > 0,
n→∞
Strong LLN: P lim X n = µ = 1.
n→∞
(3.16)
(3.17)
The differences between both formulations are subtle. The weak version states that the average
is close to the mean and excursions (for specific n) beyond µ ± ϵ can happen arbitrary often. The
strong version states that there exists a large n such that the average is always within µ ± ϵ.
The two forms represent fundamentally different notions of convergence of random variables:
(3.17) is almost sure convergence, (3.16) is convergence in probability. The CLT represents convergence in distribution.
♣
We have observed several times that for increasing sample sizes, the discrepancy between
the theoretical value and the sample diminishes. Examples include the CLT, LLN, but also our
observation that the histogram of rnorm(n) looks like the normal density for large n.
It is possible to formalize this important concept. Let X1 , . . . , Xn iid with distribution
function F (x). We define the sample empirical distribution function
n
1X
Ixi ≤x (x),
Fn (x) =
n
(3.18)
i=1
that is, a step function with jump size 1/n at the values x1 , . . . , xn (see also Equation 3.32 for the
meaning of I). As n → ∞ the empirical distribution function Fn (x) converges to the underlying
54
CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
distribution function F (x). Because of this fundamental result we are able to work with specific
distributions of random samples.
For discrete random variables, the previous convergence result can be written in terms of
probabilities and observed proportions (see Problem 3.3.b). For continuous random variables,
we would have to invoke binning to compare the histogram with the density. The binning adds
another technical layer for the theoretical results. In practice, we simply compare the histogram
(or a smoothed version thereof, Figure 1.3) with the theoretical density.
Remark 3.4. To show the convergence of the sample empirical distribution function, we consider
P
the random version thereof, i.e., the empirical distribution function 1/n ni=1 IXi ≤x (x) and invoke
the LLN. This convergence is pointwise only, meaning it holds for all x. Stronger results hold and
the Glivenko–Cantelli theorem states that the convergence is even uniform: supx Fn (x) − F (x)
converges to zero almost surely.
♣
3.4
Functions of a Random Variable
In the previous sections we saw different examples of often used, classical random variables.
These examples are often not enough and through a modeling approach, we need additional
ones. In this section we illustrate how to construct the cdf and pdf of a random variable that is
the square of one from which we know the density.
Let X be a random variable with distribution function FX (x). We define a random variable
Y = g(X), for a suitable chosen function g(·). The cumulative distribution function of Y is
written as
FY (y) = P (Y ≤ y) = P g(X) ≤ y .
(3.19)
In many cases g(·) is invertable (and differentiable) and we obtain

P X ≤ g −1 (y) = F g −1 (y),
if g −1 monotonically increasing,
X
FY (y) =
P X ≥ g −1 (y) = 1 − F g −1 (y), if g −1 monotonically decreasing.
(3.20)
X
To derive the probability mass function we apply Property 2.2.4. In the more interesting setting
of continuous random variables, the density function is derived by Property 2.3.4 and is thus
fY (y) =
d −1
g (y) fX (g −1 (y)).
dy
(3.21)
Example 3.5. Let X be a random variable with cdf FX (x) and pdf fX (x). We consider Y =
a+bX, for b > 0 and a arbitrary. Hence, g(·) is a linear function and its inverse g −1 (y) = (y−a)/b
is monotonically increasing. The cdf of Y is thus FX (y − a)/b and, for a continuous random
variable X, the pdf is fX (y − a)/b · 1/b. This fact has already been stated in Property 2.7 for
the Gaussian random variables.
♣
This last example also motivates a formal definition of a location parameter and a scale
parameter.
3.4. FUNCTIONS OF A RANDOM VARIABLE
55
Definition 3.3. For a random variable X with density fX (x), we call θ a location parameter if
the density has the form fX (x) = f (x − θ) and call θ a scale parameter if the density has the
form fX (x) = f (x/θ)/θ.
♢
In the previous definition θ stands for an arbitrary parameter. The parameters are often
chosen such that the expectation of the random variable is equal to the location parameter and
the variance equal to the squared scale parameter. Hence, the following examples are no surprise.
Example 3.6. For a Gaussian distribution N (µ, σ 2 ), µ is the location parameter, σ a scale
parameter, as seen from the definition of the density (2.24). The parameter λ of the Poisson
distribution Pois(λ) is neither a location nor a scale parameter.
♣
We consider a second example of a transformation, closely related to Problem 2.8.
Example 3.7. Let X ∼ U(0, 1) and for 0 < x < 1, we set g(x) = − log(1 − x), where log() is
the natural logarithm, i.e., the logarithm to the base e. Thus, g −1 (y) = 1 − exp(−y) and the
distribution and density function of Y = g(X) is
FY (y) = FX (g −1 (y)) = g −1 (y) = 1 − exp(−y),
d −1
g (y) fX g −1 (y) = exp(−y),
fY (y) =
dy
(3.22)
(3.23)
for y > 0. This random variable is called the exponential random variable (with rate parameter
one), denoted by X ∼ Exp(λ) with λ = 1.
The random variable V = Y /λ, λ > 0, has the density fV (v) = exp(−x/θ)/θ (by (3.21)
or Example 3.5) and is denoted V ∼ Exp(λ). Hence, the parameter θ = 1/λ of an exponential
distribution is a scale parameter.
Notice further that g(x) is the quantile function of this random variable.
♣
This last example gives rise to the so-called inverse transform sampling method to draw
realizations from a random variable X with a closed form quantile function QX (p). In more
details, assume an arbitrary cumulative distribution function FX (x) with a closed form inverse
FX−1 (p) = QX (p). For U ∼ U(0, 1), the random variable X = F −1 (U ) has cdf FX (x):
P(X ≤ x) = P FX−1 (U ) ≤ x = P U ≤ FX (x) = FX (x).
(3.24)
Hence, based on Example 3.7, the R expression -log(1- runif(n))/lambda draws a realization
iid
from X1 , . . . , Xn ∼ Exp(1).
Remark 3.5. For U ∼ U(0, 1), 1 − U ∼ U(0, 1), and thus it is possible to further simplify
the sampling procedure. Interestingly, R uses a seemingly more complex algorithm to sample.
The algorithm is however fast and does not require a lot of memory (Ahrens and Dieter, 1972);
properties that were historically very important.
♣
In the case, where we cannot invert the function g, we can nevertheless use the concept of
the approach by starting with (3.19), followed by simplification and use of Property 2.3.4, as
illustrated in the following example.
56
CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
Example 3.8. The density of a chi-squared distributed random variable with one degree of
freedom is calculated as follows. Let X ∼ N (0, 1) and Y = Z 2 .
√ √
√
√
y = Φ( y) − Φ(− y) = 2Φ( y) − 1
y 1
y
d
1
√ d√ 2
fY (y) =
y √ exp −
exp − .
FY (y) = 2ϕ( y)
√ =√
dy
dy
2 2 y
2
2πy
2π
FY (y) = P(Y = Z 2 ≤ y) = P | Z |≤
(3.25)
(3.26)
♣
As we are often interested in summarizing a random variable by its mean and variance, we now
introduce a very convenient approximation for transformed random variables Y = g(X) by the
so-called delta method . The idea thereof consists of a Taylor expansion around the expectation
E(X):
Y = g(X) ≈ g E(X) + g ′ E(X) · X − E(X)
(3.27)
(first two terms of the Taylor series). Thus
E(Y ) ≈ g E(X) ,
2
Var(Y ) ≈ g ′ E(X) · Var(X).
(3.28)
(3.29)
The approach is illustrated with the following two examples.
Example 3.9. Let X ∼ B(1, p) and Y = X/(1 − X). Thus,
E(Y ) ≈
p
;
1−p
Var(Y ) ≈
2
1
p
.
· p(1 − p) =
2
(1 − p)
(1 − p)3
♣
(3.30)
Example 3.10. Let X ∼ B(1, p) and Y = log(X). Thus
E(Y ) ≈ log(p),
Var(Y ) ≈
1 2
p
· p(1 − p) =
1−p
.
p
♣
(3.31)
Of course, in the case of a linear transformation (as, e.g., in Example 3.5), equation (3.27) is
an equality and thus relations (3.28) and (3.29) are exact, which is in sync with Property 2.7.
Consider g(X) = X 2 , by Property 2.5.1, E(X 2 ) = E(X)2 +Var(X) and thus E(X 2 ) > E(X)2 ,
refining the approximation (3.28). This result can be generalized and states that for every convex
function g(·), E g(X) ≤ g E(X) . For concave functions, the inequality is reversed. For strictly
convex or strictly concave functions, we have strict inequalities. Finally, a linear function is not
concave or convex and we have equality, as given in Property 2.5.4. These inequalities run under
Jensen’s inequality.
Remark 3.6. The following results are for completeness only. They are nicely elaborated in
Rice (2006).
3.5. BIBLIOGRAPHIC REMARKS
57
1. For continuous random variables X1 and X2 it is possible to calculate probability density
function of X1 + X2 . The formula is the so-called convolution. Same holds for discrete
random variables.
2. It is also possible to construct random variables based on an entire random sample, say
Y = g(X1 , . . . , Xn ). Property.3.5 uses exactly such an approach, where g(·) is given by
P
√ g(X1 , . . . , Xn ) =
σ/ n .
i Xi − µ
3. Starting with two random continuous variables X1 and X2 , and two bijective functions
g1 (X1 , X2 ) and g2 (X1 , X2 ), there exists a closed form expression for the (joint) density of
Y1 = g1 (X1 , X2 ) and Y2 = g2 (X1 , X2 ).
♣
We finish the chapter with introducing a very simple but convenient function, the so-called
indicator function, defined by

1, if x ∈ A,
Ix∈A (x) =
(3.32)
0, if x ∈
/ A.
The function argument ‘(x)’ is redundant but often helps to clarify. In the literature, one often
finds the concise notation IA . Here, let X be a random variable then we define a random variable
Y = IX∈A (X), i.e., Y ‘specifies’ if the random variable X is in the set A.
The indicator function is often used to calculate expectations. For example, let Y be defined
as above. Then
E(Y ) = E IX>0 (X) = 0 · P(X ≤ 0) + 1 · P(X > 0) = P(X > 0)
(3.33)
where we used Property 2.7 with g(x) = Ix∈A (x). In other words, we see IX>0 (X) as a Bernoulli
random variable with success probability P (X > 0).
3.5
Bibliographic Remarks
Limit theorems are part of any mathematical statistics or probability book and similar references
as in Section 2.6 can be added here.
Needham (1993) gives a graphical justification of Jensen’s inequality.
The derivation of the chi-square, t- and F -distribution is accessibly presented in Rice (2006)
and their properties extensively discussed in Johnson et al. (1994, 1995).
3.6
Exercises and Problems
Problem 3.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Show that for X ∼ Pois(λ), E(X) = λ, and Var(X) = λ.
58
CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
b) Starting from the pmf of a Binomial random variable, derive the pmf of the Poisson random
variable when n → ∞, p → 0 but λ = np constant.
Problem 3.2 (Geometric distribution) In the setting of Examples 2.3 and 3.2, denote p =
P(shot is successful). Assume that the individual shots are independent. Show that
a) P(X ≤ k) = 1 − (1 − p)k , k = 1, 2, . . . .
b) E(X) = 1/p and Var(X) = (1 − p)/p2 .
c) P(X = k + j | X > j) = P(X = k).
Problem 3.3 (Poisson distribution) In this problem we visualize and derive some properties of
the Poisson random variable with parameter λ > 0.
a) Visualize in R the cdf and pmf of X ∼ Pois(λ), for λ = 0.2 and λ = 2.
b) For λ = 0.2 and λ = 2, sample from X1 , . . . , Xn ∼ Pois(λ) with n = 200 and draw
histograms. Compare the histograms with a). What do you expect to happen when n
increases?
c) Let X ∼ Pois(λ), with λ = 3: calculate P(X ≤ 2), P(X < 2) and P(X ≥ 3).
d) Plot the pmf of X ∼ Pois(λ), λ = 5, and Y ∼ Bin(n, p) for n = 10, 100, 1000 with λ = np.
What can you conclude?
Problem 3.4 (Approximating probabilities) In the following settings, approximate the probabilities and quantiles by reducing the problem to calculating probabilities from a standard normal,
i.e., using a “standard table”. Compare these values with the ones obtained with R.
If you do not have standard table, the following two R commands may be used instead: pnorm(a)
and qnorm(b) for specific values a and b.
a) X ∼ Bin(42, 5/7):
P(X = 30), P(X < 21), P(24 ≤ X ≤ 35).
b) X ∼ t49 :
P(−1 < X < 1), P(X < q) = 0.05.
c) X ∼ χ268 :
P(X ≤ 68), P(47 ≤ X ≤ 92.7)
Problem 3.5 (Random numbers from a t-distribution) We sample random numbers from a tp
distribution by drawing repeatedly z1 , . . . , zm from a standard normal and setting yj = z/ s2 /m,
P
P
j = 1, . . . , n, where z = 1/m i zi and s2 = 1/(m − 1) i (zi − z)2 .
a) For m = 6 and n = 500 construct Q-Q plots based on the normal and appropriate tdistribution. Based on the plots, is it possible to discriminate which of the two distributions
is more appropriate? Is this question getting more difficult if m and n are chosen larger or
smaller?
59
3.6. EXERCISES AND PROBLEMS
b) Suppose we receive the sample y1 , . . . , yn constructed as above but the value m is not
disclosed. Is it possible to determine m based on Q-Q plots? What general implication
can be drawn, especially for small samples?
Problem 3.6 (Exponential Distribution in R) Let X1 , . . . , Xn be independent and identically
distributed (iid) random variables following Exp(λ). Assume λ = 2 for the following.
a) Sample n = 100 random numbers from Exp(λ). Visualize the data with a histogram and
superimpose the theoretical density.
b) Derive theoretically the distribution of min(X1 , . . . , Xn ).
c) Draw a histogram of min(X1 , . . . , Xn ) from 500 realizations and compare it to the theoretical result from part b).
Problem 3.7 (Inverse transform sampling) The goal of this exercise is to implement your own
R code to simulate from a continous random variable X with the following probability density
function (pdf):
fX (x) = c |x| exp −x2 , x ∈ R.
We use inverse transform sampling, which is well suited for distributions whose cdf is easily
invertible.
Hint: you can get rid of the absolute value by defining

c x exp −x2 ,
if x ≥ 0,
fX (x) =
−c x exp −x2 , if x < 0.
a) Find c such that fX (x) is an actual pdf. Show that the quantile function is
p
 − log (2(1 − p)), p ≥ 0.5,
−1
FX (p) = QX (p) =
p
− − log (2p),
p < 0.5.
b) Write your own code to sample from X. Check the correctness of your sampler.
Problem 3.8 (Geometric mean) The geometric mean of x1 , . . . , xn is defined as
√
mg = mg (x1 , . . . , xn ) = n x1 · · · · · xn
and is often used to summarize rates or to summarize different items that are rated on different
scales.
a) For x1 , . . . , xn and y1 , . . . , yn , show that mg (x1 /y1 , . . . , xn /yn ) =
a profound consequence of this property?
mg (x1 , . . . , xn )
. What is
mg (y1 , . . . , yn )
√
b) Using Jensen’s inequality, show that n x1 · · · · · xn ≤ n1 (x1 + · · · + xn ), i.e., mg ≤ x.
60
CHAPTER 3. FUNCTIONS OF RANDOM VARIABLES
Chapter 4
Estimation of Parameters
Learning goals for this chapter:
⋄ Explain what a simple statistical model is (including the role of parameters)
⋄ Describe the concept of point estimation and interval estimation
⋄ Interpret point estimates and confidence intervals
⋄ Describe the concept of method of moments, least squares and likelihood
estimation
⋄ Construct theoretically and using R confidence intervals
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter04.R.
A central point of statistics is to draw information from observations (in form of data from
measurements of an experiment or from a subset of a population) and to infer from these towards
hypotheses in form of scientific questions or general statements of the population. Inferential
statistics is often explained by “inferring” conclusions from the sample toward the “population”.
This step requires foremost data and an adequate statistical model . Such a statistical model
describes the data through the use of random variables and their distributions, by means the data
is considered a realization of the corresponding distribution. The model comprises of unknown
quantities, so-called parameters. The goal of a statistical estimation is to determine plausible
values for the parameters of the model using the observed data.
4.1
Linking Data with Parameters
In this section we first introduce the concept of statistical models and the associated parameters.
In the second step we introduce the idea of estimation.
61
62
CHAPTER 4. ESTIMATION OF PARAMETERS
4.1.1
Statistical Models and Parameters
We now discuss the box ‘Propose a statistical model’ in our statistical workflow (Figure 1.1). To
introduce a first statistical model we consider a specific example.
Example 4.1. Antimicrobial susceptibility tests (ASTs) are used to identify the level of resistance of a bacterial strain to a specific antibiotic. The procedure can be summarized by
the following steps. The bacteria are isolated and a resulting bacterial suspension is applied
(streaked) on a Petri dish that contains a growth medium and nutrients for the bacteria. The
antibiotics sources are added and the dish is then incubated, that means kept for several hours
at a favorable temperature to allow bacterial growth. If the antibiotic works well, the sources
inhibit the growth and create small circular areas that do not contain any bacterial load. The
larger the disk, the more effective the bacteria.
To assess the variability and errors of the AST testing procedure, 100 measurements based on
100 suspensions have been evaluated for different antibiotics Hombach et al. (2016). Table 4.1
reports the observed diameters with frequencies by E. coli and the antibiotics imipenem and
meropenem.
Natural questions that arise are: What are plausible or representative values of the inhibition
diameter? How much do the individuals diameters deviate around the mean?
The data is visualized in Figure 4.1 (based on R-Code 4.1).
♣
Table 4.1:
meropenem.
Inhibition diameters by E. coli and the antibiotics imipenem and
Diameter (mm)
28
29
30
31
32
33
34
35
36
37
38
39
40
Imipenem
Meropenem
0
0
3
0
7
0
14
0
32
2
20
9
18
33
4
20
1
17
1
9
0
6
0
4
0
0
Although the data of the previous example is rounded to the nearest millimeter, it would be
reasonable to assume that the diameters are real valued. Each measurement being identical to
others and thus fluctuating naturally around a common mean. The following is a very simple
example of a statistical model adequate here:
Y i = µ + εi ,
i = 1, . . . , n,
(4.1)
where Yi are the observations, µ is an unknown diameter and εi are random variables representing
measurement error. It is often reasonable to assume E(εi ) = 0 with a symmetric density. Here,
iid
we further assume εi ∼ N (0, σ 2 ). Thus, Y1 , . . . , Yn are normally distributed with mean µ and
variance σ 2 . As typically, both parameters µ and σ 2 are unknown and we need to determine
plausible values for these parameters from the available data.
Such a statistical model describes the data yi , . . . , yn through random variables Yi , . . . , Yn
and their distributions. In other words, the data is a realization of the random sample given by
the statistical model.
For both antibiotics, a slightly more evolved example consists of
Yi = µimi + εi ,
i = 1, . . . , nimi ,
(4.2)
Yi = µmero + εi ,
i = nimi + 1, . . . , nimi + nmero ,
(4.3)
63
4.1. LINKING DATA WITH PARAMETERS
iid
We assume εi ∼ N (0, σ 2 ). Thus the model states that both diseases have a different mean
but the same variability. This assumption pools information from both samples to estimate the
variance σ 2 . The parameters of the model are µimi , µmero and to a lesser extend σ 2 .
The statistical model represents the population with a seamingly infinite size. The data is
the realization of a subset of our population. And thus based on the data we want to answer
questions like: What are plausible values of the population levels? How much do the individuals
deviate around the population mean?
R-Code 4.1 states the means which we might consider as reasonable values for µimi , µmero ,
similar for their variances. Of course we need to formalize that the mean of the sample can be
taken as a representative value of the entire population. For the variance parameter σ 2 , slightly
more care is needed as neither var(imipDat) nor var(meropDat) are fully satisfying.
The questions if the inhibition diameter of both antibiotics are comparable or if the inhibition
diameter from meropenem is (statistically) larger than 33 mm are of completely different nature
and will be discussed in the next chapter, where we formally discuss statistical tests.
R-Code 4.1 Inhibition diameters (See Figure 4.1.)
diam <- 28:40
# diameters
imi <- c(0, 3, 7, 14, 32, 20, 18, 4, 1, 1, 0, 0 ,0) # frequencies
mero <- c(0, 0, 0, 0, 2, 9, 33, 20, 17, 9, 6, 4, 0)
barplot( imi, names.arg=paste(diam), main="Imipenem")
barplot( mero, names.arg=paste(diam), main="Meropenem")
imiDat <- rep(diam, imi)
# now a vector with the 100 diameters
c( mean(imiDat), sum( mero*diam)/100) # means for both, then spread
## [1] 32.40 35.12
c( var( imiDat), sum( (imiDat-mean(imiDat))^2)/(length(imiDat)-1) )
## [1] 2.2424 2.2424
c( sd( imiDat), sqrt( var( imiDat)))
## [1] 1.4975 1.4975
Meropenem
0
0 5
5
15
15
25
25
Imipenem
28
30
32
34
36
38
40
28
30
32
34
36
38
40
Figure 4.1: Frequencies of inhibition diameters (in mm) by E. coli and imipenem and
meropenem (total 100 measurements). (See R-Code 4.1.)
64
CHAPTER 4. ESTIMATION OF PARAMETERS
4.1.2
Point Estimation
We assume now that an appropriate statistical model parametrized by one or several parameters
has been proposed for a dataset. For estimation, the data is seen as a realization of a random
sample where the distribution of the latter is given by the statistical model. The goal of point
estimation is to provide a plausible value for the parameters of the distribution based on the
data at hand.
Definition 4.1. A statistic is an arbitrary function of a random sample Y1 , . . . , Yn and is therefore also a random variable.
An estimator for a particular parameter is a statistic used to obtain a plausible value for this
parameter, based on the random sample.
A point estimate is the value of the estimator evaluated at y1 , . . . , yn , the realizations of the
random sample.
Estimation (or estimating) is the process of finding a (point) estimate.
♢
Hence, in order to estimate a parameter, we start from an estimator for that particular
parameter and evaluate the estimator at the available data. The estimator may depend on the
underlying statistical model and typically depends on the sample size.
Example 4.2.
1. The numerical values shown in R-Code 4.1 are estimates.
n
1X
Yi is an estimator.
n
2. Y =
i=1
100
y=
1 X
yi = 32.40 is a point estimate.
100
i=1
n
3. S 2 =
1 X
(Yi − Y )2 is an estimator.
n−1
i=1
100
s2 =
1 X
(yi − y)2 = 2.242 or s = 1.498 are a point estimates.
n−1
i=1
♣
Often, we denote parameters with Greek letters (µ, σ, λ, . . . ), with θ being the generic one.
b Context makes clear which of
The estimator and estimate of a parameter θ are denoted by θ.
the two cases is meant.
4.2
Construction of Estimators
We have already encountered several estimators for location and spred, cf. Equations (1.1)
to (1.4). In fact, there are an abundance of different estimators for a specific parameter. We
now consider three approaches to construct estimators. Not all three work in all settings, more
so, depending on the situations a particular approach might be preferable.
65
4.2. CONSTRUCTION OF ESTIMATORS
4.2.1
Ordinary Least Squares
The first approach is intuitive and it is straightforward to construct estimators for location
parameters or, more generally, for parameters linked to the expectation of the random sample.
The ordinary least squares method of parameter estimation minimizes the sum of squares
of the differences between the random variables and the location parameter. More formally, let
Y1 , . . . , Yn be iid with E(Yi ) = µ. The least squares estimator for µ is
µ
b=µ
bLS = argmin
µ
n
X
i=1
(Yi − µ)2 ,
(4.4)
and thus after minimizing the sums of squares we get the estimator µ
bLS = Y and the estimate
µ
bLS = y.
Often, the parameter θ is linked to the expectation E(Yi ) through some function, say g. In
such a setting, we have
θb = θbLS = argmin
θ
n
X
i=1
2
Yi − g(θ) .
(4.5)
b =Y.
and θbLS solves g(θ)
In linear regression settings, the ordinary least squares method minimizes the sum of squares
of the differences between observed responses and those predicted by a linear function of the explanatory variables. Due to the linearity, simple and close form solutions exist (see Chapters 9ff).
4.2.2
Method of Moments
The method of moments is based on the following idea. The parameters of the distribution are
expressed as functions of the moments, e.g., E(Y ), E(Y 2 ). The random sample moments are
then plugged into the theoretical moments of the equations in order to obtain the estimators:
n
1X
Yi = Y ,
µ
b=
n
µ := E(Y ) ,
µ2 := E(Y 2 ) ,
µ
b2 =
1
n
i=1
n
X
Yi2 .
(4.6)
(4.7)
i=1
By using the observed values of a random sample in the method of moments estimator, the
estimates of the corresponding parameters are obtained. If the parameter is a function of the
moments, we need to additionally solve the corresponding equation, as illustrated in the following
two examples.
iid
Example 4.3. Let Y1 , . . . , Yn ∼ E(λ)
b=λ
bMM = 1 .
λ
Y
Thus, the method of moment estimate of λ is the value 1/y.
E(Y ) = 1/λ,
.
b
Y =1 λ
(4.8)
♣
iid
Example 4.4. Let Y1 , . . . , Yn ∼ F with expectation µ and variance σ 2 . Since Var(Y ) = E(Y 2 )−
E(Y )2 (Property 2.5.1), we can write σ 2 = µ2 − (µ)2 and we have the estimator
1
c2 = µ
σ
c2 − (b
µ)2 =
MM
n
n
X
i=1
n
Yi2 − Y 2 =
1X
(Yi − Y )2 .
n
i=1
(4.9)
♣
66
4.2.3
CHAPTER 4. ESTIMATION OF PARAMETERS
Likelihood Method
The likelihood method chooses as estimate the value such that the observed data is most likely
to stem from the model (using the estimate). To derive the method, we consider the probability
density function (for continuous random variables) or the probability mass function (for discrete
random variables) to be a function of parameter θ, i.e.,
fY (y) = fY (y; θ)
pi = P(Y = yi ) = P(Y = yi ; θ)
−→
L(θ) := fY (y; θ)
(4.10)
−→
L(θ) := P(Y = yi ; θ).
(4.11)
For a given distribution, we call L(θ) the likelihood function, or simply the likelihood.
Definition 4.2. The maximum likelihood estimate θbML of the parameter θ is based on maximizing the likelihood, i.e.
♢
θbML = argmax L(θ).
θ
(4.12)
By definition of a random sample, the random variables are independent and identically
distributed and thus the likelihood is the product of the individual densities fY (yi ; θ) (see Section 3.1). To simplify the notation, we have omitted the index of Y . Since θbML = argmaxθ L(θ) =
argmaxθ log L(θ) , the log-likelihood ℓ(θ) := log L(θ) can be maximized instead. The loglikelihood is often preferred because the expressions simplify more and maximizing sums is much
easier than maximizing products.
iid
Example 4.5. Let Y1 , . . . , Yn ∼ Exp(λ), thus
L(λ) =
n
Y
fY (yi ) =
i=1
n
Y
n
λ exp(−λyi ) = λ exp(−λ
i=1
n
X
yi ) .
(4.13)
i=1
Then
d log(λn exp(−λ
dℓ(λ)
=
dλ
dλ
Pn
i=1 yi )
d(n log(λ) − λ
=
dλ
n
Pn
i=1 yi )
=
n X
!
−
yi = 0
λ
(4.14)
1
= .
y
i=1 yi
(4.15)
i=1
b=λ
bML = Pnn
λ
bML = λ
bMM .
In this case (as in some others), λ
♣
In a vast majority of cases, maximum likelihood estimators posses very nice properties. Intuitively, because we use information about the density and not only about the moments, they
are “better” compared to method of moment estimators and to least squares based estimators.
Further, for many common random variables, the likelihood function has a single optimum, in
fact a maximum, for all permissible θ.
In our daily live we often have estimators available and thus we rarely need to rely on the
approaches presented in this section.
67
4.3. COMPARISON OF ESTIMATORS
4.3
Comparison of Estimators
The variance estimate based on the method of moment estimator in Example 4.4 divides the sum
of the squared deviances by n, whereas Equation (1.3) and and R use the denominator n − 1 (see
R-Code 4.1 or Example 4.1). There are different estimators for a particular parameter possible
and we now introduce two measures to compare them.
Definition 4.3. An estimator θb of a parameter θ is unbiased if
b = θ,
E(θ)
b − θ is called the bias.
otherwise it is biased. The value E(θ)
(4.16)
♢
Simply put, an unbiased estimator leads to estimates that are on the long run correct.
iid
Example 4.6. Y1 , . . . , Yn ∼ N (µ, σ 2 )
1. Y is unbiased for µ, since
n
n
1 X
1X
1
EY =E
Yi =
E Yi = n E(Y1 ) = µ .
n
n
n
i=1
(4.17)
i=1
n
1 X
2. S =
(Yi −Y )2 is unbiased for σ 2 . To show this, we expand the square terms and
n−1
i=1
simplify the sum of these cross-terms:
2
(Yi − Y )2 = (Yi − µ + µ − Y )2 = (Yi − µ)2 + 2(Yi − µ)(µ − Y ) + (µ − Y )2 ,
n
X
i=1
2(Yi − µ)(µ − Y ) = 2(µ − Y )
(4.18)
n
X
(Yi − µ) = 2(µ − Y )(nY − nµ)
i=1
2
(4.19)
= −2n(µ − Y ) ,
Collecting the terms yields
(n − 1)S 2 =
n
X
i=1
(Yi − µ)2 − 2n(µ − Y )2 + n(µ − Y )2 .
(4.20)
We now use E(Y ) = E(Y ) = µ, thus E (Yi − µ)2 = Var Yi = σ 2 , and similarly, E (µ −
Y )2 = Var Y = σ 2 /n, by Property 3.1.3. Finally,
(n − 1) E(S 2 ) =
c2 =
3. σ
n
X
i=1
σ2
Var(Yi ) − n Var Y = nσ 2 − n ·
= (n − 1)σ 2 .
n
(4.21)
1X
(Yi − Y )2 is biased for σ 2 , since
n
i
c2 ) =
E(σ
1 X
n−1
1
(n − 1) E
(Yi − Y )2 =
σ2.
n
n−1
n
i
|
{z
}
E(S 2 ) = σ 2
(4.22)
68
CHAPTER 4. ESTIMATION OF PARAMETERS
The bias is
c2 ) − σ 2 =
E(σ
n−1 2
1
σ − σ2 = − σ2,
n
n
which amounts to a slight underestimation of the variance.
(4.23)
♣
Unbiasedness is a nice and often desired property of an estimator. If an estimator is biased
and this bias is known, then it is often possible to correct for it. In the spirit of Example 4.6.3,
b = aθ, leading to a bias (a − 1)θ. Then the
suppose that we have a biased estimator θb with E(θ)
b is unbiased.
estimator θ/a
A second possibility for comparing estimators is the mean squared error
b = E (θb − θ)2 .
MSE(θ)
(4.24)
b = bias(θb )2 + Var(θb ).
The mean squared error can also be written as MSE(θ)
iid
Example 4.7. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). Using the result (4.17) and Property 3.1.3, we have
MSE(Y ) = bias(Y )2 + Var(Y ) = 0 +
σ2
.
n
Hence, the MSE vanishes as n increases.
(4.25)
♣
There is a another “classical” example for the calculation of the mean squared error, however
it requires some properties of squared Gaussian variables.
iid
2
(see Property 3.4.1). Using
Example 4.8. If Y1 , . . . , Yn ∼ N (µ, σ 2 ) then (n − 1)S 2 /σ 2 ∼ Xn−1
the variance expression from a chi-squared random variable (see Equation (3.6)), we have
MSE(S 2 ) = Var(S 2 ) =
(n − 1)S 2 σ4
σ4
2σ 4
Var
=
2(n − 1) =
.
2
2
2
(n − 1)
σ
(n − 1)
n−1
(4.26)
c2 ) is smaller than Equation (4.26). Moreover, the
Analogously, one can show that MSE(σ
MM
estimator (n − 1)S 2 /(n + 1) possesses the smallest MSE (see Problem 4.1.b).
♣
Remark 4.1. In both examples above, the variance has order O(1/n). In practical settings, it is
not possible to get a better rate. In fact, there is a lower bound for the variance that cannot be
undercut (the bound is called the Cramér–Rao lower bound ). Ideally, we aim to construct and
use minimal variance unbiased estimators (MVUE). Such bounds and properties are studied in
mathematical statistics lectures and treatise.
♣
69
15
10
0
5
Frequency
20
4.4. INTERVAL ESTIMATORS
0.4
0.5
0.6
0.7
0.8
lambdas
Figure 4.2: 100 estimates of the parameter λ = 1/2 based on a sample of size n = 25.
The red and blue vertical line indicate the sample mean and median of the estimates
respectively. (See R-Code 4.2.)
4.4
Interval Estimators
The estimates considered so far are point estimates as they are single values and do not provide
us with uncertainties. For this we have to extend the idea towards interval estimates and interval
estimators. We start with a motivating example followed by a couple analytic manipulations to
finally introduce the concept of a confidence interval.
iid
Example 4.9. Let Y1 , . . . , Yn ∼ Exp(λ) and we take n = 25 and λ = 1/2. According to
b = 1/Y is a legitimate estimator (possibly biased due to Jensen’s inequality).
Example 4.3, λ
If we have different samples, our estimates will not be exactly 1/2, but close to it, as nicely
illustrated with R-Code 4.2. If we repeat and draw additional samples, we notice the variability
in the estimates. Overall, we are quite close to the actual value, but individual estimates may
be quite far from the truth, 90% of the estimates are within the interval [ 0.39, 0.67 ].
♣
R-Code 4.2 Variability in estimates. (See Figure 4.2.)
set.seed(14)
# we work reproducible
n <- 25
# sample size
R <- 100
# number of estimates
lambda <- 1/2
# true value to estimate
samples <- matrix( rexp( n*R, rate=lambda), n, R)
lambdas <- 1/colMeans( samples) # actual estimates
hist( lambdas, main='')
# unimodel, but not quite symmetric
abline( v=c(lambda, mean(lambdas), median(lambdas)), col=c(3,2,4))
quantile(lambdas, probs=c(.05,.95))
##
5%
95%
## 0.3888 0.6693
In practice we have one set of observations and thus we cannot repeat sampling to get
a description of the variability in our estimate. Therefore we apply a different approach by
attaching an uncertainty to the estimate itself, which we now illustrate in the Gaussian setting.
70
CHAPTER 4. ESTIMATION OF PARAMETERS
4.4.1
Confidence Intervals in the Gaussian Setting
iid
Let Y1 , . . . , Yn ∼ N (µ, σ 2 ) with known σ 2 . Thus Y ∼ N (µ, σ 2 /n) and, by standardizing, we
p
have (Y − µ)
σ 2 /n ∼ N (0, 1). As a consequence,
σ Y −µ
σ
1 − α = P zα/2 ≤ √ ≤ z1−α/2 = P zα/2 √ ≤ Y − µ ≤ z1−α/2 √
σ/ n
n
n
σ
σ = P −Y + zα/2 √ ≤ −µ ≤ −Y + z1−α/2 √
n
n
σ
σ = P Y − zα/2 √ ≥ µ ≥ Y − z1−α/2 √
n
n
σ
σ = P Y − z1−α/2 √ ≤ µ ≤ Y + z1−α/2 √
,
n
n
(4.27)
(4.28)
(4.29)
(4.30)
where zp is the p-quantile of the standard normal distribution P(Z ≤ zp ) = p for Z ∼ N (0, 1)
(recall that zp = −z1−p ).
The manipulations on the inequalities in the previous derivation are standard but the probabilities in (4.28) to (4.30) seem at first sight a bit strange, they should be read as P {Y − a ≤
µ} ∩ {µ ≤ Y + a} .
Based on Equation (4.30) we now define an interval estimator.
iid
Definition 4.4. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ) with known σ 2 . The interval
h
σ
σ i
Y − z1−α/2 √ , Y + z1−α/2 √
n
n
(4.31)
is an exact (1 − α) confidence interval for the parameter µ. 1 − α is the called the level of the
confidence interval.
♢
If we evaluate the bounds of the confidence interval Bl , Bu at a realization we denote
bl , bu as the sample confidence interval or the observed confidence interval.
The interpretation of an exact confidence interval is as follows: If a large number of realizations are drawn from a random sample, on average (1 − α) · 100% of the confidence intervals will
cover the true parameter µ.
Sample confidence intervals do not contain random variables and therefore it is not possible
to make probability statements about the parameter.
If the standard deviation σ is unknown, the approach above must be modified by using a
√
p
P
point estimate for σ, typically S = S 2 with S 2 = 1/(n − 1) i (Yi −Y )2 . Since (Y − µ)
S 2 /n
has a t-distribution with n−1 degrees of freedom (see Property 3.4.3), the corresponding quantile
must be modified:
Y −µ
√ ≤ tn−1,1−α/2 .
1 − α = P tn−1,α/2 ≤
S/ n
(4.32)
Here, tn−1,p is the p-quantile of a t-distributed random variable with n − 1 degrees of freedom.
The next steps are similar as in (4.28) to (4.30) and the result is summarized below.
71
4.4. INTERVAL ESTIMATORS
iid
Definition 4.5. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). The interval
h
S
S i
Y − tn−1,1−α/2 √ , Y + tn−1,1−α/2 √
n
n
(4.33)
is an exact (1 − α) confidence interval for the parameter µ.
♢
Example 4.10 (continuation of Example 4.1). For the antibiotic imipenem, we do not know the
underlying variability and hence we use (4.33). A sample 95% confidence interval for the mean
√
√
inhibition diameter is y − t99,.975 s 100, y + t99,.975 s 100 = 32.40 − 1.98 · 1.50/10, 32.40 +
1.98 · 1.50/10 = 32.1, 32.7 , where we used the information from R Code 4.1 and qt(.975,
99) (being 1.984).
Note we have deliberately rounded to a single digit here as the original data has been rounded
to integer millimeters.
♣
Confidence intervals are, as shown in the previous two definitions, constituted by random variables (functions of Y1 , . . . , Yn ). Similar to estimators and estimates, sample confidence intervals
are computed with the corresponding realization y1 , . . . , yn of the random sample. Subsequently,
relevant confidence intervals will be summarized in the blue-highlighted text boxes, as shown
here.
CI 1: Confidence interval for the mean µ
Under the assumption of a normal random sample,
h
S
S i
Y − tn−1,1−α/2 √ , Y + tn−1,1−α/2 √
n
n
(4.34)
is an exact (1 − α) confidence interval and
h
S
S i
Y − z1−α/2 √ , Y + z1−α/2 √
n
n
(4.35)
an approximate (1 − α) confidence interval for µ.
Notice that both, the sample approximate and sample exact confidence intervals of the mean,
are of the form
estimate ± quantile · SE(estimate),
(4.36)
that is, symmetric intervals around the estimate. Here, SE(·) denotes the standard error of the
estimate, that is, an estimate of the standard deviation of the estimator.
iid
Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). The estimator S 2 for the parameter σ 2 is such, that (n −
2 , i.e. a chi-square distribution with n − 1 degrees of freedom (see Section 3.2.1).
1)S 2 /σ 2 ∼ Xn−1
72
CHAPTER 4. ESTIMATION OF PARAMETERS
Hence
(n − 1)S 2
2
1 − α = P χ2n−1,α/2 ≤
≤
χ
n−1,1−α/2
σ2
(n − 1)S 2
(n − 1)S 2 2
=P
≥
σ
≥
,
χ2n−1,α/2
χ2n−1,1−α/2
(4.37)
(4.38)
where χ2n−1,p is the p-quantile of the chi-square distribution with n − 1 degrees of freedom. The
b because
corresponding exact (1 − α) confidence interval no longer has the form θb ± q1−α/2 SE(θ),
the chi-square distribution is not symmetric.
For large n, the chi-square distribution can be approximated with a normal one (see also
Section 3.2.1) with mean n and variance 2n. Hence, a confidence interval based on a Gaussian
approximation is reasonable, see CI 2.
CI 2: Confidence interval for the variance σ 2
Under the assumption of a normal random sample,
h (n − 1)S 2
χ2n−1,1−α/2
,
(n − 1)S 2 i
χ2n−1,α/2
is an exact (1 − α) confidence interval for σ 2 .
For a random sample with n > 50
√ 2i
√ 2
h
2S
2S
2
2
S − z1−α/2 √ , S + z1−α/2 √
n
n
(4.39)
(4.40)
is an approximate (1 − α) confidence interval for σ 2 .
Example 4.11 (continuation of Example 4.1). For the antibiotic imipenem, we have a sample variance of 2.24 leading to a 95%-confidence interval [1.73, 3.03] for σ 2 , computed with
(100-1)*var(imipDat)/qchisq(c(.975, .025), df=100-1). The confidence interval is slightly
asymmetric: the center of the interval is (1.73 + 3.03)/2 = 2.38 compared to the estimate 2.24.
As the sample size is quite large, the Gaussian approximation yields a 95%-confidence interval [1.62, 2.86] for σ 2 , computed with var(imipDat)+qnorm(c(.025, .975 ))*sqrt(2/100)*
var(imipDat). The interval is symmetric, slightly narrower 2.86 − 1.62 = 1.24 compared to
3.03 − 1.73 = 1.30.
√
If we would like to construct a confidence interval for the standard deviation σ = σ 2 , we
√
√
can use the approximation [ 1.73, 3.03 ] = [ 1.32, 1.74 ]. That means, we have applied the same
transformation for the bounds as for the estimate, a quite common approach.
♣
4.4.2
Interpretation of Confidence Intervals
The correct interpretation of sample confidence intervals is not straightforward and often causes
confusion. An exact sample confidence interval [ bl , bu ] for a parameter θ at level 1−α means that
4.4. INTERVAL ESTIMATORS
73
when repeating the same experiment many times, on average, the fraction 1 − α of all confidence
intervals contain the true parameter.
The sample confidence interval [ bl , bu ] does not state that the parameter θ is in the interval
with fraction 1 − α. The parameter is not random and thus such a probability statement cannot
be made.
iid
Example 4.12. Let Y1 , . . . , Y4 ∼ N (0, 1). Figure 4.3 (based on R-Code 4.3) shows 100 sample confidence intervals based on Equation (4.31) (top), Equation (4.35) (middle) and Equation (4.33) (bottom). We color all intervals that do not contain zero, the true (unknown) parameter value µ = 0, in red (if ci[1]>mu | ci[2]<mu is true).
On average we should observe 5% of the intervals colored red (five here) in the top and
bottom panel because these are exact confidence intervals. Due to sampling variability we have
for the specific simulation three and five in the two panels. In the middle panel there are typically
more, as the normal quantiles are too small compared to the t-distribution ones (see Figure 3.2).
Because n is small, the difference between the normal and the t-distribution is quite pronounced;
here, there are eleven intervals that do not cover zero.
A few more points to note are as follows. As we do not estimate the variance, all intervals
in the top panel have the same lengths. Further, the variance estimate shows a lot of variability (very different interval lengths in the middle and bottom panel). Instead of for()-loops,
calls of the form segments(1:ex.n, ybar + sigmaybar*qnorm(alpha/2)), 1:ex.n, ybar sigmaybar*qnorm(alpha/2)) are possible.
♣
4.4.3
Comparing Confidence Intervals
Confidence intervals can often be constructed starting from an estimator θb and its distribution.
In many cases it is possible to extract the parameter to get to 1 − α = P(Bl ≤ θ ≤ Bu ), often
some approximations are necessary. For the variance parameter σ 2 in the framework of Gaussian
random variables, CI 2 states two different intervals. These two are not the only ones and we
can use further approximations (see derivation of Problem 4.7.a or Remark 3.1), all leading to
slightly different confidence intervals.
Similar as with estimators, it is possible to compare different confidence intervals with equal
level. Instead of bias and MSE, the criteria here are the width of a confidence interval and the
so-called coverage probability. The latter is the probability that the confidence interval actually
contains the true parameter. In case there is an exact confidence interval, the coverage probability
is equal to the level of the interval.
For a specific estimator the width of a confidence interval can be reduced by reducing the
level or increasing n. The former should be fixed at 90%, 95% or possibly 99% by convention.
Increasing n after the experiment has been performed is often impossible. Therefore, the sample
size should be choosen before the experiment, such that under the model assumptions the width
of a confidence interval is below a certain threshold (see Chapter 12).
The following example illustrates the concept of coverage probability. A more relevant case
will be presented in the next chapter.
74
CHAPTER 4. ESTIMATION OF PARAMETERS
R-Code 4.3 100 confidence intervals for the parameter µ, based on three different approaches (exact with known σ, approximate, and exact with unknown σ). (See Figure 4.3.)
set.seed( 1)
# important to reconstruct the same CIs
ex.n <- 100
# 100 confidence intervals
alpha <- .05
# 95\% confidence intervals
n <- 4
# sample size
mu <- 0
# mean
sigma <- 1
# standard deviation
sample <- matrix( rnorm( ex.n * n, mu, sigma), n, ex.n) # sample used
yl <- mu + c( -6, 6)*sigma/sqrt(n)
# same y-axis for all 3 panels
ybar <- apply( sample, 2, mean)
# mean for each sample
# First panel: sigma known:
sigmaybar <- sigma/sqrt(n)
plot( 1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',
main=bquote(sigma~known))
abline( h=mu)
for ( i in 1:ex.n){
# draw the individual CIs with appropriate color
ci <- ybar[i] + sigmaybar * qnorm(c(alpha/2, 1-alpha/2))
lines( c(i,i), ci, col=ifelse( ci[1]>mu|ci[2]<mu, 2, 1))
}
# Second panel: sigma unknown, normal approx:
sybar <- apply(sample, 2, sd)/sqrt(n)
# estimate the standard deviation
plot( 1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',
main=bquote("Gaussian approximation"))
abline( h=mu)
for ( i in 1:ex.n){
# similar but with individual standard deviation
ci <- ybar[i] + sybar[i] * qnorm(c(alpha/2, 1-alpha/2))
lines( c(i,i), ci, col=ifelse( ci[1]>mu | ci[2]<mu, 2, 1))
}
# Third panel: sigma unknown, t-based:
plot(1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',
main='t-distribution')
abline( h=mu)
for ( i in 1:ex.n){ # similar but with t-quantile
ci <- ybar[i] + sybar[i] * qt(c(alpha/2, 1-alpha/2), n-1)
lines( c(i,i), ci, col=ifelse( ci[1]>mu | ci[2]<mu, 2, 1))
}
iid
Example 4.13. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). If σ 2 is known, the following confidence intervals
75
4.4. INTERVAL ESTIMATORS
−3
−1 0
1
2
3
σ known
Gaussian approximation
−3
−1 0
1
2
3
1:ex.n
t−distribution
−3
−1 0
1
2
3
1:ex.n
1:ex.n
Figure 4.3: Normal and t-distribution based
confidence intervals for the parameters
µ = 0 with σ = 1 (above) and unknown σ (middle and below). The sample size is n = 4
and confidence level is (1 − α) = 95%. Confidence intervals which do not cover the true
value zero are shown in red. (See R-Code 4.3.)
have coverage probability exactly equal to the level 1 − α:
h
h
S
S i
S
Y + tn−1,3α/4 √ , Y + tn−1,1−α/4 √
Y + tn−1,α √ , ∞
n
n
n
(4.41)
with (4.34) having the shortest possible width (for fixed level).
To calculate the coverage probability of (4.35), we would have to calculate probabilities of
√
the form P Y + z1−α/2 S/ n ≥ µ . This is not trival and we approach it differently. Suppose
that z1−α/2 = tn−1,1−α⋆ /2 , i.e., the standard normal p-quantile and the 1 − α⋆ /2-quantile of
the t-distribution with n − 1 degrees of freedom are equivalent. Thus the interval has coverage
probability α⋆ and in the case of Example 4.12 α⋆ ≈ 86%, based on 1-2*uniroot(function(p)
qt(p, 3)-qnorm(.025),c(0,1))$root. In Figure 4.3, we have 11 intervals marked compared
76
CHAPTER 4. ESTIMATION OF PARAMETERS
to expected 14.
4.5
♣
Bibliographic Remarks
Statistical estimation theory is very classical and many books are available. For example, Held
and Sabanés Bové (2014) (or it’s German predecessor, Held, 2008) are written at an accessible
level.
4.6
Exercises and Problems
Problem 4.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Derive the approximate confidence interval (4.40) for σ 2 .
n
iid
b) We assume Y1 , . . . , Yn ∼ N (µ, σ 2 ). Let θbρ =
1X
(Yi − Y )2 . Calculate bias θbρ and
ρ
i=1
Var θbρ , both as a function of ρ. Which value of ρ minimizes the mean squared error?
1
c) We assume Y1 , . . . , Yn ∼ N (µ, σ 2 ). Let θbρ =
iid
ρ
n
X
Yi and calculate the MSE as a function
i=1
of ρ. Argue that Y is the minimum variance unbiased estimator.
Problem 4.2 (Hemoglobin levels) The following hemoglobin levels of blood samples from patients with Hb SS and Hb S/β sickle cell disease are given (Hüsler and Zimmermann, 2010):
HbSS <- c( 7.2, 7.7, 8, 8.1, 8.3, 8.4, 8.4, 8.5, 8.6, 8.7, 9.1,
9.1, 9.1, 9.8, 10.1, 10.3)
HbSb <- c( 8.1, 9.2, 10, 10.4, 10.6, 10.9, 11.1, 11.9, 12.0, 12.1)
a) Visualize the data with boxplots.
b) Propose a statistical model for Hb SS and for Hb S/β sickle cell diseases. What are the
parameters? Indicate your random variables and parameters with subscripts SS and Sb .
c) Estimate all parameters from your model proposed in part b).
d) Propose a single statistical model for both diseases. What are the parameters? Estimate
all parameters from your model based on intuitive estimators.
e) Based on boxplots and QQ-plots, is there coherence between your model and the data?
iid
Problem 4.3 (Normal distribution with known σ 2 ) Let X1 , . . . , Xn ∼ N (µ, σ 2 ), with σ > 0
assumed to be known.
4.6. EXERCISES AND PROBLEMS
a) What is the distribution of
√
77
n(X − µ)/σ? (No formal proofs required).
b) Let n = 9. Calculate P(−1 ≤ (X − µ)/σ ≤ 1).
c) Determine the lower and upper bound of a confidence interval Bl and Bu (both functions
of X̄) such that
√
P(−q ≤ n(X − µ)/σ ≤ q) = P(Bl ≤ µ ≤ Bu )
d) Construct a sample 95%-confidence interval for µ.
e) Determine an expression of the width of the confidence interval? What “elements” appearing in a general 1 − α-confidence interval for µ make the interval narrower?
f) Use the sickle-cell disease data from Problem 2 and construct 90%-confidence intervals for
the means of HbSS and HbSβ variants (assume σ = 1).
g) Repeat problems a)–f) by replacing σ with S and making adequate and necessary changes.
Problem 4.4 (ASTs radii) We work with the inhibition diameters for imipenem and meropenem
as given in Table 4.1.
a) The histograms of the inhibition diameters in Figure 4.1 are not quite symmetric. Someone
proposes to work with square-root diameters.
Visualize the transformed data of Table 4.1 with histograms. Does such a transformation
render the data more symmetric? Does such a transformation make sense? Are there other
reasonable transformations?
b) What are estimates of the inhibition area for both antibiotics (in cm2 )?
c) Construct a 90% confidence interval for the inhibition area for both antibiotics?
d) What is an estimate of the variability of the inhibition area for both antibiotics? What is
the uncertainty of this estimate.
Problem 4.5 (Geometric distribution) In the setting of Example 2.3, denote p = P( shot is
successful ). Assume that it took the boy k1 , k2 , . . . , kn attempts for the 1st, 2nd, . . . , nth
successful shot.
a) Derive the method of moment estimator of p. Argue that this estimator is also the least
squares estimator.
b) Derive the maximum likelihood estimate of p.
P
c) The distribution of the estimator pb = n/ ni=1 ki is non-trivial. And thus we use a simulation approach to assess the uncertainty in the estimate. For n = 10 and p = 0.1 draw from
the geometric distribution (rgeom(...)+1)) and report the estimate. Repeat R = 500
times and discuss the histogram of the estimates. What changes if p = 0.5 or p = 0.9?
78
CHAPTER 4. ESTIMATION OF PARAMETERS
d) Do you expect the estimator in c) to be biased? Can you support your claim with a
simulation?
iid
Problem 4.6 (Poisson Distribution) Consider X1 , . . . , Xn ∼ Pois(λ) with a fixed λ > 0.
b = 1 Pn Xi be an estimator of λ. Calculate E(λ),
b Var(λ)
b and the MSE(λ).
b
a) Let λ
i=1
n
b Var(λ)
b and MSE(λ)
b when n → ∞ ?
b) What is E(λ),
iid
Problem 4.7 (Coverage probability) Let Y1 , . . . , Yn ∼ N (µ, σ 2 ) and consider the usual estimator
S 2 for σ 2 . Show that the coverage probability of (4.40) is given by
1− P W ≥
n−1
√
1 − z1−α/2 √n2
−P W ≤
n−1
√
1 + z1−α/2 √n2
2 . Plot the coverage probability as a function of n.
with W ∼ Xn−1
Problem 4.8 (Germany cancer counts) The dataset Oral is available in the R package spam
and contains oral cavity cancer counts for 544 districts in Germany.
a) Load the data and take a look at its help page using ?Oral.
b) Compute summary statistics for all variables in the dataset.
Which of the 544 regions has the highest number of expected counts E ?
c) Poisson distribution is common for modeling rare events such as death caused by cavity
cancer (column Y in the data). However, the districts differ greatly in their populations.
Define a subset from the data, which only considers districts with expected fatal casualties
caused by cavity cancer between 35 and 45 (subset, column E). Perform a Q-Q Plot for a
Poisson distribution.
Hint: use qqplot() from the stats package and define the theoretical quantiles with
qpois(ppoints( ...), lambda=...).
Simulate a Poisson distributed random variable with the same length and and the same
lambda as your subset. Perform a QQ-plot of your simulated data. What can you say
about the distribution of your subset of the cancer data?
d) Assume that the standardized mortality ratio Zi = Yi /Ei is normally distributed, i.e.,
iid
Z1 , . . . , Z544 ∼ N (µ, σ 2 ). Estimate µ and give a 95% (exact) confidence interval (CI).
What is the precise meaning of the CI?
e) Simulate a 95% confidence interval based on the following bootstrap scheme (sampling with
replacement):
Repeat 10′ 000 times
– Sample 544 observations Zi with replacement
– Calculate and store the mean of these sampled observations
4.6. EXERCISES AND PROBLEMS
79
Construct the confidence interval by taking the 2.5% and the 97.5% quantiles of the stored
means.
Compare it to the CI from d).
Problem 4.9 (BMJ Endgame) Discuss and justify the statements about ‘Describing the spread
of data’ given in doi.org/10.1136/bmj.c1116.
80
CHAPTER 4. ESTIMATION OF PARAMETERS
Chapter 5
Statistical Testing
Learning goals for this chapter:
⋄ Explain the concepts of hypothesis and significance test
⋄ Given a problem situation, state appropriate null and alternative hypotheses,
perform a hypothesis test, interpret the results
⋄ Define p-value and the significance level
⋄ Know the difference between one-sample and two-sample t-tests
⋄ Explain and apply various classical tests
⋄ Understand the duality of tests and confidence intervals
⋄ Be aware of the multiple testing problem and know how to deal with it
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter05.R.
Recently I have replaced a LED light bulb that was claimed to last 20 000 hours. However,
in less than 2 months (and only a fraction thereof in use) the bulb was already broken and I
immediately asked myself if I am unlucky or is the claim simply exaggerated? A few moments
later rational kicked back in and - being a statistician - I knew that one individual break should
not be used to make too general statements
On a similar spirit, suppose we observe 13 heads in 17 tosses of a coin. When tossing a
fair coin, I expect somewhere between seven to ten heads and the 13 observed ones representing
seemingly an unusual case. We intuitively wonder if the coin is fair.
In this chapter we discuss a formal approach to answer if the observed data provides enough
evidence against a hypothesis (against a claimed livetime or against a claimed fairness). We
introduce two interlinked types of statistical testing procedures and provide a series of tests that
can be used off-the-shelf.
81
82
5.1
CHAPTER 5. STATISTICAL TESTING
The General Concept of Significance Testing
The idea of a statistical testing procedure is to formulate a statistical hypothesis and to draw
conclusions from them based on the data. A testing procedure states at the beginning a null
hypothesis, denoted with H0 , and we compare how compatible the data is with respect to this
hypothesis.
Simply stated, starting from a statistical null hypothesis a statistical test calculates a value
from the data and places that value in the context of the hypothetical distribution induced by
the statistical null hypothesis. If the value from the data is unlikely to occur with respect to the
hypothetical distribution, we argue that the data provides evidence against the null hypothesis.
This coherence is typically quantified as a probability, i.e., the famous p-value, with formal
definition as follows.
Definition 5.1. The p-value is the probability under the distribution of the null hypothesis of
obtaining a result equal to or more extreme than the observed result.
♢
Example 5.1. We assume a fair coin is tossed 17 times and we observe 13 heads. Under
the null hypothesis of a fair coin, each toss is a Bernoulli random variable and the 17 tosses
can be modeled with a binomial random variable Bin(n = 17, p = 1/2). Hence, the p-value
is the probability of observing 0, 1, . . . , 4, 13, 14, . . . , 17 heads (or by symmetry of observing
17, . . . , 13, 4, . . . , 0), which can be calculated with sum( dbinom(0:4, size=17, prob=1/2) +
dbinom(13:17, size=17, prob=1/2)) and is 0.049. The p-value indicates that we observe such
a seemingly unlikely event roughly every 20th time.
Note that because of the symmetry of the binomial distribution at Bin(n, 1/2), we can alternatively calculate the p-value as 2*pbinom(4, size=17, prob=1/2) or equivalently as 2*pbinom(12,
size=17, prob=1/2, lower.tail=FALSE).
In this example, we have considered more extreme as very many or very few heads. There
might be situations, where very few heads is not relevant or does not even make sense and thus
“more extreme” corresponds only to observing 13, 14, . . . , 17 heads.
♣
Figure 5.1 illustrates graphically the p-value in two hypothetical situations. Suppose that
under the null hypothesis the hypothetical distribution of the observed result is Gaussian with
mean zero and variance one and suppose that we observe a value of 1.8. If more extreme is considered on both sides of the tails of the density then the p-value consists of two probabilities (here
because of the symmetry, twice the probability of either side). If more extreme is actually larger
(possibly smaller in other situations), the p-value is calculated based on a one-sided probability.
As the Gaussian distribution is symmetric around its mean, the two-sided p-value is twice the
one-sided p-value, here 1-pnorm(1.8), or, equivalently, pnorm(1.8, lower.tail=FALSE).
We illustrate the statistical testing approaches and the statistical tests with data introduced
in the following example.
Example 5.2. In rabbits, pododermatitis is a chronic multifactorial skin disease that manifests
mainly on the hind legs. This presumably progressive disease can cause pain leading to poor
welfare. To study the progression of this disease on the level of individual animals, scientists
83
5.1. THE GENERAL CONCEPT OF SIGNIFICANCE TESTING
p−value=0.072
(two−sided)
H0
p−value=0.036
(one−sided)
H0
Observation
−3
−2
−1
0
1
2
3
Observation
−3
−2
−1
0
1
2
3
Figure 5.1: Illustration of the p-value in the case of a standard normal distribution
with observed value 1.8. Two-sided (left panel) and one-sided setting (right panel).
assessed many rabbits in three farms over the period of an entire year (Ruchti et al., 2019). We
use a subset of the dataset in this and later chapters, consisting of one farm (with two barns) and
four visits (between July 19/20, 2016 and June 29/30, 2017). The 6 stages from Drescher and
Schlender-Böbbis (1996) were used as a tagged visual-analogue-scale to score the occurrence and
severity of pododermatitis on 4 spots on the rabbits hind legs (left and right, heal and middle
position), resulting in the variable PDHmean with range 0–10, for details on the scoring see Ruchti
et al. (2018). We consider the visits in June 2017. R-Code 5.1 loads the dataset and subsets it
correspondingly.
♣
In practice, we often start with a scientific hypothesis and subsequently collect data or perform an experiment to confirm the hypothesis. The data is then “modeled statistically”, in the
sense that we need to determine a theoretical distribution for which the data is a realization. In
our discussion here, the distribution typically involves parameters that are linked to the scientific
question (probability p in a binomial distribution for coin tosses, mean µ of a Gaussian distribution for testing differences pododermatitis scores). We then formulate the null hypothesis H0
for which the data should provide evidence against. The calculation of a p-value can be summarized as follows. When testing about a certain parameter, say θ, we use an estimator θb for that
parameter. We often need to transform the estimator such that the distribution thereof does
not depend on (the) parameter(s). We call this random variable test statistic which is typically
a function of the random sample. The test statistic evaluated at the observed data is then used
to calculate the p-value based on the distribution of the test statistic. Based on the p-value we
summarize the evidence against the null hypothesis. We cannot make any statement for the
hypothesis.
Example 5.3 (continuation of Example 5.2). For the visits in June 2017 we would like to asses
if the score of the rabbits is comparable to 10/3 ≈ 3.33, representing a low-grade scoring (lowgrade hyperkeratosis, hypotrichosis or alopecia). We have 17 observations and the sample mean
is 3.87 with a standard deviation of 0.64. Is this enough evidence in the data to claim that the
observed mean is different from low-grade scoring?
We postulate a Gaussian model for the scores. The observations are a realization of X1 , . . . ,
iid
X17 ∼ N (µ, 0.82 ), i.e., n = 17 and the standard deviation is known (the latter will be relaxed
84
CHAPTER 5. STATISTICAL TESTING
R-Code 5.1 Pododermatitis in rabbits, dataset pododermatitis.
str( podo <- read.csv('data/podo.csv'))
## 'data.frame': 67 obs. of 6 variables:
## $ ID
: int 4 3 9 10 1 7 8 5 14 13 ...
## $ Age
: int 12 12 14 12 17 14 14 12 6 6 ...
## $ Weight : num 4.46 4.31 4.76 5.34 5.71 5.39 5.42 5.13 5.39 5.41 ...
## $ Visit : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Barn
: int 1 1 1 1 1 1 1 1 1 1 ...
## $ PDHmean: num 4.47 3.62 4.28 4.2 1.77 ...
apply( podo, 2, function(x) length(unique(x)) )
##
##
ID
17
Age
21
Weight
63
Visit
4
# 4 visits, 2 barns
Barn PDHmean
2
49
PDHmean <- podo$PDHmean[podo$Visit==13]
length( PDHmean)
## [1] 17
print( me <- mean( PDHmean))
## [1] 3.8691
print( se <- sd( PDHmean))
## [1] 0.63753
soon, here we suppose that this value has been determined by another study or additional
information). The scientific hypothesis results in the null hypothesis H0 : “mean is low-grade
H0 :µ=3.33
scoring” being equivalent to H0 : µ = 3.33. Under the null hypothesis, we have X
∼
2
N (3.33, 0.8 /17). Hence, we test about the parameter µ and our estimator is θb = X with a
known distribution under the null. With this information, the p-value in a two-sided setting is
p-value = P(under the null hypothesis we observe 3.87 or a more extreme value)
= 2 PH0 (|X| ≥ |x|) = 2 1 − PH0 (X < 3.87)
X − 3.33
3.87 − 3.33 √ <
√
=2 1−P
= 2(1 − φ(2.76)) ≈ 0.6%.
0.8/ 17
0.8/ 17
(5.1)
(5.2)
(5.3)
where we have used the subscript “H0 ” to emphasis the calculation under the null hypothesis,
i.e., µ = 3.33. An alternative is to write it in a conditional form PH0 ( · ) = P( · | H0 ). Hence,
there is evidence in the data against the null hypothesis.
♣
In many cases we use a “known” statistical test, instead of a “manually” constructed test
statistic. Specifically, we state the statistical model with a well-known and named test, as we
shall see later.
Some authors summarize p-values in [1, 0.1] as no evidence, in [0.1, 0.01] as weak evidence,
in [0.01, 0.001] as substantial evidence, and smaller ones as strong evidence (see, e.g., Held and
Sabanés Bové, 2014). For certain representations, R output uses symbols for similar ranges ␣ ,
. and * , ** , and, *** .
85
5.1. THE GENERAL CONCEPT OF SIGNIFICANCE TESTING
The workflow of a statistical significance test can be summarized as follows. The starting
point is a scientific question or hypothesis and data that has been collected to support the
scientific claim.
(i) Formulation of the statistical model and statistical assumptions. Formulate the scientific
hypothesis in terms of a statistical one.
(ii) Selection of the appropriate test or test statistic and formulation of the null hypothesis H0
with the parameters of the test.
(iii) Calculation of the p-value,
(iv) Interpretation of the results of the statistical test and conclusion.
Although the workflow is presented in a linear fashion, there are several dependencies. For
example the interpretation depends not only on the p-value but also on the null hypothesis in
terms of proper statistical formulation and, finally, on the scientific question to be answered, see
Figure 5.2. The statistical test hypothesis essentially depends on the statistical assumptions, but
need to be cast to answer the scientific questions, of course. The statistical assumptions may
also determine the selection of statistical tests. This dependency will be taken up in Chapter 7.
Scientific question
or hypothesis;
data from an experiment
(i) Statistical model
assumptions;
test hypothesis
(ii) Selection of
appropriate test;
H 0 with parameters
(iii) Calculation
of the p−value
(iv) Interpretation
of the results,
conclusions
Figure 5.2: Workflow of statistical testing.
Example 5.4 (Revisiting Example 5.1 using the workflow of Figure 5.2). The (scientific) claim
is that the coin is biased and as data we have 17 coin tosses with 13 heads. (i) As the data is
the number of successes among a fixed number of trials a binomal model is appropriate. The
test hypothesis that we want to reject is equal chance of getting head or tail. (ii) We can work
directly with X ∼ Bin(17, p) with the null hypothesis H0 : p = 1/2. (iii) The p-value is 0.049.
(iv) The p-value indicates that we observe such a seemingly unlikely event roughly every 20th
time only weak evidence.
♣
Example 5.5 (Revisiting Example 5.2 using the workflow of Figure 5.2). The scientific claim
is that the observed score is different to low-grade. We have four scores from the hind legs from
17 animals. (i) We work with one average value per animal (PDMean). Thus we assume that
iid
X ∼ N (µ, 0.82 ). We want to quantify how much the observed mean differs from low-grade score.
(ii) We have X ∼ N (µ, 0.82 /17), with the null hypothesis H0 : µ = 3.33. (iii) The p-value is
p-value = 2 PH0 (|X| ≥ |x|) ≈ 0.6%. (iv) There is substantial evidence, that the pododermatitis
scores are different than low-grade scoring.
♣
86
5.2
CHAPTER 5. STATISTICAL TESTING
Hypothesis Testing
A criticism of significance testing is that we have only one hypotheses. It is often easier to choose
between two alternatives and this is essentially the approach of hypothesis testing. However, as
in the setting of significance testing, we will still not choose for one hypothesis but rather reject
or fail to reject the null hypothesis.
Hypothesis testing starts with a null hypothesis H0 and an alternative hypothesis, denoted
by H1 or HA . These hypotheses are with respect to a parameter, say θ or some other choice
specific to the situation.
Example 5.6. We revisit the light bulb situation elaborated at the beginning of the chapter.
I postulate a null hypothesis H0 : “median lifetime is 20 000 h” versus the alternative hypothesis
H1 : “median lifetime is 5 000 h”. I only have one observation and thus I need external information
about the distribution of light bulb lifetime. Although there is no consensus, some published
literature claim that the cdf of certain types of light bulbs are given by F (x) = 1 − exp(−x/λ)k
for x > 0 and k between 4 and 5. For simplicity we take k = 4 and thus the median lifetime is
λ log(2)1/4 . The hypotheses are thus equivalent to H0 : λ = 20 000/ log(2)1/4 h versus H1 : λ =
5 000/ log(2)1/4 h.
♣
In the example above, I could have taken any other value for the alternative. Of course this
is dangerous and very subjective (the null hypothesis is given by companies claim). Therefore,
we state the alternative hypothesis as everything but the null hypothesis. In the example above
it would be H1 : λ ̸= 20 000/ log(2)1/4 h.
In a similar fashion, we could state that the median lifetime is at least 20 000 h. In such
a setting we would have a null hypothesis H0 : “median lifetime is 20 000 h or larger” versus
the alternative hypothesis H1 : “median lifetime smaller than 20 000 h”, which is equivalent to
H0 : λ ≥ 20 000/ log(2)1/4 h versus H1 : λ < 20 000/ log(2)1/4 h.
Hypotheses are classified as simple if parameter θ assumes only a single value (e.g., H0 :
θ = 0), or composite if parameter θ can take on a range of values (e.g., H0 : θ ≤ 0, H1 : µ ̸= µ0 ).
The case of a simple null hypothesis and composite alternative hypothesis is also called a
two-sided setting. Whereas composite null hypothesis and composite alternative hypothesis is
called one-sided or directional setting.
Note that for Example 5.2 a one-sided test is necessary for the hypothesis “there is a progression of the pododermatitis scores between two visits”, but a two-sided test is needed for “the
pododermatitis scores between two visits are different”. We strongly recommend to always use
two-sided tests (e.g. Bland and Bland, 1994; Moyé and Tita, 2002), not only in clinical studies
where it is the norm but as Bland and Bland (1994) states “a one-sided test is appropriate when
a large difference in one direction would lead to the same action as no difference at all. Expectation of a difference in a particular direction is not adequate justification.”. However to illustrate
certain concepts, a one-sided setting may be simpler and more accessible.
In the case of a hypothesis test, we compare the value of the test statistic with the quantiles
of the distribution of the null hypothesis. A predefined threshold determines if we reject H0 , if
not, we fail to reject H0 .
87
5.2. HYPOTHESIS TESTING
Definition 5.2. The significance level α is a threshold determined before the testing, with
0 < α < 1 but it is often set to 5% or 1%.
The rejection region of a test includes all values of the test statistic for which we reject the
null hypothesis. The boundary values of the rejection region are called critical values.
♢
Similar as for the significance test we reject for values of the test statistic that are in the tail
of the density under the null hypothesis, e.g., that would lead to a small p-value. In fact, we can
base our decision on whether the p-value is smaller than the significance level or not.
It is important to realize that the level α is set by the scientists, not by the experiment or
the data. Therefore there is some “arbitrariness” to the value and thus whether we reject the H0
or not. The level α may be imposed to other values in different scientific domains.
For one-sided hypotheses like H0 : µ ≤ µ0 versus H1 : µ > µ0 (similarly H0 : µ ≥ µ0 versus
H1 : µ < µ0 ), the “=” case is used in the null hypothesis, i.e., the most unfavorable case of
the alternative point of view. That means, the null hypothesis is from a calculation perspective
always simple and could be reduced to a simple hypothesis.
Figure 5.3 shows the rejection region of two-sided test H0 : µ = µ0 vs H1 : µ ̸= µ0 (left
panel), H0 : µ ≤ µ0 vs H1 : µ > µ0 and one-sided (right panel), for the specific case µ = 0. For
the left-sided case, H0 : µ ≥ µ0 vs H1 : µ < µ0 , the rejection area is on the left side, analogue to
the right-sided case.
If we assume that the distribution of the test statistic is Gaussian and α = 0.05, the critical
values are ±1.96 and 1.64, respectively (qnorm(c(0.05/2, 1 - 0.05/2) and qnorm(1 - 0.05)).
These critical values are typically linked with the so-called z-test.
H0
H0
Critical value
−3
−2
Critical value
−1
0
1
2
3
Critical value
−3
−2
−1
0
1
2
3
Figure 5.3: Critical values (red) and rejection regions (orange) for two-sided H0 : µ =
µ0 = 0 (left) and one-sided H0 : µ ≤ µ0 = 0 (right) hypothesis test with significance
level α = 5%.
In significance testing two types of errors can occur. Type I errors: we reject H0 if we should
have not and Type II errors: we fail to reject H0 if we should have. The framework of hypothesis
testing allows us to express the probabilities of committing these two errors. The probability of
committing a Type I error is exactly α = P(reject H0 |H0 ). This probability is often called the
size of the test. To calculate the probability of committing a Type II error, we need to assume
a specific value for our parameter within the alternative hypothesis, e.g., a simple alternative.
88
CHAPTER 5. STATISTICAL TESTING
The probability of Type II error is often denoted with β = P(not rejecting H0 |H1 ). Table 5.1
summarizes the errors in a classical 2 × 2 layout.
Table 5.1: Probabilities of Type I and Type II errors in the setting of a significance
test.
Test result
do not reject H0
reject H0
True but unknown state
H0 true
H1 true
1−α
β
α
1−β
Example 5.7 (revisit Example 5.1). The null hypothesis remains as having a fair coin and
the alternative is simply not having a fair coin. Suppose we reject the null hypothesis if we
observe 0, . . . , 4 or 13, . . . , 17 heads out of 17 tosses. The Type I error is 2*pbinom(4, size=17,
prob=1/2), i.e., 0.049, and, if the coin has a probability of 0.7 for heads, the Type II error is
sum(dbinom(5:12, size=17, prob=0.7)), i.e., 0.611. However, if the coin has a probability
of 0.6 for heads, the Type II error increases to sum(dbinom(5:12, size=17, prob=0.6)), i.e.,
0.871.
♣
Ideally we would like to use tests that have simultaneous small Type I and Type II errors.
This is coneptually not possible as reducing one increases the other and one typically fixes the
Type I error to some small value, say 5%, 1% or suchlike (committing a type one error has
typically more severe consequences than a Type II error). Type I and Type II errors are shown
in Figure 5.4 for two different alternative hypotheses. When reducing the significance level α,
the critical values move further from the center of the density under H0 and thus to an increase
of the Type II error β. Additionally, the clearer the separation of the densities under H0 and
H1 , the smaller the Type II error β. This is intuitive, if the data stems from H1 which is “far”
from H0 , the chance that we reject is large.
As a summary, the Type I error
• is defined a priori by selection of the significance level (often 5%, 1%),
• is not influenced by sample size,
• is increased with multiple testing of the same data (we discuss this in Section 5.5.2)
and the Type II error
• depends on sample size and significance level α,
• is a function of the alternative hypothesis.
The value 1−β is called the power of a test. High power of a test is desirable in an experiment:
we want to detect small effects with a large probability. R-Code 5.2 computes the power for a
z-test (Gaussian random sample with known variance). More specifically, under the assumption
√
of σ/ n = 1 we test H0 : µ0 = 0 versus H1 : µ0 ̸= 0. Similarly to the probability of a Type II
89
5.2. HYPOTHESIS TESTING
R-Code 5.2 A one-sided and two-sided power curve for a z-test. (See Figure 5.5.)
alpha <- 0.05
# significance level
mu0 <- 0
# mean under H_0
mu1 <- seq(-1.5, to=5, by=0.1)
# mean under H_1
power_onesided <- 1-pnorm( qnorm(1-alpha, mean=mu0), mean=mu1)
power_twosided <- pnorm( qnorm(alpha/2, mean=mu0), mean=mu1) +
pnorm( qnorm(1-alpha/2, mean=mu0), mean=mu1, lower.tail=FALSE)
plot( mu1, power_onesided, type='l', ylim=c(0,1), xlim=c(-1, 4.25), las=1,
xlab=bquote(mu[1]-mu[0]), ylab="Power", col=4, yaxs='i', lwd=1)
axis(2, at=alpha, labels='')
# adding level
axis(2, at=1.4*alpha, labels=bquote(alpha), las=1, adj=0, tick=FALSE)
lines( mu1, power_twosided, lty=2) # power curve for two-sided test
abline( h=alpha, col='gray')
# significance level
abline( v=c(2, 4), lwd=2, col=3)
# values from figure 4.3
error, the power can only be calculated for a specific assumption of the “actual” mean µ1 , i.e., of
a simple alternative. Thus, as typically done, Figure 5.5 plots power (µ1 − µ0 ).
For µ1 = µ0 , the power is equivalent to the size of the test (significance level α). If µ1 − µ0
H0
−2
0
2
4
H0
−2
H0
H1
0
2
4
6
−2
H1
H0
6
−2
H1
0
2
4
6
H1
0
2
4
6
Figure 5.4: Type I error with significance level α (red) and Type II error with probability β (blue) for two different alternative hypotheses (µ = 2 top row, µ = 4 bottom
row) with two-single hypothesis H0 : µ = µ0 = 0 (left column) and one-sided hypothesis
H0 : µ ≤ µ0 = 0 (right).
90
CHAPTER 5. STATISTICAL TESTING
1.0
Power
0.8
0.6
0.4
0.2
α
0.0
−1
0
1
2
3
4
µ1 − µ0
Figure 5.5: Power curves for a z-test: one-sided (blue solid line) and two-sided (black
dashed line). The gray line represents the level of the test, here α = 5%. The vertical
lines represent the alternative hypotheses µ = 2 and µ = 4 of Figure 5.4. (See RCode 5.2.)
increases, the power increases in sigmoid-shaped form to reach asymptotically one. For a twosided test, the power is symmetric around µ1 = µ0 (no direction preferred) and smaller than the
power for a one sided test. The latter decreases further to zero for negative differences µ1 − µ0
although for the negative values the power curve does not make sense. In a similar fashion as
to reduce the probability of a Type II error, it is possible to increase the power by increasing
√
the sample. Note that in the illustrations above we work with σ/ n = 1, the dependence of the
power on n is somewhat hidden by using the default argument for sd in the functions pnorm()
and qnorm().
Remark 5.1. It is impossible to reduce simultaneously both the Type I and II error probabilities, but it is plausible to consider tests that have the smallest possible Type II error (or
the largest possible power) for a fixed significance level. More theoretical treatise discuss the
existence and construction of uniformly most powerful tests. Note that the latter do not exist
for two-sided settings or for tests involving more than one parameter. Most tests that we discuss in this book are “optimal” (the one-sided version being uniformly most powerful under the
statistical assumptions).
♣
The workflow of a hypothesis test is very similar to the one of a statistical significance test
and only point (ii) and (iii) need to be slightly modified:
(i) Formulation of the statistical model and statistical assumptions. Formulate the scientific
hypothesis in terms of a statistical one.
(ii) Selection of the appropriate test or test statistic, significance level and formulation of the
null hypothesis H0 and the alternative hypothesis H1 with the parameters of the test.
(iii) Calculation of the test statistic value (or p-value), comparison with critical value (or level)
and decision
(iv) Interpretation of the results of the statistical test and conclusion.
91
5.2. HYPOTHESIS TESTING
The choice of test is again constrained by the assumptions. The significance level must,
however, always be chosen before the computations.
The value of the test statistic, say tobs , calculated in step (iii) is compared with critical values
tcrit in order to reach a decision. When the decision is based on the calculation of the p-value,
it consists of a comparison with α. The p-value can be difficult to calculate, but is valuable
because of its direct interpretation as the strength (or weakness) of the evidence against the null
hypothesis.
General remarks about statistical tests
In the following, we will present several statistical tests in text blocks like this one.
In general, we denote
• n, nx , ny , . . . the sample size;
• x1 , . . . , xnx , y1 , . . . , yny , samples, independent observations;
• x, y, . . . the sample mean;
n
x
1 X
(xi − x)2
nx − 1
i=1
(in the tests under consideration, the variance is unknown);
• s2 , s2x , s2y , . . . the sample variance, e.g., s2x =
• α the significance level (0 < α < 1, but α typically small);
• tcrit , Fcrit , . . . the critical values, i.e., the quantiles according to the distribution of the test statistic and the significance level.
We forumlate our scientific question in general terms under Question. The statistical or formal assumptions summarizing the statistical model are given under
Assumptions. Generally, two-sided tests are performed.
For most tests, there is a corresponding R function. The arguments x, y usually
represent vectors containing the data and alpha the significance level. From the
output, it is possible to get the p-value.
In this book we consider typical settings and discuss common and appropriate tests. The
choice of test is primarily dependent on the quantity being tested (location, scale, frequencies,
. . . ) and secondly on the statistical model and assumptions. The tests will be summarized in
yellowish boxes similar as given here. The following list of tests can be used as a decision tree.
• Tests involving the location (mean, means, medians):
– one sample (Test 1 in Section 5.3.1)
– two samples:
∗ two paired/dependent samples (Test 3 in Section 5.3.3 and Test 8 in Chapter 7)
∗ two independent samples (Test 2 in Section 5.3.1 and Test 9 in Chapter 7)
– several samples (Test 11 in Chapter 7 and Test 16 in Chapter 12)
92
CHAPTER 5. STATISTICAL TESTING
• Tests involving variances:
– one sample (Problem 5.4)
– two samples (Test 4 in Section 5.3.4)
• Tests to compare frequencies:
– one proportions
– two proportions (Test 5 in Chapter 6)
– distributions (Test 6 in Chapter 6)
Of course this list is not exhaustive and many additional possible tests exists and are frequently used. Moreover, the approaches described in the first two sections allow to construct
arbitrary tests.
We present several of these tests in more details by motivating the test statistic, giving an
explicit example and by summarizing the test in yellow boxes. Ultimately, we perform test with
a single call in R. However, the underlying mechanism has to be understood, it would be too
dangerous using statistical tests as black-box tools only.
5.3
Testing Means and Variances in Gaussian Samples
In this section, we compare means and variances from normally distributed samples. Formally,
iid
we assume that the data y1 , . . . , yn is a realization of Y1 , . . . , Yn ∼ N (µ, σ 2 ). This distributional
assumption is required to use the results of Section 3.2 and implies that our results (p-values,
Type I and II error probabilities) are exact. If there is a discrepancy between the data and the
statistical model, the results are at most approximate. In many situations, the approximation is
fairly good because of the central limit theorem and we can proceed nevertheless but interpret
the results accordingly. In Chapter 7, we see an alternative and we will relax the Gaussian
assumption entirely.
5.3.1
Comparing a Sample Mean with a Theoretical Value
We revisit the setting when comparing the sample mean with a hypothesized value (e.g., obiid
served pododermatitis score with the value 3.33). As stated above, Y1 , . . . , Yn ∼ N (µ, σ 2 ) with
parameter of interest µ but now unknown σ 2 . Thus, from
Y ∼ N (µ, σ 2 /n)
=⇒
Y −µ
√ ∼ N (0, 1)
σ/ n
σ unknown
=⇒
Y −µ
√ ∼ tn−1 .
S/ n
(5.4)
The null hypothesis H0 : µ = µ0 specifies the hypothesized mean and the distribution in (5.4).
This test is tyically called the “one-sample t-test”, for obvious reasons. To calculate p-values
the function pt(..., df=n-1) (for sample size n) is used.
The box Test 1 summarizes the test and Example 5.8 illustrates the test based on the pododermatitis data.
5.3. TESTING MEANS AND VARIANCES IN GAUSSIAN SAMPLES
93
Test 1: Comparing a sample mean with a theoretical value
Question: Does the sample mean deviate significantly from the postulated but
unknown mean?
Assumptions: The observed data is a realization of a Gaussian random sample with
unknown mean and variance.
Calculation: tobs =
|x − µ0 |
√ .
s/ n
Decision: Reject H0 : µ = µ0 , if tobs > tcrit = tn−1,1−α/2 .
Calculation in R: t.test( x, mu=mu0, conf.level=1-alpha)
Example 5.8 (continuation of Example 5.2). We test the hypothesis that the animals have
a different pododermatitis score than low-grade hyperkeratosis, corresponding to 3.333. The
sample mean is larger and we want to know if the difference is large enough for a statistical
claim.
The statistical null hypothesis is that the mean score is equal to 3.333 and we want to know
if the mean of the (single) sample deviates from a specified value, sufficiently for a statistical
claim. Although there might be a preferred direction of the test (higher score than 3.333), we
perform a two-sided hypothesis test. From R-Code 5.1 we have for the sample mean x̄ = 3.869,
sample standard deviation s = 0.638 and sample size n = 17. Thus,
H0 : µ = 3.333 versus H1 : µ ̸= 3.333;
|3.869 − 3.333|
√
= 3.467;
tobs =
0.638/ 17
tcrit = t16,1−0.05/2 = 2.120
p-value: 0.003.
Formally, we can reject our H0 because tobs > tcrit . The p-value can be calculated with 2*(1-pt(
tobs, n-1)) with tobs defined as 3.467. This calculation is equivalent to 2*pt( -tobs, n-1).
The p-value is low and hence there is substantial evidence against the null hypothesis.
R-Code 5.3 illustrates the direct testing in R with the function t.test() and subsequent
extraction of the p-value.
♣
The returned object of the function t.test() (as well as of virtually all other test functions
we will see) is of class htest. Hence, the output looks always similar and is summarized in
Figure 5.6 for the particular case of the example above.
5.3.2
Comparing Sample Means of two Samples
Comparing means of two different samples is probably the most often used statistical test. To
introduce this test, we assume that both random samples are normally distributed with equal
iid
iid
sample size and variance, i.e., X1 , . . . , Xn ∼ N (µx , σ 2 ), Y1 , . . . , Yn ∼ N (µy , σ 2 ). Further, we
94
CHAPTER 5. STATISTICAL TESTING
R-Code 5.3 One sample t-test, pododermatitis (see Example 5.8 and Test 1)
print( out <- t.test( PDHmean, mu=3.333))
# print the result of the test
##
## One Sample t-test
##
## data: PDHmean
## t = 3.47, df = 16, p-value = 0.0032
## alternative hypothesis: true mean is not equal to 3.333
## 95 percent confidence interval:
## 3.5413 4.1969
## sample estimates:
## mean of x
##
3.8691
out$p.val
# printing only the p-value
## [1] 0.0031759
Name of the statistical test
Name of the object containing the data
Observed value of the test statistic,
degrees of freedom (if adequate) and p−value
Alternative hypothesis
(not equal, less, or ’greater’)
Confidence interval of the main estimate used for the test
Main estimate(s) used for the test with corresponding CI above
Figure 5.6: R output of an object of class htest.
assume that both random samples are independent. Under these assumptions we have
σ2 σ2 2σ 2 X ∼ N µx ,
,
Y ∼ N µy ,
=⇒ X − Y ∼ N µx − µy ,
n
n
n
X − Y − (µx − µy )
X − Y H0 :µx =µy
p
p
=⇒
∼ N (0, 1)
=⇒
∼
N (0, 1).
σ/ n/2
σ/ n/2
(5.5)
(5.6)
As often in practice, we do not know σ and we have to estimate it. One possible estimate is
so-called pooled estimate s2p = (s2x + s2y )/2, where s2x and s2x are the variance estimates of the
two samples. When using the estimator Sp2 in the righ-hand expression of (5.6), the distribution
. p
of (X − Y ) (Sp
n/2) is a t-distribution with 2n − 2 degrees of freedom. This result is not
surprising (up to the degrees of freedom) but somewhat difficult to show formally.
If the sample sizes are different, we need to adjust the pooled estimate and the form is slightly
more complicated (see Test 2). As the calculation s2p requires the estimates µx and µy , we adjust
the degrees of freedom to 2n − 2 or nx + ny − 2 in case of different sample sizes.
The following example revisits the pododermatitis data again and compares the scores between the two different barns.
5.3. TESTING MEANS AND VARIANCES IN GAUSSIAN SAMPLES
95
Test 2: Comparing means from two independent samples
Question: Are the means and of two samples significantly different?
Assumptions: Both samples are normally distributed with the same unknown variance. The samples are independent.
r
nx · ny
|x − y|
|x − y|
=
·
,
Calculation: tobs = p
s
n
sp
1/nx + 1/ny
p
x + ny
1
where s2p =
· (nx − 1)s2x + (ny − 1)s2y .
nx + ny − 2
Decision: Reject H0 : µx = µy if tobs > tcrit = tnx +ny −2,1−α/2 .
Calculation in R: t.test( x, y, var=TRUE, conf.level=1-alpha)
Example 5.9 (continuation of Example 5.2). We question if the pododermatitis scores of the
two barns are significantly different (means 3.83 and 3.67; standard deviations: 0.88 and 0.87;
sample sizes: 20 and 14). Hence, using the formulas given in Test 2, we have
H0 : µx = µy versus H1 : µx ̸= µy
1
s2p =
(19 · 0.8842 + 13 · 0.8682 ) = 0.878
20 + 14 − 2
r
|3.826 − 3.675| 20 · 14
√
tobs =
= 0.494
20 + 14
0.878
tcrit = t32,1−0.05/2 = 2.037
p-value: 0.625.
Hence, 3.826 and 3.675 are not statistically different. See also R-Code 5.4, were we use again the
function t.test() but with two data vectors and the argument var.equal=TRUE.
♣
2
In practice, we often have to assume that the variances of both samples
q are different, say σx
and σy2 . In such a setting, we have to normalize the mean difference by s2x /nx + s2y /ny . While
this estimate seems simpler than the pooled estimate sp , the degrees of freedom of the resulting
t-distribution is difficult to derive, and we refrain to elaborate it here (Problem 5.1.d gives some
insight). In the literature, this test is called Welch’s two sample t-test and is actually the default
choice of t.test( x, y).
5.3.3
Comparing Sample Means from two Paired Samples
In many situations we have paired measurements at two different time points, before and after
a treatment or intervention, from twins, etc. Analyzing the different time points should be done
on an individual level and not on the difference of the sample means of the paired samples.
The assumption of independence of both samples in the previous Test 2 may not be valid if
the two samples consist of two measurements of the same individual, e.g., observations over two
different instances of time. In such settings, were we have a “before” and “after” measurement, it
96
CHAPTER 5. STATISTICAL TESTING
R-Code 5.4 Two-sample t-test with independent samples, pododermatitis (see Example 5.9 and Test 2).
t.test( PDHmeanB1, PDHmeanB2, var.equal=TRUE)
##
## Two Sample t-test
##
## data: PDHmeanB1 and PDHmeanB2
## t = 0.495, df = 32, p-value = 0.62
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4717 0.7742
## sample estimates:
## mean of x mean of y
##
3.8262
3.6750
would be better to take this pairing into account, by considering differences only instead of two
samples. Hence, instead of constructing a test statistic based on X − Y we consider
iid
X1 − Y1 , . . . , Xn − Yn ∼ N (µx − µy , σd2 )
=⇒
=⇒
σ2 X − Y ∼ N µx − µ y , d
n
X − Y H0 :µx =µy
√
∼
N (0, 1).
σd / n
(5.7)
(5.8)
where σd2 is essentially the sum of the variances minus the “dependence” between Xi and Yi . We
formalize this dependence, called covariance, starting in Chapter 8.
The paired two-sample t-test can thus be considered a one sample t-test of the differences
with mean µ0 = 0.
Example 5.10. We consider the pododermatitis measurements from July 2016 and June 2017
and test if there is a progression over time. We have the following summaries for the differences
(see R-Code 5.5 and Test 3). Mean d¯ = 0.21; standard deviation sd = 1.26; and sample size n =
17.
H0 : d = 0 versus H1 : d ̸= 0; or equivalently H0 : µx = µy versus H1 : µx ̸= µy ;
|0.210|
√ = 0.687;
tobs =
1.262/ 17
tcrit = t16;0.05 = 2.12
p-value: 0.502.
There is no evidence that there is a progression over time.
5.3.4
♣
Comparing Sample Variances of two Samples
Just as means can be compared, there are also tests to compare variances of two samples,
where, instead of taking differences, we take the ratio of the estimated variances. If the two
estimates are similar, the ratio should be close to one. The test statistic is accordingly Sx2 Sy2
5.3. TESTING MEANS AND VARIANCES IN GAUSSIAN SAMPLES
97
Test 3: Comparing means from two paired samples
Question: Are the means x and y of two paired samples significantly different?
Assumptions: The samples are paired. The differences are normally distributed
with unknown mean δ. The variance is unknown.
Calculation: tobs =
|d |
√ , where
sd / n
• di = xi − yi is the i-th observed difference,
• d and sd are the mean and the standard deviation of the differences di .
Decision: Reject H0 : δ = 0 if tobs > tcrit = tn−1,1−α/2 .
Calculation in R: t.test(x, y, paired=TRUE, conf.level=1-alpha) or
t.test(x-y, conf.level=1-alpha)
R-Code 5.5 Two-sample t-test with paired samples, pododermatitis (see Example 5.10
and Test 3).
podoV1V13 <- podo[podo$Visit %in% c(1,13),] # select visits from 2016 and 2017
PDHmean2 <- matrix(podoV1V13$PDHmean[order(podoV1V13$ID)], ncol=2, byrow=TRUE)
t.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE)
##
## Paired t-test
##
## data: PDHmean2[, 2] and PDHmean2[, 1]
## t = 0.687, df = 16, p-value = 0.5
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.43870 0.85929
## sample estimates:
## mean difference
##
0.21029
# Same result as with
# t.test( PDHmean2[,2] - PDHmean2[,1])
and the distribution thereof is an F -distribution (see also Section 3.2.3 with quantile, density,
and distribution functions implemented in R with [q,d,p]f). This “classic” F -test is given in
Test 4.
In latter chapters we will see more natural settings, where we need to compare variances (not
98
CHAPTER 5. STATISTICAL TESTING
necessary from a priori two different samples).
Test 4: Comparing two variances
Question: Are the variances s2x and s2y of two samples significantly different?
Assumptions: Both samples are normally distributed and independent.
Calculation: Fobs =
s2x
s2y
Decision: Reject H0 : σx2 = σy2 if Fobs > Fcrit , where Fcrit is the 1 − α/2 quantile of
an F -distribution with nx − 1 and ny − 1 degrees of freedom or if Fobs < Fcrit ,
where Fcrit is the α/2 quantile of an F -distribution with nx − 1 and ny − 1
degrees of freedom.
Calculation in R: var.test( x, y, conf.level=1-alpha)
Example 5.11 (continuation of Example 5.2). As shown by R-Code 5.6 the pododermatitis
mean scores of the two barns do not have any evidence against the null hypothesis of having
equal variances.
♣
R-Code 5.6 Comparison of two variances, PDH (see Example 5.11 and Test 4).
var.test( PDHmeanB1, PDHmeanB2)
##
## F test to compare two variances
##
## data: PDHmeanB1 and PDHmeanB2
## F = 1.04, num df = 19, denom df = 13, p-value = 0.97
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.35027 2.78376
## sample estimates:
## ratio of variances
##
1.0384
It is important to note that for the two-sample t-test we should not test first if the variances
are equal and then deciding on whether to use the classical or Welch’s two-sample t-test. With
such a sequential approach, we cannot maintain the nominal significance level (we use the same
data for several tests, see also Section 5.5.2). It should be rather the experimental setup that
should argue if conceptually the variances should be equivalent (see also Chapter 12).
5.4. DUALITY OF TESTS AND CONFIDENCE INTERVALS
99
Remark 5.2. When flipping the samples in Test 4, the value of the observed test statistic and
the associated confidence interval of σy2 /σx2 changes. However, the p-value of the test remains
the same, see the output of var.test( PDHmeanB2, PDHmeanB1). This is because if W ∼ Fn,m
then 1/W ∼ Fm,n .
♣
5.4
Duality of Tests and Confidence Intervals
There is a close connection between a significance test for a particular parameter and a confidence
interval for the same parameter. Rejecting H0 : θ = θ0 with significance level α is equivalent to
θ0 not being in the (1 − α) confidence interval of θ.
To illustrate, we consider the one sample t-test where we compare the mean with a theoretical
value µ0 . As shown in Test 1, H0 : µ = µ0 is rejected if
|x − µ0 |
√ > tcrit = tn−1,1−α/2 .
s/ n
(5.9)
s
|x − µ0 |
√ ≤ tcrit = tn−1,1−α/2 ⇐⇒ |x − µ0 | ≤ tn−1,1−α/2 √ ,
s/ n
n
(5.10)
tobs =
Hence, H0 cannot be rejected if
tobs =
which also means that H0 is not rejected if
s
−tn−1,1−α/2 √ ≤ x − µ0
n
s
and x − µ0 ≤ tn−1,1−α/2 √ .
n
(5.11)
s
and µ0 ≤ x + tn−1,1−α/2 √ ,
n
(5.12)
We can again rewrite this as
s
x − tn−1,1−α/2 √ ≤ µ0
n
which correspond to the boundaries of the sample (1−α) confidence interval for µ0 . Analogously,
this duality can be established for the other tests described in this chapter.
Example 5.12 (reconsider the situation from Example 5.8). The 95%-confidence interval for
the mean µ is [ 3.54, 4.20 ]. Since the value of µ0 = 3.33 is not in this interval, the null hypothesis
H0 : µ = µ0 = 3.33 is rejected at level α = 5%.
♣
In R, most test functions return the corresponding confidence intervals (named element
$conf.int of the returned list) together with the value of the statistic ($statistic), p-value
($p.value) and other information. Some test functions may explicitly require setting additional
argument conf.int=TRUE.
5.5
Missuse of p-Values and Other Dangers
p-Values and their use have been criticized lately. By not carefully and properly performing
statistical tests it is possible to “obtain” p-values that are small (especially smaller than 0.05),
to “observe” a significant result. In this section we discuss and illustrate a few possible pitfalls
of statistical testing. Note that wrong statistical results are often due to insufficient statistical
knowledge and not due to deliberate manipulation of data or suchlike.
100
CHAPTER 5. STATISTICAL TESTING
5.5.1
Interpretation of p-Values
The definition and interpretation of p-values are not as easy as it seems and quite often lead
to confusion or a misinterpretation. Some scientific journals went as far to ban articles with
p-values altogether. In the last few years many articles have emerged discussing what p-values
are and what they are not, often jointly with the interpretation of confidence intervals. Here, we
cite verbatim from the ASA’s Statement (Wasserstein and Lazar, 2016):
1. p-values can indicate how incompatible the data are with a specified statistical model.
2. p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based only on whether
a p-value passes a specific threshold.
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance does measure the size of and effect or the importance
of a result.
6. By itself, a p-value does not provide a good measure of evidence regarding a model or
hypothesis.
See also Greenland et al. (2016).
5.5.2
Multiple Testing and p-Value Adjustments
In many cases, we want to perform not just one test, but a series of tests for the same data.
We must then be aware that the significance level α only holds for a single test. In the case
of a single test, the probability of a falsely significant test result equals the significance level,
usually α = 0.05. The probability that the null hypothesis H0 is correctly not rejected is then
1 − 0.05 = 0.95.
Consider the situation in which m > 1 tests are performed. The probability that at least one
false significant test result is obtained is then equal to one minus the probability that no false
significant test results are obtained. It holds that
P(at least 1 false significant results) = 1 − P(no false significant results)
= 1 − (1 − α)m .
(5.13)
(5.14)
Table 5.2 gives the probabilities of at least one false significant result for α = 0.05 and various
m. Even for just a few tests, the probability increases drastically.
Table 5.2: Probabilities of at least one false significant test result when performing m
tests at level α = 5% (top row) and at level αnew = α/m (bottom row).
m
1
2
3
4
5
6
8
10
20
100
1 − (1 − α)m
0.05
0.098
0.143
0.185
0.226
0.265
0.337
0.401
0.642
0.994
0.05
0.049
0.049
0.049
0.049
0.049
0.049
0.049
0.049
0.049
1 − (1 − αnew )m
5.6. BIBLIOGRAPHIC REMARKS
101
There are several different methods that allow multiple tests to be performed while maintaining the selected significance level. The simplest and most well-known of them is the Bonferroni
correction. Instead of comparing the p-value of every test to α, they are compared to a new
significance level, αnew = α/m, see second row of Table 5.2.
There are several alternative methods, which, according to the situation, may be more appropriate. We recommend to use at least method="holm" (default) in p.adjust. For more details
see, for example, Farcomeni (2008).
5.5.3
p-Hacking, HARKing and Publication Bias
There are other dangers with p-values. It is often very easy to tweak the data such that we
observe a significant p-value (declaring values as outliers, removing certain observations, use
secondary outcomes of the experiment). Such tweaking is often called p-hacking: manage the
data until we get a significant result.
Hypothesizing after the results are known (HARKing) is another inappropriate scientific
practice in which a post hoc hypothesis is presented as an a priori hypotheses. In a nutshell, we
collect the data of the experiment and adjust the hypothesis after we have analysed the data,
e.g., select effects small enough such that significant results have been observed.
Along similar lines, analyzing a dataset with many different methods will likely lead to
several significant p-values. In fact, even if in the case of the true underlying null hypothesis, on
average α · 100% of the tests are significant. Due to various inherent decisions often even more.
When searching for a good statistical analysis one often has to make many choices and thus
inherently selects the best one among many. This danger is often called the ‘garden of forking
paths’. Conceptually, adjusting the p-value for the many (not-performed) test would mitigate
the problem.
If a result is not significant, the study is often not published and is left in a ‘file-drawer’. A
seemingly significant result might be well due to Type I error but this is not evident as many
similar experiments lead to non-significant outcomes that are not published. Hence, the so-called
publication bias implies that there are more Type I error results than the nominal α level.
For many scientific domains, it is possible to preregister the study, i.e., to declare the study
experiment, analysis methods, etc. before the actual data has been collected. In other words,
everything is determined except the actual data collection and actual numbers of the statistical
analysis. The idea is that the scientific question is worthwhile investigating and reporting independent of the actual outcome. Such an approach reduces HARKing, garden-of-forking-paths
issue, publication bias and more.
5.6
Bibliographic Remarks
Virtually all introductory statistics books contain material about statistical testing. The classics
Lehmann and Casella (1998); Lehmann and Romano (2005) are quite advanced and mathematical.
102
5.7
CHAPTER 5. STATISTICAL TESTING
Exercises and Problems
Problem 5.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Derive the power of a one-sample t-test for H0 : µ = µ0 and H0 : µ = µ1 for sample size n.
b) Show that the value of the test statistic of the
two sample t-test with equal variances and
q
√
equal sample sizes simplifies to n x − ȳ
s2x + s2y .
c) Starting from results in (5.6), derive the test statistic of Test 2 for the case general case of
sample sizes nx and ny .
d) We give some background to Welch’s two sample t-test with test statistic X − Y SW
2 = S 2 /n + S 2 /n . The distribution thereof will not be exactly a t-distribution
with SW
x
y
x
y
because σx ̸= σy . And thus the denominator is not exactly a scaled chi-squared random
2 /σ 2 is approximately chi-squared with r
variable. However, it can be shown that rSW
W
2 /σ 2 ) = 2r.
degrees of freedom. Determine r such that Var(rSW
W
Problem 5.2 (t-Test) Use again the sickle-cell disease data introduced in Problem 4.2. For the
cases listed below, specify the null and alternative hypothesis. Then use R to perform the tests
and give a careful interpretation.
a) µHbSβ = 10 (α = 5%, two-sided)
b) µHbSβ = µHbSS (α = 1%, two-sided)
c) What changes, if one-sided tests are performed instead?
Problem 5.3 (t-Test) Anorexia is an eating disorder that is characterized by low weight, food
restriction, fear of gaining weight and a strong desire to be thin. The dataset anorexia in the
package MASS gives the weight of 29 females before and after a cognitive behavioral treatment
(in pounds). Test whether the treatment was effective.
Problem 5.4 (Testing the variance of one sample) In this problem we develop a “one-sample
iid
variance” test. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). We consider the estimator S 2 for the parameter σ 2 .
We test H0 : σ 2 = σ02 against H0 : σ 2 ≥ σ02 .
a) What is the distribution of (n − 1)S 2 /σ 2 ?
b) Find an expression for the p-value for the estimate s2 . For simplicity, we assume s2 > σ02 .
c) We now assume explicit values n = 17, s2 = 0.41 and σ02 = 0.25. What is the p-value?
d) Construct a one-sided sample confidence interval [0, bu ] for the parameter σ 2 in the general
setting and with the values from c).
103
5.7. EXERCISES AND PROBLEMS
Problem 5.5 (p-values under mis-specified assumptions) In this problem investigate the effect
of deviations of statistical assumptions on the p-value. For simplicity, we use the one sample
t-test.
iid
a) For 10000 times, sample X1 , . . . , Xn ∼ N (µ, σ 2 ) with µ = 0, σ = 1 and n = 10. For each
sample perform a t-test for H0 : µ = 0. Plot the p-values in a histogram. What do you
observe? For α = 0.05, what is the observed Type I error?
b) We repeat the experiment with a different distribution. Same questions as in a), but for
• X1 , . . . , Xn from a t-distribution with 4 degrees of freedom and H0 : µ = 0;
• X1 , . . . , Xn from a chi-square distribution with 10 degrees of freedom and H0 : µ = 10.
c) Will the observed Type I error be closer to the nominal level α when we increase n? Justify.
Problem 5.6 (Misinterpretation of overlapping confidence intervals when comparing groups)
Suppose we have measurements x1 , . . . , xnx and y1 , . . . , yny from two groups. There is a common
misinterpretation that if the confidence intervals of the corresponding means are overlapping
then the corresponding two-sample test will not show a statistically significant difference of the
two samples. In this problem we analyze the setting and learn to understand the reason of this
“apparent” violation of the duality of between tests and confidence intervals.
iid
For simplicity, assume that we have n = nx = ny and X1 , . . . , Xn ∼ N (µx , σ 2 ) and
iid
Y1 , . . . , Yn ∼ N (µy , σ 2 ), µx < µy .
a) We assume n = 25, σ 2 = 1 and α = 5%. Generate for µx = 0 a sample x1 , . . . , xn and for
µy = 0 a sample y1 , . . . , yn . Shift the sample y1 , . . . , yn such that both confidence intervals
“touch”. Based on the shifted sample, what is the p-value of the two sample t-test?
Based on trial-and-error, which shift would yield approximately a p-value of α = 5%?
Hint: use t.test(...)$conf to access the sample confidence intervals.
b) Calculate the difference between the lower bound of the sample confidence interval for µy
and upper bound of the sample.confidence interval for µx . Show that the two intervals
√ “touch“ each other when (x − y) (sx + sy )/ n = tn−1,1−α/2 .
c) Which test would be adequate to consider here? What is tobs and tcrit in this specific
setting?
d) Comparing b) and c), why do we not have this apparent duality between the confidence
intervals and the hypothesis test?
Problem 5.7 (BMJ Endgame) Discuss and justify the statements about ‘Independent samples
t test’ given in doi.org/10.1136/bmj.c2673.
104
CHAPTER 5. STATISTICAL TESTING
Chapter 6
Estimating and Testing Proportions
Learning goals for this chapter:
⋄ Identify if the situation involves proportion
⋄ Explain and apply estimation, confidence interval and hypothesis testing for
proportions
⋄ Compare different CI for a single proportion (Wald, Wilson)
⋄ Explain and apply different methods of comparing proportions (difference
between proportions, odds ratio, relative risk)
⋄ Explain the concept of Person’s chi-squared test
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter06.R.
One of the motivational examples in the last chapter was tossing a coin. In this chapter, we
generalize this specific idea and discuss the estimation and testing of a single proportion as well
as two or several proportions. The following example serves as a motivation.
The end-of-semester exam of the lecture ‘Analysis for the natural sciences’ at my university
consisted of two sightly different versions. The exam type is assigned according to the student
listing. It is of utmost importance, that the two versions are identical. For one particular year,
among the 589 students that participated, 291 received exam version A, the others version B.
There were 80 students failing version A versus 85 failing version B. Is there enough evidence
in the data to claim that the exams were not of equal difficulty and hence some students were
disadvantaged.
In this chapter we have a closer look at statistical techniques that help us to correctly answer
the above and similar questions. More precisely, we will estimate and compare proportions. To
simplify the exhibition, we discuss the estimation of one proportion followed by comparing two
proportions. The third section discusses statistical test for different cases.
105
106
6.1
CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
Estimation
We start with a simple setting where we observe occurrences of a certain event and are interested
in the proportion of the events over the total population. More specifically, we consider the
number of successes in a sequence of experiments, e.g., whether a certain treatment had an effect
or whether a certain test has been passed. We first discuss point estimation followed by the
construction a confidence intervals for a proportion.
6.1.1
Point Estimation for a Proportion
For our setting, we use a binomial random variable X ∼ Bin(n, p), where n is given or known
and p is unknown, the parameter of interest. Intuitively, we find x/n to be an estimate of p and
X/n the corresponding estimator.
More formally, with the method of moments we obtain the estimator pbMM = X/n, since
np = E(X) and we have only one observation (total number of cases). The estimator is thus
identical to the intuitive estimator.
The likelihood estimator is constructed as follows:
n x
L(p) =
p (1 − p)n−x
(6.1)
x
n
ℓ(p) = log L(p) = log
+ x log(p) + (n − x) log(1 − p)
(6.2)
x
dℓ(p)
x n−x
x
n−x
= −
=⇒
=
=⇒
x − xb
pML = nb
pML − xb
pML .
(6.3)
dp
p
1−p
pbML
1 − pbML
Thus, the maximum likelihood estimator is (again) pbML = X/n.
In our example we have the following estimates: pbA = 211/291 ≈ 72.51% for exam version A
and pbB = 213/298 ≈ 71.48% for version B. Once we have an estimate of an proportion, we can
now answer questions (for each version separately), such as:
1. How many cases of failure can be expected in a group of 100 students?
2. What is the probability that more than 100 failures occur in a cohort of 300 students?
(where we use results from Sections 2.2.1 and 3.3 only). We revisit questions like “Is the failure
rate significantly lower than 30%?” later in the chapter.
The estimator pb = X/n does not have a “classical” distribution (it is a “scaled binomial”).
Figure 6.1 illustrates the probability mass function based on the estimate of exam version A. The
figure visually suggest to use a Gaussian approximation. Formally, we approximate the binomial
distribution of X by a Gaussian distribution (see Section 3.3), which is well justified here as
np(1 − p) ≫ 9. The approximation for X is then used to state that the estimator pb is also
app
approximately Gaussian with adjusted parameters: pb ∼ N (x/n, p(1 − p)/n). In this chapter,
we will often use this approximation and thus we implicitly assume np(1 − p) > 9.
Figure 6.1 also indicates that shifting the Gaussian density slightly to the right, the approximation would improve. This shift is linked to the continuity correction and is performed in
practice. In our derivations we often omit the correction for clarity.
107
0.00 0.01 0.02 0.03 0.04 0.05
0.00 0.01 0.02 0.03 0.04 0.05
6.1. ESTIMATION
0.0
0.2
0.4
0.6
0.8
1.0
0.65
0.70
x
0.75
0.80
x
Figure 6.1: Probability mass function of pb = 211/291. The blue curve in the right
panel is the normal approximation. The actual estimate is indicated with a green tick
mark.
When dealing with proportions we often speak of odds, or simply of chance, defined by
ω = p/(1 − p). The corresponding intuitive estimator is ω
b = pb/(1 − pb) for an estimator pb.
Similarly, θb = log(b
ω ) = log (b
p/(1 − pb) is an intuitive estimator of log odds. If pb is an estimate,
then the quantities are the corresponding estimates. As a side note, these estimators also coincide
with the maximum likelihood estimators.
6.1.2
Confidence Intervals for a Proportion
To construct a confidence intervals for the parameter p we use the Gaussian approximation for
the binomial distribution and thus we have
X − np
≤ z1−α/2 .
(6.4)
1 − α ≈ P zα/2 ≤ p
np(1 − p)
This can be rewritten as
p
p
1 − α ≈ P zα/2 np(1 − p) ≤ X − np ≤ z1−α/2 np(1 − p)
X
1p
1p
X
= P − + zα/2
np(1 − p) ≤ −p ≤ − + z1−α/2
np(1 − p) .
n
n
n
n
(6.5)
(6.6)
As a further approximation we replace p with pb in the argument of the square root to obtain
X
1p
X
1p
1 − α ≈ P − + zα/2
nb
p(1 − pb) ≤ −p ≤ − + z1−α/2
nb
p(1 − pb) .
n
n
n
n
(6.7)
Since pb = x/n (as estimate) and q := z1−α/2 = −zα/2 , we have as sample confidence interval
r
bl,u = pb ± q ·
pb(1 − pb)
= pb ± q · SE(b
p) .
n
(6.8)
Remark 6.1. For (regular) models with parameter θ, as n → ∞, likelihood theory states that,
the estimator θbML is normally distributed with expected value θ and variance Var(θbML ).
p
Since Var(X/n) = p(1 − p)/n, one can assume that SE(b
p) = pb(1 − pb)/n. The so-called
Wald confidence interval rests upon this assumption (which can be shown more formally) and is
108
CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
identical to (6.8).
♣
If the inequality in (6.4) is solved through a quadratic equation, we obtain the sample Wilson
confidence interval
!
r
1
q2
pb(1 − pb)
q2
bl,u =
· pb +
±q·
+ 2 ,
(6.9)
1 + q 2 /n
2n
n
4n
where we use the estimates pb = x/n (see Problem 6.1.a).
CI 3 summarizes both the Wald and Wilson confidence interval. The latter can be constructed
in R with prop.test( x, n, correct=FALSE)$conf.int. An exact (1 − α) confidence interval
for a proportion based on quantiles of the binomial distriubtion is computed with binom.test(
x, n)$conf.int.
CI 3: Confidence intervals for proportions
An approximate (1 − α) Wald confidence interval for a proportion is
r
pb(1 − pb)
Bl,u = pb ± q ·
n
with estimator pb = X/n and quantile q = z1−α/2 .
An approximate (1 − α) Wilson confidence interval for a proportion is
!
r
q2
1
pb(1 − pb)
q2
Bl,u =
· pb +
±q·
+ 2 .
1 + q 2 /n
2n
n
4n
(6.10)
(6.11)
The Wilson confidence interval is “more complicated” than the Wald confidence interval. Is
it also “better” because of one fewer approximation during the derivation?
Ideally the coverage probability of a (1 − α) confidence interval should be 1 − α. Because of
the approximations this will not be the case and the coverage probability can be used to assess
confidence intervals. For a discrete random variable, the coverage is
P(p ∈ [ Bl , Bu ] ) =
n
X
P(X = x)I{p∈[ bl ,bu ] } .
(6.12)
x=0
(see Problem 6.1.b). R-Code 6.1 calculates the coverage of the 95% confidence intervals for
X ∼ Bin(n = 40, p = 0.4). For the particular setting, the Wilson confidence interval does not
seem to have a better coverage (96% compared to 94%).
R-Code 6.1: Coverage of 95% confidence intervals for X ∼ Bin(n = 40, p = 0.4).
6.1. ESTIMATION
109
p <- .4 ;
n <- 40
# defining the binomial random variable
x <- 0:n
# sequence of all possible values
WaldCI <- function(x, n){
# CI formula based on eq (6.8)
mid <- x/n
se <- sqrt(x*(n-x)/n^3)
cbind( pmax(0, mid - 1.96*se), pmin(1, mid + 1.96*se))
}
WaldCIs <- WaldCI(x,n)
# calculate CI for our values n and x
Waldind <- (WaldCIs[,1] <= p) & (WaldCIs[,2] >= p) # p in CI?
Waldcoverage <- sum( dbinom(x, n, p)*Waldind)
# eq (6.12)
WilsonCI <- function(x, n){
# CI formula based on eq (6.9)
mid <- (x + 1.96^2/2)/(n + 1.96^2)
se <- sqrt(n)/(n+1.96^2)*sqrt(x/n*(1-x/n)+1.96^2/(4*n))
cbind( pmax(0, mid - 1.96*se), pmin(1, mid + 1.96*se))
}
WilsonCIs <- WilsonCI(x,n)
# calculate CI for our values n and x
Wilsonind <- (WilsonCIs[,1] <= p) & (WilsonCIs[,2] >= p) # p in CI?
Wilsoncoverage <- sum( dbinom(x, n, p)*Wilsonind)
print( c(true=0.95, Wald=Waldcoverage, Wilson=Wilsoncoverage))
##
true
Wald Wilson
## 0.95000 0.94587 0.96552
Figure 6.2 illustrates the coverage for the confidence intervals for different value of p. Overall
the Wilson confidence interval now dominates the Wald one. The Wilson confidence interval
has better nominal coverage at the center. This observation also holds when n is varied, as seen
in Figure 6.3 which shows the coverage for different values of n and p. The Wilson confidence
interval has a slight tendency of too large coverage (more blueish areas) but is overall much better
than the Wald one. Note that the top “row” of the left and right part of the panel corresponds
to the left and right part of Figure 6.2.
R-Code 6.2 Tips for construction of Figure 6.2.
p <- seq(0, 0.5, .001)
# we exploit the symmetry
# either a loop over all elements in p or a few `apply()`'s
# over the functions 'Wilsonind' and 'Wilsoncoverage'
# `Waldcoverage` and `Wilsoncoverage` are thus vectors!
plot( p, Waldcoverage, type='l', ylim=c(.8,1))
Waldsmooth <- loess( Waldcoverage ~ p, span=.12)
# "guide the eye" curve
lines( Waldsmooth$x, Waldsmooth$fitted, col=3, lw=2) # add to plot.
lines( c(-1, 2), c(.95, .95), col=2, lty=2)
# nominal level
CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
0.80 0.85 0.90 0.95 1.00
Coverage
110
Wald CI
0.0
0.2
Wilson CI
0.4
0.6
0.8
1.0
p
Figure 6.2: Coverage of the 95% confidence intervals for X ∼ Bin(n = 40, p) for the
Wald CI (left) and Wilson CI (right). The red dashed line is the nominal level 1 − α
and in green we have a smoothed curve to “guide the eye”. (See R-Code 6.2.)
Wilson CI
40
Wald CI
20
10
n
30
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.0
0.2
0.4
0.6
0.8
1.0
p
Figure 6.3: Coverage of the 95% confidence intervals for X ∼ Bin(n, p) as functions
of p and n. The probabilities are symmetric around p = 1/2. All values smaller than
0.7 are represented with dark red. Left is for the Wald CI, the right is for the Wilson
CI.
The width of a sample confidence interval is bu − bl . For the Wald confidence interval we
obtain
r
r
pb(1 − pb)
x(n − x)
= 2q
(6.13)
2q ·
n
n3
and for the Wilson confidence interval we have
r
r
2q
pb(1 − pb)
q2
2q
x(n − x)
q2
+ 2 =
+ 2.
2
2
3
1 + q /n
n
4n
1 + q /n
n
4n
(6.14)
The widths vary with the observed value of X and are shown in Figure 6.4. For 5 < x < 36, the
Wilson confidence interval has a smaller width and a better nominal coverage (over small ranges
of p). For small and very large values x, the Wald confidence interval has a too small coverage
and thus wider intervals are desired.
111
0.15
0.00
Width
0.30
6.2. COMPARISON OF TWO PROPORTIONS
0
10
20
30
40
x
Figure 6.4: Widths of the sample 95% confidence intervals for X ∼ Bin(n = 40, p) as
a function of the observed value x (the Wald CI is in green, the Wilson CI in blue).
6.2
Comparison of two Proportions
We assume that we have two binomial random variables X1 ∼ Bin(n1 , p1 ) and X2 ∼ Bin(n2 , p2 )
that define two groups. The resulting observations are often presented in a 2 × 2 contingency
table as shown in Table 6.1. Sometimes, ni are also denoted by ri (row totals). The goal of this
section is to introduce formal approaches for a comparison of two proportions p1 and p2 . This
can be accomplished
using (i) a difference p1 − p2 , (ii) a quotient p1 /p2 , or (iii) an odds ratio
.
p1 /(1 − p1 ) (p2 /(1 − p2 )), which we consider in the following three sections. The approaches are
illustrated based on the following example.
Table 6.1: Example of a two-dimensional contingency table displaying frequencies.
The first index refers to the group category and the second to the result category.
Group
A
B
Total
Result
positive
negative
h11
h12
h21
h22
c1
c2
Total
n1
n2
n
Example 6.1. Pre-eclampsia is a hypertensive disorder occurring during pregnancy (gestational
hypertension) with symptoms including: edema, high blood pressure, and proteinuria. In a
double-blinded randomized controlled trial (RCT) 2706 pregnant women were treated with either
a diuretic or with a placebo (Landesman et al., 1965). Pre-eclampsia was diagnosed in 138 of
the 1370 subjects in the treatment group and in 175 of the subjects receiving placebo. The
medical question is whether diuretic medications, which reduce water retension, reduce the risk
of pre-eclampsia.
The two treatments (control/placebo and diuretic medication) define the risk factor, which
is the group category and the diagnosis is the result category in Table 6.1. R-Code 6.3 presents
the 2 × 2 table.
♣
112
CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
R-Code 6.3 Contingency table for the pre-eclampsia data of Example 6.1.
xD <- 138;
xC <- 175
# positive diagnosed counts
nD <- 1370;
nC <- 1336
# totals in both groups
tab <- rbind( Diuretic=c(xD, nD-xD), Control=c(xC, nC-xC))
colnames( tab) <- c( 'pos', 'neg')
tab
##
pos neg
## Diuretic 138 1232
## Control 175 1161
6.2.1
Difference Between Proportions
The risk difference RD describes the (absolute) difference in the probability of experiencing the
event in question.
Using the notation introduce above, the difference h11 (h11 + h12 ) − h21 (h21 + h22 ) =
h11 n1 − h21 n2 can be seen as a realization of X1 /n1 − X2 /n2 , which is approximately normally
distributed
p1 (1 − p1 ) p2 (1 − p2 ) +
N p1 − p2 ,
n1
n2
(6.15)
(based on the normal approximation of the binomial distribution). Hence, a corresponding
confidence interval can be derived.
6.2.2
Relative Risk
The relative risk estimates the size of the effect of a risk factor compared with the size of the
effect when the risk factor is not present:
RR =
P(Positive diagnosis with risk factor)
.
P(Positive diagnosis without risk factor)
(6.16)
The groups with or without the risk factor can also be considered the treatment and control
groups.
The relative risk is a positive values. A value of RR = 1 means that the risk is the same in
both groups and there is no evidence of a association between the diagnosis/disease/event and
the risk factor. A RR ≥ 1 is evidence of a possible positive association between a risk factor and
a diagnosis/disease. If the relative risk is less than one, the exposure has a protective effect, as
is the case, for example, for vaccinations.
An estimate of the relative risk is (see Table 6.1)
d = pb1 =
RR
pb2
h11
h11 + h12
h21
h21 + h22
=
h11 n2
.
h21 n1
(6.17)
113
6.2. COMPARISON OF TWO PROPORTIONS
d The standard error of θb is
To construct confidence intervals, we consider first θb = log(RR).
determined with the delta method and based on Equation (3.31), applied to a Binomial instead
of a Bernoulli random variable:
d = Var log( pb1 ) = Var log(b
Var( θb ) = Var log(RR)
p1 ) − log(b
p2 )
pb2
1 − pb1
1 − pb2
= Var log(b
p1 ) + Var((log(b
p2 ) ≈
+
n1 · pb1 n2 · pb2
h11
h22
1−
1−
h11 + h12
h21 + h22
≈
+
h11
h22
(h11 + h12 ) ·
(h21 + h22 ) ·
h11 + h12
h21 + h22
1
1
1
1
=
−
+
−
.
h11 h11 + h12 h21 h21 + h22
(6.18)
(6.19)
(6.20)
(6.21)
A back-transformation
h
b
exp θb ± z1−α/2 SE(θ)
i
(6.22)
implies positive confidence boundaries. Note that with the back-transformation we loose the
‘symmetry’ of estimate plus/minus standard error.
CI 4: Confidence interval for relative risk (RR)
An approximate (1 − α) confidence interval for RR, based on the two-dimensional
contingency table (Table 6.1), is
h
i
d ± z1−α/2 SE log(RR)
d
exp log(RR)
(6.23)
where for the sample confidence intervals
rwe use the estimates
1
h
(h
+
h
)
1
1
1
22
d = 11 21
d =
RR
, SE log(RR)
−
+
−
.
(h11 + h12 )h21
h11 h11 + h12 h21 h21 + h22
Example 6.2 (continuation of Example 6.1). The relative risk and corresponding confidence
interval for the pre-eclampsia data are given in R-Code 6.4. The relative risk is smaller than one
(diuretics reduce the risk). An approximate 95% confidence interval does not include the value
one.
♣
The relative risk cannot be applied to so-called case-control studies, where we match for each
subject in the risk group one or several subjects in the second, control group. This matching
implies that the risk in the control group is not representative but influenced by the first group.
114
CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
R-Code 6.4 Relative Risk with confidence interval.
print( RR <- ( tab[1,1]/ sum(tab[1,])) /
( tab[2,1]/ sum(tab[2,])) )
## [1] 0.769
s <- sqrt( 1/tab[1,1] + 1/tab[2,1] - 1/sum(tab[1,]) - 1/sum(tab[2,]) )
exp( log(RR) + qnorm(c(.025,.975))*s)
## [1] 0.62333 0.94872
6.2.3
Odds Ratio
The relative risk is closely related to the odds ratio, which is defined as
P(Positive diagnosis with risk factor)
P(A)
P(A)(1 − P(B))
P(Negative Diagnosis with risk factor)
1 − P(A)
OR =
=
=
P(Positive diagnosis without risk factor)
P(B)
P(B)(1 − P(A))
P(Negative diagnosis without risk factor)
1 − P(B)
(6.24)
h11
h12
h21
h22
(6.25)
with A and B the positive diagnosis with and without risk factors. The odds ratio indicates the
strength of an association between factors (association measure). The calculation of the odds
ratio also makes sense when the number of diseased is determined by study design, as is the case
for case-control studies. When a disease is rare (very low probability of disease), the odds ratio
and relative risk are approximately equal.
The odds ratio is one of the most used association measures in statistics. Additionally there
are several statistical models that are linked to odds ratio.
An estimate of the odds ratio is
d=
OR
=
h11 h22
.
h12 h21
The construction of confidence intervals for the odds ratio is based on Equation (3.30) and
Equation (3.31), analogous to that of the relative risk.
CI 5: Confidence interval for the odds ratio (OR)
An approximate (1 − α) confidence interval for OR, based on a two-dimensional
contingency table (Table 6.1), is
h
i
d ± z1−α/2 SE(log(OR))
d
exp log(OR)
(6.26)
where for the sample confidence intervals we use the estimates
r
h11 h22
1
1
1
1
d
d
OR =
and SE(log(OR)) =
+
+
+
.
h12 h21
h11 h21 h12 h22
6.3. STATISTICAL TESTS
115
R-Code 6.5 Odds ratio with confidence interval, approximate and exact.
print( OR <- tab[1]*tab[4]/(tab[2]*tab[3]))
## [1] 0.74313
s <- sqrt( sum( 1/tab) )
exp( log(OR) + qnorm(c(.025,.975))*s)
## [1] 0.58626 0.94196
# Exact test from Fisher:
fisher.test(tab)
##
## Fisher's Exact Test for Count Data
##
## data: tab
## p-value = 0.016
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.58166 0.94826
## sample estimates:
## odds ratio
##
0.74321
Example 6.3 (continuation of Example 6.1). The odds ratio with confidence interval for the preeclampsia data is given in R-Code 6.5. The 95% confidence interval is again similar as calculated
for the relative risks and does also not include one, strengthening the claim (i.e., significant
result).
Notice that the function fisher.test() also calculates the odds ratio. As it is based on a
likelihood calculation, there are very minor differences between both estimates.
♣
6.3
Statistical Tests
In this section we look at the statistical tests for proportions. We separate the discussion whether
the test involves a single, two or several proportions.
6.3.1
Tests Involving a Single Proportion
We start the discussion for the hypothesis test H0 : p = p0 versus H1 : p ̸= p0 . This case is
straightforward when relying on the duality of test and confidence intervals. That means, we
reject the null hypothesis at level α if p0 is not in the (1 − α) sample confidence interval [ bl , bu ].
The confidence interval can be obtained based on a Wald, Wilson or some other approach. If
the confidence interval is not exact, the test may not have exact size α.
In R, there is the possibility to use binom.test(n,p), prop.test(n,p), the latter with the
argument correct=TRUE (default) to include a continuity correction or correct=FALSE.
116
CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
Example 6.4. In one of his famous experiments Gregor Mendel crossed peas based on AA
and aa-type homozygotes. Under the Mendelian inheritance assumption, the third generation
should consist of AA and Aa genotypes with a ratio of 1:2. For one particular genotype, Mendel
reported the counts 8 and 22 (see, e.g., Table 1 of Ellis et al., 2019). We cannot reject the
hypothesis H0 : p = 1/3 based on the p-value 0.56. As shown in R-Code 6.6, both binom.test(8,
8+22, p=1/3) and prop.test(8, 8+22, p=1/3) yield the same p-value (up to two digits). The
corresponding confidence intervals are also very similar. Whereas when using the argument
correct=FALSE the outcome changes noticeably.
♣
R-Code 6.6 Testing a proportion.
rbind(binom=unlist( binom.test(8, 8+22, p=1/3)[c("p.value", "conf.int")]),
prop1=unlist( prop.test(8, 8+22, p=1/3)[c("p.value", "conf.int")]),
prop2=unlist( prop.test(8, 8+22, p=1/3, correct=FALSE)[c("p.value",
"conf.int")]))
##
p.value conf.int1 conf.int2
## binom 0.56215
0.12279
0.45889
## prop1 0.56128
0.12975
0.46173
## prop2 0.43858
0.14183
0.44448
6.3.2
Tests Involving two Proportions
We now consider the 2 × 2 framework as shown in Table 6.1. More specifically, we assume that
each row of the table shows the successes and failures of a binomial random variable. A test for
equality of the proportions tests H0 : p1 = p2 , where p1 and p2 are the two proportions.
One way to derive such a test is to start with the difference of proportions. Under the null
hypothesis, (6.15) simplifies to N 0, p(1−p)(1/n1 + 1/n2 ) and a test statistic can be derived. In
practice one often works with the squared difference of the proportions for which the test statistic
takes quite a simple form, as given in Test 5. Under the null hypothesis, the distribution thereof
is a chi-squared distribution (square of a normal random variable). The quantile, density, and
distribution functions are implemented in R with [q,d,p]chisq (see Section 3.2.1). This test is
also called Pearson’s χ2 test.
Example 6.5 (continuation of Example 6.1). The R-Code 6.7 shows the results for the preeclampsia data, once using a proportion test and once using a chi-squared test (comparing
expected and observed frequencies).
♣
Remark 6.2. We have presented the rows of Table 6.1 in terms of two binomials, i.e., with two
fixed marginals. In certain situations, such a table can be seen from a hypergeometric distribution point of view (see help( dhyper)), where three margins are fixed. For this latter view,
fisher.test is the test of choice.
♣
117
6.3. STATISTICAL TESTS
Test 5: Test of proportions
Question: Are the proportions in the two groups the same?
Assumptions: Both samples are independent from a binomial distribution with
equal probability.
Calculation: Using the same notation as in Table 6.1,
χ2obs =
(h11 h22 − h12 h21 )2 (h11 + h12 + h21 + h22 )
(h11 h22 − h12 h21 )2 n
.
=
(h11 + h12 )(h21 + h22 )(h12 + h22 )(h11 + h21 )
n1 n2 c1 c2
Decision: Reject if χ2obs > χ2crit = χ21,1−α .
Calculation in R: prop.test( tab) or chisq.test(tab) with default continuity
correction.
R-Code 6.7 Test of proportions
det(tab)^2*sum(tab) / prod(c(rowSums(tab), colSums(tab))) # as in Test 5
## [1] 6.0541
chisq.test( tab, correct=FALSE)
# same value of the test statistic
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 6.05, df = 1, p-value = 0.014
prop.test( tab)
# continuity correction by default
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: tab
## X-squared = 5.76, df = 1, p-value = 0.016
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.0551074 -0.0054088
## sample estimates:
## prop 1 prop 2
## 0.10073 0.13099
118
6.3.3
CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
Tests Involving a Several Proportions
We now extend the tests of the last two sections by first considering so-called multinomial setting
and then second by extending the 2 × 2-tables to general two-way tables. Both settings result in
a so-called chi-square test.
In the binomial setting we have two proportions p and 1 − p and a test H0 : p = p0 consists
of comparing x with p0 n or equivalently n − x with (1 − p0 )n. If there is a large discrepancy
between the observed and the theoretical counts, there is evidence against the null hypothesis.
Hence it is quite natural to extend the same idea to the setting of K proportions (p1 , . . . , pK )
PK
P
with observed counts (x1 , . . . , xK ). Note that K
i=1 pi = 1. Hence we have
i=1 xi = n and
only K − 1 free parameters and in the binomial setting, K = 2.
There is a natural extension of the test of proportions in which we compare “arbitrary” many
proportions. The so-called chi-square test (X 2 test) compares if the observed data follow a
particular distribution by comparing the frequencies of binned observations with the expected
frequencies.
Under the null hypothesis, the chi-square test is X 2 distributed (Test 6). The test is based on
approximations and thus the categories should be aggregated so that all bins contain a reasonable
amount of counts, e.g., ei ≥ 5 (trivially, K − k > 1).
Test 6: Comparison of observations with expected frequencies
Question: Do the observed frequencies oi of a sample deviate significantly from
the expected frequencies ei of a certain distribution?
Assumptions: Sample with data from any scale.
Calculation: Calculate the observed values oi and the expected frequencies ei
(with the help of the expected distribution) and then compute
χ2obs =
K
X
(oi − ei )2
i=1
ei
where K is the number of categories.
Decision: Reject H0 : “no deviation between the observed and expected” if χ2obs >
χ2crit = χ2K−1−k,1−α , where k is the number of parameters estimated from the
data to calculate the expected counts.
Calculation in R: chisq.test( obs, p=expect/sum(obs)) or
chisq.test( obs, p=expect, rescale.p=TRUE)
Example 6.6. With few observations (10 to 50) it is often pointless to test for normality of
the data. Even for larger samples, a Q-Q plot is often more informative. For completeness,
6.3. STATISTICAL TESTS
119
we illustrate a simple goodness-of-fit test by comparing the pododermatitits data with expected
counts constructed from Gaussian density with matching mean and variance (R-Code 6.8). We
pool over both periods and barns (n = 34) (there is no significant difference in the means and
the variances).
The binning of the data is done through a histogram-type binning (an alternative way would
be table( cut( podo$PDHmean))). As we have less than five observations in several bins, the
function chisq.test() issues a warning. This effect could be mitigated if we calculate the pvalue using a bootstrap simulation by setting the argument simulate.p.value=TRUE. Pooling
the bins, say breaks=c(1.5,2.5,3.5,4,4.5,5) would be an alternative as well.
The degrees of freedom are K − 1 − k = 7 − 1 − 2 = 4, as we estimate the mean and standard
deviation to determine the expected counts.
♣
R-Code 6.8 Testing normality, pododermatitis (see Example 6.6 and Test 6).
observed <- hist( podo$PDHmean, plot=FALSE, breaks=5)
# without 'breaks' argument there are too many categories
observed[1:2]
## $breaks
## [1] 1 2 3 4 5 6
##
## $counts
## [1] 3 9 27 27
1
m <- mean( podo$PDHmean)
s <- sd( podo$PDHmean)
p <- pnorm( observed$breaks, mean=m, sd=s)
chisq.test( observed$counts, p=diff( p), rescale.p=TRUE)
## Warning in chisq.test(observed$counts, p = diff(p), rescale.p = TRUE):
Chi-squared approximation may be incorrect
##
## Chi-squared test for given probabilities
##
## data: observed$counts
## X-squared = 8.71, df = 4, p-value = 0.069
General distribution tests (also goodness-of-fit tests) differ from the other tests discussed here,
in the sense that they do not test or compare a single parameter or a vector of proportions. Such
tests run under the names of Kolmogorov-Smirnov Tests (ks.test()), Shapiro–Wilk Normality
test (shapiro.test()), Anderson–Darling test (goftest::ad.test()), etc.
Example 6.7. The dataset haireye gives the hair and eye colors of 592 persons collected
by students (Snee, 1974). The cross-tabulation data helpful to get information about individual combinations or comparing two combinations. The overall picture is best assessed with
120
CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
BROWN
RED
hazelgreen
brown
blue
BLACK BLOND
Figure 6.5: Mosaic plot of hair and eye colors. (See R-Code 6.9.)
a mosaic-plot, which indicates that there is a larger proportion of persons with blond hair
having blue eyes compared to other hair colors, which is well known. When examining the
individual terms (oi − ei )2 /ei of the χ2 statistic of Test 6, we observe three very large values (BLONDE-blue, BLONDE-brown, BLACK-brown), and thus the very low p-value is no surprise.
These residual terms do not indicate if there is an excess or lack of observed pairs. Restricting to persons with brown and red hair only, the eye color seems independent (p-value 0.15,
chisq.test(HAIReye[c("BROWN","RED"),])$p.value).
R-Code 6.9 Pearson’s Chi-squared test for hair and eye colors data. (See Figure 6.5.)
HAIReye <- read.csv("data/HAIReye.csv") # hair color in upper case
mosaicplot(HAIReye, color=c(4,"brown",3,"orange"), main="")
chisq.test(HAIReye)
# Pearson's Chi-squared test for contingency table
##
## Pearson's Chi-squared test
##
## data: HAIReye
## X-squared = 138, df = 9, p-value <2e-16
n <- sum(HAIReye)
# 592 persons
rs <- rowSums(HAIReye) # row totals
cs <- colSums(HAIReye) # column totals
(outer(rs,cs)/n-HAIReye)^2/(outer(rs,cs)/n) # (o-e)^2/e, sum thereof is 138.3
##
blue
brown
green
hazel
## BLACK 9.4211 19.3459095 3.81688 0.22786
## BLOND 49.6967 34.2341707 0.37540 4.96329
## BROWN 3.8005 1.5214189 0.11909 1.83138
## RED
2.9933 0.0056217 5.21089 0.72633
## Re-run with brown and red colored persons
# (HAIReye <- read.csv("data/HAIReye.csv")[c("BROWN","RED"),])
6.4. BIBLIOGRAPHIC REMARKS
6.4
121
Bibliographic Remarks
Agresti (2007) or the more technical and advanced version Agresti (2002) are ultimate references
for the analysis of categorical data. Brown et al. (2002) is a reference for CI for the binomial
model.
See also the package binom and the references therein for further confidence intervals for
proportions.
6.5
Exercises and Problems
Problem 6.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Derive the Wilson confidence interval (6.11).
b) Derive formula (6.12) to calculate the coverage probablity for X ∼ Bin(n, p).
c) Derive the test statistic of Test 5 (without continuity correction).
d) Show that the test statistic of Test 5 is a particular case of Test 6 (without continuity
correction).
d .
e) Derive standard error of the odds ratio SE log(OR)
Problem 6.2 (Binomial distribution) Suppose that among n = 95 Swiss males, eight are redgreen colour blind. We are interested in estimating the proportion p of people suffering from
such disease among the male population.
a) Is a binomial model suitable for this problem?
b) Calculate the maximum likelihood estimate (ML) p̂ML and the ML of the odds ω̂.
c) Using the central limit theorem (CLT), it can be shown that pb follows approximately
N p, n1 p(1 − p) . Compare the binomial distribution to the normal approximation for
different n and p. To do so, plot the exact cumulative distribution function (CDF) and
compare it with the CDF obtained from the CLT. For which values of n and p is the approximation reasonable? Is the approximation reasonable for the red-green colour blindness
data?
d) Use the R functions binom.test() and prop.test() to compute two-sided 95%-confidence
intervals for the exact and for the approximate proportion. Compare the results.
e) What is the meaning of the p-value?
f) Compute the Wilson 95%-confidence interval and compare it to the confidence intervals
from (d).
122
CHAPTER 6. ESTIMATING AND TESTING PROPORTIONS
Cleared
Not cleared
Treatment A
9
18
Treatment B
5
22
Problem 6.3 (A simple clinical trial) A clinical trial is performed to compare two treatments,
A and B, that are intended to treat a skin disease named psoriasis. The outcome shown in the
following table is whether the patient’s skin cleared within 16 weeks of the start of treatment.
Use α = 0.05 throughout this problem.
a) Compute for each of the two treatments a Wald type and a Wilson confidence interval for
the proportion of patients whose skin cleared.
b) Test whether the risk difference is significantly different to zero (i.e., RD = 0). Use both
an exact and an approximated approach.
c) Compute CIs for both, relative risk (RR) and odds ratio (OR).
d) How would the point estimate of the odds ratio change if we considered the proportions of
patients whose skin did not clear?
Problem 6.4 (BMJ Endgame) Discuss and justify the statements about ‘Relative risks versus
odds ratios’ given in doi.org/10.1136/bmj.g1407.
Chapter 7
Rank-Based Methods
Learning goals for this chapter:
⋄ Explain robust estimators and their properties
⋄ Explain the motivation and the concept of ranks
⋄ Describe, apply and interpret rank based tests as counter pieces to the classical t-test
⋄ Describe the general approach to compare the means of several samples, apply
and interpret the rank based tests
⋄ Explain the idea of a permutation test
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter07.R.
As seen in previous chapters “classic” estimates of the expectation and the variance are
x=
n
n
i=1
i=1
1X
1 X
xi , and s2 =
(xi − x)2 .
n
n−1
(7.1)
If, hypothetically, we set one (arbitrary) value xi to an infinitely large value (i.e., we create an
extreme outlier), these estimates “explode”. A single value may exert enough influence on the
estimate such that the estimate is not representative of the bulk of the data anymore. In a
similar fashion, outliers may not only influence estimates drastically but also the value of test
statistics and thus render the result of the test questionable.
Until now we have often assumed that we have a realization of a Gaussian random sample.
We have argued that the t-test family is exact for Gaussian data but remains usable for moderate
deviations thereof since we use the central limit theorem in the test statistic. If we have very small
sample sizes or if the deviation is substantial, the result of the test may be again questionable.
In this chapter, we discuss basic approaches to estimation and testing for cases that include
the presence of outliers and deviations from Gaussianity.
123
124
7.1
CHAPTER 7. RANK-BASED METHODS
Robust Point Estimates
A robust estimators is an estimator of a parameter that is not sensitive to one or possibly several
outliers. Sensitive here should be understood in the sense that if we replace one or possibly
several values of the sample with arbitrary values, the corresponding estimate does not or only
marginally change.
As argued above, the mean is therefore not a robust estimate of location. However, the
trimmed mean (i.e., the biggest and smallest values are trimmed away and not considered) is
a robust estimate of location. The sample median (the middle value if the sample size is odd
or the center of the two middle-most values otherwise, see (1.2)) is another robust estimate of
location.
Robust estimates of the spread are (i) the sample interquartile range (IQR), calculated as the
difference between the third and first quartiles, and (ii) the sample median absolute deviation
(MAD), calculated as
MAD = c · med |xi − med{x1 , . . . , xn }| ,
(7.2)
where most software programs (including R) use c = 1.4826. The choice of c is such, that for
Gaussian random variables we have an unbiased estimator, i.e., E(MAD) = σ for MAD seen as an
estimator. Since for Gaussian random variables IQR= 2Φ−1 (3/4)σ, IQR/1.349 is an estimator
of σ; for IQR seen as an estimator.
Example 7.1. Let the values 1.1, 3.2, 2.2, 1.8, 1.9, 2.1, 2.7 be given. Suppose that we have erroneously entered the final number as 27. R-Code 7.1 compares several statistics (for location and
scale) and illustrates the effect of this single outlier on the estimates.
♣
R-Code 7.1 Classic and robust estimates from Example 7.1.
sam <- c(1.1, 3.2, 2.2, 1.8, 1.9, 2.1, 27)
print( c(mean(sam), mean(sam, trim=.2), median(sam)))
## [1] 5.6143 2.2400 2.1000
print( c(sd(sam), IQR(sam)/1.349, mad(sam)))
## [1] 9.45083 0.63010 0.44478
Remark 7.1. An intuitive approach to quantify the “robustness” of an estimator is the breakdown
point which quantifies the proportion of the sample that can be set to arbitrary large values before
the estimate takes an arbitrary value, i.e., before it breaks down. The mean has a breakdown
point of 0, (1 out of n values is sufficient to break the estimate), an α-trimmed mean of α
(α ∈ [0, 1/2]), the median of 1/2, the latter is the maximum possible value.
The IQR has a breakdown point of 1/4 and the MAD of 1/2. See also Problem 7.1.a.
♣
Robust estimators have two main drawbacks compared to classical estimators. The first
disadvantage is that robust estimators do not possess simple distribution functions and for this
7.1. ROBUST POINT ESTIMATES
125
reason the corresponding exact confidence intervals are not easy to calculate. More specifically,
for a robust estimator θb of the parameter θ we rarely have the exact quantiles ql and qu (which
depend on θ), to construct a confidence interval starting from
1 − α = P ql (θ) ≤ θb ≤ qu (θ)
(7.3)
If we could assume that the distribution of robust estimators are somewhat Gaussian (for
large samples) we could calculate approximate confidence intervals based on
s
d robust\
Var(
estimator)
robust\
estimator ± zα/2
,
(7.4)
n
b
b Note that we have
which is of course equivalent to θb ± zα/2 SE(θ)/n,
for a robust estimator θ.
deliberately put a hat on the variance term in (7.4) as the variance often needs to be estimated
as well (which is reflected in a precise definition of the standard error). For example, the R expression median( x)+c(-2,2)*mad( x)/sqrt(length( x)) yields an approximate sample 95%
confidence interval for the median.
The second disadvantage of robust estimators is their lower efficiency, i.e., these estimators
have larger variances compared to classical estimators. Formally, the efficiency is the ratio of
the variance of one estimator to the variance of the second estimator.
In some cases the exact variance of robust estimators can be determined, often approximations or asymptotic results exist. For example, for a continuous random variable with distribution
function F (x) and density function f (x), asymptotically, the median is also normally distributed
around the true median η = Q(1/2) = F −1 (1/2) with variance 1 (4nf (η)2 ). The following example illustrates this result and R-Code 7.2 compares the finite sample efficiency of two estimators
of location based on repeated sampling.
iid
Example 7.2. Let X1 , . . . , X10 ∼ N (0, σ 2 ). We simulate realizations of this random sample
and calculate the corresponding sample mean and sample median. We repeat R = 1000 times.
Figure 7.1 shows the histogram of these means and medians including a (smoothed) density of
the sample. The histogram and the density of the sample medians are wider and thus the mean
is more efficient.
For this particular example, the sample efficiency is roughly 72% for n = 10. As the density
is symmetric, η = µ = 0 and thus the asymptotic efficiency is
1 2
σ2
2
σ 2 /n
=
√
·
4n
= ≈ 64%.
2
n
π
1/ 4nf (0)
2πσ
(7.5)
Of course, if we change the distribution of X1 , . . . , X10 , the efficiency changes. For example let
us consider the case of a t-distribution with 4 degrees of freedom, a density with heavier tails
than the normal. Now the sample efficiency for sample size n = 10 is 1.26, which means that the
median is better compared to the mean.
♣
Robust estimation approaches have the advantage of not having to identify outliers and
eliminating these for the estimation process. The decision as to whether a realization of a
126
CHAPTER 7. RANK-BASED METHODS
R-Code 7.2 Distribution of sample mean and median, see Example 7.2. (See Figure 7.1.)
set.seed( 14)
# to reproduce the numbers!
n <- 10
# sample size
R <- 1000
# How often we repeat the sampling
samples <- matrix( rnorm(n*R), nrow=R, ncol=n) # each row one sample
means <- apply( samples, 1, mean)
# R means
medians <- apply( samples, 1, median)
# R medians
print( c( var( means), var( medians), var( means)/var( medians)))
## [1] 0.10991 0.15306 0.71809
hist( medians, border=7, col=7, prob=TRUE, main='', ylim=c(0, 1.2),
xlab='estimates')
hist( means, add=TRUE, prob=TRUE)
lines( density( medians), col=2)
lines( density( means), col=1)
# with a t_4 distribution the situation is different:
samples <- matrix( rt(n*R, df=4), nrow=R, ncol=n) # row a t-sample
means <- apply( samples, 1, mean)
medians <- apply( samples, 1, median)
print( c( var( means), var( medians), var( means)/var( medians)))
0.8
0.4
0.0
Density
1.2
## [1] 0.19268 0.15315 1.25815
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
estimates
Figure 7.1: Comparing finite sample efficiency of the mean and median for a Gaussian
sample of size n = 10. Medians in yellow with red smoothed density of the sample,
means in black. (See R-Code 7.2.)
random sample contains outliers is not always easy and some care is needed. For example it is
not possible to declare a value as an outlier if it lays outside the whiskers of a boxplot. For all
distributions with values from R, observations will lay outside the whiskers when n is sufficiently
large (see Problem 7.1.b). Obvious outliers are easy to identify and eliminate, but in less clear
cases robust estimation methods are preferred.
Outliers can be very difficult to recognize in multivariate random samples, because they are
127
7.2. RANK-BASED TESTS AS ALTERNATIVES TO T-TESTS
not readily apparent with respect to the marginal distributions. Robust methods for random
vectors exist, but are often computationally intense and not as intuitive as for scalar values.
It has to be added that independent of the estimation procedures, if an EDA finds outliers,
these should be noted and scrutinized.
7.2
Rank-Based Tests as Alternatives to t-Tests
We now turn to tests. To start, recall that the tests to compare means in Chapter 5 assume
normally distributed data resulting in test statistics that are t-distributed. Slight deviations from
a normal distribution has typically negligible consequences as the central limit theorem reassures
that the mean is approximately normally distributed.
However, if outliers are present or the data are skewed or the data are measured on the
ordinal scale, the t-test family is not exact and, depending on the degree of departure from
the assumption, may not be useful. In such settings, the use of so-called ‘rank-based’ tests is
recommended. Compared to classical tests which typically assume a parametrized distribution
(e.g., µ, σ 2 in N (µ, σ 2 ) or p in Bin(n, p)), rank based tests do not prescribe a detailed or specific
distribution and are thus also called non-parametric tests.
iid
To illustrate the nature of a non-parametric test, consider paired data X1 , . . . , Xn ∼ FX ,
iid
Y1 , . . . , Yn ∼ FY , with continuous distributions FX and FY . We consider the null hypothesis H0
that P(Xi < Yi ) = P(Yi < Xi ) = 1/2 (because of the continuous distribution P(Xi = Yi ) = 0).
In other words, the null hypothesis is that the medians of both distributions are the same.
However, the test does not make any assumption on the distributions FX and FY .
The test compares the cases when xi < yi with yi < xi and we consider the sign of the
difference xi − yi . If there are too many or two few negative signs (xi < yi ) or equivalently too
many or two few positive signs (xi > yi ), the data is not supporting the null hypothesis. More
formally, under H0 , the sign of each pair xi − yi is distributed according a Bernoulli random
variable with probability p = 1/2.
The resulting statistical test is called the sign test and the procedure is illustrated in Test 7.
In practice, there might be the case that yi = xi and all such cases will be dropped from the
sample and the sample size n will be reduced accordingly to n⋆ . One-sided versions of Test 7
are straightforward to construct, by not taking the minimum in Step (3) and comparing to the
α-quantile or 1 − α-quantile of a Bin(n⋆ , 1/2) distribution, respectively.
The sign test is based on very weak assumptions and therefore has little power. We introduce
now the concept of “ranks”, as an extension of “signs”, and resulting rank tests. The rank of a
value in a set of values is the position (order) of that value in the ordered sequence (from smallest
to largest). In particular, the smallest value has rank 1 and the largest rank n. In the case of
ties, the arithmetic mean of the ranks is used.
Example 7.3. The values 1.1, −0.6, 0.3, 0.1, 0.6, 2.1 have ranks 5, 1, 3, 2, 4 and 6. However,
the ranks of the absolute values are 5, (3+4)/2, 2, 1, (3+4)/2 and 6.
♣
Rank-based tests consider the ranks of the observations or ranks of the differences, not the
observations or the differences between the observations and thus mitigate their effect on the test
128
CHAPTER 7. RANK-BASED METHODS
Test 7: Comparing the locations of two paired samples with the sign test
Question: Are the medians of two paired samples significantly different?
Assumptions: The samples are paired and from continuous distributions.
Calculation: (1) Calculate the differences di = xi − yi . Ignore all differences
di = 0 and consider only the n⋆ differences di ̸= 0.
(2) Categorize each difference di by its sign (+ or −) and count the number
P ⋆
I{di >0} . Sobs = min(S + , n⋆ − S + ).
of positive signs: S + = ni=1
Decision: Reject H0 : “the medians are the same”, if Sobs < Scrit , where Scrit is
the α/2-quantile of a Bin(n⋆ , 1/2) distribution.
Calculation in R:
binom.test( sum( d>0), sum( d!=0), conf.level=1-alpha)
statistic. For example, the largest value has always the same rank and therefore has the same
influence on the test statistic independent of its value.
Compared to classical t-tests as discussed in Chapter 5 and further tests that we will discuss
in Chapter 11, rank tests should be used if
• the underlying distributions are skewed.
• the underlying distributions have heavy or light tails compared to the Gaussian distribution.
• if the data has potentially “outliers”.
For smaller sample sizes, it is difficult to check model assumptions and it is recommended
to use rank tests in such situations. Rank tests have fewer assumptions on the underlying
distributions and have a fairly similar power (see Problem 7.6).
We now introduce two classical rank tests which are the Wilcoxon signed rank test and the
Wilcoxon–Mann–Whitney U test (also called the Mann–Whitney test), i.e., rank-based versions
of Test 2 and Test 3 respectively.
7.2.1
Wilcoxon Signed Rank Test
The Wilcoxon signed rank test is used to test an effect in paired samples, i.e., two matched
samples or two repeated measurements on the same subject). As for the sign test or the twosample paired t-test, we start by calculating the differences of the paired observations, followed
by calculating the ranks of the negative differences and of the positive differences. Note that we
omit pairs having zero differences and denote with n⋆ the possibly adjusted sample size. Under
the null-hypotheses, the paired samples are from the same distribution and thus the ranks of the
positive and negative differences should be comparable, not be too small or too large. If this
129
7.2. RANK-BASED TESTS AS ALTERNATIVES TO T-TESTS
is not the case, the data indicates evidence against the hypothesis. This approach is formally
presented in Test 8.
Test 8: Comparing the distribution of two paired samples
Question: Are the distributions of two paired samples significantly different?
Assumptions: Both samples are from continuous distributions of the same shape,
the samples are paired.
Calculation: (1) Calculate the differences di = xi − yi . Ignore all differences
di = 0 and consider only the remaining n⋆ differences di ̸= 0.
(2) Order the n⋆ differences di by their absolute differences |di |.
(3) Calculate the sum of the ranks of the positive differences:
P ⋆
I{di >0} .
W + = ni=1
W − = n⋆ (n2⋆ +1) − W + (sum of the ranks of the negative differences).
Wobs = min(W + , W − )
(W + + W − =
n⋆ (n⋆ + 1)
)
2
Decision: Reject H0 : “the distributions are the same” if Wobs < Wcrit (n⋆ ; α/2),
where Wcrit is the critical value.
Calculation in R: wilcox.test(x-y, conf.level=1-alpha) or
wilcox.test(x, y, paired=TRUE, conf.level=1-alpha)
The quantile, density and distribution functions of the Wilcoxon signed rank test statistic
are implemented in R with [q,d,p]signrank(). For example, the critical value Wcrit (n; α/2)
mentioned in Test 8 is qsignrank( .025, n) for α = 5% and the corresponding p-value is
2*psignrank( Wobs, n) with n = n⋆ .
It is possible to approximate the distribution of test statistic with a normal distribution. The
observed test statistic Wobs is z-transformed as follows:


Wobs − n⋆ (n⋆ + 1) 
4
zobs = r
,
n⋆ (n⋆ + 1)(2n⋆ + 1)
24
(7.6)
and zobs is then compared with the corresponding quantile of the standard normal distribution
(see Problem 7.1.c). This approximation may be used when the samples are sufficiently large,
which is, as a rule of thumb, n⋆ ≥ 20.
In case of ties, R may not be capable to calculate exact p-values and thus will issue a warning. The warning can be avoided by not requiring exact p-values through setting the argument
exact=FALSE. When setting the argument conf.int=TRUE in wilcox.test(), a nonparametric
130
CHAPTER 7. RANK-BASED METHODS
confidence interval is constructed. It is possible to specify the confidence level with conf.level,
default value is 95%. The numerical values of the confidence interval are accessed with the
element $conf.int, also illustrated in the next example.
Example 7.4 (continuation of Example 5.2). We consider again the podo data as introduced in
Example 5.2. R-Code 7.3 performs Wilcoxon signed rank test. As we do not have any outliers
the p-value is similar to the one obtained with a paired t-test in Example 5.10. There are no ties
and thus the argument exact=FALSE is not necessary.
The advantage of robust methods becomes clear when the first value from the second visit
3.75 is changed to 37.5, as shown towards the end of the same R-Code. While the p-value of the
signed rank test does virtually not change, the one from the paired two-sample t-test changes
from 0.5 (see R-Code 5.5) to 0.31. More importantly, the confidence intervals are now drastically
different as the outlier inflated the estimated standard deviation of the t-test. In other situations,
it is quite likely that with or without a particular “outlier” the p-value falls below the magical
threshold α (recall the discussion of Section 5.5.3).
Of course, a corrupt value, as introduced in this example, would be detected with a proper
EDA of the data (scales are within zero and ten).
♣
R-Code 7.3: Rank tests and comparison of a paired tests with a corrupted observation.
# Possibly relaod the 'podo.csv' and construct the variables as in Example 4.1
wilcox.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE)
##
## Wilcoxon signed rank exact test
##
## data: PDHmean2[, 2] and PDHmean2[, 1]
## V = 88, p-value = 0.61
## alternative hypothesis: true location shift is not equal to 0
# wilcox.test( PDHmean2[,2]-PDHmean2[,1], exact=FALSE) # is equivalent
PDHmean2[1, 2] <- PDHmean2[1, 2]*10 # corrupted value, decimal point wrong
rbind(t.test=unlist( t.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE)[3:4]),
wilcox=unlist( wilcox.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE,
conf.int=TRUE)[c( "p.value", "conf.int")]))
##
p.value conf.int1 conf.int2
## t.test 0.31465
-2.2879
6.6791
## wilcox 0.61122
-0.5500
1.1375
For better understanding of the difference between a sign test and the Wilcoxon signed
rank test consider an arbitrary distribution F (x). The assumptions of the Wilcoxon signed
iid
iid
rank test are X1 , . . . , Xn ∼ F (x) and Y1 , . . . , Yn ∼ F (x − δ), where δ represents the shift.
Hence, under the null hypothesis, δ = 0 which further implies that for all i = 1, . . . , n, (1)
P(Xi > Yi ) = P(Xi < Yi ) = 1/2 and (2) the distribution of the difference Xi − Yi is symmetric.
The second point is not required by the sign test. Hence, the Wilcoxon signed rank test requires
more assumptions and has thus generally a higher power.
7.2. RANK-BASED TESTS AS ALTERNATIVES TO T-TESTS
131
For a symmetric distribution, we can thus use wilcox.test() with argument mu=mu0 to test
H0 : µ = µ0 , where µ is the median (and by symmetry also the mean). This setting is the rank
test equivalent of Test 1.
7.2.2
Wilcoxon–Mann–Whitney Test
The Wilcoxon–Mann–Whitney test is probably the most well-known rank test and represents the
two-sample version of the Wilcoxon signed rank test. To motivate this test, assume that we have
two samples from one common underlying distribution. In this case the observations and ranks
of both samples mingle nicely. If both samples have the same or quite similar sample sizes, the
two sums of the ranks are comparable. Alternatively, assume that the first sample has a much
smaller median (or mean) and the ranks of the first sample would be smaller than those of the
sample with the larger median (or mean).
More specifically, the Wilcoxon–Mann–Whitney test calculates the rank sums of the two
samples and corrects them based on the corresponding sample size. The smaller of the resulting
values is then compared with an appropriate quantile, see Test 9. Note that we do not require a
symmetric distribution, but rely on the less stringent assumption that the distribution of both
samples have the same shape, possibly shifted. Under the null hypothesis, the shift is zero and
the Wilcoxon–Mann–Whitney test can be interpreted as comparing the medians between the two
distributions.
Test 9: Comparing the locations of two independent samples
Question: Are the medians of two independent samples significantly different?
Assumptions: Both samples are from continuous distributions of the same shape,
the samples are independent and the data are at least ordinally scaled.
Calculation: Let nx ≤ ny , otherwise switch the samples. Assemble the (nx + ny )
sample values into a single set ordered by rank and calculate the sum Rx and
Ry of the sample ranks. Let
nx (nx + 1)
2
Uobs = min(Ux , Uy )
Ux = Rx −
Uy = Ry −
ny (ny + 1)
2
Note: Ux + Uy = nx ny .
Decision: Reject H0 : “medians are the same” if Uobs < Ucrit (nx , ny ; α/2), where
Ucrit is the critical value.
Calculation in R: wilcox.test(x, y, conf.level=1-alpha)
The quantile, density and distribution functions of the Wilcoxon–Mann–Whitney test statistic
are implemented in R with [q,d,p]wilcox(). For example, the critical value Ucrit (nx , ny ; α/2)
used in Test 9 is qwilcox( .025, nx, ny) for α = 5% and corresponding p-value 2*pwilcox(
Uobs, nx, ny).
132
CHAPTER 7. RANK-BASED METHODS
It is possible to approximate the distribution of the test statistic by a Gaussian one, provided
we have sufficiently large samples, e.g., nx , ny ≥ 10. The value of the test statistic Uobs is
z-transformed to


Uobs − nx ny 
2
zobs = r
,
(7.7)
nx ny (nx + ny + 1)
12
where nx ny /2 is the mean of all ranks and the denominator is the standard deviation of the
sum of the ranks under the null (see Problem 7.1.d). This value is then compared with the
respective quantile of the standard normal distribution. With additional continuity corrections,
the approximation may be improved.
To construct confidence intervals, the argument conf.int=TRUE must be used in the function
wilcox.test() and unless α = 5% a specification of conf.level is required. The numerical
values of the confidence interval are accessed with the list element $conf.int.
Example 7.5 (continuation of Example 5.2 and 7.4). We compare if there is a difference between the pododermatitis scores between both barns. Histogram of the densities support the
assumption that both samples come from one underlying distribution. There is no evidence of
difference (same conclusion as in Example 5.9).
Several scores appear multiple times and thus the ranks contain ties. To avoid a warning
message we set exact=FALSE.
♣
R-Code 7.4: Two-sample rank test.
# hist(PDHmeanB1, density=15) # quick check on shape of distribution
# hist(PDHmeanB2, add=T, border=2, col=2, density=15, angle=-45)
wilcox.test( PDHmeanB1, PDHmeanB2, exact=FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: PDHmeanB1 and PDHmeanB2
## W = 153, p-value = 0.66
## alternative hypothesis: true location shift is not equal to 0
Remark 7.2.
1. When constructing a boxplot in R, the argument notch=TRUE draws a
“notch” a the median. The idea is, that if the notches of two boxplots do not overlap
then the medians are approximately significantly different at the fixed 95% confidence
√
level. The notches have width of 3.15IQR/ n and are based on asymptotic normality of
the median and roughly equal sample sizes for the two medians being compared (McGill
et al., 1978).
2. The interpretation of the confidence interval resulting from argument conf.int=TRUE requires some care, see ?wilcox.test. With this approach, it is possible to set the confidence
level via the argument conf.level.
♣
133
7.3. COMPARING MORE THAN TWO GROUPS
7.3
Comparing More Than two Groups
Until now we have considered at most two groups. Imagine the situations where n subjects are
measured over several times points or where several treatments are applied to different subjects.
In such situations we should not perform blindly individual tests comparing all combinations of
time points or all combinations of treatments. Based on Section 5.5.2, if we were to administer
placebo to six different groups, detecting at least one significant result at level α = 5% is more
likely than not (1-0.95ˆchoose(6,2)).
Instead of correcting for multiple testing the following more powerful approach is used. In a
first step we test if there is any effect at all. If there is, in a second step, we investigate, which
of the groups shows an effect.
In more details, for the first step, we compare the locations of the different groups. In case of
an effect, these exhibit a lot of variability, compared to the variability with the individual groups.
If there is no effect, the variability between the locations of the groups is small compared to the
variability within the groups. Of course some care is needed when comparing the variablities
and we likely have to take into account the number of groups and group sizes. In general, such
an approach is called ANOVA which stands for analysis of variance. We will revisit ANOVA in
the Gaussian framework in Chapter 11.
Similar as in the setting of two groups, we also have to differentiate betweeen matched
or repeated measures and independent groups. We will discuss two classical rank tests now.
The latter two complete the layout shown in Figure 7.2, which summarizes the different nonparametric tests to compare locations (median with a theoretical value or medians of two or
several groups). Note that this layout is by no means complete.
Nonparametric tests
of location
no
no
1 group
2 groups
k>2 groups
yes
yes
yes
no
symmetric
distribution
Sign test
yes
Wilcoxon signed
rank test
∆
yes
no
paired data
Wilcoxon−Mann−
Whitney test
repeated
measures
yes
no
Kruskal−
Wallis test
Friedman test
Figure 7.2: Different non-parametric statistical tests of location. In the case of paired
data, the ∆ operation takes the difference of the two samples and reduces the problem
to a single sample setting.
134
7.3.1
CHAPTER 7. RANK-BASED METHODS
Friedman Test
Suppose we have I subjects that are tested on J different treatments. The resulting data matrix
represents for each subject the J measures. While the subjects are assumed independent, the
measures within each subject are not necessarily.
Such a simulation setup is also called one-way repeated measures analysis of variance, oneway because each subject receives at any point one single treatment and all other factors are
kept the same. We compare the variability between the groups with the variability within the
groups based on ranks.
The data is then often presented in a matrix form with I rows and J columns, resulting in a
total of n = IJ observations. In a first step we calculate the ranks within each subject. If all the
treatments are equal then there is no preference group for small or large ranks. That means, the
sum of the ranks within each group Rj are similar. As we have more than two groups we look
at the variability of the column ranks. There are different ways to present the test statistics and
the following is typically used to calculate the observed value
Fr obs =
J
X 12
Rj2 − 3I(J + 1),
IJ(J + 1)
(7.8)
j=1
where Rj is the sum of the ranks of the jth column. Under the null hypothesis, the value Fr obs
is small.
At first sight it is not evident that the test statistic represents a variability measure. See
Problem 7.1.e for a different and more intuitive form of the statistic. If there are ties then the
test statistic needs to be adjusted because ties affect the variance in the groups. The form of
the adjustment is not intuitive and consists essentially of a larger denominator compared to
IJ(J + 1). For very small values of J and I, critical values are tabulated in selected books (e.g.,
Siegel and Castellan Jr, 1988). For J > 5 (or for I > 5, 8, 13 when J = 5, 4, 3, respectively) an
approximation based on a chi-square distribution with J − 1 degrees of freedom is used. Test 10
summarizes the Friedman test and Example 7.6 uses the podo dataset as an illustration.
Test 10: Comparing the locations of repeated measures
Question: Are the medians of J matched samples significantly different?
Assumptions: All distributions are continuous of the same shape, possibly shifted,
the samples are independent.
Calculation: For each subject the ranks are calculated. For each sample, sum the
ranks to get Rj . Calculate Fr obs according to (7.8).
Decision: Reject H0 : “the medians are the same” if Fr obs > χ2crit = χ2J−1,1−α .
Calculation in R: friedman.test(y)
7.3. COMPARING MORE THAN TWO GROUPS
135
In case H0 is rejected, we have evidence that at least one of the groups has a different location
compared to at least one other. The test does not say which one differs or how many are different.
To check which of the group pairs differ, individual post-hoc tests can be performed. Typically,
for the pair (r, s) the following approximation is used
r
IJ(J + 1)
|Rr − Rs | ≥ zα/(J(J−1))
(7.9)
6
where Rs is as above. Note that we divide the level α of the quantile by J(J − 1) = J2 which
corresponds to the classical Bonferroni correction, Section 5.5.2.
Example 7.6 (continuation of Example 5.2). We consider all four visits of the dataset. Note that
one rabbit was not measured in the third visit (ID 15 in visit called 5) and thus the remaining
three measures have to be eliminated. The data can be arranged in a matrix form, here of
dimension 16 × 4. R-Code shows that there is no significant difference between the visits, even
if we would further stratify according to the barn.
♣
R-Code 7.5 Friedman test, pododermatitis (see Example 7.6 and Test 10)
table(podo$ID)
##
##
##
1
4
2
4
3
4
4
4
5
4
6
4
7
4
8
4
9 10 11 12 13 14 15 16 17
4 4 4 4 4 4 3 4 4
podoCompl <- podo[podo$ID !=15,] # different ways to eliminate
PDHmean4 <- matrix(podoCompl$PDHmean[order(podoCompl$ID)], ncol=4, byrow=TRUE)
colMeans(PDHmean4)
# means of the four visits
## [1] 3.7500 3.4484 3.8172 3.8531
friedman.test( PDHmean4)
# no post-hoc tests necessary.
##
## Friedman rank sum test
##
## data: PDHmean4
## Friedman chi-squared = 2.62, df = 3, p-value = 0.45
## further stratification does not change the result
# podoCompl <- podoCompl[podoCompl$Barn==2,] # Barn 2 only, 6 animals
Often the Friedman test is also called a Friedman two-way ANOVA by ranks. Two-way in
the sense that there is a symmetry with respect to groups and subjects.
7.3.2
Kruskal–Wallis Test
The Kruskal–Wallis test compares I different groups, with possibly different group sizes nj and
thus represents the extension of the Wilcoxon–Mann–Withney test to arbitrary number of groups.
We start by calculating the ranks for all observations. The test statistic calculates the average of
136
CHAPTER 7. RANK-BASED METHODS
the ranks in each group and compares it to the overall average. Similar as in a variance estimate,
we square the difference and additionally weight the difference by the number of observations in
the group. Formally, we have
J
KW obs =
X
2
12
nj R j − R ,
n(n + 1)
(7.10)
j=1
where n is the total number of observations, Rj the average of the ranks in group j and R the
overall mean of the ranks, i.e., (n + 1)/2. The additional denominator is such that for J > 3 and
I > 5, the Kruskal–Wallis test statistics is approximately distributed as a chi-square distribution
with J − 1 degrees of freedom.
Similar comments as for the Friedman test hold: for very small datasets critical values are
tabulated; in case of ties the test statistic needs to be adjusted; if the null hypothesis is rejected
post-hoc tests can be performed.
Test 11: Comparing the locations of several independent samples
Question: Are the medians of several dependent samples significantly different?
Assumptions: All distributions are continuous of the same shape (possibly
shifted), the samples are independent.
Calculation: Calculate the rank of the observations among the entire dataset. For
Rj the average of the ranks in group j and R = (n + 1)/2, calculate KW obs
according (7.10).
Decision: Reject H0 : “the medians are the same” if KW obs > χ2crit = χ2J−1,1−α .
Calculation in R: Kruskal.test(x, g)
7.3.3
Family-Wise Error Rate
In this short paragraph we give an alternative view on correcting for multiple testing. A Bonferoni
correction as used in the post-hoc tests (see Section 5.5.2) guarantees the specified Type I error
but may not be ideal as discussed now. Suppose that we perform several test, with resulting
counts denoted as shown in Table 7.1. The family-wise error rate (FWER) is the probability
of making at least one Type I error in the family of tests FWER = P(V ≥ 1). Hence, when
performing the tests such that FWER ≤ α we keep the probability of making one or more Type I
errors in the family at or below level α.
The Bonferroni correction, where we replace the level α of the test with αnew = α/m, maintains the FWER. Another popular correction method is the Holm–Bonferroni procedure for
which we order the p-values (lowest to largest) and the associated hypothesis. We successively
test if the k th p-value is smaller than α/(m − k + 1). If so we reject the k th hypothesis and move
137
7.4. PERMUTATION TESTS
Table 7.1: Typical notation for multiple testing.
H0 is true
H1 is true
Total
Test is declared significant
Test is declared non-significant
V
U
S
T
Total
m0
m − m0
R
m−R
m
on to k + 1, if not we stop the procedure. The Holm–Bonferroni procedure also maintains the
FWER but has (uniformly) higher power than the simple Bonferroni correction.
Remark 7.3. In bioinformatics and related fields m may be thousands and much larger. In
such settings, a FWER procedure may be too stringent as many effects may be missed due to
the very low thresholds. An alternative correction is to control the false discovery rate (FDR) at
level q with FDR = E(V /R). FDR procedures have greater power at the cost of increased rates
of type I errors.
If all of the null hypotheses are true (m0 = m), then controlling the FDR at level q also
controls the FWER
FWER = P(V ≥ 1) = E(V /R) = FDR ≤ q,
(7.11)
because the event {V ≥ 1} is identical to {V /R = 1} (for V = 0 we make appropriate definitions
of the events). However, if m0 < m it holds FWER ≥ FDR.
♣
7.4
Permutation Tests
We conclude the chapter with additional tests, relying on few statistical assumptions and rather
being a toolbox to construct arbitrary tests. The general idea of permutation tests is to “reassign”
the group or treatment to the observations and recalculate the test statistic. Under the null
hypothesis, the observed test statistic will be similar compared to the ones obtained by reassigning
the groups. The resulting p-value is the proportion of cases that the permutation yielded a more
extreme observation than the data.
Permutation tests are straightforward to implement manually and thus are often used in
settings where the distribution of the test statistic is complex or even unknown. Two classical
examples are given in Tests 12 and 13. Permutation tests assume that under the null hypothesis,
the underlying distributions being compared are the same and exchangeable. For large samples,
the test may be computationally demanding.
Example 7.7 (continuation of Examples 5.2, 7.4 and 7.5). R-Code 7.6 illustrates an alternative
to the Wilcoxon signed rank test and the Wilcoxon–Mann–Whitney test and compares the difference in locations between the two visits and the two barns respectively. For the paired setting,
see also Problem 7.4. Without too much surprise, the resulting p-values are in the same range as
seen in Examples 7.4 and 7.5. Note that the function oneway_test() requires a formula input.
♣
138
CHAPTER 7. RANK-BASED METHODS
Test 12: Comparing the locations of two paired samples
Question: Are the locations of two paired samples significantly different?
Assumptions: The null hypothesis is formulated, such that under H0 the groups
are exchangeable.
Calculation:
(1) Calculate the mean tobs of the differences d1 , . . . , dn .
(2) Multiply each difference di with −1 with probability p = 1/2.
(3) Calculate the mean of differences.
(4) Repeat this procedure R times (R large).
Decision: Compare the selected significance level with the p-value:
1
(number of permuted differences more extreme than tobs )
R
Calculation in R: require(exactRankTests); perm.test(x, y, paired=TRUE)
Test 13: Comparing the locations of two independent samples
Question: Are the locations of two independent samples significantly different?
Assumptions: The null hypothesis is formulated, such that under H0 the groups
are exchangeable.
Calculation: (1) Calculate the difference tobs in the means of the two groups to
be compared (m observations in group 1, n observations in group 2).
(2) Form a random permutation of the values of both groups by randomly
allocating the observed values to the two groups (m observations in
group 1, n observations in group 2).
(3) Calculate the difference in means of the two new groups.
(4) Repeat this procedure R times (R large).
Decision: Compare the selected significance level with the p-value:
1
(number of permuted differences more extreme than tobs )
R
Calculation in R: require(coin); oneway_test( formula, data)
7.5. BIBLIOGRAPHIC REMARKS
139
R-Code 7.6 Permutation test comparing means of two samples.
require(exactRankTests)
perm.test( PDHmean2[,1],
PDHmean2[,2], paired=TRUE)
##
## 1-sample Permutation Test (scores mapped into 1:m using rounded
## scores)
##
## data: PDHmean2[, 1] and PDHmean2[, 2]
## T = 28, p-value = 0.45
## alternative hypothesis: true mu is not equal to 0
require(coin)
oneway_test( PDHmean ~ as.factor(Barn), data=podo)
##
## Asymptotic Two-Sample Fisher-Pitman Permutation Test
##
## data: PDHmean by as.factor(Barn) (1, 2)
## Z = 1.24, p-value = 0.21
## alternative hypothesis: true mu is not equal to 0
Note that the package exactRankTests will no longer be further developed. The package
coin is the successor of the latter and includes extensions to other rank-based tests.
Remark 7.4. The two tests presented in this section are often refered to as randomization tests.
There is a subtle difference between permutation tests and randomization tests. For simplicity
we use the term permutation test only and refer to Edgington and Onghena (2007) for an indepth
discussion.
♣
7.5
Bibliographic Remarks
There are many (“elderly”) books about robust statistics and rank test. Siegel and Castellan Jr
(1988) was a very accessible classic for many decades. The book by Hollander and Wolfe (1999)
treats the topic of rank based methods in much more details and depth.
The classical theory of robust statistics is summarized by Huber (1981); Hampel et al. (1986).
7.6
Exercises and Problems
Problem 7.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Justify (intuitively) the breakdown points stated in Remark 7.1.
140
CHAPTER 7. RANK-BASED METHODS
b) Determine the proportion of observations which would be marked on average as outliers
when the data is from a (i) Gaussian distribution, (ii) from a t-distribution with ν degrees
of freedom, (iii) from an exponential distribution with rate λ.
c) Show that under the null hypothesis, E(Wobs ) = n⋆ (n⋆ + 1)/4 and Var(Wobs ) = n⋆ (n⋆ +
1)(2n⋆ + 1)/24.
P ⋆
kIk where Ik is a binomial random variable with p = 1/2 under
Hint: Write Wobs = nk=1
H0 . Hence, E(Ik ) = 1/2 and Var(Ik ) = 1/4.
d) Show that under the null hypothesis, E(Uobs ) = nx ny /2 and Var(Uobs ) = nx ny (nx + ny +
1)/12.
Hints: Use the result of Problem 2.5.c. If we draw without replacement ny elements from
2
a set of N = nx + ny iid random variables, each
having mean
µ and variance σ , then the
2
ny −1
mean and variance of the sample is µ and nσy 1 − nx +n
. That means that the variance
y −1
needs to be adjusted because of the finite population size N = nx + ny , see also Problem 4.
e) Show that the Friedman test statistic Fr given in equation (7.8) can be written in the
“variance form”
J
Fr =
X
2
12
Rj − R ,
IJ(J + 1)
j=1
where R is the average of Rj .
Hint: start showing R = I(J + 1) 2
Problem 7.2 (Robust estimates) For the mtcars dataset (available by default through the
package datasets), summarize location and scale for the milage, number of cylinders and the
weight with standard and robust estimates. Compare the results.
Problem 7.3 (Weight changes over the years) We consider palmerpenguins and assess if
there is a weight change of the penguins over the different years (due to exceptionally harsh
environmental conditions in the corresponding years).
a) Is there a significant weight change for Adelie between 2008 and 2009?
b) Is there evidence of change for any of the species? Why would a two-by-two comparison
be suboptimal?
Problem 7.4 (Permutation test for paired samples) Test if there is a significant change in
pododermatitis between the first and last visit using a manual implementation of a permutation
test. Compare the result to Example 7.4.
Problem 7.5 (Rank and permutation tests) Download the water_transfer.csv data from
the course web page and read it into R. The dataset describes tritiated water diffusion across
human chorioamnion and is taken from Hollander and Wolfe (1999, Table 4.1, page 110). The
7.6. EXERCISES AND PROBLEMS
141
pd values for age "At term" and "12-26 Weeks" are denoted with yA and yB , respectively. We
will statistically test if the water diffusion is different at the two time points. That means we
test whether there is a shift in the distribution of the second group compared to the first.
a) Use a Wilcoxon–Mann–Whitney test to test for a shift in the groups. Interpret the results.
b) Now, use a permutation test as implemented by the function wilcox_test() from R package coin to test for a potential shift. Compare to a).
c) Under the null hypothesis, we are allowed to permute the observations (all y-values) while
keeping the group assignments fix. Keeping this in mind, we will now manually construct
a permutation test to detect a potential shift. Write an R function perm_test() that
implements a two-sample permutation test and returns the p-value. Your function should
execute the following steps.
• Compute the test statistic tobs = yeA − yeB , where e· denotes the empirical median.
• Then repeat many times (e.g., R = 1000)
– Randomly assign all the values of pd to two groups xA and xB of the same size
as yA and yB .
– Store the test statistic tsim = x
eA − x
eB .
• Return the two-sided p-value, i.e., the number of permuted test statistics tsim which
are smaller or equal than −|tobs | or larger or equal than |tobs | divided by the total
number of permutations (in our case R = 1000).
Problem 7.6 (Comparison of power) In this problem we compare the power of the one sample
t-test to the Wilcoxon signed rank test with a simulation study.
iid
a) We assume that Y1 , . . . , Yn ∼ N (µ1 , σ 2 ), where n = 15 and σ 2 = 1. For µ1 = 0 to 1.2 in
steps of 0.05, simulate R = 1000 times a sample and based on these replicates, estimate
the power for a one sample t-test and a Wilcoxon signed rank test (at level α = 0.1).
p
iid
b) Redo for Yi = (m − 2)/mVi + µ1 , where V1 , . . . , Vn ∼ tm , with m = 4. Interpret.
√
iid
2 , with m = 10. Why is the
c) Redo for Yi = (Xi − m)/ 2m + µ1 , where X1 , . . . , Xn ∼ Xm
Wilcoxon signed rank test having an apparent lower power?
Problem 7.7 (Power of the sign test) In this problem we will analyze the power of the sign test
in two situations and compare it to the Wilcoxon signed rank test. To emphasize the effects, we
assume a large sample size n = 100.
a) We assume that d1 , . . . , d100 are the differences of two paired samples. Simulate the di
according to a normal random variables (with variance one) and different means µ1 (e.g.,
a fine sequence from 0 to 1 seq(0, to=1, by=.1)). For the sign test and the Wilcoxon
signed rank test, calculate the proportion of rejected cases out of 500 of the test H0 :
“median is zero” and visualize the resulting empirical power curves. What are the powers
for µ1 = .2? Interpret the power for µ1 = 0.
142
CHAPTER 7. RANK-BASED METHODS
b) We now look at the Type I error if we have a skewed distribution and we use the chisquared distribution with varying degrees of freedom. For very large values thereof, the
distribution is almost symmetric, for very small values we have a pronounced asymmetry.
Note that the median of a chi-square distribution with ν degrees of freedom is approximately
ν(1 − 2/(9ν))3 .
We assume that we have a sample d1 , . . . , d100 from a shifted chi-square distribution with ν
degrees of freedom and median one, e.g., from rchisq(100, df=nu)-nu*(1-2/(9*nu))ˆ3.
As a function of ν, ν = 2, 14, 26, . . . , 50, (seq(2, to=50, by=12)) calculate the proportion
of rejected cases out of 500 tests for the test H0 : “median is zero”. Calculate the resulting
empirical power curve and interpret.
c) Similar to a) we calculate the power for a shifted chi-squared distribution with ν = 2. To
simplify, simulate first rchisq(100, df=2)-2*(1-2/(9*2))ˆ3 and then shift by the same
amount as done in a). Calculate the resulting empirical power curve and interpret. Why
should the resulting power curves of the sign test not be compared with the power curve
of a)?
Problem 7.8 (Anorexia treatments) Anorexia is an eating disorder that is characterized by
low weight, food restriction, fear of gaining weight and a strong desire to be thin. The dataset
anorexia in the package MASS gives the weight in pounds of 72 females before and after a
treatment, consisting of control, cognitive behavioral treatment and family treatment. Visualize
the data wisely and test for effectiveness of the treatments. If there are several possibilities,
discuss the appropriateness of each.
Problem 7.9 (Water drinking preference) Pachel and Neilson (2010) conducted a pilot study
to assess if house cats prefer water if provided still or flowing. Their experiment consisted of
providing nine cats either form of water over 4 days. The data provided in mL consumption over
the 22h hour period is provided in the file cat.csv. Some measurement had to be excluded due
to detectable water spillage. Assess with a suitable statistical approach if the cats prefer flowing
water over still water.
Problem 7.10 (BMJ Endgame) Discuss and justify the statements about ‘Parametric v nonparametric statistical tests’ given in doi.org/10.1136/bmj.e1753.
Chapter 8
Multivariate Normal Distribution
Learning goals for this chapter:
⋄ Describe a random vector, cdf, pdf of a random vector and its properties
⋄ Give the definition and intuition of E, Var and Cov for a random vector
⋄ Know basic properties of E, Var and Cov for a random vector
⋄ Recognize the density of Gaussian random vector and know properties of
Gaussian random vector
⋄ Explain and work with conditional and marginal distributions.
⋄ Estimate the mean and the variance of the multivariate distribution.
⋄ Explain the relationship between the eigenvalues and eigenvector of the covariance matrix and the shape of the density function.
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter08.R.
In Chapter 2 we have introduced univariate random variables and in Chapter 3 random samples. We now extend the framework to random vectors (i.e., multivariate random variables) where
the individual random variables are not necessarily independent (see Definition 3.1). Within the
scope of this book, we can only cover a tiny part of a beautiful theory. We are pragmatic and
discuss what will be needed in the sequel. Hence, we discuss a discrete case as a motivating
example only and then focus on continuous random vectors, especially Gaussian random vectors.
In this chapter we cover the theoretical details, the next one focuses on estimation.
8.1
Random Vectors
A random vector is a (column) vector X = (X1 , . . . , Xp )⊤ with p random variables as components.
The following definition is the generalization of the univariate cumulative distribution function
(cdf) to the multivariate setting (compare with Definition 2.1).
143
144
CHAPTER 8. MULTIVARIATE NORMAL DISTRIBUTION
Definition 8.1. The multivariate (or multidimensional) distribution function of a random vector
X is defined as
FX (x ) = P(X ≤ x ) = P(X1 ≤ x1 , . . . , Xp ≤ xp ),
(8.1)
where the list in the right-hand-side is to be understood as intersections (∩) of the p events. ♢
The multivariate distribution function generally contains more information than the set of
p
Q
P(Xi ≤
marginal distribution functions P(Xi ≤ xi ), because (8.1) only simplifies to FX (x ) =
xi ) under independence of all random variables Xi (compare to Equation (3.3)).
i=1
Example 8.1. Suppose we toss two fair four-sided dice with face values 0, 1, 2 and 3. The
first player marks the absolute difference between both and the second the maximum value. To
cast the situation in a probabilistic framework, we introduce the random variables T1 and T2
for the result of the tetrahedra and assume that each value appears with probability 1/4. (As a
side note, tetrahedron die have often marked corners instead of faces. For the later, appearance
means the face that is turned down.) Table 8.1 gives the frequency table for X = |T1 − T2 | and
Y = max(T1 , T2 ). The entries are to be interpreted as the joint pdf fX,Y (x, y) = P(X = x, Y =
y). From the joint pdf, we can derive various probabilities, for example P(X ≥ Y ) = 9/16,
P(X ≥ 2 | Y ̸= 3) = 1/8. It is also possible to obtain the marginal pdf of, say, X: fX (x) =
P3
y=0 fX,Y (x, y), also summarized by Table 8.1. More specifically, from the joint we can find the
marignal quantities, but not the other way around.
♣
Table 8.1: Probability table for X and Y of Example 8.1. Last column and last row
give the marginal probabilities of X and Y , respectively.
X Y
0
1
2
3
8.1.1
0
1
2
3
1/16
0
0
0
1/16
1/8
0
0
1/16
1/8
1/8
0
1/16
1/8
1/8
1/8
1/4
3/8
1/4
1/8
1/16
3/16
5/16
7/16
1
Density Function of Continuous Random Vectors
A random vector X is a continuous random vector if each component Xi is a continuous random
variable. The probability density function for a continuous random vector is defined in a similar
manner as for univariate random variables.
Definition 8.2. The probability density function (or density function, pdf) fX (x ) of a pdimensional continuous random vector X is defined by
Z
P(X ∈ A) =
fX (x ) dx ,
for all A ⊂ Rp .
(8.2)
A
♢
145
8.1. RANDOM VECTORS
For convenience, we summarize here a few properties of random vectors with two continuous
components, i.e., for a bivariate random vector (X, Y )⊤ . The univariate counterparts are stated
in Properties 2.1 and 2.3. The properties are illustrated with a subsequent extensive example.
Property 8.1. Let (X, Y )⊤ be a bivariate continuous random vector with joint density function
fX,Y (x, y) and joint distribution function FX,Y (x, y).
1. The distribution function is monotonically increasing:
for x1 ≤ x2 and y1 ≤ y2 , FX,Y (x1 , y1 ) ≤ FX,Y (x2 , y2 ).
2. The distribution function is normalized:
lim FX,Y (x, y) = FX,Y (∞, ∞) = 1.
x,y↗∞
(We use the slight abuse of notation by writing ∞ in arguments without a limit.)
FX,Y (−∞, −∞) = FX,Y (x, −∞) = FX,Y (−∞, y) = 0.
3. FX,Y (x, y) and fX,Y (x, y) are continuous (almost) everywhere.
∂2
FX,Y (x, y).
∂x∂y
Z bZ d
5. P(a < X ≤ b, c < Y ≤ d) =
fX,Y (x, y) dy dx
4. fX,Y (x, y) =
a
c
= FX,Y (b, d) − FX,Y (b, c) − FX,Y (a, d) + FX,Y (a, c).
6. Marginal distributions:
FX (x) = P(X ≤ x, Y arbitrary) = FX,Y (x, ∞) and FY (y) = FX,Y (∞, y);
7. Marginal Zdensities:
fX (x) =
Z
fX,Y (x, y) dy and fY (y) =
R
fX,Y (x, y) dx.
R
The last two points of Property 8.1 refer to marginalization, i.e., reduce a higher-dimensional
random vector to a lower dimensional one. Intuitively, we “neglect” components of the random
vector in allowing them to take any value.
Example 8.2. The joint distribution of (X, Y )⊤ is given by FX,Y (x, y) = y − 1 − exp(−xy) /x,
for x ≥ 0 and 0 ≤ y ≤ 1. Top left panel of Figure 8.1 illustrates the joint cdf and the mononicity
of the joint cdf is nicely visible. The top right panel illustrates that the joint cdf is normalized:
for all values x = 0 or y = 0 the joint cdf is also zero and it reaches one for x → ∞ and y = 1.
The joint cdf is defined for the entire plane (x, y), if any of the two variables is negative, the joint
cdf is zero. For values y > 1 we have FX,Y (x, y) = FX,Y (x, 1). That means, that the joint cdf
outside the domain [0, ∞[ × [0, 1] is determined by its properties (here, for rotational simplicity,
we only state the functions in that specific domain).
∂
∂2
FX,Y (x, y) = ∂x
1 − exp(−xy) = exp(−xy)y
The joint density is given by fX,Y (x, y) = ∂x∂y
and shown in the top right panel of Figure 8.1. The joint density has a particular simple form
but it cannot be factored into the marginal cdf and marginal densities:
FY (y) = FX,Y (∞, y) = y,
Z ∞
fY (y) =
e−xy y dx = 1,
0
FX (x) = FX,Y (x, 1) = FX,Y (x, ∞) = 1 − (1 − e−x )/x,
Z 1
1 − e−x (1 + x)
fX (x) =
e−xy y dy =
.
x2
0
(8.3)
(8.4)
146
CHAPTER 8. MULTIVARIATE NORMAL DISTRIBUTION
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0
0.0
0
1
1
2
x
2
3
1.0
0.8
0.6
0.4
0.2
y
5 0.0
x
0.8
1.0
0.8
0.6
0.4
0.2
5 0.0
y
0.0
0.4
fX, Y(x, a)
0.2
0.0
fX(x)
3
4
0.4
4
0
1
2
3
4
5
0
1
2
x
3
4
5
x
Figure 8.1: Top row: joint cdf and joint density function for case discussed in Example 8.2. Bottom row: marginal density of X and joint density evaluated at a = 1, .7, .4, .1
(in black, blue, green and red).
(Of course, we could have differentiated the expressions in the first line to get the second line or
integrated the expressions in the second line to get the ones in the first line, due to Property 2.3.)
The marginal distribution of Y is U(0, 1)! Bottom left panel of Figure 8.1 gives the marginal
density of X. The product of the two marginal densities is not equal to the joint one. Hence, X
and Y are not independent.
For coordinate aligned rectangular domains, we use the joint cdf to calculate probabilities,
for example
P(1 < X ≤ 2, Y ≤ 1/2) = FX,Y (2, 1/2) − FX,Y (2, 0) − FX,Y (1, 1/2) + FX,Y (1, 0)
√
( e − 1)2
−2·1/2
−1·1/2
≈ 0.077.
= 1/2 − (1 − e
)/2 − 0 − 1/2 − (1 − e
)/1 − 0 =
2e
(8.5)
and via integration for other cases
Z 1
Z 1 Z 1/y
e −1
−xy
−1
≈ 0.63. (8.6)
P(XY ≤ 1) = FX,Y (1, 1) +
e
y dx dy = e +
e−y − e−1 dy =
e
0
1
0
♣
8.1.2
Mean and Variance of Continuous Random Vectors
We now characterize the first two moments of random vectors.
8.1. RANDOM VECTORS
Definition 8.3. The expected value of a random vector X is defined as

 

X1
E(X1 )
 .  

..
 .  
.
E(X) = E 
.
 .  = 

Xp
E(Xp )
147
(8.7)
♢
Hence the expectation of a random vector is simply the vector of the individual expectations.
Of course, to calculate these, we only need the marginal univariate densities fXi (x) and thus the
expectation does not change whether (8.1) can be factored or not. The expectation of products
of random variables is defined as
Z Z
E(X1 X2 ) =
x1 x2 f (x1 , x2 ) dx1 dx2
(8.8)
(for continuous random variables). The variance of a random vector requires a bit more thought
and we first need the following.
Definition 8.4. The covariance between two arbitrary random variables X1 and X2 is defined
as
Cov(X1 , X2 ) = E (X1 − E(X1 ))(X2 − E(X2 )) = E(X1 X2 ) − E(X1 ) E(X2 ).
(8.9)
♢
In case of two independent random variables, the their joint density can be factored and
Equation (8.8) shows that E(X1 X2 ) = E(X1 ) E(X2 ) and thus their covariance is zero. The
inverse, however, is in general not true.
Using the linearity property of the expectation operator, it is possible to show the following
handy properties.
Property 8.2. We have for arbitrary random variables X1 , X2 and X3 :
1. Cov(X1 , X2 ) = Cov(X2 , X1 ),
2. Cov(X1 , X1 ) = Var(X1 ),
3. Cov(a + bX1 , c + dX2 ) = bd Cov(X1 , X2 ), for arbitrary values a, b, c and d,
4. Cov(X1 , X2 + X3 ) = Cov(X1 , X2 ) + Cov(X1 , X3 ).
It is important to note that the covariance describes the linear relationship between the
random variables.
Definition 8.5. The variance of a p-variate random vector X = (X1 , . . . , Xp )⊤ is defined as
Var(X) = E (X − E(X))(X − E(X))⊤
(8.10)

 

X1
Var(X1 )
. . . Cov(Xi , Xj )

 .  
..





,
= Var  ..  = 
(8.11)
.

Xp
Cov(Xj , Xi ) . . .
Var(Xp )
called the covariance matrix or variance–covariance matrix.
♢
148
CHAPTER 8. MULTIVARIATE NORMAL DISTRIBUTION
The covariance matrix is a symmetric matrix and – except for degenerate cases – a positive
definite matrix. We will not consider degenerate cases and thus we can assume that the inverse
of the matrix Var(X) exists and is called the precision.
Similar to Properties 2.5, we have the following properties for random vectors.
4 min
Property 8.3. For an arbitrary p-variate random vector X, (fixed) vector a ∈ Rq and matrix
B ∈ Rq×p it holds:
1. Var(X) = E(XX⊤ ) − E(X) E(X)⊤ ,
2. E(a + BX) = a + B E(X),
3. Var(a + BX) = B Var(X)B⊤ .
The covariance itself cannot be interpreted as it can take arbitrary values. The correlation
between two random variables X1 and X2 is defined as
Cov(X1 , X2 )
Corr(X1 , X2 ) = p
Var(X1 ) Var(X2 )
5 min
(8.12)
and corresponds to the normalized covariance. It holds that −1 ≤ Corr(X1 , X2 ) ≤ 1, with
equality only in the degenerate case X2 = a + bX1 for some a and b ̸= 0.
8.2
Conditional Densities
In this short section we revisit conditional probabilities and conditional densities with particular
focus on bivariate continuous random variables.
We recall the formula for conditional probablity P(A | B) = P(A ∩ B)/ P(B), which can be
translated directly to the conditional cdf
FX|Y (x | y) =
FX,Y (x, y)
FY (y)
(8.13)
for A = {X ≤ x} and B = {Y ≤ y}. The events A and B can be much more general, for example
A = {X = 2} and B = {Y = 2}. In a setting like Example 8.1, it is still possible to calculate
P(X = 2 | Y = 2) = P(X = 2, Y = 2)/ P(Y = 2) = (1/8)/(5/16) = 2/5. However, in a setting
like Example 8.2, an expression like P(Y = a) is zero and we are - seemingly - stuck.
Definition 8.6. Let (X, Y )⊤ be a bivarite continuous random vector with joint density function
fX,Y (x, y). The conditional density of X given Y = y is
fX|Y (x | y) =
fX,Y (x, y)
,
fY (y)
whenever fY (y) > 0 and zero otherwise.
(8.14)
♢
This definition has nevertheless some intuition. Consider a neighborhood of x and y
R x+δ R y+ϵ
fX,Y (x, y) dy dx
x
y
P(x ≤ X ≤ x + δ | y ≤ Y ≤ y + ϵ) =
R y+ϵ
fY (y) dy
y
≈
δϵfX,Y (x, y)
= δfX|Y (x | y).
ϵfY (y)
(8.15)
(8.16)
149
8.3. MULTIVARIATE NORMAL DISTRIBUTION
That means that the conditional probability of X | Y = y in a neighborhood of (x, y) is proportional to the conditional density.
Visually, the conditional density is like slicing the joint at particular values (of x or y) and
renormalizing the resulting curve to get a proper density.
Conditional densities are proper densities and thus it is possible to calculate the conditional
expectation E(X | Y ), conditional variance Var(X | Y ) and so forth.
Example 8.3. In the setting of Example 8.2 the marginal density of Y is constant and thus
fX|Y (x | y) = fX,Y (x, y). The curves in the lower right panel of Figure 8.2 are actual densities:
the conditional density fX|Y (x | y) for y = 1, 0.7, 0.4 and 0.1.
R∞
The conditional expectation of X | Y = y is E(X | Y ) = 0 x e−xy y dx = 1/y.
♣
8.3
Multivariate Normal Distribution
We now consider a special multivariate distribution: the multivariate normal distribution, by
first considering the bivariate case.
8.3.1
Bivariate Normal Distribution
We introduce the bivariate normal distribution by specifying its density, whose expression is
quite intimidating and is for reference here.
Definition 8.7. The random variable pair (X, Y ) has a bivariate normal distribution if
Z x Z y
FX,Y (x, y) =
fX,Y (x, y) dx dy
(8.17)
−∞
−∞
with density
f (x, y) = fX,Y (x, y)
=
1
p
2πσx σy 1 − ρ2
(8.18)
exp −
1
2(1 − ρ2 )
(x − µx )2 (y − µy )2 2ρ(x − µx )(y − µy )
+
−
σx2
σy2
σx σy
for all x and y and where µx ∈ R, µy ∈ R, σx > 0, σy > 0 and −1 < ρ < 1.
,
♢
The role of some of the parameters µx , µy , σx , σy and ρ might be guessed. We will discuss
their precise meaning after the following example. Note that the joint cdf does not have a closed
form and essentially all probability calculations involve some sort of numerical integration scheme
of the density.
Example 8.4. Figure 8.2 (based on R-Code 8.1) shows the density of a bivariate normal distri√
√
bution with µx = µy = 0, σx = 1, σy = 5, and ρ = 2/ 5 ≈ 0.9. Because of the quadratic form
in (8.18), the contour lines (isolines) are ellipses.
The joint cdf is harder to interpret, it is almost impossible to infer the shape an orientation
of an isoline of the density. Since (x, y) can take any values in the plane, the joint cdf is strictly
positive and the value zero is only reached at limx,y↘−∞ F (x, y) = 0.
Several R packages implement the bivariate/multivariate normal distribution. We recommend
the package mvtnorm.
♣
150
CHAPTER 8. MULTIVARIATE NORMAL DISTRIBUTION
R-Code 8.1: Density of a bivariate normal random vector. (See Figure 8.2.)
require( mvtnorm)
# providing dmvnorm, pmvnorm
require( fields)
# providing tim.colors() and image.plot()
Sigma <- array( c(1,2,2,5), c(2,2))
x <- y <- seq( -3, to=3, length=100)
grid <- expand.grid( x=x, y=y)
densgrid <- dmvnorm( grid, mean=c(0, 0), sigma=Sigma) # zero mean
jdensity <- array( densgrid, c(100, 100))
image.plot(x, y, jdensity, col=tim.colors())
# left panel
faccol <- fields::tim.colors()[cut(jdensity[-1,-1],64)]
persp(x, y, jdensity, col=faccol, border = NA, zlab="", # right panel
tick='detailed', theta=120, phi=30, r=100, zlim=c(0,0.16))
## To calculate the cdf, we need a lower and upper bound. Passing directly
cdfgrid <- apply(grid, 1, function(x) {
# the grid is not possible
pmvnorm( upper=x, mean=c(0, 0), sigma=Sigma) } )
jcdf <- array( cdfgrid, c(100, 100))
image.plot(x, y, jcdf, zlim=c(0,1), col=tim.colors())
# left panel
faccol <- fields::tim.colors()[cut(jcdf[-1,-1],64)]
persp(x, y, jcdf, col=faccol, border = NA, zlab="",
# right panel
tick='detailed', theta=12, phi=50, r=100, zlim=c(0,1))
The bivariate normal distribution has many nice properties. A first set is as follows.
Property 8.4. For the bivariate normal random vector as specified by (8.18), we have: (i) The
marginal distributions are X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 ) and (ii)
!
!!
!
!!
ρσx σy
X
µx
X
σx2
.
(8.19)
E
=
Var
=
ρσx σy
σy2
Y
µy ,
Y
Thus,
Cov(X, Y ) = ρσx σy ,
Corr(X, Y ) = ρ.
(8.20)
(iii) If ρ = 0, X and Y are independent and vice versa.
Note, however, that the equivalence of independence and uncorrelatedness is specific to jointly
normal variables and cannot be assumed for random variables that are not jointly normal.
Example 8.5. Figure 8.3 (based on R-Code 8.2) shows realizations from a bivariate normal
distribution with zero mean and unit marginal variance for various values of correlation ρ. Even
for large samples as shown here (n = 200), correlations between −0.25 and 0.25 are barely
perceptible.
♣
151
8.3. MULTIVARIATE NORMAL DISTRIBUTION
3
## Loading required package: mvtnorm
##
## Attaching package: ’mvtnorm’
## The following objects are masked from ’package:spam’:
##
##
rmvnorm, rmvt
2
0.15
1
0.15
0
0.10
−1
y
0.10
0.05
0.05
−2
0.00
−3
−3
−2
−1
−2
−1
−3
0.00
−3
−2
−1
0
1
2
0
0
y
1
1
2
x
2
3 3
3
3
x
2
1.0
1
0.8
0
1.0
−1
0.4
3
0.8
2
0.6
0.2
1
0.4
0
y
y
0.6
−2
0.2
−3
0.0
−3
−2
−1
0
1
2
0.0
−3
−1
−2
−1
−2
0
x
1
2
3 −3
3
x
Figure 8.2: Density (top row) and distribution function (bottom row) of a bivariate
normal random vector. (See R-Code 8.1.)
8.3.2
Multivariate Normal Distribution
For the general multivarate case we have to use vector notation. Surprisingly, we gain clarity
even compared to the bivariate case. We again introduce the distribution through the density
Definition 8.8. The random vector X = (X1 , . . . , Xp )⊤ is multivariate normally distributed if
Z x1
Z xp
FX (x ) =
···
fX (x1 , . . . , xp ) dx1 . . . dxp
(8.21)
−∞
−∞
152
CHAPTER 8. MULTIVARIATE NORMAL DISTRIBUTION
R-Code 8.2 Realizations from a bivariate normal distribution for various values of ρ,
termed binorm (See Figure 8.3.)
4
4
4
set.seed(12)
rho <- c(-.25, 0, .1, .25, .75, .9)
for (i in 1:6) {
Sigma <- array( c(1, rho[i], rho[i], 1), c(2,2))
sample <- rmvnorm( 200, sigma=Sigma)
plot( sample, pch=20, xlab='', ylab='', xaxs='i', yaxs='i',
xlim=c(-4, 4),ylim=c(-4, 4), cex=.4)
legend( "topleft", legend=bquote(rho==.(rho[i])), bty='n')
}
2
4
0
−4
−2
0
2
4
0
2
4
0
2
4
0
2
4
2
0
−4
−2
0
−4
−2
0
−2
−4
−2
−2
ρ = 0.9
2
ρ = 0.75
2
ρ = 0.25
−4
−4
4
0
4
−2
4
−4
−4
−2
0
−2
−4
−4
−2
0
2
ρ = 0.1
2
ρ=0
2
ρ = −0.25
−4
−2
0
2
4
−4
−2
Figure 8.3: Realizations from a bivariate normal distribution with mean zero, unit
marginal variance and different correlations (n = 200). (See R-Code 8.2.)
with density
fX (x1 , . . . , xp ) = fX (x ) =
1
1
⊤ −1
exp
−
(x
−
µ)
Σ
(x
−
µ)
2
(2π)p/2 det(Σ)1/2
(8.22)
for all x ∈ Rp (with µ ∈ Rp and symmetric, positive-definite Σ). We denote this distribution
with X ∼ Np (µ, Σ).
♢
The following two properties give insight into the meaning of the parameters and give the
distribution of linear combinations of Gaussian random variables.
153
8.3. MULTIVARIATE NORMAL DISTRIBUTION
Property 8.5. For the multivariate normal distribution X ∼ Np (µ, Σ) we have:
E(X) = µ ,
Var(X) = Σ .
Property 8.6. Let X ∼ Np (µ, Σ) and let a ∈ Rq , B ∈ Rq×p , q ≤ p, rank(B) = q, then
a + BX ∼ Nq a + Bµ, BΣB⊤ .
(8.23)
(8.24)
This last property has profound consequences. First, it generalizes Property 2.7 and, second, it asserts that the one-dimensional marginal distributions are again Gaussian with Xi ∼
N (µ)i , (Σ)ii , i = 1, . . . , p (by (8.24) with a = 0 and B = (. . . , 0, 1, 0, . . . ), a vector with zeros
and a one in the ith position).
Similarly, any subset and any (non-degenerate) linear combination of random variables of X
is again Gaussian with appropriate subset selection of the mean and covariance matrix.
We now discuss how to draw realizations from an arbitrary Gaussian random vector, much
in the spirit of Property 2.7. We suppose that we can efficiently draw from a standard normal
distribution. Recall that Z ∼ N (0, 1), then σZ + µ ∼ N (µ, σ 2 ), σ > 0. We will rely on
Equation (8.24) but need to decompose the target covariance matrix Σ. Let L ∈ Rp×p such that
LL⊤ = Σ. That means, L is like a “matrix square root” of Σ.
To draw a realization x from a p-variate random vector X ∼ Np (µ, Σ), one starts with
iid
drawing p values from Z1 , . . . , Zp ∼ N (0, 1), and sets z = (z1 , . . . , zp )⊤ . The vector is then
(linearly) transformed with µ + Lz . Since Z ∼ Np (0, I), where I ∈ Rp×p be the identity matrix,
a square matrix which has only ones on the main diagonal and only zeros elsewhere, Property 8.6
asserts that X = µ + LZ ∼ Np (µ, LL⊤ ).
In practice, the Cholesky decomposition of Σ is often used. This factorization decomposes
a symmetric positive-definite matrix into the product of a lower triangular matrix L and its
Q
transpose. It holds that det(Σ) = det(L)2 = pi=1 (L)2ii , i.e., we get the normalizing constant
in (8.22) “for free”.
8.3.3
Conditional Distributions
We now extend the concept of bivariate conditional densities to arbitrary dimensions for Gaussian
random vectors. To do so, we separate the random vector X in two parts of size q and p − q,
taking the role of X and Y in Equation (8.13), (8.14) and so forth. For simplicity we (possibly)
reorder the elements and write
X1
X=
, X1 ∈ Rq , X2 ∈ Rp−q .
(8.25)
X2
We divide the vector µ in similarly sized components µ1 and µ2 and the matrix Σ in 2 × 2 blocks
with according sizes:
!!
Σ11 Σ12
X1
µ1
X=
∼ Np
,
.
(8.26)
X2
µ2
Σ21 Σ22
Both (multivariate) marginal distributions X1 and X2 are again normally distributed with X1 ∼
Nq (µ1 , Σ11 ) and X2 ∼ Np−q (µ2 , Σ22 ) (this can be seen again by Property 8.6).
5 min
154
CHAPTER 8. MULTIVARIATE NORMAL DISTRIBUTION
Property 8.7. If one conditions a multivariate normally distributed random vector (8.26) on a
sub-vector, the result is itself multivariate normally distributed with
1 min
−1
X2 | X1 = x1 ∼ Np−q µ2 + Σ21 Σ−1
11 (x1 − µ1 ), Σ22 − Σ21 Σ11 Σ12 .
(8.27)
X1 and X2 are independent if Σ21 = 0 and vice versa.
Equation (8.27) is probably one of the most important formulas one encounters in statistics
albeit not always in this explicit form. The result is again one of the many features of Gaussian
random vectors: the distribution is closed (meaning again Gaussian) with respect to linear
combinations and conditioning.
We now have a close look at Equation (8.27) and give a detailed explaination thereof. The
expected value of the conditional distribution (conditional expectation) depends linearly on the
value of x 1 , but the variance is independent of the value of x 1 . The conditional expectation
represents an update of X2 through X1 = x 1 : the difference x 1 − µ1 is normalized by its
variance Σ11 and scaled by the covariance Σ21 .
The interpretation of the conditional distribution (8.27) is a bit easier to grasp if we use p = 2
2 2
2 −1 = ρσ /σ and Σ Σ−1 Σ
with X and Y , in which Σ21 Σ−1
y
x
21 11 12 = ρ σy , yielding
11 = ρσy σx (σx )
Y | X = x ∼ N µy + ρσy σx−1 (x − µx ), σy2 − ρ2 σy2 .
(8.28)
This equation is illustrated in Figure (8.4). The bivariate density is shown with blue ellipses. The
vertical green density is the marginal density of Y and represent the “uncertainty” without any
further information. The inclined ellipse indicates dependence between both variables. Hence, if
we know X = x (e.g., one of the ticks on the x-axis), we change our knowledge about the second
variable. We can adjust the mean and reduce the variance. The red vertical densities are two
y
µy + ρσy σx−1 (x − µx )
x
Figure 8.4: Graphical illustration of the conditional distribution of a bivariate normal random vector. Blue: bivariate density with isolines indicating quartiles, green:
marginal densities, red: conditional densities. The respective means are indicated with
circles. The height of the univariate densities are exaggerated by a factor of five.
8.4. BIBLIOGRAPHIC REMARKS
155
examples of conditional densities. The conditional means are on the line µy + ρσy σx−1 (x − µx ).
The unconditional mean µy is corrected based the difference x − µx , the further x from the
mean of X the larger the correction. Of course, this difference needs to be normalized by the
standard deviation of X, leading then to σx−1 (x − µx ). The tighter the ellipses the stronger we
need to correct (further multiplication with ρ) and finally, we have to scale (back) according to
the variable Y (final multiplication with σy ).
The conditional variance remains the same for all possible X = x (the red vertical densities
have the same variance) and can be written as σy2 (1 − ρ2 ) and is smaller the larger the absolute
value of ρ is. As ρ is the correlation, a large correlation means a stronger linear relationship, the
contour lines of the ellipses are tighter and thus the more information we have for Y . A negative
ρ simply tilts the ellipses preserving the shape. Hence, similar information for the variance and
a mere sign change for the mean adjustment.
This interpretation indicates that the conditional distribution and more specifically the conditional expectation plays a large role in prediction where one tries to predict (in the litteral
sense of the word) a random variable based on an observation (see Problem 8.4).
Remark 8.1. With the background of this chapter, we are able to close a gap from Chapter 3.
A slightly more detailed explanation of Property 3.4.2 is as follows. We represent the vector
(X1 , . . . , Xn )⊤ by two components, (X1 − X, . . . , Xn − X)⊤ and X1, with 1 the p-vector containing only ones. Both vectors are orthogonal, thus independent. If two random variables,
say Y and Z are independent, then g(Y ) and h(Z) as well (for reasonable choices of g and h).
P
Here, g and h are multivariate and have the particular form of g(y1 , . . . , yn ) = ni=1 yi2 /(n − 1)
h(z1 , . . . , zn ) = z1 .
♣
8.4
Bibliographic Remarks
Many introductory textbooks contain chapters about random vectors and multivariate Gaussian
distribution. Classics are Rice (2006) and Mardia et al. (1979).
The online book “Matrix Cookbook” is a very good synopsis of the important formulas related
to vectors and matrices (Petersen and Pedersen, 2008).
8.5
Exercises and Problems
Problem 8.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Proof Property 8.3. Specifically, let X be a p-variate random vector and let B ∈ Rq×p and
a ∈ Rq be a non-stochastic matrix and vector, respectively. Show the following.
(a) E(a + BX) = a + B E(X),
(b) Var(X) = E(XX⊤ ) − E(X) E(X)⊤ ,
156
CHAPTER 8. MULTIVARIATE NORMAL DISTRIBUTION
(c) Var(a + BX) = B Var(X)B⊤ .
Problem 8.2 (Bivariate normal) In this exercise we derive the bivariate density in an alternative
fashion.
iid
a) Let X, Y ∼ N (0, 1) Show that the joint density is fX,Y (x, y) = 1/(2π) exp(−(x2 + y 2 )/2).
p
b) Define Z = ρX + 1 − ρ2 Y Derive the distribution of Z and the pair (X, Z). Give an
intuitive explainaition of the dependency between X and Z in terms of ρ.
c) Show that Z | X = x ∼ N (ρx, 1 − ρ2 )
Problem 8.3 (Covariance and correlation) Let fX,Y (x, y) = c · exp(−x)(x + y) for 0 ≤ x ≤ ∞,
0 ≤ y ≤ 1 and zero otherwise.
a) Show that c = 2/3.
b) Derive the marginal densities fX (x) and fY (y) and their first two moments.
c) Calculate Cov(X, Y ) and Corr(X, Y ).
d) Derive the conditional densities fY |X (y | x), fX|Y (x | y) and the conditional expectations
E(Y | X = x), E(X | Y = y). Give an interpretation thereof.
Problem 8.4 (Prediction) According to https://www.bfs.admin.ch/bfs/de/home/statistiken/
gesundheit.assetdetail.7586022.html Swiss values 164.65 0.15
a) What height do we predict for her dauther after grown up?
b) Could the prediction be further refined?
Problem 8.5 (Sum of dependent random variables) We saw that if X, Y are independent
Gaussian variables, then their (weighted) sum is again Gaussian. This problem illustrates that
the assumption of independence is necessary.
Let X be a standard normal random variable. We define Y as

X
if |X| ≥ c,
Y =
−X otherwise,
for a fixed positive constant c.
a) Argue heuristically that also Y ∼ N (0, 1).
b) Argue heuristically and with simulations that (i) X and Y are dependent and (ii) the
random variable X + Y is not normally distributed.
Chapter 9
Estimation of Correlation and Simple
Regression
Learning goals for this chapter:
⋄ Explain the concept of correlation
⋄ Explain different correlation coefficient estimates
⋄ Explain the least squares criterion
⋄ Explain the statistical model of the linear regression
⋄ Apply linear model in R, check the assumptions, interpret the output
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter09.R.
We now start to introduce more complex and realistic statistical models (compared to (4.1)).
In most of this chapter (and most of the remaining ones) we consider “linear models.” We start by
estimating the correlation and then linking the simple regression model to the bivariate Gaussian
distribution, followed by extending the model to several so-called predictors in the next chapter.
Chapter 11 presents the least squares approach from a variance decomposition point of view.
A detailed discussion of linear models would fill entire books, hence we consider only the most
important elements.
The (simple) linear regression is commonly considered the archetypical task of statistics and
is often introduced as early as middle school, either in a black-box form or based on intuition.
We approach the problem more formally and we will in this chapter (i) quantify the (linear)
relationship between variables, (ii) explain one variable through another variable with the help
of a “model”.
157
158
CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
9.1
Estimation of the Correlation
The goal of this section is to quantify the linear relationship between two random variables X
and Y with the help of n pairs of values (x1 , yn ), . . . , (xn , yn ), i.e., realizations of the two random
variables (X, Y ). More formally, we will derive an estimator and estimate the covariance and
correlation of a pair of random variables.
Based on Equation (8.12), an intuitive estimator of the correlation between the random
variables X and Y is
Xn
(xi − x)(yi − y)
\Y )
Cov(X,
\Y ) = q
r = Corr(X,
= rX i=1
,
(9.1)
Xn
n
\
\
2
2
Var(X)Var(Y )
(xi − x)
(yj − y)
i=1
j=1
where we used the same denomiator for the covariance and the variance estimate (e.g., n − 1).
The estimate r is called the Pearson correlation coefficient. Just like the correlation, the Pearson
correlation coefficient also lies in the interval [−1, 1].
We will introduce a handy notation that is often used in the following:
sxy =
n
X
i=1
(xi − x)(yi − y),
sxx =
n
X
i=1
(xi − x)2 ,
syy =
n
X
(yi − y)2 .
(9.2)
i=1
√
Hence, we can express (9.1) as r = sxy / sxx syy . Similarly, unbiased variance estimates are
sxx /(n − 1) and syy /(n − 1) and an estimate for the covariance is sxy /(n − 1) (again an unbiased).
Example 9.1. In R, we use cov() and cor() to obtain an estimate of the covariance and
of Pearson correlation coefficient. For the penguin data (see Example 1.9) we have between
body mass and flipper length a covariance of 9824.42 and a correlation of 0.87. The latter was obtained with the command with(penguins, cor(body_mass_g, flipper_length_mm,
use="complete.obs")), where the argument use specifies the handling of missing values.
♣
A natural question we now tackle is if an observed correlation is statistically significant,
i.e., testing the hypothesis H0 : ρ = 0. As seen in Chapter 5 we need the distribution of the
corresponding estimator. Let us consider bivariate normally distributed random variables as
discussed in the last chapter. Naturally, r as given in (9.1) is an estimate of the correlation
parameter ρ explicated in the density (8.18). Let R be the corresponding estimator of ρ based
on (9.1), i.e., replacing (xi , yi ) by (Xi , Yi ). With quite tedious work, it can be shown that the
random variable
√
n−2
(9.3)
T = R√
1 − R2
is, under H0 : ρ = 0, t-distributed with n − 2 degrees of freedom. Hence, the test consists of using
the test statistic (9.3) with the data and comparing the value with the appropriate quantile of
the t-distribution. The corresponding test is described under Test 14.
The test is quite particular, as it only depends on the sample correlation and sample size.
Astonishingly, for correlation estimates to be significant large sample sizes are required. For
9.1. ESTIMATION OF THE CORRELATION
159
Test 14: Test of correlation
Question: Is the correlation between two paired samples significant?
Assumptions: The pairs of values stem from a bivariate normal distribution.
√
n−2
Calculation: tobs = |r| √
where r is the Pearson correlation coefficient (9.1).
1 − r2
Decision: Reject H0 : ρ = 0 if tobs > tcrit = tn−2,1−α/2 .
Calculation in R: cor.test( x, y, conf.level=1-alpha)
example, r = 0.25 is not significant unless n > 62. Naturally, if we do not have Gaussian data,
Test 14 is not exact and the resulting p-value an approximation only.
In order to construct confidence intervals for correlation estimates, we typically need the
so-called Fisher transformation
1 + r
1
W (r) = log
= arctanh(r)
(9.4)
2
1−r
and the fact that, for bivarate normally distributed random variables, the distribution of W (R)
is approximately N W (ρ), 1/(n − 3) . Then a straight-forward confidence interval can be constructed:
h
z1−α/2 i
W (R) − W (ρ) app
p
∼ N (0, 1),
approx. CI for W (ρ) : W (R) ± √
.
(9.5)
n−3
1/(n − 3)
A confidence interval for ρ requires a back-transformation and is shown in CI 6.
CI 6: Confidence interval for the Pearson correlation coefficient
A sample approximate (1 − α) confidence interval for r is
h
zα/2 zα/2 i
tanh arctanh(r) − √
, tanh arctanh(r) + √
n−3
n−3
where tanh and arctanh are the hyperbolic and inverse hyperbolic tangent functions
(see (9.4)).
Example 9.2. We estimate the correlation of the scatter plots from Figure 8.3 and calculate
the corresponding 95%-confidence intervals thereof in R-Code 9.1. For these specific samples, the
confidence intervals obtained from simulating with ρ = 0 and 0.1 cover the value zero. Hence, ρ
is not statistically significant (different from zero).
The width of the six intervals are slightly different and the correlation estimate is not precisely
in the center, both due to the back-transformation.
♣
160
CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
R-Code 9.1 Pearson correlation coefficient and confidence intervals of the scatter plots
from Figure 8.3.
require( mvtnorm)
set.seed(12)
rho <- c(-.25, 0, .1, .25, .75, .9)
n <- 200
out <- matrix(0, 3,6, dimnames=list(c("rhohat","b_low","b_up"), paste(rho)))
for (i in 1:6) {
Sigma <- array( c(1, rho[i], rho[i], 1), c(2,2))
sample <- rmvnorm( n, sigma=Sigma)
out[1,i] <- cor( sample)[2]
out[2:3,i] <- tanh( atanh( out[1,i]) + qnorm( c(0.025,0.975))/sqrt(n-3))
}
print( out, digits=2)
##
-0.25
0
0.1 0.25 0.75 0.9
## rhohat -0.181 -0.1445 0.035 0.24 0.68 0.89
## b_low -0.312 -0.2777 -0.104 0.11 0.60 0.86
## b_up
-0.044 -0.0059 0.173 0.37 0.75 0.92
There are alternatives to Pearson correlation coefficient, that is, there are alternative estimators of the correlation. The two most common ones are called Spearman’s ϱ or Kendall’s τ and
are based on ranks and thus also called rank correlation coefficients. In brief, Spearman’s ϱ is
calculated similarly to (9.1), where the values are replaced by their ranks. Kendall’s τ compares
the number of concordant (if xi < xj then yi < yj ) and discordant (if xi < xj then yi > yj ) pairs.
As expected, Pearson’s correlation coefficient is not robust, while Spearman’s ϱ or Kendall’s τ
are “robust”.
Example 9.3. Correlation is a measure of linear dependency and it is impossible to deduce
general relationship in a scatterplot based on a single value. Figure 9.1 (based on R-Code 9.2)
gives the four scatterplots of the so-called anscombe data, all having an identical Pearson correlation coefficient of 0.82 but vastly different shape. As given in R-Code 9.2 the robustness of
Spearman’s ϱ or Kendall’s τ are evident. If there are no outliers and the the data does exhibit
a linear relationship, there is, of course, quite some agreement between the estimates.
Note that none of the estimators “detects” the squared relationship in the second sample. The
functional form can be “approximated” by a linear one and hence the large correlation estimates.
♣
161
9.2. ESTIMATION OF A P-VARIATE MEAN AND COVARIANCE
R-Code 9.2 anscombe data: visualization and correlation estimates. (See Figure 9.1.)
library( faraway)
# dataset 'anscombe' is provided by this package
data( anscombe)
# head( anscombe)
# dataset with eleven observations and 4x2 variables
with( anscombe, { plot(x1, y1); plot(x2, y2); plot(x3, y3); plot(x4, y4) })
sel <- c(0:3*9+5)
# extract diagonal entries of sub-block
print(rbind( pearson=cor(anscombe)[sel],
spearman=cor(anscombe, method='spearman')[sel],
kendall=cor(anscombe, method='kendall')[sel]), digits=2)
6
8 10
14
12
10
12
8
y4
10
y3
6
8
6
y2
4
3 4 5 6 7 8 9
8
4
6
y1
10
##
[,1] [,2] [,3] [,4]
## pearson 0.82 0.82 0.82 0.82
## spearman 0.82 0.69 0.99 0.50
## kendall 0.64 0.56 0.96 0.43
4
x1
6
8 10
14
4
6
8 10
x2
14
8
x3
12
16
x4
Figure 9.1: anscombe data, the four cases all have the same Pearson correlation
coefficient of 0.82, yet the scatterplot shows a completely different relationship. (See
R-Code 9.2.)
9.2
Estimation of a p-Variate Mean and Covariance
We now extend the estimation of the covariance and correlation of pairs or variables to random
vectors of arbitrary dimension and thus we will rely again on vector notation. The estimators
for parameters of random vectors are constructed in a manner similar to that for the univariate
case. Let x 1 , . . . , x n be a realization of the random sample X1 , . . . , Xn iid for some p-variate
distribution with n > p. We have the following estimators for the mean and the variance
n
b =X =
µ
1X
Xi
n
i=1
n
b =
Σ
1 X
(Xi − X)(Xi − X)⊤
n−1
(9.6)
i=1
and corresponding estimates
n
1X
b =x =
µ
xi
n
i=1
4 min
n
X
b = 1
Σ
(x i − x )(x i − x )⊤ .
n−1
(9.7)
i=1
The estimators and estimates are intuitive generalizations of the univariate forms (see Problem 8.3.a). In fact, it is possible to show that the two estimators given in (9.6) are unbiased
162
CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
estimators of µ = E(X) and Σ = Var(X).
iid
b is the maximum likelihood and the method of moRemark 9.1. If X1 , . . . , Xn ∼ Np (µ, Σ), µ
b
ment estimator. For the variance, the maximum likelihood estimator is (n − 1)/nΣ.
♣
A normally distributed random variable is determined by two parameters, namely the mean
µ and the variance σ 2 . A multivariate normally distributed random vector is determined by p
and p(p + 1)/2 parameters for µ and Σ, respectively. The remaining p(p − 1)/2 parameters in Σ
are determined by symmetry. However, the p(p + 1)/2 values cannot be arbitrarily chosen since
Σ must be positive-definite (in the univariate case the variance must be strictly positive as well).
As long as n > p, the estimator in (9.7) satisfies this condition. If p ≤ n, additional assumptions
about the structure of the matrix Σ are needed.
Example 9.4. Similar as in R-Code 8.2, we generate bivariate realizations with different sample sizes (n = 10, 50, 100, 500). We estimate the mean vector and covariance matrix according
to (9.7); from these we can calculate the corresponding isolines of the bivariate normal density
(where with plug-in estimates for µ and Σ). Figure 9.2 (based on R-Code 9.3) shows the estimated 95% and 50% confidence regions (isolines). As n increases, the estimation improves, i.e.,
the estimated ellipses are closer to the ellipses based on the true (unknown) parameters.
Note that several packages provide a function called ellipse with varying arguments. We
use the call ellipse::ellipse() to ensure using the one provided by the package ellipse. ♣
R-Code 9.3: Bivariate normally distributed random numbers for various sample sizes with
contour lines of the density and estimated moments. (See Figure 9.2.)
set.seed( 14)
require( ellipse)
# to draw ellipses
n <- c( 10, 50, 100, 500)
# different sample sizes
mu <- c(2, 1)
# theoretical mean
Sigma <- matrix( c(4, 2, 2, 2), 2)
# and covariance matrix
for (i in 1:4) {
plot(ellipse::ellipse( Sigma, centre=mu, level=.95), col='gray',
xaxs='i', yaxs='i', xlim=c(-4, 8), ylim=c(-4, 6), type='l')
lines( ellipse::ellipse( Sigma, centre=mu, level=.5), col='gray')
sample <- rmvnorm( n[i], mean=mu, sigma=Sigma)
# draw realization
points( sample, pch=20, cex=.4)
# add realization
muhat <- colMeans( sample)
# apply( sample, 2, mean) # is identical
Sigmahat <- cov( sample)
# var( sample)
# is identical
lines( ellipse::ellipse( Sigmahat, centre=muhat, level=.95), col=2, lwd=2)
lines( ellipse::ellipse( Sigmahat, centre=muhat, level=.5), col=4, lwd=2)
points( rbind( muhat), col=3, cex=2)
text( -2, 4, paste('n =',n[i]))
}
163
9.2. ESTIMATION OF A P-VARIATE MEAN AND COVARIANCE
muhat
# Estimates for n=500
## [1] 1.93252 0.97785
Sigmahat
##
[,1]
[,2]
## [1,] 4.7308 2.2453
## [2,] 2.2453 1.9619
c(cov2cor( Sigma)[2], cov2cor( Sigmahat)[2])
# correlation= sqrt(2)/2
6
6
## [1] 0.70711 0.73700
2
4
n = 50
−4
−2
0
y
−4
−2
0
y
2
4
n = 10
0
2
4
6
8
x
−2
0
4
6
8
4
6
8
n = 500
2
−4
−2
0
y
−4
−2
0
y
2
2
x
4
n = 100
4
−4
6
−2
6
−4
−4
−2
0
2
x
4
6
8
−4
−2
0
2
x
Figure 9.2: Bivariate normally distributed random numbers. The contour lines of the
true density are in gray. The isolines correspond to 95% and 50% quantiles and the
sample mean in green. (See R-Code 9.3.)
In general, interval estimation within random vectors is much more complex compared to
random variables. In fact, due to the multiple dimensions, a better term for ’interval estimation’ would be ’area or (hyper-)volume estimation’. In practice, one often marginalizes and
constructs intervals for individual elements of the parameter vector (which we will also practice
in Chapter 10, for example).
iid
b = X ∼ Np (µ, Σ/n)
In the trivial case of X1 , . . . , Xn ∼ Np (µ, Σ), with Σ known, we have µ
(can also be shown with Property 8.24) and the uncertainty in the estimate can be expressed with
164
CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
ellipses or hyper-ellipsoids. In case of unknown Σ, plug-in estimates can be used, at the price of
accuracy. This setting is essentially the multivariate version of the middle panel of Figure 4.3.
b is beyond the scope of this book.
The discussion of uncertainty in Σ
To estimate the correlation matrix we define the diagonal matrix D with elements dii =
b
(Σ)ii )−1/2 . The estimated correlation matrix is given by
b
R = DΣD
(9.8)
(see Problem 9.1.a). In R, the function cov2cor() can be used.
Example 9.5. We retake the setup of Example 9.4 for n = 50 and draw 100 realizations.
Figure 9.3 visualizes the density based on the estimated parameters (as in Figure 9.2 top left).
The variablity seems dramatic. However, if we would estimate the parameters marginally, we
would have the same estimates and the same uncertainties. The only additional element here is
b
the estimate of the covariance, i.e., the off-diagonal elements of Σ.
♣
R-Code 9.4: Visualizing uncertainty when estimating parameters of bivariate normally
distributed random numbers. (See Figure 9.3.)
set.seed( 14)
n <- 50
# fixed sample size
R <- 100
# draw R realizations
plot( 0, type='n', xaxs='i', yaxs='i', xlim=c(-4, 8), ylim=c(-4, 6))
for (i in 1:R) {
sample <- rmvnorm( n, mean=mu, sigma=Sigma)
muhat <- colMeans( sample)
# estimated mean
Sigmahat <- cov( sample)
# estimated variance
lines( ellipse::ellipse( Sigmahat, centre=muhat, level=.95), col=2)
lines( ellipse::ellipse( Sigmahat, centre=muhat, level=.5), col=4)
points( rbind( muhat), col=3, pch=20) # add mean in green
}
Assessing multivariate normality from a sample is not straightforward. A simple approach is
to reduce the multivariate dataset to univariate ones, which we assess, typically, with QQ-plots.
A more elaborate approach to assess if a p dimensional dataset indicates any suspicion against
multivariate normality is:
1. assess QQ-plots of all marginal variables;
2. assess pairs plots for elliptic scatter plots of all bivariate pairs;
3. assess QQ-plots of linear combinations, e.g. sums of two, three, . . . , all variables;
−1
b
4. assess d1 , . . . , dn with di = (x i −x )⊤ Σ
(x i −x ) with a QQ-plot against a χ2p distribution.
If any of the above fails, we have evidence against multivariate normality of the entire dataset;
subsets thereof may still be of course. Note that only for the fourth point we do need estimates
of the mean and covariance because the marginal QQ-plots are scale invariant.
165
−4
−2
0
0
2
4
6
9.3. SIMPLE LINEAR REGRESSION
−4
−2
0
2
4
6
8
Figure 9.3: Visualizing uncertainty whenIndex
estimating parameters of bivariate normally
distributed random numbers (n = 50 and 100 realizations). The red and blue isolines
correspond to the 95% and 50% quantiles of the bivariate normal density with plugin
estimates. The sample means are in green. (See R-Code 9.4.)
9.3
Simple Linear Regression
The correlation is a symmetric measure, Corr(X, Y ) = Corr(Y, X), thus there is no preference
between either variable. We now extend the idea of quantifying the linear relationship to an
asymmetric setting where we have one variable as fixed (or given) and observe the second one as
a function of the first. Of course, in practice even the given variable has to be measured. Here,
we refer to situations such as “given the height of the mother, the height of the child is”, “given
the dose, the survival rate is”, etc. In a linear regression model we determine a linear relationship
between two variables.
More formally, in simple linear regression a dependent variable is explained linearly through
a single independent variable. The statistical model writes as
Y i = µ i + εi
= β 0 + β 1 x i + εi ,
(9.9)
i = 1, . . . , n,
(9.10)
with
• Yi : dependent variable, measured values, or observations;
• xi : independent variable, predictor, assumed known or observed and not stochastic;
• β0 , β1 : parameters (unknown);
• εi : error, error term, noise (unknown), with symmetric distribution around zero.
It is often also assumed that Var(εi ) = σ 2 and that the errors are independent of each other.
iid
We make another simplification and further assume εi ∼ N (0, σ 2 ) with unknown σ 2 . Thus,
Yi ∼ N (β0 + β1 xi , σ 2 ), i = 1, . . . , n, and Yi and Yj are independent for i ̸= j. Because of the
varying mean, Y1 , . . . , Yn are not identically distributed.
166
CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
Fitting a linear model is based on pairs of data (x1 , y1 ), . . . , (xn , yn ) and the model (9.10)
with the goal to find estimates for β0 and β1 , essentially an estimation problem. In R, this task
is straightforward as shown in a motivating example below which has been dissected and will
illustrate the theoretical concepts in the remaining part of the chapter.
Example 9.6 (hardness data). One of the steps of manufacturing metal springs is a quenching
bath that cools metal back to room temperature to prevent that a slow cooling process dramatically changes the metal’s microstructure. The temperature of the bath has an influence on the
hardness of the springs. Data is taken from Abraham and Ledolter (2006). The Rockwell scale
measures the hardness of technical materials and is denoted by HR (Hardness Rockwell). Figure 9.4 shows the Rockwell hardness of coil springs as a function of the temperature (in degrees
Celcius) of the quenching bath, as well as the line of best fit. R-Code 9.5 shows how a simple
linear regression is performed with the function lm() and the ‘formula’ statement Hard˜Temp. ♣
R-Code 9.5 hardness data from Example 9.6. (See Figure 9.4.)
40
30
20
Hardness [HR]
50
60
Temp <- c(30, 30, 30, 30, 40, 40, 40, 50, 50, 50, 60, 60, 60, 60)
# Temp <- rep( 10*3:6, c(4, 3, 3, 4)) # alternative way based on `rep()`
Hard <- c(55.8, 59.1, 54.8, 54.6, 43.1, 42.2, 45.2,
31.6, 30.9, 30.8, 17.5, 20.5, 17.2, 16.9)
plot( Temp, Hard, xlab="Temperature [C]", ylab="Hardness [HR]")
lm1 <- lm( Hard~Temp)
# fitting of the linear model
abline( lm1)
# add fit of the linear model to linear model
30
35
40
45
50
55
60
Temperature [C]
Figure 9.4: hardness data: hardness as a function of temperature. The black line is
the fitted regression line. (See R-Code 9.5.)
Classically, we estimate βb0 and βb1 with a least squares approach (see Section 4.2.1), which
167
9.3. SIMPLE LINEAR REGRESSION
minimizes the sum of squared residuals. That means that βb0 and βb1 are determined such that
n
X
i=1
(yi − βb0 − βb1 xi )2
(9.11)
is minimized. This concept is also called ordinary least squares (OLS) method to emphasize that
we have iid errors and no weighting is taken into account. The solution of the minimization is
given by
Xn
r
(xi − x)(yi − y)
sxy
syy
i=1
βb1 = r
=
=
,
(9.12)
Xn
sxx
sxx
(xi − x)2
i=1
βb0 = y − βb1 x .
(9.13)
and are termed the estimated regression coefficients. The first equality in (9.12) emphasizes
the difference to a correlation estimate. Here, we correct Pearson’s correlation estimate with
p
syy /sxx . It can be shown that the associated estimators are unbiased (see Problem 9.1.b).
The predicted values are
ybi = βb0 + βb1 xi ,
(9.14)
which lie on the estimated regression line
y = βb0 + βb1 x .
(9.15)
The residuals (observed minus predicted values) are
ri = yi − ybi = yi − βb0 − βb1 xi .
(9.16)
Finally, an estimate of the variance σ 2 of the errors εi is given by
n
n
i=1
i=1
1 X
1 X 2
σ
b =
(yi − ybi )2 =
ri ,
n−2
n−2
2
(9.17)
where we have used a slightly different denominator than in Example 4.6.2 to obtain an unbiased
estimate. More precisely, instead of estimating one mean parameter we have to estimate two
regression coefficients and thus instead of n − 1 we use n − 2. For simplicity, we do not introduce
another symbol for the estimate here. The variance estimate σ
b2 is often termed mean squared
error and its root, σ
b, residual standard error .
Example 9.7 (continuation of Example 9.6). R-Code 9.6 illustrates how to access the estimates,
the fitted values and the residuals from the output of the linear model fit of Example 9.6.
More specifically, the function lm() generates an object of class "lm" and the functions coef(),
fitted() and residuals() extract the corresponding elements from the object.
To access the residual standard error we need to further “digest” the object via summary().
More specifically, an estimate of σ 2 can be obtained via residuals(lm1) and Equation (9.17)
(e.g., sum(residuals(lm1)ˆ2)/(14-2)) or directly via summary(lm1)$sigmaˆ2.
There are more methods defined for an object of class "lm", for example abline() as used
in R-Code 9.5; more will be discussed subsequently.
♣
168
CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
R-Code 9.6 hardness data from Example 9.7.
coef( lm1)
# extract coefficients, here hat{beta_0}, here hat{beta_1}
## (Intercept)
##
94.1341
Temp
-1.2662
rbind( observation=Hard, fitted=fitted( lm1), residuals=residuals( lm1))[,1:6]
##
1
2
3
4
5
6
## observation 55.80000 59.1000 54.8000 54.6000 43.10000 42.2000
## fitted
56.14945 56.1495 56.1495 56.1495 43.48791 43.4879
## residuals
-0.34945 2.9505 -1.3495 -1.5495 -0.38791 -1.2879
head( Hard - (fitted( lm1) + residuals( lm1)))
## 1 2 3 4 5 6
## 0 0 0 0 0 0
summary(lm1)$sigma
# equivalent to:
sqrt( sum(residuals( lm1)^2)/(14-2))
## [1] 1.495
9.4
Inference in Simple Linear Regression
In this section we discuss the quality of the fitted regression line. In our setting, ‘quality’ is interpreted in the sense of whether the model makes (statistical) sense by verifying if the estimated
slope is significantly different to zero and if the model explains variablity in the data. We will
revisit these questions in later chapters again where we further discuss formal justifications. In
this section we give the conceptual ideas with some justifications in Problem 9.1.
When fitting a linear model with R, much of the discussion is given by the summary output
of linear model fit. R-Code 9.7 gives the summary output which is very classical and further
details are elaborated in Figure 9.5.
Example 9.8 (continuation of Examples 9.6 and 9.7). R-Code 9.7 outputs the summary of a
linear fit by summary(lm1).
♣
The output is essentially build of four blocks. The first restates the call of the model fit, the
second states information about the residuals, the third summarizes the regression coefficients
and the fourth digests the overall quality of the regression.
The summary of the residuals serves to check if the residuals are symmetric (median should
be close to zero, up to the sign similar quartiles) or if there are outliers (small minimum or
maximum value). In the next chapter, we will revisit graphical assessments of the residuals.
The third block of the summary output contains for each estimated coefficient the following
four numbers: the estimate itself, its standard error, the ratio of the latter two and the associated
p-value of the Test 15. More specifically, for the slope parameters of a simple linear regression
these are βb1 , SE(βb1 ) βb1 / SE(βb1 ) and a p-value.
To obtain the aforementioned p-value consider βb1 as an estimator. When properly scaled
it can be shown that the estimator of βb1 / SE(βb1 ) has a t-distribution and thus we can apply a
169
9.4. INFERENCE IN SIMPLE LINEAR REGRESSION
R-Code 9.7 hardness data from Example 9.6.
summary( lm1)
# summary of the fit
##
## Call:
## lm(formula = Hard ~ Temp)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -1.550 -1.190 -0.369 0.599 2.950
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.1341
1.5750
59.8 3.2e-16 ***
## Temp
-1.2662
0.0339
-37.4 8.6e-14 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.5 on 12 degrees of freedom
## Multiple R-squared: 0.991,Adjusted R-squared: 0.991
## F-statistic: 1.4e+03 on 1 and 12 DF, p-value: 8.58e-14
statistical test for H0 : β1 = 0, detailed in Test 15. That means for the slope parameter, the
last number is P V ≥ |βb1 |/ SE(βb1 ) with V a t-distributed random variable with n − 2 degrees
of freedom. This latter test is conceptually very similar to Test 1 and essentially identical to
Test 14, see Problem 9.1.1.
The final block summarizes the overall fit of the model. In the case of a “good” model fit, the
1
States the R command used to create the object
2
3
Summary of the residuals:
quartiles as well as smallest and largest residual
Summary of the estimated coefficients:
estimates, standard error thereof,
value of the corresponding t−test and p−value
4
Summarizes the overall model
Figure 9.5: R summary output of an object of class lm (here for the hardness data
from Example 9.6).
170
CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
Test 15: Test of a linear relationship in regression
Question: Is there a linear relationship between the dependent and independent
variables?
Assumptions: Based on the sample x1 , . . . , xn , the second sample is a realization
of a normally distributed random variable with expected value β0 + β1 xi and
variance σ 2 .
Calculation: tobs =
βb1
SE(βb1 )
Decision: Reject H0 : β1 = 0 if tobs > tcrit = tn−2,1−α/2 .
Calculation in R: summary( lm( y ∼ x ))
resulting absolute residuals are small. More precisely, the residual sums of squares are small, at
least compared to the total variability in the data syy . In later chapters, we will formalize that
a linear regression decomposes the total variablity in the data in two components, namely the
variablity explained by the model and the variablity of the residual. We often write this as
SST = SSM + SSE ,
(9.18)
where the subscript indicates ‘total’, ’model’ and ’error’. This statement is of general nature
and holds not only for linear regression models but for virtually all statistical models. Here,
P
the somewhat surprising fact is that we have the simple expressions SST = i (yi − y)2 , SSM =
P
P
yi − y)2 and SSE = i (yi − ybi )2 (see also Problem 9.1.d). The specific values of the final
i (b
block for the simple regression are calculated as follows
P
(yi − ybi )2
SSE
2
2
Multiple R : R = 1 −
= 1 − Pi
,
(9.19)
2
SST
i (yi − y i )
1 − R2
2
Adjusted R2 : Radj
= R2 −
,
(9.20)
n−2
SSM
,
(9.21)
Observed value of F -test:
SSE /(n − 2)
The multiple R2 is sometimes called the coefficient of determination or simply R-squared . The
adjusted R2 is a weighted version of R2 and we revisit it later for more details. For a simple
linear regression the F -test is an alternative form to test H0 : β1 = 0 (see Problem 9.1.e) and a
deeper meaning will be revealed with more complex models. The denominator of (9.21) is σ
b2 .
Note that for the simple regression the degrees of freedom of the F -test are 1 and n − 2.
Hence, if the regression model explains a lot of variability in the data, SSM is large compared
to SSE , that means SSE is small compared to SST , which implies R2 is close to one and the
observed value of the F -test (F -statistic) is large.
In simple linear regression, the central task is often to determine whether there exists a linear
relationship between the dependent and independent variables. This can be tested with the
9.4. INFERENCE IN SIMPLE LINEAR REGRESSION
171
hypothesis H0 : β1 = 0 (Test 15). We do not formally derive the test statistic here. The idea
is to replace in Equation (9.12) the observations yi with random variables Yi with distribution
specified by (9.10) and derive the distribution of a test statistic.
For prediction of the dependent variable at x0 , a specific, given value of the independent
variable, we plug-in x0 in Equation (9.15), i.e., we determine the corresponding value of the
regression line. The function predict() can be used for prediction in R.
Prediction at a (potentially) new value x0 can also be written as
βb0 + βb1 x0 = y − βb1 x + βb1 x0 = y + sxy (sxx )−1 (x0 − x) .
(9.22)
Thus, the last expression is equivalent to Equation (8.27) but with estimates instead of (unknown)
parameters. Or in other words, simple linear regression is equivalent to conditional expectation
of a bivariate normal distribution with plug-in estimates.
The uncertainty of the prediction depends on the uncertainty of the estimated parameter.
Specifically:
Var(b
µ0 ) = Var(βb0 + βb1 x0 ),
(9.23)
is an expression in terms of the independent variables, the dependent variables and (implicitly)
the variance of the error term. Because βb0 and βb1 are not necessarily independent, it is quite
tedious to derive the explicit expression. However, with matrix notation, the above variance is
straightforward to calculate, as will be illustrated in Chapter 10.
To construct confidence intervals for a prediction, we must discern whether the prediction is
for the mean response µ
b0 or for an unobserved (e.g., future) observation yb0 at x0 . The former
prediction interval depends on the variability of the estimates of βb0 and βb1 . For the latter the
prediction interval depends on the uncertainty of µ
b0 and additionally on the variability of the error
2
ε, i.e., σ
b . Hence, the latter is always wider than the former. In R, these two types are denoted
— somewhat intuitively — with interval="confidence" and interval="prediction", (see
Example 9.9 and R-Code 9.8). The confidence interval summary CI 7 gives the precise formulas,
we refrain from the derivation here.
Example 9.9 (continuation of Examples 9.6 to 9.8). We close this example by constructing
pointwise confidence intervals for the mean response and for an unobserved observation. Figure 9.6 (based on R-Code 9.8) illustrates the confidence intervals for a very fine grid of temperatures. For each temperature, we calculate the intervals (hence “pointwise”) but as classically
♣
done they are visualized by lines. The width of both increases as we move further from x.
Linear regression framework is based on several statistical assumptions, including on the
errors εi . After fitting the model, we must check whether these assumptions are realistic or if the
residuals indicate evidence against these (see Figure 1.1). For this task, we return in Chapter 10
to the analysis of the residuals and their properties.
172
CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
R-Code 9.8 hardness data: predictions and pointwise confidence intervals. (See Figure 9.6.)
50
40
30
10
20
Hardness [HR]
60
new <- data.frame( Temp = seq(25, 65, by=.5))
pred.w.clim <- predict( lm1, new, interval="confidence") # for hat(mu)
pred.w.plim <- predict( lm1, new, interval="prediction") # for hat(y), wider!
plot( Temp, Hard, xlab="Temperature [C]", ylab="Hardness [HR]",
xlim=c(28,62), ylim=c(10,65))
# Plotting observations
matlines( new$Temp, cbind(pred.w.clim, pred.w.plim[,-1]),
col=c(1,2,2,3,3), lty=c(1,1,1,2,2)) # Prediction intervals are wider!
30
35
40
45
50
55
60
Temperature [C]
Figure 9.6: hardness data: hardness as a function of temperature. Black: fitted
regression line, red: confidence intervals for the mean µ
bi (pointwise), green: prediction
intervals for (future) predictions ybi (pointwise). (See R-Code 9.8.)
CI 7: Confidence intervals for mean response and for prediction
A (1 − α) confidence interval for the mean response µ
b0 = βb0 + βb1 x0 is
s
1
x0 − x
.
βb0 + βb1 x0 ± tn−2,1−α/2 σ
b
+P
2
n
i (xi − x)
A (1 − α) prediction interval for an unobserved observation of yb0 = βb0 + βb1 x0 is
s
1
x0 − x
b
b
β0 + β1 x0 ± tn−2,1−α/2 σ
b 1+ + P
.
2
n
i (xi − x)
In both intervals we use estimators, i.e., all the estimates yi are to be replaced with
Yi to obtain estimators.
173
9.5. BIBLIOGRAPHIC REMARKS
9.5
Bibliographic Remarks
We recommend the book from Fahrmeir et al. (2009) (German) or Fahrmeir et al. (2013) (English), which is both detailed and accessible. Many other books contain a single chapter on
simple linear regression.
There exists a fancier version of anscombe data, called the ‘DataSaurus Dozen’, found
on https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html. Even the transitional frames in the animation https://blog.revolutionanalytics.com/downloads/DataSaurus%
20Dozen.gif maintain the same summary statistics to two decimal places.
9.6
Exercises and Problems
Problem 9.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Show that the off-diagonal entries of the matrix R of (9.8) are equivalent to (9.8) and that
the diagonal elements are one.
b) Show that the ordinary least squares regression coefficient estimators are unbiased.
c) Show that the rightmost expression of (9.12) can be written as
deduce Var(βb1 ) = σ 2 /sxx
Pn
i=1 (xi − x)yi
sxx and
d) Show the decomposition (9.18) by showing that SSM = r2 syy and SSE = (1 − r2 )syy .
e) Show that the value of the F -test in the lm summary is equivalent to t2obs of Test 15 and
to t2obs of Test 14.
Problem 9.2 (Correlation) For the swiss dataset, calculate the correlation between the variables Catholic and Fertility as well as Catholic and Education. What do you conclude?
Hint: for the interpretation, you might use the parallel coordinate plot, as shown in Figure 1.10
in Chapter 1.
iid
Problem 9.3 (Bivariate normal distribution) Consider the random sample X1 , . . . , Xn ∼ N2 (µ, Σ)
with
!
!
1
1 1
µ=
, Σ=
.
2
1 2
We define estimators for µ and Σ as
n
1X
b=
Xi
µ
n
i=1
n
and
b =
Σ
1 X
b )(Xi − µ
b )⊤ ,
(Xi − µ
n−1
i=1
a) Explain in words that these estimators for µ and Σ “generalize” the univariate estimators
for µ and σ.
174
CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
b) Simulate n = 500 iid realizations from N2 (µ, Σ) using the function rmvnorm() from package
mvtnorm. Draw a scatter plot of the results and interpret the figure.
c) Add contour lines of the density of X to the plot. Calculate an eigendecomposition of Σ
and place the two eigenvectors in the center of the ellipses.
d) Estimate µ, Σ and the correlation between X1 and X2 from the 500 simulated values using
mean(), cov() and cor(), respectively.
e) Redo the simulation with several different covariance matrices, i.e., choose different values
as entries for the covariance matrices. What is the influence of the diagonal elements and
the off-diagonal elements of the covariance matrix on the shape of the scatter plot?
Problem 9.4 (Variance under simple random sampling) In this problem we proof the hint of
Problem 7.1.d. Let Y1 , . . . , Ym be iid with E(Yi ) = µ and Var(Yi ) = σ 2 . We take a random
sample Y1′ , . . . , Yn′ of size n without replacement and denote the associated mean with Y ′ .
a) Show that E Y ′ = µ.
b) Show that Cov(Yi′ , Yj′ ) =
−σ 2
, if i ̸= j and Cov(Yi′ , Yj′ ) = σ 2 , if i = j.
m−1
σ2 n−1
c) Show that Var Y ′ =
1−
. Discuss this result.
n
m−1
Problem 9.5 (Linear regression) The dataset twins from the package faraway contains the
IQ of mono-zygotic twins where one of the siblings has been raised by the biological parents and
the second one by foster parents. The dataset has been collected by Cyril Burt. Explain the IQ
of the foster sibling as a function of the IQ from his sibling. Give an interpretation. Discuss the
shortcomings and issues of the model.
Problem 9.6 (Linear regression) In a simple linear regression, the data are assumed to follow
iid
Yi = β0 + β1 xi + εi with εi ∼ N (0, σ 2 ), i = 1, . . . , n. We simulate n = 15 data points from that
model with β0 = 1, β1 = 2, σ = 2 and the follwoing values for xi .
Hint: to start, copy–paste the following lines into your R-Script.
set.seed(5)
## for reproducable simulations
beta0.true <- 1
## true parameters, intercept
beta1.true <- 2
##
and slope
## observed x values:
x <- c(2.9, 6.7, 8.0, 3.1, 2.0, 4.1, 2.2, 8.9, 8.1, 7.9, 5.7, 1.6, 6.6, 3.0, 6.3)
## simulation of y values:
y <- beta0.true + x * beta1.true + rnorm(15, sd = 2)
data <- data.frame(x = x, y = y)
9.6. EXERCISES AND PROBLEMS
175
a) Plot the simulated data in a scatter plot. Calculate the Pearson correlation coefficient and
the Spearman’s rank correlation coefficient. Why do they agree well?
b) Estimate the linear regression coefficients βb0 and βb1 using the formulas from the script.
Add the estimated regression line to the plot from (a).
c) Calculate the fitted values Ybi for the data in x and add them to the plot from (a).
d) Calculate the residuals (yi − ybi ) for all n points and the residual sum of squares SS =
P
bi )2 . Visualize the residuals by adding lines to the plot with segments(). Are
i (yi − y
the residuals normally distributed? Do the residuals increase or decrease with the fitted
values?
p
e) Calculate standard errors for β0 and β1 . For σ
bε = SS/(n − 2), they are given by
s
s
1
x2
1
σ
bβ0 = σ
bε
,
σ
bβ1 = σ
bε P
.
+P
2
2
n
i (xi − x)
i (xi − x)
f) Give an empirical 95% confidence interval for β0 and β1 . (The degree of freedom is the
number of observations minus the number of parameters in the model.)
g) Calculate the values of the t statistic for βb0 and βb1 and the corresponding two-sided pvalues.
h) Verify your result with the R function lm() and the corresponding S3 methods summary(),
fitted(), residuals() and plot() (i.e., apply these functions to the returned object of
lm()).
i) Use predict() to add a “confidence” and a “prediction” interval to the plot from (a). What
is the difference?
Hint: The meanings of "confidence" and "predict" here are based on the R function. Use
the help of those functions to understand their behaviour.
j) Fit a linear model without intercept (i.e., force β0 to be zero). Add the corresponding
regression line to the plot from (a). Discuss if the model fits “better” the data.
k) How do outliers influence the model fit? Add outliers:
data_outlier <- rbind(data, data.frame(x = c(10, 11, 12), y = c(9, 7, 8)))
Fit a linear model and discuss it (including diagnostic plots).
l) What is the difference between a model with formula y ∼ x and x ∼ y? Explain it from
a stochastic and fitting perspective.
Problem 9.7 (BMJ Endgame) Discuss and justify the statements about ‘Simple linear regression’ given in doi.org/10.1136/bmj.f2340.
176
CHAPTER 9. ESTIMATION OF CORRELATION AND SIMPLE REGRESSION
Chapter 10
Multiple Regression
Learning goals for this chapter:
⋄ Statistical model of multiple regression
⋄ Multiple regression in R, including:
– Multicollinearity
– Influential points
– Interactions between variables
– Categorical variables (factors)
– Model validation and information criterion (basic theory and R)
⋄ Be aware of nonlinear regression - examples
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter10.R.
In many situations a dependent variable is associated with more than one independent variable. The simple linear regression model can be extended by the addition of further independent
variables. We first introduce the model and estimators. Subsequently, we become acquainted
with the most important steps in model validation. Two typical examples of multiple regression
are given. At the end, several typical examples of extensions of linear regression are illustrated.
10.1
Model and Estimators
A natural extension of the simple linear regression model to p independent variables is as follows
Yi = µi + εi ,
(10.1)
= β0 + β1 xi1 + · · · + βp xip + εi ,
= x⊤
i β + εi ,
i = 1, . . . , n,
with
177
(10.2)
n > p,
(10.3)
178
CHAPTER 10. MULTIPLE REGRESSION
• Yi : dependent variable, measured value, observation, response;
• x i = (1, xi1 , . . . , xip )⊤ : (known) independent/explanatory variables, predictors;
• β = (β0 , . . . , βp )⊤ : (unknown) parameter vector, regression coefficients;
• εi : (unknown) error term, noise, with symmetric distribution around zero, E(εi ) = 0.
It is often also assumed that Var(εi ) = σ 2 and/or that the errors are independent of each other.
In matrix notation, Equation (10.3) is written as
Y = Xβ + ε
(10.4)
with X an n × (p + 1) matrix with rows x ⊤
i . The model is linear in β and often referred to as
multiple linear regression model.
To derive estimators and obtain simple, closed form distributions for these, we assume that
iid
εi ∼ N (0, σ 2 ) with unknown σ 2 , also when simply referring to the model (10.3). With a Gaussian
error term we have (in matrix notation)
ε ∼ Nn (0, σ 2 I),
(10.5)
Y ∼ Nn (Xβ, σ 2 I).
(10.6)
The mean of the response varies, implying that Y1 , . . . , Yn are only independent and not iid.
There are some constraints on the independent variables. We assume that the rank of X
equals p + 1 (rank(X) = p + 1, column rank). This assumption guarantees that the inverse of
X⊤ X exists. In practical terms, this implies that we do not include twice the same predictor, or
that a predictor has additional information on top of the already included predictors.
The parameter vector β is estimated with the method of ordinary least squares (see Secb is such that the sum of the squared errors (residuals)
tion 4.2.1). This means, that the estimate β
is minimal and is thus derived as follows:
n
X
2
⊤
b = argmin
β
(yi − x ⊤
(10.7)
i β) = argmin(y − Xβ) (y − Xβ)
β
=⇒
β
i=1
d
(y − Xβ)⊤ (y − Xβ)
dβ
d
=
(y ⊤ y − 2β ⊤ X⊤ y + β ⊤ X⊤ Xβ) = −2X⊤ y + 2X⊤ Xβ
dβ
=⇒
X⊤ Xβ = X⊤ y
=⇒
⊤
−1
b = (X X)
β
(10.8)
(10.9)
(10.10)
⊤
X y
(10.11)
Equation (10.10) is also called the normal equation and Equation (10.11) indicates why we need
to assume full column rank of the matrix X.
We now derive the distributions of the estimator and other related and important vectors.
The derivation of the results are based directly on Property 8.6. Starting from the distributional
assumption of the errors (10.5), jointly with Equations (10.6) and (10.11), it can be shown that
b = (X⊤ X)−1 X⊤ y ,
b ∼ Np+1 β, σ 2 (X⊤ X)−1 ,
β
β
(10.12)
b = X(X⊤ X)−1 X⊤ y = Hy ,
y
b = (I − H)y
r =y −y
b ∼ Nn (Xβ, σ 2 H),
Y
(10.13)
R ∼ Nn 0, σ 2 (I − H) ,
(10.14)
179
10.1. MODEL AND ESTIMATORS
where we term the matrix H = X(X⊤ X)−1 X⊤ as the hat matrix . In the left column we find the
estimate, predicted value and the residuals, in the right column the according functions of the
random sample and its distribution. Notice the subtle difference in the covariance matrix of the
b the hat matrix H is not I, hopefully quite close to it. The latter would
distributions Y and Y:
imply that the variances of R are close to zero.
The distribution of the regression coefficients will be used for inference and when interpreting a fitted regression model (similarly as in the case of the simple regression). The marginal
distributions of the individual coefficients βbi are determined by the distribution (10.12):
βbi ∼ N (βi , σ 2 vii ) with vii = (X⊤ X)−1 ii ,
i = 0, . . . , p,
(10.15)
√
(again direct consequence of Property 8.6). Hence (βbi − βi ) σ 2 vii ∼ N (0, 1). As σ 2 is unknown
and we use a plug-in estimate, typically the unbiased estimate
n
σ
b2 =
X
1
1
(yi − ybi )2 =
r ⊤r ,
n−p−1
n−p−1
(10.16)
i=1
which is again termed mean squared error . Its square root is termed residual standard error
with n − p − 1 degrees of freedom. Finally, we use the same approach when deriving the t-test
in Equation (5.4) and obtain
βbi − βi
√
∼ tn−p−1
σ
b2 vii
(10.17)
as our statistic for testing H0 : βi = β0,i . For testing, we are often interested in H0 : βi = 0 for
which the test statistic reduces to the one shown in Test 15. Confidence intervals are constructed
along equation (4.33) and summarized subsequently in CI 8.
CI 8: Confidence interval for regression coefficients
For the model (10.3), a sample (1 − α) confidence interval for βi
r
h
i
1
βbi ± tn−p−1,1−α/2
r ⊤ r vii
n−p−1
b and vii = (X⊤ X)−1 ii .
with r = y − y
(10.18)
In R, the multiple regression is fitted with lm() again by adding the additional predictor to
the right-hand-side of the formula statement. The summary output is very similar as for the
simple regression, we simply add further lines to the coefficient block. The degrees of freedom
of the numerator of the F -statistic changes from 1 to p − 1. The following example illustrates a
regression with p = 2 predictors.
Example 10.1 (abrasion data). The data comes from an experiment investigating how rubber’s
resistance to abrasion is affected by the hardness of the rubber and its tensile strength (Cleveland,
1993). Each of the 30 rubber samples was tested for hardness and tensile strength, and then
subjected to steady abrasion for a fixed time.
180
CHAPTER 10. MULTIPLE REGRESSION
R-Code 10.1 performs the regression analysis based on two predictors. The sample confidence
intervals confint( res) do not contain zero. Accordingly, the p-values of the three t-tests are
small.
We naively assumed a linear relationship between the predictors hardness and strength and
the abrasion. The scatterplots in Figure 10.1 give some indication that hardness itself has a
linear relationship with abrasion. The scatterplot only depict the marginal relations, i.e., the
red curves would be linked to lm(loss hardness, data=abrasion) and lm(loss strength,
data=abrasion) (the latter indeed shows a poor linear relationship). Similarly, a quadratic term
for strength is not necessary (see, the summary of lm(loss hardness+strength+I(strengthˆ2),
data=abrasion)).
♣
250
350
50
60
70
80
90
120
160
200
240
250
150
70
80
9050
250
loss
50
150
350
150
350
50
200
240 50
60
hardness
120
160
strength
120
160
200
240
Figure 10.1: Pairs plot of abrasion data with red “guide-the-eye” curves. (See RCode 10.1.)
R-Code 10.1: abrasion data: fitting a linear model. (See Figure 10.1.)
abrasion <- read.csv('data/abrasion.csv')
str(abrasion)
## 'data.frame': 30 obs. of 3 variables:
## $ loss
: int 372 206 175 154 136 112 55 45 221 166 ...
## $ hardness: int 45 55 61 66 71 71 81 86 53 60 ...
## $ strength: int 162 233 232 231 231 237 224 219 203 189 ...
pairs(abrasion, upper.panel=panel.smooth, lower.panel=NULL, gap=0)
abres <- lm(loss~hardness+strength, data=abrasion)
summary( abres)
10.2. MODEL VALIDATION
181
##
## Call:
## lm(formula = loss ~ hardness + strength, data = abrasion)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -79.38 -14.61
3.82 19.75 65.98
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) 885.161
61.752
14.33 3.8e-14 ***
## hardness
-6.571
0.583 -11.27 1.0e-11 ***
## strength
-1.374
0.194
-7.07 1.3e-07 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared: 0.84,Adjusted R-squared: 0.828
## F-statistic:
71 on 2 and 27 DF, p-value: 1.77e-11
confint( abres)
##
2.5 %
97.5 %
## (Intercept) 758.4573 1011.86490
## hardness
-7.7674
-5.37423
## strength
-1.7730
-0.97562
10.2
Model Validation
Recall the data analysis workflow shown in Figure 1.1. Suppose we have a model fit of a linear
model (obtained with lm(), as illustrated in the last chapter or in the previous section). Model
validation essentially verifies if (10.3) is an adequate model for the data to probe the given
Hypothesis to investigate. The question is not whether a model is correct, but rather if the
model is useful (“Essentially, all models are wrong, but some are useful”, Box and Draper, 1987
page 424).
Model validation verifies (i) the fixed component (or fixed part) µi and (ii) the stochastic
component (or stochastic part) εi and is typically an iterative process (arrow back to Propose
statistical model in Figure 1.1).
182
10.2.1
CHAPTER 10. MULTIPLE REGRESSION
Basics and Illustrations
Validation is based on (a) graphical summaries, typically (standardized) residuals versus fitted
values, individual predictors, summary values of predictors or simply Q-Q plots and (b) summary
statistics. The latter are also part of a summary() call of a regression object, see, e.g., RCode 9.6. The residuals are summarized by the range and the quartiles. Below the summary of
the coefficients, the following statistics are given
P
(yi − ybi )2
SSE
Multiple R : R = 1 −
= 1 − Pi
,
2
SST
i (yi − y i )
p
2
Adjusted R2 : Radj
= R2 − (1 − R2 )
,
n−p−1
(SST − SSE )/p
,
Observed value of F -Test:
SSE /(n − p − 1)
2
2
(10.19)
(10.20)
(10.21)
were SS stands for sums of squares and SST , SSE for total sums of squares and sums of squares
of the error, respectively. The statistics are similar to the ones from a simple regression up to
taking into account that we have now p + 1 parameters to estimate.
The last statistic explains how much variability in the data is explained by the model and
is essentially equivalent to Test 4 and performs the omnibus test H0 : β1 = β2 = · · · = βp = 0.
When we reject this test, this merely signifies that at least one of the coefficients is significant
and thus often not very useful.
A slightly more general version of the F -Test (10.21) is used to compare nested models. Let
M0 be the simpler model with only q out of the p predictors of the more complex model M1
(0 ≤ q < p). The test H0 : “M0 is sufficient” is based on the statistic
(SSsimple model − SScomplex model )/(p − q)
SScomplex model /(n − p − 1)
(10.22)
and often runs under the name ANOVA (analysis of variance). We see an alternative derivation
thereof in Chapter 11.
In order to validate the fixed components of a model, it must be verified whether the necessary
predictors are in the model. We do not want too many, nor too few. Unnecessary predictors
are often identified through insignificant coefficients. When predictors are missing, the residuals
show (in the ideal case) structure, indicative for model improvement. In other cases, the quality
of the regression is low (F -Test, R2 (too) small). Example 10.2 below will illustrate the most
important elements.
Example 10.2. We construct synthetic data in order to better illustrate the difficulty of detecting a suitable model. Table 10.1 gives the actual models and the five fitted models. In all cases
we use a small dataset of size n = 50 and predictors x1 and x2 = x21 ) that we construct from
iid
a uniform distribution. Further, we set εi ∼ N (0, 0.252 ). R-Code 10.2 and the corresponding
Figure 10.2 illustrate how model deficiencies manifest.
183
10.2. MODEL VALIDATION
We illustrate how residual plots may or may not show missing or unnecessary predictors.
Because of the ‘textbook’ example, the adjusted R2 values are very high and the p-value of the
F -Test is – as often in practice – of little value.
Since the output of summary() is quite long, here we show only elements from it. This is
achieved with the functions print() and cat(). For Examples 2 to 5 the output has been
constructed by a function call to subset_of_summary() constructing the output as the first
example.
The plots should supplement a classical graphical analysis through lm( res).
♣
Table 10.1: Fitted models for five examples of 10.2. The true model is always Yi =
β0 + β1 x1 + β2 x2 + εi .
Example
fitted model
1
2
3
4
Yi = β0 + β1 x1 + β2 x2 + εi
Yi = β0 + β1 x1 + εi
Yi = β0 + β2 x2 + εi
Yi = β0 + β1 x1 + β2 x2 + β3 x3 + εi
correct model
missing predictor x21
missing predictor x1
unnecessary predictor x3
R-Code 10.2: Illustration of missing and unnecessary predictors for an artificial dataset.
(See Figure 10.2.)
set.seed( 18)
n <- 50
x1 <- runif( n);
x2 <- x1^2;
x3 <- runif( n)
eps <- rnorm( n, sd=0.16)
y <- -1 + 3*x1 + 2.5*x1^2 + 1.5*x2 + eps
# Example 1: Correct model
sres <- summary( res <- lm( y ~ x1 + I(x1^2) + x2 ))
print( sres$coef, digits=2)
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
-1.0
0.045
-23 7.9e-27
## x1
3.0
0.238
13 9.9e-17
## I(x1^2)
3.9
0.242
16 6.9e-21
cat("Adjusted R-squared: ", formatC( sres$adj.r.squared),
" F-Test: ", pf(sres$fstatistic[1], 1, n-2, lower.tail = FALSE))
## Adjusted R-squared:
0.9963
F-Test:
4.6376e-53
plotit( res$fitted, res$resid, "fitted values") # Essentially equivalent to:
# plot( res, which=1, caption=NA, sub.caption=NA, id.n=0) with ylim=c(-1,1)
# i.e, plot(), followed by a panel.smooth()
plotit( x1, res$resid, bquote(x[1]))
184
CHAPTER 10. MULTIPLE REGRESSION
plotit( x2, res$resid, bquote(x[2]))
# Example 2: Missing predictor x1^2
sres <- subset_of_summary( lm( y ~ x1 ))
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
-1.5
0.081
-19 5.4e-24
## x1
6.7
0.151
45 8.6e-41
## Adjusted R-squared: 0.9761
F-Test: 8.5684e-41
# Example 3: Missing predictor x1
sres <- subset_of_summary( lm( y ~ x2 ))
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
-0.54
0.052
-10 6.2e-14
## x2
6.88
0.125
55 5.1e-45
## Adjusted R-squared: 0.9841
F-Test: 5.0725e-45
# Example 4: Too many predictors x3
sres <- subset_of_summary( lm( y ~ x1 + x2 + x3))
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
-0.976
0.060
-16.2 1.3e-20
## x1
2.976
0.239
12.5 2.3e-16
## x2
3.938
0.242
16.3 1.0e-20
## x3
-0.069
0.065
-1.1 2.9e-01
## Adjusted R-squared: 0.9963
F-Test: 6.6892e-49
# The results are in sync with:
# tmp <- cbind( y, x1, "x1^2"=x1^2, x2, x3)
# pairs( tmp, upper.panel=panel.smooth, lower.panel=NULL, gap=0)
# cor( tmp)
It is important to understand that the stochastic part εi does not only represent measurement
error. In general, the error is the remaining “variability” (also noise) that is not explained through
the predictors (“signal”).
With respect to the stochastic part εi , the following points should be verified:
1. constant variance: if the (absolute) residuals are displayed as a function of the predictors,
the estimated values, or the index, no structure should be discernible. The observations
can often be transformed in order to achieve constant variance. Constant variance is also
called homoscedasticity and the terms heteroscedasticity or variance heterogeneity are used
otherwise.
indep
More precisely, for heteroscedacity we relax Model (10.3) to εi ∼ N (0, σi2 ), i.e., ε ∼
Nn (0, σ 2 V). In case the diagonal matrix V is known, we can use so-called weighted least
squares (WLS) by considering the argument weights in R (weights=1/diag(V)).
2. independence: correlation between the residuals should be negligible.
10.2. MODEL VALIDATION
185
If data are taken over time, observations might be serially correlated or dependent. That
means, Corr(εi−1 , εi ) ̸= 0 and Var(ε) = Σ which is not a diagonal matrix. This is easy to
test and to visualize through the residuals, illustrated in Example 10.3.
If Var(ε) = Σ = σ 2 R where R is a known correlation matrix a so-called generalized least
squares (GLS) approach can be used. Correlated data are extensively discussed in time
series analysis and in spatial analysis. We refer to follow-up lectures for a more detailed
exhibition.
3. symmetric distribution: it is not easy to find evidence against this assumption. If the
distribution is strongly right- or left-skewed, the scatter plots of the residuals will have
structure. Transformations or generalized linear models may help. We have a quick look
at a generalized linear model in Section 10.4.
Example 10.3 (continuation of Example 10.1). R-Code 10.3 constructs a few diagnostic plots
shown in Figure 10.3.
A quadratic term for strength is not necessary. However, the residuals appear to be slightly
correlated (see bottom right panel, cor( abres$resid[-1],abres$resid[-30]) is 0.53). We
do not have further information about the data here and cannot investigate this aspect further.
♣
R-Code 10.3: abrasion data: model validation. (See Figure 10.3.)
# Fitted values
plot( loss~hardness, ylim=c(0,400), yaxs='i', data=abrasion)
points( abres$fitted~hardness, col=4, data=abrasion)
plot( loss~strength, ylim=c(0,400), yaxs='i', data=abrasion)
points( abres$fitted~strength, col=4, data=abrasion)
# Residuals vs ...
plot( abres$resid~abres$fitted)
lines( lowess( abres$fitted, abres$resid), col=2)
abline( h=0, col='gray')
plot( abres$resid~hardness, data=abrasion)
lines( lowess( abrasion$hardness, abres$resid), col=2)
abline( h=0, col='gray')
plot( abres$resid~strength, data=abrasion)
lines( lowess( abrasion$strength, abres$resid), col=2)
abline( h=0, col='gray')
plot( abres$resid[-1]~abres$resid[-30])
abline( h=0, col='gray')
186
400
0
200
loss
200
0
loss
400
CHAPTER 10. MULTIPLE REGRESSION
50
60
70
80
90
120
140
160
50
50
60
180
strength
200
80
90
220
240
50
0
−50
abres$resid[−1]
50
abres$resid
0
160
70
hardness
−50
140
240
0
100 150 200 250 300 350
abres$fitted
120
220
−50
abres$resid
50
0
50
200
strength
−50
abres$resid
hardness
180
−50
0
50
abres$resid[−30]
Figure 10.3: abrasion data: model validation. Top row shows the loss (black) and
fitted values (blue) as a function of hardness (left) and strength (right). Middle and
bottom panels are different residual plots. (See R-Code 10.3.)
10.3
Model Selection
An information criterion in statistics is a tool for model selection. It follows the idea of Occam’s
razor , in that a model should not be unnecessarily complex. It balances the goodness of fit
of the estimated models with its complexity, measured by the number of parameters. There is
a penalty for the number of parameters, otherwise complex models with numerous parameters
would be preferred.
The coefficient of determination R2 is thus not a proper information criterion: every additional predictor that is added to the model will potentially reduce SSM . The adjusted R2
is somewhat a information criterion as the second term increases with predictors that do not
contribute to reduce the residual sums of squares.
To introduce two well known information criterion, we assume that the distribution of the
187
10.3. MODEL SELECTION
observations follows a known distribution with an unknown parameter θ with p components.
Hence, we can write down the likelihood function of the observations. In maximum likelihood
b or, equivalently, the smaller the negative logestimation, the larger the likelihood function L(θ)
b the better the model is. The oldest criterion was proposed as “an
likelihood function −ℓ(θ),
information criterion” in 1973 by Hirotugu Akaike (Akaike, 1973) and is known today as the
Akaike information criterion (AIC):
b + 2p.
AIC = −2ℓ(θ)
(10.23)
In regression models with normally distributed errors, the maximized log-likelihood is n log(b
σ2)
up to some constant and so the first term describes the goodness of fit.
It is important to not that additive constants are not relevant. Also solely the difference of
two AICs make sense but not the reduction in proportion or something.
Remark 10.1. When deriving the AIC for the regression setting, we have to use likelihood esb
b and σ
b ,σ
timates. β
=β
b2 = r ⊤ r /n. The log-likelihood is then shown to be ℓ(β
b2 ) =
ML
LS
ML
ML
ML
−n/2(log(2 ∗ pi) − log(n) + log(r ⊤ r ) + 1). As additive constants are not relevant for the AIC,
we can express the AIC in terms of the residuals only.
♣
The disadvantage of AIC is that the penalty term is independent of the sample size. The
Bayesian information criterion (BIC)
b + log(n) p
BIC = −2ℓ(θ)
(10.24)
penalizes the model more heavily based on both the number of parameters p and sample size n,
and its use is recommended.
We illustrate the use of information criteria with an example based on a classical dataset.
Example 10.4 (LifeCycleSavings data). Under the life-cycle savings hypothesis developed by
Franco Modigliani, the savings ratio (aggregate personal savings divided by disposable income) is
explained by per-capita disposable income, the percentage rate of change in per-capita disposable
income, and two demographic variables: the percentage of population less than 15 years old and
the percentage of the population over 75 years old (see, e.g., Modigliani, 1966). The data
provided by LifeCycleSavings are averaged over the decade 1960–1970 to remove the business
cycle or other short-term fluctuations and contains information from 50 countries about these
five variables:
• sr aggregate personal savings,
• pop15 % of population under 15,
• pop75 % of population over 75,
• dpi real per-capita disposable income,
• ddpi % growth rate of dpi.
Scatter plots are shown in Figure 10.4. R-Code 10.4 fits a multiple linear model, selects models
through comparison of various goodness of fit criteria (AIC, BIC) and shows the model validation
plots for the model selected using AIC. The step() function is a convenient way for selecting
188
CHAPTER 10. MULTIPLE REGRESSION
relevant predictors. Figure 10.4 gives four of the most relevant diagnostic plots, obtained by
passing a fitted object to plot() (compare with the manual construction of Figure 10.1).
Different models may result from different criteria: when using BIC for model selection,
pop75 drops out of the model.
♣
25
35
45
1
2
3
4
0
2000
4000
0
5
10
15
10 15 20
10 15 20
sr
45 0
0
5
5
5
10 15 20
0
3
4
25
35
pop15
2000
4000 1
2
pop75
10
15 0
dpi
0
5
ddpi
0
5
10
15
Figure 10.4: Scatter plots of LifeCycleSavings with red “guide-the-eye” curves. (See
R-Code 10.4.)
R-Code 10.4: LifeCycleSavings data: EDA, linear model and model selection. (See
Figures 10.4 and 10.5.)
data( LifeCycleSavings)
head( LifeCycleSavings)
##
sr pop15 pop75
dpi ddpi
## Australia 11.43 29.35 2.87 2329.68 2.87
## Austria
12.07 23.32 4.41 1507.99 3.93
## Belgium
13.17 23.80 4.43 2108.47 3.82
## Bolivia
5.75 41.89 1.67 189.13 0.22
## Brazil
12.88 42.19 0.83 728.47 4.56
## Canada
8.79 31.72 2.85 2982.88 2.43
pairs(LifeCycleSavings, upper.panel=panel.smooth, lower.panel=NULL, gap=0)
lcs.all <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
summary( lcs.all)
189
10.3. MODEL SELECTION
##
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -8.242 -2.686 -0.249 2.428 9.751
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.566087
7.354516
3.88 0.00033 ***
## pop15
-0.461193
0.144642
-3.19 0.00260 **
## pop75
-1.691498
1.083599
-1.56 0.12553
## dpi
-0.000337
0.000931
-0.36 0.71917
## ddpi
0.409695
0.196197
2.09 0.04247 *
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.8 on 45 degrees of freedom
## Multiple R-squared: 0.338,Adjusted R-squared: 0.28
## F-statistic: 5.76 on 4 and 45 DF, p-value: 0.00079
lcs.aic <- step( lcs.all)
# AIC is default choice
## Start: AIC=138.3
## sr ~ pop15 + pop75 + dpi + ddpi
##
##
Df Sum of Sq RSS AIC
## - dpi
1
1.9 653 136
## <none>
651 138
## - pop75 1
35.2 686 139
## - ddpi
1
63.1 714 141
## - pop15 1
147.0 798 146
##
## Step: AIC=136.45
## sr ~ pop15 + pop75 + ddpi
##
##
Df Sum of Sq RSS AIC
## <none>
653 136
## - pop75 1
47.9 701 138
## - ddpi
1
73.6 726 140
## - pop15 1
145.8 798 144
190
CHAPTER 10. MULTIPLE REGRESSION
summary( lcs.aic)$coefficients
##
Estimate Std. Error t value
Pr(>|t|)
## (Intercept) 28.12466
7.18379 3.9150 0.00029698
## pop15
-0.45178
0.14093 -3.2056 0.00245154
## pop75
-1.83541
0.99840 -1.8384 0.07247270
## ddpi
0.42783
0.18789 2.2771 0.02747818
plot( lcs.aic)
# 4 plots to assess the models
summary( step( lcs.all, k=log(50), trace=0))$coefficients # now BIC
##
Estimate Std. Error t value
Pr(>|t|)
## (Intercept) 15.59958
2.334394 6.6825 2.4796e-08
## pop15
-0.21638
0.060335 -3.5863 7.9597e-04
## ddpi
0.44283
0.192401 2.3016 2.5837e-02
10.4
Extensions of the Linear Model
Naturally, the linearity assumption (of the individual parameters) in the linear model is central.
There are situations, where we have a nonlinear relationship between the observations and the
parameters. In this section we enumerate some of the alternatives, such that at least relevant
statistical terms can be associated.
10.4.1
Logistic Regression
The response Yi in Model (10.3) is Gaussian, albeit not iid. If this is not the case, for example the
observations yi are proportions, then we need another approach. The idea of this new approach is
to model a function of the expected value of Yi as a linear function of some predictors, for example
g(E(Yi )) = β0 +β1 xi . The function g(·) is such that for any value xi , g −1 (β0 +β1 xi ) is constrained
to [ 0, ]. To illustrate the concept we look at a simple logistic regression. Example 10.5 is based
on widely discussed data.
Example 10.5 (orings data). In January 1986 the space shuttle Challenger exploded shortly
after taking off, killing all seven crew members aboard. Part of the problem was with the
rubber seals, the so-called o-rings, of the booster rockets. Due to low ambient temperature, the
seals started to leak causing the catastrophe. The data set data( orings, package="faraway")
contains the number of defects in the six seals in 23 previous launches (Figure 10.6). The question
we ask here is whether the probability of a defect for an arbitrary seal can be predicted for an
air temperature of 31◦ F (as in January 1986). See Dalal et al. (1989) for a detailed statistical
account or simply https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster.
The variable of interest is a probability (failure of a rubber seal), that we estimate based on
binomial data (failures of o-rings) but a linear model cannot guarantee pbi ∈ [0, 1] (see linear fit in
Figure 10.6). In this and similar cases, logistic regression is appropriate. The logistic regression
10.4. EXTENSIONS OF THE LINEAR MODEL
191
R-Code 10.5 orings data and estimated probability of defect dependent on air temperature. (See Figure 10.6.)
data( orings, package="faraway")
str(orings)
## 'data.frame': 23 obs. of 2 variables:
## $ temp : num 53 57 58 63 66 67 67 67 68 69 ...
## $ damage: num 5 1 1 1 0 0 0 0 0 0 ...
plot( damage/6~temp, xlim=c(21,80), ylim=c(0,1), data=orings, pch='+',
xlab="Temperature [F]", ylab='Probability of damage') # data
abline(lm(damage/6~temp, data=orings), col='gray') # regression line
glm1 <- glm( cbind(damage,6-damage)~temp, family=binomial, data=orings)
points( orings$temp, glm1$fitted, col=2) # fitted values
ct <- seq(20, to=85, length=100)
# vector to predict
p.out <- predict( glm1, new=data.frame(temp=ct), type="response")
lines(ct, p.out)
abline( v=31, col='gray', lty=2)
# actual temp. at start
models the probability of a defect as
p = P(defect) =
1
,
1 + exp(−β0 − β1 x)
(10.25)
where x is the air temperature. Through inversion one obtains a linear model for the log odds
p = β0 + β1 x,
(10.26)
g(p) = log
1−p
where g(·) is generally called the link function. In this special case, the function g −1 (·) is called
the logistic function.
♣
10.4.2
Generalized Linear Regression
The case of logistic regression can be extended to the so-called generalized linear model (GLM).
A GLM contains as a special case the classical linear regression and the logistic regression.
The underlying idea is to model a transformation of the expected value of Yi as a linear
function of some predictors. Lets assume that Yi are independent and from a certain class
of distribution and denote E(Yi ) = µi then g(E(Yi )) = x i β. In the linear regression setting,
the response is Gaussian with g(x) = x, in the logistic regression, the response is Bernoulli (or
equivalently Binomial) with g(x) = log(p/(1−p)). Alternative possible distributions are Poisson,
log-normal, or gamma (the latter is an extension of the exponential or chi-squared distribution).
The advantage of a GLM approach is the common fitting and estimation procedure unified in
the glm() function, where one simply specifies the distribution and link function via the argument
family=...(link="..."). While the fitting via R is straightforward, the interpretation of the
parameters and model validation are more involved compared to linear regression.
192
10.4.3
CHAPTER 10. MULTIPLE REGRESSION
Transformation of the Response
For a Poisson random variable, the variance increases with increasing mean. Similarly, it is
possible that the variance of the residuals increases with an increasing number of observations.
Instead of “modeling” increasing variances, transformations of the response variables often render
the variances of the residuals sufficiently constant. For example, instead of a linear model for Yi ,
a linear model for log(Yi ) is constructed.
√
Typical transformations are log(x), log(x + 1), x or a general power transformation xλ . We
may pick “any” other reasonable transformation. There are formal approaches to determine an
optimal transformation of the data, notably the function boxcox() from the package MASS (see
√
Problem 10.2). However, in practice log(·) and · are used dominantly.
It is important to realize that if the original data stems from a “truly” linear model, any
non-linear transformation leads to an approximation. On the other hand, a log-transformation
of the true model
Yi = β0 xβi 1 eεi
(10.27)
log(Yi ) = log(β0 ) + β1 log(xi ) + εi .
(10.28)
leads to a linear relationship
10.4.4
Nonlinear and Non-parametric Regression
A natural generalization of model (10.3) is to relax the linearity assumption on β and write
Yi = f (x i , β), +εi ,
iid
εi ∼ N (0, σ 2 ),
(10.29)
where f (x , β) is a (sufficiently well behaved) function depending on a vector of covariates x and
the parameter vector β. To estimate the latter we use a least squares approach
b = argmin
β
β
n
X
i=1
2
yi − f (x i , β) .
(10.30)
For most functions f , we do not have closed forms for the resulting estimates and iterative
approaches are needed. That is, starting from an initial condition we improve the solution by
small steps. The correction factor typically depends on the gradient of f . Such algorithms are
variants of the so-called Gauss–Newton algorithm. The details of these are beyond the scope of
this document.
There are some prototype functions f (·, β) that are often used:
Yi = β1 exp −(xi /β2 )β3 + εi

 β + ε , if x < β
1
i
i
3
Yi =
 β + ε , if x ≥ β ,
2
i
i
Weibull model,
(10.31)
break point models.
(10.32)
3
The exponential growth model is a special case of the Weibull model.
There is no guarantee that an optimal solution exists (global minimum). Using nonlinear least
squares as a black box approach is dangerous and should be avoided. In R, the function nls()
(nonlinear least-squares) can be used and general optimizer functions are nlm() and optim().
10.5. BIBLIOGRAPHIC REMARKS
193
An alternative way to express a response in a nonlinear way is through a virtually arbitrary
flexible function, similar to the (red) guide-the-eye curves in many of the scatter plots. These
smooth curves are constructed through a so-called “non-parametric” regression approach.
10.5
Bibliographic Remarks
The book by Faraway (2006) nicely summarizes extensions of the linear model.
10.6
Exercises and Problems
Problem 10.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Derive the distributions of the fitted values and the residuals, i.e., (10.13) and (10.14).
b) Identify the “simple” and “complex model” in the setting of a simple regression such that
(10.22) reduces to (9.21).
c) Show that the AIC for a standard multiple regression model with p+1 regression coefficients
is proportional to n log(r ⊤ r ) + 2p and derive the BIC for the same model.
Problem 10.2 (Box–Cox transformation) In this problem, we motivate the so-called Box–Cox
transformation. Suppose that we have a random variable Y with strictly positive mean µ > 0 and
standard deviation σ = cµα , where c > 0 and α arbitrary. That means the standard deviation
of Y is proportional to a power of the mean.
a) Define the random variable X = Y λ , λ ̸= 0, and show that the standard deviation of X
is approximately proportional to µλ−1+α . What value λ of the transformation leads to a
constant variance?
b) Suppose we observe from Y1 , . . . , Yn with respective means µ1 , . . . , µn . Describe a way to
estimate the transformation parameter λ.
c) The dataset SanMartinoPPts from the package hydroTSM contains daily precipitation at
station San Martino di Castrozza (Trento Italy) over 70 years. Determine the optimal
transformation for monthly totals. Justify that a square root transformation is adequate.
How do you expect the transformation to change when working with annual or daily data?
Hint: monthly totals may be obtained via hydroTSM::daily2monthly(SanMartinoPPts,
FUN=sum).
Problem 10.3 (Multiple linear regression 1) The data stackloss.txt are available on the
course web page. The data represents the production of nitric acid in the process of oxidizing
194
CHAPTER 10. MULTIPLE REGRESSION
ammonia. The response variable, stack loss, is the percentage of the ingoing ammonia that
escapes unabsorbed. Key process variables are the airflow, the cooling water temperature (in
degrees C), and the acid concentration (in percent).
Construct a regression model that relates the three predictors to the response, stack loss.
Check the adequacy of the model.
Exercise and data are from B. Abraham and J. Ledolter, Introduction to Regression Modeling,
2006, Thomson Brooks/Cole.
Hints:
• Look at the data. Outliers?
• Try to find a “optimal” model. Exclude predictors that do not improve the model fit.
• Use model Diagnostics, use t−, F -tests and (adjusted) R2 values to compare different
models.
• Which data points have a (too) strong influence on the model fit? (influence.measures())
• Are the predictors correlated? In case of a high correlation, what are possible implications?
Problem 10.4 (Multiple linear regression 2) The file salary.txt contains information about
average teacher salaries for 325 school districts in Iowa. The variables are
District
districtSize
salary
experience
name of the district
size of the district:
1 = small (less than 1000 students)
2 = medium (between 1000 and 2000 students)
3 = large (more than 2000 students)
average teacher salary (in dollars)
average teacher experience (in years)
a) Produce a pairs plot of the data and briefly describe it.
Hint: districtSize is a categorical random variable with only three possible outcomes.
b) For each of the three district sizes, fit a linear model using salary as the dependent variable
and experience as the covariate. Is there an effect of experience? How can we compare the
results?
c) We now use all data jointly and use districtSize as covariate as well. However, districtSize
is not numerical, rather categorical and thus we set mydata$districtSize <- as.factor(
mydata$districtSize) (with appropriate dataframe name). Fit a linear model using
salary as the dependent variable and the remaining data as the covariates. Is there an
effect of experience and/or district size? How can we interpret the parameter estimates?
Problem 10.5 (Missing or unnecessary predictors) Construct synthetic data according to Yi =
β0 + β1 x1 + β2 x21 + β3 x2 + εi with n = 50, predictors x1 , x2 and x3 drawn from a uniform
iid
distribution and εi ∼ N (0, 0.252 ). Fit models according to the table below and discuss the
10.6. EXERCISES AND PROBLEMS
195
model deficiencies.
Hint: You may consider R-Code 10.2 and Figure 10.2.
Example
fitted model
1
2
3
4
5
Yi = β0 + β1 x1 + β2 x21 + β3 x2 + εi
Yi = β0 + β1 x1 + β2 x2 + εi
Yi = β0 + β1 x21 + β2 x2 + εi
Yi = β0 + β1 x1 + β2 x21 + εi
Yi = β0 + β1 x1 + β2 x21 + β3 x2 + β4 x3 + εi
correct model
missing predictor x21
missing predictor x1
missing predictor x2
unnecessary predictor x3
Problem 10.6 (BMJ Endgame) Discuss and justify the statements about ‘Multiple regression’
given in doi.org/10.1136/bmj.f4373.
196
−1
0
1
2
3
4
5
6
0.0 0.5 1.0
−1.0
0.0 0.5 1.0
−1.0
−1.0
0.0 0.5 1.0
CHAPTER 10. MULTIPLE REGRESSION
0.0
0.2
0.4
0.6
0.8
1.0
2
3
4
5
0.2
0.4
0.6
0.8
1.0
4
5
6
0.2
0.4
0.6
0.8
1.0
3
fitted values
4
5
6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
0.0 0.5 1.0
−1.0
0.0 0.5 1.0
2
0.6
x2
−1.0
0.0 0.5 1.0
1
0.0
x1
−1.0
0
0.4
0.0 0.5 1.0
0.0
fitted values
−1
0.2
−1.0
0.0 0.5 1.0
3
1.0
x2
−1.0
0.0 0.5 1.0
2
0.0
x1
−1.0
1
0.8
0.0 0.5 1.0
0.0
fitted values
0
0.6
−1.0
0.0 0.5 1.0
1
0.4
x2
−1.0
0.0 0.5 1.0
−1.0
0
0.2
x1
fitted values
−1
0.0
0.0
0.2
0.4
0.6
x1
0.8
1.0
0.0
0.2
0.4
x2
Figure 10.2: Residual plots. Residuals versus fitted values (left column), predictor
x1 (middle) and x2 (right column). The rows correspond to the different fitted models.
The panels in the left column have different scaling of the x-axis. (See R-Code 10.2.)
197
10.6. EXERCISES AND PROBLEMS
Normal Q−Q
10
3
Residuals vs Fitted
Zambia
6
8
10
2
1
0
5
0
−2
−5
Residuals
−10
Chile
12
14
Philippines
−1
Standardized residuals
Zambia
Philippines
Chile
16
−2
−1
Fitted values
2
Residuals vs Leverage
3
Zambia
1.5
1
Theoretical Quantiles
Scale−Location
Zambia
2
0
1
0.5
−1
1.0
0.5
1
Japan
Libya
Cook's distance
6
8
10
12
Fitted values
14
16
0.5
1
−2
Standardized residuals
Chile
Philippines
0.0
Standardized residuals
0
0.0
0.1
0.2
0.3
0.4
Leverage
Figure 10.5: LifeCycleSavings data: model validation. (See R-Code 10.4.)
0.5
198
0.2
0.4
0.6
0.8
+
++
+
0.0
Probability of damage
1.0
CHAPTER 10. MULTIPLE REGRESSION
+
+++++
20
30
40
50
60
70
+
++
++
++
+
80
Temperature [F]
Figure 10.6: orings data (proportion of damaged orings, black crosses) and estimated
probability of defect (red dots) dependent on air temperature. Linear fit is given by
the gray solid line. Dotted vertical line is the ambient launch temperature at the time
of launch. (See R-Code 10.5.)
Chapter 11
Analysis of Variance
Learning goals for this chapter:
⋄ Understanding the link between t-tests, linear model and ANOVA
⋄ Define an ANOVA model
⋄ Understand the concept of sums of squares decomposition
⋄ Interpret and discuss the output of a two-way ANOVA
⋄ Define an ANCOVA model for a specific setting
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter11.R.
In this Chapter we will further elaborate on a sums of squares decomposition of linear models.
For the ease of presentation, we focus on qualitative predictors, called factors. The simplest
setting is comparing the means of I independent samples with a common variance term. This
is much in the spirit of Test 2 discussed in Chapter 4, where we compared the means of two
independent samples with each other.
Instead of comparing the samples pairwise (which would amount to I2 tests and would require adjustments due to multiple testing as discussed in Section 5.5.2) we introduce a “better”
method. Although we are concerned with comparing means we frame the problem to comparing variances, termed analysis of variance (ANOVA). We focus on a linear model approach to
ANOVA. Due to historical reasons, the notation is slightly different than what we have seen in
the last two chapters; but we try to link and unify as much as possible.
11.1
One-Way ANOVA
The model of this section is tailored to compare I different groups where the variability of the
observations around the mean is the same in all groups. That means that there is one common
variance parameter and we pool the information across all observations to estimate it.
199
200
CHAPTER 11. ANALYSIS OF VARIANCE
More formally, the model consists of I groups, in the ANOVA context called factor levels,
i = 1, . . . , I, and every level contains a sample of size ni . Thus, in total we have N = n1 +· · ·+nI
observations. The model is given by
Yij = µi + εij
(11.1)
= µ + βi + εij ,
(11.2)
where we use the indices to indicate the group and within group observation. Similarly as in
iid
the regression models of the last chapters, we again assume εij ∼ N (0, σ 2 ). Formulation (11.1)
represents the individual group means directly, whereas formulation (11.2) models an overall
mean and deviations from the mean.
However, model (11.2) is overparameterized (I levels and I + 1 parameters) and an additional
constraint on the parameters is necessary. Often, the sum-to-zero-contrast or treatment contrast,
written as:
I
X
βi = 0
or
β1 = 0,
(11.3)
i=1
are used.
We are inherently interested in whether there exists a difference between the groups and so
our null hypothesis is H0 : β1 = β2 = · · · = βI = 0. Note that the hypothesis is independent of
the constraint. To develop the associated test, we proceed in several steps. We first link the two
group case to the notation from the last chapter. In a second step we intuitively derive estimates
in the general setting. Finally, we state the test statistic.
11.1.1
Two Level ANOVA Written as a Regression Model
Model (11.2) with I = 2 and treatment constraint β1 = 0 can be written as a regression problem
Yi∗ = β0∗ + β1∗ xi + ε∗i ,
i = 1, . . . , N
(11.4)
with Yi∗ the components of the vector (Y11 , Y12 , . . . , Y1n1 , Y21 , . . . , Y2n2 )⊤ and xi = 0 if i =
1, . . . , n1 and xi = 1 otherwise. We simplify the notation and spare ourselves from writing the
index denoted by the asterisk with
Y1
β0
1 0 β0
= Xβ + ε = X
+ε=
+ε
(11.5)
Y2
β1
1 1 β1
and thus we have as least squares estimate
b β0
N n2 −1 1⊤ 1⊤
y1
⊤
−1 ⊤ y 1
b
β = b = (X X) X
=
⊤
⊤
y2
n2 n2
0
1
y2
β1
P
1 P
n2 − n2
1
j y1j
n1
ij yij
P
= 1 P
.
=
1 P
n1 n2 −n2 N
j y2j
j y2j − n
j y1j
n
2
(11.6)
(11.7)
1
Thus the least squares estimates of µ and β2 in (11.2) for two groups are the mean of the first
group and the difference between the two group means.
The null hypothesis H0 : β1 = β2 = 0 in Model (11.2) is equivalent to the null hypothesis
H0 : β1∗ = 0 in Model (11.4) or to the null hypothesis H0 : β1 = 0 in Model (11.5). The latter is
201
11.1. ONE-WAY ANOVA
of course based on a t-test for a linear association (Test 15) and coincides with the two-sample
t-test for two independent samples (Test 2).
Estimators can also be derived in a similar fashion under other constraints or for more factor
levels.
11.1.2
Sums of Squares Decomposition
Historically, we often had the case n1 = · · · = nI = J, representing a balanced setting. In
this case, there is another simple approach to deriving the estimators under the sum-to-zero
constraint. We use dot notation in order to show that we work with averages, for example,
n
i
1 X
y i· =
yij
ni
I
and
j=1
n
i
1 XX
y ·· =
yij .
N
(11.8)
i=1 j=1
Based on Yij = µ + βi + εij we use the following approach to derive estimates
yij = y ·· + (y i· − y ·· ) + (yij − y i· ).
With the least squares method, µ
b and βbi are chosen such that
X
X
b − βbi )2
(yij − µ
b − βbi )2 =
(y ·· + y i· − y ·· + yij − y i· − µ
i,j
(11.9)
(11.10)
i,j
=
X
i,j
2
b) + (y i· − y ·· − βbi ) + (yij − y i· )
(y ·· − µ
(11.11)
is minimized. We evaluate the square of this last equation and note that the cross terms are zero
since
J
X
j=1
(yij − y i· ) = 0
and
I
X
(y i· − y ·· − βbi ) = 0
(11.12)
i=1
(the latter due to sum-to-zero constraint) and thus µ
b = y ·· and βbi = y i· − y ·· . Hence, writing
rij = yij − y i· , we have
yij = µ
b + βbi + rij .
(11.13)
The observations are orthogonally projected in the space spanned by µ and βi . This orthogonal projection allows for the division of the sums of squares of the observations (mean corrected
to be precise) into the sums of squares of the model and sum of squares of the error component.
These sums of squares are then weighted and compared. The representation of this process
in table form and the subsequent interpretation is often equated with the analysis of variance,
denoted ANOVA.
Remark 11.1. This orthogonal projection also holds in the case of a classical regression framework, of course. Using (10.13) and (10.14), we have
b ⊤ r = y ⊤ H⊤ (I − H)y = y ⊤ (H − HH)y = 0,
y
(11.14)
202
CHAPTER 11. ANALYSIS OF VARIANCE
because the hat matrix H is symmetric (H⊤ = H) and idempotent (HH = H).
♣
The decomposition of the sums of squares can be derived with help from (11.9). No assumptions about constraints or ni are made
X
X
(yij − y ·· )2 =
(y i· − y ·· + yij − y i· )2
i,j
(11.15)
ij
=
X
ij
(y i· − y ·· )2 +
X
X
(yij − y i· )2 +
2(y i· − y ·· )(yij − y i· ),
i,j
where the cross term is again zero because
i,j
ni
X
j=1
of the sums of squares
X
i,j
|
(yij − y ·· )2 =
{z
Total
}
X
i,j
|
(11.16)
(yij − y i· ) = 0. Hence we have the decomposition
(y i· − y ·· )2 +
{z
Model
}
X
i,j
|
(yij − y i· )2
{z
Error
(11.17)
}
or SST = SSA + SSE . We choose deliberately SSA instead of SSM as this will simplify subsequent
extensions. Using the least squares estimates µ
b = y ·· and βbi = y i· − y ·· , this equation can be
read as
1 X
1 X
1 X
(yij − µ
b)2 =
ni (µ\
+ βi − µ
b)2 +
(yij − µ\
+ βi )2
N
N
N
i,j
i
i,j
X
2
1
c2 ,
\ij ) =
ni βbi + σ
Var(y
N
(11.18)
(11.19)
i
(where we could have used some divisor other than N ). The test statistic for the statistical
hypothesis H0 : β1 = β2 = · · · = βI = 0 is based on the idea of decomposing the variance into
variance between groups and variance within groups, just as illustrated in (11.19), and comparing
c2 in
them. Formally, this must be made more precise. A good model has a small estimate for σ
comparison to that for the second sum. We now develop a quantitative comparison of the sums.
A raw comparison of both variance terms is not sufficient, the number of observations must
be considered: SSE increases as N increases also in light of a high quality model. In order to
weight the individual sums of squares, we divide them by their degrees of freedom, e.g., instead
of SSE we will use SSE /(N − I) and instead of SSA we will use SSA /(I − 1), which we will
term mean squares. Under the null hypothesis, the mean squares are chi-square distributed
and thus their quotients are F distributed. Hence, an F -test as illustrated in Test 4 is needed
again. Historically, such a test has been “constructed” via a table and is still represented as such.
This so-called ANOVA table consists of columns for the sums of squares, degrees of freedom,
mean squares and F -test statistic due to variance between groups, within groups, and the total
variance. Table 11.1 illustrates such a generic ANOVA table, numerical examples are given in
Example 11.1 later in this section. Note that the third row of the table represents the sum of
the first two rows. The last two columns are constructed from the first two ones.
203
11.1. ONE-WAY ANOVA
Table 11.1: Table for one-way analysis of variance.
Source
Sum of squares Degrees of freedom Mean squares Test statistic
(SS)
(DF)
(MS)
Between groups
SSA =
X
(Factors, levels, . . . )
(y i· − y ·· )2
I −1
MSA =
SSE =
X
(yij − y i· )2
SSA
MSA
Fobs =
I −1
MSE
N −I
MSE =
SST =
X
(yij − y ·· )2
SSE
N −I
N −1
i,j
Within groups
(Error)
i,j
Total
i,j
The expected values of the mean squares (MS) are
E(MSA ) = E SSA /(I − 1) = σ 2 +
2
i ni βi
P
I −1
,
E(MSE ) = E SSE /(N − I) = σ 2 .
(11.20)
Note that only the latter is intuitive (Problem 11.1.b).
Thus under H0 : β1 = · · · = βI = 0, E(MSA ) = E(MSE ) and hence the ratio MSA /MSE is
close to one, but typically larger. We reject H0 for large values of this ratio. Test 16 summarizes
the procedure. The test statistic of Table 11.1 naturally agrees with that of Test 16. Observe
that when MSA ≤ MSE , i.e., Fobs ≤ 1, H0 is never rejected. Details about F -distributed random
variables are given in Section 3.2.3.
Test 16: Performing a one-way analysis of variance
Question: Of the means y 1 , y 2 , . . . , y I , are at least two significantly different?
Assumptions: The I populations are normally distributed with homogeneous variances. The samples are independent.
Calculation: Construct a one-way ANOVA table.
squares of the factor and the error are needed:
Fobs =
The quotient of the mean
MSA
SSA /(I − 1)
=
SSE /(N − I)
MSE
Decision: Reject H0 : β1 = β2 = · · · = βI if Fobs > Fcrit , where Fcrit is the
1 − α quantile of an F -distribution with I − 1 gives the degrees of freedom
“between” and N − I the degrees of freedom “within”
Calculation in R: summary( lm(...)) for the value of the test statistic or anova(
lm(...)) for the explicit ANOVA table.
204
CHAPTER 11. ANALYSIS OF VARIANCE
Example 11.1 discusses a simple analysis of variance.
Example 11.1 (retardant data). Many substances related to human activities end up in
wastewater and accumulate in sewage sludge. The present study focuses on hexabromocyclododecane (HBCD) detected in sewage sludge collected from a monitoring network in Switzerland.
HBCD’s main use is in expanded and extruded polystyrene for thermal insulation foams, in
building and construction. HBCD is also applied in the backcoating of textiles, mainly in furniture upholstery. A very small application of HBCD is in high impact polystyrene, which is used
for electrical and electronic appliances, for example in audio visual equipment. Data and more
detailed background information are given in Kupper et al. (2008) where it is also argued that
loads from different types of monitoring sites showed that brominated flame retardants ending
up in sewage sludge originate mainly from surface runoff, industrial and domestic wastewater.
HBCD is harmful to one’s health, may affect reproductive capacity, and may harm children
in the mother’s womb.
In R-Code 11.1 the data are loaded and reduced to Hexabromocyclododecane. First we use
constraint β1 = 0, i.e., Model (11.5). The estimates naturally agree with those from (11.7). Then
we use the sum-to-zero constraint and compare the results. The estimates and the standard errors
changed (and thus the p-values of the t-test). The p-values of the F -test are, however, identical,
since the same test is used.
The R command aov is an alternative for performing ANOVA and its use is illustrated in
R-Code 11.2. We prefer, however, the more general lm approach. Nevertheless we need a function
which provides results on which, for example, Tukey’s honest significant difference (HSD) test
can be performed with the function TukeyHSD. The differences can also be calculated from the
coefficients in R-Code 11.1. The p-values are smaller because multiple tests are considered. ♣
R-Code 11.1: retardant data: ANOVA with lm command and illustration of various
contrasts.
tmp <- read.csv('./data/retardant.csv')
retardant <- read.csv('./data/retardant.csv', skip=1)
names( retardant) <- names(tmp)
HBCD <- retardant$cHBCD
str( retardant$StationLocation)
##
chr [1:16] "A11" "A12" "A15" "A16" "B11" "B14" "B16" "B25" "C2" "C4" ...
type <- as.factor( rep(c("A","B","C"), c(4,4,8)))
lmout <- lm( HBCD ~ type )
summary(lmout)
##
## Call:
## lm(formula = HBCD ~ type)
205
11.1. ONE-WAY ANOVA
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -87.6 -44.4 -26.3
22.0 193.1
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
75.7
42.4
1.78
0.098 .
## typeB
77.2
60.0
1.29
0.220
## typeC
107.8
51.9
2.08
0.058 .
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.8 on 13 degrees of freedom
## Multiple R-squared: 0.249,Adjusted R-squared: 0.134
## F-statistic: 2.16 on 2 and 13 DF, p-value: 0.155
options( "contrasts")
## $contrasts
##
unordered
## "contr.treatment"
ordered
"contr.poly"
# manually construct the estimates:
c( mean(HBCD[1:4]), mean(HBCD[5:8])-mean(HBCD[1:4]),
mean(HBCD[9:16])-mean(HBCD[1:4]))
## [1]
75.675
77.250 107.788
# change the constrasts to sum-to-zero
options(contrasts=c("contr.sum","contr.sum"))
lmout1 <- lm( HBCD ~ type )
# summary(lmout1) # as above, except the coefficents are different:
print( summary(lmout1)$coef, digits=3)
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
137.4
22.3
6.15 3.51e-05
## type1
-61.7
33.1
-1.86 8.55e-02
## type2
15.6
33.1
0.47 6.46e-01
beta <- as.numeric(coef(lmout1))
# Construct 'contr.treat' coefficients:
c( beta[1]+beta[2], beta[3]-beta[2], -2*beta[2]-beta[3])
## [1]
75.675
77.250 107.787
206
CHAPTER 11. ANALYSIS OF VARIANCE
R-Code 11.2 retardant data: ANOVA with aov and multiple testing of the means.
aovout <- aov( HBCD ~ type )
options("contrasts")
## $contrasts
## [1] "contr.sum" "contr.sum"
coefficients( aovout)
## (Intercept)
##
137.354
# coef( aovout) is suffient as well.
type1
-61.679
type2
15.571
summary(aovout)
##
## type
## Residuals
Df Sum Sq Mean Sq F value Pr(>F)
2 31069
15534
2.16
0.15
13 93492
7192
TukeyHSD( aovout)
##
Tukey multiple comparisons of means
##
95% family-wise confidence level
##
## Fit: aov(formula = HBCD ~ type)
##
## $type
##
diff
lwr
upr
p adj
## B-A 77.250 -81.085 235.59 0.42602
## C-A 107.787 -29.335 244.91 0.13376
## C-B 30.537 -106.585 167.66 0.82882
11.2
Two-Way and Complete Two-Way ANOVA
Model (11.2) can be extended for additional factors. Adding one factor to a one-way model leads
us to a two-way model
Yijk = µ + βi + γj + εijk
(11.21)
iid
with i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , nij and εijk ∼ N (0, σ 2 ). The indices again specify the
levels of the first and second factor as well as the count for that configuration. As stated, the
model is over parameterized and additional constraints are again necessary, in which case
I
X
i=1
βi = 0,
J
X
γj = 0
or
β1 = 0,
γ1 = 0
(11.22)
j=1
are often used.
When the nij in each group are not equal, the decomposition of the sums of squares is not
necessarily unique. In practice this is often unimportant since, above all, the estimated coefficients are compared with each other (“contrasts”). We recommend always using the command
207
11.2. TWO-WAY AND COMPLETE TWO-WAY ANOVA
lm(...). In case we do compare sums of squares there are resulting ambiguities and factors need
to be included in decreasing order of “natural” importance.
For the sake of illustration, we consider the balanced case of nij = K, called complete two-way
ANOVA. More precisely, the model consists of I · J groups and every group contains K samples
and N = I · J · K. The calculation of the estimates are easier than in the unbalanced case and
are illustrated as follows.
As in the one-way case, we can derive least squares estimates
yijk = y ··· + y i·· − y ··· + y ·j· − y ··· + yijk − y i·· − y ·j· + y ···
|{z} | {z } | {z } |
{z
}
µ
b
γ
bj
rijk
βbi
(11.23)
and separate the sums of squares
X
X
X
X
(yijk − y ··· )2 =
(y i·· − y ··· )2 +
(y ·j· − y ··· )2 +
(yijk − y i·· − y ·j· + y ··· )2 .
i,j,k
|
i,j,k
{z
|
}
SST
i,j,k
{z
SSA
}
|
(11.24)
i,j,k
{z
SSB
}
|
{z
SSE
}
We are interested in the statistical hypotheses H0 : β1 = · · · = βI = 0 and H0 : γ1 = · · · =
γJ = 0. The test statistics of both of these tests are given in the last column of Table 11.2.
The test statistic Fobs,A is compared with the quantiles of the F distribution with I − 1 and
N − I − J + 1 degrees of freedom. Similarly, for Fobs,B we use J − 1 and N − I − J + 1 degrees
of freedom. The tests are equivalent to that of Test 16.
Table 11.2: Table of the complete two-way ANOVA.
Source
SS
Factor A SSA =
X
i,j,k
Factor B SSB =
(y i·· − y ··· )2
X
i,j,k
Error
(y ·j· − y ··· )2
DF
MS
I −1
MSA =
SSA
I −1
Fobs,A =
MSA
MSE
J −1
MSB =
SSB
J −1
Fobs,B =
MSB
MSE
MSE =
SSE
DFE
SSE =
DFE =
X
(yijk − y i·· − y ·j· + y ··· )2 N −I −J +1
i,j,k
Total
SST =
X
i,j,k
(yijk − y ··· )2
Test statistic
N −1
Model (11.21) is additive: “More of both leads to even more”. It might be that there is a
certain canceling or saturation effect. To model such a situation, we need to include an interaction
(βγ)ij in the model to account for the non-linear effects:
Yijk = µ + βi + γj + (βγ)ij + εijk
(11.25)
208
CHAPTER 11. ANALYSIS OF VARIANCE
iid
with εijk ∼ N (0, σ 2 ) and corresponding ranges for the indices. In addition to constraints (11.22)
we require
I
X
(βγ)ij = 0 and
J
X
i=1
j=1
for all i and j
(βγ)ij = 0
(11.26)
or analogous treatment constraints are often used. As in the previous two-way case, we can
derive the least squares estimates
yijk = y ··· + y i·· − y ··· + y ·j· − y ··· + y ij· − y i·· − y ·j· + y ··· + yijk − y ij·
|{z} | {z } | {z } |
{z
} | {z }
d
µ
b
γ
bj
rijk
βbi
(βγ)
(11.27)
ij
and a decomposition in sums of squares is straightforward.
Note that for K = 1 a model with interaction does not make sense as the error of the two-way
model is the interaction.
Table 11.3 shows the table of a complete two-way ANOVA. The test statistics Fobs,A , Fobs,B
and Fobs,AB are compared with the quantiles of the F distribution with I −1, J −1, (I −1)(J −1)
and N − IJ degrees of freedom, respectively.
Table 11.3: Table of the complete two-way ANOVA with interaction.
Source
SS
DF
MS
Factor A
SSA =
X
(y i·· − y ··· )2
I −1
MSA =
SSA
DFA
Fobs,A =
MSA
MSE
X
J −1
MSB =
SSB
DFB
Fobs,B =
MSB
MSE
Interaction SSAB =
(I − 1)×
X
(y ij· − y i·· − y ·j· + y ··· )2 (J − 1)
MSAB =
SSAB
MSAB
Fobs,AB =
DFAB
MSE
SSE =
X
(yijk − y ij· )2
N − IJ
MSE =
SSE
DFE
SST =
X
N −1
i,j,k
Factor B
SSB =
i,j,k
(y ·j· − y ··· )2
Test statistic
i,j,k
Error
i,j,k
Total
i,j,k
11.3
(yijk − y ··· )2
Analysis of Covariance
We now combine regression and analysis of variance elements, i.e., continuous predictors and factor levels, by adding “classic” predictors to models (11.2) or (11.21). Such models are sometimes
called ANCOVA, e.g., example
Yijk = µ + β1 xi + γj + εijk ,
(11.28)
209
11.4. EXAMPLE
iid
with j = 1, . . . , J, k = 1, . . . , nj and εijk ∼ N (0, σ 2 ). Additional constraints are again necessary.
Keeping track of indices and Greek letters quickly gets cumbersome and one often resorts to R
formula notation. For example, if the predictor xi is in the variable Population and γj is in the
variable Treatment, in form of a factor then
y ˜ Population + Treatment
(11.29)
is representing (11.28). An intersection is indicated with :, for example Treatment:Month. The
* operator denotes ‘factor crossing’: a*b is interpreted as a + b + a:b. See Section Details of
help(formula).
ANOVA tables are constructed similarly and individual rows thereof are associated to the
individual elements in a formula specification. Hence, our unified approach via lm(...). Notice
that for the estimates the order of the variable in formula (11.29) does not play a role, for the
decomposition of sums of squares it does. Different statistical software packages have different
approaches and thus may lead to minor differences.
11.4
Example
Example 11.2 (UVfilter data). Octocrylene is an organic UV Filter found in sunscreen and
cosmetics. The substance is classified as a contaminant and dangerous for the environment by
the EU under the CLP Regulation. Sunscreens containing i.a. octocrylene has been forbidden
in some areas as a protective measure for their coral reefs.
Because the substance is difficult to break down, the environmental burden of octocrylene
can be estimated through the measurement of its concentration in sludge from waste treatment
facilities.
The study Plagellat et al. (2006) analyzed octocrylene (OC) concentrations from 24 different
purification plants (consisting of three different types of Treatment), each with two samples
(Month). Additionally, the catchment area (Population) and the amount of sludge (Production)
are known. Treatment type A refers to small plants, B medium-sized plants without considerable
industry and C medium-sized plants with industry.
R-Code 11.3 prepares the data and fits first a one-way ANOVA based on Treatment only,
followed by a two-way ANOVA based on Treatment and Month (with and without interactions).
Note that the setup is not balanced with respect to treatment type.
Adding the factor Month improves considerably the model fit: increase of the adjusted R2
from 40% to 50%) and the standard errors of the treatment effects estimates are further reduced.
The interaction is not significant, as the corresponding p-value is above 14%. Based on
Figure 11.1 this is not surprising. First, the seasonal effect of groups A and B are very similar
and second, the variability in group C is too large.
♣
R-Code 11.3: UVfilter data: one-way ANOVA using lm.
UV <- read.csv( './data/chemosphere.csv')
UV <- UV[,c(1:6,10)]
# reduce to OT
210
CHAPTER 11. ANALYSIS OF VARIANCE
str(UV, strict.width='cut')
## 'data.frame': 24 obs. of 7 variables:
## $ Treatment : chr "A" "A" "A" "A" ...
## $ Site_code : chr "A11" "A12" "A15" "A16" ...
## $ Site
: chr "Chevilly" "Cronay" "Thierrens" "Prahins" ...
## $ Month
: chr "jan" "jan" "jan" "jan" ...
## $ Population: int 210 284 514 214 674 5700 8460 11300 6500 7860 ...
## $ Production: num 2.7 3.2 12 3.5 13 80 150 220 80 250 ...
## $ OT
: int 1853 1274 1342 685 1003 3502 4781 3407 11073 3324 ...
with( UV, table(Treatment, Month))
##
Month
## Treatment jan jul
##
A
5
5
##
B
3
3
##
C
4
4
options( contrasts=c("contr.sum", "contr.sum"))
lmout <- lm( log(OT) ~ Treatment, data=UV)
summary( lmout)
##
## Call:
## lm(formula = log(OT) ~ Treatment, data = UV)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -0.952 -0.347 -0.136 0.343 1.261
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
8.122
0.116
70.15 < 2e-16 ***
## Treatment1
-0.640
0.154
-4.16 0.00044 ***
## Treatment2
0.438
0.175
2.51 0.02049 *
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.555 on 21 degrees of freedom
## Multiple R-squared: 0.454,Adjusted R-squared: 0.402
## F-statistic: 8.73 on 2 and 21 DF, p-value: 0.00174
lmout <- lm( log(OT) ~ Treatment + Month, data=UV)
summary( lmout)
##
11.4. EXAMPLE
211
## Call:
## lm(formula = log(OT) ~ Treatment + Month, data = UV)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -0.7175 -0.3452 -0.0124 0.1691 1.2236
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
8.122
0.106
76.78 < 2e-16 ***
## Treatment1
-0.640
0.141
-4.55 0.00019 ***
## Treatment2
0.438
0.160
2.74 0.01254 *
## Month1
-0.235
0.104
-2.27 0.03444 *
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.507 on 20 degrees of freedom
## Multiple R-squared: 0.566,Adjusted R-squared: 0.501
## F-statistic: 8.69 on 3 and 20 DF, p-value: 0.000686
summary( aovout <- aov( log(OT) ~ Treatment * Month, data=UV))
##
Df Sum Sq Mean Sq F value Pr(>F)
## Treatment
2
5.38
2.688
11.67 0.00056 ***
## Month
1
1.32
1.325
5.75 0.02752 *
## Treatment:Month 2
1.00
0.499
2.17 0.14355
## Residuals
18
4.15
0.230
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD( aovout, which=c('Treatment'))
##
Tukey multiple comparisons of means
##
95% family-wise confidence level
##
## Fit: aov(formula = log(OT) ~ Treatment * Month, data = UV)
##
## $Treatment
##
diff
lwr
upr
p adj
## B-A 1.07760 0.44516 1.71005 0.00107
## C-A 0.84174 0.26080 1.42268 0.00446
## C-B -0.23586 -0.89729 0.42556 0.64105
boxplot( log(OT)~Treatment, data=UV, col=7, boxwex=.5)
at <- c(0.7, 1.7, 2.7, 1.3, 2.3, 3.3)
boxplot( log(OT)~Treatment+Month, data=UV, add=T, at=at, xaxt='n', boxwex=.2)
212
CHAPTER 11. ANALYSIS OF VARIANCE
8.5
8.0
jan
jul
7.0
7.5
7.5
8.5
mean of log(OT)
Month
6.5
log(OT)
9.5
with( UV, interaction.plot( Treatment, Month, log(OT), col=2:3))
A
B
C
Treatment
A
B
C
Treatment
Figure 11.1: UVfilter data: box plots sorted by treatment and interaction plot. (See
R-Code 11.3.)
11.5
Bibliographic Remarks
Almost all books covering linear models have a section about ANOVA.
11.6
Exercises and Problems
Problem 11.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Show the second and third equality of
2
Radj
= R2 − (1 − R2 )
n−1
SSE /dofE
p
= 1 − (1 − R2 )
=1−
n−p−1
n−p−1
SST /dofT
b) Show the results of (11.20).
Problem 11.2 (ANOVA) We consider the data chemosphere_OC.csv available on the course
web page. The data describe the octocrylene (OC) concentration sampled from 12 wastewater
treatment plants in Switzerland. Further variables in the dateset are: Behandlung (treatment of
the wastewater), Monat (month when the sample was collected), Einwohner (number of inhabitant
connected to the plant), Produktion (sludge production (metric tons of dry matter per year),
everything that doesn’t enter the water system after treatment).
213
11.6. EXERCISES AND PROBLEMS
Octocrylene is an organic UV filter and is used in sunscreens and as additive in cosmetics
for daily usage. The substance is classified as irritant and dangerous for the environment (EU
classification of dangerous substances).
The data are published in C. Plagellat, T. Kupper, R. Furrer, L. F. de Alencastro, D. Grandjean, J. Tarradellas Concentrations and specific loads of UV filters in sewage sludge originating
from a monitoring network in Switzerland, Chemosphere 62 (2006) 915–25.
a) Describe the data. Do a visual inspection to check for differences between the treatment
types and between the months of data aquirement. Use an appropriate plot function to do
so. Describe your results.
Hint: Also try the function table()
b) Fit a one-way ANOVA with log(OC) as response variable and Behandlung as explanatory
variable.
Hint: use lm and perform an anova on the output. Don’t forget to check model assumptions.
c) Extend the model to a two-way ANOVA by adding Monat as a predictor. Interpret the
summary table.
d) Test if there is a significant interaction between Behandlung and Monat. Compare the
result with the output of interaction.plot
e) Extend the model from (b) by adding Produktion as an explanatory variable. Perform an
anova on the model output and interpret the summary table. (Such a model is sometimes
called Analysis of Covariance, ANCOVA).
Switch the order of your explanatory variables and run an anova on both model outputs.
Discuss the results of Behandlung + Produktion and Produktion + Behandlung. What
causes the differences?
Problem 11.3 (ANOVA table) Calculate the missing values in the following table:
library( aov( yield ~ block + N + P + K, digits=npk))
##
## block
## N
## P
## K
## Residuals
Df Sum Sq Mean Sq F value
5 343.3
1
8.4
8.40
1
95.2
5.946
15
16.01
Pr(>F)
0.00366
How many observations have been taken in total? Do we have a balanced or complete fourway setting?
Problem 11.4 (ANCOVA) The dataset rats of the package faraway consists of two-way
ANOVA design with factors poison (three levels I, II and III) and treatment (four levels A, B,
214
CHAPTER 11. ANALYSIS OF VARIANCE
C and D). To study the toxic agents, 4 rats were exposed to each pair of factors. The response
was survival time in tens of hours.
a) Test the effect of poison and treatment.
b) Would a transformation of the response variable improve the model?
Problem 11.5 (ANCOVA) The perceived stress scale (PSS) is the most widely used psychological instrument for measuring the perception of stress. It is a measure of the degree to which
situations in one’s life are appraised as stressful.
The dataset PrisonStress from the package PairedData gives the PSS measurements for 26
people in prison at the entry and at the exit. Part of these people were physically trained during
their imprisonment.
a) Describe the data. Do a visual inspection to check for differences between the treatment
types and PSS.
b) Propose a statistical model.
Problem 11.6 (BMJ Endgame) Discuss and justify the statements about ‘One way analysis of
variance’ given in doi.org/10.1136/bmj.e2427.
Chapter 12
Design of Experiments
Learning goals for this chapter:
⋄ Understand the issues and principles of Design of Experiments
⋄ Compute sample size for an experiment
⋄ Compute power of test
⋄ Describe different setups of an experiment
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter12.R.
Design of Experiments (DoE) is a relatively old field of statistics. Pioneering work has been
done almost 100 years ago by Sir Ronald Fisher and co-workers at Rothamsted Experimental
Station, England, where mainly agricultural questions have been discussed. The topic has been
taken up by the industry after the second world war to, e.g., optimize production of chemical
compounds, work with robust parameter designs. In recent decades, advances are still been made
on for example using the abundance of data in machine learning type discovery or in preclinical
and clinical research were the sample sizes are often extremely small.
In this chapter we will selectively cover different aspects of DoE, focusing on sample size
calculations, power and randomization. Additionally, we also cover a few domain specific concepts
and terms that are often used in the context of setting up experiments for clinical or preclinical
trials.
12.1
Basic Idea and Terminology
Often statisticians are consulted after the data has been collected. Moreover, if the data does
not “show” what has been hypothesized, frantic visits to “data clinics” are done. Along the same
lines, Sir Ronald Fisher once stated “To consult the statistician after an experiment is finished
is often merely to ask him to conduct a post mortem examination. He can perhaps say what the
experiment died of.” (Fisher, 1938, Presidential Address to the First Indian Statistical Congress).
215
216
CHAPTER 12. DESIGN OF EXPERIMENTS
Here, the term ‘experiment’ describes a controlled procedure that is (hopefully) carefully
designed to test one (or very few) scientific hypothesis. In the context of this chapter, the
hypothesis is often about the effect of independent variables on one dependent variable, i.e., the
outcome measure. In terms of our linear model equation (10.2), what is the effect of one or several
of the xiℓ on the Yi . Designing the experiment implies the choice of the independent variables
(we need to account for possible confounders, or effect modifiers), the values thereof (fixed at
particular “levels” or randomly chosen) and sample size. Again in terms of our model, we need
to include and well specify all necessary predictors xiℓ that have an effect on Yi . Finally, we
need to determine the sample size n such that the desired effects – if they exist – are statistically
significant.
The prime paradigm is of DoE is
Maximize primary variance, minimize error variance and control for secondary variance.
which translates to maximize the signal we are investigating, minimize the noise we are not
modeling and control for uncertainties with carefully chosen independent variables.
In the context of DoE we often want to compare the effect of a treatment (or procedure) on
an outcome. Examples that have been discussed in previous chapters are: “Is there a progression
of pododermatitis at the hind paws over time?”, “Is a diuretic medication during pregnancy
reducing the risk of pre-eclampsia?”, “How much can we increase hardness of metal springs with
lower temperatures of quenching baths?”, “Is residual octocrylene in waste water sludge linked
to particular waste water types?”.
To design an experiment, it is very important to differentiate between exploratory or confirmatory research questions. An exploratory experiment tries to discover as much as possible
about the sample or the phenomenon under investigation, given time and resource constraints.
Whereas in a confirmatory experiment we want to verify, to confirm, or to validate a result,
which was often derived from an earlier exploratory experiment. Table 12.1 summarizes both
approaches in a two-valued setting. Some of the design elements will be further discussed in later
sections of this chapter. The binary classification should be understood within each domain: few
observations in one domain could be many in another one. In both situations and all scientific
domains, however, a proper statistical analysis is crucial.
Table 12.1: Differences between exploratory or confirmatory experiments.
Design feature
Exploratory
Confirmatory
Design space
Subjects
Environment
Treatments
Outcomes
Hypotheses
Statistical tests
Inferences about population
Large
Heterogeneous
Varied
Many
Many
Loose and flexible
Generate hypotheses
Not possible
Small
Homogeneous
Standardized
Few
Few
Narrow and predefined
Confirm/reject hypotheses
Possible with rigorous designs
217
12.2. SAMPLE SIZE CALCULATIONS
12.2
Sample Size Calculations
When planning an experiment, we should always carefully evaluate sample size n, that is require
to be able to properly conclude our hypothesis. In many cases sample size needs to be determined
before starting the experiment: organizing (time-wise) the experiment, acquire necessary funds,
filing study protocols or submitting a license to an ethic commission. As a general rule, we
choose as many as possible but as few as necessary samples to balance statistical and economic
interest.
12.2.1
Experimental and Response Units
Suppose that we test the effect of a dietary treatment for female rabbits (say, with and without
a vitamin additive) on the weight of the litter within two housing boxes. Each doe (i.e., female
reproductive rabbit) in the box receives the same treatment, i.e., the treatment is not applied to
a single individual subject and could not be individually controlled for. All does form a single,
so-called experimental unit. The outcomes or responses are measured on the response units,
which are typically “smaller” than the experimental units. In our example, we would weight the
litter of each doe in the housing box individually, but aggregate or average these to a single
number. As a side note, this average often justifies the use of a normal response model when an
experimental unit consists of several response units.
Formally, experimental units are entities which are independent of each other and to which
it is possible to assign a treatment or intervention independently of the other units. The experimental unit is the unit which has to be replicated in an experiment. Below, when we talk about
a sample, we are talking about a sample of experimental units. We do not discuss the choice
of the response units here as it is most often situation specific were a statistician has little to
contribute.
12.2.2
Case of Single Confidence Interval
In this section, we link the sample size to the width of three different confidence intervals.
Specifically, we discuss the necessary number of observations required such that the width of the
empirical confidence interval has a predefined size.
To start, assume that we are in the setting of a simple z confidence interval at level (1 − α)
with known σ, as seen Equation (4.31). If we want to ensure an empirical interval width ω, we
need
2
n ≈ 4z1−α/2
σ2
ω2
(12.1)
observations. We use an approximation relation because sample size is determined as an integer.
In this setting, the right-hand-side of (12.1) does not involve the data and thus the width of the
confidence interval is in any case guaranteed. Note that in order to reduce the width by a factor
two, we need to quadruple the sample size.
The same approach is used when estimating a proportion. We use, for example, the Wald
218
CHAPTER 12. DESIGN OF EXPERIMENTS
confidence interval (6.10) to get
2
n ≈ 4z1−α/2
pb(1 − pb)
,
ω2
(12.2)
which corresponds to (12.1) with the the plug-in estimate pb(1 − pb) for σ
b2 in the setting of
a Bernoulli random variable and central limit approximation for X/n. Of course, pb is not
known a priori and we often take the conservative choice of pb = 1/2 as the function x(1 − x)
is maximized over (0, 1) at x = 1/2. Thus, without any prior knowledge on p we may choose
conservatively n ≈ (z1−α/2 /ω)2 . Alternatively, the sample size calculation can be done based on
a Wilson confidence interval (6.11), where a quadratic equation needs to be solved to obtain n
(see Problem 12.1.a).
If we are estimating a Pearson’s correlation coefficient, we can use CI 6 to link interval width
ω with n. Here, we use an alternative approach and would like to determine sample size such
that the interval does not contain the value zero, i.e., the width is just smaller than 2r. The
derivation relies on the duality of tests and confidence intervals (see Section 5.4). Recall Test 14
for Pearson’s correlation coefficient. From Equation (9.3) we construct the critical value for the
test (boundary of the rejection region, see Figure 5.3) and based on that we can calculate the
minimum sample size necessary to detect a correlation |r| ≥ rcrit as significant:
√
n−2
tcrit
=⇒
rcrit = q
.
(12.3)
tcrit = rcrit q
2
1 − rcrit
n − 2 + t2crit
Figure 12.1 illustrates the least significant correlation for specific sample sizes. Specifically, with
sample size n < 24 correlations between −0.4 and 0.4 are not significant and for a correlation of
±0.25 to be significant, we require n > 62 at level α = 5% (see R-Code 12.1).
R-Code 12.1 Significant correlation for specific sample sizes (See Figure 12.1.)
rcrit <- function(n, alpha=.05) {
tcrit <- qt( 1 - alpha/2, df=n-2)
return( tcrit / sqrt( n-2 + tcrit^2))
}
curve( rcrit(x), from=3, to=200, xlab="n", ylab="rcrit", ylim=c(0,1), yaxs='i')
round( c(rcrit(25), uniroot( function(x) rcrit(x)-0.25, c(50,100))$root), 2)
## [1]
0.40 62.02
abline( v=63, h=0.25, col='gray')
12.2.3
Case of t-Tests
Sample sizes are most often determined to be able to “detect” an alternative hypothesis with
a certain probability. That means we need to link the sample size with the power 1 − β of a
particular statistical test.
219
0.6
0.4
0.0
0.2
rcrit
0.8
1.0
12.2. SAMPLE SIZE CALCULATIONS
0
50
100
150
200
n
Figure 12.1: Significant correlation for specific sample sizes (at level α = 5%). For
a sample correlation of 0.25, n needs to be larger than 62 as indicated with the gray
lines. For a particular n, correlations above the line are significant, below are not. (See
R-Code 12.1.)
As a simple example, we consider a one-sided z-test with H0 : µ ≤ µ0 and H1 : µ > µ0 . The
Type II error probability for the true mean µ1 is
µ 0 − µ1 √
.
(12.4)
β = β(µ1 ) = P(H0 not rejected given µ = µ1 ) = · · · = Φ z1−α +
σ/ n
Suppose we want to “detect” the alternative with probability 1 − β(µ1 ), i.e., reject the null
hypothesis with probability 1 − β when the true mean is µ1 . Hence, plugging the values in (12.4)
and solving for n we have approximate sample size
z1−α + z1−β 2
n≈ σ
.
(12.5)
µ0 − µ1
Hence, the sample size depends on the Type I and II error probabilities as well as the standard
deviation and the difference of the means. The latter three quantities are often combined to the
standardized effect size
δ=
µ0 − µ1
,
σ
(12.6)
If the standard deviation is not known, an estimate can be used. An estimate based version of
δ is often termed Cohen’s d .
For a two-sided test, a similar expression is found where z1−α is replaced by z1−α/2 . For a
one-sample t-test (Test 1) the right hand side of (12.5) is analogue with the quantiles tn−1,1−α
and tn−1,1−β respectively. Note that now the right hand side depends on n as well as on the
quantiles. To determine n, we start with a reasonable value for n to obtain the quantiles,
calculate the resulting n and repeat the two steps for at least one more iteration. In R, the
function power.t.test() uses a numerical approach.
In the case of two independent samples, the degrees of freedom in the t-quantiles need to be
adjusted from n − 1 to n − 2. Cohen’s d is defined as (x1 −x2 )/sp , where s2p is an estimate of the
pooled variance (e.g., as given in Test 2).
For t-tests in the behavioral sciences, Cohen (1988) defined small, medium and large (standardized) effect sizes as d = 0.2, 0.5 and 0.8, respectively. These are often termed the conventional
220
CHAPTER 12. DESIGN OF EXPERIMENTS
effect sizes but depend on the type of test, see also the function cohen.ES() of the R package
pwr. In preclinical studies, effect sizes are typically larger.
Example 12.1. In the setting of a two-sample t-test with equal group sizes, we need at level
α = 5% and power 1 − β = 80% in each group 26, 64 and 394 observations for a large, medium
and small effect size, respectively, see, e.g., power.t.test( d=0.2, power=0.8) or pwr.t.test(
d=0.2, power=0.8) from the pwr package.
For unequal sample sizes, the sum of both group sizes is a bit larger compared to equal sample
sizes (balanced setting). For a large effect size, we would, for example, require n1 = 20 and n2 =
35, leading to three more observations compared to the balanced setting, (pwr.t2n.test(n1=20,
d=0.8, power=.8) from the pwr package).
♣
Many R packages and other web interfaces allow to calculate sample sizes for many different
settings of comparing means. Of course, care is needed when applying such toolboxes whether
they use the same parametrizations, definitions etc.
12.3
Blocking, Randomization and Biases
In many experiments the subjects or generally the experimental units are inherently heterogeneous with respect to factors that we are not of interest. This heterogeneity may imply variability
in the data, masking the effect we would like to study. Blocking is a technique for dealing with
this nuisance heterogeneity. Hence, we distinguish between the treatment factors that we are
interested in and the nuisance factors which have some effect on the response but are not of
interest to us.
The term blocking originated in agricultural experiments where it refers to a set of plots of
land that have very similar characteristics with respect to crop yield, in other words they are
homogeneous.
If a nuisance factor is known and controllable, we use blocking and control for it by including
a blocking factor in the experiment. Typical, blocking factors are factory, production batch, or
sex. These are controllable in the sense that we are able to choose which factor to include
If a nuisance factor is known and uncontrollable, we may use the concept of ANCOVA, i.e.,
to remove the effect of the factor. Suppose that age has an effect on the treatment. It is not
possible to control for age and creating age batches may not be efficient either. Hence we include
age in our model. This approach is less efficient than blocking as we do correct for the design
compared to design the experiment to account for the factor.
Unfortunately, there are also unknown and uncontrollable nuisance factors. To protect for
these we use proper randomization such that their impact is balanced in all groups. Hence, we
can see randomization as a insurance against systematic biases due to nuisance factors.
12.3.1
Randomization
Randomization is the mechanism to assign a treatment to an experimental unit by pure chance.
Randomization ensures that – on average – the only systematic difference between the groups is
12.3. BLOCKING, RANDOMIZATION AND BIASES
221
the treatment, all other effects that are not accounted for are averaged out. Proper randomization
also protects against spurios correlations in the observations.
The randomization procedure should be a truly randomized procedure for all assignments,
ultimately performed by a genuine random number generator. There are several procedures,
simple randomization, balanced or constrained randomization, stratified randomization etc. and
of course, the corresponding sample sizes are determined a priori.
Simple randomization randomly assigns the type of treatment to an experimental unit. For
example, to assign 12 subjects to three groups we use sample(x=3, size=12, replace=TRUE).
This procedure has the disadvantage of leading to a possibly unbalanced design and thus should
not be used in practice.
A practical randomization scheme is a completely randomized design (CRD) in which we
randomly assign the type of treatment to an experimental unit constrained that in all groups
the same number of subjects are assigned to a specific treatment (conditional on appropriate
sample size). This can be achieved by numbering the subjects, followed by random drawing
of the corresponding numbers and finnaly putting the corresponding subjects in the appropriate four groups: matrix(sample(x=12, size=12, replace=FALSE), nrow=3). The four groups
themselves should also be randomized: sample(c(’A’,’B’,’C’)). In a single call, we have
set.seed(1); matrix(sample(x=12, size=12, replace=FALSE), nrow=3, dimnames=list(sample(c(’A’,’B
When setting up an experiment, we need to carefully eliminate possible factors or variables
that influence both, the dependent variable and independent variable and thus causing a spurious
association. For example, smoking causes lung cancer and yellow fingers. It is not possible to
conclude that yellow fingers cause lung cancer. Similarly, there is a strong association between
the number of fire men at the incident and the filed insurance claim (the more men the larger
the claim). The confounding here is the size of the fire, of course. In less striking words, we
often have to take into account slope, aspect, altitude when comparing the yield of different fields
or the sex, age, weight, pre-existing conditions when working with human or animal data. Not
taking into account these factors will induce biases in the results.
In the case of discrete confounders it is possible to split your sample into subgroups according to these pre-defined factors. These subgroups are often called blocks (when controllable) or
strata (when not). To randomize, randomized complete block design (RCBD), a stratified randomization, is used. In RCBD each block receives the same amount of experimental units and
each block is like a small CRD experiment.
Suppose we have six male and six female subjects. The randomization is done for both the
male and female groups according to cbind( matrix(sample(x=6, size=6, replace=FALSE),
nrow=3), matrix(sample(x=7:12, size=6, replace=FALSE), nrow=3)).
Example 12.2. Suppose we are studying the effect of fertilizer type on plant growth. We have
access to 24 plant pots arranged in a four by six alignment inside a glass house. We have three
different fertilizer and one control group.
To allocate the fertilizers to the plants we number the plants row by row (starting top right).
We assign randomly the 24 pots randomly to four groups of four. The fertilizers are then
222
CHAPTER 12. DESIGN OF EXPERIMENTS
additionally randomly assigned to the different groups. This results in a CRD design (left panel
of Figure 12.2).
Suppose that the glass house has one open wall. This particular setup affects plant growth
unequally, because of temperature differences between the open and the opposite side. To account
for the difference we block the individual rows and we assign randomly one fertilizer to each plant
in every row. This results in a RCBD design (right panel of Figure 12.2).
♣
Figure 12.2: Different randomization of 24 plants.
Source online.stat.psu.edu/stat502/lesson/6.
CRD (left), RCBD (right).
RCBD are used in practice and might be seen on fields of agricultural education and research
stations, as illustrated in Figure 12.3.
Remark 12.1. In certain situations there are two different EUs in the same experiment. As
illustration suppose we are studying the effect of irrigation amount and fertilizer type on crop
Figure 12.3: Grain field with experimental plots. (Photo: R. Furrer).
12.3. BLOCKING, RANDOMIZATION AND BIASES
223
yield. If irrigation is more difficult to vary on a small scale and fields are large enough to be
split, a split plot design becomes appropriate. Irrigation levels are assigned to whole plots by
CRD and fertilizer is assigned to subplots using RCBD (irrigation is the block).
Split-plot experiments are not straight-forward to model statistically and we refer to followup
lectures.
♣
12.3.2
Biases
Similar to the bias of an estimator we denote any systematic error of an experiment as a bias.
Biases are introduced at the design of the study, at the experiment itself and at the analysis
phase. Inherently, these biases are not part of the statistical model and thus induce biases in the
estimates and/or inflate variance of the estimates.
There are many different types of biases and we explain these in its simplest from, assuming
an experimental setup of two groups comparing one treatment to a control.
• Selection bias occurs when the two groups differ systematically next to the effect that is
analyzed.
When studies are not carefully implemented, it is possible to associate smaller risk for
dementia for smokers compared to non-smokers. Here, selection bias occurs by smokers
having a lower life-expectancy compared to non-smokers and thus having fewer dementia
cases (Hernán et al., 2008)
• Confirmation bias occurs when the experimenter or analyziser searches for a confirmation
in the experiment.
In a famous study, students were told that there exist two different types of rats, “maze
bright” and “maze dull” rats where the former are genetically more apt to navigate in a
maze. The result of an experiment by the students showed that the maze bright ones did
perform systematically better than the maze dull ones (Rosenthal and Fode, 1963).
• Confounding is a bias that occurs due to another factor that distorts the relationship
between treatment and outcome.
There are many classical examples of confounding bias. One is that coffee drinkers have a
larger risk of lung cancers. Such studies typically neglect that there is a larger proportion of
smokers that are coffee drinkers, than non-smoking coffee drinkers (Galarraga and Boffetta,
2016).
• in certain cohort studies, subjects do not receive the treatment immediately after they have
entered the study. The time span between entering and receiving the treatment is called
the immortal time and this needs to be taken into account when comparing to the control
group.
A famous example was a study that wrongly claimed that Oscar laureates live about four
years longer than comparable actors without (Redelmeier and Singh, 2001). It is not
sufficient to group the actors in two groups (received Oscar or not) and then to match the
actors in both groups.
224
CHAPTER 12. DESIGN OF EXPERIMENTS
• Performance bias is when a care giver or analyzer treats the subjects of the two groups
differently.
• Attrition bias occurs when participants do not have the same drop-out rate in the control
and treatment groups.
There are many more types of biases, see, e.g., catalogofbias.org/biases/.
12.3.3
Some Particular Terms
This section summarizes different terms that are used in the context of design of experiments.
An intervention is a process where a group of subjects (or experimental units) is subjected
to such a surgical procedure, a drug injection, or some other form of treatment (intervention).
Control has several different uses in design. First, an experiment is controlled because scientists assign treatments to experimental units. Otherwise, we would have an observational study.
Second, a control treatment is a “standard” treatment that is used as a baseline or basis of comparison for the other treatments. This control treatment might be the treatment in common
use, or it might be a null treatment (no treatment at all). For example, a study of new pain
killing drugs could use a standard pain killer as a control treatment, or a study on the efficacy
of fertilizer could give some fields no fertilizer at all. This would control for average soil fertility
or weather conditions.
Placebo is a null treatment that is used when the act of applying a treatment has an effect.
Placebos are often used with human subjects, because people often respond to any treatment:
for example, reduction in headache pain when given a sugar pill. Blinding is important when
placebos are used with human subjects. Placebos are also useful for nonhuman subjects. The
apparatus for spraying a field with a pesticide may compact the soil. Thus we drive the apparatus
over the field, without actually spraying, as a placebo treatment. In case of several factors, they
are combined to form treatments. For example, the baking treatment for a cake involves a given
time at a given temperature. The treatment is the combination of time and temperature, but we
can vary the time and temperature separately. Thus we speak of a time factor and a temperature
factor. Individual settings for each factor are called levels of the factor.
A randomized controlled trial (RCT) is study in which people are allocated at random to
receive one of several clinical interventions. One of these interventions is the standard of comparison or control. The control may be a standard practice, a placebo, a sham treatment or no
intervention at all. Someone who takes part in a RCT is called a participant or subject. RCTs
seek to measure and compare the outcomes after the participants received their intervention.
Because the outcomes are measured, RCTs are quantitative studies.
In sum, RCTs are quantitative, comparative, controlled experiments in which investigators
study two or more interventions in a series of individuals who receive them in random order.
The RCT is one of the simplest and most powerful tools in clinical research but often relatively
expensive.
Confounding occurs when the effect of one factor or treatment cannot be distinguished from
that of another factor or treatment. The two factors or treatments are said to be confounded.
Except in very special circumstances, confounding should be avoided. Consider planting corn
225
12.4. DOE IN THE CLASSICAL FRAMEWORK
Systematic reviews
Randomized controlled trials (RCT)
Non−randomized controlled trials
Observational studies with comparison groups
Case reports and case studies
Expert opinion
Figure 12.4: (One possible) presentation of the evidence-based medicine pyramid.
variety A in Minnesota and corn variety B in Iowa. In this experiment, we cannot distinguish
location effects from variety effects: the variety factor and the location factor are confounded.
Blinding occurs when the evaluators of a response do not know which treatment was given
to which unit. Blinding helps prevent bias in the evaluation, even unconscious bias from wellintentioned evaluators. Double blinding occurs when both the evaluators of the response and
the subject (experimental units) do not know the assignment of treatments to units. Blinding
the subjects can also prevent bias, because subject responses can change when subjects have
expectations for certain treatments.
Before a new drug is admitted to the market, many steps are necessary: starting from a
discovery based step toward highly standardized clinical trials (type I, II and III). At the very
end, there are typically randomized controlled trials, that by design (should) eliminate all possible
confounders.
At later steps, when searching for an appropriate drug, a decision may be available based on
“evidence”: what has been used in the past, what has been shown to work (for similar situations).
This is part of evidence-based medicine. Past information may be of varying quality, ranging
from ideas opinions to case studies to RCTs or systematic reviews. Figure 12.4 represents a socalled evidence-based medicine pyramid which reflects the quality of research designs (increasing)
and quantity (decreasing) of each study design in the body of published literature (from bottom
to top). For other scientific domains, similar pyramids exist, with bottom and top typically
remaining the same.
12.4
DoE in the Classical Framework
DoE in the Fisher sense is heavily ANOVA driven by his analysis of the crop experiments at
Rothamsted Experimental Station and thus in many textbooks DoE is equated to the discussion
of ANOVA. Here, we have separated the statistical analysis in Chapter 11 from the conceptual
setup of the experiment in this chapter.
In a typical ANOVA setting we should strive to have the same amount of observations in
each cell (for all settings of levels). Such a setting is called a balanced design (otherwise it is
unbalanced). If every treatment has the same number of observations, effect of unequal variances
are mitigated.
226
CHAPTER 12. DESIGN OF EXPERIMENTS
P
In a simple regression setting, the standard errors of βb0 and βb1 depend on 1/ i (xi −x)2 , see
expressions for the estimates (9.12) and (9.13). Hence, to reduce the variability of the estimates,
P
2
we should increase
i (xi − x) as much as possible. Specifically, suppose the interval [ a, b ]
represents a natural range for the predictor, then we should choose half of the predictors as a
and the other half as b.
This last argument justifies a discretization of continuous (and controllable) predictor variables in levels. Of course this implies that we expect a linear relationship. If the relationship is
not linear, such a discretization may be devastating.
12.4.1
Sample Size in a One-Way and Two-Way ANOVA
To illustrate the difficulty of sample size calculation in an ANOVA context we consider a one-way
example. Recall the one-way model
Yij = µi + εij
(12.7)
see (11.1) for detailed assumptions. We reject H0 : µ1 = µ1 = · · · = µI when (SSA /(I −
1)) (SSE /(N − I)) = MSA MSE is larger than the 1 − α-quantile of an F -distribution with I − 1
and N − I the degrees of freedom.
The difficulty is in specifying the alternative. One often chooses the setting where all but
two group means are equal and the two deviate by ∆/2.
In R, the function Fpower1() from the package daewr can be used to calculate the power and
resulting sample size.
In the framework of two-way ANOVA, the same difficulties hold. In R, the function Fpower2()
from the package daewr can be used.
12.4.2
Sums of Squares for Unbalanced Two-Way ANOVA
If we have a balanced two-way ANOVA, the sums of squares partition is additive due to “orthogonality” induced by the equal number in each cell and thus we have
SST = SSA + SSB + SSAB + SSE
(12.8)
(see also Equation (11.24)). In the unbalanced setting this is not the case and the decomposition
depends on the order which we introduce the factors in the model. That means, the ANOVA
table of aov(y˜f1+f2) is not the same as aov(y˜f2+f1). We should consider the ANOVA table
as sequential: each additional factor (row in the table), we reduce the remaining variability.
Hence, we should rather write
SST = SSA + SSB|A + SSAB|A,B + SSE ,
(12.9)
where the term SSB|A indicates the sums of squares of factor B after correction of factor A. and
similarly, term SSAB|A,B indicates the sums of squares of the interaction AB after correction of
factors A and B.
This concept of sums of squares after correction is not new. We have encountered this type
of correction already: SST is actually calculated after correcting for the overall mean.
227
12.5. BIBLIOGRAPHIC REMARKS
Equation (12.9) represents the sequential sums of squares decomposition, called Type I sequential SS : SSA and SSB|A and SSAB|A,B . It is possible to show that SSB|A = SSA,B − SSA ,
where the former is the classical sums of squares of a model without interactions. An ANOVA
table such as given in Table 11.3 yields different p-values for H0 : β1 = · · · = βI = 0 and
H0 : γ1 = · · · = γJ = 0 if the order of the factors is exchanged. This is often a disadvantage and
for the F -test the so-called Type II partial SS, being SSA|B and SSB|A can be used. As there is
no interaction involved, we should use Type II only if the interaction is not significant (in which
case it is to be preferred over Type I). Alternatively, Type III partial SS, SSA|B,AB and SSB|A,AB ,
may be used.
In R, the output of aov, or anova are Type I sequential SS. To obtain the other types, manual
calculations may be done or using the function Anova(..., type=i) from the package car.
Example 12.3. Consider Example 11.2 in Section 11.4 but we eliminate the first observation
and the design is unbalanced in both factors. R-Code 12.2 calculates the Type I sequential SS
for the same order as in R-Code 11.3. Type II partial SS are subsequently slightly different.
Note that the design is balanced for the factor Month and thus simply exchanging the order
does not alter the SS here.
♣
R-Code 12.2 Type I and II SS for UVfilter data without the first observation.
require( car)
lmout2 <- lm( log(OT) ~ Month + Treatment, data=UV, subset=-1) # omit 1st!
print( anova( lmout2), signif.stars=FALSE)
## Analysis of Variance Table
##
## Response: log(OT)
##
Df Sum Sq Mean Sq F value Pr(>F)
## Month
1
1.14
1.137
4.28 0.053
## Treatment 2
5.38
2.692
10.12 0.001
## Residuals 19
5.05
0.266
print( Anova( lmout2, type=2), signif.stars=FALSE)
# type=2 is default
## Anova Table (Type II tests)
##
## Response: log(OT)
##
Sum Sq Df F value Pr(>F)
## Month
1.41 1
5.31 0.033
## Treatment
5.38 2
10.12 0.001
## Residuals
5.05 19
12.5
Bibliographic Remarks
Devore (2011) derives sample size calculations for many classical tests in an accessible fashion.
228
CHAPTER 12. DESIGN OF EXPERIMENTS
An online lecture about DoE is available at https://online.stat.psu.edu/stat503/home.
There are many different online apps to calculate sample sizes, for example https://w3.
math.uzh.ch/shiny/git/reinhard.furrer/SampleSizeR/. For specific operating systems the free
software G*Power calculates power and sample sizes for various settings (www.psychologie.hhu.
de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower.html.
For in-vivo studies, The Experimental Design Assistant https://eda.nc3rs.org.uk/ is a very
helpful tool. Classical approaches are PREPARE guidelines (planning guidelines before the
study) https://norecopa.no/PREPARE and ARRIVE guidelines (reporting guidelines after the
study) https://www.nc3rs.org.uk/arrive-guidelines.
12.6
Exercises and Problems
Problem 12.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Compare sample sizes when using Wilson and Wald type confidence intervals for an proportion.
b) In the context of a simple regression, the variances of βb0 and βb1 are given by
Var(βb0 ) = σ 2
1
n
x2
,
2
i (xi − x)
+P
Var(βb1 ) = σ 2 P
1
.
2
i (xi − x)
Assume that x1 , . . . , xn ∈ [a, b], with n even. Show that the variances are minimized by
choosing half at a and the other half at b. What if n is odd?
Problem 12.2 (Study design) Consider a placebo-controlled trial for a treatment B (compared
to a placebo A). The clinician proposes, to use ten patients, who first receive the placebo A and
after a long enough period treatment B. Your task is to help the clinician to find an optimal
design with at most 20 treatments and with at most 20 patients available.
a) Describe alternative designs, argue regarding which aspects those are better or worse than
the original.
b) Give an adequate statistical test for each of your suggestions.
Problem 12.3 (Sample size calculation) Suppose we compare the mean of some treatment
in two equally sized groups. Let zγ denote the γ-quantile of the standard normal distribution.
Furthermore, the following properties are assumed to be known or fixed:
• clinically relevant difference ∆ = µ1 − µ0 , we can assume without loss of generality that
∆>0
• standard deviation of the treatment effect σ (same in both groups)
• Power 1 − β.
229
12.6. EXERCISES AND PROBLEMS
• Type I error rate α.
a) Write down the suitable test statistic and its distributions under the null hypothesis.
b) Derive an expression for the power using the test statistic.
c) Prove analytically that the required sample size n in each group is at least
n=
2σ 2 (z1−β + z1−α/2 )2
∆2
Problem 12.4 (Sample size and group allocation) A randomized clinical trial to compare
treatment A to treatment B is being conducted. To this end 20 patients need to be allocated to
the two treatment arms.
a) Using R, randomize the 20 patients to the two treatments with equal probability. Repeat
the randomization in total a 1000 times retaining the difference in group size and visualize
the distribution of the differences with a histogram.
b) In order to obtain group sizes that are closer while keeping randomization codes secure a
random permuted block design with varying block sizes 2 and 4 and respective probabilities
0.25 and 0.75 is now to be used. Here, for a given length, each possible block of equal
numbers of As and Bs is chosen with equal probability. Using R, randomize the 20 patients
to the two treatments using this design. Repeat the randomization in total a 1000 times
retaining the difference in group size. What are the possible values this difference my take?
How often did these values occur?
Problem 12.5 (BMJ Endgame) Discuss and justify the statements about ‘Sample size: how
many participants are needed in a trial?’ given in doi.org/10.1136/bmj.f1041.
230
CHAPTER 12. DESIGN OF EXPERIMENTS
Chapter 13
Bayesian Approach
Learning goals for this chapter:
⋄ Describe the fundamental differences between the Bayesian and frequentist
approach
⋄ Describe the Bayesian terminology
⋄ Explain how to compute posterior probability
⋄ Interpret posterior probability and a posterior credible interval
⋄ Explain the idea of the Bayes factor and link it to model selection
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter13.R.
In statistics there exist two different philosophical approaches to inference: frequentist and
Bayesian inference. Past chapters dealt with the frequentist approach; now we introduce the
Bayesian approach. Here, we consider the parameter as a random variable with a suitable
distribution, which is chosen a priori, i.e., before the data is collected and analyzed. The goal is
to update this prior knowledge after observation of the data in order to draw conclusions (with
the help of the so-called posterior distribution).
As an example, suppose I have misplaced my cordless phone in my room. I should start
search in rooms where the chance of finding it is highest. Based on past experience, the chance
that I might have left the phone in the living room is much larger than on a shelf in the pantry.
But there is not much of a difference between the sofa of living room and counter in the kitchen.
If I use the locator button on the base station, the phone starts to beep and I can use this
information to update and tailor my search.
231
232
13.1
CHAPTER 13. BAYESIAN APPROACH
Motivating Example and Terminology
Bayesian statistics is often introduced by recalling the so-called Bayes theorem, which states for
two events A and B
P(A | B) =
P(B | A) P(A)
,
P(B)
for P(B) ̸= 0,
(13.1)
and is shown by using twice Equation (2.5). Bayes theorem is often used in probability theory
to calculate probabilities along an event tree, as illustrated in the arch-example below.
Example 13.1. A patient sees a doctor and gets a test for a (relatively) rare disease. The
prevalence of this disease is 0.5%. As typical, the screening test is not perfect and has a sensitivity
of 99%, i.e., true positive rate; properly identified the disease in a sick patient, and a specificity
of 98%, i.e., true negative rate; a healthy person is correctly identified disease free. What is the
probability that the patient has the disease provided the test is positive?
Denoting the events D = ‘Patient has disease’ and + = ‘test is positive’ we can use Equation (13.1) to calculate
P(+ | D) P(D)
P(+ | D) P(D)
=
P(+)
P(+ | D) P(D) + P(+ | Dc ) P(Dc )
99% · 0.5%
=
= 20%.
99% · 0.5% + 2% · 99.5%
P(D | +) =
(13.2)
Note that for the denominator we have used the so-called law of total probability to get an
expression for P(+).
♣
An interpretation of the previous example from a frequentist view is in terms of proportion
of outcomes (in a repeated sampling framework). In the Bayesian approach, we view the probabilities as “degree of belief”, where we have some proposition (event D in Example 13.1) and
new evidence (event + in Example 13.1). More specifically, P(D) represents the prior believe of
our proposition, P(+ | D)/ P(+) is the support of the evidence for the proposition and P(D | +)
is the posterior believe of the proposition after having accounted for the new evidence +.
Extending Bayes’ theorem to the setting of two continuous random variables X and Y along
the definition of the conditional density (8.14) we have
fX|Y =y (x | y) =
fY |X=x (y | x) fX (x)
, for all y such that fY (y) > 0.
fY (y)
(13.3)
In the context of Bayesian inference the random variable X will now be a parameter, typically
of the distribution of Y :
fΘ|Y =y (θ | y) =
fY |Θ=θ (y | θ) fΘ (θ)
, for all y such that fY (y) > 0.
fY (y)
(13.4)
Hence, current knowledge about the parameter is expressed by a probability distribution on the
parameter: the prior distribution. The model for our observations is called the likelihood. We
use our observed data to update the prior distribution and thus obtain the posterior distribution.
233
13.2. BAYESIAN INFERENCE
In the next section, we discuss examples where the parameter is the success probability of a
trial and the mean in a normal distribution.
Notice that P(B) in (13.1), P(+) in (13.2), or fY (y) in (13.3) and (13.4) serves as a normalizing constant, i.e., it is independent of A, D, x or the parameter θ, respectively. Thus, we often
write the posterior without this normalizing constant
fΘ|Y =y (θ | y) ∝ fY |Θ=θ (y | θ) × fΘ (θ),
(13.5)
(or in short form f (θ | y) ∝ f (y | θ)f (θ) if the context is clear). The symbol “∝” means
“proportional to”. For simplicity, we will omit the additional constraint that f (y) > 0.
Finally, we can summarize the most important result in Bayesian inference. The posterior
density is proportional to the likelihood multiplied by the prior density, i.e.,
Posterior density ∝ Likelihood × Prior density
(13.6)
In a nutshell, advantages of using a Bayesian framework are:
• formal way to incorporate priori knowledge;
• intuitive interpretation of the posterior;
• much easier to model complex systems;
• no n-dependency of ‘significance’ and the p-value.
As nothing comes for free, there are also some disadvantages:
• more ‘elements’ have to be specified for a statistical model;
• in virtually all cases, a Bayesian approach is computationally more demanding.
Until recently, there were clear fronts between frequentists and Bayesians. Luckily, these differences have vanished.
13.2
Bayesian Inference
We illustrate the concept of Bayesian inference with two typical examples that are tractable.
13.2.1
Bayesian Estimates
Example 13.2. (beta-binomial model) If we observe y successes (out of n), a frequentist setting
assumes Y ∼ Bin(n, p) and uses as estimate pb = y/n (Section 6.1).
In the Bayesian framework we assume the success probability p as a random variable with
an associated distribution. We require the support of the associated density to be the interval
(0, 1). One example is the uniform distribution U(0, 1) or the so-called Beta distribution, see
Section 13.5.1. The density of a Beta random variable is given by
f (p) = c · pα−1 (1 − p)β−1 ,
p ∈ [0, 1], α > 0, β > 0,
(13.7)
234
CHAPTER 13. BAYESIAN APPROACH
with normalization constant c. We write P ∼ Beta(α, β). Figure 13.6 shows densities for various
pairs (α, β).
If we investigate the probability of a lamb being male, then it is highly unlikely that p < 0.1
or p > 0.9. This additional knowledge about the parameter p would be reflected by using a prior
P ∼ Beta(5, 5), for example.
The posterior density is then proportional to
n y
∝
p (1 − p)n−y × c · pα−1 (1 − p)β−1
y
(13.8)
∝ py pα−1 (1 − p)n−y (1 − p)β−1 = py+α−1 (1 − p)n−y+β−1 ,
(13.9)
which can be recognized as a beta distribution Beta(y + α, n − y + β).
Figure 13.1 illustrates the case of y = 10, n = 13 with prior Beta(5, 5). The posterior mode
is now between the prior mode (0.5) and the frequentist estimate pb.
The expected value of a beta distributed random variable Beta(α, β) is α/(α + β) (here the
prior distribution). The posterior expected value is thus
E(P | Y = y) =
y+α
.
n+α+β
(13.10)
4
Specifically, the mean changed from 0.5 to (10 + 5)/(13 + 5 + 5) ≈ 0.65.
♣
0
1
2
3
Data/likelihood
Prior
Posterior
0.0
0.2
0.4
0.6
^
p
0.8
1.0
p
Figure 13.1: Beta-binomial model with prior density (cyan), data/likelihood (green)
and posterior density (blue).
In the previous example, we use P ∼ Beta(α, β) and fix α and β during model specification,
which is why they are called hyper-parameters.
The beta distribution Beta(1, 1), i.e., α = 1, β = 1, is equivalent to a uniform distribution
U(0, 1). The uniform distribution for the probability p, however, does not mean “informationfree”. As a result of Equation (13.10), a uniform distribution as prior is “equivalent” to two experiments, of which one is a success. That means, we can see the prior as two pseudo-observations.
In the next example, we have the data and the parameter both continuous.
235
13.2. BAYESIAN INFERENCE
iid
Example 13.3. (normal-normal model) Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). We assume σ is known.
The mean µ is the only parameter of interest, for which we assume the prior N (η, τ 2 ). Thus, we
have the Bayesian model:
iid
Yi | µ ∼ N (µ, σ 2 ),
i = 1, . . . , n,
2
µ ∼ N (η, τ ).
(13.11)
(13.12)
where σ 2 , η and τ 2 are considered as hyper-parameters. Notice that we have again slightly abused
the notation by using µ as the realization in (13.11) and as the random variable in (13.12). Since
the context determines the meaning, we use this simplification for the parameters in the Bayesian
context. The posterior density is then
f (µ | y1 , . . . , yn ) ∝ f (y1 , . . . , yn | µ) × f (µ) =
∝
n
Y
i=1
n
Y
i=1
f (yi | µ) × f (µ)
1 (µ − η)2 1 (y − µ)2 i
exp
−
exp −
2
σ2
2
τ2
n
1X
(yi − µ)2 1 (µ − η)2 ∝ exp −
−
,
2
σ2
2
τ2
(13.13)
(13.14)
(13.15)
i=1
where the constants (2πσ 2 )−1/2 and (2πτ 2 )−1/2 do not need to be considered. Through further
manipulation (of the square in µ) one obtains
2 !
1 n
1
n
1 −1 ny
η
+
µ−
+
+
∝ exp −
2 σ2 τ 2
σ2 τ 2
σ2 τ 2
and thus the posterior distribution is
!
1 −1 ny
η
n
1 −1
n
+
+
,
+
.
N
σ2 τ 2
σ2 τ 2
σ2 τ 2
(13.16)
(13.17)
In other words, the posterior expected value
nτ 2
σ2
+
y
= η ω + y (1 − ω)
(13.18)
nτ 2 + σ 2
nτ 2 + σ 2
is a weighted mean of the prior mean η and the mean of the likelihood y. The weights ω =
σ 2 /(nτ 2 + σ 2 ) depend on the variance parameters and on n. With a smaller prior variance τ ,
more weight is given to the prior, for example. The larger n is, the less weight there is on the
prior mean, since σ 2 /(nτ 2 + σ 2 ) → 0 for n → ∞. Typically, the prior is fixed but if more data is
collected, the posterior mean will be closer to the mean of the data and the prior has a weaker
“influence” on the posterior.
Figure 13.2 (based on R-Code 13.1) illustrates the setting of this example with y = 2.1, n = 4
and the hyper-parameters σ 2 = 1, η = 0 and τ 2 = 2. Here, the likelihood is with respect to Y ,
i.e., the likelihood is a function of the parameter µ, given by the density of Y , a Gaussian with
mean y and variance σ 2 /n, see Problem 13.1.a.
♣
E(µ | y1 , . . . , yn ) = η
As a summary statistic of the posterior distribution the posterior mode is often used. Naturally, the posterior median and posterior mean (i.e., expectation of the posterior distribution)
are intuitive alternatives. In the case of the previous example, the posterior mode is the same as
the posterior mean.
236
CHAPTER 13. BAYESIAN APPROACH
R-Code 13.1 Normal-normal model. (See Figure 13.2.)
# Information about data:
ybar <- 2.1;
n <- 4;
sigma2 <- 1
# information about prior:
priormean <- 0;
priorvar <- 2
# Calculating the posterior variance and mean:
postvar <- 1/( n/sigma2 + 1/priorvar)
postmean <- postvar*( ybar*n/sigma2 + priormean/priorvar )
# Plotting follows:
y <- seq(-2, to=4, length=500)
plot( y, dnorm( y, postmean, sqrt( postvar)), type='l', col=4,
ylab='Density', xlab=bquote(mu))
lines( y, dnorm( y, ybar, sqrt( sigma2/n)), col=3)
lines( y, dnorm( y, priormean, sqrt( priorvar)), col=5)
legend( "topleft", legend=c("Data/likelihood", "Prior", "Posterior"),
col=c(3, 5, 4), bty='n', lty=1)
0.4
0.0
0.2
Density
0.6
0.8
Data/likelihood
Prior
Posterior
−2
−1
0
1
2
3
4
µ
Figure 13.2: Normal-normal model with prior (cyan), data/likelihood (green) and
posterior (blue). (See R-Code 13.1.)
13.2.2
Bayesian Confidence intervals
Interval estimation in the frequentist approach results in confidence intervals. But sample confidence intervals need to be interpreted with care, in a context of repeated sampling. A sample
(1 − α)% confidence interval [bu , bo ] contains the true parameter with a frequency of (1 − α)% in
infinite repetitions of the experiment. With a Bayesian approach, we can now make statements
about the parameter with probabilities. In Example 13.3, based on Equation (13.17)
P v −1 m − z1−α/2 v −1/2 ≤ µ ≤ v −1 m + z1−α/2 v −1/2 = 1 − α,
(13.19)
with v = n/σ 2 + 1/τ 2 and m = ny/σ 2 + η/τ 2 . That means that the bounds v −1 m ± z1−α/2 v −1/2
can be used to construct a Bayesian counterpart to a confidence interval.
237
13.2. BAYESIAN INFERENCE
Definition 13.1. The interval R with
Z
f (θ | y1 , . . . , yn ) dθ = 1 − α
(13.20)
R
is called a (1 − α)% credible interval for θ with respect to the posterior density f (θ | y1 , . . . , yn )
and 1 − α is the credible level of the interval.
♢
The definition states that the parameter θ, now seen as a random variable whose posterior
density is given by f (θ | y1 , . . . , yn ), is contained in the (1−α)% credible interval with probability
(1 − α).
Example 13.4 (continuation of Example 13.3). The interval [ 0.94, 2.79 ] is a 95% credible
interval for the parameter µ.
♣
Since the credible interval for a fixed α is not unique, the “narrowest” is often used. This
is the so-called HPD interval (highest posterior density interval). HPD intervals and credible
intervals in general are often determined numerically.
Example 13.5 (continuation of Example 13.2). The 2.5% and 97.5% quantiles of the posterior (13.9) are 0.45 and 0.83, respectively. A HPD is given by the bounds 0.46 and 0.84. The
differences are not pronounced as the posterior density is fairly symmetric. Hence, the widths of
both are almost identical: 0.377 and 0.375.
The frequentist sample 95% CI is [0.5, 0.92], with width 0.42, see Equation (6.9).
13.2.3
♣
Predictive Distribution
In the classical regression framework, the estimated regression line represented the mean of an
unobserved new location. To fully assess the uncertainty of the prediction, we had to take into
account the uncertainty of the estimates and argued that the prediction is given by a t-distribution
(see, CI 7).
In the Bayesian setting, the likelihood f (ynew | θ) can be seen as a the density of the predictive
distribution. That means, the distribution of an unobserved new observation ynew . As the
classical regression framework, using f (ynew | θb ), with θb some Bayesian estimate of the parameter
(e.g., posterior mean or posterior mode). The better approach is based on the posterior predictive
distribution, defined as follows.
Definition 13.2. The posterior predictive distribution of a Bayesian model with likelihood f (y |
θ) and prior f (θ) is
f (ynew | y1 , . . . , yn ) =
Z
f (ynew | θ)f (θ | y1 , . . . , yn ) dθ.
(13.21)
♢
In the previous equation, f (ynew | θ, y1 , . . . , yn ) represents the likelihood and thus there is no
dependency on the data. Hence f (ynew | θ, y1 , . . . , yn ) = f (ynew | θ).
238
CHAPTER 13. BAYESIAN APPROACH
0.00
0.05
0.10
0.15
0.20
0.25
Predictive posterior
Likelihood plugin
0
2
4
6
8
10
12
y
Figure 13.3: Predictive posterior distribution for the beta-binomial model (red) and
likelihood with plugin parameter pb = 10/13 (black).
Example 13.6 (continuation of Example 13.2). In the context of the beta-binomial model, the
posterior predictive distribution is constructed based on the single observation y only
Z 1
f (ynew | y) =
f (ynew | p) × f (p | y) dp
(13.22)
0
Z 1
n
y+α−1
n−y+β−1
n−ynew
ynew
×p
(1 − p)
dp,
(1 − p)
=
c
p
ynew
0
where c is the normalizing constant for the posterior. The integral itself gives us the normalizing
constant of a Beta(ynew + y + α, 2n − ynew − y + β) distribution. We do not recognize this
distribution per se. As illustration, Figure 13.3 shows the posterior predictive distribution based
on the observation y = 10 and prior Beta(5, 5). The prior implies that the posterior predictive
distribution is much more centered compared to the likelihood with plugin parameter pb = 10/13
(i.e., the binomial distribution Bin(13, 10/13)).
R-Code 13.2 Predictive distribution with the beta-binomial model. (See Figure 13.3.)
library(LearnBayes)
n <- 13
y <- 0:n
pred.probs <- pbetap(c( 10+5, 13-10+5), n, y) # prior Beta(5,5)
plot(y, pred.probs, type="h", ylim=c(0,.27), col=2, ylab='')
lines( y+0.07, dbinom(y, size=n, prob=10/13), type='h')
legend("topleft", legend=c("Predictive posterior", "Likelihood plugin"),
col=2:1, lty=1, bty='n')
Even in the simple case of the beta-binomial model, it is not straightforward to derive the
predictive posterior distribution. Quite often, more detailed integration knowledge is required.
In the case of the normal-normal model as introduced in Example 13.3, it is possible to show
2 ), where µ
2
that the posterior predictive distribution is again normal N (µpost , σ 2 + σpost
post , σpost
are the posterior mean and posterior variance as given in (13.17).
239
13.2. BAYESIAN INFERENCE
13.2.4
Bayes Factors
The Bayesian counterpart to hypothesis testing is done through a comparison of posterior probabilities. For example, consider two specific models specified by two hypotheses H0 and H1 . By
Bayes theorem,
P(H0 | y1 , . . . , yn )
P(y1 , . . . , yn | H0 )
P(H0 )
=
×
,
P(H1 | y1 , . . . , yn )
P(y1 , . . . , yn | H1 )
P(H1 )
|
{z
}
|
{z
}
| {z }
Posterior odds
Bayes factor (BF01 )
(13.23)
Prior odds
that means that the Bayes factor BF01 summarizes the evidence of the data for the hypothesis
H0 versus the hypothesis H1 . The Bayes factor is any positive number. However, it has to be
mentioned that a Bayes factor needs to exceed 3 to talk about substantial evidence for H0 . For
strong evidence we typically require Bayes factors larger than 10. More precisely, Jeffreys (1983)
differentiates
1<
very
barely worth
< 3 < substantial < 10 < strong < 30 <
< 100 < decisive
strong
mentioning
For values smaller than one, we would favor H1 and the situation is similar by inverting the
ratio, as also illustrated in the following example.
Example 13.7. We consider the setup of Example 13.2 and compare the models with p = 1/2
and p = 0.8 when observing 10 successes among the 13 trials. To calculate the Bayes factor, we
need to calculate P(Y = 10 | p) for p = 1/2 and p = 0.8. Hence, the Bayes factor is
10
13
3
0.0349
10 0.5 (1 − 0.5)
BF01 = 13
=
= 0.1421,
(13.24)
10
3
0.2457
10 0.8 (1 − 0.2)
which is somewhat substantial (1/0.1421 ≈ 7) in favor of H1 . This is not surprising, as the
observed proportion is pb = 10/13 = 0.77 close to p = 0.8 corresponding to H1 .
♣
In the example above, the hypotheses H0 and H1 are understood in the sense of H0 : θ = θ0
and H1 : θ = θ1 . The situation for an unspecified alternative H1 : θ ̸= θ0 is much more interesting
and relies on using the prior f (θ) and integrating out the parameter θ:
Z
f (y1 , . . . , yn | H1 : θ ̸= θ0 ) = f (y1 , . . . , yn | θ)f (θ) dθ,
(13.25)
illustrated as follows.
Example 13.8 (continuation of Example 13.7). For the situation H1 : p ̸= 0.5 using the prior
Beta(5, 5), we have
Z 1
P(Y = 13 | H1 ) =
P(Y = 13 | p)f (p) dp
0
(13.26)
Z 1 13 10
3
4
4
=
p (1 − p) · c p (1 − p) dp = 0.0704,
10
0
where we used integrate( function(p) dbinom(10,13,prob=p)*dbeta(p, 5,5),0,1). Thus,
BF01 = 0.0349/0.0704 = 0.4957. Hence, the Bayes factor is approximately 2 in favor of H1 , barely
worth calculating the value. Under a uniform prior, the support for H1 only marginally increases
(from 2.017 to 2.046).
♣
240
CHAPTER 13. BAYESIAN APPROACH
Example 13.9 (continuation of Example 13.3, data from Example 5.8). We now look at a
Bayesian extension of the frequentist t-test. For simplicity we assume the one sample setting
without deriving the explicit formulas. The package BayesFactor provides functionality to
calculate Bayes factors for different settings.
We take the pododermatitis scores (see R-Code 5.1). R-Code ?? shows that the Bayesfactor
comparing the null model µ = 3.33 against the alternative µ ̸= 3.33 is approximately 14. Here,
we have used the standard parameter setting which includes the specification of the prior and
the prior variance. The prior variance can be specified with the argument rscale with default
√
0.707 = 2. Increasing this variance leads to a flatter prior and thus to a smaller Bayes factor.
Default priors are typically very reasonable and we come back to the choice of the priors in the
next section.
♣
R-Code 13.3 Bayes factor within the normal-normal model.
library(BayesFactor)
ttestBF(PDHmean, mu=3.33)
# data may be reloaded.
## Bayes factor analysis
## -------------## [1] Alt., r=0.707 : 14.557 ±0%
##
## Against denominator:
##
Null, mu = 3.33
## --## Bayes factor type: BFoneSample, JZS
Bayes factors are popular because they are linked to the BIC (Bayesian Information Criterion) and thus automatically penalize model complexity. Further, they also work for non-nested
models.
13.3
Choice and Effect of Prior Distribution
The choice of the prior distribution belongs to the modeling process just like the choice of the
likelihood distribution. Naturally, the prior should be fixed before the data has been collected.
The examples in the last section were such, that the posterior and prior distributions did
belong to the same class. Naturally, this is no coincidence and such prior distributions are called
conjugate prior distributions.
With other prior distributions we may obtain posterior distributions that we no longer “recognize” and normalizing constants must be explicitly calculated. An alternative approach to
integrating the posterior density is discussed in Chapter 14.
Beside conjugate priors, there are many more classes that are typically discussed in a full
Bayesian lecture. We prefer to classify the effect of the prior instead. Although not universal and
241
13.3. CHOICE AND EFFECT OF PRIOR DISTRIBUTION
quite ambigiuous, we differentiate between informative, weakly informative and uninformative
priors. The first describes a prior that is specific for the data at hand where with different data
the prior typically chances. The prior has potentially a substantial effect on the posterior. Weakly
informative priors do not have a strong influence on the posterior but they may substantially
contribute to the model in situations where the model is ill posed. Finally, an uninformative
prior is such, that the likelihood relates essentially to the posterior in terms of some criterion
like posterior mean.
Uninformative priors are not classical prior distributions. In Example 13.2 we would require
a “beta density” with α = β = 0 in order to have a posterior mean that is equivalent to the
likelihood estimate (see Equation (13.10)). However, for α = β = 0 the normalizing constant
R1
of (13.7) is not finite as 0 p−1 (1 − p)−1 dp diverges. Similarly, in Example 13.3, in order that
E(µ | y1 , . . . , yn ) = y, we need τ → ∞, that means, the prior of µ is “completely constant”
(see Equation (13.18)). As we do not have a bounded range for µ, we have again not a “proper
density”, i.e., a so-called improper prior .
Without going into details, it is possible that for certain improper priors the posterior is a
letigimate density.
For large n the difference between a Bayesian and likelihood estimate is not pronounced.
As a matter of fact, it is possible to show that the posterior mode converges to the likelihood
estimate as the number of observations increase.
Data/likelihood
Prior
Posterior
2
0
1
Density
3
4
Example 13.10. We consider again the normal-normal model and compare the posterior density
for various n with the likelihood. We keep y = 2.1, independent of n. As shown in Figure 13.4,
the maximum likelihood estimate does not depend on n (y is kept constant by design). However,
√
the uncertainty decreases (standard error is σ/ n). For increasing n, the posterior approaches
the likelihood density. In the limit, there is no difference between the posterior and the likelihood.
The R-Code follows closely R-Code 13.1.
♣
0.5
1.0
1.5
2.0
2.5
µ
Figure 13.4: Normal-normal model with prior (cyan), data/likelihood (green) and
posterior (blue) for increasing n (n = 4, 36, 64, 100). Prior is N (0, 2) and y = 2.1 in all
cases.
242
13.4
CHAPTER 13. BAYESIAN APPROACH
Regression in a Bayesian Framework
In this section we introduce a Bayesian approach to simple linear and logistic regression. More
complex models are deferred to Chapter 14. Here, we will discuss the conceptual ideas and
use software tools as a black box approach. The underlying computational principles will be
discussed in Chapter 14.
The simplest Bayesian regression model is as follows
indep
Yi | β, σ 2 ∼ N (x i β, σ 2 ),
i = 1, . . . , n,
2
β ∼ Np+1 (η, σ T),
(13.27)
(13.28)
where σ 2 , η and T are hyper-parameters. The model is quite similar to (13.11) and (13.12) with
the exception that we have a multivariate prior distribution for β. Instead of the parameter τ 2
we use σ 2 T, where T is a (p + 1) × (p + 1) symmetric positive definite matrix. The special form
will allow us to factor σ 2 and simplify the posterior. With a few steps, it is possible to show that
β | y ∼ Np+1 (V−1 m, σ 2 V−1 )
(13.29)
with V−1 = T−1 + X⊤ X and m = T−1 η + X⊤ y (see Problem 13.1.b).
It is possible to show that the posterior is a weighted average of the prior mean η and the
b = (X⊤ X)−1 Xy (see Problem 13.1.c).
classical least squares estimate β
The function bayesglm() from the R package arm implements an accessible way for simple
linear regression and logistic regression. It is simple in the sense that it returns the posterior
modes of the estimates in a framework that is similar to a frequentist approach. We need to
specify the priors for the regression coefficients (separately for the intercept and the remaining
coefficients).
Example 13.11 (Bayesian approach to orings data). We revisit Example 10.5 and fit in RCode 13.4 a Bayesian logistic model to the data.
In the first bayesglm() call, we set the prior variances to infinity, resulting in uninformative
priors. The posterior mode is identical to the result of a classical glm() model fit.
In the second call, we use Gaussian priors for both parameters with mean zero and variance 9.
This choice is set by prior.df=Inf (i.e., a t-distribution with infinite degrees of freedom), by
the default prior.mean=0, and by prior.scale=3. and similarly for the intercept parameter.
The slope parameter is hardly affected by the prior. The intercept is, because with its rather
informative choice of the prior variance, the posterior mode is shrunk towards zero.
Note that summary(baye) should not be used, as the printed p-values are not relevant in the
Bayesian context. The function display() is the preferred way.
♣
Remark 13.1. For one particular class, consider several grades of some students. A possible
model might be
Yij = µ + αi + εij
with
iid
εij ∼ N (0, σ 2 ),
(13.30)
where αi represents the performance relative to the overall mean. This performance is not “fixed”,
it highly depends on the choice of the student and is thus variable. Hence, it would make more
13.4. REGRESSION IN A BAYESIAN FRAMEWORK
243
R-Code 13.4 orings data and estimated probability of defect dependent on air temperature. (See Figure 13.5.)
require(arm)
data( orings, package="faraway")
bayes1 <- bayesglm( cbind(damage,6-damage)~temp, family=binomial, data=orings,
prior.scale=Inf, prior.scale.for.intercept=Inf) #
coef(bayes1)
## (Intercept)
##
11.66299
temp
-0.21623
# result "similar" to
# coef( glm( cbind(damage,6-damage)~temp, family=binomial, data=orings))
bayes2 <- bayesglm( cbind(damage,6-damage)~temp, family=binomial, data=orings,
prior.df = Inf, prior.scale=3,
prior.df.for.intercept=Inf, prior.scale.for.intercept=3) #
arm::display(bayes2)
## bayesglm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial,
##
data = orings, prior.scale = 3, prior.df = Inf, prior.scale.for.intercept = 3,
##
prior.df.for.intercept = Inf)
##
coef.est coef.se
## (Intercept) 10.54
3.03
## temp
-0.20
0.05
## --## n = 23, k = 2
## residual deviance = 17.1, null deviance = 38.9 (difference = 21.8)
plot( damage/6~temp, xlim=c(21,80), ylim=c(0,1), data=orings, pch='+',
xlab="Temperature [F]", ylab='Probability of damage', cex=1.5) # data
glm1 <- glm( cbind(damage,6-damage)~temp, family=binomial, data=orings)
ct <- seq(20, to=85, length=100)
# vector to predict
p.out <- predict( glm1, new=data.frame(temp=ct), type="response")
lines(ct, p.out, lwd=3, col=4)
scoefs <- coef(sim(bayes2)) # simulation of coefficients...
for (i in 1:100) {
lines( ct, invlogit( scoefs[i,1]+scoefs[i,2]*ct), col=rgb(.8,.8,.8,.2))
}
iid
sense to consider αi as random, or, more specifically, αi ∼ N (0, σα2 ), and αi and εij independent.
Such models are called mixed effects models in contrast to the fixed effects model as discussed in
this book.
From a Bayesian perspective, such a separation is not necessary, as all Bayesian linear models
iid
are a mixed-effects model. In the example above, we impose as prior αi ∼ N (0, σα2 ).
♣
244
0.2
0.4
0.6
0.8
+
++
+
+
+
++
++++ ++ ++ ++ +
0.0
Probability of damage
1.0
CHAPTER 13. BAYESIAN APPROACH
20
30
40
50
60
70
80
Temperature [F]
Figure 13.5: orings data (proportion of damaged orings, black crosses), estimated
probability of defect dependent on air temperature by a logistic regression (blue line).
Gray lines are based on draws from the posterior distribution. (See R-Code 13.4.)
13.5
Appendix: Distributions Common in Bayesian Analysis
13.5.1
Beta Distribution
We introduce a random variable with support [0, 1]. Hence this random variable is well suited
to model probabilities (proportions, fractions) in the context of Bayesian modeling.
A random variable Y with density
fY (y) = c · y α−1 (1 − y)β−1 ,
y ∈ [0, 1], α > 0, β > 0,
(13.31)
where c is a normalization constant, is called beta distributed with parameters α and β. We
write this as Y ∼ Beta(α, β). The normalization constant cannot be written in closed form for
all parameters α and β. For α = β the density is symmetric around 1/2, for α > 1, β > 1 the
density is unimodal with mode (α − 1)/(α + β − 2) and for 0 < α < 1, 0 < β < 1 the density
has a bathtub shape. For arbitrary α > 0, β > 0 we have:
E(Y ) =
α
,
α+β
Var(Y ) =
αβ
.
(α + β + 1)(α + β)2
(13.32)
Figure 13.6 shows densities of the beta distribution for various pairs of (α, β).
13.5.2
Gamma Distribution
Many results involving the variance parameters of a normal distribution are simpler if we would
work with the precision, i.e., the inverse of the variance. A prior for the precision should take
245
13.5. APPENDIX: DISTRIBUTIONS COMMON IN BAYESIAN ANALYSIS
R-Code 13.5 Densities of beta distributed random variables for various pairs of (α, β).
(See Figure 13.6.)
3.0
p <- seq( 0, to=1, length=100)
a.seq <- c( 1:6, .8, .4, .2, 1, .5, 2)
b.seq <- c( 1:6, .8, .4, .2, 4, 4, 4)
col <- c( 1:6, 1:6)
lty <- rep( 1:2, each=6)
plot( p, dbeta( p, 1, 1), type='l', ylab='Density', xlab='x',
xlim=c(0,1.3), ylim=c(0,3))
for ( i in 2:length(a.seq))
lines( p, dbeta(p, a.seq[i], b.seq[i]), col=col[i], lty=lty[i])
legend("topright", col=c(NA,col), lty=c(NA, lty), cex=.9, bty='n',
legend=c(expression( list( alpha, beta)), paste(a.seq, b.seq, sep=',')))
1.5
0.0
0.5
1.0
Density
2.0
2.5
α, β
1,1
2,2
3,3
4,4
5,5
6,6
0.8,0.8
0.4,0.4
0.2,0.2
1,4
0.5,4
2,4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Figure 13.6: Densities of beta distributed random variables for various pairs of (α, β).
(See R-Code 13.5.)
any positive value and the gamma distribution is a natural choice because of its conjugacy with
the normal likelihood.
A random variable Y with density
fY (y) = f (y | α, β) = c · y α−1 exp(−βy),
y > 0, α > 0, β > 0,
(13.33)
is called gamma distributed with parameters α and β. We write Y ∼∼ Gam(α, β). The normalization constant c cannot be written in closed form for all parameters α and β.
246
CHAPTER 13. BAYESIAN APPROACH
For arbitrary α > 0, β > 0 we have:
E(Y ) =
α
,
β
Var(Y ) =
α
.
β2
(13.34)
The parameters α and β are also called the shape and rate parameter, respectively. The
parameterizations of the density in terms of the scale parameter 1/β is also frequently used.
13.5.3
Inverse Gamma Distribution
Many results involving the variance parameters of a normal distribution would be simpler if
we would work with the precision, i.e., the inverse of the variance. In such cases we choose a
so-called inverse-gamma distribution for the parameter τ = 1/σ 2 .
A random variable Y is said to be distributed according an inverse-gamma distribution if
1/Y is distributed according a gamma distribution. For parameters α and β, the density is given
by
fY (y) = f (y | α, β) = c · y −α−1 exp(−β/y),
y > 0, α > 0, β > 0,
(13.35)
and we write Y ∼ IGam(α, β). The normalization constant c cannot be written in closed form
for all parameters α and β.
For arbitrary β > 0 and for α > 1 and α > 2, we have respectively
E(Y ) =
13.6
β
,
α−1
Var(Y ) =
β2
.
(α − 1)2 (α − 2)
(13.36)
Bibliographic Remarks
An accessible discussion of the Bayesian approach can be found in Held and Sabanés Bové (2014),
including a discussion about the choice of prior distribution. A classic is Bernardo and Smith
(1994).
The online open access book “An Introduction to Bayesian Thinking” available at https:
//statswithr.github.io/book/ nicely melds theory and R source code.
The source https://www.nicebread.de/grades-of-evidence-a-cheat-sheet compares different
categorizations of evidence based on a Bayes factor and illustrates that the terminology is not
universal.
13.7
Exercises and Problems
Problem 13.1 (Theoretical derivations) In this problem we derive some of the theoretical and
mathematical results that we have stated in the chapter.
a) Show that the likelihood of Example 13.3 can be written as c · exp −n(y − µ)2 /σ 2 , where
c is the normalizing constant that does not depend on µ.
b) Derive the posterior distribution of β | y .
247
13.7. EXERCISES AND PROBLEMS
c) Show that the posterior mean of β | y can be written as
−1 −1
(σ 2 T)−1 + (σ 2 (X⊤ X)−1 )−1
(σ 2 T)−1 η + (σ 2 (X⊤ X)−1 )−1 (X⊤ X)−1 X⊤ y
(13.37)
and give an interpretation with respect to σ 2 T and σ 2 (X⊤ X)−1 .
Problem 13.2 (Sunrise problem) The sunrise problem is formulated as “what is the probability
that the sun rises tomorror?” Laplace formulated this problem by casting it into his rule of
succession which calculates the probability of a success after having observed y successes out of
n trials (the (n + 1)th is again independent of the previous ones).
a) Formulate the rule of succession in a Bayesian framework and calculate the expected probability for the n + 1th term.
b) Laplace assumed that the Earth was created about 6000 years ago. If we use the same
information, what is the probability that the sun rises tomorrow?
iid
Problem 13.3 (Normal-gamma model) Let Y1 , Y2 , . . . , Yn ∼ N (µ, 1/κ). We assume that the
value of the expectation µ is known (i.e., we treat it as a constant in our calculations), whereas
the precision, i.e., inverse of the variance, denoted here with κ, is the parameter of interest.
a) Write down the likelihood of this model.
b) We choose the a gamma prior for the parameter κ, i.e., κ ∼ Gam(α, β). How does this
distribution relates to the exponential distribution?
Plot four densities for (α, β) = (1,1), (1,2), (2,1) and (2,2). How does a certain choice of
α, β be interpreted with respect to our “beliefs” on κ?
c) Derive the posterior distribution of the precision κ.
d) Compare the prior and posterior distributions. Why is the choice in b) sensible?
e) Simulate some data with n = 50, µ = 10 and κ = 0.25. Plot the prior and posterior
distributions of κ for α = 2 and β = 1.
Problem 13.4 (Bayesian statistics) For the following Bayesian models, derive the posterior
distribution and give an interpretation thereof in terms of prior and data.
a) Let Y | µ ∼ N (µ, 1/κ), where κ is the precision (inverse of the variance) and is assumed
to be known (hyper-parameter). Further, we assume that µ ∼ N (η, 1/ν), for fixed hyperparameters η and ν > 0.
b) Let Y | λ ∼ Pois(λ) with a prior λ ∼ Gam(α, β) for fixed hyper-parameters α > 0, β > 0.
c) Let Y | θ ∼ U(0, θ) with a prior a shifted Pareto distribution with parameters γ > 0 and
ξ > 0, whose density is
f (θ; γ, ξ) ∝ θ−(γ+1) Iθ>ξ (θ).
(13.38)
248
CHAPTER 13. BAYESIAN APPROACH
Chapter 14
Monte Carlo Methods
Learning goals for this chapter:
⋄ Explain how to numerically approximate integrals
⋄ Explain how to sample from an arbitrary (univariate) distribution in R
⋄ Describe qualitatively Gibbs sampling
⋄ Explain the idea of a Bayesian hierarchical model
⋄ Able to interpret the output of a MCMC sampler of a simple model
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter14.R.
In Chapter 13, the posterior distribution of several examples were similar to the chosen prior
distribution albeit with different parameters. Specifically, for binomial data with a beta prior,
the posterior is again beta. This was no coincidence; rather, we chose so-called conjugate priors
based on our likelihood (distribution of the data).
With other prior distributions, we may have “complicated”, not standard posterior distributions, for which we no longer know the normalizing constant, the expected value or any other
moment in general. Theoretically, we could derive the normalizing constant and then in subsequent steps determine the expectation and the variance (via integration) of the posterior. The
calculation of these types of integrals is often complex and so here we consider classic simulation
procedures as a solution to this problem. In general, so-called Monte Carlo simulation is used
to numerically solve a complex problem through repeated random sampling.
In this chapter, we start with illustrating the power of Monte Carlo simulation where we
utilize, above all, the law of large numbers. We then discuss one method to draw a sample from
an arbitrary density and, finally, illustrate a method to derive (virtually) arbitrary posterior
densities by simulation. We conclude the chapter with a few realistic examples.
249
250
14.1
CHAPTER 14. MONTE CARLO METHODS
Monte Carlo Integration
In this section we discuss how to approximate integrals. Let X be a continuous random variable
and fX (x) be its density function and g(x) an arbitrary (sufficiently “well behaved”) function.
We aim to evaluate expected value of g(X), i.e.,
Z
g(x)fX (x) dx.
(14.1)
E g(X) =
R
Hence, g(x) cannot be entirely arbitrary, but such that the integral on the right-hand side
of (14.1) is well defined. An approximation of this integral is (along the idea of method of
moments)
Z
n
1X
\
g(x)fX (x) dx ≈ E g(X) =
E g(X) =
g(xi ),
(14.2)
n
R
i=1
where x1 , . . . , xn is a realization of a random sample with density fX (x). The method relies on
the law of large numbers (see Section 3.3).
Example 14.1. To estimate the expectation of a χ21 random variable we can use mean( rnorm(
100000)ˆ2), yielding 1 with a couple digits of precision, close to what we expect according to
Equation (3.6).
Of course, we can use the same approach to calculate arbitrary moments of a χ2n or Fn,m
distribution.
♣
We now discuss this justification in slightly more details. We consider a continuous function
Rb
g (over the interval [a, b]) and the integral I = a g(x) dx. There exists a value ξ such that
I = (b − a)g(ξ) (often termed as the mean value theorem for definite integrals). We do not
know ξ nor g(ξ), but we hope that the “average” value of g is close to g(ξ). More formally, let
iid
X1 , . . . , Xn ∼ U(a, b) which we use to calculate the average (the density of Xi is fX (x) = 1/(b−a)
over the interval [a, b] and zero elsewhere). We now show that on average, our approximation is
correct:
n
n
1X
1X
g(Xi ) = (b − a)
E(g(Xi )) = (b − a) E g(X)
E Ib = E (b − a)
n
n
i=1
i=1
(14.3)
Z b
Z b
Z b
1
= (b − a)
g(x)fX (x) dx = (b − a)
g(x)
dx =
g(x) dx = I .
b−a
a
a
a
We can generalize this to almost arbitrary densities fX (x) having a sufficiently large support:
1
Ib =
n
n
X
g(xi )
i=1
fX (xi )
,
(14.4)
where the justification is as in (14.3). The density in the denominator takes the role of an
additional weight for each term.
Similarly, to integrate over a rectangle R in two dimensions (or a cuboid in three dimensions,
etc.), we use a uniform random variable for each dimension. More specifically, let R = [a, b]×[c, d]
then
Z
Z bZ d
n
1X
g(xi , yi ),
g(x, y) dx dy =
g(x, y) dx dy ≈ (b − a)(d − c)
(14.5)
n
R
a
c
i=1
251
14.1. MONTE CARLO INTEGRATION
where x1 , . . . , xn and y1 , . . . , yn , are two samples of U(a, b) and of U(c, d), respectively.
R
To approximate A g(x, y) dx dy for some complex domain A ⊂ R2 . We choose a bivariate
random vector having a density fX,Y (x, y) whose support contains A. For example we define a
rectangle R such that A ⊂ R and let fX,Y (x, y) = (b − a)(d − c) over R and zero otherwise. We
define the indicator function IA (x, y) that is 1 if (x, y) ∈ A and zero otherwise. Then we have
the general formula
Z
Z bZ d
n
1X
g(xi , yi )
g(x, y) dx dy =
IA (x, y)g(x, y) dx dy ≈
IA (xi , yi )
.
(14.6)
n
fX,Y (xi , yi )
A
a
c
i=1
Testing if a point (xi , yi ) is in the domain A is typically an easy problem.
We now illustrate this idea with two particular examples.
Example 14.2. Consider the bivariate normal density specified in Example 8.4 and suppose we
are interested in evaluating the probability that P(X > Y 2 ). To approximate this probability we
can draw a large sample of the bivariate normal density and calculate the proportion for which
xi > yi2 , as illustrated in R-Code 14.1 and yielding 10.47%.
In this case, the function g is the density with which we are drawing the data points. Hence,
Equation (14.6) reduces to calculate the proportion of the data satisfying xi > yi2 .
♣
R-Code 14.1 Calculating probability with the aid of a Monte Carlo simulation
set.seed( 14)
require(mvtnorm)
# to sample the bivariate normals
l.sample <- rmvnorm( 10000, mean=c(0,0), sigma=matrix( c(1,2,2,5), 2))
mean( l.sample[,1] > l.sample[,2]^2) # calculate the proportion
## [1] 0.1047
Example 14.3. The area of the unit circle is π as well as the volume of a cylinder placed at the
origin with height one. To estimate π we estimate the volume of the cylinder and we consider
U(−1, 1) for both coordinates of a square that contains the unit circle. The function g(x, y) = 1
is the identity function, IA (x, y) is the indicator function of the set A = {x2 + y 2 ≤ 1} and
fX,Y (xi , yi ) = 1/4 for 0 ̸= x, y ̸= 1. We have the following approximation of the number π
Z 1Z 1
n
1X
IA (xi , yi ),
(14.7)
π=
IA (x, y) dx dy ≈ 4
n
−1 −1
i=1
where x1 , . . . , xn and y1 , . . . , yn , are two independent samples of U(−1, 1). Equation (14.6)
reduces to calculate a proportion again.
It is important to note that the convergence is very slow, see Figure 14.1. It can be shown
√
that the rate of convergence is of the order of 1/ n.
♣
In practice, more efficient “sampling” schemes are used. More specifically, we do not sample
uniformly but deliberately “stratified”. There are several reasons to sample randomly stratified
but the discussion is beyond the scope of the work here.
252
CHAPTER 14. MONTE CARLO METHODS
R-Code 14.2 Approximation of π with the aid of Monte Carlo integration. (See Figure 14.1.)
set.seed(14)
m <- 49
# calculate for 49 different n
n <- round( 10+1.4^(1:m)) # non-equal spacing
piapprox <- numeric(m)
# to store the approximation
for (i in 1:m) {
st <- matrix( runif( 2*n[i]), ncol=2)
# bivariate uniform
piapprox[i] <- 4*mean( rowSums( st^2)<= 1) # proportion
}
plot( n, abs( piapprox-pi)/pi, log='xy', type='l') # plotting on log-log scale
lines( n, 1/sqrt(n), col=2, lty=2)
# order of convergence
sel <- 1:7*7
# subset for printing
cbind( n=n[sel], pi.approx=piapprox[sel], rel.error=
# summaries
abs( piapprox[sel]-pi)/pi, abs.error=abs( piapprox[sel]-pi))
1e−01
1e−03
1e−05
abs(piapprox − pi)/pi
##
n pi.approx rel.error abs.error
## [1,]
21
2.4762 0.21180409 0.66540218
## [2,]
121
3.0083 0.04243968 0.13332819
## [3,]
1181
3.1634 0.00694812 0.02182818
## [4,]
12358
3.1662 0.00783535 0.02461547
## [5,]
130171
3.1403 0.00040166 0.00126186
## [6,] 1372084
3.1424 0.00025959 0.00081554
## [7,] 14463522
3.1406 0.00032656 0.00102592
1e+01
1e+02
1e+03
1e+04
1e+05
1e+06
1e+07
n
Figure 14.1: Convergence of the approximation for π: the relative error as a function
of n (log-log scale). (See R-Code 14.2.)
14.2
Rejection Sampling
We now discuss an approach to sample from a distribution with density fY (y) when no direct
method exists. There are many of such approaches and we discuss an intuitive but inefficient
one here. The approach is called rejection sampling.
14.2. REJECTION SAMPLING
253
In this method, values from a known density fZ (z) (proposal density) are drawn and through
rejection of “unsuitable” values, observations of the density fY (y) (target density) are generated.
This method can also be used when the normalizing constant of fY (y) is unknown and we write
fY (y) = c · f ∗ (y).
The procedure is as follows: Step 0: Find an m < ∞, so that f ∗ (y) ≤ m · fZ (y) for all y.
Step 1: draw a realization ỹ from fZ (y) and a realization u from a standard uniform distribution
U(0, 1). Step 2: if u ≤ f ∗ (ỹ)/ m · fZ (ỹ) then ỹ is accepted as a simulated value from fY (y),
otherwise ỹ is dicarded and no longer considered. We cycle along Steps 1 and 2 until a sufficiently
large sample has been obtained. The algorithm is illustrated in the following example.
Example 14.4. The goal is to draw a sample from a Beta(6, 3) distribution with the rejection
sampling method. That means, fY (y) = c · y 6−1 (1 − y)3−1 and f ∗ (y) = y 5 (1 − y)2 . As proposal
density we use a uniform distribution, hence fZ (y) = I0≤y≤1 (y). We select m = 0.02, which
fulfills the condition f ∗ (y) ≤ m · fZ (y) since optimize( function(x) xˆ5*(1-x)ˆ2, c(0, 1),
maximum=TRUE) is roughly 0.152.
An implementation of the example is given in R-Code 14.3. Of course, f_Z is always one
here. The R-Code can be optimized with respect to speed. It would then, however, be more
difficult to read.
Figure 14.2 shows a histogram and the density of the simulated values. By construction the
bars of the target density are smaller than the one of the proposal density. In this particular
example, we have sample size 285.
♣
R-Code 14.3: Rejection sampling in the setting of a beta distribution. (See Figure 14.2.)
set.seed( 14)
n.sim <- 1000
m <- 0.02
fstar <- function(y) y^( 6-1) * (1-y)^(3-1)
# unnormalized target
f_Z <- function(y) ifelse( y >= 0 & y <= 1, 1, 0) # proposal density
result <- sample <- rep( NA, n.sim)
# to store the result
for (i in 1:n.sim){
sample[i] <- runif(1)
# ytilde, proposal
u <- runif(1)
# u, uniform
if( u < fstar( sample[i]) /( m * f_Z( sample[i])) ) # if accept ...
result[i] <- sample[i]
#
... keep
}
mean( !is.na(result))
# proportion of accepted samples
## [1] 0.285
result <- result[ !is.na(result)]
# eliminate NAs
hist( sample, xlab="y", main="", col="lightblue") # hist of all proposals
hist( result, add=TRUE, col=4)
#
of the kept ones
curve( dbeta(x, 6, 3), frame =FALSE, ylab="", xlab='y', yaxt="n")
254
CHAPTER 14. MONTE CARLO METHODS
40
60
80
truth
smoothed empirical
0
20
Frequency
100 120
lines( density( result), lty=2, col=4)
legend( "topleft", legend=c("truth", "smoothed empirical"),
lty=1:2, col=c(1,4))
0.0
0.2
0.4
0.6
y
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
y
Figure 14.2: Left panel: histogram of the simulated values of fZ (y) (light blue) and
fY (y) (dark blue). Right panel: theoretical density (truth) black and the simulated
density (smoothed empirical) blue dashed. (See R-Code 14.3.)
For efficiency reasons the constant m should be chosen to be as small as possible to reduce
the number of rejections. Nevertheless in practice, rejection sampling is intuitive but often quite
inefficient. The next section illustrates an approach well suited for complex Bayesian models.
14.3
Gibbs Sampling
The idea of Gibbs sampling is to simulate the posterior distribution through the use of a so-called
Markov chain. This algorithm belongs to the family of Markov chain Monte Carlo (MCMC)
methods. We illustrate the principle in a Bayesian context with a likelihood that depends on two
parameters θ1 and θ2 . Based on some prior, the joint posterior density is written as f (θ1 , θ2 | y)
with, for simplicity, a single observation y. The Gibbs sampler reduces the problem to two onedimensional simulations f (θ1 | θ2 , y) and f (θ2 | θ1 , y). Starting with some initial value θ2,0 we
draw θ1,1 from f (θ1 | θ2,0 , y), followed by θ2,1 from f (θ2 | θ1,1 , y) and θ1,2 from f (θ1 | θ2,1 , y), etc.
If all is setup properly, the sample (θ1,i , θ2,i ), i = 1, . . . , n, is a sample of the posterior density
f (θ1 , θ2 | y). Often we omit the first few samples to avoid influence of possibly sub-optimal initial
values.
In many cases one does not have to program a Gibbs sampler oneself but can use a preprogrammed sampler. We use the sampler JAGS (Just Another Gibbs sampler) (Plummer,
2003) with the R-Interface package rjags (Plummer, 2016).
R-Codes 14.4, 14.5 and 14.6 give a short, but practical overview into MCMC methods with
JAGS in the case of a simple Gaussian likelihood. Luckily more complex models can easily be
constructed based on the approach shown here.
255
14.3. GIBBS SAMPLING
When using MCMC methods, you may encounter situations in which the sampler does not
converge (or converges too slowly). In such a case the posterior distribution cannot be approximated with the simulated values. It is therefore important to examine the simulated values
for eye-catching patterns. For example, the so-called trace-plot, observations in function of the
index, as illustrated in the right panel of Figure 14.3 is often used.
Example 14.5. R-Code 14.4 implements the normal-normal model for a single observation,
y = 1, n = 1, known variance, σ 2 = 1.1, and a normal prior for the mean µ:
Y | µ ∼ N (µ, 1.1),
(14.8)
µ ∼ N (0, 0.8).
(14.9)
−2
0.0
0.1
−1
0.2
0
0.3
1
0.4
2
0.5
The basic approach to use JAGS is to first create a file containing the Bayesian model definition.
This file is then transcribed into a model graph (function jags.model()) from which we can
finally draw samples (coda.samples()).
Defining a model for JAGS is quite straightforward, as the notation is very close to the one
fro, R. Some care is needed when specifying variance parameters. In our notation, we typically
use the variance σ 2 , as in N ( · , σ 2 ) ; in R we have to specify the standard deviation σ as parameter
sd in the function dnorm(..., sd=sigma); and in JAGS we have to specify the precision 1/σ 2
in the function dnorm(..., precision=1/sigma2), see also LeBauer et al. (2013).
The resulting samples are typically plotted with smoothed densities, as seen in the left panel
of Figure 14.3 with prior and likelihood, if possible. The posterior seems affected similarly by
likelihood (data) and prior, the mean is close to the average of the prior mean and the data.
More precisely, the prior is slightly tighter as its variance is slightly smaller (0.8 vs. 1.1), thus
the posterior mean is slightly closer to the prior mean than to y. The setting here is identical to
Example 13.3 and thus the posterior has again a normal distribution with N 0.8/(0.8 + 1.1), 0.8 ·
1.1/(0.8 + 1.1) , see Equation (13.17).
♣
−2
−1
0
1
2
N = 2000 Bandwidth = 0.1626
3
0
500
1000
1500
Iterations
Figure 14.3: Left: empirical densities: MCMC based posterior (black), exact (red),
prior (blue), likelihood (green). Right: trace-plot of the posterior µ | y = 1. (See
R-Code 14.4.)
2000
256
CHAPTER 14. MONTE CARLO METHODS
R-Code 14.4 JAGS sampler for normal-normal model, with n = 1.
require( rjags)
writeLines("model {
# File with Bayesian model definition
y ~ dnorm( mu, 1/1.1)
# here Precision = 1/Variance
mu ~ dnorm( 0, 1/0.8)
# Precision again!
}",
con="jags01.txt")
# arbitrary file name
jagsModel <- jags.model( "jags01.txt", data=list( 'y'=1)) # transcription
## Compiling model graph
##
Resolving undeclared variables
##
Allocating nodes
## Graph information:
##
Observed stochastic nodes: 1
##
Unobserved stochastic nodes: 1
##
Total graph size: 8
##
## Initializing model
postSamples <- coda.samples( jagsModel, 'mu', n.iter=2000) # draw samples
plot( postSamples, trace=FALSE, main="", auto.layout=FALSE, xlim=c(-2, 3))
y <- seq(-3, to=4, length=100)
lines( y, dnorm( y, 1, sqrt(1.1)), col=3)
# likelihood
lines( y, dnorm( y, 0, sqrt(0.8)
), col=4)
# prior
lines( y, dnorm( y, 1/1.1 * (1.1*0.8/(0.8 + 1.1)),
sqrt(1.1*0.8/(0.8 + 1.1))), col=2)
# posterior
plot( postSamples, density=FALSE, main="", auto.layout=FALSE)
Example 14.6. R-Code 14.5 extends the normal-normal model to n = 10 observations with
known variance:
iid
Y1 , . . . , Yn | µ ∼ N (µ, 1.1),
(14.10)
µ ∼ N (0, 0.8).
(14.11)
We draw the data in R via rnorm(n, 1, sqrt(1.1)) and proceed similarly as in R-Code 14.4.
Figure 14.4 gives the empirical and exact densities of the posterior, prior and likelihood and shows
√
a trace-plot as a basic graphical diagnostic tool. The density of the likelihood is N (y, 1.1/ n),
the prior density is based on (14.11) and the posterior density is based on (13.17). The latter
simplifies considerably because we have η = 0 in (14.11).
As the number of observations increases, the data gets more “weight”. From (13.18), the
weight increases from 0.8/(0.8 + 1.1) ≈ 0.42 to 0.8n/(0.8n + 1.1) ≈ 0.88. Thus, the posterior
is “closer” to the likelihood but slightly more peaked. As both the variance of the data and the
variance of the priors are comparable, the prior has a comparable impact on the posterior as if
we would possess an additional observation with value zero.
♣
257
0.5
1.0
1.5
2.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2
14.3. GIBBS SAMPLING
−0.5
0.0
0.5
1.0
1.5
2.0
N = 2000 Bandwidth = 0.07252
2.5
3.0
0
500
1000
1500
2000
Iterations
Figure 14.4: Left: empirical densities: MCMC based posterior µ | y = y1 , . . . , yn ,
n = 10 (black), exact (red), prior (blue), likelihood (green). Black and green ticks
are posterior sample and observations, respectively. Right: trace-plot of the posterior.
(See R-Code 14.5.)
R-Code 14.5: JAGS sampler for the normal-normal model, with n = 10. (See Figure 14.4.)
set.seed( 4)
n <- 10
obs <- rnorm( n, 1, sqrt(1.1))
# generate artificial data
writeLines("model {
for (i in 1:n) {
# define a likelihood for each
y[i] ~ dnorm( mu, 1/1.1)
# individual observation
}
mu ~ dnorm( 0, 1/0.8)
}",
con="jags02.txt")
jagsModel <- jags.model( "jags02.txt", data=list('y'=obs, 'n'=n), quiet=T)
postSamples <- coda.samples( jagsModel, 'mu', n.iter=2000)
plot( postSamples, trace=FALSE, main="", auto.layout=FALSE,
xlim=c(-.5, 3), ylim=c(0, 1.3))
rug( obs, col=3)
y <- seq(-.7, to=3.5, length=100)
lines( y, dnorm( y, mean(obs), sqrt(1.1/n)), col=3)
# likelihood
lines( y, dnorm( y, 0, sqrt(0.8)
), col=4)
# prior
lines( y, dnorm( y, n/1.1*mean(obs)*(1.1*0.8/(n*0.8 + 1.1)),
sqrt(1.1*0.8/(n*0.8 + 1.1)) ), col=2) # posterior
plot( postSamples, density=FALSE, main="", auto.layout=FALSE)
Example 14.7. We consider an extension of the previous example by including an unknown
variance, respectively unknown precision. That means that we now specify two prior distributions
258
CHAPTER 14. MONTE CARLO METHODS
and we have a priori no knowledge of the posterior and cannot compare the empirical posterior
density with a true (bivariate) density (as we had the red densities in Figures 14.3 and 14.4).
R-Code 14.6 implements the following model in JAGS:
iid
Yi | µ, κ ∼ N (µ, 1/κ),
µ ∼ N (η, 1/λ),
κ ∼ Gam(α, β),
i = 1, . . . , n, with n = 10,
(14.12)
with η = 0, λ = 1.25,
(14.13)
with α = 1, β = 0.2.
(14.14)
For more flexibility with the code, we also pass the hyper-parameters η, λ, α, β to the JAGS
MCMC engine.
Figure 14.5 gives the marginal empirical posterior densities of µ and κ, as well as the priors
(based on (14.13) and (14.14)) and likelihood (based on (14.12)). The posterior is quite data
driven and by the choice of the prior, slightly shrunk towards zero.
√
Note that the marginal likelihood for µ is N (y, s2 / n), i.e., we have replaced the parameters
in the model with their unbiased estimates. The marginal likelihood for κ is a gamma distribution
P
based on parameters n/2 + 1 and ns2 /2 = ni=1 (yi − y)2 /2, see Problem 13.3.a.
♣
R-Code 14.6 JAGS sampler for priors on mean and precision parameter, with n = 10.
eta <- 0
# start with defining the four hyperparameters
lambda <- 1.25 # corresponds to a variance 0.8, as in previous examples
alpha <- 1
beta <- 0.2
writeLines("model {
# JAGS model as above with ...
for (i in 1:n) {
y[i] ~ dnorm( mu, kappa)
}
mu ~ dnorm( eta, lambda)
kappa ~ dgamma(alpha, beta)
# ... one additional prior
}",
con="jags03.txt")
jagsModel <- jags.model('jags03.txt', quiet=T, data=list('y'=obs, 'n'=n,
'eta'=eta, 'lambda'=lambda, 'alpha'=alpha, 'beta'=beta))
postSamples <- coda.samples(jagsModel, c('mu','kappa'), n.iter=2000)
plot( postSamples[,"mu"], trace=FALSE, auto.layout=FALSE,
xlim=c(-1,3), ylim=c(0, 1.3))
y <- seq( -2, to=5, length=100)
lines( y, dnorm(y, 0, sqrt(1.2)
), col=4)
# likelihood
lines( y, dnorm(y, mean(obs), sd(obs)/sqrt(n)), col=3) # prior
plot( postSamples[,"kappa"], trace=FALSE, auto.layout=FALSE, ylim=c(0, 1.3))
y <- seq( 0, to=5, length=100)
lines( y, dgamma( y, 1, .2), type='l', col=4)
# likelihood
lines( y, dgamma( y, n/2+1, (n-1)*var(obs)/2 ), col=3) # prior
259
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0 0.2 0.4 0.6 0.8 1.0 1.2
14.4. BAYESIAN HIERARCHICAL MODELS
−1
0
1
2
3
0.0
N = 2000 Bandwidth = 0.07463
0.5
1.0
1.5
2.0
2.5
3.0
3.5
N = 2000 Bandwidth = 0.09084
Figure 14.5: Empirical posterior densities of µ | y1 , . . . , yn (left) and κ = 1/σ 2 |
y1 , . . . , yn (right), MCMC based (black), prior (blue), likelihood (green). (See RCode 14.6.)
This last example is another classical Bayesian example and with a very careful specification
of the priors, we can construct a closed form posterior density. Problem 14.1 gives a hint towards
this more advanced topic.
Remark 14.1. The marginal distribution of µ in the normal-normal-gamma model is a (shifted
and scaled) t-distribution.
♣
Note that the function jags.model() writes some local files that may be cleaned after the
analysis.
14.4
Bayesian Hierarchical Models
We conclude this chapter with a final rather flexible class of models, called Bayesian hierarchical
models.
A hierarchical Bayesian model is a Bayesian model in which the prior distribution of some of
the parameters depends on further parameters to which we assign priors too.
Suppose that we observe a linear relationship between two variables. The relationship may
be different for different subjects. A very simple hierarchical Bayesian model is the following
indep
Yij | β i , κ ∼ N (x ij β i , 1/κ),
i = 1, . . . , n, j = 1, . . . , ni ,
iid
β i , | η, λ ∼ Np (η, I/λ),
η ∼ Np (η 0 , I/τ ),
(14.15)
(14.16)
λ ∼ Gam(αλ , βλ ),
κ ∼ Gam(ακ , βκ ).
(14.17)
where τ , αλ , βλ , ακ and βκ are hyperparameters. The three levels (14.15) to (14.17) are often
referred to as observation level, state or process level and prior level.
Example 14.8. Parasite infection can pose a large economic burden on livestock such as sheep,
horses etc. Infected herds or animals receive an anthelmintic treatment that reduces the infection
260
CHAPTER 14. MONTE CARLO METHODS
of parasitic worms. To assess the efficacy of the treatment, the number of parasitic eggs per gram
feces are evaluated. We use the following Poisson model for the pre- and post-treatment counts,
denoted as yiC and yiT :
indep
YiC | µC
∼ Pois(µC
i
i ),
µC
i ∼ Gam(κ, κ/µ).
δ ∼ U(0, 1),
indep
YiT | µTi ∼ Pois(δµC
i ),
κ ∼ Gam(1, 0.001),
i = 1, . . . , n,
(14.18)
(14.19)
µ ∼ Gam(1, 0.7),
(14.20)
where we have used directly the numerical values for the hyperparameters (Wang et al., 2017).
The parameter δ represents the efficacy of the treatment. Notice that with the parameterization
of the Gamma distribution for µC
i the mean thereof is µ.
The package eggCounts provides the dataset epgs containing 14 eggs per gram (epg) values
in sheep before and after anthelmintic treatment of benzimidazole. The correction factor of
the diagnostic technique was 50 (thus all numbers are multiples of 50, due to the measuring
technique, see also Problem 14.4). R-Code 14.7 illustrates the implementation in JAGS. We
reduce all sheep that did not have any parasites, followed by the setup of the JAGS file with
dpois(), dunif() and dgamma() for the distributions of the different layers of the hierarchy.
Although the arguments of the gamma distribution are named differently in JAGS we can pass
the same parameters in the same order.
The sampler does not indicate any convergence issues (trace-plots and the empirical posterior densities behave well). The posterior median reduction factor is about 0.077, with a
95% HPD interval [ 0.073, 0.0812 ], virtually identical to the quantile based credible interval
[ 0.0729, 0.0812 ], (TeachingDemos::emp.hpd(postSamples[,"delta"][[1]], conf=0.95) and
summary( postSamples)$quantiles["delta",c(1,5)]). The posterior median epg is reduced
from 1885.8 to 145.
As a sanity check, we can compare the posterior median (or mean values) with corresponding
frequentist estimates. The average epgs before and after the treatment are 2094.4 and 161.1
(colMeans(epgs2)).
To compare the variances we can use the following ad-hoc approach. The (prior) variance
2
C
of µC
i is µ /κ. Although the posterior distribution of µi is not gamma anymore, we use the
same formula to estimate the variance (here we have very few observations and for a simple
Poisson likelihood, a gamma prior is conjugate, see Problem 13.4.b). Based on the posterior
medians, we have the following values 5.326 × 106 and 3.15 × 104 , which are somewhat smaller
than the frequentist values 9.007×106 and 6.861×104 . We should not be too surprised about the
difference but rather be assured that we have properly specified and interpreted the parameters
of the gamma distribution.
♣
R-Code 14.7: JAGS sampler for epgs data. (See Figure 14.6.)
require(eggCounts)
require(rjags)
data(epgs)
epgs2 <- epgs[rowSums(epgs[,c("before","after")])>0,c("before","after")]
14.4. BAYESIAN HIERARCHICAL MODELS
261
n <- nrow(epgs2)
writeLines("model {
for (i in 1:n) {
# define a likelihood for each
yC[i] ~ dpois( muC[i])
# pre-treatment
yT[i] ~ dpois( delta*muC[i]) # post-treatment
muC[i] ~ dgamma( kappa, kappa/mu) # pre-treatment mean
}
delta ~ dunif( 0, 1)
# reduction
mu ~ dgamma(1, 0.001)
# pre-treatment mean
kappa ~ dgamma(1, 0.7)
#
}",
con="jagsEggs.txt")
jagsModel <- jags.model( "jagsEggs.txt",
# write the model
data=list('yC'=epgs2[,1],'yT'=epgs2[,2], 'n'=n), quiet=T)
postSamples <- coda.samples( jagsModel, # run sampler and monitor all param.
c('mu', 'kappa', 'delta'), n.iter=5000)
summary( postSamples)
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
##
plus standard error of the mean:
##
##
Mean
SD Naive SE Time-series SE
## delta
0.077 2.13e-03 3.01e-05
4.10e-05
## kappa
0.684 2.60e-01 3.68e-03
5.21e-03
## mu
1982.866 7.06e+02 9.99e+00
1.52e+01
##
## 2. Quantiles for each variable:
##
##
2.5%
25%
50%
75%
97.5%
## delta
0.0727 7.56e-02
0.077 7.85e-02 8.12e-02
## kappa
0.2892 4.96e-01
0.644 8.29e-01 1.30e+00
## mu
943.9016 1.49e+03 1864.182 2.35e+03 3.71e+03
par(mfcol=c(2,3), mai=c(.6,.6,.1,.1))
plot( postSamples[,"delta"], main="", auto.layout=FALSE, xlab=bquote(delta))
plot( postSamples[,"mu"], main="", auto.layout=FALSE, xlab=bquote(mu))
plot( postSamples[,"kappa"], main="", auto.layout=FALSE, xlab=bquote(kappa))
262
1000
3000
5000
1.5
0.5
0.070
2000
0.078
6000
CHAPTER 14. MONTE CARLO METHODS
1000
3000
5000
1000
µ
0.080
0.085
1.0
0.0
0e+00
0 50
0.075
5000
κ
4e−04
150
δ
0.070
3000
0
δ
2000
4000
6000
8000
0.0
0.5
1.0
µ
1.5
2.0
κ
Figure 14.6: Top row: trace-plots of the parameters δ, µ and κ. Bottom row: empirical
posterior densities of the parameters δ, µ and κ. (See R-Code 14.7.)
14.5
Bibliographic Remarks
There is ample literature about in-depth Bayesian methods and the computational implementation of these and we only give a few relevant links.
The list of textbooks discussing MCMC is long and extensive. Held and Sabanés Bové (2014)
has some basic and accessible ideas. Accessible examples for actual implementations can be
found in Kruschke (2015) (JAGS and STAN) and Kruschke (2010) (Bugs).
Further information about MCMC diagnostics is found in general Bayesian text books like
Lunn et al. (2012). Specific and often used tests are published in Geweke (1992) Gelman and
Rubin (1992), and Raftery and Lewis, 1992 and are implemented in the package coda with
geweke.plot(), gelman.diag(), raftery.diag().
An alternative to JAGS is BUGS (Bayesian inference Using Gibbs Sampling) which is distributed as two main versions: WinBUGS and OpenBUGS, see also Lunn et al. (2012). Additionally, there is the R-Interface package (R2OpenBUGS, Sturtz et al., 2005). Other possibilities
are the Stan or INLA engines with convenient user interfaces to R through rstan and INLA
(Gelman et al., 2015; Rue et al., 2009; Lindgren and Rue, 2015).
14.6
Exercises and Problems
iid
Problem 14.1 (Normal-normal-gamma model) Let Y1 , Y2 , . . . , Yn | µ, τ ∼ N (µ, 1/κ). Instead
of independent priors on µ and κ, we propose a joint prior density that can be factorized by
the density of κ and µ | κ. We assume κ ∼ Gam(α, β) and µ | κ ∼ N (η, 1/(κν)), for some
hyper-parameters η, ν > 0, α > 0, and β > 0. This distribution is a so-called normal-gamma
distribution, denoted by NΓ (η, ν, α, β).
iid
a) Create an artificial dataset consisting for Y1 , . . . , Yn ∼ N (1, 1), with n = 20.
263
14.6. EXERCISES AND PROBLEMS
b) Write a function called dnormgamma() that calculates the density at mu, kappa based on the
parameters eta, nu, alpha, beta. Visualize the bivariate density based on η = 1, ν = 1.5,
α = 1, and β = 0.2.
c) Setup a Gibbs sampler for the following values η = 0, ν = 1.5, α = 1, and β = 0.2. For a
sample of length 2000 illustrate the (empirical) joint posterior density of µ, κ | y1 , . . . , yn .
Hint: Follow closely R-Code 13.6.
d) It can be shown that the posterior is again normal-gamma with parameters
1
(ny + νη)
n+ν
n
αpost = α +
2
ηpost =
νpost = ν + n
βpost = β +
1
nν(η − x)2 (n − 1)s2 +
2
n+ν
(14.21)
(14.22)
where s2 is the usual unbiased estimate of σ 2 . Superimpose the true isolines of the normalgamma prior and posterior density in the plot form the previous problem.
e) Compare the posterior mean of the normal-normal model with ηpost .
Problem 14.2 (Monte Carlo integration) Estimate the volume of the unit ball in d = 2, 3, . . . , 10
dimensions and compare it to the exact value π d/2 /Γ(d/2 + 1). What do you notice?
Problem 14.3 (Rejection sampling) A random variable X has a Laplace distribution with
parameters µ ∈ R and λ > 0 if its density is of the form
fX (x) =
|x − µ| 1
.
exp −
2λ
λ
a) Draw 100 realizations of a Laplace distribution with parameters µ = 1 and λ = 1 with a
rejection sampling approach.
b) Propose an intuitive alternative sampling approach based on rexp().
Problem 14.4 (Anthelmintic model) The process of determining the parasitic load, a fecal sample is taken and is thoroughly mixed after dilution. We assume that the eggs are homogeneously
distributed within each sample. A proportion of the diluted sample p = 1/f is then counted.
Denote the raw number of eggs in the diluted sample of the ith control animal as Yi∗C , with
i = 1, 2, . . . , n. Given the true number of eggs per gram of feces YiC , the raw count Yi∗C follows
a binomial distribution Bin(YiC , p). This captures both the dilution and the counting variability.
For the true epg counts YiC we use the same models as in Example 14.8. Similar approach is
used for the observations after the treatment.
a) Implement the model in JAGS and compare the results with those of the simple JAGS
sampler of Example 14.8.
b) The package eggCounts provides samplers for the specific model discussed here as well as
further extensions. Interpret the output of
264
CHAPTER 14. MONTE CARLO METHODS
model <- fecr_stanSimple(epgs2$before, epgs2$after)
and compare the result with those of a). (See also the vignette of the package eggCounts).
Epilogue
Are we done yet? No of course not!
An introduction is typically followed by an immersion. Here based on the lecture STA121
Statistical Modeling.
Throughout the book we have encountered several gaps that should be filled to feel proficient in statistics. To name a few: the extension from linear models to mixed models or to
generalized linear models. In a similar fashion we can work with arbitrary many predictors and
work with non-parametric regressions (e.g., guide the eye curves in the scatterplots) or neural
nets. Dropping the iid Gaussian assumption and working with time series data, spatial data and
extreme values. Real world examples are typically much larger and messier than what we have
encountered here and thus methods to find low(er) dimensional structures in the data are often
first steps in an analysis. Similarly, one often has to find clusters, i.e., grouping observations into
similar groups, or construct classifiers, i.e., allocate new observations to known groups.
After all that, are we done yet? No of course not, the educational iteration requires a
further refinement. Books at virtually arbitrary length could be written for all the topics above.
However, I often feel that too many iterations is not advancing my capability to help statistically
in a proportional manner. Being a good statistician does not only require having solid knowledge
in statistics but also having domain specific knowledge, the skills to listen and to talk to experts
and to have fun stepping outside the own comfort zone, rather into a foreign backyard (in the
sense of John Tukey’s quote “The best thing about being a statistician is that you get to play in
everyone’s backyard.”).
For the specific document, I am done here up to some standard appendices and indices.
265
266
Epilogue
Appendix A
Software Environment R
R is a freely available language and environment for statistical computing and graphics which
provides a wide variety of statistical and graphical techniques. It compiles and runs on a wide
varieties operating systems (Windows, Mac, and Linux), its central entry point is https://www.
r-project.org.
The R software can be downloaded from CRAN (Comprehensive R Archive Network) https:
//cran.r-project.org, a network of ftp and web servers around the world that store identical,
up-to-date, versions of code and documentation for R. Figure A.1 shows a screenshot of the web
page.
Figure A.1: Screenshot of the entry webpage of CRAN (Comprehensive R Archive
Network, https://cran.r-project.org).
267
268
APPENDIX A. SOFTWARE ENVIRONMENT R
R is console based, that means that individual commands have to be typed. It is very important to save these commands to construct a reproducible workflow – which the big advantage
over a “click-and-go” approach. We strongly recommend to use some graphical, integrated development environment (IDE) for R. The prime choice these days is RStudio. RStudio includes
a console, syntax-highlighting editor that supports direct code execution, as well as tools for
plotting, history, debugging and workspace management, see Figure A.2.
RStudio is available in a desktop open source version for many different operating systems
(Windows, Mac, and Linux) or in a browser connected to an RStudio Server. There are several
providers of such servers, including https://rstudio.math.uzh.ch for the students of the STA120
lecture.
Figure A.2: RStudio screenshot. The four panels shown are (clock-wise starting top
left): (i) console, (ii) plots, (iii) environment, (iv) script.
4 min
The installation of all software components are quite straightforward, but the look of the
download page may change from time to time and the precise steps may vary a bit. Some
examples are given by the attached videos.
The biggest advantage of using R is the support from and for a huge user community. Sheer
endless packages provide almost seemingly every statistical task, often implemented by several
authors. The packages are documented and by the upload to CRAN confined to a limited level
of documentation, coding standards, (unit) testing etc. There are several forums (e.g., R mailing
lists, Stack Overflow with tag “r”) to get additional help, see https://www.r-project.org/help.
html.
In this document we tried to keep the level of R quite low and rely on few packages only,
examples were MASS, vioplot, mvtnorm, ellipse and some more. Due to complex dependencies,
269
more than the actual loaded packages are used. The following R-Code output shows all packages
(and their version number) used to compile this document (not including any packages required
for the problems).
R-Code A.1: R-Session infomation of this document.
print( sessionInfo(), locale=FALSE)
## R Under development (unstable) (2023-01-31 r83741)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS:
/usr/lib/R-devel/lib/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3; LAPACK version 3.10.0
##
## attached base packages:
## [1] stats
graphics grDevices utils
datasets methods
base
##
## other attached packages:
## [1] LearnBayes_2.15.1
BayesFactor_0.9.12-4.4 eggCounts_2.3-2
## [4] Rcpp_1.0.10
arm_1.13-1
lme4_1.1-31
## [7] Matrix_1.5-3
rjags_4-13
coda_0.19-4
## [10] TeachingDemos_2.12
daewr_1.2-7
pwr_1.3-0
## [13] car_3.1-1
carData_3.0-5
faraway_1.0.8
## [16] mvtnorm_1.1-3
ellipse_0.4.3
fields_14.1
## [19] viridis_0.6.2
viridisLite_0.4.1
spam_2.9-1
## [22] exactRankTests_0.8-35 coin_1.4-2
survival_3.5-0
## [25] palmerpenguins_0.1.1
vioplot_0.4.0
zoo_1.8-11
## [28] sm_2.2-5.7.1
MASS_7.3-58.2
knitr_1.42
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0
rootSolve_1.8.2.3
DoE.base_1.2-1
## [4] libcoin_1.0-9
dplyr_1.1.0
loo_2.5.1
## [7] combinat_0.0-8
TH.data_1.1-1
mathjaxr_1.6-0
## [10] numbers_0.8-5
digest_0.6.31
dotCall64_1.0-2
## [13] lifecycle_1.0.3
StanHeaders_2.21.0-7 processx_3.8.0
## [16] magrittr_2.0.3
compiler_4.3.0
rlang_1.0.6
## [19] tools_4.3.0
igraph_1.4.0
utf8_1.2.3
## [22] prettyunits_1.1.1
pkgbuild_1.4.0
scatterplot3d_0.3-42
## [25] multcomp_1.4-20
abind_1.4-5
partitions_1.10-7
## [28] grid_4.3.0
stats4_4.3.0
fansi_1.0.4
## [31] conf.design_2.0.0
inline_0.3.19
colorspace_2.1-0
## [34] ggplot2_3.4.0
scales_1.2.1
cli_3.6.0
## [37] crayon_1.5.2
generics_0.1.3
RcppParallel_5.1.6
## [40] FrF2_2.2-3
pbapply_1.7-0
minqa_1.2.5
## [43] polynom_1.4-1
stringr_1.5.0
rstan_2.21.8
## [46] modeltools_0.2-23
splines_4.3.0
maps_3.4.1
270
## [49] parallel_4.3.0
## [52] boot_1.3-28.1
## [55] vcd_1.4-11
## [58] ps_1.7.2
## [61] gtable_0.3.1
## [64] gmp_0.7-1
## [67] pillar_1.8.1
## [70] evaluate_0.16
## [73] rbibutils_2.2.13
## [76] gridExtra_2.3
## [79] pkgconfig_2.0.3
APPENDIX A. SOFTWARE ENVIRONMENT R
matrixStats_0.63.0
sandwich_3.0-2
glue_1.6.2
codetools_0.2-18
sfsmisc_1.1-14
munsell_0.5.0
R6_2.5.1
lattice_0.20-45
MatrixModels_0.5-1
nlme_3.1-161
vctrs_0.5.2
callr_3.7.3
nloptr_2.0.3
stringi_1.7.12
lmtest_0.9-40
tibble_3.1.8
Rdpack_2.4
highr_0.10
rstantools_2.2.0
xfun_0.37
Appendix B
Calculus
In this chapter we present some of the most important ideas and concepts of calculus. For example, we will not discuss sequences and series. It is impossible to give a formal, mathematically
precise exposition. Further, we cannot present all rules, identities, guidelines or even tricks.
B.1
Functions
We start with one of the most basic concepts, a formal definition that describes a relation between
two sets.
Definition B.1. A function f from a set D to a set W is a rule that assigns a unique value
element f (x) ∈ W to each element x ∈ D. We write
f :D→W
x 7→ f (x)
(B.1)
(B.2)
The set D is called the domain, the set W is called the range (or target set or codomain).
The graph of a function f is the set (x, f (x)) : x ∈ D .
♢
The function will not necessarily map to every element in W , and there may be several
elements in D with the same image in W . These functions are characterized as follows.
Definition B.2.
1. A function f is called injective, if the image of two different elements in
D is different.
2. A function f is called surjective, if for every element y in W there is at least one element
x in D such that y = f (x).
3. A function f is called bijective if it is surjective and injective. Such a function is also called
a one-to-one function.
♢
As an illustration, the first point can be ‘translated’ to ∀x, z ∈ D, x ̸= z =⇒ f (x) ̸= f (z),
which is equivalent to ∀x, z ∈ D, f (x) = f (z) =⇒ x = z.
By restricting the range, it is possible to render a function surjective. It is often possible to
restrict the domain to obtain a locally bijective function.
271
272
APPENDIX B. CALCULUS
In general, there is virtually no restriction on the domain and codomain. However, we often
work with real functions, i.e., D ⊂ R and W ⊂ R.
There are many different characterizations of functions. Some relevant one are as follows.
Definition B.3. A real function f is
1. periodic if there exists an ω > 0 such that f (x + ω) = f (x) for all x ∈ D. The smallest
value ω is called the period of f ;
2. called increasing if f (x) ≤ f (x + h) for all h ≥ 0. In case of strict inequalities, we call the
function strictly increasing. Similar definitions hold when reversing the inequalities.
♢
The inverse f −1 (y) of a bijective function f : D → W is defined as
f −1 : W → D
y 7→ f −1 (y), such that y = f f −1 (y) .
(B.3)
Subsequently, we require the “inverse” of increasing functions by generalizing the previous
definition. We call these function quantile functions.
To capture the behavior of a function locally, say at a point x0 ∈ D, we use the concept of a
limit.
Definition B.4. Let f : D → R and x0 ∈ D. The limit of f as x approaches x0 is a, written
as limx→x0 f (x) = a if for every ϵ > 0, there exists a δ > 0 such that for all x ∈ D with
0 < |x − x0 | < δ =⇒ |f (x) − a| < ϵ.
♢
The latter definition does not assume that the function is defined at x0 .
It is possible to define “directional” limits, in the sense that x approaches x0 from above (from
the right side) or from below (from the left side). These limits are denoted with
lim
x→x+
0
lim
x↘x0
for the former; or lim
x→x−
0
lim for the latter.
x↗x0
(B.4)
We are used to interpret graphs and when we sketch an arbitrary function we often use a
single, continuous line. This concept of not lifting the pen while sketching is formalized as follows
and linked directly to limits, introduced above.
Definition B.5. A function f is continous in x0 if the following limits exist
lim f (x0 + h)
h↗0
and are equal to f (x0 ).
lim f (x0 + h)
h↘0
(B.5)
♢
There are many other approaches to define coninuity, for example in terms of neighborhoods,
in terms of limits of sequences.
Another very important (local) characterization of a function is the derivative, which quantifies the (infinitesimal) rate of change.
273
B.1. FUNCTIONS
Definition B.6. The derivative of a function f (x) with respect to the variable x at the point
x0 is defined by
f ′ (x0 ) = lim
h→0
f (x0 + h) − f (x0 )
,
h
(B.6)
df (x0 )
= f ′ (x0 ).
provided the limit exists. We also write
dx
If the derivative exists for all x0 ∈ D, the function f is differentiable.
♢
Some of the most important properties in differential calculus are:
Property B.1.
1. Differentability implies continuity.
2. (Mean value theorem) For a continuous function f : [a, b] → R, which is differentiable on
f (b) − f (a)
(a, b) there exists a point ξ ∈ (a, b) such that f ′ (ξ) =
.
b−a
The integral of a (positive) function quantifies the area between the function and the x-axis.
A mathematical definition is a bit more complicated.
Definition B.7. Let f (x) : D → R a function and [a, b] ∈ D a finite interval such that |f (x)| <
∞ for x ∈ [a, b]. For any n, let t0 = a < t1 < · · · < tn = b a partition of [a, b].
The integral of f from a to b is defined as
Z b
n
X
f (x)dx = lim
f (ti )(ti − ti−1 ).
(B.7)
a
n→∞
i=1
♢
For non-finite a and b, the definition of the integral can be extended via limits.
Property B.2. (Fundamental theorem of calculus (I)). Let f : [a, b] → R continuous. For all
Rx
x ∈ [a, b], let F (x) = a f (u)du. Then F is continuous on [a, b], differentiable on (a, b) and
F ′ (x) = f (x), for all x ∈ (a, b).
The function F is often called the antiderivative of f . There exists a second form of the
previous theorem that does not assume continuity of f but only Riemann integrability, that
means that an integral exists.
Property B.3. (Fundamental theorem of calculus (II)). Let f : [a, b] → R. And let F such that
Z b
′
F (x) = f (x), for all x ∈ (a, b). If f is Riemann integrable then
f (u)du = F (b) − F (a).
a
There are many ‘rules’ to calculate integrals. One of the most used ones is called integration
by substitution and is as follows.
Property B.4. Let I be an interval and φ : [a, b] → I be a differentiable function with integrable
derivative. Let f : I → R be a continuous function. Then
Z φ(b)
Z b
f (u) du =
f (φ(x))φ′ (x) dx.
(B.8)
φ(a)
a
274
APPENDIX B. CALCULUS
B.2
Functions in Higher Dimensions
We denote with Rm the vector space with elements x = (x1 , . . . , xm )⊤ , called vectors, equipped
with the standard operations. We will discuss vectors and vector notation in more details in the
subsequent chapter.
A natural extension of a real function is as follows. The set D is subset of Rm and thus we
write
f : D ⊂ Rm → W
x 7→ f (x ).
(B.9)
Note that we keep W ⊂ R.
The concept of limit and continuity translates one-to-one. Differentiability, however, is different and slightly more delicate.
Definition B.8. The partial derivative of f : D ⊂ R → W with respect to xj is defined by
f (x1 , . . . , xj−1 , xj + h, xj+1 , . . . , xm ) − f (x1 , . . . , xn )
∂f (x )
= lim
,
h→0
∂xj
h
(provided it exists).
(B.10)
♢
The derivative of f with respect to all components is thus a vector
f ′ (x ) =
∂f (x )
∂x1
,...,
∂f (x ) ⊤
∂xm
(B.11)
Hence f ′ (x ) is a vector valued function from D to Rm and is called the gradient of f at x ,
also denoted with grad(f (x)) = ∇f (x).
Remark B.1. The existence of partial derivatives is not sufficient for the differentiability of the
function f .
♣
In a similar fashion, higher order derivatives can be calculated. For example, taking the
derivative of each component of (B.11) with respect to all components is an matrix with components
f ′′ (x ) =
∂ 2 f (x ) ∂xi ∂xj
,
(B.12)
called the Hessian matrix.
It is important to realize that the second derivative constitutes a set of derivatives of f : all
possible double derivatives.
275
B.3. APPROXIMATING FUNCTIONS
B.3
Approximating Functions
Quite often, we want to approximate functions.
Property B.5. Let f : D → R with continuous Then there exists ξ ∈ [a, x] such that
1
f (x) = f (a) + f ′ (a)(x − a) + f ′′ (a)(x − a)2 + . . .
2
1 (m)
1
+
f (a)(x − a)m +
f (m+1) (ξ)(x − a)m
m!
(m + 1)!
(B.13)
We call (B.13) Taylor’s formula and the last term, often denoted by Rn (x), as the reminder
of order n. Taylor’s formula is an extension of the mean value theorem.
If the function has bounded derivatives, the reminder Rn (x) converges to zero as x → a.
Hence, if the function is at least twice differentiable in a neighborhood of a then
1
f (a) + f ′ (a)(x − a) + f ′′ (a)(x − a)2
2
(B.14)
is the best quadratic approximation in this neighborhood.
If all derivatives of f exist in an open interval I with a ∈ I, we have for all x ∈ I
f (x) =
∞
X
1
r=0
r!
f (r) (a)(x − a)r
(B.15)
Often the approximation is for x = a + h, h small.
Taylor’s formula can be expressed for multivariate real functions. Without stating the precise
assumptions we consider here the following example
f (a + h) =
∞
X
X
r=0 i :i1 +···+in =r
extending (B.15) with x = a + h.
1
∂ r f (a)
hi1 hi2 . . . hinn ,
i1 !i2 ! . . . in ! ∂xi1 . . . ∂xin 1 2
(B.16)
276
APPENDIX B. CALCULUS
Appendix C
Linear Algebra
In this chapter we cover the most important aspects of linear algebra, namely of notational
nature.
C.1
Vectors, Matrices and Operations
A collection of p real numbers is called a vector, an array of n × m real numbers is called a
matrix. We write




x1
a11 . . . a 1m
 . 
 .
.. 
. 
.
x =
A = (aij ) = 
(C.1)
. 
 . ,
 .
.
xp
an1 . . . anm
Providing the dimensions are coherent, vector and matrix addition (and subtraction) is performed
componentwise, as is scalar multiplication. That means, for example, that x ± y is a vector with
elements xi ± yi and cA is a matrix with elements caij .
The n × n identity matrix I is defined as the matrix with ones on the diagonal and zeros
elsewhere. We denote the vector with solely one elements with 1 similarly, 0 is a vector with only
zero elements. A matrix with entries d1 , . . . , dn on the diagonal and zero elsewhere is denoted
with diag(d1 , . . . , dn ) or diag(di ) for short and called a diagonal matrix. Hence, I = diag(1).
To indicate the ith-jth element of A, we use (A)ij . The transpose of a vector or a matrix
flips its dimension. When a matrix is transposed, i.e., when all rows of the matrix are turned
into columns (and vice-versa), the elements aij and aji are exchanged. Thus (A⊤ )ij = (A)ji .
The vector x ⊤ = (xa , . . . , xp ) is termed a row vector. We work mainly with column vectors as
shown in (C.1).
In the classical setting of real numbers, there is only one type of multiplication. As soon
as we have several dimensions, several different types of multiplications exist, notably scalar
multiplication, matrix multiplication and inner product (and actually more such as the vector
product, outer product).
Let A and B be two n × p and p × m matrices. Matrix multiplication AB is defined as
AB = C
with
(C)ij =
p
X
k=1
277
aik bkj .
(C.2)
278
APPENDIX C. LINEAR ALGEBRA
This last equation shows that the matrix I is the neutral element (or identity element) of the
matrix multiplication.
Definition C.1. The inner product between two p-vectors x and y is defined as x ⊤ y =
Pp
⊤
i=1 xi yi . There are several different notations used: x y = ⟨a, b⟩ = x · y .
If for an n × n matrix A there exists an n × n matrix B such that
AB = BA = I,
(C.3)
then the matrix B is uniquely determined by A and is called the inverse of A, denoted by A−1 .
C.2
Linear Spaces and Basis
The following definition formalizes one of the main spaces we work in.
Definition C.2. A vector space over R is a set V with the following two operations:
1. + : V × V → V (vector addition)
2. · : R × V → V (scalar multiplication).
♢
Typically, V is Rp , p ∈ N.
In the following we assume a fixed d and the usual operations on the vectors.
Definition C.3.
1. The vectors v 1 , . . . , v k are linearly dependent if there exists scalars a1 , . . . , ak
(not all equal to zero), such that a1 v 1 + · · · + ak v k = 0.
2. The vectors v 1 , . . . v k are linearly independent if a1 v 1 + · · · + ak v k = 0 cannot be satisfied
by any scalars a1 , . . . , ak (not all equal to zero).
♢
In a set of linearly dependent vectors, each vector can be expressed as a linear combination
of the others.
Definition C.4. The set of vectors {b 1 , . . . , b d } is a basis of a vectors space V if the set is
linearly independent and any other vector v ∈ V can be expressed by v = v1 b 1 + · · · + vd b d . ♢
The following proposition summarizes some of the relevant properties of a basis.
Property C.1.
1. The decomposition of a vector v ∈ V in v = v1 b1 + · · · + vd bd is unique.
2. All basis of V have the same cardinality, which is called the dimension of V , dim(V ).
3. If there are two basis {b1 , . . . , bd } and {e1 , . . . , ed } then there exists a d × d matrix A such
that ei = Abi , for all i.
Definition C.5. The standard basis, or canonical basis of V = Rd is {e 1 , . . . , e d } with e i =
(0, . . . , 0, 1, 0, . . . )⊤ , i.e., the vector with a one at the ith position and zero elsewhere.
♢
279
C.3. PROJECTIONS
Definition C.6. Let A be a n × m matrix. The column rank of the matrix is the dimension
of the subspace that the m columns of A span and is denoted by rank(A). A matrix is said to
have full rank if rank(A) = m.
The row rank is the column rank of A⊤ .
♢
Some fundamental properties of the rank are as follows.
Property C.2. Let A be a n × m matrix.
1. The column rank and row rank are identical.
2. rank(A⊤ A) = rank(AA⊤ ) = rank(A).
3. rank(A) ≤ dim(V ).
4. rank(A) ≤ min(m, n).
5. For an appropriately sized matrix B rank(A + B) ≤ rank(A) + rank(B) and rank(AB) ≤
min rank(A), rank(B) .
C.3
Projections
We consider classical Euclidean vector spaces with elements x = (x1 , . . . , xp )⊤ ∈ Rp with EuP
clidean norm ||x || = ( i x2i )1/2 .
To illustrate projections, consider the setup illustrated in Figure C.1, where y and a are two
vectors in R2 . The subspace spanned by a is
{λa, λ ∈ R} = {λa/||a||, λ ∈ R}
(C.4)
where the second expression is based on a normalized vector a/||a||. By the (geometric) definition
of the inner product (dot product),
< a, b > = a ⊤ b = ||a||||b|| cos θ
(C.5)
where θ is the angle between the vectors. Classical trigonometric properties state that the length
of the projection is a/||a|| · ||y || cos(θ). Hence, the projected vector is
a a⊤
y = a(a ⊤ a)−1 a ⊤ y .
||a|| ||a||
(C.6)
In statistics we often encounter expressions like this last term. For example, ordinary least
squares (“classical” multiple regression) is a projection of the vector y onto the column space
spanned by X, i.e., the space spanned by the columns of the matrix X. The projection is
X(X⊤ X)−1 X⊤ y . Usually, the column space is in a lower dimension.
y
θ
a
Figure C.1: Projection of the vector y onto the subspace spanned by a.
280
APPENDIX C. LINEAR ALGEBRA
Remark C.1. Projection matrices (like H = X(X⊤ X)−1 X⊤ ) have many nice properties such
as being symmetric, being idempotent, i.e., H = HH, having eigenvalues within [0, 1], (see next
section), rank(H) = rank(X), etc.
♣
C.4
Matrix Decompositions
In this section we elaborate representations of a matrix as a product of two or three other
matrices.
Let x be a non-zero n-vector (i.e., at least one element is not zero) and A an n × n matrix.
We can interpret A(x ) as a function that maps x to Ax . We are interested in vectors that
change by a scalar factor by such a mapping
Ax = λx ,
(C.7)
where λ is called an eigenvalue and x an eigenvector.
A matrix has n eigenvalues, {λ1 , . . . , λn }, albeit not necessarily different and not necessarily
real. The set of eigenvalues and the associated eigenvectors denotes an eigendecomposition.
For all square matrices, the set of eigenvectors span an orthogonal basis, i.e., are constructed
that way.
We often denote the set of eigenvectors with γ 1 , . . . , γ n . Let Γ be the matrix with columns
γ i , i.e., Γ = (γ 1 , . . . , γ n ). Then
Γ⊤ AΓ = diag(λ1 , . . . , λn ),
(C.8)
due to the orthogonality property of the eigenvectors Γ⊤ Γ = I. This last identity also implies
that A = Γ diag(λ1 , . . . , λn )Γ⊤ .
In cases of non-square matrices, an eigendecomposition is not possible and a more general
approach is required. The so-called singular value decomposition (SVD) works or any n × m
matrix B,
B = UDV⊤
(C.9)
where U is an n × min(n, m) orthogonal matrix (i.e., U⊤ U = In ), D is an diagonal matrix
containing the so-called singular values and V is an min(n, m) × m orthogonal matrix (i.e.,
V⊤ V = Im ).
We say that the columns of U and V are the left-singular vectors and right-singular vectors,
respectively.
Note however, that the dimensions of the corresponding matrices differ in the literature, some
write U and V as square matrices and V as a rectangular matrix.
Remark C.2. Given an SVD of B, the following two relations hold:
BB⊤ = UDV⊤ (UDV⊤ )⊤ = UDV⊤ VDU⊤ = UDDU⊤
(C.10)
B⊤ B = (UDV⊤ )⊤ UDV⊤ = VDU⊤ UDV⊤ = VDDV⊤
(C.11)
and hence the columns of U and V are eigenvectors of BB⊤ and B⊤ B, respectively, and most
importantly, the elements of D are the square roots of the (non-zero) eigenvalues of BB⊤ or
B⊤ B.
♣
Besides an SVD there are many other matrix factorization. We often use the so-called
Cholesky factorization, as - to a certain degree - it generalizes the concept of a square root for
matrices. Assume that all eigenvalues of A are strictly positive, then there exists a unique lower
triangular matrix L with positive entries on the diagonal such that A = LL⊤ . There exist very
efficient algorithm to calculate L and solving large linear systems is often based on a Cholesky
factorization.
The determinant of a square matrix essentially describes the change in “volume” that associated linear transformation induces. The formal definition is quite complex but it can be written
Q
as det(A) = ni=1 λi for matrices with real eigenvalues.
The trace of a matrix is the sum of its diagonal entries.
C.5
Positive Definite Matrices
Besides matrices containing covariates, we often work with variance-covariance matrices, which
represent an important class of matrices as we see now.
Definition C.7. A n × n matrix A is positive definite (pd) if
x ⊤ Ax > 0,
for all x ̸= 0.
Further, if A = A⊤ , the matrix is symmetric positive definite (spd).
Relevant properties of spd matrices A = (aij ) are given as follows.
Property C.3.
1. rank(A) = n
2. the determinant is positive, det(A) > 0
3. all eigenvalues are positive, λi > 0
4. all elements on the diagonal are positive, aii > 0
5. aii ajj − a2ij > 0, i ̸= j
6. aii + ajj − 2|aij | > 0, i ̸= j
7. A−1 is spd
8. all principal sub-matrices of A are spd.
281
(C.12)
♢
282
References
For a non-singular matrix A, written as a 2 × 2 block matrix (with square matrices A11 and
A22 ), we have
A
−1
=
A11 A12
A21 A22
!−1
=
−1
−1
A−1
−A−1
11 + A11 A12 CA21 A11
11 A12 C
−1
−CA21 A11
C
−1
with C = (A22 − A21 A−1
11 A12 ) . Note that A11 and C need to be invertible.
It also holds that det(A) = det(A11 ) det(A22 ).
!
(C.13)
Bibliography
Abraham, B. and Ledolter, J. (2006). Introduction To Regression Modeling. Duxbury applied
series. Thomson Brooks/Cole.
Agresti, A. (2002). Categorical data analysis. A Wiley-Interscience publication. Wiley, New
York, second edition.
Agresti, A. (2007). An introduction to categorical data analysis. Wiley, New York, second edition.
Ahrens, J. H. and Dieter, U. (1972). Computer methods for sampling from the exponential and
normal distributions. Communications of the ACM, 15, 873–882.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.
In Petrov, B. and Csaki, F., editors, 2nd International Symposium on Information Theory,
267–281. Akadémiai Kiadó.
Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. John Wiley & Sons Inc.,
Chichester.
Bland, J. M. and Bland, D. G. (1994). Statistics notes: One and two sided tests of significance.
BMJ, 309, 248.
Box, G. E. P. and Draper, N. R. (1987). Empirical Model-building and Response Surfaces. Wiley.
Brown, L. D., Cai, T. T., and DasGupta, A. (2002). Confidence intervals for a binomial proportion and asymptotic expansions. The Annals of Statistics, 30, 160–201.
Canal, L. (2005). A normal approximation for the chi-square distribution. Computational Statistics & Data Analysis, 48, 803 – 808.
Cleveland, W. S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, U.S.A.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Routledge.
Dalal, S. R., Fowlkes, E. B., and Hoadley, B. (1989). Risk analysis of the space shuttle: Prechallenger prediction of failure. Journal of the American Statistical Association, 84, 945–957.
Devore, J. L. (2011). Probability and Statistics for Engineering and the Sciences. Brooks/Cole,
8th edition.
Edgington, E. and Onghena, P. (2007). Randomization Tests. CRC Press, 4th edition.
283
284
BIBLIOGRAPHY
Ellis, T. H. N., Hofer, J. M. I., Swain, M. T., and van Dijk, P. J. (2019). Mendel’s pea crosses:
varieties, traits and statistics. Hereditas, 156, 33.
Fahrmeir, L., Kneib, T., and Lang, S. (2009). Regression: Modelle, Methoden und Anwendungen.
Springer, 2 edition.
Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). Regression: Models, Methods and
Applications. Springer.
Faraway, J. J. (2006). Extending the Linear Model with R: Generalized Linear, Mixed Effects
and Nonparametric Regression Models. CRC Press.
Farcomeni, A. (2008). A review of modern multiple hypothesis testing, with particular attention
to the false discovery proportion. Statistical Methods in Medical Research, 17, 347–388.
Feller, W. (1968). An Introduction to Probability Theory and Its Applications: Volume I. Number
Bd. 1 in Wiley series in probability and mathematical statistics. John Wiley & Sons.
Fisher, R. A. (1938). Presidential address. Sankhyā: The Indian Journal of Statistics, 4, 14–17.
Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data
analysis. IEEE Transactions on Computers, C-23, 881–890.
Friendly, M. and Denis, D. J. (2001).
Milestones in the history of thematic
cartography,
statistical graphics,
and data visualization.
Web document,
www.math.yorku.ca/SCS/Gallery/milestone/. Accessed: September 16, 2014.
Furrer, R. and Genton, M. G. (1999). Robust spatial data analysis of lake Geneva sediments
with S+SpatialStats. Systems Research and Information Science, 8, 257–272.
Galarraga, V. and Boffetta, P. (2016). Coffee drinking and risk of lung cancer–a meta-analysis.
Cancer Epidemiology, Biomarkers & Prevention, 25, 951–957.
Gelman, A., Lee, D., and Guo, J. (2015). Stan: A probabilistic programming language for
bayesian inference and optimization. Journal of Educational and Behavior Science, 40, 530–
543.
Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences.
Statistical Science, 7, 457–511.
Gemeinde Staldenried (1994). Verwaltungsrechnung 1993 Voranschlag 1994.
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to calculating posterior
moments. In Bernardo, J. M., Berger, J., Dawid, A. P., and Smith, J. F. M., editors, Bayesian
Statistics 4, 169–193. Oxford University Press, Oxford.
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., and
Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to
misinterpretations. European Journal of Epidemiology, 31, 337–350.
Grinstead, C. M. and Snell, J. L. (2003). Introduction to Probability. AMS.
BIBLIOGRAPHY
285
Hampel, F. R., Ronchetti, E., Rousseeuw, P. J., and Stahel, W. A. (1986). Robust statistics: the
approach based on influence functions. Wiley New York.
Held, L. (2008). Methoden der statistischen Inferenz: Likelihood und Bayes. Springer, Heidelberg.
Held, L. and Sabanés Bové, D. (2014). Applied Statistical Inference. Springer.
Hernán, M. A., Alonso, A., and Logroscino, G. (2008). Cigarette smoking and dementia: potential selection bias in the elderly. Epidemiology, 19, 448–450.
Hollander, M. and Wolfe, D. A. (1999). Nonparametric Statistical Methods. John Wiley & Sons.
Hombach, M., Ochoa, C., Maurer, F. P., Pfiffner, T., Böttger, E. C., and Furrer, R. (2016).
Relative contribution of biological variation and technical variables to zone diameter variations
of disc diffusion susceptibility testing. Journal of Antimicrobial Chemotherapy, 71, 141–151.
Horst, A. M., Hill, A. P., and Gorman, K. B. (2020). palmerpenguins: Palmer Archipelago
(Antarctica) penguin data. R package version 0.1.0.
Huber, P. J. (1981). Robust Statistics. John Wiley & Sons Inc.
Hüsler, J. and Zimmermann, H. (2010). Statistische Prinzipien für medizinische Projekte. Huber,
5 edition.
Jeffreys, H. (1983). Theory of probability. The Clarendon Press Oxford University Press, third
edition.
Johnson, N. L., Kemp, A. W., and Kotz, S. (2005). Univariate Discrete Distributions. WileyInterscience, 3rd edition.
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1994). Continuous Univariate Distributions,
Vol. 1. Wiley-Interscience, 2nd edition.
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995). Continuous Univariate Distributions,
Vol. 2. Wiley-Interscience, 2nd edition.
Kruschke, J. K. (2010). Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic
Press, first edition.
Kruschke, J. K. (2015). Doing Bayesian Data Analysis: A Tutorial with R, JAGS and Stan.
Academic Press/Elsevier, second edition.
Kupper, T., De Alencastro, L., Gatsigazi, R., Furrer, R., Grandjean, D., and J., T. (2008). Concentrations and specific loads of brominated flame retardants in sewage sludge. Chemosphere,
71, 1173–1180.
Landesman, R., Aguero, O., Wilson, K., LaRussa, R., Campbell, W., and Penaloza, O. (1965).
The prophylactic use of chlorthalidone, a sulfonamide diuretic, in pregnancy. J. Obstet. Gynaecol., 72, 1004–1010.
286
BIBLIOGRAPHY
LeBauer, D. S., Dietze, M. C., and Bolker, B. M. (2013). Translating Probability Density
Functions: From R to BUGS and Back Again. The R Journal, 5, 207–209.
Leemis, L. M. and McQueston, J. T. (2008). Univariate distribution relationships. The American
Statistician, 62, 45–53.
Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation. Springer-Verlag, second
edition.
Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses. Springer, 3 edition.
Lindgren, F. and Rue, H. (2015). Bayesian spatial modelling with R-INLA. Journal of Statistical
Software, 63, i19.
Lunn, D., Jackson, C., Best, N., Thomas, A., and Spiegelhalter, D. (2012). The BUGS Book:
A Practical Introduction to Bayesian Analysis. Texts in Statistical Science. Chapman &
Hall/CRC.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate Analysis. Academic Press.
McGill, R., Tukey, J. W., and Larsen, W. A. (1978). Variations of box plots. The American
Statistician, 32, 12–16.
Modigliani, F. (1966). The life cycle hypothesis of saving, the demand for wealth and the supply
of capital. Social Research, 33, 160–217.
Moyé, L. A. and Tita, A. T. (2002). Defending the rationale for the two-tailed test in clinical
research. Circulation, 105, 3062–3065.
Needham, T. (1993). A visual explanation of jensen’s inequality.
Monthly, 100, 768–771.
American Mathematical
Olea, R. A. (1991). Geostatistical Glossary and Multilingual Dictionary. Oxford University Press.
Pachel, C. and Neilson, J. (2010). Comparison of feline water consumption between still and
flowing water sources: A pilot study. Journal of Veterinary Behavior, 5, 130–133.
Petersen, K. B. and Pedersen, M. S. (2008). The Matrix Cookbook. Version 2008-11-14, http:
//matrixcookbook.com.
Plagellat, C., Kupper, T., Furrer, R., de Alencastro, L. F., Grandjean, D., and Tarradellas, J.
(2006). Concentrations and specific loads of UV filters in sewage sludge originating from a
monitoring network in Switzerland. Chemosphere, 62, 915–925.
Plummer, M. (2003). Jags: A program for analysis of bayesian graphical models using gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing
(DSC 2003). Vienna, Austria.
Plummer, M. (2016). rjags: Bayesian Graphical Models using MCMC. R package version 4-6.
BIBLIOGRAPHY
287
R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria.
Raftery, A. E. and Lewis, S. M. (1992). One long run with diagnostics: Implementation strategies
for Markov chain Monte Carlo. Statistical Science, 7, 493–497.
Redelmeier, D. A. and Singh, S. M. (2001). Survival in academy award–winning actors and
actresses. Annals of Internal Medicine, 134, 955–962.
Rice, J. A. (2006). Mathematical Statistics and Data Analysis. Belmont, CA: Duxbury Press.,
third edition.
Rosenthal, R. and Fode, K. L. (1963). The effect of experimenter bias on the performance of the
albino rat. Behavioral Science, 8, 183–189.
Ross, S. M. (2010). A First Course in Probability. Pearson Prentice Hall, 8th edition.
Ruchti, S., Kratzer, G., Furrer, R., Hartnack, S., Würbel, H., and Gebhardt-Henrich, S. G.
(2019). Progression and risk factors of pododermatitis in part-time group housed rabbit does
in switzerland. Preventive Veterinary Medicine, 166, 56–64.
Ruchti, S., Meier, A. R., Würbel, H., Kratzer, G., Gebhardt-Henrich, S. G., and Hartnack, S.
(2018). Pododermatitis in group housed rabbit does in switzerland prevalence, severity and
risk factors. Preventive Veterinary Medicine, 158, 114–121.
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian
models by using integrated nested Laplace approximations. Journal of the Royal Statistical
Society B, 71, 319–392.
Siegel, S. and Castellan Jr, N. J. (1988). Nonparametric Statistics for The Behavioral Sciences.
McGraw-Hill, 2nd edition.
Snee, R. D. (1974). Graphical display of two-way contingency tables. The American Statistician,
28, 9–12.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680.
Sturtz, S., Ligges, U., and Gelman, A. (2005). R2WinBUGS: A package for running WinBUGS from
R. Journal of Statistical Software, 12, 1–16.
Swayne, D. F., Temple Lang, D., Buja, A., and Cook, D. (2003). GGobi: evolving from XGobi
into an extensible framework for interactive data visualization. Computational Statistics &
Data Analysis, 43, 423–444.
SWISS Magazine (2011). Key environmental figures, 10/2011–01/2012, p. 107. http://www.
swiss.com/web/EN/fly_swiss/on_board/Pages/swiss_magazine.aspx.
Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press.
Tufte, E. R. (1990). Envisioning Information. Graphics Press.
288
BIBLIOGRAPHY
Tufte, E. R. (1997a). Visual and Statistical Thinking: Displays of Evidence for Making Decisions.
Graphics Press.
Tufte, E. R. (1997b). Visual Explanations: Images and Quantities, Evidence and Narrative.
Graphics Press.
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
Wang, C., Torgerson, P. R., Höglund, J., and Furrer, R. (2017). Zero-inflated hierarchical models
for faecal egg counts to assess anthelmintic efficacy. Veterinary Parasitology, 235, 20–28.
Wasserstein, R. L. and Lazar, N. A. (2016). The asa statement on p-values: Context, process,
and purpose. The American Statistician, 70, 129–133.
Glossary
Throughout the document we tried to be consistent with standard mathematical notation. We
write random variables as uppercase letters (X, Y , . . . ), realizations as lower case letters (x, y,
. . . ), matrices as bold uppercase letters (Σ, X, . . . ), and vectors as bold italics lowercase letters
(x , β, . . . ). (The only slight confusion arises with random vectors and matrices.)
The following glossary contains a non-exhaustive list of the most important notation. Standard operators or products are not repeatedly explained.
:=
Define the left hand side by the expression on the other side.
♣, ♢
R P Q
, ,
End of example, end of definition end of remark.
[a, b], [c, d[
Closed interval {x ∈ R | a ≤ x ≤ b}, and half open interval {x ∈ R | c ≤ x <
d}.
∪, ∩
Union, intersection of sets or events.
Integration, summation and product symbol. If there is no ambiguity, we omit
the domain in inline formulas.
∅
Empty set.
Ac
Complement of the set A.
B\A
Relative complement of A in B.
All elements of the set B that are not in the set A: {x ∈ B | x ̸∈ A}.
θb
x
|x|
Estimator or estimate of the parameter θ.
P
Sample mean: ni=1 xi /n.
Absolute value of the scalar x.
||x ||
Norm of the vector x .
X⊤
Transpose of an matrix X.
x(i)
Order statistics of the sample x1 , . . . , xn .
0, 1
Vector or matrix with components 0 respectively 1.
Cov(X, Y )
Covariance between two random variables X and Y .
Corr(X, Y )
Correlation between two random variables X and Y .
d ′ ∂
dx , , ∂x
Derivative and partial derivative with respect to x.
diag(A)
Diagonal entries of an (n × n)-matrix A.
ε, εi
Random variable or process, usually measurement error.
E(X)
Expectation of the random variable X.
289
290
Glossary
e, exp(·)
Transcendental number e = 2.71828 18284, the exponential function.
n!
n
k
In = I
Factorial of a positive integer n: n! = n(n − 1)(n − 2) · · · 1, with 0! = 1.
n
n!
Binomial coefficient defined as
=
.
k
k!(n − k)!
Identity matrix, I = (δij ).
I{A}
Indicator function, talking the value one if A is true and zero otherwise.
limx→a , limx↗a ,
limx↘a
Limits as x approaches a (two sided), one sided limits where x approaches a
from the left and from the right.
log(·)
Logarithmic function to the base e.
max{A}, min{A} Maximum, minimum of the set A.

x
(n/2+1/2) ,
med(xi )
Median of the sample x1 , . . . , xn :
 1 (x
+x
2
(n/2
if n odd,
(n/2+1) ),
if n odd,
N, Nd
Space of natural numbers, of d-vectors with natural elements.
φ(x)
Φ(x)
Gaussian probability densitiy function φ(x) = (2π)−1/2 exp(−x2 /2).
Rx
Gaussian cumulative distribution function Φ(x) = −∞ φ(z) dz.
π
Transzendental number π = 3.14159 26535.
P(A)
Probability of the event A.
R, Rn , Rn×m
Space of real numbers, real n-vectors and real (n × m)-matrices.
rank(A)
s
s2
The rank of a matrix A is defined as the number of linearly independent rows
(or columns) of A.
√
Sample standard deviation: s = s2 .
1 Pn
2
Sample variance: s2 = n−1
i=1 (xi − x) .
tr(A)
Trace of an matrix A defined by the sum of its diagonal elements.
Var(X)
Variance of the random variable X.
Z, Zd
Space of integers, of d-vectors with integer elements.
The following table contains the abbreviations of the statistical distributions (dof denotes degrees
of freedom).
N (0, 1), zp
Standard standard normal distribution, p-quantile thereof.
N (µ, σ 2 )
Gaussian or normal distribution with parameters µ and σ 2 (being mean and
variance), σ 2 > 0.
Np (µ, Σ)
Normal p dimensional distribution with mean vector µ and symmetric positive
definite (co-)variance matrix Σ.
Bin(n, p),
bn,p,1−α
Binomial distribution with n trials and success probability p,
1 − α-quantile thereof, 0 < p < 1.
Pois(λ)
Exp(λ)
Poisson distribution with parameter λ, λ > 0.
Exponential distribution with rate parameter λ, λ > 0.
291
Glossary
U(a, b)
Uniform distribution over the support [a, b], −∞ < a < b < ∞.
Xν2 , χ2ν,p
Chi-squared distribution with ν dof, p-quantile thereof.
Tn , tn,p
Student’s t-distribution with n dof, p-quantile thereof.
Fm,n , fm,n,p
F -distribution with m and n dof, p-quantile thereof.
Ucrit (nx ,ny ;1−α) 1 − α-quantile of the distribution of the Wilcoxon rank sum statistic.
Wcrit (n⋆ ; 1−α)
1 − α-quantile of the distribution of the Wilcoxon signed rank statistic.
The following table contains the abbreviations of the statistical methods, properties and quality
measures.
EDA
Exploratory data analysis.
DoE
Design of experiment.
DF, dof
Degrees of freedom.
MAD
Median absolute deviation.
MAE
Mean absolute error.
ML
Maximum likelihood (ML estimator or ML estimation).
MM
Method of moments.
MSE
Mean squared error.
OLS, LS
Ordinary least squares.
RMSE
Root mean squared error.
SS
Sums of squares.
WLS
Weighted least squares.
292
Glossary
Index of Statistical Tests and
Confidence Intervals
General remarks about statistical tests, 91
Comparing a sample mean with a theoretical value, 93
Comparing means from two independent samples, 95
Comparing means from two paired samples, 97
Comparing the distribution of two paired samples, 129
Comparing the locations of repeated measures, 134
Comparing the locations of several independent samples, 136
Comparing the locations of two independent samples, 131, 138
Comparing the locations of two paired samples, 138
Comparing the locations of two paired samples with the sign test, 128
Comparing two variances, 98
Comparison of observations with expected frequencies, 118
Performing a one-way analysis of variance, 203
Test of a linear relationship in regression, 170
Test of correlation, 159
Test of proportions, 117
Confidence interval for regression coefficients, 179
Confidence interval for relative risk (RR), 113
Confidence interval for the mean µ, 71
Confidence interval for the odds ratio (OR), 114
Confidence interval for the Pearson correlation coefficient, 159
Confidence interval for the variance σ 2 , 72
Confidence intervals for mean response and for prediction, 172
Confidence intervals for proportions, 108
293
294
Index of Statistical Tests and CIs
Video Index
The following index gives a short description of the available videos, including a link to the
referenced page. The videos are uploaded to https://tube.switch.ch/.
Chapter 0
What are all these videos about?, vi
Chapter 8
Construction of general multivariate normal variables, 153
Important comment about an important equation, 154
Proof that the correlation is bounded, 148
Properties of expectation and variance in the setting of random vectors, 148
Chapter 9
Two classical estimators and estimates for random vectors, 161
Chapter A
Installing of RStudio, 268
295
296
Video Index
Index of Terms
A randomized controlled trial, 224
Convergence in distribution, 53
Absolute errors, 53
Convergence in probability, 53
Akaike information criterion, 187
Convolution, 57
Almost sure convergence, 53
Correlation, 148
ANOVA, 133
Covariance, 147
Attrition bias, 224
Covariance matrix, 147
Coverage probability, 73
Bar plot, 9
Cramér–Rao lower bound, 68
Bayes factor, 239
Critical values, 87
Bayesian hierarchical model, 259
Bayesian information criterion, 187
Dataset, 4
Bernoulli trial, 32
Delta method, 56
Bias, 67, 223
Density function, 35
Binomial distribution, 32
Dependent variable, 178
Blinding, 225
Dependent variable, 165
Box-and-wishker plot, 11
Diagram, 9
Boxplot, 11
Distribution
F, 51
Breakdown point, 124
Bernoulli, 32
Centered second moment, 38
binomial, 32
Central limit theorem, 52
chi-square, 48
Chart, 9
Gaussian, 39
Cholesky decomposition, 153
normal, 39
Coefficient of determination, 170
Poisson, 33
Coefficient of variation, 7
standard normal, 39
Cohen’s d , 219
Student’s t, 49
Completely randomized design, 221
Distribution function, 29
Composite hypothesis, 86
Confidence interval, 70
Efficiency, 125
Confirmation bias, 223
Elementary event, 28
Confirmatory research, 216
Empirical, 7
Confounding, 223, 224
Empirical distribution function
sample, 53
Conjugate prior, 240
Contingency table, 111
Empirical distribution function, 54
Continuity correction, 53, 106
Error term, 165, 178
Control, 224
Estimated regression coefficients, 167
Conventional effect sizes, 220
Estimated regression line, 167
297
298
Estimation, 64
Estimator, 64
Euler diagrams, 28
Event, 28
Evidence-based medicine, 225
Expectation, 147
Expectation, 37
Experimental unit, 217
Explanatory variables, 178
Exploratory data analysis, 2
Index of Terms
Location, 7
Location parameter, 55
Lower quartile, 7
False discovery rate, 137
Family-wise error rate, 136
Fisher transformation, 159
Fixed effects model, 243
Mean squared error, 68, 167, 179
Median, 37
Memoryless, 44
Minimal variance unbiased estimators, 68
Mixed effects model, 243
Model
Bayesian hierarchical, 259
fixed effects, 243
mixed effects, 243
Mosaic plot, 16
Multiple linear regression, 178
Gauss–Newton algorithm, 192
Generalized linear model, 191
Geometric mean, 59
Glivenko–Cantelli theorem, 54
Graph, 9
Noise, 165, 178
Nominal scale, 5
Non-parametric regression, 193
Non-parametric test, 127
Normal equation, 178
HARKing, 101
Hat matrix, 179
Higher moments, 38
Histogram, 9
Immortal time, 223
Improper prior, 241
Independence, 46
Independent variable, 165
Independent variables, 178
Indicator function, 57
Informative prior, 241
Interquartile range, 7
Interval scale, 5
Intervention, 224
Inverse transform sampling, 55
Jensen’s inequality, 56
Joint pdf, 144
Law of large numbers, 53
Left-skewed, 9
Level, 70
Observation, 4
Observational study, 224
Occam’s razor, 186
One-sided test, 86
Ordinal scale, 5
Ordinary least squares, 167
Outlier, 8
p-hacking, 101
Parallel coordinate plots, 18
Pearson correlation coefficient, 158
Performance bias, 224
Placebo, 224
Plot, 9
Point estimate, 64
Poisson distribution, 33
Post-hoc tests, 135
Posterior odds, 239
Posterior predictive distribution, 237
Power, 88
Predicted values, 167
Prediction, 155
Predictor, 165, 178
299
Index of Terms
Prior
improper, 241
informative, 241
uninformative, 241
weakly informative, 241
Prior odds, 239
Probability measure, 28
Projection pursuit, 20
QQ-plot, 12
Qualitative data, 5
Quantile function, 36
Quantitative data, 5
R-squared, 170
Randomized complete block design, 221
Range, 7
Rank, 127
Ratio scale, 5
Rejection region, 87
Relative errors, 53
Residual standard error, 167, 179
Residuals, 167
Response units, 217
Right-skewed, 9
Robust estimators, 124
Rule of succession, 247
Sample mode, 8
Sample size, 7
Sample space, 28
Sample standard deviation, 7
Sample variance, 7
Scale parameter, 55
Second moment, 38
Selection bias, 223
Significance level, 87
Simple hypothesis, 86
Simple linear regression, 165
Simple randomization, 221
Size, 87
Split plot design, 223
Spread, 7
Standard error, 71
Statistic, 7, 64
Statistical model, 61
Studentized range, 7
Taylor series, 56
Trace-plot, 255
Trimmed mean, 7
Two-sided test, 86
Unbiased, 67
Uniformly most powerful, 90
Unimodal, 9
Uninformative prior, 241
Upper quartile, 7
Variables, 4
Variance, 38
Variance–covariance, 147
Violin plot, 12
Weakly informative prior, 241
Welch’s two sample t-test, 95
z -transformation, 39
300
Index of Terms
Download