Seminar for Statistics
Week 1
Dr. Jonathan Koh
Computational Statistics FS2025
Outline
1. Introduction
2. Linear regression (§1 📖)
Seminar for Statistics
Computational Statistics: Week 1
1/33
Course details
• This is a statistics/mathematics course: knowledge of basic probability and statistics notions is
required.
• Course material posted on Moodle. Use the forum for Q&A!
• Evaluation: on-site “digital” exam (100%),
• Bring your laptops for the exercises.
• We will use R (www.r-project.org). Chance to learn a new language!
• What is examinable?
– The script (I will indicate which sections are not examinable).
– The slides (which will contain some extra material), and also the blackboard.
• Some references:
[ISL] An Introduction to Statistical Learning, Springer.
http://www-bcf.usc.edu/~gareth/ISL/.
[ESL] The Elements of Statistical Learning, Springer.
http://web.stanford.edu/~hastie/ElemStatLearn/.
[CASI] Computer Age Statistical Inference, CUP. http://web.stanford.edu/~hastie/CASI/.
Seminar for Statistics
Computational Statistics: Week 1
2/33
What is Computational Statistics?
Original comic by @sandserifcomics
Seminar for Statistics
Computational Statistics: Week 1
3/33
Motivation
• We will study the statistical aspects of the some widely used machine learning (ML) methods.
• “Machine learning” aims at programming computers to learn information from data.
• Machine learning algorithms are mostly used for prediction, but can also be useful for interpreting
the relationships between inputs and outputs. More on this later...
• These methods are used everywhere: finance, marketing, healthcare, tech companies, climatology,
ecology, …
• Some terminology related to machine learning: statistical learning, computational statistics, data
science, data analytics, big data, …
• Aim of this course: give you a good basis for you to understand more advanced methods and to
develop your own strategies to analyze complex data. We will do a bit of theory, but will focus
more on the methodology.
• At the end of the semester you should be able to use statistical techniques discussed in this course,
using existing softwares, or writing your own code.
Seminar for Statistics
Computational Statistics: Week 1
4/33
A (personal) teaching experiment
• EduApp
– Occasional breather in the lectures to test understanding
– Some data collection? – we will use this to have a bit of fun in the lectures
• Lots of R live demonstrations
• Chocolates? 🍫
• 🍫 indicates a “food for thought”, ✏️ indicates blackboard time!
Seminar for Statistics
Computational Statistics: Week 1
5/33
EduApp break 1
What are some data collection problems in practice? 🍫
Seminar for Statistics
Computational Statistics: Week 1
6/33
An important comment: Prediction vs. Inference
• With every ML method, ask yourself: why are we doing this?
• Many ML methods are focused more on prediction
– Lots of benefits, with very impressive (recent!) development in this field
– Downsides (especially in applications)? Inference?
I Understanding of nature, and how the world functions?
I Not always clear how to have a measure of uncertainty
• Breiman, 2001: Statistical Modeling: The Two Cultures (with comments and a rejoinder by the
author), Statist. Sci. 16(3): 199-231
– Caveat: read the comments too!
Seminar for Statistics
Computational Statistics: Week 1
7/33
Trends
Google Trends: interest over 2004–Present for big data (blue), machine learning (red) and data science
(orange).
Seminar for Statistics
Computational Statistics: Week 1
8/33
Example: body fat
Dataset bodyfat (mfp). The data are body fat estimates (siri) for 252 men, with measurements of different
body attributes. 5 suspicious observations were removed. The first 10 measurements and a plot of the responses
versus some of the predictors (for the remaining 247 observations):
neck
36.2
38.5
34.0
37.4
34.4
39.0
36.4
37.8
38.1
42.1
chest
93.1
93.6
95.8
101.8
97.3
104.5
105.1
99.6
100.9
99.6
abdo
85.2
83.0
87.9
86.4
100.0
94.4
90.7
88.5
82.5
88.6
hip
94.5
98.7
99.2
101.2
101.9
107.8
100.3
97.1
99.9
104.1
thigh
59.0
58.7
59.6
60.1
63.2
66.0
58.4
60.0
62.9
63.1
Seminar for Statistics
200
250
300
Weight (lbs)
350
ankle
21.9
23.4
24.0
22.8
24.0
25.6
22.9
23.2
23.8
25.0
biceps
32.0
30.5
28.8
32.4
32.2
35.7
31.9
30.5
35.9
35.6
forearm
27.4
28.9
25.2
29.4
27.7
30.6
27.8
29.0
31.1
30.0
wrist
17.1
18.2
16.6
18.2
17.7
18.8
17.7
18.8
18.2
19.2
0
0
150
knee
37.3
37.3
38.9
37.3
42.2
42.0
38.3
39.4
38.3
41.7
Body fat percentage
10
20
30
40
height
67.75
72.25
66.25
72.25
71.25
74.75
69.75
72.50
74.00
73.50
Body fat percentage
10
20
30
40
weight
154.25
173.25
154.00
184.75
184.25
210.25
181.00
176.00
191.00
198.25
Body fat percentage
10
20
30
40
age
23
22
22
26
24
24
26
25
25
23
0
siri
12.3
6.1
25.3
10.4
28.7
20.9
19.2
12.4
4.1
11.7
80
100
120
140
Abdomen circumference (cm)
35
40
45
Knee circumference (cm)
Computational Statistics: Week 1
9/33
Example: spam classification
Dataset spam (kernlab). 4601 emails classified as spam/non-spam, and 57 variables indicating
the frequency of certain words and characters. A subset of the data:
1222
1712
2635
4176
928
4129
4341
3036
2890
284
946
811
3153
1763
3532
2283
3291
4547
1742
3563
type
spam
spam
nonspam
nonspam
spam
nonspam
nonspam
nonspam
nonspam
spam
spam
spam
nonspam
spam
nonspam
nonspam
nonspam
nonspam
spam
nonspam
Seminar for Statistics
george
0.00
0.00
0.66
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.38
0.00
2.00
0.00
0.00
0.00
0.00
4.16
free
0.00
0.44
1.33
0.00
0.00
0.00
0.00
0.00
0.00
0.40
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
credit
5.19
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.81
0.00
1.02
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
money
0.64
0.00
0.22
0.00
0.00
1.16
0.00
0.00
0.00
0.60
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
hp
0.00
0.00
3.34
0.00
0.00
0.00
0.00
0.00
1.72
0.00
0.00
0.34
0.90
0.00
4.00
0.00
0.00
0.00
0.00
8.33
business
1.29
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.61
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
your
1.29
0.00
0.44
0.22
0.00
1.16
0.00
2.32
2.58
2.62
0.00
0.68
0.00
11.11
0.00
0.00
0.00
0.00
0.47
0.00
!
0.09
0.00
0.37
0.04
0.40
0.49
0.77
0.00
0.11
1.45
0.03
0.90
0.00
0.00
0.00
0.00
0.00
0.00
0.69
0.00
capitalTotal
135.00
186.00
411.00
97.00
495.00
34.00
18.00
37.00
58.00
513.00
339.00
1330.00
1232.00
4.00
46.00
14.00
1.00
5.00
239.00
30.00
Computational Statistics: Week 1
10/33
Example: Netflix rating prediction
Netflix price: 100, 480, 507 ratings that 480, 189 users gave to 17, 770 movies. Each entry is of
the form <user, movie, date of grade, grade>, with grade ∈ {1, 2, . . . , 5}. Examples:
<1234, 456, 10/01/2005, 3>
<1234, 15021, 03/02/2005, 5>
<11623, 1201, 01/12/2004, 1>
<25876, 10387, 23/04/2004, 4>
<2189, 463, 05/11/2005, 1>
<324, 9056, 06/06/2004, 5>
Can you use these data to make recommendations for user #687?
Seminar for Statistics
Computational Statistics: Week 1
11/33
Example: handwritten characters/digits recognition
The MNIST database of handwritten digits contains more than 60, 000 examples (28 × 28
greyscale images). Some algorithms can recognize new handwritten digits with less than 1%
error.
Seminar for Statistics
Computational Statistics: Week 1
12/33
Example: smart cars
Smart cars can recognize cars, pedestrian, buildings, road signs, trafic lights, …and predict how
these objects will move, and take decisions based on these predictions.
CES 2016: NVIDIA DRIVENet Demo - Visualizing a Self-Driving Future (part 5)
(www.youtube.com/watch?v=HJ58dbd5g8g)
Seminar for Statistics
Computational Statistics: Week 1
13/33
Example: heart disease
The heart dataset discussed in [ISL]. Heart disease (AHD) for 303 patients who presented with
chest pain, with 13 measurements (sex, age, chol,…). A subset of the data:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
AHD
No
Yes
Yes
No
No
No
Yes
No
Yes
Yes
No
No
Yes
No
No
No
Yes
No
No
No
Seminar for Statistics
Age
63
67
67
37
41
56
62
57
63
53
57
56
56
44
52
57
48
54
48
49
Sex
1
1
1
1
0
1
0
0
1
1
1
0
1
1
1
1
1
1
0
1
ChestPain
typical
asymptomatic
asymptomatic
nonanginal
nontypical
nontypical
asymptomatic
asymptomatic
asymptomatic
asymptomatic
asymptomatic
nontypical
nonanginal
nontypical
nonanginal
nonanginal
nontypical
asymptomatic
nonanginal
nontypical
RestBP
145
160
120
130
130
120
140
120
130
140
140
140
130
120
172
150
110
140
130
130
Chol
233
286
229
250
204
236
268
354
254
203
192
294
256
263
199
168
229
239
275
266
Fbs
1
0
0
0
0
0
0
0
0
1
0
0
1
0
1
0
0
0
0
0
RestECG
2
2
2
0
2
0
2
0
2
2
0
2
2
0
0
0
0
0
0
0
MaxHR
150
108
129
187
172
178
160
163
147
155
148
153
142
173
162
174
168
160
139
171
Computational Statistics: Week 1
14/33
Example: buying pattern
Loyalty cards provide a lot of informations to supermarkets.
Forbes article Feb. 2012: “How Target Figured Out A Teen Girl Was Pregnant Before Her
Father Did.”
Seminar for Statistics
Computational Statistics: Week 1
15/33
Example: ecology
Presences/absences of Persicaria Vivipara recorded at 910 locations in Swiss Alps. Included are climate
and topographic informations at each location: annual mean temperature and precipitation, solar
radiation, altitude, slope, ground type, …
• Ecologists use these data to predict (probability of) presences/absences at some other locations)
• Another example: first arrival of migratory birds at their breeding site
Seminar for Statistics
Computational Statistics: Week 1
16/33
Example: climate modelling
Ground temperature strongly modulated by the atmospheric conditions. Snapshot on July 19, 2022.
Seminar for Statistics
Computational Statistics: Week 1
17/33
Statistical framework
• General framework: we observe some training data (x1 , y1 ), . . . , (xn , yn ) where
– the xi ∈ Rp are called predictors, covariates, features or inputs;
– the yi are called responses or outputs (here univariate).
• Types of variables:
– quantitative variables typically take values in R and there is an ordering of their values
(measurements close in values are close in nature);
– qualitative variables (or categorical variables, or factors) take values in a finite set and there
is no ordering of the classes.
• Types of supervised learning tasks: regression, classification (=pattern recognition).
• When the yi ’s are unobserved (or inexistant) we call the problem unsupervised learning. For
example, clustering methods aim at grouping “similar” xi ’s.
• Goals:
– Prediction: for a given x+ , predict the unobserved value of y+ .
– Interpretation: understand how the value of xi influences the response yi .
• Typical aspects of machine learning problems:
– n is (very) large, sometimes huge, and p is also very large;
– different types of inputs, and typically missing values.
Seminar for Statistics
Computational Statistics: Week 1
18/33
Statistical framework
• Training data are (x1 , y1 ), . . . , (xn , yn ).
• We assume (xi , yi ) are iid realizations from (X, Y ) ∼ Pr(X, Y ), where the random vector X
takes values in Rp and the random variable Y takes values in
– R for regression;
– a set G = {G1 , . . . , GK } for classification (when K = 2 we typically identify G1 and G2 with
0/1 or −1/1).
Note that here, X is random. In the lecture notes, X is fixed! Does this matter 🍫?
Seminar for Statistics
Computational Statistics: Week 1
19/33
Question: what aspects of Pr(X, Y ) are we interested in if we want to predict Y ? 🍫
EduApp break 2
We will come back to this in the next weeks! But do hold that thought!
Seminar for Statistics
Computational Statistics: Week 1
20/33
Statistical framework
• Interpretation/inference: to fully understand the link between X and Y we need the conditional
distribution Pr(Y | X) (hard to estimate!). Often we could be interested in quantities like
E(Y | X) (easier to estimate). But actually... why?
• Prediction: given X = x, we want to make a good prediction Yb of the unobserved value of Y .
We will need to define what we mean by good prediction (prediction should be close to the truth,
in a sense to be defined soon).
Seminar for Statistics
Computational Statistics: Week 1
21/33
Food for thought: What do we mean by a “good prediction”? 🍫
We will ALSO come back to this in the next weeks! But do hold that thought!
Seminar for Statistics
Computational Statistics: Week 1
22/33
You may already know (at least) one method for prediction and interpretation!
EduApp break 3
Seminar for Statistics
Computational Statistics: Week 1
23/33
Outline
1. Introduction
2. Linear regression (§1 📖)
Seminar for Statistics
Computational Statistics: Week 1
24/33
History of the term “Linear Regression”
Seminar for Statistics
Computational Statistics: Week 1
25/33
History of the regression estimation problem
• 1632: Galileo Galilei used a procedure which can be interpreted as fitting a linear relationship to
contaminated observed data
• 1805, 1809: A. M. Legendre and C. F. Gauss: solving the problem with least squares
• 1846: A. Bravais: probabilistic approach in the context of multivariate normal distributions
• 1889: F. Galton: concept of a regression function in linear regression analysis
• 1947: J. W. Tukey: first nonparametric regression estimate of local averaging type
• Present: Why is this still relevant today?
Seminar for Statistics
Computational Statistics: Week 1
26/33
Linear models and least squares
• A linear model describes the conditional expectation of Y as a linear combination of the inputs
X = (X1 , . . . , Xp )>,
E(Y | X) = β0 +
p
X
βj Xj = β0 + X>β,
(1)
j=1
where β = (β1 , . . . , βp )> ∈ Rp .
• Assuming X1 = 1, β0 is included in β, and (1) can be written as
E(Y | X) = X>β.
• Linear models can be fit using least squares:
b = argmin
β
β∈Rp
n
X
(yi − x>i β)2 = argmin (y − Xβ)>(y − Xβ),
i=1
β∈Rp
where X is a n × p matrix whose ith row is x>i and y = (y1 , . . . , yn )>.
Seminar for Statistics
Computational Statistics: Week 1
27/33
Correcting/clarifying a small detail in the previous slide about the argmin.. ✏️.
Seminar for Statistics
Computational Statistics: Week 1
28/33
Least squares
• For a training set Dn = {(xi , yi )}n
i=1 , fit the model by minimizing
RSS(β) =
n
X
{yi − x>i β}2 = ky − Xβk2 = (y − Xβ)>(y − Xβ).
(2)
i=1
b = X>y. ✏️
• Differentiating (2) with respect to β, gives the normal equations X>Xβ
b = (X>X)−1 X>y.
– If X has full column rank p: β
b
– If rank(X) < p, β is not unique.
• The fitted values when rank(X) = p,
y
b = Xβb = X(X>X)−1 X>y = Hy.
• The predicted value for an input x+ is linear in y, and when rank(X) = p,
b = x>+ (X>X)−1 X>y.
yb+ = x>+ β
• The residuals are ri = yi − ybi
Seminar for Statistics
Computational Statistics: Week 1
29/33
Clarifying the difference between xi , X, and X, yi , y, and Y ,
and also , i and . ✏️
Seminar for Statistics
Computational Statistics: Week 1
30/33
Linear model assumptions
• A common modelling assumption is (1) (note that it is a scalar quantity) which describes the
conditional expectation of Y as a linear combination of the inputs X = (X1 , . . . , Xp )>, which we
repeat here again:
E(Y | X) = X>β,
(3)
where β = (β1 , . . . , βp )> ∈ Rp .
• We will also use another assumption that you might also have seen (in other courses or textbooks),
which is directly on the random variable Y :
Y = X>β + ,
(4)
E() = 0
• We can get from (3) to (4) by further assuming that the deviations of Y around its conditional
expectation are additive and given by , with E() = 0. This still does not assume much about .
• A further assumption is var() = σ 2 . Another further assumption is var() = σ 2 I (uncorrelated
errors across the unobservations)
• A further assumption is that is Gaussian
Seminar for Statistics
Computational Statistics: Week 1
31/33
Correction to the previous slide (week 2)
When X is fixed, (4) and (3) are equivalent. To see this:
• One can always define an = Y − E(Y ), which has expectation 0.
• Going from (4) to (3) is trivial.
When X is random,
• One still does not need to assume more going from (3) to (4). To see this, one can always define
an = Y − E(Y | X), which has expectation 0. So, E[] = E(Y ) − E{E(Y | X)} = 0.
• Going from (4) to (3): if and X are independent, then we have what we need.
Seminar for Statistics
Computational Statistics: Week 1
32/33
R demo – a chocolate example 🍫
Seminar for Statistics
Computational Statistics: Week 1
33/33
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )