From Summation to Vector Form: RSS
and Dot Product Explained
1. Problem Statement
In linear regression, we aim to find the best-fitting line that minimizes the difference
between the predicted and actual values. This difference is measured using the Residual
Sum of Squares (RSS). The initial form of RSS is:
RSS(β) = ∑ (yᵢ − xᵢᵀβ)² for i = 1 to N
Here:
- yᵢ is the observed value for the i-th data point
- xᵢᵀ is the transpose of the i-th row vector of the design matrix X
- β is the parameter vector
2. Matrix Representation of RSS
We can express RSS in matrix form as:
RSS(β) = (y − Xβ)ᵀ(y − Xβ)
Where:
- y is an N×1 vector of observed values
- X is an N×p design matrix
- β is a p×1 vector of model parameters
- (y − Xβ) is the residual vector
3. Connecting Matrix Form to Summation Form
Let r = y - Xβ. Then (y - Xβ)ᵀ(y - Xβ) = rᵀr, which is the sum of squares of the residuals. Thus:
(y - Xβ)ᵀ(y - Xβ) = ∑ (yᵢ - xᵢᵀβ)²
This confirms that the matrix form and the summation form are equivalent and represent
the same cost function.
4. Expanding the Matrix Form
Now we expand the matrix expression algebraically:
(y - Xβ)ᵀ(y - Xβ) = yᵀy - yᵀXβ - βᵀXᵀy + βᵀXᵀXβ
Since yᵀXβ and βᵀXᵀy are scalars and equal, the expression simplifies to:
RSS(β) = yᵀy - 2βᵀXᵀy + βᵀXᵀXβ
5. Importance in Linear Regression
This form is useful because it allows us to compute the gradient of RSS with respect to β. To
find the optimal β that minimizes RSS, we set the gradient to zero:
∂RSS/∂β = -2Xᵀy + 2XᵀXβ = 0
Solving for β gives us the closed-form solution of linear regression:
β = (XᵀX)^(-1) Xᵀy
This solution assumes that XᵀX is invertible. When it is not, regularization techniques like
Ridge Regression are used.
6. Understanding the Dot Product
The dot product of two vectors a and b is defined as:
a · b = a1*b1 + a2*b2 + ... + an*bn
It returns a scalar and measures the similarity or projection between two vectors.
Geometrically:
a · b = ||a|| ||b|| cos(θ), where θ is the angle between the two vectors.
- If θ = 0°, vectors are aligned (maximum dot product)
- If θ = 90°, vectors are orthogonal (dot product is zero)
7. Dot Product in Matrix Form
Matrix notation for dot product:
If a and b are column vectors, then:
aᵀb = scalar = a · b
Example:
a = [2, 3], b = [4, 5]
aᵀb = 2*4 + 3*5 = 23
8. Conclusion
The transition from summation form to matrix form for RSS allows for elegant and efficient
computation, especially when dealing with large datasets. Understanding the dot product
and matrix algebra is essential for grasping the mechanics of linear regression and its
optimization.
#LinearRegression #MatrixAlgebra #DotProduct #MachineLearning #MathForML
9. Understanding rᵀr = ∑ rᵢ²
Let r be the residual vector defined as r = y - Xβ, where:
- y is the vector of actual values
- Xβ is the vector of predicted values
Then, r is a column vector of shape (N×1). Taking the dot product of r with itself (i.e., rᵀr)
results in a scalar value:
rᵀr = ∑ (rᵢ)² for i = 1 to N
This is also the squared Euclidean norm of the residual vector. It measures the total squared
distance between the actual and predicted values, and is the objective function that linear
regression seeks to minimize.
10. Why βᵀXᵀy is a Scalar
To understand why βᵀXᵀy is a scalar, we need to examine the matrix dimensions involved:
- X is an N×p matrix (N observations, p features)
- y is an N×1 column vector
- Xᵀ is a p×N matrix
- βᵀ is a 1×p row vector
Multiplying in steps:
1. Xᵀy gives a p×1 column vector
2. βᵀXᵀy results in a 1×1 scalar value
This scalar represents the weighted sum of correlations between each feature and the
output, scaled by the corresponding coefficient in β. It plays a key role in the expanded form
of the RSS equation:
RSS(β) = yᵀy - 2βᵀXᵀy + βᵀXᵀXβ
This compact expression is central to Ordinary Least Squares (OLS) regression. The middle
term βᵀXᵀy (or equivalently yᵀXβ) captures the alignment between the predicted and actual
values. The final term βᵀXᵀXβ penalizes complexity by incorporating the model coefficients
and feature correlations.
These matrix expressions allow us to easily compute gradients, solve for β analytically, and
scale linear regression to high-dimensional datasets. Understanding their structure and
scalar nature is critical when working with vector calculus, optimization, and advanced
machine learning algorithms.