Uploaded by vicekartel

Final Exam Review - S24

advertisement
CS 439
Final Exam Review
Spring 2024
5/8/2024
12:00-3:00 PM
Hill 114 - Lecture Hall
Logistics
The final exam will be given inperson
You are allowed up to 1-page of
notes. Both sides ok. Write netID
and handover the notes after test.
Exam
Composition
• Open ended questions. Short responses.
• Graduating seniors must perform significantly better
if other indicators are low
• 6 questions – multiple parts
• 3-hour long exam
• The material will mostly be from post-midterm on
• Linear Regression
• Gradient descent
• Logistic Regression
• Feature Engineering
• Classification and regularization
• Unsupervised Learning
• Deep Learning
• Clustering
• Recommender systems
What general
skills will be
tested?
• SVD, PCA, interpretations, applications
• Linear regression, cost functions, gradient descent
• Logistics regression, non-linearity with sigmoid, relu
• Interpretation of linear models
• Bias-variance tradeoffs, regularization
• Choosing a cost function – L1, L2, huber
• Maximum likelihood estimators
• How to minimize the cost function and find optimal
parameters using differentiation/gradient descent?
• Memorization versus generalization
• Foundations of neural networks
• Computing general functions using NN’s
• Recommender systems
• AND MORE …..
PCA
Capturing Variance:
Each principal component is associated with an eigenvalue, representing the variance captured
by that component.
The first principal component captures the most variance, followed by the second, and so on.
By keeping only the top few principal components, we can retain most of the important
information in the data while discarding redundant or irrelevant information.
Orthogonality:
Principal components are mutually orthogonal, meaning they are perpendicular to each other.
This property ensures that the information captured by each component is independent of the
others.
Geometric Interpretation:
Geometrically, principal components represent the axes of a new coordinate system.
The data points are projected onto these axes, preserving the distance relationships between
them.
This allows us to visualize the data in a more meaningful way, focusing on the directions that
are most relevant.
MLE
MLE
Find MLE of Bernoulli distribution with parameter p as L(p|x) = p^x * (1-p)^(1-x)
Linear Regression
The Concept
Key ideas in regression
Regression analyzes the relationship between two types of variables:
Dependent variable (Y): The variable you are trying to predict or explain.
Independent variable (X): The variable you believe influences the dependent variable.
Linear regression: Models a linear relationship between the independent and
dependent variables.
Non-linear regression: Models a non-linear relationship, requiring more
complex models and interpretations.
The coefficient of an independent variable indicates its directional impact on
the dependent variable.
The magnitude of the coefficient represents the strength of the relationship.
Regression models provide insights into relationships, but they cannot prove
causation.
Results should be interpreted within the context of the data and limitations
of the model.
Extending Linear Regression
beyond one variable
Developing Notation for Multivariate
Linear Models
Linear Model
Linear in the Parameters
Feature Functions
Squared Loss
Loss Minimization
Gradient Descent for Multivariate
Linear Regression
Where alpha is the learning rate
and convergence is obtained
when error from one iteration to
next is less than some threshold.
The effect of Learning Rate & convergence
alpha
sq_error
iterations
Θ0
Θ1
accuracy
0.01
2.88153
71
0.29013
0.53484
10-4
2.88080
92
0.29423
0.52744
10-5
2.88072
114
0.29559
0.52497
10-6
0.011
2.88147
65
0.29036
0.53444
10-4
0.001
2.8901
493
0.27558
0.56117
10-4
0.1
NaN
NaN
NaN
NaN
10-4
0.005
2.88251
129
0.28721
0.54014
10-4
0.02
NaN
NaN
NaN
NaN
10-4
0.009
2.88162
78
0.28980
0.53544
10-4
Reference: Regression Lab
• If alpha is too small, then the
convergence may be slow
• If alpha is too large, error may
not decrease on every iteration
and may not converge
• When choosing alpha, try
• 0.001, 0.01, 0.1, 1
• 0.003, 0.03, 0.3, etc
Question
E(Θ) – The error for given Θ
Which of the following graphs shows a converging
gradient descent algorithm?
E(Θ)
Θ
E(Θ)
E(Θ)
bad
Θ
Θ
E(Θ)
bad
bad
Θ
good
Feature Engineering
Feature Engineering ctd..
Linear in the Parameters
Feature Functions
For Example:
Features:
How to decide on new features
Often a data scientist must design “new” features to get better models
Rules of Design
• Understand and capture domain knowledge
• Feature Extraction: Transform
• One-Hot Encoding
• Introduce polynomial features
• Encoding Cyclical Features
• Dimensionality Reduction
• Feature Importance Analysis
• Improve express-ability and complexity of the data
Encoding Categorical Data
• Categorical Data 
One-hot encoding:
• Text Data
• Bag-of-words & Ngram models
state
AL
…
CA
…
NY
…
WA
…
WY
NY
0
…
0
…
1
…
0
…
0
WA
0
…
0
…
0
…
1
…
0
CA
0
…
1
…
0
…
0
…
0
The Feature Matrix
Logistic Regression Model
Notation
hypothesis output and Interpretation
• Let hΘ(x) be the probability: P(y=1|x,Θ)
• Probability that y=1 given x,Θ
• Example:
• suppose x = [1, 0.7, 0.5]
• Compute hΘ(x) for some Θ, say hΘ(x) = 0.8
• “predict” that there is a 80% chance that the patient has a malignant tumor.
• P(y=0|x,Θ) = 1 - P(y=1|x,Θ)
Issues with a linear hypothesis function
• In logistic regression, we prefer outputs between 0 and 1 (why?)
• Then we can decide if the value < 0.5 it is more likely to be 0 and vice versa
• Linear hypothesis can lead to outputs outside of the range (0 1)
• Solution?
• Introduce a non-linear transformation to the hypothesis function
• Next: The Sigmoid function
Non
Linearizing
the
hypothesis
function
0 ≤ g(hΘ(x)) = g(ΘT x) ≤ 1
What are the properties of this function?
This Photo by Unknown Author is licensed under CC BY-NC-ND
Relu function
Regularization
• A quadratic model could fit the training set well
• might generalize well to new examples
• But a higher order model may fit the training set
“perfectly”
• might not generalize well to new examples
The idea
The solution:
• use the higher order model, but penalize the
higher order parameters a “lot”
• Optimization problem
• Minimize {low order model + λ*(high order
terms) }
• λ is a large value is called the regularization
parameter
Regularization
idea
• The degree of the polynomial acts as a natural
measure of the “complexity” of the model
• higher degree polynomials are more complex
(can fit any finite data set exactly)
• fitting the models requires extremely large
coefficients on these polynomials
• Regularization is the notion of keeping weights
small
The Regularization Function R(θ
R(θ)
Goal: Penalize model complexity
• More features 
overfitting …
• How can we control
overfitting through θ
• Proposal:
set weights = 0
to remove features
Regularized Loss Minimization
R(θ)
Question: Should we penalize Θ0 as well?
Unsupervised learning
K-means clustering
Unlabeled data
Finding labels
K-means
Algorithm
Question: What is the asymptotic complexity of this algorithm?
Example of k-means
Consider the eight 2D points in a grid given by (0,0), (0,1),(1,0),(-1,1),(1,2),(-2,1),(2,2),(3,-1).
Hierarchical Clustering
example
• Cluster (0,0), (0,1),(1,0),(-1,1),(1,2),(-2,1),(2,2),(3,-1) into 2 clusters
based on Manhattan distance. What is the complexity of the
algorithm?
Neural Networks for Deep
Learning (DL)
39
Feature Learning
Hypothesis function
Set of non-linear features
ML Challenge: good performance ~ “good” features
DL Objective: algorithm will automatically “learn” the features
40
Architecture of a basic neural network (NN)
Question: What is the purpose of the hidden layer? Learn new features
41
This Photo by Unknown Author is licensed under CC BY-SA
Question: how many
total parameters need to
be optimized in this
(considering bias)
network?
Example
Consider one layer and
write equations to
produce the hypothesis
This Photo by Unknown Author is licensed under CC BY-SA
42
Computing OR function
1
Θ0
x1
Θ1
Θ2
h(Θ, x)
h(Θ, x)
0
1
1
1
x2
43
NAND Function
Gradient descent vs stochastic gradient descent
45
Recommender Systems
Definition (collaborative filtering)
Recommender systems that
makes recommendations based
“solely” upon the preferences
that other users have made for
those items
• “x bought y”
Highly Sparse matrix
Challenge: Fill in the missing entries
ratings
u
s
e
r
UserUser-user approach
Restrict sum to only k
users “most similar”
• Hypothesis: h(Θ, i, j)
• Prediction problem :
If there are no similar users to
i, the prediction for i would be
my average prediction
Find the difference
between the other user
and the mean
People have similar
ranges in relative rating
Similarity metric
• Pearson correlation
• Cosine similarity
Sum over items where both
have entries
Quiz
Consider the following user-item rating matrix.
What would be the prediction (user-user
method using Pearson correlation) for the
missing point?
X(4,2) = 4 + 1.(2- 3.5)/|4-5| = 2.5
Here we are using W(I,k) as the absolute distance
We can use Pearson correlation or cosine distance
as well.
Remember
Download