Uploaded by A. Faour

2.1-Intro-Statistical-Learning-1

advertisement
9/1/2021
Data Mining
BIF 524 - CSC 498
Data is the sword of the 21st century, those who wield it well,
the Samurai. – Jonathan Rosenberg
2
1
9/1/2021
Before we start
• Instructor: Joseph Rebehmed
• Contact: joseph.rebehmed@lau.edu.lb
• Office hours: TR, 9:00 – 11:00 AM;W 5:30 – 7:30 PM
& by appointment (Online)
• Lecture: MWF, 9:00 – 9:50 AM
AKSOB 1003, Online via Collaborate platform
• Grading: (subject to 5% variation)
•
•
•
•
Midterm: 30%
Project: 25%
Final Exam: 35%
Participation: 10%
3
Textbook
https://www.statlearning.com/
4
2
9/1/2021
Course Description
This course covers the fundamental techniques and applications
for mining data; topics include concepts from:
• Machine learning
• Statistics
• Techniques and algorithms for parametric and non-parametric
classification, clustering, classifier assessment.
• Supervised vs unsupervised learning.
• Expert system
• Graphical models
5
Course Description (2)
This course aims to provide a very applied overview to:
• modern non-linear methods as:
•
•
•
•
Generalized Additive Models,
Decision Trees,
Boosting, Bagging,
Support Vector Machines
• more classical linear approaches such as:
•
•
•
•
Logistic Regression,
Linear Discriminant Analysis,
K-Means Clustering,
Nearest Neighbors.
• Cover many cases/data sets in the course plus some additional
interesting applications + Lab sessions
6
3
9/1/2021
Teaching/Learning methods
• Active learning approaches, no more passive learners
• The most important kind of learning comes from doing, not
from standing on the sidelines.
• In parallel to “Lectures”, this course makes extensive use of:
• in class group activities
• Dialogues, discussions and sharing ideas
• Reading providing materials before class, lecture preparation
• Plenty of applications
7
Tips for success
• Actively participate in class
• Don’t wait until the last minute to start your assignments or to
study for an exam.
• Please communicate with me
questions/difficulties/challenges
if
you
have
any
8
4
9/1/2021
Additional Remarks
• Reading the textbook is a must.
• Deadlines must be respected.
• Make-ups and Incomplete: students are not automatically
entitled to make-ups; F will be given until reasons (in writing and
within one week of absence) are presented and approved.
• Some of the exam questions will be based on class discussion
and assignments.
• No mobile phones in the classroom.
9
Introduction
10
5
9/1/2021
Introduction (2)
• Statistical learning refers to a set of tools for modelling and
understanding complex datasets.
• With the explosion of “Big Data” problems, statistical learning
has become a very hot field in many scientific areas (marketing,
finance, CS, biology, etc.)
• People with statistical learning skills are in high demand.
• Many companies are using Machine Learning in different and/or
cool ways
11
Pinterest – Improved Content Discovery
12
6
9/1/2021
Twitter – Curated Timelines
13
IBM – Better Healthcare
14
7
9/1/2021
Statistical Learning Problems
• Identify the risk factors for prostate cancer.
• Predict whether someone will have a heart attack based
on demographic, diet and clinical measurements.
• Customize an email spam detection system
• Classify a tissue sample into one of several cancer
classes, based on gene expression profile
15
16
8
9/1/2021
17
18
9
9/1/2021
Notation
• Use n to represent the number of distinct data points, or
observations, in our sample; p the number of variables.
• xij represent the value of the jth variable for the ith observation,
where i = 1, 2, . . ., n and j = 1, 2, . . . , p
• X denote a n×p matrix.
• The input variables are typically denoted using the symbol X,
with a subscript to distinguish them. The inputs go by different
names, such as predictors, independent variables, features or
sometimes just variables.
• The output variable is often called the response or dependent
variable and is typically denoted using the symbol Y.
19
What is Statistical Learning?
• In ML, we have a large set of inputs X and corresponding
outputs Y but not the function f(X).
• We believe that there is a relationship between Y and at least
one of the X’s.
• The goal is to find/model the relationship as:
Yi  f (Xi )   i
• Where f is some fixed but unknown function and ε is a random
error term, which is independent of X with mean zero.
20
10
9/1/2021
Simple Example
The function f that connects the input variable to the output variable is
in general unknown. In this situation one must estimate f based on the
observed points.
21
Different Standard Deviations
sd=0.005
0.05
y
-0.05
-0.10
-0.10
0.2
0.4
0.6
0.8
1.0
0.0
0.4
0.6
x
sd=0.01
sd=0.03
0.05
y
0.00
0.8
1.0
0.8
1.0
-0.10
-0.05
-0.10
y
0.2
x
0.10
0.0
0.00 0.05 0.10
The
difficulty
of
estimating
f
will
depend
on
the
standard deviation of
the ε’s.
0.00
0.00
-0.05
y
0.05
0.10
0.10
sd=0.001
0.0
0.2
0.4
0.6
x
0.8
1.0
0.0
0.2
0.4
0.6
x
22
11
Download