9/1/2021 Data Mining BIF 524 - CSC 498 Data is the sword of the 21st century, those who wield it well, the Samurai. – Jonathan Rosenberg 2 1 9/1/2021 Before we start • Instructor: Joseph Rebehmed • Contact: joseph.rebehmed@lau.edu.lb • Office hours: TR, 9:00 – 11:00 AM;W 5:30 – 7:30 PM & by appointment (Online) • Lecture: MWF, 9:00 – 9:50 AM AKSOB 1003, Online via Collaborate platform • Grading: (subject to 5% variation) • • • • Midterm: 30% Project: 25% Final Exam: 35% Participation: 10% 3 Textbook https://www.statlearning.com/ 4 2 9/1/2021 Course Description This course covers the fundamental techniques and applications for mining data; topics include concepts from: • Machine learning • Statistics • Techniques and algorithms for parametric and non-parametric classification, clustering, classifier assessment. • Supervised vs unsupervised learning. • Expert system • Graphical models 5 Course Description (2) This course aims to provide a very applied overview to: • modern non-linear methods as: • • • • Generalized Additive Models, Decision Trees, Boosting, Bagging, Support Vector Machines • more classical linear approaches such as: • • • • Logistic Regression, Linear Discriminant Analysis, K-Means Clustering, Nearest Neighbors. • Cover many cases/data sets in the course plus some additional interesting applications + Lab sessions 6 3 9/1/2021 Teaching/Learning methods • Active learning approaches, no more passive learners • The most important kind of learning comes from doing, not from standing on the sidelines. • In parallel to “Lectures”, this course makes extensive use of: • in class group activities • Dialogues, discussions and sharing ideas • Reading providing materials before class, lecture preparation • Plenty of applications 7 Tips for success • Actively participate in class • Don’t wait until the last minute to start your assignments or to study for an exam. • Please communicate with me questions/difficulties/challenges if you have any 8 4 9/1/2021 Additional Remarks • Reading the textbook is a must. • Deadlines must be respected. • Make-ups and Incomplete: students are not automatically entitled to make-ups; F will be given until reasons (in writing and within one week of absence) are presented and approved. • Some of the exam questions will be based on class discussion and assignments. • No mobile phones in the classroom. 9 Introduction 10 5 9/1/2021 Introduction (2) • Statistical learning refers to a set of tools for modelling and understanding complex datasets. • With the explosion of “Big Data” problems, statistical learning has become a very hot field in many scientific areas (marketing, finance, CS, biology, etc.) • People with statistical learning skills are in high demand. • Many companies are using Machine Learning in different and/or cool ways 11 Pinterest – Improved Content Discovery 12 6 9/1/2021 Twitter – Curated Timelines 13 IBM – Better Healthcare 14 7 9/1/2021 Statistical Learning Problems • Identify the risk factors for prostate cancer. • Predict whether someone will have a heart attack based on demographic, diet and clinical measurements. • Customize an email spam detection system • Classify a tissue sample into one of several cancer classes, based on gene expression profile 15 16 8 9/1/2021 17 18 9 9/1/2021 Notation • Use n to represent the number of distinct data points, or observations, in our sample; p the number of variables. • xij represent the value of the jth variable for the ith observation, where i = 1, 2, . . ., n and j = 1, 2, . . . , p • X denote a n×p matrix. • The input variables are typically denoted using the symbol X, with a subscript to distinguish them. The inputs go by different names, such as predictors, independent variables, features or sometimes just variables. • The output variable is often called the response or dependent variable and is typically denoted using the symbol Y. 19 What is Statistical Learning? • In ML, we have a large set of inputs X and corresponding outputs Y but not the function f(X). • We believe that there is a relationship between Y and at least one of the X’s. • The goal is to find/model the relationship as: Yi f (Xi ) i • Where f is some fixed but unknown function and ε is a random error term, which is independent of X with mean zero. 20 10 9/1/2021 Simple Example The function f that connects the input variable to the output variable is in general unknown. In this situation one must estimate f based on the observed points. 21 Different Standard Deviations sd=0.005 0.05 y -0.05 -0.10 -0.10 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.6 x sd=0.01 sd=0.03 0.05 y 0.00 0.8 1.0 0.8 1.0 -0.10 -0.05 -0.10 y 0.2 x 0.10 0.0 0.00 0.05 0.10 The difficulty of estimating f will depend on the standard deviation of the ε’s. 0.00 0.00 -0.05 y 0.05 0.10 0.10 sd=0.001 0.0 0.2 0.4 0.6 x 0.8 1.0 0.0 0.2 0.4 0.6 x 22 11