26/01/20 :תאריך עדכון :שם ומספר הקורס יישומים בלמידה חישובית Practical topics in Machine Learning 89-6541-01 הרצאה:סוג הקורס 2 :היקף שעות 'ב :סמסטר תש"ף :שנת לימודים :אתר הקורס באינטרנט We explore different use cases in which machine-learning algorithms are used to handle reallife problems involving large amounts of data items of different formats. We put a special focus on medical-related problems. After a brief introduction to machine learning, we will get familiar with some of the most common technologies that are being used in practice. We will get to know the relevant libraries and platforms that provide tools for processing and cleaning data of different types, including unstructured volumes like images and texts. Our main coding language is Python; Among the relevant libraries that we will work with during the course are: numpy, pandas, sklearn, XGBoost, LGBM, pytorch, and more. Every class is divided into a preliminary part, in which we will present the relevant theory of the technology in focus, followed by a practice session, which includes presentation of code examples. During the course, we will show a number of case studies, in which we will present a few recent published works and projects, and study their machine-learning problem, data, and their proposed technology. We will cover various topics in machine learning, including, decision trees, random forest, neural nets, boosting and more. However, our focus will be put more on the practical side; therefore, a background in machine learning is required, and the introductory course for machine learning is a prerequisite for taking this course. Class breakdown: Class Topic Description 1 Intro Background in machine learning: A quick reminder of linear/logistic regression and evaluation philosophy Assignment Teaching material: Slides 1 Slides 2 Notebook - numpy, pandas Notebook - sklearn, bike sharing data Notebook - linear regression Notebook - logistic regression, ROC, AUC 2-3 Feature handling, data exploration Case study Feature cross and non-linear regression, handling different feature types (numeric, ordinal, categorical, string), exploring a dataset (types of visualization: contingency tables, normal/scatter plots, box plots), data imputation Teaching material: Slides 1 Slide2 2 (TBD: data exploration, imputation) Notebook - one hot encoding Notebook - feature cross Notebook - feature cross on bike sharing Case study (practical notebook): Notebook (TBD) using the following dataset: https://www.kaggle.com/osmi/mentalhealth-in-tech-survey Topics that will be covered: encoding different types of features, data exploration (contingency tables, different distribution of features, and getting intuition about what to look at in the data), data cleansing (imputation) 4 Overfitting and regularization Variance/bias, feature selection, L1/L2 regularization 2 Case studies Teaching material: Slide 1 Notebook - regularization Ex 1 (out) Case study (paper): Development and validation of a predictive model for detection of colorectal cancer in primary care by analysis of complete blood counts: a binational retrospective study Case study (practical notebook): TBD - a notebook about tuning regularization parameter (inspired by https://www.kaggle.com/kashnitsky/topic4-linear-models-part-3-regularization) 5 Multiclass classification, intro to neural networks (reminder) Basic image representation, convolution, max entropy (softmax) classifier, Intro to Feed forward networks, conv nets, introduction to Pytorch, with examples Ex 1 (in) Ex 2 (out) Teaching material: Slides 1 Slides 2 Slides 3 (intro to feed forward and conv nets- TBD) Slides 4 GPU vs. CPU Notebook - image classification Notebook - pytorch tutorial Notebook - simple FF network Notebook - CIFAR10 with conv net 6 Case studies Presentation of two (or more) works, using deep learning to predict medical conditions 1. International evaluation of an AI system for breast cancer screening, Nature 2019 2. A clinically applicable approach to continuous prediction of future acute kidney injury, Nature 2018 7-8 Predicting with trees Case study Trees, random forest, bagging, boosting (AdaBoost, Gradient boosting, XGBoost). Subtopics: optimization with grid search, input normalization Teaching material: Slides 1 Notebook - housing data with trees Ex 2 (in) Ex 3 (out) Case study (paper): Trees vs Neurons: Comparison between random forest and ANN for highresolution prediction of building energy consumption 9-10 Time series Case study Definition, univariate/multivariate, stochastic process, stationarity, seasonality, moving average, exponential smoothing, time-series clustering techniques (e.g. topics: hierarchical clustering, DTW, Ward, self-organizing maps (SOM)) Ex 3 (in) Ex 4 (out) Teaching material: Slides 1 (TBD) Notebook (TBD, using the dataset: https://www.kaggle.com/c/dsghackathon/data ) Case study (paper): The emotional arcs of stories are dominated by six basic shapes (EPJ Data Science, 2016) 11 Text analysis Text representation, embeddings, tagging and classification with RNN Teaching material: Slides 1 - tf idf Slides 2 - LSA, embedding Notebook - tweet classification with trees Slides 3 - RNN (TBD) 12 Common deep learning architectures and their applications Encode-decoder: Image captioning, Seq2seq - translation. Attention models and transformer Ex 4 (in) Teaching material: TBD 13 Case study and project proposals Case study, paper: On the Automatic Generation of Medical Imaging Reports, ACL 2018 Project proposals and discussion Project ideas (out) Require Prerequisites: 89511-Introduction to machine learning Grade structure: 4 Home assignments - 40% Final project - 60%