B.Ramamurthy Intuition/ understand ing * EDA Data * Big-data analytics StatsAlgs Discoveries / intelligence Statistical Inference Decisions/ Answers/ Results 1. 2. 3. Pipelines to prepare data Three types: Data preparation algorithms such as sorting, workflows Optimization algorithms stochastic gradient descent, least squares… Machine learning algorithms… Comes from Artificial Intelligence No underlying generative process Build to predict or classify something Three basic algorithms: linear regression, k-nn, k-means We already looked at linear regression as a case study for R/Rstudio We will start with k-means… K-means is unsupervised: no prior knowledge of the “right answer” Goal of the algorithm is to determine the definition of the right answer by finding clusters of data Kind of satisfaction survey data, incident report data, Assume data {age, gender, income, state, household, size}, your goal is to segment the users. K-means is the simplest of the clustering algorithms. Lets understand kmeans using an example. {Age, income range, education, skills, social, paid work} Lets take just the age { 23, 25, 24, 23, 21, 31, 32, 30,31, 30, 37, 35, 38, 37, 39, 42, 43, 45, 43, 45} Classify this data using K-means Lets assume K = 3 or 3 groups Give me a guess of the centroids? Lets assume initial value of centroids to {21, 30, 40} First lets hand calculate and then use R-Studio Supervised ML You know the “right answers” or at least data that is “labeled”: training set Set of objects have been classified or labeled (training set) Another set of objects are yet to be labeled or classified (test set) Your goal is to automate the processes of labeling the test set. Intuition behind k-NN is to consider most similar items --- similarity defined by their attributes, look at the existing label and assign the object a label. Age Loan (X1000) Default 25 40 N 35 60 N 45 80 N 20 20 N 35 120 N 52 18 Y 23 95 Y 40 62 Y 60 100 Y 48 220 Y 33 150 Y K = 3, whether you can lend money to a person age 48 requesting a loan amount of 142K K=5, repeat the same. We need lot more data for the application of K-NN.