Machine Learning Coursebook - BE Computer Science, 8th Semester

— a ~ 2 e. es fas! 4 | Fe fi | ! te geen ro f : i * E ti u i ie 5 ; fe ae j { a ¥ 4 i B i Rea ; 3 ¥ a I a ’ ZO19 i ya f 4 ate mc AHitian ; - EGItion | ; rae FFT id j 5 | B —, cs MACHINE BAN (BE - COMPUTER) Handcrafted by BackkBenchers Publications Scanned by CamScanner B Fl Machine Learning SEM, Marks Distribution: # “MAY - 16 DEC -16 o. : 05 0 o G 4, 15 15 5 30 Ee 7 8. - Repeated MAY - 17 DEC -17 MAY -18 0° 10 “e 15 20 sy 10 is © “0 30 20 ae m OB 10 10 1S 10 15 10 20 15 10 30 = 3 05 20 15 25 "0 8] - 55 50 70 90 Bm - = [-——— | 2 10 15 | f 25 _ PRET) eS Marks : | Scanned by CamScanner Chap -1] Introduction to ML. CHAP Ql. www, Topperssohitions com - 1: INTRODUCTION TO MACHINE LEARNING What is Machine Learning? Explain how supervised learning la different from unsupervised learning. Q2. Define Machine Learning? Briefly explain the types of learning, Ans: [5M | Mayl7 & Maytf} MACHINE LEARNING: 1, Machine learning is an application of Artificial Intelliqence (Al) 2. It provides ability to systems to automatically learn and improve from experience vothaut bere explicitly programmed. 3. Machine 4, The learning teaches primary goal of cornputer machine to do what learning comes is to allow naturally to butnans the systeras learn learns for autornatically ex<perierice with) fornia 5, Real life examples: Google search Engine, Amazon. a intervention or assistance and adjust actions accordingly Machine learning is helpful in a. Improving business decision b. Increase productivity. c. Detect d. Forecast weather. disease, TYPES OF LEARNING: — Supervised Learning: Supervised fearning as the name indicates a presence of supervisor as teacher Nv 1) Basically supervised learning is a learning in which we teach or train the machine using date which ts well labelled. After that, machine 3. is provided with new set of oxarmples(data) so that supervised learning el yorthen analyses the training data Machine and produces a correct outcome from labelled data, 4, wn Supervised learning classified into two categories of algorithms: a. Classification. b. Regression. il) Unsupervised learning: 1. to the tnachine Unlike supervised learning, no teacher is provided that means no training will be given 2. Unsupervised learning is the training of machine Using Information that is neither classified mor labelled. It allows the algorithm to act on that information without guidance, 4. Unsupervised learning classified into twa categories of algorithms: a. Clustering b. Association “8 Handcrafted by BackkBenchors Publications Page 1 of 102 i Scanned by CamScanner . ersSolutio NS.com, www.Topp Chap -1| (ntroduction to ML PERVI SED LEARNING: DIFFERENCE BETWEEN SUPERVISED AND UNSU Unsupervised learning [ Supervised learning» Uses unknown Uses known and labelled data. ‘Very complex to develop, Less complex to develop. Uses off-line analysis. Uses real-time analysis. Number of classes are known. Number Learning Applications of Machine Q4. Machine learning applications | of classes are unknown. Gives moderate accurate and reliable results. “Gives accurate and reliable results. Q3. : and unlabelled data. | algorithms [10M | May16, Declé & May17] Ans: APPLICATIONS: 1) Virtual Personal Assistants: |. Siri, Alexa, Google Now are some of the popular examples of virtual personal assistants. 2. As the name %. All you need to do is activate them and ask "What is my schedule for today?”, "What are the flights asked over voice, suggests, they assist in finding information, when from Germany to London", or similar questions, IN) Image Recognition: 1, Itis one of the most common machine learning applications. 2. There are many situations where you can classify the object as a digital image. 3%. For digital images, the measurements 4 Inthe case of a black and white image, the intensity of each pixel serves as one Measurement. describe the outputs of each pixel in the image. IH) Speech Recognition: i. Speech recognition (SR) is the translation of spoken words into text. 2. inspeech 3. The measurements recognition, a software application recognizes spoken words. in this Machine Learning application might be a set of numbers that represent 4 Wecan segment the signal into portions that contain distinct words or phonemes. un the speech signal. In each segment, frequency we can represent the speech signal by the intensities or energy in different time- bands. IV) Medical Diagnosis: J. ML provides methods, techniques, and tools that can help in solving diagnostic and prognostic problems in a variety of medical domains. 2, Itis being used for the analysis of the importance of clinical parameters and of their combinations for prognosis. 4. E.g. prediction of disease progression, for the extraction of medical knowledge for outcomes research V) Search Engine Result Refining: 1. Google and other search engines use machi ne learning to improve the search results for you es ? Handcrafted by BackkBenchers Publications s j Page 2 of 102 Scanned by CamScanner chap -1I Introduction to ML » Www.ToppersSolutions.com Every time you execute a search, the algorithms atthe backend keep a watch at how you respend to thre beSUlts Vi) Statistical Arbitrage: y} large In finance, statistical arbitrage refers to automated short-term trading strategies that involve a Number of SOCUHTes, In such strategies, the user tries to implement a lading algorithm for a set of securities on the basis of quanunes & These measurements can be cast as a classification or estimation problem, vil)Learning Associations; 1 Leaning association is the process of developing insights into various associations between products. > A goed evample % When is how seemingly unrelated products may reveal an association to one another. analyzea in relation to buying behaviors of customers. Vill) Classification: | Classification is a process of placing each individual from the population under study in many classes. 2 This is identified as independent variables Classification Nnelps analysis to use measurements abrect “ of an object to identify the category to which that belongs Toestablish an efficient rule, analysts use data. IX} Prediction: 1 Consider the example of a bank computing the probability of any of loan applicants faulting the loan repayment. > Jacampute the probability of the fault, tne system will first need to ciassify the available data in certain a Groups itis described by a set of ruies prescribed by the analysts. Once X) Extraction: 1. Information Extraction (!E) is another application of machine learning. toe 4 wodo the classification, as per neea we can compute tne probability. It is the process of extracting structured information fiom unstructured data. 3 for example web “4. The process Q5. pages, articles, blogs, business reports, and e-mails. of extraction takes input as a set of documents and produces a structured data. Explain the steps required for selecting the right machine learning algorithm Ans: [IOM | MayI6 & DecI7] STEPS: !) Understand Your Data: . The type and kind of data we have plays a key rele in deciding which algorithm to use. 2 Some algorithms can work with smaller sample sets while others require tons and tons of samples. “ Handcrafted by BackkBenchers Publications Page 3 of 102 Scanned by CamScanner ws Chap ~1] Introduction to ML Www. TOPPersSOlutlons, cig, 3. Certain algonthins work with certain types of data. 4. £.a, Naive Raves works well with categorical input S ttincliudes a Know but is not at all sensitive to miissitia date your data: Look at Summary statistics and visualizations. © Percentiles can help identify the range for most of the data, Averages and medians can describe central tendency, ' Visualize the data; Rox plots can identify outliers. Density blots and histoarams show the spread of data, Seatter plots can describe bivariate relationships. ce Clean your data: ( Dealwith missing value Missing data affects some © qd Missing data models more than others. for certain variables can result in poor predictions. Augment your data: Feature engineering ts the process of going from raw data to data that is ready for modelling It can serve multiple purposes: Different models may have different feature engineering Some il) requirements have built in feature engineenna. Categorize the problem: This is a two-step process 1 Categorize by input: a tf you have lebelled data, it’s a supervised learning problem, bo Ifyou have unlabeled data and want to find structure, If you want reinforcement 2. to optimize learning an objective function by it's an Unsupervised interacting with learina an problery environment. it's a problem Categorize by output: a Wthe output of your model is a number, it’s a regression problem. bo If the cutput of your model is a class, it's a classification problem, cv. It the output of your model d. Do you want to detect an anomaly? That's anomaly detection, is a set of input groups, it’s a clustering problem, ill) Understand your constraints: 1. What is your data storage capacity? a. Depending on the storage capacity of your system, you might not be able to store Giqabytes of classification/regression maouels or gigabytes of data to cluster, 2. Does the prediction have to be fast? In real time applications: a. For example, in autonomous driving, it's inportant that the classification of road signs be as fast as possible to avoid accidents, # Handcrafted by BackkBenchers Publications s Page 4 of 102 Scanned by CamScanner ESET Ghape 4, | itraduetion to bai. www. ToppersSolutions.com Dose the laarnlid linve to be taet) 8 Toad Ceinstatices, Update, oh THe Ty, Your talilhia tiocdels troelel witli qulekly is necessaty: sornatines, you need to rapicly different dataset IV) Find the available algertthins: | Setteat 2 Whether 4A Tew A ayy ae (he riodebiteets Saleh Gloves V) Hie fae tare affet thighthe pile proeeseing choles of atnodel are: the busiiess foals the mieelel pert tibete (ie pier des is eitiabile the prodel te A Plevs fast ther tried is % Heavy sealable the tectel is Try eaeh algorithm, access and compare, Vi) Adjust and combine, optimization techniques. Vi) Chaase, operate and continuously measure, Vill) Repent, AUC CANE AEKOOER EVECARE UR TINDES HAN TEAR ADE Q6, What b eS eRe are the steps in designing a machine Bb 1db 65d bd Od learning Eads ed Es ae ed wd Medd eNsesasndawcnmnnadeanecesraseun problem? Explain with the checkers problern, — fzplain the steps in developing a machine learning application. OF, Explain procedure to design machine learning system. Of, Ans: [10M | May17, May18 & Dec18} CTEPS FOR DEVELOPING IL APPLICATIONS: I) Gathering data: oThis Goud aN 2 steg is very aportant YOu prediothle madel because the quality of data that you gather will directly determine how valle Heé hare to collect date fron different sources for our ML application training purpose. fis nichides Collecting sarriples by scraping a website and extracting Cclata from an RSS feed or an S54. i Preparing the data: Date Dep aration 6 Whete wie load our data inte a suitable place and prepare it for use in our system fee Meaning 4 the variedit of Pons Atig has Standard format is that you use Can mix and matching algorithms and data CREECES WN} Choasing a model: §$, Rites ates Hatey ttiteiels that the data scientists and researcher have created over years Manderatiad bry tach Berichars Publications Page 5 of 102 ME Scanned by CamScanner chap -3 | httaduction to ML a 4 . gia jasbitainees Ine WW oe ee eeeiellieniians iii sore 4 ce for 7 uen and son , other for. seq data ge ima for ed suit well are s of ther elt. . ' of nove ty, al chats nist r rune lia eas \ : ' ing outliers and detection , identify ik involves Feceahizitty patterns IV) Tratning: i thaty ity toto prec predict the datz ability ode our models i incrementally improve ti this step we will Use cur data to have insetted Depending algorithm, the on the feed alaarithm good clean data from x previous 4 5faps vhedde ot fnfotrmation kine 4. ar aly . e |' for by. a rnachine usable is: readily {he krowledde extracted is stored Ina format that V) Evaluation: to Giese the taining is cornplete, it's time to check if the model is good and r pe tiais eytr: ana Steps next* $ter ps for using evaluation this is where testina datasets comes into play 3 used for training Evaluation allows us to test our model aaainst data that has never been V1) Parameter tuning: we are dene with evaluation, we want 1 QRee 2 Weeat de this by tuning cur parameters to see if we can further improve our traning way in any Vit}Prediction: for some questions 1 Ibis a step whete we det to answer + [tis the peint where the value of machine learning ts realized. CHECKER LEARNING 1 Acotmputer PROBLEM: program that learns to play checkers might improve its performance as measured by 's ability te win at the class of tasks involving playing checkers games, through experience optained ty blavinu games against itself. 2 Cheoriny a training expenence: a. The type of training experience ‘E’ available to a system can have significant impaci on success oF failure of the learning system. b. One key attribute is whether regarding the choices made 3. & the training experience provides direct or indirect feedback by the performance system. cc. Second attribute is the degree to which the learner controls the sequence of training exarnples d. Another attribute ts the degree to which the learner controls the sequence of training exarnples Assumptions: a. Let us assume that our system will train by playing games against itself. b. And itis allowed to generate as much training data as time permits, Issues Related to Experience: a. What type of knowledge/experience should one learn? b. Haw to represent the experience? © What should be the learning mechanism? t? Manderafted by BackkBenchers Publications Page 6 of 102 Scanned by CamScanner Chap-1] Introduction to ML Www.ToppersSolutions.com Target Function: 5. a. Choose Move:B>M b. Choose Move ts a function. c. where input Bis the set of legal board states and d M= Choose Move (B) produces M which is the set of legal moves. Representation of Target Function: 6. a. x1: the number of white pieces on the board. o x2: the number of red pieces on the board. aa x3: the number of white kings on the board © ny e. 7. Tne x4: the number of red kings on the board. xS:the number of white pieces threatened by red (i.e, which can be captured on red's next turn) ¥6: the number of red pieces threatened by white. F'(b) = Wo + Wixi + Woxe2 + W5Xs + WaXy + WeXs + Woe problem of learning a checkers strategy reduces to the problem of learning values for the coefficients We through We in the target function representation. Q9. What are the key tasks of Machine Learning Ans: [5M | Mayl6 & Dec17] CLASSIFICATION: 1 lf we have data, say pictures of animals, we can classify them. 2. This animal is a cat, that animal is a dog and soon. 3. A computer can do the same task using a Machine Learning algorithm that’s designed for the classification task. an 4. \n the real world, this is used for tasks like voice classification and obiect detection. This is a Supervised learning task, we give training data to teach the algorithm the classes they belong to REGRESSION: nN WN 1. Sometimes you want to predict values. What are the sales next month? And what is the salary for a job? Those type of problems are regression problems. The aim is to predict the value of a continuous response variable. This is also a supervised learning task. CLUSTERING: J. Clustering is to create groups of data called clusters. 2. Observations are assigned to a group based on the algorithm. 3. This is an unsupervised 4. Imagining have a bunch of documents on your computer, the computer will organize them in clusters based on their content automatically. learning task, clustering happens fully automatically. . ® Handcrafted by BackkBenchers Publications Page 7 of 102 4 Scanned by CamScanner YADA TOP _ . aes HBP id Chap -1 | iitrsdticttisi te Ml ROrs8 MUtONS. cop. . ran — FEATURE SELEC TIDE: Wigs TEP Pet Tf Eer Beceatiee BbIRTELETI THis task fs rtjsiittaltt 1. / af jodels pac e be apy Gt f Aue nat andy help He build Mmegsts of hag oh Aacclrdcy 2. Wales Helps tH achiev BhyerHcs (elt beHd Ts Uaigiliviey I Ev % tt alea Welpe th Hedtietitey cer HATHel Feat 4. Techiquies Used fot heatiie selet Hed [Mbet ebline, Fe hapiet itive TESTING AND MATCHING: ebbe 1. Vhhis task telated te briipeitliel [ib ald 3 Peebttrr ariel (ertbebitiied HesEPreselee CTP HEAILE EE Et ETETEL EL By Pf pdt Gere aat What are the lesties 14 (Aelia DREHER? O10. Ansi {El4 | Decié & Mayley ISSUES: tawhat eet tinete oll partietiae aldabit lini Geridide Fa EE 1 Healied FYREHGH, GEN SUTICEDT airing cata? What examples? r ie Head SPREE Talal beababel FH ieebedl ab HeprAsANtarians? Jevirine ane ire fet exist typ ebe: ab pared sled algorithttis titties pretbest try Hiesed Led odie Le 8 2. Whitelralaer A) How titel Lraltind date fe stliiclent? Whaticthe beet wiv & bo beodtiioe the We Hid heey ean pitak 6 Wheto ate 7 exainpiles ? Can prion khowledue beemedleoddde Hele be helplabesen baile Decree ep EO beg Hite Peet tred quide the pracess of generalizing froa wien H frat Whatis the best sltateciy for elieeeiid a Horlibnnel 8 appreximation problen PEE appredinnately Corrente? Taina @4GHEALE, aid bay daes ihe chow’ « yd ob ihe Pesce naltead pireebaead Lis atratedy alter (ie cedrpileeity and Well! rt HA ATPL Pie TES ahilily 19 represent aller ihe Pap reSe altatnatically learner the Howear the tardel Futhethon? Define Well posed learning problein, Hees, define abet driving leaming problem, QU. (6M | Decl) Ans: WELL-POSED LEARNING PROBLEMS! 1 Accomputel prodtaiti is sale bo leati roti: eperenici BOWL performance miecstie EPS Taped ba caria dass of bask: [ans peri naneccat tacke 4 98 PaBaslind Ly HO inppazes 2aLb espera Be. 2, {tidentifies followine (hiee fatto’ a. Class of tasks. bo Meastire of perfortnaiiie te be tnipreeed c, Soulce of @xperlehoe. ® Handerafted by BackkBerighers Hublleatians . page Hat jal at Scanned by CamScanner Chap -1] Introduction to ML a. Learning to classify chernical compounds. a Examples: Learning to drive an autonomous vehicle. 9 3, www.ToppersSolutions.com Learning to play bridge. Learning to parse natural language sentences. ROBOT DRIVING PROBLEM: ask k (T): Driving on public, 4-lane highway using vision sensors. 2 erformance measure (P): Average distance travelled before an error (as judged by human overseer) 3, Training experience (E): A sequence of images and steering commands recorded while observing 2 human driver. andcrafted by BackkBenchers Publications Page 9 of 102 Scanned by CamScanner Wh, Pappas PHSalillins Hitlong.¢ cay, hi Chap -2| Learning with Reuresster CHAP« 2: LEARNING REGRESSION WITH Luwistic Reuression QQ [OM | Mavi, Mayit & bari Ans: LOGISTIC REGRESSION: i algoellhinn be aulve d Classification problery aud popilat Legistic Reateksiet is one of the baste > ria With tye labs values) tric the Go-te method for binaty classification probloris (proble & t vallies BOE 1egikte FaGghesslan is Used fey Linear rearession aluorithee ate Used to predietforecas classification tasks ts taker form the Coult fimeton Ural is dsectih Hails riethiodiats Janaification a, The tet “Leuistic! & The logistic function, also called the stymiaidl fnctian describe properties OF POpelatlon Groat in ecology, fisina auickly and trasing out at the catniig Capacity of Tie eavirniient G teal valued Hurober isan “shaped cutve that Cah lake aby abel trap ll inte a value betyyear O andy but pever exactly at (hose linits FO Biaure 2. shows the exatiple of logistie Hanction Higdte 2h begistic banethoan SIGMOID FUNCTION (LOGISTIC FUNCTION): 1. Loeistic regression alaorithiy also ases a Enear equation with independent precictors to precicr a value, The predicted 3. Weneed 4. Therefore, we are squashing vi 2. value can be anywhere berwe en negative iifinity CO positive itinity the cutput of the algotthia to be class variable be, Gena, byes the outbut To sauash the predicted value between of the linear equation into a range of (A, 1, 0 and | Wwe Use the sigmoid fumetion, Linear Equation: Pye Dry Bey dices Sigmoid Function: Squashed Output +h; COST FUNCTION: 1h Since We are tying to predict class values, we cannot regression alaerithmn -® Handcrafted by BackkBenchers Publications use the Same Cost finetion - used in lined! page oat ni Scanned by CamScanner Wwww.ToppersSolutions.com Chap - 2] Learning with Regression Therefore, we use a logarithmic loss function to calculate the cost for misclassifying. 2, iz "i Cost(ha(x), CALCULATING vy) = —log(heGr)) { —log(Qh ~ ha(e)) ify =1 ify = 90 GRADIENTS: We take partial derivatives of the cost function with respect to each pararneier (theta_O, theta_l, ...) to 1. obtain the gradients. With the help of these gradients, we can update the values of theta_O, theta_l, etc. It does assume linear relationship between the logit of the explanatory variables and the response. Independent variables can be even the power 4. terms or some other nonlinear transformations of the original independent variaples. dependent The 5. variable NOT does binomial to be normaily family (e.g. binomial, distribution frorn an exponential regression assume need distributed, but Poisson, multinomial, it typically assumes a normal); binary logistic distribution of the response. 6. The homogeneity of variance does NOT need to be satisfied. 7. Errors need to be independent but NOT normally distributed, Q2. Explain in brief Linear Regression Technique. Q3. Explain the concepts behind Linear Regression. [5M | Mayl16 & Decl7] Ans: LINEAR REGRESSION: 1. Linear Regression is a machine learning algorithm based on supervised learning. 2. It is a simple machine learning the target variable is a real model for regression problems, i.e, when value. response y from the predictor variable x. 3 Itis used to predict a quantitative 4. |t is made with an essumption thai there's a linear relationship between x and y. 8. This method is mestly used for forecasting and finding out cause and effect relationship between variables. 6. Figure 22 shows the example of linear regression. | ul BO = ' aa Figure 2.2: Linear Regression. 7. The red line in the above graph is referred to as the best fit straight line. 8. Based on the given data points, we try to plot a line that models the points the best. 9. For example, in a simple regression problem (a single x and a single y). the form of the model would be: Y = Aot AMX vrsssee (Linear Equation) ¥ Handcrafted by BackkBenchers Publications Page 11 of 102 Scanned by CamScanner Chap ~ 2} Learning with Regression 7 www.Toppers Ss olutions.co, i rc ey, = : : é IO. The motive of the linear AG for aG and dal regression algorithm is . to find the ’ best5 values z COST FUNCTIONS: 1 The cost function helps us to figure out the best possible values for a0 and al which would Provide r, dest fit line for the data points. ts We convert this search problem into a minimization problem where we would error like to Minimize tp. between the predicted value and the actual value tl Minimization and cost function is given below: fm le inimiz minimize Va J= «. : dolore Cost function(3) of Linear Regression 9 2 (pre di im) — yi)” ~yi) 4 is the Root Mean Squared Error (RMSE) between predicted y value and true y value, GRADIENT DESCENT: 1. To update as and a; values in order to reduce Cost function and achieving the best fit line the mod: uses Gradient Descent. fo Tne idea is to start with minimum Q4. random ao and a values and then iteratively updating the values, reaching cost. Explain Regression line, Scatter plot, Error in Predic tion and best fitting line Ans: [SM | Decié] REGRESSION us ho 1 4. LINE: The Regression Line is the line tnat best fits the data, such that the overall distance from the line tc the points (variabie values) plotted on a graph is the smallest. There are as many numbers of regression lines as variable s, Suppose we take two variables, say X and Y, then there will be two regression lines: Regression line of Y on X: This gives the most probable values of Y from the given values of X. Y, =a + bx 5. Regression line of X on Y: This gives the most probable values of X from the given values of Y. X,>aaby SCATTER DIAGRAMS: 1. If data is given in pairs then the scatter diagram of the data is just the points plotted on the xy-plan® 2. The scatter plot is used to visually identify relati onships between paired data. Fe @ Handcrafted by BackkBenchers Publications the first and the second entries © “ec cist iia Page 12 of 102 Scanned by CamScanner Www.ToppersSolutions.com Chap-2| Learning with Regression 3. Example: ts 1 * . . ° e . . . w x. “2 g 2 4 & 3 0 ASE +. The scatter plot above represents the age vs. size of a plant. 5 Itisciear from the scatter plot that as the plant ages, its size tends to increase. ERROR !IN PREDICTION: 1. The standard error of the estimate is a measure of the accuracy of predictions. ine standard error of the estimate is closely related to this quantity and is defined below: ony = pore Ley 3. Where os, is the standard error of the estimate, Y isan actual score, Y' is a predicted score, and N is the number of pairs of scores. A line of best fit is a straight line that best represents the data on a scatter piot. pass through of the points, none of the points. or ail of the points. some bo ~ EST FITTED LINE: This line may 3. Figure 2.3 shows the example of best fitted line. S90 0 sa 6 Figure 2.3: Best Fitted Line Example. Z 4. Tne red line in the above graph is referred to as the best fit straight line ~——_— »¥ Handcrafted by BackkBenchers Publications Page 13 of 102 Scanned by CamScanner www.BackkBenche;, ¢ Chap - 2| Learning with Regression Qs. The following table shows Coy, the midterm and final exam grades obtained 4 - for Student, in, { | database course. | Use the method of least squares using regression to predict the final exam grade of a Stude, who received 86 on the midterm exam. Final exam 84 Midterm exam (x) 72 50 Ans: (y) 63 8] 77 74 78 94 90 86 75 59 49 83 65 33 79 77 52 88 74 8] 90 [10M | Mayi7 Finding x*y and x? using given data: x y x*"y x? 72 84 6048 5184 50 63 3150 2500 8] 77 6237 6561 74 78 5772 5476 94 90 8460 8836 86 75 6450 7396 59 49 289) 3481 83 79 6557 6889 65 77 5005 4225 1716 1089 6512 7744 7230 6561 33 52 | 88 | 74 8) Here n 90 = 12 (total number of values tn either x or y) Now we have to find 3x, dy, >(x*y) and yx? Where, >x =sum of all x values yy = sum of all y values y(x*ty) = sum of all x*y values yx? = sum of all x? values yx = 866 sy = 388 y(x*ty) 3x? | | = 66088 \ = 65942 aa Handcrafted by BackkBenchers Publications Page 14 of 10? Scanned by CamScanner www.BackkBenchers.com Chap - 2| Learning with Regression Now we have to finda & b. _ ne Dxyr Dxdy . 3= ne x2-(Lx)? Putting values in above equation, _ (12+66088)—(866+888) (12*65942)—(866)? a=0.59 Yy-ar Ex b= n _ 888-(0.59+866) ~ 12 b= 31.42 Estimating final exam grade of a student who 86 marks = y= a*x +b received Here,a = 0.59 lp = 31.42 x= 86 Putung these values in equation of y. y - (O.S9 * 86) + 31.42 > 82.16 marks Q6. The values of independent variable x and dependent value y are given below: Find the least square regression line y = ax + b. Estimate the value of y when x is 10. x oO ] 2 3 [10M | May?8] Ans; Fin hag (xty) and x? using given data [x y x"y x? 0 2 O O 1 3 3 1 2 s 10 4 3 4 12 4 6 24 16 x or y) Here n= 5 (total number of values in either Now we have to find Dx, dy, Diy) and Ex? Where, ¥x = sum of ail x values yy = sum of ally values ¥ Handcrafted by BackkBenchers Pubiications Page 15 of 102 Scanned by CamScanner WNL BAGKK Be tichers 45, Chap - 2] Learning with Regression sane pcan - nen 5 (xty) = sum of all x*y values Sxé= sun of all x? values DA =10 2Y +20 2ey) yx 8 49 5 40 Now we have to finda & b. “one Ea t-(Exy Putting values in above equation, DS (5+49j- (109205 ae 09 enn en (Se topo? bp = bode n b _ 20-(0.9° 10) f be22 Estimating final exam grade of a student who received 86 marks = yeatx+b Here, a=0.9 b= 2.2 x= 10 Putting these values in equation of y. y=(09°10) + 2.2 wane eee sews seeeeeen ew ere recone e Cen wn ea Weer cE meee setae we ENE N ORR E REE OBER Rene 8086 Od 205618 EEE OE UE MAS ESSERE SMEARS What is linear regression? Find best fitted line for following example: A 1 63 X y y | 127 | 1021 an BY | lal | 1263 3 66 142 138.5 4 69 1S7 157.0 S 69 162 157.0 6 71 156 169.2 7 7 169 169.2 ‘4 Q7. Ta ¥ Handcrafted by BackkEenchers Publications Page 16 of 102 Scanned by CamScanner ae” ltlt—‘“CSO www.BackkBenchers.com Chap - 2| Learning with Regression 8 72 165 | 175.4 9 73 181 181.5 10 75 208 | 193.8 [1OM - Dec18] Ans: LINEAR REGRESSION: Refer Q2. SUM: Finding x*y and x* using given data Here n = 10 (total number of values x 63 y 127 xty 800] x? 3969 G4 12] T7744) 4096 66 142 @372 4356 69 157 10833 4761 69 162 M78 4761 71 156 O76 5041 71 169 11999 S041 72 165 11880 5184 73 . 181 13213 5329 75 208 15600 S625 in eitner x or y) ooxwwe have to find ¥x. Sy, ¥(x*y) and 3x? Where, 5x =sum of all x values Sy = sum of all y values 3(xty) = sum of all x*y values yx? = sum of all x? values dx = 693 dy = 1588 L{xty) = 10896 2x = 48163 Now we have to find a & b. q = Mr dexy- Exdy ne Ex?-(Ex)" Putting values in above equation, ¥ Handcrafted by BackkBenchers Publications Page 17 of 102 s, Scanned by CamScanner > www.BackkBenchers Chap = 2 | Learning with Regression enna og et co CLMEET OHO) Laat oats é (edn intl a= ) OSS «(eure ™, ‘. hAty on (ear? 614 Yyoaehy te St I hehe " 0 : ye ' (i La eeg ty TBS b 260,71 Finding best fitting line yeax th here, a = G14 b= 266071 % = Unknown Putting these values in equation of y, ye Glatx = 266,71 Therefore, best fitting line y = 6.14x - 266.71 Graph for best fitting line: 210 200 190 | 140 | roo 4 160 4 ~ ‘ 160 140 130 120 + * . “ss 62 ub 70 74 height ‘¥ Handcrafted by BackkBenchers Publications Page 18 of 102 _ii Scanned by CamScanner Chap -3| Learning with Trees Www.ToppersSolutions.com — CHAP Q1. What is decision tree? - 3: LEARNING How you will choose WITH TREES best attribute for decision tree classifier? Give suitable example. [IOM | Deci8] Ans: DECISION TREE: Decision tree is the most powerful and popular tool for classification and prediction. 1. WH A Decision tree is a flowchart like tree structure. Each internal node denotes a test on an attribute. represents an outcome of the test. NOUR Each branch Each leaf node (terminal node) holds a class label. Best attribute is the attribute that "best" classifies the available training examples. There are two - entropy and 8. Entropy terms one needs to be familiar with in order to define the "best" information gain. is a number that represents how heterogeneous a set of examples is based on their target class. 9. Information gain, on the other hand, shows how much the entropy of a set of examples will decrease if a specific attribute is chosen. 10. Criteria for selecting “best” attribute’ a. Want to get smallest tree. b. Choosing the attribute that produces purest nodes. EXAMPLE: Predictors 7 ae Rairy Overssst hot Hign True No Hot Hign Fone Yeo Hic Fee Yoe teanv wns Con Nermal Fake Yoo funny Cool Normal True No Ovarsact Coot Necmal True Yoo Foxe No cool Neemai Foxe Yet tunny Mid Neemal Fsue You Rainy Mid Normal True Yee mp Ralry Oversset mod Gveraact Hot MBs funny Higa High Normal High Decision Tree Newer ‘tuany Rainy Q2. Target . True Yeo Fake You True NO | as Fi Ep pa Pea ae Explain procedure to construct decision tree [5M | Deci8] Ans: DECISION TREE: 1. Decision tree is the most powerful and popular tool for classification and prediction, 2. A Decision tree is a flowchart like tree structure. 3. Each internal node denotes a test on an attribute. 4. Each branch represents an outcome of the test. Page 19 of 102 ® Handcrafted by BackkBenchers Publications Be a a TE TO TE TI PEARL TET a IT TLE TO EY Scanned by CamScanner www, Topperssolutions.es, Chap - 3 | Learning with Trees S. Each leaf nade (terminal node) halds a class label 6. Indecision trees, for predicting a class label for a record we stall fom 7. We compare the values of the root attribute with records attribute 8 On the basis of comparison, we follow the branch corresponding the haat af the tren to (hat valle ancl ilinp to tie je. node. 9. internal nodes of We continue comparing our record’s attribute values with other (he thee 10. We do this comparison untilwe reach a leaf node with precio ted Class valtie Tl. As we know how the modelled decision tree can be used to predict the target Class ot (ye value PSEUDO CODE FOR DECISION TREE ALGORITHM: 1 Place the best attribute of the dataset at the root of the bee 2. Split the training set into subsets. 3. Subsets should be made in such a way that each subset contaiis cata with the sane value for an attribute. 4 Repeat step ]and step 2 on each subset until you find leaf nodes in all the branches oft he tree CONSTRUCTION OF DECISION TREE: [Outlook } Overenst Sunny Rain Hamidity High ml /\ Strong Normal : 4 i kad 2 \Nu Z [ves] do] [ye] Figure 3.1 Example of Decision 1. Weak Troe tor Tennis Play Figure 3.1 shows the example of decision tree for tennis play, an attribute value test, A tree can be “learned” by splitting the source set into subsots baseclon This process is repeated on each derived subset ina recursive mannher called recursive partitioning at a node all has the same value of the target vatioble,ot 4. The recursion is completed when the subset 5. when splitting no longer adds value to the predictians, The construction of decision tree classifier does not require any clomain knowledge of parametel setting 6. Therefore is appropriate for exploratory knowleclqe discovery 7. Decision trees can handle high dimensional data. 8. In general decision tree classifier has good accuracy, 9, Decision tree induction is a typical inductive approach to learn knowledge on classificalon * Handcrafted by BackkBenchers Putlications page 20 of 10 Scanned by CamScanner —_— Chap - 3] Learning with Trees Q3. www.ToppersSolutions.com What are the issues in decision tree induction? [IOM | Decl6 & May18] Ans: JSSUES.IN DECISION TREE INDUCTION: 1) Instability: 1. The reliability of the information external at the onset, information in the decision tree depends 2, Evena small change 3, Following things will require reconstructing the tree: on feeding the precise internal and in input data can at times, cause large changes in the tree. a, Changing variables. b. Excluding duplication information. c. Altering the sequence midway. ll) Analysis Limitations: 1. Among the major disadvantages of a decision tree analysis is its inherent limitations. 2. The major limitations include: a. Inadequacy in applying regression and predicting continuous values. b. Possibility of spurious relationships. c, Unsuitability for estimation of tasks to predict values ofa continuous attribute. a. Difficulty in representing functions such as parity or exponential size. 1. Over fitting happens when Na Il) Over fitting: It reduce training set error at the cost of an increased test set error 3. learning algorithm continues to develop hypothesis How to avoid over fittingIt stops growing the tree very early, before it classifies the training set. a. Pre-pruning: b. Post-pruning: It aliows tree to perfectly classify the training set ana then prune the tree. IV) Attributes with many values: 1 Ifattributes have a lot values, then the Gain could select any value for processing. 2. This reduces the accuracy for classification. V) Handling 1. with costs: Strong quantitative and analytical knowledge required to build complex decision trees. This raises the possibility of having to train people to complete a complex decision tree analysis. 3. The costs involved in such training makes decision tree analysis an expensive option. Vi) Unwieldy: 1. Decision trees, while providing easy to view illustrations, can also be unwieldy. 2. Even data that is perfectly divided into classes and uses only simple threshold tests may require a large decision tree. Large trees are not intelligible, and pose presentation difficulties. 4 Drawing decision trees manually usually require several re-draws owing to space constraints at some sections 5. There is no fool proof way to precict the number of branches or spears that emit frem decisions or sub-decisions. Handcrafted by BackkBenchers Publications Page 21 0f 102_ Scanned by CamScanner Win TOPPETSS olin Chap - 3| Learning with Trees utes: attribg atin uous valued continor Vil) Incorp class prediction The attributes which have continuous values can't have a proper 1. Forexample, AGE or Temperature can have any values There is no solution for it until a range is defined in decision tree itself. 2, a 3, VII) Handling examples with missing attributes values: 1. Itis possible to have missing values in training set. 2.ee To avoid this, most common value among examples can be selected for tuple in consideration IX) Unable to determine depth of decision tree: 1. If the training set dees not have an end value i.e. the set is given to be continuous. 2. This can lead to an infinite decision tree building. X) Complexity: 1. Among 2. Decision trees are easy to use compared to other decision-making models. 3. Preparing decision trees, especially large ones with many branches, are complex and time-consumr, the major decision tree disadvantages are its complexity. affairs, 4. Computing probabilities of different possible branches, determining the best split of each node. 3 Q4. For the given data determine the entropy classification separately and find which after classification using each attribute fe attribute is best as decision attribute for the rootbj finding information gain with respect to entropy of Temperature as reference attribute. Sr.No | Temperature | Wind ] Hot Weak 2 Hot Strong 3 Mild Weak 4 Cool Strong 5 Cool Weak 6 Mild | Strong Humidity High High Normal High 4 Normai Norma! 7 Mild Weak High & 9 Hot Mild Strong Weak High Normal io Hot Strong Normal . [10M | Mayié} Ans: ' First we have to find entropy of all attributes, 1. Temperature: There are three distinct values in Temperature which are Hot, Mild and Cool. As there are three distinct values in reference attribute, Total information gain will be I(p, n, r). Here, p = total count of Hot = 4 n = total count of Mild = 4 r= total count of cool = 2 ; s=ptn+r=4+4+2=10 Therefore, ip,n,r) = ~2log,® ~ “log. — Tlog.= ——" ® Handcrafted by BackkBenchers Publications ee page 22 of 10 Scanned by CamScanner . -3 | Learninging with Trees Chap-31 4 4 ~—lop,— 10 I(p, 9, ) 82 76 www.ToppersSolutions.com — =4 4 70 8275 = 1 G22 eee 2 5 7p O82 2 10 using calculator Wind: 2, There are two distinct values in Wind which are Strong and Weak. As there are two distinct values in reference attribute, Total inforrnation gain will be I(p, n). Here, p = total count of Strong =5 n = total count of Weak =5 s=pt+n=5+5=10 Therefore, \(p, n) == Plog, —tlogz 2= —— 2tlog.~n ===}oe,5% — to9,3 7 35 OB2 10 10 O82 75 I(p, n) 3. 2D ceiiceccannataiecatesyes as value of p and n are same, the answer will be 1. Humidity: There are two distinct values in Humidity which are High and Normal. As there are two distinct values in reference attribute, Total information gain will be I(p, n). Here, p = total count of High =5 n = total count of Normal =5 s=pt+n=5+5=10 Therefore, ' i(p, n) __F =—tleg25 -_2 Nn n — ylog2> p 5 = 79 O82 Fo I(p, n) =] 19 3S 82 75 be 1. ___. as value of p and nare same, the answer will Temperature Now we will find best root noae using ure Here, reference attribute is Temperat a. b. as reference attribute. ure which are Hot, Mild and Cool. There are three distinct values in Temperat n Ga in for whole data using reference attribute. Here we will find Total Informatio c. will be I(p, n, r). es in reference attri bute, Total information gain As there are three distinct valu d. Here, p=total count of Hot = 4 = 4 n= total count of Mild r= total count of cool = 2 s=pt Therefore, I(p, 9, —__ n+r=44+4+2=10 = —2log,* - “log: > — flog," ions ¥ Handcrafted by BackkBenchers Publicat Page 23 of 102 Scanned by CamScanner q www.Topperssolutions.ce, Chap - 3[ Learning with Trees —— , A 2 4 ~— log -~ lo = —+fog,+ io? 49 - mn 082 330 7 10 OBZ Zo 19, ri,5} © VGP2 ooceneenee USING Calculator How ye vill find Information Gain, Entropy and Gain of other attributes except reference attribute 1. Wind: Wind attobute have tyo distinct values v/hich are weak and strong We will find information gain of these distinct values as following |, Weak = p= no of Hot values related to weak = 1 n= no of Mild values related to weak = 3 6 = no of Cool values related to weak = 1 promt nels Ser T 5 therefore, K(yeeak) = Hp, nr} = a2 ® loge ©-- : — flog. a (weak) ik =p, non “log, 2" - “log, = =log: ; 7 “log: - = 1371. using calculator = Strang = p< no of Hot values related to strong =3 n= ne of Mild values related to strong =1 t> no of Cool values related to strong = 1 B2pemtne3ol+1=5 Therefore, weak) = (p,m. = —*log,® -— = ~Zlog,? - *loge+ - loge ~ log, = eee USING Calculator = LSAT iweak) = (Ip, “loge ~ - Therefore, oO Wind Distinct from Wind values of Hot) values of Mild) vi Weak Strang 1 3 (total related | related | (total values | (total related | Information Gain of value values of Cool) 1; 1(pp nti) nj 3 1 137 7 7 Gi *? Fandcrafted by BackkBenchers Publicat‘ons - =o Page 24 of 10? Al Scanned by CamScanner ate 7 i chap -31 Learning with Trees Wwww.ToppersSolutions.com Now we will find Entropy of Wind as following, Pltne+r, Entropy of Wind Here, = Yiy PtNtT= ptntr x 1(pi, NyM4) total count of Hot, Mild and Cold from reference attribute = 10 Pisntr, = total count of related values from above table for distinct values in Wind attribute (pp nun) = Information gain of particular distinct value of attribute ae Entropy of Wind D, Pitnjtr; for weak — 1+3+1 pentr xX 1p, NN) 3+141 13714 SE x 1.371 Sy Entropy of wind Pitnitr; for strong X (pp npr) + ptntr =1371 Gain of wind = Total Information Gain - Entropy of wind = 1.522 - 1.371 = 0.151 2. Humidity: Humidity attribute have two distinct values which are High and Normal. We will find information gain of these distinct values as following i. High= p, = no of Hot values related to High =3 n, = no of Mild values related to High = 1 r= no of Coo! values related to High = 1 S=ptn+n=3+14+1=5 Therefore, (High) = I(p, n, r) = —Flogs® - “log. = - “loge = = —2jog,2 5 lOBze — tlog,+ clOB25 — +log,+ 5 OBz; I(High) = Ip, n,n) = 1.371 ue. i. UsINg Calculator Normal = p, = no of Hot values related to Norrnal = 1 n.=ne of Mild values related to Normal = 3 r= no of Cool values related to Normal = 1 S=ptnt+n=1+3+1=5 Therefore, I(Normal) =l(p,n,r) -_1) = I(weak) = —Flog,® 1 3) 3 yiOb2g — 5 i0B25 = AMD, 1, 1) = VB - “log.” 4 , - “log2" 2 5 Obes ccecceesssssneseee using calculator | | 4 | 4 | 7 i ¥ Handcrafted by BackkBenchers Publications reer cree Page 25 of 162 ag ) bes Scanned by CamScanner- WWww.TOPPersSolut; ong Chap - 3 | Learning with Trees ¢ Therefore, ~ Humidity Distinct total related | (tota values | (total from Humidity | values of Hot) High 3 Norma 1 z Pi values rite related | (total Information Gan ce. related | Information Gain pan values of Cool) r; 0 f Mild) fy. ie Mat) eo | ; ; nF 1371 fe 3 PN 71 1.371 —™~ ae: Now we will find Entropy by Humidity as following, ay. Pitty . Entropy of Humidity = vhaseer x (pi Ni Mi) Here, p+n+r= total count of Hot, Mild and Cold from reference attribute = 10 Pi+n+r, = total count of related values from above table for distinct values in Humidity attribute I(p,. ny) = Information gain of particular distinct value of attribute . iligg, — Pitnytr; for weak Entropy of Humidity = oer Xe Entropy of Humidity = = x 1371+ =“ Pitnj+r; for strong nyt) + TX WN M(pp>. nM) x 1371 Entropy of Humidity = 1.371 Gain of wind = Total Information Gain - Entropy of wind = 1.522 - 1.371 = 0.151 Gain of wind = Gain of humidity = 0.151 Here both values are same so we can take any one attribute as root node. If they were different then we would have selected biggest value from it. Q5. Create a decision tree for the attribute “class” using the respective values: Eye Colour Hair Length | Class Brown Married Yes | Sex Male Long Feotball Blue Brown Yes Yes Male Mate Short Long Football Brown No Female Long Netball Brown Blue No No Female Male Long Long Netball Brown Brown No No Female Male Long Short Brown Brown Netbali Football Yes No Female Female Short Long Netball Netball Football “| Football Blue No Blue Male No Long Football | Male Short Football Ans: _—| fom | Dec!él Finding total information gain I(p, n) using class attribute, There are two distinct values in Class which are Football and Netball. Here, p= total count of Football =7 Handcrafted by BackkBenchers Publications . _——— Page 26° f 102 Scanned by CamScanner chap -3| Learning with Trees Wwww.ToppersSolutions.com n= total count of Netball =5 s=p+n=7+5=12 Therefore, - _P Pp n ? = — log. = - 5 log. = \(2,.) -_7 = —Slog, 7s a I(D, 1) = 0.980 one 7 B25 5 USING Calculator Now we will find Information Gain, Entropy and Gain of other attributes except reference attribute 1. Eye Colour: Wind attribute have two distinct values which are Brown and Blue. We will find information gain of these distinct values as following I. Brown = p, = no of Football values related to weak = 3 n,= no of Netball values related to weak =5 S=ptn=3+5=8 Therefore, = \(Brown) _ Pp P n n = I(p, n) = — Flog. = = = logs = =“5 —3Iog,2 — Slog, 0825 B 0825 = (Brown) I. Kp, 1) = 0.955 ween: using calculator Blue = p. = no of Football values related to Blue=-4 n, = no of Netball values related to Blue = 0 S=ptn=4t+0=4 Therefore, -_?P Pm) ik (Blue) = I(p, n) = —flog2> — = loge > 4 == — slog. 54_ HO occu Sip, Fi log 27 Fanyone value is 0 then the answer will be 0 for Information gain Therefore, [ Eye Colour Distinct values from | Eye Colour (total related values of | (total related values of | Information Gain of value 1(Q, 723) Netball) Football) Pi Brown ie 3 ny 5 Handcrafted by BackkBenchers Publications 0.955 Page 27 of 102 Scanned by CamScanner www. TOPPerssolusi Chap - 3 | Learning with Trees Ms. Now we will find Entropy of Eye Colour as following, Entropy of Eye Colour = har x I (pp i) i =12 p +n = total count of Football and Netball from class attribute Here, oti i e Colo ; Pj + nm = total count of related values from above table for distinct values in Ey Our attridy 1 (pj, %) = Information gain of particular distinct value of attribute Entropy ; = Pi+n, for Brown of Eye Colour ; 1(py nj) of Brown + Pitn, for Blue ea ; x T(pir ty) of Blue ptn “a8 x 0.955 + "x0 Entropy of Eye Colour = 0.637 Gain of Eye Colour = Total Informat ion Gain - Entropy of Eye Colour = 0.980 - 0.637 = 0.343 2. Married: Married attribute have two distinct values which distinct values as following I. are Yes and No. We will find information gain of the: Yes = Pi = no of Football values rel ated to yes =3 Ni = no of Netball values rel ated to yes =] HS prn=3t+ ]s Therion I(Yes) = I(p, n)= — flog, ® - = log .” = —log,? = zlog: + I(Ves) = Mp, n)= 0.872 oo cece. ". using calculator No= D, = no of Focthall values related to No = 4 ny = no of Netbail vaiues related to No =4 S=pPptn=4+4=8 Therefore, (No) = I(p, n) = —*log,® - Flog.” =— Therefore, “too, 4 g 0825 4 ou, 4 3 525 = 1 veseeeasneemneene AS both value are 4 which is same so answer will be ] [ Married Distinct values from | (total relat ed values of Married Football) (total related values of Netball) Pi Yes 3 No 4 ® Handcrafted by BackkBenc hers Publications ni Information Gain of value (pny ae )e y Page 28 of 102 Scanned by CamScanner www. ToppersSolutions.com chap-31 Learning with Trees Now we will find Entropy of Married as following, aries entropy of Married = Here, yj Yi. Phen an x (pny) pt n= total count of Football and Netball from class attribute = 12 Pin, = total count of related values from above table for distinct values in Married attribute K(py M) = Information gain of particular distinct value of attribute ‘ _ Pitny, for ves 7 Entropy of Married = & Mani) of Yes + terse , — 3t1 ar ptn 444 * 0.812 4+ —— , y I(p,.n,) of No t x1 Entropy of Married = 0,938 Gain of Married = Total Information Gain — Entropy of Married = 0.980 - 0.938 = 0,042 3. Sex: Wind attribute have two distinct values which are Male and Female. We will find information gain af these distinct values as following lL Male = p, = no of Football values related to Male = 7 n, = no of Netball values related to Male =0 s=ptn=7+0=7 Therefore, (Yes) = I(p, n) = —* log,” _ *log,~ _ 7, 7 0 =—sloge7 — jlogz> 0 (Yes) = I(P, 1) = O ween i. Female USING Calculator = P, = No of Foothail values related to Female = 0 n, = no of Netball values related to Female =5 s=pit+n=O0+5S=5 Therefore, (No) = I(p,n) = —2 log,2 — Slog. 0 0 5 5 = — loge = = 5 O82 5 SO) scsasassesserenrenies If both values are same then the answer will be | foLInformatien gain Therefore, Sex ere tet et arene amare Distinct values from | (total related values of | (.otal related values of | Information Gain ofvalue : Sex Football) Netball) Py Male Female Tp, ni) mh 7 0 O 5 oO ee _[o —. oo re Page 29 of 102 Scanned by CamScanner ad “| www. Topperssotyy: 7 9) with Trees Chap -3] Learning ( ing, ropy of Sex as follow Now we will find Ent Entropy of Sex = ve, oipen x (pi Ni) attribute = 12 Netball from cla ss p+n-=total coun t of Football and dist inct values in Sex attribute from above table for Pisn,tFnj = total count of relat ed values ct value of attribute rmation gain of particul ar distin Here, I(p;.n)) = Info Pi+n, Entropy of Sex = — for Male _ 7+0 x I(pieni) of Yes + x04%O+S Pi+n; for Female x (pi na) of No pin xo Entropy of Sex = 0 - 0 = 0.980 Gain of Sex = Total Information Gain - Entropy of Sex = 0.980 4. Hair Length: Hair Length attribute have two distinct values which are Long and Short. We will find information g3 these distinct values as following Lang = p, = no of Football values related to Long =4 n, = no of Netball values related to Long =4 S=prtm=4+4=8 Therefore, I(Long) = I(p, n) = —Flog,® - "log." = ~tlog,! - logs? I(Leng) = I(p, n) =D= een AS bUtTH Values are 4 which is same so answer will be1 Short = p = no of Foctball values related to Short =3 n, =no of Netbali values related toS hort =1 S=p+n=3+1=4 Therefore, \(Short) = I{p, n) = —Flog.® ~ “log. ~ = — Slog, = - + log: + = 0.812 Therefore, Hair Length Distinct values from | (total related values of | (total related values of | Information Gain of value Hair length Football) Netball) I(p.n;) Di nj t t Long 4 4 short s 1 ** Handcrafted by BackkBenchers Fublications ] 0.812 : — __ Page 20 of 10? all Scanned by CamScanner _ chap - 31 Learning with Trees ——— www.ToppersSolutions.com _ Now we will find Entropy of Hair length as following, . ~ yk Piemy Entropy of Hair Length = xt Here, x Mp ny) P+N= total count of Football and Netball from class attribute = 12 Pitn, = total COUNT of related values from above table for distinct values in Hair Length attripute I(py. ni) = Information gain of particular distinct value of attribute = Entropy of Hair Length Pl+ny for Lomg eel pt+n x 1(pi,nj) of Long + Pitny for Short pin x 1(py nj) of Short ayata x 1+ =3+1 x0812 = 0.938 Entropy of Hair Length Gain of Hair Length = Total Information Gain - Entropy of Hair Length = 0.980 - 0.938 = 0.042 Gain of all Attributes except class attribute, Gain (Eye Colour) = 0.343 Gain (Married) = 0.042 Gain (Sex) = 0.980 Gain (Hair Length) = 0.042 Here, Sex attribute have largest value so attribute ‘Sex’ we be root node of Decision tree. Ce) tema First we will yo for Male value to get its child node, Now we have to repeat the entire process where Sex = Maie We will take those tuples from given data which contains Male as Sex. And construct a table of those tuples. : “Eye Colour | Married Sex Brown Yes Male Long Football Biue Yes jale Short Football Brown Yes Male Long Football Blue No Male Long Football Brown No Male Short Football Blue No Male Long Football Blue No Male Short Football Hair Length — Class Here we can see that Football is the only one value of class which is related to Male value of Sex class. “Owe can say that all Male plays Football. Now we will go for Female value to get its child node, “We will take those tuples from given data which contains Female as Sex. And construct a table of those tuples a — gigs [laNdcraftea by BackkBenchers Publications Page 31 of 102 Scanned by CamScanner www.To PP&FSSOlution. ‘ “Sy Chap - 3] Learning with Trees Eye Colour Married Sex | Hair Length | a Brown No Female Long Netba Brown No Female Brown No Female Long Netball Brown Yes Female Short Netball Brown No Fernale Long Netball -——Tong.—«|~_—Nettballl pt clac. Here we can see that Netball is the only one value of class which is related to Female value of f SexSex clas. So we can say that all Female plays Netball. So Final Decision Tree will be as following 4 ( Eye Married Sex Brown Yes Blue Yes Brown Blue Yes No No No Brown Blue Blue <> ee \ Class Male | Long Male | Short | Football| Male Male Male | Male Cent M fale SNye Hair | Length Colour i Eye | Married Colour Brown No Brown No Brown No Brown Yes Brown No Football | Long | | Long | | Short | | Long | Football Football Football Football No___| Male | short | Football Sex Female Female Female Female Female | | | | | Hair Length Long | Long | Long | Short | long | Class Netball Netball Netball Netball Netball _ ( Football tee Q6. For the given data >) . determine the entropy after classification using each attribute fo! classification separately and fine which attribute is best as decisiun attribute for the root by finding information gain with respect to entropy of Temperature as reference attribute. Sr.No | Temperature Wind Humidity 7 Hot Weak Normal 2 Hot Strong High 3 Mild Weak Normal 4 Mild Strong High 5 Cool Weak Normal 6 Mild Strong Normal 7 Mild Weak High 8 Hot Strong Normal 9 Mild Strong Normal 10 Cool Strong Normal Ans: ‘® Handcrafted by BackkSenc hers Publications Page 32 of 102 = Scanned by CamScanner —a——_ cha www. ToppersSalutions.com p-3| Learning with Trees cirst we have to find entropy of all attributes, 1 Temperature: There are three distinct values in Temperature which are Hot, Mild and Coal As there are three distinct values in reference attribute, Total information gain will be Hp, nm. r) Here, p = total count of Hot = 3 n= total count of Mild =5 r= total count of cool = 2 s=ptn+r=3+5+2=10 Therefore, Ip,n.t) _ Pp P n n = —Zlog.® - “log,* — log,* 3. = 3 ~ 19 082 10 5 5 10 O82 35 I(p, 1, 1) = 1486 wee 2 2 ~ 75982 75 USING Calculator Wind: There are two distinct values in Wind which are Strong and Weak. As there are two distinct values in reference attribute, Total information gain will be I(p, n) Here, p = total count of Strong =6 n = total count of Weak = 4 s=p+n=6+4=10 Therefore, (0, n) =P = —*log2~Bot — ylog2>n =-*}Jog,% - +log,+ =~ 75108279 ~ 39 082 to WD, 1) = O97 1 oe ceeeseeecereeee as value of pandn are same, the answer will be 1. Humidity: There are two distinct values in Humidity which are High and Normal As there are two distinct values in reference attribute, Total information gain will be I(p, n). Here, p = total count of High = 3 n= total count of Normal = 7 S=p+n=3+7=10 Therefore, I(p, n) = —?log.® - “loge = 3 3 70 O82 —10” Se. — 7) 10 8275 —/0¢,— (1) = 0.882 vaeccsssnenenies as value of pand n are same, the answer will be 1, fieercestee by BackkBenchers Publications Paye 33 of 102 Scanned by CamScanner www: TOPPerssoluy,, 1 8, om Chap ~ 3 | Learning with Trees . é re as ref erence attribute Now we will find best root node using TTemperatu Here, reference attribute is Temperature. 4 cool : 7 Mild and There are three distinct values in Ternperature which are Hot, Mil As there are three distinct values in reference attrib Here, Cool. . ute, Total inforrnation gain will be I(p, n,r), p= total count of Hot =3 nN = total count of Mild =5 r = total count of cool = 2 S=p+n+r=3+5+2=10 Therefore, l(p,n,r)= Pp, ) —2Iop,2 clogs —— 2= log2 Sa 77 zlog, Ll= _ 3 3 ~ 2Igp,% 5 5 ~ Zoe, 2 =~ A]op,2 %2 190 O82 75 WPL 10 log, 10 1) = 1.486 cece 10 loge 10 USING calculator Now we will find Information Gain, Entropy and Gain of other attributes except reference attribute 1 Wind: Wind attribute have two distinct values which are weak and strong. We will find information gain of these distinct values as following = =Weak= P. = no of Hot values related to weak = 1 ni = n0 of Mild values related to weak = 2 ri = no of Cool values related to weak =1 S=prtaAt+n=l+2+1=4 Therefore, (weak) = Hip, n, r) = —2 log, — “log,” — *log,* =—Liop,2 =~ 710825 Z10p,2 — 70825 — I(weak) = I(P, 0, 1) = 1S wee I. Liop,2 7827 USING Calculator =—- Strong= pi = no of Hot values related to strong =2 ni = no of Mild values related to strong =3 r, = no of Cool values related to strong =1 S=pitn+n=2+3+1=6 Therefore, l(weak) = I(p, n, r) -_pF= —* log. * -- * logs ~ - “log: ~ ==—2Iog,2 Iop,32 ~— 510825 Lge, 2 610825 —— 519825 weak) = I(p, ny 1) = 1.460 wcccccccssees using calculator _ —_— Handcrafted by BackkBenchers Publications 4 Page 34 of 10! Scanned by CamScanner er www.ToppersSolutiens.com chap-3| Learning with Trees therefore, Wind vi Gain of value related | Information values of Cool) 7 values of Mild) values of Hot) from Wind related | (total related | (total values | (total Distinct pe nit i i ny Weak Strong Now we will find Entropy of Wind as following, ienper Purneery Entropy of Wind = 3%, Here, ptner x 1%, ny, nm) p+n+r=totalcount of Hot, Mild and Cold from reference attribute = 10 Pumer, = total count of related values from above table for distinct values in Wind attribute Ip.) Entropy of = Information Wind =n ptn+r arene = 2h obtttl TUN Entropy of wind gain of particular distinct value of attribute Plenytr, for weak eRe EOE Jac 24341 LS + = p per strong 4g, Perer ea A . x (pp nen | & 1.460 =1476 Gain of wind = Entropy of Reference 2. ay X 1p NT) - Entropy of wind = 1.486 - 1.476 = 0.01 Humidity: Humidity attribute have two distinct values which are High and Normal. We wil find information gain of these distinct values as following lL High= Pp. = no of Hot values related to High =1 n= no of Mild values related to High = 2 r =no of Cool values related to High = G S=pemtn=1+2+0=3 Therefore, High) = Kp, n, 9) > — log, * - “logs > ~ “log. * 1 1 — 2I09,~ 2 2 — “lop: 0 0 = —)o¢,1 3 B23 3 Ob 3 log25 I(High) = Ip, n, 1) = 0.919 we Using calculator I. = Normal = D = no of Hot values related to Normal = 2 n, = no of Mild values related to Normal = 3 r = no of Cool values related to Normal = 2 Se pentns2+3+2=7 Therefore, (Normal) = Ip, n,n) = —2log,® — Slog,> — <log.~ 2 z = -Flog.= — =log> — Sloga> 2 Mweak) = 1p, 0, 1) = 1-557 .ncesnnmnnns USING calculator 8 Handcrafted by BackkBenchers Publications Page 35 of 102 i fii. Scanned by CamScanner www.ToOPPeFsSolutio,,, " Chap - 3 | Learning with Trees c ~ : Therefore, Humidit (total Distinct — related | (total related | (total values of Cool) ri values of Mild) from Humidity | values of Hot) Di related informe ; I Gain oF DS (Pi.Mu ri) nj High 1 2 Normal 2 3 | t 1 Now we will find Entropy by Humidity as following, Entropy of Humidity = re Here, \ x (pp nit) c total count of Hot, Mild and Cold from reference attribute = 10 above table for distinct va lues in Humidity attribute p+n+r= Pienj+r, = total count of related values from = Information gain of particular distinct value of attribute i (vy) Entropy of Humidity = = % 4¢ _ Pitnytr, for weak Pitnj tr; for strong Ioynntis: ee Entropy of Humidity = ue x 0.919 + ar x a 1(piny.%) x 1.557 Entropy of Humidity = 1366 Gain of wind = Entropy of Reference - Entropy of wind = 1.486 - 1.366 = 0.12 Gain of wind = 0.01 Gain of humidity = 0.12 Here value of Gain(Humidity) is biggest so we will take Humidity attribute as root node. Q7. For a SunEurn dataset given below, construct a decision tree. Location Class Light No Yes Tall Average Yes No Brown Short Average Yes No Sushma Blonde Short Average No Yes xavier Red Average Heavy No Yes Balaji Brown Tall Heavy No No Ramesh Brown Average Heavy No No Swetha Blonde Short Light Yes No | Height Name Hair Sunita Blonde Average Anita Blonde Kavita Weight [10M - May17 & Maylé Ans: First we will find entropy of Class attribute as following, There are two distinct values in Class which are Yes and No. Here, p =totalcount of Yes=5 n = total count of No=3 s=p+n=5+3=8 Therefore, ‘ ; oy Handcrafted by BackkBenchers Putlications : page 36 of 10? Scanned by CamScanner Chap - 3 | Learning with Trees www.ToppersSolutions.com ten) =—Bloge? ~ Sopa? =—lo9,*% — 2Ioe,2 =~ 4 !0B25 ~ 70825 1D, MN) = 0.955 vneeseeereenseeins using calculator Now we will find Information Gain, Entropy and Gain of other attributes except reference attribute 1. Hair: Wind attribute have three distinct values which are Blonde, Brown and Red. We will find information gain of these distinct values as following 1. Blonde = p.= no of Yes values related to Blonde =2 n, = no of No values related to Blonde =2 S=Ptn=2+2=24 “herefore, (Blonde) = I(p, n) = —log,? - zlog2* = —log2s — Flogs> Blonde) = I{p, n) = 1 ou... Ik as Value ofp and nis same, so answer will be1 Brown = © = no of Yes values related to Brown = no of No values related to Brown =0 = 2 5 = ptrn=O0+2=2 Therefore, 1 ~\ ? no, 2 (Blue) =— I(p, n) == — Slog. >p — Flog2s fa = —Slogz5 2 — 0 = logs= 0 =O vccecses cseeneeereees If anyone value is 0 then the answer will be O for Information aain Ii Red= p. = no of Yes values related to Ped =1 n, = no of No values related to Red = O S=pitn=1+0=2 Therefore, ‘ _ =P] PL “log LA Blue) = U(p,n) = —Floge> — 7'082 5 1 = —jlos27 1 - 0 logy o HO covcccccssssssneesnieereIfanyone value is 0 then the answer will be O for Information gain “—_— * Handcrafted by BackkBenchers Publications Page 37 of 102 Scanned by CamScanner www.TOPPerssoluys Chap - 3 | Learning with Trees Ns, Therefore, [ei Distinct values from | (total related values of Football) Eye Colour Value Gain iofvan—~ ia. linformation (total related values of | Inform Netball) ni Pi " 2 2 Blonde Brown 2 0 Red 7 O re [0 Now we will find Entropy of Hair as following, Entropy of Hair = Ee Here, x (py ni) p+n-= total count of Yes and No from class attribute = 8 Pp. +n, = total count of related values from above table for distinct values in Hair attribute (yn) _ Entrop y of = Information gain of particular distinct value of attribute Eye Colour = P —— r nde x 1(pj, nj) of Blonde + r Pitn) For Pion for p+n Brown x 1p. n;) of Brown + — py. vn (py. nj) of Red _ 242 Sar 2+0 2+ Entropy of Hair xO0+ 140 = * 0 =05 Gain of Hair = Entropy of Class - Entropy of Hair = 0.955 - 0.5 = 0.455 2. Height: Height attribute have three distinct values which are Average, Tall and Short. We will find informatie gain of these distinct values as following = Tall= p. = no of Yes values reiated to Tall = O n, = no of No values related to Tall = 2 S=p+n=O+2=2 Therefore, _ Pp {(Averace) = I(p, n) = — slog. Pp n 7 y loge n = — 90,9 — 269, 2 =~ 710822 ~ 710825 Average) I. = I(p, n) = 0 wee If any value from p and nis 0 then answer will be O Average = p. = no of Yes values related to Average = 2 n, = no of No values related to Average = 1 S=p+n=2+1=3 Therefore, (Average) = I(p, n) = —*log,® - “log,” ly = — 2]—$log,> — Slog. =1 = 0.919 Handcrafted by RackkBenchers Publications Page 38 oF 10 ,| Scanned by CamScanner chap-3 } Learning with Trees oo www.ToppersSolutions.com short = HE 4 " a a) a £ = no of No values related 6 nm rr o = no of Yes values related to Short =} geptmsl+2=3 Therefore, Short} = itp, nm) = — log, t = — slog, L == = 2 tog rn 2 > slog: = 0.919 Therefore, sight | Ea? = = ay pistinct values from | ay ee T | (total related values of | (total related values of | Information Gain of value | Height | Yes) * t f - i No) | Dy | ts {1 Ip, 1) ny | | |2 . 12 [ql 0 0.919 [2 0.919 Now we will fing Entropy of Height as following, Entropy of Height = We,i=l et Here, p+n=total x I(p;,1;) count of Yes and No from class attribute = 8 pj + nj = total count of related values from above table for distinct values in Height attribute I(p,.n;) = Information gain of particular distinct value of attribute _ Entropy : of Pien, Heiaht= forTal! SX = . Mp n;) of Tall + Pitn, oe Ar u Fi: X 1(9,,n,) of Average + = aH z x ifp..n,) of Short = xo+2 0+2 8 xo9194 2 8 x6.919 8 Entropy of Height = 0.690 Can of Height = Entropy of Class - Entropy of Height = 0.955 - 0.690 = 0.265 << 3. Weight: Weight attribute have three distinct values which are Heavy, Average and Light. We will find information 93in of these distinct values as following 1, Heavy = Pp: = no of Yes values related to Heavy =1 n. = no of No values related to Heavy = 2 S=ptn=1+2=3 Therefore, \(Average) = I(p, n) = —thg: 1 = —jlog25 1 2 - 5 los25 2 (Average) = I(p, n) = 0.919 ed hy BackkBenchers Publications Page 39 of 102 Scanned by CamScan ne. Wwww.TOPPersSolutia, Chap -3 | Learni ng with Trees Se ea I. Average = Pi NO Of Yes values related to Average = 1 "= NO of No values related to Average = 2 Spt njsl+2s% therefore, Average) =I(p,n) = —Flog,® =—— = “log = I 1 2 2 ylones ~ 7 ORF = 0,919 I, Light = Pi = no of Yes values related to Light =1 nis no of No values related to Light =1 a= Prmsl+1=2 Therefore, = -—1iog,1 _ 1) 208s 3 — 3 !082 FT ccs Nie K(Light) = I(p, n)= —f Pylog, 2p ~ n hop, 2n 7 9625 5 Be AS Value of p and nare same, so answer will be] Therefore, [ Weight Distinct values from {total related values of | (total related values of Yes} No) Weight ae rz 1(pi.ni) n; — _| | Information Gain of value | Fiecavy q 2 0.919 | Average 1 2 0.919 Light 1 1 ] Now we will find Entropy of Weight as following, P Entropy of Weight = Xx, i Here, p+ n= x 1(p, nj) total count of Yes and No from class attribute = 8 p, + ny = total count of related values from above table for distinct values in Weight attribute (py nj) = Information gain of particular distinct value of attribute Entropy of Height= Prem forneery x Hpi, nj) of Heavy + Phen forarerage x I(p nj) of Average + Pu Jortiot x T(py nj) of Light = x0.919+ 2 xo9194 44 x1 Entropy of Height = 0.94 Gain of Height = Entropy of Class - Enuopy of Height = 0.955 - 0.94 = 0.015 Ee %@ Handcrafted by BackkBenchers Publications Page 40 of 10 Scanned by CamScanner Chap -3| Learning with Trees Wwww.ToppersSolutions.com —_— Location: 4 Location attribute have two distinct values which are Yes and No. We will find information gain of these distinct values as following 1. Yes = p= no of Yes values related to Yes = 0 n= no of No values related to Yes =3 s=ptm=O+3=3 Therefore, Average) = I(p,n) = —Plog,2 — Ziog,2 s s s bz TseRieeccsecstrase if any one value from p and nis 0 then answer will be O p. = no of Yes values related to No=3 in n, = no of No values related to No =2 S=pt+tn=3+2=5 Therefore, Average) = I(p, n) = —flog,® s 3199,3 = “log.” s s —ZIop,2 5 0825 ~ 5 Ob2s = 0.971 Therefore, | Location 7] | Distinct values from (total related values of | (total elated values of | Information Gain of value Location Yes) No) | Pi (pp, ay) Ny Ves 0 3 re: 3 2 _ C 0.971 Now we will find Entropy of Location as following, : “ntropy of Location = Here, . Pitn io ptr x 1p Mu) p+n-=total count of Yes and No from class attribute = 8 », + n= total count of related values from above table for distinct values in Location attribute t tT = Information gain of particular distinct value of attribute I(p,.n)) ie E Entropy of Location = ES = 93 8 Pi+n; for No x Mpn mi) of V85F pn (Pim) of No yo4 24 x 0,971 38 Entropy of Height = 0.607 Cain of Heiaht = Entropy of Class - Entropy of Height = 0.955 - 0.607 = 0.348 é= — by BackkBenchers Publications Page 41 of 102 Scanned by CamScanner 4 www: TOPP®'SSolutig,. “ ‘ Chap - 3 | Learning with Trees Here, Gain (Hair) = 0.455 Sain (Height) = 0.265 Gain (Weight) = 0.015 Gain (Location) = 0348 Ve Here, Wwe: can see that Gain of Hair attribute is highest So e will take Hair attribute as rool node, . oo. i te eas as followiing, value of Hair: attribu Now we will construct a table for each distinct ng Bionde - Name Height | Weight Location Class Sunita Average Light No Yes Average Yes No Anita Browns Tall Sushma Short Average No Yes Swetha Short Light Yes No - Name Height Weight Location Class Average Yes No “Kavita ; Short Balaji Tall Heavy No TNo Average Heavy No No Ramesh > 7 acre Red - ™ Name Height Weight Xavier Average Heavy Location ‘No. Class Ves As we Can see that for table of brown values, cl ass val ue is No for all data tuples and same fer table of Ré values where class value is Yes for all data tuples. So we will expand the node of Blonde value in decision tree. We will now use following table which we have constructed above for Blonde value , Name Height Weight Sunita Anita Average Tall Sushma Short Swetha Short [Light Average Average Flight ee & Handcrafted by BackkBenchers Publleating a Page 42 of 10? Scanned by CamScanner N chap ~3| Learning with Trees www.ToppersSolutions.com ee Now we will find Entropy of class attribute again as following, There are two distinct values in Class which are Yes and No. Here, P= total count of Yes = 2 n= total count of No = 2 +224 gzptn=a Therefore, p,m) = Plog, ® - Slog." : -slop 2 VlOh2 =3 179, NY ED — . “-lop,= q > ebay tence ~ As value of p and n are same, so answer will be] Now we will find Information Gain, Entropy and Gain of other attributes except reference attribute 1. Height: Hoght attribute have three distinct values which are Average, Tall and Shart. We will find information gain of these distinct values as following 1. Tall= p= no of Yes values related to Tall=0O n= no of No values related to Tall=1 S=p+n=O+17=17 Therefore, K{Tall) = 1p, n) = —flog.* SETS = Ups TAY SO i ik n — Flog,* n serccseesessienenen _ Many value from p and nis © then answer will be 0 Average = p = no of Yes values related to Average = 1 n. = no of No values related to Average = 0 S=p+ne)+0=) Therefore, Average) = (p,m) == — Plo 7 loge58 5 —— "log,7 ples If any value from pand nis O then answer will be 0 MN Short = to Short = 1 p =no of Yes values related n. = no of Na values related to Red =1 S = p+ mel+1=2 Therefore, ers Publications fted by BackkBench Page 43 of 102 Scanned by CamScanner d Chap - 3 | Learning with Trees www.ToppersSolutions., (Short) = I(p, n) = —?log.® - “log. ~s =—+log,1 — 3) 7 0825 — 7/0825 =D veinsAS Value of p and n are same, answer will be| Therefore, Height Distinct values from | (total related values of | (total related values of | Information Gain of value Height Yes) No) di ‘Tall “Average. Short, 1(pun) ni oO 7 0 1 0 0 ] 1 ] Now we will find Entropy of Height as following, Entropy of Height = yk, a ~ Here, p+ns x 1p, nj) pen total count of Yes and No from class attribute = 4 py + n, = total count of related values from above table for distinct values in Height attribute (py. nj) = Information gain of particular distinct value of attribute . Entropy : of Height= _ Pi¢n; for Tall —n =~ 7 I(p.nj) of Tall + Pitn, for Average an ; aye (pin) of Average + Pitn, for Short an \ I(py, ny) of Short _ oF = xO+ 1+0 = 1+1 ~0+—— xi Entropy of Height = 9.5 Gain of Height = Entropy cf Class - Entropy cf Height =1-0.5=0,5 2. Weight: Weight attribute have three distinct values which are Heavy, Average and Light. We wili find informatic gain of these distinct values as foilowing L. Heavy = p, = no of Yes values related to Heavy = O 0 n, = no of No values related to Heavy = s=ptrn=O+O=0 Therefore, 50825 7 50825 —— “log Heavy) = I(p, n)== —Plog,® r O. = HP, N) =O crsesrsnnene AS Value Of p and n are IL. Average = rage = 1 p,=noof Yes values related to Ave n, = no of No values related to Average = 1 ss ptmeltt=2 Therefore, —a * Handcrafted by BackkBenchers Publications Page 44 of 1° v4 Scanned by CamScanner chap-3 | Learning with Trees www.ToppersSolutions.com —_—_—. _ Average) = I(P,) = —Flog,# — n ¥ a 5 O82 -| __1 lo \ oe zlogz5 ~ slog,- yi. soos AS Value Of p aNd Nn are same. Light = p, = no of Yes values rela ted to Light =1 n, = no of No values rela ted to Light =1 g=pitm=l+1=2 Therefore, (Light) = I(p, n) = —Flog.® ~ “hogs” Therefore, Weight Distinct values from | (total related values of | (total related values of | Information Gain of value Weight Yes) No) | yh (pp ni) ny Heavy 0 O oO Average 1 ] ] Light 1 1 1 Now we will fing Entropy of Weight as following, Entropy of Weight = YE, 22" x M(p,.1) pn Here, p +n =tocal count of Yes and No from class attribute = 8 p, + n;, = toral count of related values from above table for distinct values in Weight attribute I(p,, n;) = Information gain of particular distinct vaiue of attribute Entropy of Weight= Plan forteetY pin y tp. nj) of Heavy + p = . Pp. Pitny for Average SX 1 (pyr) of Average + Sten; for Light ptn x '(p.n,) of Light = 4 ygp tt 4 L+1 xit— ‘ x1 Entropy of Weight = 1 of Weight =1-1=0 Gain of Weight = Entropy of Class - Entropy 3. Location: Location attribute have two distinct values which are Yes and No. We will find information gain of these distinct values as following Yes = related to Yes = Oo Pp, = no of Yes values to Yes = 2 Nn, = no of No value s related S=pt+n=0+2=2 rafted by BackkBenchers Publications Page 45 of 102 Scanned by CamScanner tig Chap - 3] Learning with Trees * . WA, chi —_ y Therefore, et n (Yes) = I(p, n) = —Zlog,> - “log:5 p and nis o then answer will beg = — Slog, + — log2 . Yes) = 1p, n) = O vensunsennsnen if any ONE V I. qlue from Ni No= Pp, = no of Yes values related to No = 2 n. = no of No values related to No=9 S=p+n=2+O0=2 Therefore, I(No) = - I(p, n) pn __P = —<logz > _- 4 n A = 1082 5 ft 1 = — 2 Jog, 7 —*log2* sf ‘ = O vies will be O : one value from p and nisi O thena nswer : if any Therefore, eel Location (total related values of Distinct values from | I(p;,n;) No) Yes) Location (total related values of | Information Gain of valus Pi Yes nj 0 No 0 Now we will find Entropy cf Location as following, : Pitn; Entropy of Location = Ek, a x 1(pj.ni) Here, p + n= total count of Yes and No from class attribute = 4 ri + n; = total count of related values from above table for distinct values in Location attribute I(x,4) = Information gain of particular distinct value of attribute Entropy of Location = O+2 =— 4 Pi+n; for Yes pin 2+0 x0+— 4 Pisin, % A (piu ni) of Yes + —ilerne P+n x (pi, ni) of No x0 Entropy of Location = 0 6 Gain of Location = Entropy of Class - Entropy of Location =] =1-0=] Here, Gain (Height) = 0.5 Gain (Weight) = 0 Gain (Location) =1 As Gain cf location is largest value, we will tale . € location Now we will construct a table for each distinct Value e% Handcrafted by BackkBenchers Publications of at,; attribute Se erat Splitting node. © Of Locati on attribute as following, 7 ee , _ 66° Scanned by CamScanner chap - 3] Learning with Trees pe wer. Topperssolutions.com eA Yes~ a “Short "4 amt erage SSC Wes Tall anita = — ign Helgi ame [ Sivetha nen eee Light. 7 ~ —TVYes. No- Sushma Yes No Average | ‘Shot a | a I Class io ‘Light Average. PSunita - Location | Weight Height Name — — —_ As we can see that class value for Yes (Location) is No and No(Location) is Yes. There is no need —F to da further classification, Ihe final Decision tree will be as following Hee apna by | | Blonds } Brown \ ( ~~ Yes) 1 hed 7 ———, wea) { TT (no) fom) crafted by BackkBenchers Publications \ | so tl (ne ; { es ) Page 47 of 192 Scanned by CamScanner 4 sy www.ToppersSolirtions.c,_ Chap ~ 4] Support Vector Machines CHAP - 4: SUPPORT VECTOR MACHINES Ql. What are the key terminologies of Suppert Vector Machine? Q2. What is SVM? Explain the following terms: hyperplane, separating hyperplane, margin ane support vectors with suitable example. Q3. Explain the key terminologies of Support Vector Machine Ans: [5M | Mayl6, SUPPORT VECTOR Decl6 & Mayr; MACHINE: 1. Asupport vector machine is a supervised learning algorithm that sorts data into two categories 2. Asupport vector machine is also known 3. Wis trained with a series of data already classified into two categories, building the model as it is initig: as a support vector network (SVN). trained. An SVIM 5. SVMs cutputs are used a map of the sorted in text data with the margins categorization, image between classification, the two as far apart as possible handwriting recognition and in tr sciences. HYPERPLANE: 1. Anhyperplane is 4 generalization cf a plane. SVMs ere based on the idea of finding a hyperplane that best divides a dataset into two classes/groups Figure 4.1 show's the example of hyperplane. 4 y . j 2 | es Figure 4.1: Exemple of hyperplane. example, for a classification task with only two features as shown in figure 41, you ca" 4. Asasimple S. think of a hyperplane as a line tnat linearly sepaiates and classifies a set of data. When new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it. SEPARATING HYPERPLANE: 1. From figure 4.1, we can see that it is possible to separate the data. 2. Wecan 3, Allthe data points representing men will be above the line. 4. Allthe data points representing women 5, Sucha line is called a separating hyperplane. 1. Amargin 2. closest points The margin is calculated as the perpendicular distance from the line to only the use a line to separate the data. will be below the line. is a separation of line to the closest ciass points. & Handcrafted by BackkBenchers Publications ‘ san 4102 i‘ Page 48 9! Scanned by CamScanner Chap - 4| Support Vector Machines www.ToppersSolutions.com 3, Agood margin is one where this separation is larger far both the classes. 4, Agood margin allows the points to be in their respective classes without crossing to other class. 5. The more width of margin is there, the more optimal hyperplane we get. SUPPORT VECTORS: 1. The vectors (cases) that define the hyperplane are the support vectors. 2. Vectors are separated using hyperplane. 3. Vectors are mostly from a group which is classified using hyperplane. 4. Figure 4.2 shows the example of support vectors. xi SUPPOF Vectors Margin Width Figure 42. Example Q4° Define Support Vector Machine (SVM) of support vectors. and further explain the maximum separators concept. margin linear [10M | Dec17] SUPPORT VECTOR MACHINE: 1. Asupport vector machine is a supervised learning algoritn m that sorts data into two categories. A support vector machine is also Known as a su eport vector network (SVN), ~ Itis trained with a series of data already classified into two categories, building the rnodel as it is initially trained. An SVM outputs a map of the sorted data with the margins between the two as far apart as possible, SVMs are used in text categorization, image classification, handwriting recognition and in the sciences, MAXIM AL-MARGIN \ CLASSIFIER/SEPARATOR: The Maximal-Margin Classifier is a hypothetical classifier that best explains how SVM works in practice. Ww The numeric input variables (x) in your data (the columns) form an n-dimensional space. if you had two input variables, this would form a two-dimensional space. Bt For example, Oo A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, Sither class O or class 1. l ‘90-dimensions you can visualize this as a line and let's assume that all of our input points can be ely separated by this iine. 4 fted by BackkBenchers Publications Page 49 of 102 3 Scanned by CamScanner 2 i www TOPPerssolutions.c. Chap > 4 | Support Vector Machines ™% 7% Porasaniple (1 (ane) tA) hsp s Oo Ho Whete (he coefficients (Ay and Ao) that deterrnine the slope of the fine and the intercept (8.) are fa. by the earning algorithiny, and Xpane X, ate the two Input variables 8 OYotrcan rake chisstfiedtions using tis tine 10, AY plugaing in Input valiies Inte the fine equation, you can ealeuate whether @ N&w Point |s above, bolow (he tine I first class Above the line, the eruation returns a Value greater than Gand the paint belangs ta the 18 Helaw the tine, The equation feturns a value less Tan O and the point helongs to the second class 1A AVvaltie close ta the tine returns a Velie close to zero and the paint may be difficult to classify li Ihe ragnitideof the value is large, the model ray have more confidence in the prediction I) The distance between the line and the closest data points is referred Co as the margin Ib The best oroptimal line that can sepandite The lwo classes 16s the fine that as the largest margin I This is called the Maximal-Margin hyperplane IK The margin be ealetiatectas (he perpendicular 19, Only these points are relevant distanee from the line to only the closest points in-clefining the line and in the construction of the classifier 20, These points are called the suppert vectors “1 They stupporter define the hyperplane, 2? The hyperplane is learned from training data using an optimization procedure that maximizes tre Maren, Figure 4.4 een et Qs. Q6. oN What ENE Maximum rnargin linear separators concept ae OLEATE is Support Vector Machine (SVM)? How to compute the margin? Whiat is the goal of the Support Vector Machine (SVM)? How to compute the margin? Ans: [1OM | May?7 & Mayl8] SUPPORT VECTOR MACHINE: Refer Q4 (SVM Part) MARGIN: w no Armargin is a separation of line to the closest class points, The margin is calculated as the perpendicular distance from the line to only the closest points. A good margin is one where this separation is larger for both the classes, 4. AgooUd margin allows the palnts to be in thei respective classes without crossing te other class 5. The more width of margin is there, the more optimal hyperplane we get, ¥ Honderafted by BackkBonchors Publications Page 50 of 10? ‘ Scanned by CamScanner ye chap-4 | Support Vector Machines EXAMP FOR HOW uilding an 9 TO FIND MARGIN: , . . i : a, 2N SVM SVM over the (very little) data set shown in figure 4.4 for an example like this b ansi consider 5 www.ToppersSolutions.com Figure4.4 1 The MsxiImumM > Jrneopt margin weight vector will be parallel to the shartest line connecting points of the two hrough. So, the SVM decision boundary is: a oO ” eo Qi uw n 7 DO m Cx Q oO trl Es } Sion surfs : ; ' : The optumal decision surface is ortncgonal te that line and intersects it at the halfway point. 2. y= 4 xX; + 2x —~55 Workng algebraicaily, with the standard constraint that, we seek to minimize. This ha‘spRENs when this constraint is satisfied with equality by the two support vectors. 6 Further we know that the solution is for some m Therefore a=2/5 and b=-11/5. wo € optimal hycerplane is given by & = (2/5,4/5) Ano b> = -11/5. Phe Margin DouNcary Is: Th wn 2/'| = 2/ 4/25 + 16/25 = 2/(2V5/5) = V5 answer can be confirmed gecmetrically by examining figure 4.4. meee, Write short note on - Soft margin SVM 97. Ans: [10M | Dec18] SOFTMARGIN SVM: & ' Sof margin is extended version of hard margin SVM. 2 Hard margin given by Boser et al. 192 in COLT and soft margin given by Vapnik et al. 1995. 3 Hard margin SVM can work only when data is completely linearly separable without any errors (noise Sr outlers). n case of errors eitne: the margin is smaller or hard margin SVM fails. On the ot her hand soft margin SYM was proposed by Vapnik io solve this problem by introducing crafted by BackkBenchers Publications Page 51 of 102 = ee Scanned by CamScanner “ys Chap - 4| Support Vector Machines www.Topperssolutions.c,, : ~ 6. lis Ac for as their usage is concerned since Soft margin is extended version of hard margin SVM so we, Soft margin SVM. 7 The allowance of softness in margins (i.e. a low cost setting) allows for errors to be made while fit. %. the rnodel (support vectors) to the training/discovery data set. Conversely, hard margins will result in fitting of a model that allows zero errors. 3. Sarnetimes it can be helpful to allow for errors in the training set. JO. It may produce a more generalizable model when applied to new datasets. il. Forcing rigid rnargins can result in a model that performs perfectly in the training set, but is possity applied to a new dataset. over-fit /less generalizable when IZ. yi. Identifying the best settings for ‘cost’ is probably related to the specific data set you are working 13. Currently, there aren't many good parameters 4. (if using a non-linear In both the soft margin and solutions for simultaneously optimizing cost, features, and kerr, kernel). hard margin the case we are maximizing margin between supps- vectors, Le. minimizing Fir relaxation to few points. 1S. In soft margin case, we let our model give some 16. If we consider these points our margin might reduce significantly and our decision boundary will be poorer. as support vectors we consider them as error points. 17. So instead of considering them 1B. And we give certain penalty for them which is proportional to the amount by which each data poir: is violating the hard constraint. 19. Slack vanables e, can be added to allow misclassification of difficult or noisy examples. 20. This variables represent the deviation of the examples from the margin. 21, Doing this we are relaxing the margin, we are using a soft margin. Q8. What is Kernel? How kernel can be used with SVM to classify non-linearly separable data Also, list standard kernel functions. Ans: T1OM | May!é} KERNEL: A kernelis a similarity function. 2. SYM algorithrns use a set of mathematical functions that are defined as the kernel. 4 The function of kernel is to take data as input and transform it into the required form. 4. |tisa function that you provide to a machine learning algorithm. S. ft takes two inputs and spits out how similar they are. 6, Different SVM algorithms use different types of kernel functions. ‘ 1 For exarnple linear, nonlinear, polynomial, radial basis function (RBF). and sigmoid. EXAMPLE ON EXPLAINING SEPARABLE 1. HOW KERNEL CAN BE USED FOR CLASSIFYING NON-LINEAR! { DATA: Te predict if a dog is a particular breed, we load in millions of dog information/properties like type : height, skin colour, body hair length etc. aaron a en @ Handcrafted by BackkBenchers Publications page 52 0f 107 © Scanned by CamScanner chap - 4 | Support Vec tor Machines O , 2 3. www.ToppersSolutions.com —_ C In ML language, these properties are referred to as ‘features’. Asingie entr of the yse . listo j f features isa data iinstance while the collection of everything is the Training Data which forms the basis of your prediction Le. if you know the ski : 'n colour, body hair length, height and so on of a particular dog, then you can predict the breed it will Probably belo ng to In support vector machinesInés, itj looks some what from red. i like h n show Figure 6 acl > Therefore the } i n figure figure 4.54.5 whi which separates rat the b | Ml 4.5 o + + + 5 . se hyperplane of a two dimensional space below is. a one dimensional line dividing the red and blue dots. From the example above cf trying to predict the breed ofa particular dog, it goes like this: Data {all breeds of dog) fae if we wy want » Features (skin colour, hair etc.) » Learning algorithm . hy a m i j j to solve following example in Linear manner then itit j is not possible to separate by straight ine as we did in above steps Figure 4.6 The red and blue balls cannot be separated by a straight line as they are randomly distributed. Here comes Kernel in picture in machine iearning a “kernel” 1s usually used to refer to the kernel trick, a method of using a linear classifier to solve < non-linear problem, It entails transforming linearly inseparable data like (Figure 4.6) to linearly separable ones (Figure 4.5). The kernel function is what is applied on each data instance to map the original non-linear observations into a higher-dimensicnal space in which they become separable. Using the dog breed prediction example again, kernels offer a better alternative. Instead between of defining a slew of features, you define a single kernel function to compute similarity breeds of dog. You provide this kernel, together with the data and labels to the learning algorithm, and out comes a classifier. . Mathematical definition: K(x, y) = <f(x), fly). Here Kis the kernel function, x, yaren dimensional irputs. . f is a map from n-dimensicn to m-dimension space < x,y > denotes the dot product. Usually m is much v larger than n, Page 53 of 102 Handcrafted by BackkBenchers Publications a ee ay Scanned by CamScanner ‘ www.Topp Chap - 4| Support Vector Machines j olutions.co, hada EL SEE ERNE RENT eee pen ersS the dy 19. Intultlen: Normally calculating <f(x), fly)> requires us to calculate f(x), fly) first, and then do product manipulations in m dimension, 20, These two Gornputation steps can be quile expensive as they involve space, where fn can be a large number, 2), of the dot product IS real, Gutafterall the trouble of going to the high dimensional space, the result ascalar. 22, Therefore we come back to one-dimensional space again. 24 number Now, the question we have is: do we really need to go through all the trouble to get this one 24, Da we really have to go to Lhe m-dimensional space? 25, The answor is no, if you find a clever kernel. 26, Siraple Exarnple: x = (xi, x Xd Y= (Yn Ye Ys): Then for the function (Oo) = (XxX Xo, Xi, XX XX, XK KIX), MX, HX), the kernel is K(x, y) = (sx, yo}? Let's plug in some numbers 27, to make this more intuitive: z= 28, Suppose «= (1, 2, 4). y = (4, 5, 6). Then: Fd (12, 4, 2,4, 6, 5, 6, 9) F (y) © (16, 20, 24, 20, 25, 30, 24, 50, 36) V(x), F (y) = = 16 4 40 4°72 4 40 4100 + 180 + 72 + 180 + 324 = 1024 29, Now let us use the kernel instead: K (x,y) = (4 +10 + 1B)! = 324 = 1024 30. Same result, but this calculation Quadratic Programming Q9. is so much easier. margin separation in Support Vector solution for finding maximum Machine. [1OM - May16] Ans: moclel is a very powerful tool for the analysis of a wide variety of problems in The linear programming 1, the sciences, industry, engineering, and business. However, it does have its limits. Not all pnenomena ‘ Once nonlinearities enter the picture an LP model is at best only a first-order approximation. The next level of complexity beyond linear programming is quadratic programming. a 4. are linear, This rnodel allows us to include nonlinearities of a quadratic nature into the objective function. 4. As we shall see this will be a useful tool for including the Markowitz mean-variance models uncertainty in the selection of optimal port-folios. B, A quadratic program (QP) is an optimization problem wherein one either minimizes or maximizes? quadratic objective function of a finite number of decision variable subject to a finite number of line" inequality and/or equality constraints. npn pcan 9, A quadratic function of a finite number of variables x = (x1, %2,..... Xn)" is any function of the form: n f(z) ==a + \ So cya; j=l n k=l Ue jLp dy. j=l Using rnatrix notation, this expression sitnplifies to 10, Pe n lea “+ 52 em ee re tee ee ee — —_—_— — : Page 54 of 102 E Ponts ¥ Handcrafted by BackkBanchets Publications Scanned by CamScanner chap ~ 4 | Support Vector Machines near ste A wo OA ate BRN ENN yeu Topperssolutions.com ene te ta asta Nt} = a4 cn wt Qs. where £4 e ne and Q: a iL Gi ia fie ™ _ ™ Ini Post Goon ° included The factor of one half preceding the quadratic tern in the furnietion ¢ convenience since It simplifies the expressions for the first and second derrvatives of f for the sake of 2. With no loss in generality, we may as well assume that the reatriz © iSib syrimmetric since & T 15. Qn T yl (2h 7 Oryh z set Qty - 1, , tT gt! Qs ert tr) . = 27f gar jx, 4 And so we are free to replace the matric O by the symmetric matriz 4Q7 i, Henectorth, we will assure that the rmatriz Q is syrammetric. form that we use is, Q pubjech Io Arb, 0 2, where A © I" every quadratic program and & e IR. cer tr form. Observed constant that we can term a since eer Ql0. to Just as in the case of linear programing, standard 17 4 be Ox minimize @ iho OP standard Explain rrr rt tr how have sirnplified the expression for the object\ve function by dropping tne jt plays no role in optimization step. itr support tir Vector rs Machine can oo be used to noe find optimal hyperplane to classify linearly separable data. Give suitable example. Ans: flom j Deci8] OPTIMAL HYPERPLANE: | Optimal hyperplane is cornpletely defined by support vectors 2 the optimal hyperplane is tie one which maximizes the margin of the training data SUPPORT VECTOR MACHINE: . A support vector machine is a supervised learning algorithrn that sorts data into WG two categaries. 2 Asupport vector machine is also known a5 a support vector network (S/N). 3, Iris trained with a series of data already classified into two categories, building the model as it is initialh trained, 4 AN SVM outputs a map of tne sorted data with the margins between the two as far apart as possible *S SVMs are used in text categorization, image classification, handwriting recognmion and in tne sciences, _ HYPERPLANE: A byperplane is a generalization of a plane. andcrafted by BackkEenchers Fublications Page 55 of 102 Scanned by CamScanner UR Oe ag WALTa“ Op OPSSON Machines Chap - 4| Support Vector _ee Uataset Into of finding ah yperplane that best divides a SVMs are basedoan the idea 2 savernsiacaemnce tt set ACOA Yemen + ber Ware igh ir Classtangres a PAG ae —T 169 tat tts 170 teh 0 ann fhige (etn) for a classification task with only two features eo) As a simple example, When voir ty line that linearly separates and classifies a set of data think of a hyperplane as a 4. (like the intiage abtvey. side of the hyperplane is added, whatever new testing data it lands will Hecidle (Ne Clase fF we assign to it. 5S. Hyperplane: (w-x)+b=OweRN,beR Corresponding decision function: f(x) = sgn((w > *) + b) 7. Optimal hyperplane (maximal margin): Maxw.e MIN=T,..gm{[]X — Xi] 2x ERY (w+ x) + b = 0) with y, € {-1, +1} holds: y, « ((w +x) + b) > O for alli =1, Where, am w and b not unique wand bcan be scaled, so that |(w> xi) + b] = 1 for the x, closest to the hyperplane Canonical form (w, b) cf hyperplane, now holds: y.-((w-x) + b)21foralli=1,...,m Margin of optimal hyperplane in canonical form equals = Qi. ‘Find optimal hyperplane for the data points: ia, Y, (2, ), Qi, -1), (2, 1), (4. 0;, (5, i), (5, -1), (6, 0})} {10M - Bees} Ans: nN Plotting given support vector points on a graph (In exam, you can draw it roughly in paper) We are assuming that the hyperplane will appear in graph in following region { 115 {| : 4 ° 6 | os | } 0 eee -05 ° ) i fe 1 a 8 § | Lj | 6 ; | 6 7 | | | o-1) “15 | | { ' ; ‘ % Handcrafted by BackkBenchers Publications ° Page 56 of 102 i Scanned by CamScanner = v grap -4 |Support . ector Machines arse. Topperssolutions.com ait use three SUDO vectore which . 7 swectors which ate closer to ass imed regianon gram we auf THCse ET's SUDPON vectoras Ss G. «14, 7. 32.2295 Shown | | below 1 15 | 1 o5 or a ag = SF o ) re! G - - 1 ° 3 - Ss 5 «$2 ® 6 7 a “15 2 of? $=} — 4 Sa ={",). Ss= (3) assed on : these Suppori vectors, we 3will find augmented B35 SUPPON vectors, we go Augmented °\ vectors i 2 \k 1 vectors by adding |as bias port wili De. 4 V/ ewilliindgt = 2 will find three parameters a,, a>. a, based on following three linear equations, )+ (a. x & * §)+(ayx Hx H)=A aX & x S)+ (aX HX §)+ (a,x Hx Has a,x 5x S}> (a, x 5X &) + (a3 X 3X S)=) eta neh ON tm patna YV\ Okb OC b ()-Ce b= G B h A ) b B be Bb()- eb G)-Ohh Gh ubstitutine values of §, 5, & in above linear equations, Simplifying these equations, ‘x {(2x2)+ x I+ 1+ fay x ((2x2)+ x1) 4x (az x (2% 2) + (1 XI) +X DY © lay x (4X 2) + OK DEX x [(2% 2) + (“1X -1) + LX DDN DBS fa, lags (214) + (1% 0) + CL DYE Cee (2D tay MCP 204 (Ox DD DB eA DY ed ODEN + Lay 1A) 4 OO) EX DYE We get, £ y+ Ga, + Sa, = -1 rafted by BackkBenchers Publications aa eee EET ESN IS ERIN re Page 57 of 102 se i Scanned by CamScanner » Machines__—— Chap - 4 | Support Vector1 4a, + Gaz + 9a3 =-] 9a, + Say + 1703 = 1 - . ; By solving above equations we get 1 To find hyperplane, we hav e to = ty ="? = 6 _3as and 3 = 3° ive class. So we will y., r fok, « from negat ° a positlityive cl discriminate ss equation, w= Y ek Putting values of a and § in above equation < v= foe) ‘pe ¥ = @ 5, + A252 + A355 We get, = t get augmentedoO vectyeg Now we will remove the bias point which we have added to support vectors tog So we will use hyperplane equation which is y = wx + b Here w = ( 0) as we have removed the bias point from it which is -3 And b = -3 which is bias point. We can write this as b +3 =O also. So we will use w and b to plot the hyperplane on graph Asw = fl (5): oo . 0 . . so it means it is an vertical line. Ifit were () then the line would be horizonta! line. And b + 3 =0, soit means the hyperplane will go from point 3 as show || _ | | | 05 0 | — ° 4 | ° ' I : . | — | fe } | ore Fa | 1. 2 | as ° =? j = _ o——~-9 4.05 6 Gg 9 ' | 08 0 -1 below. rr”. __O 4 UnanaAnvaftad hv BackkBenichers “ pubtiewia. fo Se a i Scanned by CamScanner cha p-5! Learning with Classification www.ToppersSolutions.com Cc HAP - _e.5; LEARNING WITH CLASSIFICATION Explain Ql. with suitab le example approaches to Probability, “ the advantages Q2. Explain Stassitication g3. Explain, in brief, Bayesian Belief network of Bayesian approach over using Bayesian Belief Network with an example. [10M j Decl6, Deci7 & Deci8] ay BAYESIAN BELIEF NETWORK: 1. A Bayesian network is a grap hical model 2. It represents a set of variables and the dependencies between them by usin g probability. 3, The nodes in a Bayesian 4, classiicai of a situation. network represent the variables. The directional arcs represent the dependenc ies between the variables. The direction of the arrows show the direction of the dep endency. Each variable is associated with a conditional probability tabl (CPT) e . CPT gives the probability of this variable for different values of the variables on which this node depends. Using this model, it is possible to perform inference and learning. BBN provides a graphical model of casual relationship on which learning can be performed. . We can use a trained Bayesian . There are two components network for classification. that defines a BBN: a, Directed Acyclic Graphs (DAG) b. Aset of Conditional Probability Tables (CPT) . As an example, consider the following scenario. Pablo travels by air, if he is on an official visit. If he is on a personal visit, he travels by air if he has money. If he does not travei by plane, he travels by train but sometimes also takes a bus. . The variables involved are: a Pablo travels by air (A) b. Goes on official visit(F) 9 (M) Pablo has money d. Pablo travels by train (T) e. Pablo travels by bus (B) This situation is converted into a belief network as shown in figure 5.1 below. . Inthe graph, we can see the dependencies with respect to the variables. The probability values at a variable are dependent on the value of its parents. t on F andM. . In this case, the variable A is dependen The variable T is dependent onA and variable B is dependent on A. The variables F and M are independent variables which do not have any parent node. variable. So their probabilities are not dependent on any other _* on F and M. Node A has the biggest conditional probability table as A depends Tand B depend anA. Handcrafted by BackkBenchers Publications Page 59 of 102 Scanned by CamScanner www.ToppersSolutions,c, ¥ r eM Cher =f ! Learning with Classification M oo in pocket ' [ren os | bef [fas money Cipes ut oltigial triph . ] nee cei P(A| Mand F) * 7098 4 SS 0.98 om Lakskhman travels by a0 air ——— J f Lakshman Lakshman leery : » feain a 1 0,00 i 1,60 bur J - : 2% Pirst we take the independent 24 Node F has a probabilityof P (F) > O77, 24, Hocde M has a probability of P (tM) = 0 Yh We ey. The conditional probability table for this node can be represented as nest come — travels by travels hy A | PIBIA (BA) T | 0.00 F 0.40 nodes. to node A FIM [PALE MandPy TT “0.98 TLE (1.98 FIT | 0.907 FYI 28 0.10 The conditional probability table for T can be represented as A]JPIA Tioga Toe 29. The conditional probability table for Bis ALPILA fl oo Lao $0. Using the Bayesian belief network, we can get the probability of a combination of these variables. 31. For example, we can get the probability that Pablo travels by train, does not travel by air, goes onan official tup and has morney. 32. In other words, we are finding P (1, 0A, F, M). 33, The probability of each variable given its parent is found and multiplied together to give the probability, P(T, ~A,M, P) = P(T[ oA) + P(-A] FE and M) + P(F)« Piha) = 0670.98 *0.7*°03=0123 # Handcrafted by Ba~kkBenchers Publications Page 60 of 102 | { Scanned by CamScanner chaP~ 51 Learning with Classification : as _ Expplain cla, ssificati os 65. www.ToppersSolutions.com usii ng Back Propagation algorithm with a suitable exam ple Classification using Back Propag ation Algorithm o6- ExpPelain . how Ba ck Prop agatioi n g7. Write short note on - Back Pr opagation algorithm algorithm helps in classification & May13} [10M | May16, Deci6, Mayl7 Ans: aCK PROPAGATION: propagati Back k propagation isi a supervised learning algorithm, for training Multi-layer Percey tron puc tArtificts Aslitic Neural Networks). ~w Back propagation algorithm locks for minimum value of error in weight space. us ip uses techniques like Delta rule or Gradient descent. en 1 the error function is then considered that minimize The \ weights ; to be a solution BACK PROPAGATION ALGORITHM: 1 a set of training iteratively process tuple and compare the network prediction with actu. «n ie] problem. target value. For each training tuple, the weights are modified to minimize the mean squared error © epween tris network's prediction and actual target value. a Modifications are made in “Backwards” direction from output layer. Through each hidden layer down to first layer (hence called “back propagation’). wn Tne weight will eventually converge and learning process stops. BACK PROPAGATION 1 STEPS: Initiate the weights: a. The weights in networks are initialized to small random b. Ex. ranging from -1.0 to1.0 or -0.5to 0.5 c, Each unit has a bias associated with it d. The biases are initialisea to small random numbers numbers Propagate the error: c layer. The training tuple is fed to network's input unchanged The input is passed through the input units, to its input value lj For aninputj its output Gj is equal d. Given a 3. b. input |, to unitj is unitj in hidden or output layer, the net = m 2 ALGORITHM Here, Wii * 45 from unit i in previous layer of unitj Ww y = Weight of connection previous layer O,= Output of unit i from £ j= Bias of unit Given the net input |; to unitj, the O, the output of unit j is computed as o,;=1/ (1+ e*) s biarted by 3ackkBenchers Publications Page 61 of 102 Scanned by CamScanner wwew.ToppersSolutions sc, Chap ~ § | Learning with Classification Oty “. 3. Back Propagat the Error; e a The errer propagated backward by updating the weights and oiases b. It reflect the error of network's prediction. c For unit) in output layer, the error Errj = Oj (1- Of)(Tj - Oj) d. Where, ©, = actual output of unity T, sKnown target value of given tuple Ol - ©) = derivate of logistic function e. The error of hidden layer unit J is Err, = O, (l- O)) ¥ Err.Wy 4. Terminating Condition: Training stops when: a. AILW,, in previous epoch are so small as to be below some specified threshold b. The percentage of tuple misclassified in previous epoch is below some threshold. c. Apre-specified number Seren of epochs has expired ewan ern eee een n wenn nen wenn enee - Q8. Hidden Markov Model Q9. What Ql0. Explain Hidden Markov Models. aoe. o- cnwene. = wee ennew enn, are the different Hidden Markov Models Ans: [10M | Mayl6, Decl6, Mayl7 & Dec17| HIDDEN 1. MARKOV MODEL: Hidden Markov Model is a statistical model based on the Markov precess with unobserved (hidden) states. A Markov process is referred as memory-less process which satisfies Markov property. 3. Markov property states that the conditional pronability distribution of future states of a process | depends only on present state, not on the sequence of events 4“. The Hidden Markov Modei is one ot the most popular graphical model S. Examples of Markov process: a. Measurement of weather pattern, b. Daily stock market prices. 6, HMM 7. HMM is a variant of finite state machine have following things: b. An output alpnabet of visible state (Vj. c. Transition probabilities (Aj. d. Output Probability (B). €, initial state probability {41} in es Aset of hidden states (W). The current state is not observable, instead €acn state produces en output with a certain probability 1B). 9. Usually states (Wj and outputs {¥j are understaod “* Handcrafted by BackkBenchers Publications Page 620f 107 nent at eat a. emeret generally works on set of temporal data. Bene 8. | | i Scanned by CamScanner yy chap 5] Learning with Classification jons.com www.Topperssolut oan HMM is said to be a triple (A, B, 71) 10.5 LANATION OF hMM: HMM ¢ 1 onsists of two t ypes of states, the hidden state (W) and visible state (V). e to another _ Transition probability is the probability of transitioning from one stat 2 ———————_""_ in single step. . ne + robability: is i 1 emission probability: ability: The conditional distributions of observed variables P(Vn| Zn) ,, HMM have to address following issues: a. Evaluation problem: e atfora given model @with W( hidden states), V(visible states), A, (transition probability) and Vv" (sequence of visible symbol emitted) « What is probability that the visible states sequence V™ will be emitted by model 6 i.e. P (VT/ 8) =? b. Decoding problem: * W’, the sequence of states generated by visible symbols sequence V" has to be calculated. ie. Wl =? c. Training problem: Foraknown set of hidden states W and visible states V, the training problem is tc find transition * probability A, and emission probability By. from training set. ie. Ay=? & Bus? EXAMPLE OF HMM: Coin toss 1, Heads, tails sequence 4 4 You are in a room with a wall. with 2 coins. WwW Person behind the wall flips coin. tells the resuit. ey Coin selection and toss is hidden. Cannot observe events, only output(headgs, tails) from events. Probjem ll. is then to build a model of neads and to explain observed sequence tails. Using Bayesian ciassification and the given data classify the tuple (Rupesh, M, 1.73 m) Value [Attribute Probability Count Short Medium Tall 3 V4 2/7 3/4 ] 3/4 S/7 V4 0 0 2/4 0 0 2 O o | 2/4 i) (1.7, 1.8} O 3 0 O 3/7 (1.8, 1.9) 0 3 O 0 3/7 (1.9, 2) 0 1 2 O V7 2/4 (2, ©) 0 O 2 O 2/4 Short Medium M FE 1 3 2 5 Height (0, 1.6) 2 range) (1.6, 1.7) | fGendar Lo OS |9 = [OM | Mayt6] Ans: MHandcratted Tall by BackkSenchers Publications Page 63 of 1C2 * Scanned by CamScanner } www.Topperssolutions.c,, Chap - 5 {| Learning with Classification _ We have given Condition Probability Table i \ eee (CPT). . | h, M, 1.73m) Tuple to be classified = (Rupes as following to find prob We will use sore part of table which is red boxed ability of Snort, ediury At, ee Tall. [Attribute [Value al ee WRU | Short ee erotarallil — Téountt Medium “Short Tall a 71 2 3 V4 ofr M F 3 5 1 3/4 5/7 Height O 0 aa o 2 Oo (0, 1.6) (16,17) | 2 0 (1.7, 1.8) 0 (1.8, 1.9) = 0 5 ' WF FR : (range) 6 3 3 5 5 Vi ya Gender aa (2, ©) oO oO —T aq 5 7 a 0 2 oO | 0 V4 ~ a . | 2/4 (There is no need to draw the above table again. The above table is just for understanding purpose) Here, n= 15 (surmming up all numbers in red box.) Probability (Short) = (1+3)/9 =0.45 Probability (Medium) = (2+5)/9 =0.78 = (3+1)/9 =0.45 Probability (Tall) Here, Short, Medium and Tall are class attributes values. Now we will find probability of tuple with every value of class attribute which is Short, Medium and Tall. Probability(tuple | Short) = P[M | Short] x P[Height, Short] Here, P[M ?} Short] = probability of M with Short ana P[Height | Short] = probability of height with Short Prohabrlityyte iple | Short} = P[Mi | short] « P[(I.7 - 1.8) | Short] (As height in given tuple 's 173m which lies in (1.7 ~ 1.8) range of height.) (Now we have ta get those values from Probability part of table) Probability(tuple | Short) = - xO0=0 Probability(tuple | Medium) = P[M | Medium] x P[Height | Medium] = 2y¥ 53-04% 7 7 Probability(tuple| Tall) = P[M | Tall] x P[Height | Tall] =3 x 0=0 Now we have to find likelihood of tuple for all class attribute values. Likelihood of Short = P[tuple | Short] x P[Short] = 0 x 0.45 Likelihood of Tall = P[tuple | Tall] x P[Tall] = 0 x 0.45=0 Now we have to calculate estimate of likelihood values as following, Estimate = Likelihood of Shert + Likelihood of Medium + Likelinood of Tall =0+034+07=034 Handcrafted by BackkBencheir Publications Page 64 of 102 \ aa Pere eee teetnteeernrnenterenersanee pepe teen ee Likelihood of Medium = P[tuple | Medium] x P(Medium] = 0.43 x 0.78 = 0.34 Scanned by CamScanner i chaP i {| _5| Learning with Classification www.ToppersSolutions.com wwe will find actual Probabili ty for tuple u a PO 12) xB) pl =~Po si Ng Naive Bayes formula as following calculating actual probabili ty for all class attribute val alues, _ short: p(Short | tuple) = SCple| snore) x p¢short) “V4 on oas =_ P(t(tuple) [ 2 Medium: ey P(Medium | tuple)= (tuple | Medium) xp(Medium) = 0.43 X 0,78 , ———_ _ 0.34 P(tuple) 0.34 =034 51 3. Tall: P(Tall | tuple)= PMupletratt) xpcrauy _ Oxnas_ P(tupley “ta ~ gy comparing actual probability of all three values, we can say that Rupesh height is of Medium. (We will choose largest value from these actual probability values) Important May be Tip: some time Name in 7 ' {C jo _ exam they can ask like find probability of tuple using Gender Height Class Female 1.6m Short Male 2.0m | Female 19m Medium | Female 185m tedium 2.8m Tall ' Maie 17m Short Male 18m Medium fe Female _ 16m Short | | Female 1.65m Short E — 7 io data - Tall Male E following | — in our solved sum. Now we have to construct a table as we have and Tall. There are three distinct values in Class Attribute as Short, Medium Male and Female. There are two distinct values in Gender Attribute as There are variable values in Height Attribute there we will make them as range values. As from lowest (1.6m — 1.7m), and so on. height to highest height. As (O 1.6m), Attributes Short Gender Probability Values Medium Tall Short Medium Tall Male Female Publications a Handcrafted by sgackkBench2? -s Page 65 of 102 Scanned by CamScanner www.Topperssolutions.co, con Chap - 5 | Learning with Classification “Height | {O-1.6m) (1.6m nent a, nan rene = 17m) (1.7m a fp - 1.8m) (L8m —~— - 19m) a (1.5m 2.0m) |(2.0m - ©) and Short) value. By seeing given data, We have to find how many tuples contains (Male Same for (Male and Medium) values and (Male for Tall) values. The above step will also appiicable for tuples containing Female value Now we will go through each touple and check their height value and find that in which range they com, and increase the count for it in their range part. Now we will find values for probability section For probability of Gender values — We will count all values for Short, Medium There is count of 4 for Short in Gender and Tall as following, part Short Male 1 Female Total 3 | 4 ~ | For probabilityof Male = count of Male in Short part / count of all short values =1 14 For probability of Female = count of Female in Short part / count of all short values = 2/4 There is count of3 for Medium in Gender part Medium Male 1 Female 2 Total 3 For probability of Male = count of Male in Medium p ar / coun t t of all Medium values =1/ 3 For probability of Female = count of Femal e in Medium pa/ rt count of all Medium value s ==2/3 =” 4 ~ There is count of 4 for Tall in Gender part Tall }Male [2.7 Female seh Tota! 0 2 ee ee Handcrafted by BackkBenchers Publications nt Page 66 of 102 Scanned by CamScanner y oe -5| Learning with Classi ficat; ston a For probabbabi lity ilii ty of ab www.Topperssolut ions.com of Male = count of Male .in Tall part / count of all Tall val ues = 2/2 For probability of Female = count of ¢ emale in Tali part / count of all Tall values = 0/7 ao 4 = 0 zhe game Process Can be applie d for Height part now W e can fill the table of Condititi ion Probabiliility as following l Atti a | Attributes Vale ues ee Short po |_ ————_ Gender Male Female Medium — | Tall Probability | Short [Medium i 1 1 22 1/4 3 > 0 3/4 ps j1/‘~3 | 73!oe j2f | 2 IE 1 ~ j_——— - height (O -1.6m) (1.6m 2 =| 2 : 0 0 2/4 | 0 0 0 2/4 o | io 7.7m 9 | 1.7m) | -|0o 1 0 0 1/0 2 0 0 V3 |° | 1.8m) (8m | 2/3 |° 1.9m) (1.9m | -|o 0 (Zom-«) | 0 0 1 . : ve 2.0m) ¥ Handcrafted by BackkBenchers Publi ‘cations 1 0 | 7 = PaqQe | 67 of 102 Scanned by CamScanner ” Www Chap ~6| Dimensionality Reduction CHAP Sng - 6: DIM “Ans: n Reduction [10M — Mayle, Decig INCIPAL COMPONENT ANALYSIS (PCA) FOR DIMENS ION REDUCT a given 1. Cy CTION ENSIONAL ITYDimeREDU ysis for nsio ; Explaina in detail Principa l Component A nal Ql. ToPPerssoius. &Se¢ ION: data sets. The main idea of PCA is to reduce dimensionality from ag ther either heavily or ligh Given data set consists of many variables, correlat ed to each Ot” t am aXxIML Im exte ; tasetupto Reducing is don while retaining ve 2. 3. the vari ation present In da iables which are !an e * The same is done by transforming variables to a new set of varia own as Pring, Compon “. ents n (PC) : iginal . ioti PC are orthogonal, ordered such that retentio n of variation prese nt in origginal compo Ponents decre., 6. So, aS we move down in this way, in order. the first principal component will have maximum - oe variation that was Dreser, orthogonal component . 7. PrincipalCom ponents are Eigen vectors of covariance matrix and hence they are called as Orthog,. 8. The data set on which PCA is to be applie d must be scaled 9. The result of PCA is sensitive to relative scaling. 10. Properties of Pc: a. PC are linear combination b. PC are orthogonal. c. Variation present in PC's decreases as we move from first PC to last PC. of Original variables. IMPLEMENTATION: 1) 1. OA 2. 4. Normalize the data: Data as input to PCA process must be normalised to work PCA properly. This can be done by subtracting the respective means from numbers in resp ective columns. If we have two dimensions X and Y then for all X becomes xana Yb Ecomss yThe result give us a dataset whose means is zero. IN Calculate covariance matrix: 1. Since we have taken 2 dimens ional dataset, the covariance mat rix will be 2. Matrix (covariance)= . ill) Finding 1. Var(X1 ar(X1) Cov(X2, X1) aro | Var(X2) Eigen values and Eigen Vectors: In this step we have to find Eige n values and Eigen vectors for covariance 2. It is possible because it is square matrix. 3. The Awill be Eigen value of matrix A, 4. If it satisfies following condition: det(al - A)=0 5. Then we can find Eigen vector for each Eigen value A by cal IV) Choosing 1. components and formin features Vectors: We order the Eigen values from highest to lowest. 2. So we get components.in ordet ® Handcrafted by SackkBenchers Publicat ion- matrix , cu | ating: (al — A)v =0 : a act! Scanned by CamScanner rei _6 | i Dimensionality Reductio chap data set have n variab } 3 Ire n www.Topperssolut ions.com . les then we will have n Eigen values and Eigen vectors. Ej : : We turns; out that that Eigen vector with highest Eigen values is PC of dataset we have to deci . Now de how much Eigen values for further processing. we choose first p Eigen values and discard others , we do lose out some in formation in this process, But if Eigen values are small, we do not lose much Now we form a feature vector. Jo. Since we are Working on 2D data, we can choose either greater Eigen value or simply take both 1. Feature vector = (Eigl, Eig2) vy) Forming Principal Components: |, Inthis step, we develop Principal Component based on data from previous steps. 2 We take transpose of feature vector and transpose of scaled dataset and multiply it to get Principal Component. New Data = Feature Vector’ X Scaled Dataset? New Data = Matrix of Principal Component Q2. Describe the two methods for reducing dimensionality Ans: [5M | May17] DIMENSIONALITY 1 Dimension with 2. REDUCTION: reduction refers to process of converting lesser dimension a set of data having vast dimensions into date ensuring that it conveys similar information concisely. This techniques are typically usec’ while solving machine learning problems to obtain better features for classificaticn. 3. Dimensions can be reduced by: a. Combining features using @ linear or non-linear transformation. b. Selecting a subset of features. MES HODS FOR REDUCING DIMENSIONALITY: ) Feature Selection: 1 [t deals with finding k and d dimensions dimensions. It aives most information and discard the (d — k) 3 It try to find subset of original variables. 4 tion: There are three strategies of feature selec a. Filter. b. Wrapper. c. Embedded. ) Feature extraction: c ombinations of d dimensions. It deals with finding k dimensions from 2 I transforms the data in hiah dimensional space to data in few dimensional space. 3 transformation may be linear like PCA or non-linear like Laplacian Eigen maps. The data a : ¥ Handcrafted by BackkBenchers Publications - Page 69 of 102 Scanned by CamScanner www.ToppersSolutions.c. i bE Chap - 6 | Dimensionality Reduction I) Missing values: 1 — 5 Given a sample training sample set of features. 2. Among 3’. ; sification p process. ‘The feature with more missing value will contribute less to class “, This features can be eliminated. e y MISSING missing values. the available features a particular features has many IV) Low varlance: 1. Consider that particular feature has constant values for all training sample. 2. ;%. That means variance of features for different sample is comparatively less ; : 15 This iimplies that feature with constant values or low\ variance have less iimpact ton © classsification “. This features can be eliminated. eeene Aeon Q3. en ee ee eee ee ee ee eee eee nan eee esenenanaasaaae 7 What is independent component analysis? Ans: [5M | Mayl8 & Decigy INDEPENDENT COMPONENT ANALYSIS: 1, Independent component analysis (ICA) is a statistical and computational technique 2. Itis used for revealing hidden factors that underlie sets of random variables, measurements, Or Signais 3. ICA defines a generative model for the observed multivariate data, 4, Given data is typically a large database of samples. 5. Inthe madel, the data variables are assumed 6 The mixing system is also unknown. 7. The latent variables are assumed non-Gaussian and mutually independent, 8. This latent variables are called the independent components of the observed data. 9. These independent components, also called sources or factors, can be found by iCA. 10, ICA is superficially related to principal component analysis and factor analysis. I. ICA isa rnuch more powerful technique 12. However, capable of finding the to be linear mixtures of some unknown underlying factors or sources when these latent variables classic methods /ail completely. 13. The data analysed by ICA could originate from many different kinds of application fields. 14. It includes digital images, document databases, economic indicators and psychometric measurements. 1S, In many cases, the measurements aie given as a set of parallel signals or time series 16. The term blind source separation is used to characterize this problem. EXAMPLES: 1. Mixtures of simultaneous speech signals that have been picked up by several microphones 2. Brain waves recorded by multiple sensors 3. Interfering radio signals arriving at a mobile phone 4. Parallel time series obtained from some industrial process, AMBIGUITIES: 1. Can't determine the variances (energies) of the IC's 2. Can't determine the order of the IC's Handcrafted by BackkSenchers Publications Fage 70 of 102 4 Scanned by CamScanner uction chap ~ 61 Dimensionality Red ee jpPLICATION DOMAINS a www.ToppersSolutions.com OF ICA: mage de-noising. 2 Medical signal processing, 3, Feature extraction, face recognition, A compression, redundancy 6 Scientific Data Mining. , 04 Use Principal matrix A. P Co reduction, : : MPponent analysis (PCA) to arrive at the transformed matrix for the given AT= ans [10M | May17 & May18] Formula for finding Covariance values => Covariance = Y7., eo) worscunenarnnan{Valid for (x, y) and (y, x] Here we have Orthogonal Transformation Av We will consider upper row as x and lower row as y. Here n = 4. (Total count of data points. Take count fram either x row or y row). Now, we will find ¥ and ¥ as following, X = (sum of all values in x / count of all values in x) y = (sum of all values in y / count of all values in y) gz 2+140+(-1) _ 2_ OS Now we have to find values of {x - x), (y — ¥), [Ix - *) x(y - y)], (x - *)4, (y— #7)? 7 y (x-4) y-y | 2 4 15 1 3 O 5 | [x= x)«(y-y)] 7] a3 v-¥ 1.875 2.8185 | 2.25 3.5156 0.5 0.875 0.4375 0.25 0.7656 7 -0.5 “1.125 0.5625 0.25 1.2656 05 5 “1.625 2.4375 2.25 2.6404 | Finding following values, 2[(x - ¥)x(y —P)], TK - ¥)*, Ty -y)? YU(x - 4) xy - y)] = 2.815 + 0.4375 + 0.5625 + 2.4375 = 6.256 5(x— x)? = 2.25 + 0.25 +0.25 + 2.25=5 + 2.6404 = 8.1872 Yiy-F)P= 3.5156 + 0.7656 + 1.2656 values of (x, x) Now we will find covariance (x-%)? Covariance(x, x) = Zizi) Cevariancely, 4 -2 =1 73 ue@ n7 y) y )? yyoH _= aia: re _ 2 . = _~on riancely, xX) = ER. “ovariance(x, y) = Cova |“ Handcrafted by BackkBenchers (x-¥) (VY) 6,256 _ 2.09 : "poe potion Publications Scanned by CamScanner www. TOPPeTsSolutj ion Chap - 6 | Dimensionality Rzduct Ons. ¢ rix these values ina 7 mat ix form a5 & Therefore putting i . x s | = Covariance = [ 2.09 Ye [38 components. inci cippal Prin two be i there willale ompo these Pri ncipal components oto find S— . _, . ; As given data is in Two dimensiona screen os Is Now we have to use Characteristics Equation ea Here, s = Cavariance in Matrix form . "| |= Identity Matrix= [s Putting these values in Characteristics equation, 1.67 Ilo‘09 Getting 2.09) 2731 determinant ,71 O} a [a HI using OF A208 2.09 above step, mnmannsen (1) 2.73 -—A (1.67 - A) x (2.73 -A) - (2.09)? = 0 -4.44+ 0.191=0 = 4.3562 2 = 0.0439 Taking A and putting it in equation (1) to find factors a), and a2 for Principal components 1.67 — 4.3562 2.09 2.09 ays) 273 4.3562) * ai2 ° [soo —rezeal * EET=° -2.6862 X yy + 2.09 X O49 = O vecsersnesesnnnsn (2) 2.09 & 4, — 1.6262 X O49 = O or censsnssnnscrenn (3) Let's find vatue for a,; (you can also take a,,) by dividing equation (2) by 2.09, we get _ 2.6862 M12 = 9% M11 Q,2 = 1.2853 X ay, We can now use equation of orthogonal trans formation relation which is yg? yg? 2 srenanmasainse (4) -> Substituting value of a,2 in above equation (4), we get fayy? + (1.2853 x ayy)? =1 04,2 + 1.5620 X a7 =] 2.5620 x a,,7 =] (447 = 0.40 Q1,= 0.64 ee *? Handcrafted by BackkBenchers Publicatians ita“ a page 72 oO Scanned by CamScanner . _6| Dimensionality Reducti : me chaP www.ToppersSolutions.com — putting a; in equation (4) =1 oot a2” 94096 0+4.24"096=I 221- ay2 I ' ai j © 2= 0.5904 = 0.7684 a2 staking Now taxing — 0.0439 Ay and putting ing : ; factors a,, and a, for Principal components (1) to find ititi in equation % [i a0), * 2.73 : 04301 7 [=] =O (5 ator! * 2] = £6261 X do, + 2.09 X Ogg =O occccseeesseen (5) 2.09 X ay + 2.6861 & Qo2 =O eccecccesssinne (6) ty = 0.7781 X ayy Substituting value of a,2 in equation (4), we get doy? + (0.7781 X az1)? =) do? + 0.6055 X ag,” =] 16055 X @2,7 =1 | a," = 0.6229 ay ,= G.7893 || | | PULting 2, In equation (4) (0.7893)? + @z27 =1 0.6230 + az27 =1 a,* = ]- 0.6230 a)? = 0.377 a, = 0.6141 the Principal Components are Therefore using values Of @11, 412+ G21) A225 SS am + anxz | 22 0.64x, + 0.7684X2 | | &= auxi + 422X2 | &=0.7893x, + 0.6141X% . Handcrafted by BackkBencher s Publications Page 73 of 102 Scanned by CamScanner { } WA TOPP ETS Olu . — c hap ~ 7\ Learning with Clustering Ns ee CL “ARNING. wiTH. ~ —— * My USTERING a for clustering analysts. CHAP. 7. Lea a algorithm sultable oxample. Also, explain hoy, k. Describe the essential steps of Kemea” il Ql. Mh, Explain K-moeans clustering algorithm “m : clus torin clustering differs from hierarchical Q2. [5 -10M | Doerg i Sty, Ans: - K-MEANS CLUSTERING: algorithms that solve the Well-known ete,Ste rnlng alt joa Jed K-means is one of the simplest unsuperl 1. oroble problem, 2, The procedure to classilya given data 60set through tO Gino follows a siinple and easy W« ly aace certain ump * Er aprior clusters (assume k clusters) fixed 5. cluster. ) es diff fo = ithe fferent location Caus The main idea isi ta defiine k centres, on ’ for cach 5 € c because of cliffe € Way be placed ina cunning 4 These centres should 5. So, the better chaice 6 7, 8 is to place them from each other, has possiblef ar away as po as muc as neararest con nt belonging to a given data set <ancl associate it to the The next step is to take each poi When no point is pending, the first step 1s completed andan early group age! is done. ng g ¢ from». resultiltin rsrs resu steste ids as bar é ycentre of the : cluclu At this point we need to re-calculate k new centro previous step, 9. After we have these k new centroids, a new binding has to be done between the same data set Poin: and 10. the nearest new center. A loap has been generated AS a result of this loop we may notice that the k centres change their location step by step unt. more changes are done or in other words centres do not move any more Finally, this algorithm aims at minimizing an objective function know as squared error functi given by: Cc Cs 2 JV)= i=]= j=)E (x-wp 5, Where 7/x,- v//is the Euclidean distance between x and v, ‘c‘is the number of data points in 7" cluster, ‘c‘is the number of cluster centres. ALGORITHM STEPS: I Let X= (Xi.X2Ksy.-0Xn} be the set of data points and V = (vivo... Ve} be the set of centres 2. Randomly select ‘c'cluster centres. 3. Calculate the distance between each data point and cluster centr es, 4, Assign the data point to the cluster center whose distance from the cluster center is minimu™ of the cluster centres. 5. Recalculate the new cluster center using: wee) Where, F 7=1 x, ‘represents the number of data POiNts in 7 ¢ uster, a Handcrafted by BackkBenchers Publications Si : a al — l page " Scanned by CamScanner F ha o p-7 [ Learning with Clustering www.ToppersSolutions.com ee gecalculate the distance betwean ¢ 6 ; ae : ino data point was rez SSigned the fach data point : Land new obtained cluster centres "Stop, otherwise repeat from step 3. ,pVANTAGES: fast, robust and easier to Understand , aoe pelatively efficient: Olt j s, k is clusters, dis (tknd), where n is object dimension of each object, and t is iterations. 4 | Ges best result when data set are dis tinct or well separated from each other. piSADVANTAGES: 1, Applicable only when mean is de fined i.e. fails for categorical data 7 Unable to handle noisy data and outliers Algorithm fails for non-linear data set. 3, DIFFERENCE BETWEEN K MEANS AND HIERARCHICAL CLUSTERING: i, 2. Higrarchical clustering can’t handle big data well but K Means clustering can This is because the time complexity of K Means is linear i.e. O (n) while that of hierarchical clustering is quadratic i.e. O (n?). w In K Means Clustering, since we start with random choice of ciusters. The results produced by running the algorithm multiple times might differ. w While results are reproducible in Hierarchical clustering. ov 4. K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D). 7. K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. a You can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram 93. Hierarchicai clustering algorithms. [10M | Dect7j Ans: HIERARCHICAL CLUSTERING ALGORITHMS: Hierarchical clustering, also known as hierarchical cluster analysis. 2 Itis an algorithm that groups similar objects into groups called clusters. 3 The endpoint is a set of clusters. xs uw ~ 1. Each cluster is distinct from each other cluster. The objects within each cluster are broadly similar to each other. This algorithm starts with all the data points assigned to a cluster of their own. cluster. 1. Two nearest clusters are merged into the same &. Inthe end, this algorithm terminates when there is only a single cluster left, 5. Hierarchical clustering does not require us to prespecify the number of clusters tic. 10. Most hierarchical algorithms are determinis is of two types: 1. Hierarchical clusterin g algorithm a b. tering algorithm. Divisive Hierarchical clus tering algorithrn. Agglomerative Hierarchical clus a idcrafted by BackkBenchers Publications Page 75 of 102 Scanned by CamScanner IE E"'"SE"~C ERR NSIT www.TopPersSolug; 6 As. Chap - 7 | Learning with Clustering Sy Both this algorithm are exactly reverse of each other. 12. Mor DIVISIVE HIERARCHICAL CLUSTERING A-CORITH gle 1. Here we assign all of the observations to a 5/7 DIVISIVE ANALYSIS): DIANA (DIVISNS cluster an d then partition the cluster to 4, D te similar clusters. | there is one cluster for each observatig,, the . nti 4 cluster rchies than Agglom Finally, we proceed recursively on each a er hi te . : ‘ . e more accura There is evidence that divisive algorithms produc x. le mp co more algorithms in some circumstances but is conceptually . i 3. 4. ‘Grae Mi Advantages: a. ete hierarchy. More efficient if we do not generate a compl b. Run faster than HAC much algorithms. Disadvantages: 5. a. b. Algorithm will not identify outliers. Restricted to data which has the notion of a centre (centroid). Example: 6. a. K-—Means algorithm. b. K-Medoids algorithm. ALGORITHM OR AGNES (AGGLOMERATIVE NESTING, —_ HIERARCHICAL CLUSTERING In agglomerative or bottom-up clustering method we assign each observation to its own cluster N AGGLOMERATIVE Then, compute the similarity (e.g., distance) between each of the clusters and join the two most simi clusters, Ww Finally, repeat steps 2 and 3 until there is only a single cluster left. Advantages: 5. 6. a. No apriozi information about the number of clusters required. b, Easy toimpiernent and gives best result in some cases. Disadvantages: a. No objective function is directly minimized b. Algorithm c. Sometimes itis difficult to identify the correct number of clusters by the dendrogram. can never undo what was done previously. Types of Agglomerative Hierarchical clustering: a Single-nearest distarice or single linkage. b Complete-farthest distance or complete linkage. c. Average-average distance or average linkage. d Centroid distance. e. Ward's method - sum of squared Euclidean distance is minimized. x Handcrafted by BackkBenchers Publications | ee —— page 76 of 10 ; Scanned by CamScanner i cha ~ 7 | Learning with Clusteri ng www.ToppersSolutions.com “a oh - What is the role of radj lal basisi function In se Parating nonlinear patterns. 4 Write short note on - Radial Basis functions ; [1OM | Mayl8 & Decl8] ns gaDlUS BASIS FUNCTION: | p radial basis function (RBF) iis a real-valued function & whose value depends only on the distance from the origin, so that (x) = (|}x}]) z. Alternatively on the distance from some other point c, called a centre 3, Soany function that satisfies the Property is a radial function 4 | 5 The norrn is usually Euclidean distance, although other distance functions are also possible. Sums of radial basis functions are typically used to approximate given functions. 6 TNS APProzimation 7 &BEs are also used as a kernel in support vector classification process can also be interpreted as a simple kind of neural network mamonly used types of radial basis functions: a. Gaussian: é(r) = ewer)" b. Multiquadratic c. Inverse quadratic d. Inverse Multiquadratic alr) = ——— f 2 ylt (ery “fiat basis function network is an artificial neural network that uses radial basis functions as clivation functions. sutput of the network is a linear combination of radial basis functions of the inputs and neuron parameters. ). Radial basis function networks have many uses, including function approximation, time series control. nrediction, classification, and system .dial basis function networks is composed of input, hidden, and output layer. RBNN is strictly limited 3 as feature vector. to have exactly one hidden layer. We call this hidden layer Padial basis function networks increases dimension of feature vector. sages layer @ndcrafted by BackkBerchers Publications Regis dae ten tryer Curent boner Page 77 of 102 Scanned by CamScanner www. TOPPErsSolutig, Chap - 7 | Learning with Clustering is for the i Se FOFEAG Lee INPUT pag, saute an arbitrarctors y Basicl,s C2,---,ch ¢ cor 4s e ; hat 14. The hidden units provide a set of functions that 1S. Hidden units are known as radial c entres and repres ented 16. SP Transformation from input space to hidden unit by the ve ce is nonlinea e t * - r whereas transformatin, : fe hidden unit space to output space is linear ' . eis pl 17, dimension of each centre fora pinput network is P 18. uc es The radial basis functions in the hidden layer P rodduc ace. ' . . - : response only y, a significant non-zero input falls within a small focalized region of the input sP 19. Each Ex i ive fi field in input $ hidden unituni has iits own receptive ; . ; pace. ed 20. An input vector xi which lies in the receptive field for cen of weights the target output is obtained. 21. . would activate ¢j and by Proper cp. C : The output is given as: h uy= >d. Ow, , A= (lle — call) wr Weight of j" centre, y: some radial function. 22. Learning in RBEN Training of RBFN requires optimal selection of the parameters vectors ci andy; ; Leech. 23. Both layers are optimized using different techniques and in different time scales. 24. Following techniques are used to update the weights and centres of a RBFN. mel a, Pseudo-Inverse Technique (Off line) b. Gradient Descent Learning (On line) c. Hybrid Learning (On line) What are the requirements of clustering algorithms? Ans: [5M | Mayia} REQUIREMENTS OF CLUSTERING: 1. Sealability: We need highly scalable clustering algorithms te deal with iarge databases & leaning system. 2. Ability to deal with different kinds of attributes: Algorithms should be Capable to be applied onary kind of data sucn as interval-based (numerical) data, categorical, and binary data. 3. Biscovery of clusters with attribute shape: The Clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes. 4. High dimensionality: The clustering algorithm should not only be able to handle low dimension! data but also the high dimensional space. 5. 6. Ability to deal with noisy data: Databases contain Noisy, missin 9 Or erroneous data. Some algavit”™ are sensitive to such data and may lead to poor quality cluster S Interpretability: The clustering results 5should be j be interpret able, com Prehensible, and usable. Handcr>fted by BackkBenchers Publication, ons Scanned by CamScanner y- ap e ‘ _-7| Learning with Clustering | www.ToppersSolutions.com Apply K-means algorithm on centres. _ given | - data for k=3. Use C,(2), C2(16) and C;(38) as initial cluster | Data: 2, 4 6, 3, 31,12, 15, 16, 38, 35,14, 21, 23, 95 39 1 ’ ’ ' , Ans: ay16 & Decl6] [10M | May number of clusters k=3 initial cluster centre for C= 2,C,=16 we WI ill check distance between Cy= 38 ; data Points and all cluster centres. We will use Euclidean . Distance formula for finding distance. | pistance [x, a] = /(— xa)? OR + (y—bye pistance [(x, y), (a, b)} = fx =a) As given data is not in pair, we will use first formula of Euclidean Distance. Finding Distance between data points and cluster centres. We will use following notations for calculating Distance: { D, = Distance from cluster C;centre D2 = Distance from cluster C; centre D; = Distance from cluster C;centre ; 2 =0 (2, 2) = {@&— a}? = {2-2}? D,(2,16) = {@ =a) = (@— 16)? =14 D,(2, 38) = f(x — a)? = J (2 — 38)? = 34 Here 0 is smallest distance sc Data point 2 helcngs to () 2)? =2 | D4, 2) = fOr — ay? = (4 | = 12 = — ay? = (416)? D;(4, 16) = (Ge D.(4, 38) = (@ — a)? = V4 — 38)? = 34 : 2 point 4 belongs to C; Here 2 is smallest distance so Data (6—2)*=4 D.(6, 2) = fe — a)? = = /(6— D,(6, 16) = V(x — aj? Ds(6, 38) = fe — O° = l» 16)2 = 10 (6 — 38)? = 32 s to C, ce so Data point 6 belong Here 4 is smallest distan Di3,2)2 Vana = VG=2)? =) b,(3,16) = (@= a = VG 16)" = . fro =35 Gana = \G— 30" Ds(3, 48) 38) = (@-0 co so Data point3 belongs to Cy tan dis st le al sm lis Here ers derafted by BackkBench Publications | Page 79 of 102 s erences ets tr nese e a p Scanned by CamScanner i ie 2 OLUtion 8, in Chap - 7 | Learning with Clustering a. 31: Di(31, 2) = fay? = (1-2)? = 29 D.(31, 16) = {Gaye= (Or 16)" = 1 : int 31 belongs tos Vena? “¥Gh so on ee Data point distance smallest is 0 Here 72: Di(12, 2) = {@ — a)? = D2(12, 16) = (x — ay? = (2-2)? = 10 (12 — 16)? = 4 Ds(2, 38) = {(@@ ~ ay? = f(z — 38)? = 26 P Here 0 is smallest distance so Data point 2 belongs to% 15: DiS, 2) = {@— a)? = f5— 2)? = 3 Da(15, 16) = {@— ay? = (G5 — 16)? = 1 Ds(15, 38) = {@ — ay? = (15 — 38)? = 23 Here 1is smallest distance so Data point 15 belongs to Cz 16: Di(16, 2) = (@— a)? = (6-2)?= 14 D.(16, 16) = f(x — a)? = (16 - 16)? = 0 Ds(16, 38) = /(x — a)? = (16 — 38)? = 22 Here 0 is smallest distance so Data point 4 belongs to C2 38: D(38, 2) = {@ — a)? = (8-2) = 36 1238, 16) = f(x — a)? = (8 — 16)? = 22 D-(38, 38) = {@ — a)? = (8-38)? = 0 Here O is smailest distance so Data point 38 belongs to C; 35: 1D,(35, 2) = f@ — a)? = Y(85 — 2)? = 33 D2(35, 16) = fe — a)? = (35 — 16)? = 21 D3(35, 38) = /(@ — a)? = {5 — 38)? = 3 Here 3 is smallest distance so Data point 35 belongs to C; 14: D,(14, 2) = {@—a)*= (04-2)? = 12 D2(14, 16) = fe — a)? = (04-16)?= 2 D(14, 38) = f(@ — a)?= (04-38)?= 24 Here 2 is smallest distance so Data point 14 belongs toc, 21: D,(21, 2) = f— a)? = ¥(21=2)? = i9 D.(21, 16) = f(— a)? = {QT - 12 =5 ¥ Handcrafted by BackkBenchets Publications ee “a pace 80° d Scanned by CamScanner FS ; EP _7| Learning with Clustering wwww.ToppersSolutions.com ps(2, 38) = Yea)? = JT aa = 17 Here 5 is smallest distance so Data point 21 belongs to C; D:(23,2) = J@— a)? = (3B D,(23, 16) = V(e=a)y De = ay = V@3— 16) = 7 0:(23, 38) = (xa)? = /@3— 3a = 15 Here7 is smallest distance so Data point 23 belongs to C; 25: D(25,2) = (=a)? = (@5—ae = 23 D,(25, 16) = f(v—=a)? = /@5—1He=9 = 3a D125, 38) = f(r -a)? = J(@5 = 13 Here 91s smallest distance so Data paint 25 belonas to C; 30: D(3C, 2) = fe —a)? = J@0— 27 = 28 = 16) = 14 D.(30, 16) = Y= a)? = YG (,(30, 38) = /(r—a¥ = (G0 38) = 8 Here & 1s smallest distance so Data point 30 belongs to C; The clusters will be, C= {2,4,6,3}, C> = {12, 15, 16, 14, 21, 23, 25}, C3 = {31, 38, 35, 30} Now we have to recalculate the centre of these clusters as following C= tet = 18 = 2.75 (we can rcund off this value to 4 also) C.= vasaseteeraszie9628 =*-18 Cse —~ = = = 33.5 (we can round of this value to 34 also) Now we will again calculate distance from each data point to all new cluster centres, Ps - 4)? =2 D2.4) = f@—ay = V2 0.2, 18) = (@— a}? = ¥@ — 18)? = 16 Dii2, 84) = YO — a? = V2 — 34)? = 32 to C, Here 2 is smallest distance so Data point 2 belongs \ & D4, 4) = J@-9 =/(4-4)7=0 D.(4,18) = (@—a = V4 18)? = 16 D434) = Va VGH 34} =50 so Date point 4 belongs te Here 0 is smallest distance ( ; * Handcrafted by BackkBenchers Publications a Page 81 of 102 ee ee Scanned by CamScanner | WWW-TOPPErsSolutig,, i eee Chap- 7 | Learning with Clustering G: DG, 4) = JY apt = VO =e = 2 D.{6, 18) = (= aye = SE 18)* 7 2 — 34 = 28 Ds(6, 34)= /@— ay = Ve - Data point 6 belongs to Here 2is smallest distance so 3: D3, 4)= (@—ay = {G-H*=1 D.(3, 18) = ((— a) = (@— 18)? =15 — a)? = /@-347 =31 Ds(3, 34) = {@ belongs to Ci Here 1is smallest distance so Data point 3 3] — 4? = 27 — ay? = (GI D,(31, 4) = (@ — 18)? = 8 D.2(31, 18) = {@— ay? = (1 — a)? = D5(31, 34) = {@ (G1 — 34)? =3 belongs to Cs Here 3 is smallest distance so Data point 31 T2: | Dy(12, 4) = /@ — ap? = (02-4)? = 8 | =6 D.2(12, 18) = /(@ — a)? = (12 — 18)? — 34)? = 22 — a)? = (12 Ds(12, 34) = JG Here 6 is smallest distance so Data point 12 belongs to C2 15: D,(15, 4) = (@— a)? = (05-4)? =71 — 18)? =3 — a)? = (5 D.(15, 18) = /@ — 34)? =19 — a)? = V5 D:(15, 34) = (@ Herve & is smallest distance so Data point 15 heicrigs to C, 16: — 4)? = 12 — a)? = {6 Di(16, 4) = /@ — a)? = D.2(16, 18) = /@ /(16 — 18)? = 2 — 34)? = 18 D:(16, 34) = (@ — a)?= /(16 Here 2 is smallest distance so Data point 4 belongs to C2 38: D.(38, 4) = ea aye= (GB= Aj? = 34 — a)? = D»(38, 18) = /@ /(38 — 18)? = 20 D3(38, 34) = /@— a)? = (38 — 34)? =4 belongs to C, Here 4 is smallest distance so Data point 38 35; — a)? = ¥35— 4)? = 2 D1(35, 4) = V@ = 18)? =17 — a)? = ¥(35 1.35, 18) = f( Publications 2 capenchers Pabheeg ® Handcrafted by Bach«Benchers oo 8? of” nage Scanned by CamScanner SEE SED v pap REE | Learning with Clustering www.ToppersSolutions.com a — ps(35.34) = >a! = VOTRE - y Here 1s smallest distance so Data point 35 belongs to C, is (14,4) = {aj = (G4 ~ HF = 10 Da(I4, 18) = Oe = a)? = (Tay? = 4 — ay? = (AT Ds(14, 34) = fx 31y? = 20 Here 4 Is smallest distance so Data point 4 belongs to C, 2k: D(21, 4) = JO =a)? = {Q1— a? = 17 0,(21, 18) = {@— a}? = (= 18)? = 3 Here 3 is smallest distance so Data point 21 belongs to Cz } 23: D(23, 4) = ((—a) = (3-4? = 19 D,(23, 18) = f(x — a)? = (23-18)? = 5 (22, 34) = f(x = a)? = f(23—34)? = 11 i Here 5 is smallest distance so Data point 23 belongs to C? 25: (25, 4) = {@—a)* = {25-47= 21 1D,(25, 18) = J — a)? = J (25 — 18)? =7 (25, 34) = @ -a)2 = (25-34)? =9 tw es Here 7 is srnallest distance so Data point 25 belongs to C2 D,(30, 4) = .[@ =a)? = (G0— 4)?= 26 1D,(30, 18) = f(x — a)? = J(30- 18)? = 12 1:(30, 34) = f — a)? = (G0 - 34)? = 4 s to Cz Here 4 is smallest distance so Data point 30 belong | The updated clusters will be, C) = {2, 4, 6, 3), ne a | Cz = (12,15, 16, 14, 21, 23, 25), Cy = (31, 38, 35, 30} Raia MeL Sg MSN ERO We can see that there is no difference between previous clusters and these updated clusters, so we will Stop the process here. Finalised clusters ~ G= (2, 4, 6, 3}, | C= (12, 15, 16, 14, 21, 23, 25), C= (31, 58, 35, 30} Be “een es — . oe Publications Y Handerafted by BackkBenchers Page 83 of 102 Scanned by CamScanner 1 ' P ’ f i li _ 3 SoOlut; www. Topporss, a _ g ustering i clus Chap - 7] Learniin with ~ Q8. Ons tie erence ttc initial dun, 7 ‘. 22. Use c,(2,4) & C,(6, 3) as oe K°* datamefor on gii ve™ m th h t ri go ns al ? Apply K-mea d(6 3 ), e(4 els 3), b(3,3), ¢(5,5), , 4) 2, a( : ta Da all cluster centres. 7 | | ) d r = ( 4),} C2= (6.3 wats an Initial cluster centre - fo C i= 2, . points ¢ een data We will check distance betw | ey Ans: rs k = 2 Number of cluste " will We use $ Uetidear . | formula for finding distance. Distance [x, a] = (x — a)? OR alt Distance [(x, y), (a, b)] = V¥@- or by Euclidean ond formula of sec use will we pair, in is data As given res. : . a. and cluster centres points Finding Distance between data Distance. We will use following notations for calculating Distance. D, = Distance from cluster C. centre D2 = Distance from cluster C2 centre 2, 4): D2, 4), (2, 4-47 =0 + G5) = (@=a+ = Y@-a? Dill2, 4), (6, 3) = YR OEF = 3)? = 415 ODP = VE= H+ belongs to cluster Ci. Here O is smallest distance so Data point (2, 4) af cluster C; as followingAs Data point belongs to cluster G, we will recalculate the centre Using foliowing formula for finding new centres of cluster = Centre [(x. y), fa, b)] = (=2.2") Here, (x, y) = current data point (a, b) = old centre of cluster Updated Centre of cluster C= xXta ytb\ _ (2+2 444 ) = (=) = (2,4) (—=,4> 2 3, 3): Dil, 3), (2. Y= (Ea + O= dF = (BDF GoD = 142 DA(3, 3), (6, 3)] = V@—a)? + (y—b = J(3—6)? 4 (3-3)? =3 Here 1.42 is smallest distance so Data point (3, 3) belongs to cluster ' uster C C,. 7 As Data point belongs to cluster ©, we will recalculate the centre of cluster er C,C) as followingas followi 344 Updated Centre of cluster C = (= **) = (32 22) (>) = (25,35) 5, 5): Dil{S, 5), (2.5.35) = Viw~ aye + GBF by? = (y ~ DullS, 5), (6, 3 = Ve-al + Oe: (5- 2.5)? 4 (5 ~ 3.5)? =292 ESE 5-6)? + 6 ber : so Data point (5, Here 2.45 iis smallest distance ‘ Handcrafted by BackkBenchers Publicat ions —3)2= e Ongs a 2.45 to cluster Cc sore aaa mma Scanned by CamScanner eh EET EO ot EEE vy {sq ; aga upd www.Topperssolutions.com ———— i cha | Learning with Clustering gata point belongs to cluster C,, we wil r 2 ated Centre of cluster Co= (eu re will recalculate the centre of cluster C, as following- “*) _ (#8 543 7S) = (55, 4) >) Rte 63) Dill, 3).(25,3.5))= J@la ee a | 4) NS = - , 3),3), (55, (5.5, MS VQ= OF ODE _ = (CHDK. FGA THP = 356 V@—-aYa Goby Goa? = 12 V@- as GbE = [= 55) Here 1.12 Is smallest distance so Data point (6, 3) belongs to cluster C2 | as Data point belongs to cluster C,, we will recalculate the centre of cluster C2 as following= e [Xta yrb : = (5:75, 35) s i) - (: b+5.5 C, =. (4) of cluster "updated Centre ' (43): Dil(4, 3), (2.5, 3.5)] = V(x -al+ Dll4, 3), (5:75, 3.5)] = Ja G by = Ja — 2.5)? + (3 —3.5)2 = 1.59 + (= by = (A= 575) + B= 3.5)* = 183 Here 1.59 is smallest distance so Data point (4, 3) belongs to cluster C, As Data point beiongs to cluster ©, we will recalculate the centre of cluster C; as following- Updated Centre of cluster C, = (= “t) 7 (7#22*) = (3.25, 3.25) (6,.6): DAL(6, 6), (3.25, 3.25))= f@-a?+ De[(6, 6), (5.75, 3.S)] = V(x-a)*? + Here 2.52 is smallest JG — 3.25)? + (6 — 3.25)? = 3.89 (y—b= (yb)? (6 — 3.5)? = 2.52 Y(6- 5.75)? + = distance so Data point (6, 6) belongs to cluster C) The final clusters will be, C= {(2, 4), (3,3), (4, 3), (9. G)}, C= {(5, 5), (6, 3)} Apply Agglomerative clustering algorithm on given data and draw dendrogram. Show three ‘ method. clusters with its allocated points. Use single link Adjacency Matrix v2 a | oO b[ v2 |0 [2 e| fF V5 |1 v2o | via |2 [1 vs | v5 lc | vio | ve | ld} viz | V5 | V20 10 | ¥17 | 8 f e d Cc b a vs |o [2 v5 12 0 {3 | vis Vis [2 3 vi3 ]o —— [10M - Mayl6 & Dec17]} ‘ Ans; ‘ Jve this sum. We use following formula for Single link sum, ; | We have to use single link methed to so | Min [dist (a), (b)} one We will choose smallest distance value from two distance vaiues. 4 Be : Handcrafted ky BackkBenchers Publications ».¥ Page 85 of 102 . Scanned by CamScanner | WWW.-TOPPErsSolutio, Son Chap ~ 7 | Learning with Clustering OS TET BO OS Le oe ae ; can see 2 tha tt h e upper bound part of diagona| idens * “Al t We have givers Acjacency matrix in which we ©* atrix 44 art of the Mate. lower bound of diagonal se we can Use any par show below. Wo will use Lower bound ef diagonal as ; e matrix. We can see that1 is smay. Mallen distance value from above distanc it. from e matrixi so We can choos e anyYY one valu clistance value but it appears twice in Now we have to find minimum Taking distance value of (b, @) =1 Now we will draw dendrogram Now we have for it to recalculate distance matrix. remaining We will find distance between clustered paints Le. (b, e) and other points. e Distance between (b, e) and a: | Min[dist (b, e), a] = Min{dist (b, a), (e, a)] = Min{V2, V5] | sh. ccscasentinaw isin as we have to choose smallest value. * Distance between (b, e) and c: Min|dist (b, e), ¢] = Minfdist (b,c), (e, ¢)] = Min[v@, v5] = ¥5 « Distance between (b, e) and d: Minf{dist (b, e), d] = Min{dist (b, d), (e, d)] = Min{1,2] = 7 ¢ Distance between (b, e) and f: Min{dist (b, e), f] = Min{dist (b, f), (2, t)] = Min[v18,13] = V73 Now we have put these values to update distance matrix: — |. ibe) c d f [V2 vi0 Vii V2 Now we have to find smallest cistance value from updated _| distance mat rix again Here distance between [(b, e), d] = lis small est distance value j ne ae in distance matrix a ¢ Handcrafted by BackkBenchers Publications Scanned by CamScanner i| ‘ ae? jb earning 9 wit withClustering .com www.ToppersSolutions wwe have to draw de ndrogram for the Se New clustered points . I wow recalculating distance matrix again finding distance between clustered points and other remaining points. |, Distance between [[(b, e), d] and a: | Min (dist [(, e), d], a} = Min {dist [(b, e), a], [4, al} = Min (V2, V17] = v2 _, Distance between [(b, e), d] andc: | Min {dist {(b, e), a], c} = Min {dist {(b, e), ¢),[d, ¢]} = Min [V5, v5] = V3 |. Distance between [[b, e), d} and #: Min {dist [(b, ¢), d], f} = Min {dist [(b, e), f], [d, f]} = Min [v73,3] = 3 ow we have to put these distance values in distance matrix to update it. a Here distance between (b, e,d) }c f [(b, €, d), a] = V2 is smallest distance value in updated distance matrix. Drawing dendrogram tor new clustered points Now recalculating distance matrix. c: Finding distance between [(b, e, d), al and c]} = Min [¥5, V10] = v5 Min (dist [(b, e, a), a}, [c]} = Min {dist [(b, e, d), ], [a, _* | » d), a] and f: Finding distance between [(b, e, fl, {a, f]} = Min [3, V20] =3 Min {dist [(b, e, 2), al, [fl] = Min {dist [(b, e, d), distance matrix to update it. Putting these new distance values in (b,e,d,a) ] ¢ | F (b, e,d, a) | O c f V5 3 O 0 2 Ba f 4 : Handcrafted by BackkBenchers Publications : Page 87 of 102 Scanned by CamScanner x Lunesta Wwww.TOPDETSSolutio, Cg . Chap - 7 | Learning with Clustering . . Here, distance between (¢, f) = 2is smallest ™ distance. [ 4 Recalculating distance matrix, « Finding distance between (c, f) and (b.e.d.a) d, a)]} = Min [v5, 3] = V5 Min { dist (c, f), (b, e, d, a)} = Min (dist [¢, (b, & d, a)], [F, (Be % Putting this value to update distance matrix Drawing dendrogram for new cluster, | [| c Ql0. f o a e For the given set of points identify clusters using complete link and average link using agglomerative clustering. A B | 1 P2 115 15 P3 15 5 P4 13 4 P) rps [aya Cae Ans: [loM - May!@l We have to solve this sum using complete link method and Average link Ink methoh d. oe . We use following formula for complete link sum . 7 Max [dist (a), (b)] ..........-....... We will choose biggest distance . value fr Om two distance j value S. We use following formula for complete link sum Avg [dist (a), (bJ]=* 5 [a+b] é ~ We will choose biggest dista nce value from two distance yalue? ne i tr x i ¥ Handcrafted by BackkBenchers Publications ee a ae Ee — Page — ga of ™ Scanned by CamScanner with Clustering 7! Learning ee, chap : yore given data is notin distance20 / adja AC VAce CENCY tNatriyi form, &So we will convert it to distance / adjacency gattix using Euclidean distance formul a a WhichWhich j1S a8 followinc bye fv , www.ToppersSolutions.com a . a at ' gistance [% ¥) (a, BY] = fo “ = « ayey. (y~b)? vm yop | yow finding distance between p} and P2, Dista “7 . \ Stance J, P2] ine | gistancelt 1), (1.5, 1.5) = {aye | jasperabov hovee ste stepp we we wil ¢ OO* OW = (OTTaTae will find distance between other points as = 0.7 well Distance [P1, P3] = 5.66 Distance [P1, P4] = 4.6) Distance [PI], P5) = 4.25 Distance [P1, Pé] = 3.21 Distance [P2, P3] = 4.95 Distance [P2, P4] = 2.92 Distance [P2, PS] = 3.54 Distance [P2, P6] = 2.5 Distance [P3, P4] = 2.24 Distance [P3, P5] = 1.42 Distance [P3, P6] = 2.5 Distance [P4, P5] =1 Distance [P4, P6] =0.5 Distance [P5S, P6] = 1.12 Putting these values in lower bound of diagonal of distance / adjacency matrix, PS P4 P3 P2 Pl p2 O71 a O PS 5.66 | 495 0 | P4 3.61 292 PS. 4.25 3.54 2.24 142 0 7 0 [PG 3.21 2.5 25 0.5 12 ipt”™*é‘“‘i‘C | I 5 . COMPLETE LINK METHOD: ' Here, distance between P6 P4 and PG is 0.5 which is smallest distance in matrix. P4 P6 matrix: scalculating distance “© P1 Distance between (P4. PE) and = Max{dist (P4, P6), Pll = Max{dist (P4, Pi), (P6, PN] = Max(3.61, 3.21] = 3.6) mms Page 89 of 102 Se Mun. 8 we ! lete link methoa nce value as it is Comp a dist t ges big We take ant Renchers ue Publications Scanned by CamScanner SSI Lo ” ‘ www-TOPPersSolutia,,, oy y Chap -— 7 | Learning with Clustering Distance between (P4, P6) and P2 = Max{dist (P4, P6), P2] = Max{dist (P4, P2), (P6, P2)] = Max(2.92, 2.5] = 2.92 Distance between (P4, P6) and P3 = Max{dist /P4, P6), P3] = Max{[dist (P4, P3), (P6, P3)] = Max[2.24, 2.5] =2.5 Distance between (P4, P6) and PS = Max{dist (P4, P6), P5] = Max[dist (P4, PS), (P6, PS)] = Max], 1.12] = 1.12 Up pdating distance matrix: Pl O Here (P1, P2) = 0.71 is smallest distance in matrix PI. Pl P2 Recalculating distance matrix: e Distance between (P1, P2) and P3 = Max[dist (P1, P2), P3] = Max{dist (P1, P3), (P2, P3)] = Max[5.66, 4.95] = 5.66 Distance between (PI, P2) and (P4, P6) = Max{dist (Pi, P2), (P4, Pe)} = Max{ dist (P1, (P4, P6)}, [P2, (P4, P6)}] = Max[3.61, 2.92} = 3.61 \ Handeratted by BackkBenehere Publeatong Handcrafted by BackkBenchers Publications eo page 90° - ad Scanned by CamScanner “nap 7! Learning with Clustering ns.com www.ToppersSolutio pistance between (P1, P2) and ps = Max([dist (P1, P2), PS] - Max[dist (P1, PS), (P2, PS) 2 Max[4.25, 3.54] 2425 updating distance matrix using above values (PI, P2 (Pi, P2) p4,P6 ( | i ) 5 es 0 -—__] 5 5.66 | P3 a (P4, P6) P3 3.61 25 5 te 1.42 112 | O Here [(P4, P6), PS] = 1.12 is smallest distance in matrix, Pll Ed PS P1 P2 Recaiculating distance matrix + Distance between [(P4, P6), PS} and (P1, P2) = Max[dist {(P4, P6), PS}, (P1, P2)] = Max[ dist {(P4, P6), (P1, P2)}, (PS, (P1, P2)}] = Max[3.61, 4.25] =425 * P3 Distance between {(P4, P©), PS} and = Max{dist {(P4, P6), PS}, P3] P3)] = Max{ dist {(P4, PS), P3}, {P5, = Max[2.5, 1.42] =2.5 Updating distance matrix (P1, P2) (Pi, P2) (P4, P6, PS) 0 5.66 (P4, P6, PS) Here, [(P4, P6, P5), P3] = RS 4.25 ance in matrix, 2.5 is smallest dist Page 91 of 102 Dublications _* Handcrafted by BackkBenchers Scanned by Cam Scanner Vivi vite tM RR POU tign, Pn, —— Chap - 7| Learning with Clustering 6 pA pS P3 pl P2 Recalculating distance matrix: * Distance between {(P4, P6, PS), P3} and {P1, P2} = Max(dist {(P4, P6, PS), P3}, (PI, P2}] = Max{dist{(P4, P6, PS), (PT, P2)}, (P3, (Pl, P2)}] = Max[4.25, 5.66] = 5.66 Updating distance matrix | (P4, P6, P5, P3) |(Pl, P2) (Pl, P2) 0 (P4, P6, PS, P3) 5.66 a P5 P3 P4 AVEPAGE py P2 LINK METHOD: Original Distance Matrix: Pl PS P4 P3 P2 | PS 0 PI 0 P2 0.71 P3 5.66 4.95 P4 3.61 2.92 7 2.24 0 PS 4.25 3.54 1.42 1 5 25 25 a = P6 3.21 - 0 | | rT ~ =a Lo Here, distance between P4 and P6 is 0.5 which is smallest distance in matrix P4 6 a ee Handcrafted py BackkBenchers Publications paye of I? i Scanned by CamScanner -Tp oes ent ~'ustering ee we _giculating distance matriy pistance between ot Www.ToppersSolutions.com oo ’ (P4, P6) and py - avg[dist (P4, P6), P)] - Avgldist (P4, Pl), (P6, Pl}] 2213.61 + 3.21) 2 BAT ccecctteeecneeeee. WO take average distance : value as it is Average link method Distance between (P4, P68) and P2 = Avg[dist (P4, P6), P2] = Avg[dist (P4, P2), (P6, P2)] z : [2.92 + 2.5] =271 « Distance between (P4, P6) and P3 = Avgldist (P4, P6), P3] = Avg{dist (P4, F3), (P6, P3)] = 5 (2.24 +25) = 2.37 « Distance between (P4, P6) and PS = Avg|[dist (P4, P6), PS] = Avgfdist (P4, PS), (P6, PS)] = +f +112] = 1.06 ‘Jpdating distance matrix: | T Pi P2 P3 IP oO | P2 0.71 0 ps 5.66 4.95 0 (P4, P6) PS 3.41 4.25 27 3.54 237 1.42 (P4, P6) PS 0 1.06 [° }lere (Pl, P2) = 0.71 is smallest distance in matrix Plt P1 P2 Recalculating distance matrix: * Distance between (P1, P2) and P3 = Ava{dist (Pl, P2), P3] = Ava|dist (P1, P3), (P2, P3)] = +[5.66 + 4.95] =53) ~ ications _¥ Handcrafted by BackkBenchers Publ Page 93 of 102 Scanned by CamScanner 4 WWww.TopPersSolut; on Si Chap -7 | Learning with Clustering * Distance between (P1, P2) and (P4, P6) = Avg[dist (P1, P2), (P4, P6)] = Avg[ dist {P1, (P4, P6)}, {P2, (P4, PEI}] = 5 (341+ 2.71] = 3,06 ¢ Distance between (P1, P2) and PS = Avg[dist (P1, P2), P5] = Avgldist (P1, PS), (P2, P5)] =5[4.25 + 3.54] = 3.90 Updating distance matrix using above values, (Pi, P2) (Pl, P2) p3 oO P3 0 (P4, P6) 2.5 PS 1.42 Here [(P4, P6), P5] = 1.12 is smallest distance in matrix, rd Recalculating ¢ [| Pl P2 distance matrix: Distance between {(P4, P6), PS} and (PI, P2) = Avgldist {(P4. P6), P5}, (P1, P2)j = Avg[ dist {(P4, P6), (P1, P2)}, {P5, (P1, P2)} ] > [3.06 + 3.90] = 3.48 Distance between {(P4, P6), P5) and P3 = Avg[dist {(P4, P6), PS}, P3] = Avg| dist {(P4, P6), P3}, {P5, P3}] 1 5 [2.5 + 1.42] Mt ¢ 1 96 Updating distance matrix (P), P2) P3 (P4, P6, PS) (P, P2) 0 5.66 3.48 (P4, P6, PS) om, Lap _7| Learning with Clusterin g ere, [(P4, P6, PS), P3] = 1.96 is SMallest dist ance in www.ToppersSolutions.com Matrix, PLT gecalculating distance matrix: Pl P2 Distance between {(P4, P6, PS), P3} and {P1, p2) Avgldist ((P4, P6, PS), P3}, {P1, P2y] = Avgldist{((P4, PG, PS), (PI, P2)}, (P3, (PI, P2)} =1 (3.48 + 5.66] 2457 Updating distance matrix (Pi, P2) (P4, Pe, P5, PS) 7 (Pi, P2) a 457 | (P4, Po, PS, PS) ; BackkBenchers Publications Page 95 of 102 icatl Scanned by CamScanner www. Topparssolutlons 00 A Chap - 8 | Reinforcement Learning ee Om _ CHAP- 8: REINFORCEMENT. BReY RNING arning? What are the elements of reinforcement banenine help of an example. Ql. arious atements involved In form, Q2.. What is Reinforcement Learning? Explain with the Q3. . Explain reinforcement learning in detail along with ne servablele state. state " the concept. : Also define what is meant by partially [5-10M | May16, Decl6, Mayl7, ob 17,D Decl7 & Mayra} Ans: REINFORCEMENT LEARNING: 1. algorithm. e of machine learning typ a is ng rni Lea t Reinforcemen and error. nm ent by trial 4. It enables an agent to learn in an interactive enviro . Agent uses feedback from its own actions and expenences, . . . imnereas l a reward of the agent. e total re The goal is to find a suitable action model that increas 5. Output depends on the state of the current input. 6. The next input depends on the output of the previous input. 7. Types of Reinforcement Learning: 2. 3. } 1 8. a. Positive RL. b. Negative RL. The most common a. application of reinforcement learning are: PC Games: Reinforcement learning is widely being used in PC games like Assassin's Creed, Chess etc. the enemies change their moves and approach based on your performance, b. Robotics: Most of the robots that see you in the are world present running on Reinforcemert Learning. EXAMPLE: 1. We have anagent and a reward, with many hurdles in between. 2. The agent is supposed to find the best possible path to reach the on ‘ . 4 i reward. a < oma) Figure 8.1: Example of Reinforcement Learning. 3. The figure 8.15 hows robot, diamond 4. The goal of the robot is to get the reward 5. Reward is the diamond and avoid the hurdles that is fire. 6. The robot learns by trying all the possible paths 7. After learning robot chooses a path which gives him the reward with the least hurdles 8, 9. and fire. Each right step will give the robot a reward. Each wrong step will subtract the reward of the robot, reward will be calculat it 10. The total Sen ed when it reaches the final reward “Ward wy Handcrafted by BackkBenchers Publications ISS ; d. that is the diamon Cn 102 Scanned by CamScanner a BI Reinforcement Learning Topparssolutions60M LEARNI porcEMENTNT LEARNING ELEMENTs: Z eee four Main Sub elements of 5 ew www.ToppersSolutions.com = reinfor cementni leaar le rninina system: , pore, r ,yeward function Are pyalue function A model of the envi ronment IS ) (optional | >» op a z | > a, is = » ig! er a ig 4 ae z po i i CBstERVATIONS Policy: The policy is th core of a reinforcement learning =p Pobevis ; Policy IS sufficient Suiliciént to to dearer determine 3 ingeneral, 4 Apokcy defines the learning agent's way of behaving at a given time. policies may bs, behaviour be stochastic she environment to actions to be taken when in those 3. = a 5=e w& 5 b» 9 {Q oO ay perc each mans function Peward = o 0 Wi of Areward functuo Qa. tb 1 ~ table d bad events are for the agent. The reward function defines wh erable by the agent. nec must necesse funct.o7m n must rdd funct.o T tne policy. Reward functions sere as a basis for altering 3 lil) Value Function: a reward Whereas 1 > Avelue 5 The function value funaction starting from 4 F ds 5 Wathout 5 the to in an immediate sense, in the fong run. Species ' vfat 1ssua of a state is good indicates what amount cf reward an agent can expect to collect over the future, that state are ina sense primary, whereas values. a5 p redicticns of rewards, are secondary. rewards there could be no values, and the only purcose of estimating values is to achieve more reward we are most concerned when making and evaluating decisions. Action choices s. are made based on value judgment element of Reinforcement learning system behaviour of the environment. % action, ¢ ne mode! might predict the resultant next state and next €or example, given a state and reward 4 nning. plad for e Models are us 5. if model know current state @ e and ion, aciic next reward. Gre ase predict the resuultant next state and then it can iit a to hers Publications pZHanderafted by BackkBenc Page 97 of 162 Scanned by CamScanner Chap — 8 | Reinforcement Learning Q4. www. ToPPersSolutions co, $$ Model Based Learning Ans ns: MODEL [om BASED LEARNING: 1. arni Model-based machine learning refers to machine learning M odels. 2. Those models are parameterized with a certain num ber of parameters which do not change ag the size of training data changes. 3. |Deciy : . The goals “to provide a single development framework which supports the creation of a wide range of bespoke models", 4. 5. ‘ } that a set of data {Xl, .., n, YI... MN} you are given subject toa linear model Yi=sign(w'X, + b), Where w € R™ and m is the dimension of each data point regardless of n, For example, if you assume There are 3 steps to model based machine learning, namely: a. Describe the Model: Describe the process that generated the data using factor graphs. b. Condition on Observed Data: Condition the observed variables to their Known quantities. Perform Inference: Perform backward reasoning to update the prior distribution over the laten: variables or parameters. 6. This framework emerged from an important convergence of three key ideas: a. The adoption of a Bayesian viewpoint, b. c. MODEL i. The use of factor graphs (a type ofa probabilistic graphical model), and The application of fast, deterministic, efficient and approximate inference algorithms. BASED LEARNING PROCESS: 2. We start with medel based learning where we completely know the environment Environment pararncters are P(rt +1) and P(st +1] st, at). In such a case, we do not need any exploration, “4. Wecan model parameters directly selve for the option al value function and policy using dvnamic Proyramirnin g The optimal value function is unique and is the solution to simultaneous equations Once we have the optimal value function, the aptima l policy is to choose the action that maximize value in next state as following Tr * (st) = args: max(E [rt +1] st, at] +r y Pist + If st, at] v * [st + 1) BENEFITS: hWN 1. Provides a systematic process of creating ML solutions. Allow for incorporation of prior knowledge, Does not suffers from over fitting. Separates model from inference/traini ng code. ——. ® Handcrafted by BackkBenchers ———— Publications a _ ee ee page 98 of 102 Scanned by CamScanner | -e.. 9-8 | poinforcoment Learning po ar i a eco www.ToppersSolutions.com ao ans a explain in detail Temporal Difference Learning fb [IOM | Dec6, May17 & Dec18] j MPORAL DIFFERENCE LEARNING: pean approach to learning how to predict a quantity that depends on future values of a given signal. jermporal Difference Learning methods can be used to ebtirnate these value functions. peat hing Happens through the iterative correction of your estimated returns towards a more accurate rarget relay Walgonttinns are often Used to predict a measure of the total arnount of reward expected over the future they can be used to predict other quantities as well. Different 1 TO algorithens: a $30) algonthen b TD) © 190) algorithen algonthen The easiest to understand temporal-difference algorithrn is the TD(O) algorithm. TEMPORAL ) DIFFERENCE LEARNING METHODS; On-Policy Temporal Difference methods: Learns the The value § value of the policy that is used to make decisions. functions are updated using [hese policies are usually "soft" and results frorn executing actions determined by some policy. non-deterministic. ILensures there is always an element of exploration to the policy. The policy is not so strict that it always chooses the action that gives the most 6 On-policy algorithrns cannot separate exploration reward. from control. ll) Off-Policy Temporal Difference methods: 1} it candeann different policies for behaviour anc estimation 2 Again, Whe behaviour policy is usually "soft" so tnere is sufficient exploration going an Off-policy alyorithrns can update the estimated value functions using actions which have not actually been tried. Off-policy algorithrns can separate exploration from control. » Agent may eng up learning tactics that it did not necessarily shows ACTION SELECTION during the | 2arnin h = 9g phase, POLICIES: The aim of these policies is to balance the trade-off between 1) . exploitation and exploration c-Creedy: Most of the time the action with the highest estimated reward is chosen, called the greediest action Every once in a while, say with a small probability, an action is selected at random. The action is selected uniformly, independent of the action-value estimates. This rnethod ensures that if enough trials are done, each action will be tried an infinite number of times, ? \ 5 ensures optimal actions are discovered. -by BackkBenchers Publications Page 99 of 102 } ea Scanned by CamScanner . oo Chap - 8 | Reinforcement Learning www ‘Tepperssolutions ¢ bb " fy 2- Soft: 1 Vary sitilar toe - greedy. Z. The best action: is selected with probability1 -e 5. Bast of the time a random action is chosen uniformly. MW) Sottcnax: Gne trausback of e -yreedy and ¢ -soft is that they select randorn ctians Uniformity a 4 “ny wo bo Sarna ‘ . a ; the ’ secand THE WGESE GOSsiOle action Is just as likely to be selectec! as 2 SNe 4. » est bes raredies this by assigning a rank or weight to each of the actions ctording to their actisr. AStirn ate Krandart action is selected with regards to the weight associated with each action VT fiearis that the worst actions are unlikely to be chosen. & Frisia good approach to take where the worst actions are very unfavourable ADVANTAGES OF TD METHODS: 1 Gorn't need a model of the environment. 2. Grr-line and incremental so can be fast. | % Theydon't | 4 Needless memory and computation Of. need to wait till the end of the episode — Mriat is Q-learning? Explain algorithm for learning Q Ans: [10M | Deci8} O-LEAPRNING: 1 Grlearning 4 Q-learning is a values-based learning alyorithrn. is a reinforcernent learning technique used in machine learning. fre goal of O-learning is to learn a policy, 4 Glearung , & tells an agent what action to take under what circumstances Orlearring uses temporal differences to estimate the value of Q'(s, a). In G learning, the agent maintains a table of Q[SA], YVinere, Sis the set of states and A is the set of actions. 7 Q [s, a} represents its current estimate of Q* (s, a). Ae experience (s, a, 1,5') provides one data point for the value of ls, a) & The data point is that the agent received the future value of rt yV(s4j, Vihere Vis}= max,’ Ofs', a’) & ky ii. Ths the actual current reward plus the discounted estimated future value, This nev data point is called a return. The agent can use the temporal difference equation to upd ate its est imate for Qls, a): Ofs, a] « Ofs, a} + afr yrnaxa’ Ofs', a'] - Ofs, al) 12 Ot equivalently, Ofs, a} ¢ Mo) Ofs, a] + alr yrmaxa' Qfs', a’]). Scanned by CamScanner ~ EE nae - 8 | Reinforcement Learning ESSN ee Www. Tapparssolutions.com g iecan be e p proven that given i fol imi sufficient training under any» - soft policy, the algorithin converges with ability] to a close j i neil prob ¥ approximation of the action-yalue function foran arbitrary target policy : ing le learns i ' 4, Q-Learning the optimal policy even when actions are selected according to a mure exploratory or even random policy. ‘0° E: | Q-Table is just a simple lookup table 2 In Q-Table we calculate the maximum 4 Basically, this table will guide us to the best action at each state, In the Q-Table, the columns are the actions and the rows are the states, 5. § expected future rewards for action at each state, Each Q-table score will be the maxirnum expected future reward that the agent will get if it takes that action at that state, 6 This isan iterative process, as we need to improve the Q-Table at each iteration. PARAMETERS USED IN THE Q-VALUEUPDATE PROCESS: 1) «>the learning rate: Set between 0 and. 2 Setting it to 0 means that the Q-values are never updated, hence nothing is learned. 33. Setting a high value such as 0.9 means that learning can occur quickly facto 1 Set 0 and 2 This models the fact that future 3. Mathematically, the discount factor needs to be set less than O for the algorithm to converge. between il) Max, oe ll) y- Discount 1. - the maximum rewards are worth less than revvards. reward: 1 This is attaunable in the state following the current one. 2. Le. the reward for taking the optimal action thereafter PROCEDURAL immediate APPROACH: } yet tialize the Q-values table, Q(s, a). 2? Observe the current state, s. Choose an action, a, for that state based on one of thie action selection policies (r -soft, ¢ -greedy or Softrnax). lake the action, and observe the reward, 1, as well as the new state, s'. 5. 6. Q7. Update the Q-value for the state using the observed reward and the maximum reward possible for the next state. (The updating is done according to the forrnula and pararneters described above.) Set the state to tne new state, and repeat the process until a terrninal state is reached. Explain following terms with respect to Reinforcement learning: delayed rewards, exploration, and partially observable states.(10 mark — d18) Q8. Explain reinforcement learning in detail along with the various elements involved in forming (10 mark - d17) the concept. Also define what is meant by partiaily observable state. [IOM | Dect7 & Deci8} ‘Ans : erafted bv BackkBenchers ©i Hand Publications Page 101 of 102 A Scanned by CamScanner eine SIOSSISSSSSE TF 2 ERLE LLL oui netl Chap - 8 | Reinforcement Learning as «| Ystttl TOpperannutians we “ee ty,/ EXPLORATION: 1, Itmeans gathering more information about the problerty 2, Reinforcement learning requires clever exploration mechanisns 3. Randomly selecting actions, without reference to an estimated probabilit y Oistnbations stevie tee. performance. 4. The case of (small) finite Markov decision 5. However, due to the lack of algorithrns that properly scale well with the nurriter of states, sirigty 6 processes is relatively Well uniderstagd exploration methods are the most practical One such method is e-greedy, when the agent chooses the action that it believes has trier nase tye gterm effect with probability 1- «. 7. 8. Ifno action which satisfies this condition is found, the agent chooses an action Unity at Cae dees Here,O <e<1isatuning parameter, which is sometimes ¢ hanged, either according (6 4 fized wt edly or adaptively based on heuristics. DELAYED REWARDS: 1. Inthe general case of the reinforcernent learning problem, the agent's actions detorrnyine sor only as immediate reward, 2. Andit also determine (at least probabilistically) the next state of the environrmnert 3, It may take a long sequence of actions, receiving insignifica nt reinforcernent, Then finally arrive at a state with high reinforcernent. 4. 5. 6. It has to learn from delayed reinforcement. This is called delayed rewards. The agent must be able to learn which of its actions are desirable based on revratd that can tave olace arbitrarily far in the future. Ts It can also be done with eligibility traces, which weignt the previous action a lot 8. The action less and so on before that a littic less, and the action before that even But it takes for of computational time. PARTIALLY OBSERVABLE STATES: '. In certain applications, the agent does not know the state exactly. 2. It is equipped with sensors that return an observation USING Which the agent should estirates ine state. For example, we have a robot which navigates in a roorn 4. The robot may not know oO ON Hw 3. its exact location in the '00m, or what else is in roorn The robot may have a camera with which sensory Observati ons are re corded This does not tell the robot its state exactly but gives Inclic ation as to its likely state. For example, the rob ot may only know that there is a wall to its night. The setting is like a Markov decision process, except that after taking an action at The new state s+] is not known but we have an observation o, ‘Iwhich is stochastic function of s: a4 p (o:+1 | st, at). 10. This is called as partially observable state. . @ Handcratted by BackkBenchers Publications onnee Page 102 0 a Scanned by CamScanner || | “Education is Free.,.. But its Technology used & Efforts utilized which we charge” it takes lot of efforts for searching out each & every question and transforming it into Short & Simple Language. Entire Topper’s Solutions Team is working out for betterment of students, do help us. “Say No to Photocopy....” With Regards, Topper’s Solutions Team. Scanned by CamScanner

Machine Learning Coursebook - BE Computer Science, 8th Semester

Related documents

Products

Support

Machine Learning Coursebook - BE Computer Science, 8th Semester

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib