M10 Homework Assignment Data Mining—k-Nearest Neighbor (kNN) Section One Problem 1. Consider any organization you are familiar with, other than your workplace (e.g., your church, favorite restaurant, school). Answer the following questions using the following format. Instructions on what to write are in brackets: Definition of Organization: [Define the organization in one incomplete sentence.] Definition of Independent Variable Data Stream: [Define an independent variable data stream from this organization in just a few words.] Field Name 1: [Provide field name.] Field Name 2: [Provide field name.] Definition of Dependent: [Define the dependent variable you are trying to evaluate from this organization in one sentence.] Potential Benefits: [Discuss how a kNN application would be beneficial in a few sentences.] Potential Pitfalls: [Discuss the pitfalls of this potential implementation in a few sentences.] Problem 2. Research the literature - find any application of kNN that is of interest to you. Provide one brief paragraph that describes the problem and the benefits of the application. Problem 3 Provide a one paragraph discussion of a problem you are interested in and why you feel kNN would be applicable to it. Please use the text Business Intelligence, Analytics, and Data Science: A Managerial Perspective, found in your custom textbook, read: Chapter 4: Predictive Analytics I: Data Mining Process, Methods, and Algorithms, Section 4.5 (Data Mining Methods) A careful review of Table 4.1, p. 217 Page 1 of 4 Section Two M10 Lab Description In the Excel file are data describing the occurrence of high blood pressure. We have described high blood pressure as a YES. While it would probably be better to describe high blood pressure with a more continuous variable, for the sake of this lab, we will describe it as a nominal variable. In the Excel file are data describing the occurrence of high blood pressure. We have described high blood pressure as a YES. While it would probably be better to describe high blood pressure with a more continuous variable, for the sake of this lab, we will describe it as a nominal variable. We will accept the doctor’s (MD’s) definition of high blood pressure as described in the physician's "after visit summary" in the patients notes. As you can imagine, it would be easy to search through patient files and quanitfy patient blood pressures as high versus acceptable. The question to answer in this lab is whether we can use a machine learning tool (i.e., kNN) to determine the relationship between independent variables and the dependent variable, (high blood pressure = yes or no). In this lab, we have defined two independent variables: • • Minutes of exercise per week (domain 0–150 minutes per week) • • Quantity of vegetables consumed per day (domain 0–150 calories per day) In the following heat map is a cartesian coordinate system for the domains of minutes of exercise per week and quantity of vegetables consumed per day VERSUS the occurrence or non occurrence of high blood pressure. This heat map is a randomized result from our collected data. The x-axis represents minutes of exercise per week and the y-axis represents quantity of vegetables consumed per day. Red dots indicate no high blood pressure (NO). Green dots represent high blood pressure (YES). Page 2 of 4 Instructions Using Excel and VBA, construct a kNN algorithm for a sample set of the data. For simplicity, please use just the first 500 data points. Execute your module for various k values. Please provide the following in an MS Word document: Part One • • Insert a screenshot of results from k=5. • • Apply kNN with k=5 to the remaining 10,000 data points. • • Compute the accuracy of your model. Use the accuracy formula as described on p. 217 in Table 4.1. Part Two • • Insert a screenshot of results from k=10. • • Apply kNN with k=10 to the remaining 10,000 data points. • • Compute the accuracy of your model. Use the accuracy formula as described on p. 217 in Table 4.1. Part Three • • Insert a screenshot of results from k=20. • • Apply kNN with k=20 to the remaining 10,000 data points. • • Compute the accuracy of your model. Use the accuracy formula as described on p. 217 in Table 4.1. Part Four • • Insert a screenshot of results from k=50. • • Apply kNN with k=50 to the remaining 10,000 data points. • • Compute the accuracy of your model. Use the accuracy formula as described on p. 217 in Table 4.1. Part Five • • Create a grid with a sensitivity of 10. Find the centroid of this grid. As an example, the first grid would be (0–10 minutes) and (0–10 calories of vegetables). The centroid position would be the ordered pair (5,5). Apply your best kNN algorithm (i.e., best k), as found in the four parts preceding and defined by accuracy. You can simply add these centroids to your table as pseudo data. Provide the results of your grid. • • Using Tableau, compare the results of your grid to the original data provided. Part Six • • Provide a proposed weighting of the distance formula you are applying. You do not need to execute your proposed weighting. • • In just a few sentences, provide the reasoning for your proposed weighting. Please be specific and critical as to your reasoning. Part Seven • • Provide a screenshot from Tableau comparing the centroid grid you created in Part Five to the actual data provided. • • In two to three sentences, comment on the quality of your grid as compared in Tableau. Page 3 of 4 Open the Excel file provided in this lab. In the Excel sheet named Class Data, place the k to analyze in Cell B10. Your instructor will alter k to review your developed model. Page 4 of 4