Uploaded by Okomba

6521837 2131033832 3681689790569

advertisement
M10 Homework Assignment
Data Mining—k-Nearest Neighbor (kNN)
Section One
Problem 1.
Consider any organization you are familiar with, other than your workplace (e.g., your
church, favorite restaurant, school). Answer the following questions using the following
format. Instructions on what to write are in brackets:
Definition of Organization: [Define the organization in one incomplete sentence.]
Definition of Independent Variable Data Stream: [Define an independent variable data
stream from this organization in just a few words.]
Field Name 1: [Provide field name.]
Field Name 2: [Provide field name.]
Definition of Dependent: [Define the dependent variable you are trying to evaluate from this
organization in one sentence.]
Potential Benefits: [Discuss how a kNN application would be beneficial in a few sentences.]
Potential Pitfalls: [Discuss the pitfalls of this potential implementation in a few sentences.]
Problem 2.
Research the literature - find any application of kNN that is of interest to you. Provide one brief
paragraph that describes the problem and the benefits of the application.
Problem 3
Provide a one paragraph discussion of a problem you are interested in and why you
feel kNN would be applicable to it.
Please use the text Business Intelligence, Analytics, and Data Science: A Managerial
Perspective, found in your custom textbook, read:
Chapter 4: Predictive Analytics I: Data Mining Process, Methods, and Algorithms, Section
4.5 (Data Mining Methods)
A careful review of Table 4.1, p. 217
Page 1 of 4
Section Two
M10 Lab
Description
In the Excel file are data describing the occurrence of high blood pressure. We have
described high blood pressure as a YES. While it would probably be better to describe high
blood pressure with a more continuous variable, for the sake of this lab, we will describe it
as a nominal variable.
In the Excel file are data describing the occurrence of high blood pressure. We have
described high blood pressure as a YES. While it would probably be better to describe high
blood pressure with a more continuous variable, for the sake of this lab, we will describe it
as a nominal variable.
We will accept the doctor’s (MD’s) definition of high blood pressure as described in the
physician's "after visit summary" in the patients notes. As you can imagine, it would be easy
to search through patient files and quanitfy patient blood pressures as high versus
acceptable. The question to answer in this lab is whether we can use a machine learning
tool (i.e., kNN) to determine the relationship between independent variables and the
dependent variable, (high blood pressure = yes or no).
In this lab, we have defined two independent variables:
•
• Minutes of exercise per week (domain 0–150 minutes per week)
•
• Quantity of vegetables consumed per day (domain 0–150 calories per day)
In the following heat map is a cartesian coordinate system for the domains of minutes of
exercise per week and quantity of vegetables consumed per day VERSUS the occurrence or
non occurrence of high blood pressure.
This heat map is a randomized result from our collected data. The x-axis represents minutes
of exercise per week and the y-axis represents quantity of vegetables consumed per day.
Red dots indicate no high blood pressure (NO). Green dots represent high blood pressure
(YES).
Page 2 of 4
Instructions
Using Excel and VBA, construct a kNN algorithm for a sample set of the data. For simplicity,
please use just the first 500 data points. Execute your module for various k values.
Please provide the following in an MS Word document:
Part One
•
• Insert a screenshot of results from k=5.
•
• Apply kNN with k=5 to the remaining 10,000 data points.
•
• Compute the accuracy of your model. Use the accuracy formula as described on p.
217 in Table 4.1.
Part Two
•
• Insert a screenshot of results from k=10.
•
• Apply kNN with k=10 to the remaining 10,000 data points.
•
• Compute the accuracy of your model. Use the accuracy formula as described on p.
217 in Table 4.1.
Part Three
•
• Insert a screenshot of results from k=20.
•
• Apply kNN with k=20 to the remaining 10,000 data points.
•
• Compute the accuracy of your model. Use the accuracy formula as described on p.
217 in Table 4.1.
Part Four
•
• Insert a screenshot of results from k=50.
•
• Apply kNN with k=50 to the remaining 10,000 data points.
•
• Compute the accuracy of your model. Use the accuracy formula as described on p.
217 in Table 4.1.
Part Five
•
• Create a grid with a sensitivity of 10. Find the centroid of this grid. As an example,
the first grid would be (0–10 minutes) and (0–10 calories of vegetables). The centroid
position would be the ordered pair (5,5). Apply your best kNN algorithm (i.e., best k), as
found in the four parts preceding and defined by accuracy. You can simply add these
centroids to your table as pseudo data. Provide the results of your grid.
•
• Using Tableau, compare the results of your grid to the original data provided.
Part Six
•
• Provide a proposed weighting of the distance formula you are applying. You do
not need to execute your proposed weighting.
•
• In just a few sentences, provide the reasoning for your proposed weighting.
Please be specific and critical as to your reasoning.
Part Seven
•
• Provide a screenshot from Tableau comparing the centroid grid you created in
Part Five to the actual data provided.
•
• In two to three sentences, comment on the quality of your grid as compared in
Tableau.
Page 3 of 4
Open the Excel file provided in this lab. In the Excel sheet named Class Data, place the k to analyze in Cell B10.
Your instructor will alter k to review your developed model.
Page 4 of 4
Download