MATH 3220 Assignment 6 - report (18 nov 10)

advertisement
MATH 3220
Assignment #6 - Self-Organizing Map Exercise [Messy Data version]
This exercise will allow you to experiment with a multivariate clustering algorithm. The algorithm
that we’ll use is the Self-Organizing Map (SOM) network. The SOM is an artificial neural network
algorithm that maps multivariate data into a two-dimensional grid. The resulting map has the
property that there is a strong correlation between proximity of the map nodes and similarity of
the vectors associated with these nodes. For this assignment you will use the Diabetes.xls
dataset as your training data [use the dataset posted on the course website – there are other
versions on the web]. You must design your own experiment to answer the following questions:



How effective is the SOM algorithm in clustering data?
Is this effectiveness sensitive to scalar differences in the data vector attributes?
How tolerant is the SOM algorithm to noise?
The Diabetes dataset is a complicating issue in this assignment. This dataset contains the values
of eight attributes for individuals classified as either healthy or sick based on a medical study
conducted in 1994. Several of the attributes have missing or incorrect data. You will have to
preprocess the data prior to training SOM. You can consult with medical research literature to
determine relevant or irrelevant attributes. This research can help guide your data preprocessing
methodology.
Write a COMPLETE Analysis Report describing your analysis and recommendations.
Your research report is due in two weeks.
References:
Self-Organizing Map: http://en.wikipedia.org/wiki/Self_organizing_map
Diabetes: http://en.wikipedia.org/wiki/Diabetes
Diabetes dataset: http://archive.ics.uci.edu/ml/datasets/Diabetes
Download