Cluster Analysis - Institute of Information Sciences and Technology

advertisement
161.323 2004
Institute of Information Sciences and Technology
Massey University
R: Distances between cases, and cluster analysis
Library
The functions that you will need to use for finding distances and performing cluster analysis are contained in a
package called “mva”. It is possible that this package will already have been loaded. To check, type
> search()
If the “mva” package is not listed, type
> library("mva")
Example data set
We will use the Cars data set to illustrate.
> cars <- read.table("Cars.text", header=TRUE)
To print information about the first 4 cars, type
> cars[1:4,]
The names of the cars are saved in the first variable (Brand, a character variable). It is best to get these names
stored as row names for the table for later commands. We could do this with
> row.names(cars) <- cars$Brand
but it is easier to modify the original command that read the data, telling it that the first column contains the car
names.
> cars <- read.table("Cars.text", header=TRUE, row.names=1)
We will continue as though the data were read in this way (so there is no Brand variable).
Distances
Although the most common way to save and read data sets in R is with data frames, many of the statistical
functions do not work on data frames. Instead, you must first convert a data frame into a matrix before passing it to
a statistical function. The function “as.matrix()” does the conversion.
> cars.m <- as.matrix(cars[3:8])
Note that “cars[3:8]” is a data frame containing variables 3 to 8 of the original data frame (from Reliability to
Cylinders). A matrix must be made from only numeric variables so we cannot include the variable Country.
Before finding distances, we should standardise the variables. (Otherwise the distance measure will be
dominated by the variables with biggest standard deviation – Wt in this example.)
> cars.std.m <- scale(cars.m, center=T, scale=T)
We can now find distances with the command
> cars.dist <- dist(cars.std.m, method="euclidean")
This distance matrix is big so don’t try printing it!
Options for the method parameter are…
euclidean ordinary Euclidean distance
manhattan city-block distance
binary
the simple matching index given on section 5.5 of Manly. It is only relevant to 0/1
variables.
canberra
not mentioned in section 5.4 of Manly, but can be used for proportions.
–1–
161.323 2004
A distance matrix is a different type of R object from an ordinary matrix. If you have created a square matrix of
distances by other means, you can convert them into a distance matrix with the command
> my.dist <- as.dist(my.square.matrix)
Cluster analysis
Cluster analysis is based on a distance matrix, such as that produced by the dist() command. Performing a
hierarchical cluster analysis has two stages. Firstly the command hclust() is used to perform the analysis.
> cluster.results <- hclust(cars.dist, method="single")
The ‘method’ parameter describes how the distance between two clusters is defined from the individual
distances. Possible values include..
single
distance between nearest neighbours
complete distance between furthest neighbours
average
averages distances between pairs in the two clusters
centroid
distance between the centroids of the two groups
The results of a cluster analysis are usually displayed in a dendrogram. This is produced with the command
plclust()
> plclust(cluster.results)
Another display option (which I prefer) is to use the parameter “hang=-1”. This extends all dendrogram
branches down to 0.
> plclust(cluster.results, hang=-1)
If you have produced your distance matrix from a data set with its row.names set, the dendrogram will be
labelled with these row names. If your distance matrix does not contain row names, the dendrogram branches will
be labelled with the numbers 1, 2, …. This can be avoided by providing a vector of names for the individuals in the
labels parameter. For example, if you had read the cars data set without the “row.names=1” parameter, you could
have got the car names printed in the dendrogram with the command
> plclust(cluster.results, labels=cars$Brand)
To see the countries, try
> plclust(cluster.results, labels=cars$Country)
–2–
Download