Stat 579: More on Array and Matrix Operations Ranjan Maitra

advertisement
Stat 579: More on Array and Matrix
Operations
Ranjan Maitra
2220 Snedecor Hall
Department of Statistics
Iowa State University.
Phone: 515-294-7757
maitra@iastate.edu
,
1/12
Example: Color Quantization of Images
Objective: Represent an Image in terms of a certain (few)
number of colors.
We represent the PET image with 8 colors in an alternative
way:
> PET <- matrix(scan(file =
’’http://maitra.public.iastate.edu/stat579/datasets/fbp-img.dat’’)
ncol = 128, nrow = 128, byrow = T)
> image(1:128, 1:128, PET[, 128:1]), col =
topo.colors(128 ˆ 2)
We will represent these colors in terms of their closest
value to their 12.5 percentile bins
Let us obtain these bins:
> PET.qt <- quantile(PET, probs = seq(0, 1, length =
9))
Get the mid-points of the quantiles:
> PET.qt.mid <- (PET.qt[-1] + PET.qt[-length(PET.qt)])/2
,
2/12
A Simplified Problem
Let us simplify to a more manageable problem > x <matrix(rpois(15, lambda = 10), ncol = 5)
> x.qt <- quantile(x, probs = seq(0, 1, length = 5))
> x.qt.mid <- (x.qt[-1] + x.qt[-length(x.qt)])/2
For each observation in x, want to see which mid-point it is
closest to.
want to do this efficiently
Strategy: make 3-d arrays and obtain the distances from
each observation to each mid-point.
order them, get the order of the first value at each pixel
,
3/12
Implementation on a Simplified Problem
Make the appropriate 3-d arrays:
> x.arr <- array(x, dim = c(dim(x),
length(x.qt.mid))) stack the matrix 5 times on top
of each other
> mid.arr <- array(rep(x.qt.mid, each =
prod(dim(x))), dim = dim(x.arr))
Now subtract and square the two arrays:
> sq.diff.arr <- (x.arr - mid.arr)ˆ2
For each element in the array, we want the index in the
third dimension which has the smallest element
> min.idx1 <- apply(X = sq.diff.arr, MAR = c(1, 2),
FUN = which.min)
> min.idx2 <- apply(X = sq.diff.arr, MAR = c(1, 2),
FUN = order, decreasing = F)[1, , ]
> x.quant <- array(x.qt.mid[min.idx1], dim = dim(x))
,
4/12
Translate to PET case
Make the appropriate 3-d arrays:
> PET.arr <- array(PET, dim = c(dim(PET),
length(PET.qt.mid))) # stack the matrix 8 times on
top of each other
> mid.arr <- array(rep(PET.qt.mid, each =
prod(dim(PET))), dim = dim(PET.arr))
Now subtract and square the two arrays:
> sq.diff.arr <- (PET.arr - mid.arr)ˆ 2
For each element in the array, we want that index in the
third dimension having the smallest element
> min.idx1 <- apply(X = sq.diff.arr, MAR = c(1, 2),
FUN = which.min)
> min.idx2 <- apply(X = sq.diff.arr, MAR = c(1, 2),
FUN = order, decreasing = F)[1, , ]
> PET.quant <- array(PET.qt.mid[min.idx2], dim =
dim(PET))
> image(1:nrow(PET), 1:ncol(PET), PET[, 128:1], col
= topo.colors(256))
,
5/12
k -means Color Quantization of Images
Try to minimize
SSW = min
n X
K
X
I[Xi ∈Gk ] (Xi − µk )2
i=1 k=1
Both I[Xi ∈Gk ] and µk are parameters to be estimated
(solution provided by k-means and other algorithms)
> PET.kmns <- kmeans(as.vector(PET), centers = 8,
nstart = 1000)
PET.kmns is a list with components centers and
clusters, among other things
> PET.kmns.img <array(PET.kmns$centers[PET.kmns$cluster], dim =
dim(PET))
> image(1:ncol(PET), 1:nrow(PET), PET.kmns.img[,
128:1], col = topo.colors(256))
# clearly does much better
,
6/12
Factors
A factor is a vector object used to specify a discrete
classification (grouping) of the components of other
vectors of the same length. R provides both ordered and
unordered factors. While the most common application of
factors is with model formulae, consider for example, that
we have a sample of 50 tax accountants from all the states
and territories of the US and their individual state of origin
is specified by a character vector of state mnemonics.
To save typing, we will just obtain the statenames from
somewhere, in particular the rownames of the USArrests
dataset.
> data(USArrests)
> state <- rownames(USArrests)
Notice that in the case of a character vector, sorted means
sorted in alphabetical order.
,
7/12
Factors (continued)
A factor is similarly created using the factor() function:
> statef <- factor(state)
The print() function handles factors slightly differently
from other objects:
> statef
To find out the levels of a factor the function levels() can
be used.
> levels(statef)
,
8/12
Ordering Factors
The levels of factors are stored in alphabetical order, or in
the order they were specified to factor if they were
specified explicitly.
Sometimes the levels will have a natural ordering that we
want to record and want our statistical analysis to make
use of. The ordered() function creates such ordered
factors but is otherwise identical to factor. For most
purposes the only difference between ordered and
unordered factors is that the former are printed showing
the ordering of the levels, but the contrasts generated for
them in fitting linear models are different.
> statef.order <- ordered(statef)
> levels(statef.order)
,
9/12
Ragged Arrays and the tapply() function - I
To continue the previous example, suppose we have the
incomes of the same tax accountants in another vector (in
suitably large units of money)
> x.rep <- rep(1:nlevels(statef), times =
sample(1:10, size = length(statef), rep = T))
> state.f.rep <- statef[x.rep]
> incomes <- rpois(n = length(state.f.rep), lambda =
300)
To calculate the sample mean income for each state we
can now use the special function tapply():
> incmeans <- tapply(X = incomes, INDEX = state.f.rep, FUN
= mean)
giving a means vector with the components labelled by the
levels
,
10/12
Ragged Arrays and the tapply() function - II
The function tapply() is used to apply a function, here
mean(), to each group of components of the first
argument, here incomes, defined by the levels of the
second component, here statef12, as if they were separate
vector structures. The result is a structure of the same
length as the levels attribute of the factor containing the
results.
Suppose further we needed to calculate the standard
errors of the state income means. We could do so using
> incster <- tapply(X = incomes, INDEX = statef, FUN
= sd)
,
11/12
Ragged Arrays and the tapply() function - III
The function tapply() can also be used to handle more
complicated indexing of a vector by multiple categories.
For example, we might wish to split the tax accountants by
both state and sex. However in this simple instance (just
one factor) what happens can be thought of as follows.
The values in the vector are collected into groups
corresponding to the distinct entries in the factor. The
function is then applied to each of these groups
individually. The value is a vector of function results,
labelled by the levels attribute of the factor.
The combination of a vector and a labelling factor is an
example of what is sometimes called a ragged array, since
the subclass sizes are possibly irregular. When the
subclass sizes are all the same the indexing may be done
implicitly and much more efficiently, we can use the matrix
and the array utilities above.
,
12/12
Download