CS910 Exercise Sheet 2: Trying out tools

advertisement
CS910 Exercise Sheet 2: Trying out tools
The exercises on this sheet ask you to accomplish a number of tasks. It’s up to you to
choose tools that will help you achieve them — either from those covered in lectures,
or others that you are familiar with.
Please present your results as a single printed document. Limit your answers to 4
sides of paper – this should be more than enough to express your findings; any more
and you are including a lot of unnecessary detail.
Warning: some of the questions require you to figure out how to do things in your
tool of choice that have not been discussed in detail in lectures. You’ll need to do some
searching on the web and trial and error to find the commands that will produce the
output you want.
Manipulating Data
In this section, we will use the “automobile” data set, which contains information
about the characteristics of cars from the 1980s.
This is available from: http://archive.ics.uci.edu/ml/datasets/Automobile
1. How many vehicle models are produced by manufacturers beginning with ’m’
(e.g. mazda)? Describe how you produced your answer.
[10]
Solution: One could do:
grep mazda imports-85.data | wc -l
grep mercedes-benz imports-85.data | wc -l
grep mitsubishi imports-85.data | wc -l
and add up the results (17+9+13 = 39).
However, it may be more direct to
cut -f3 -d, imports-85.data | grep ˆm | wc -l
to get the answer 39.
2. How many different (unique) combinations of {fuel type, aspiration, number of
doors, body style, drive wheels, engine location} are there present in the data
(counting examples where there is an unknown value as distinct)? For example,
there is a (gas, std, two, hatchback, rwd, front) combination in the data, so this
counts 1; there is no (diesel, std, two, wagon, fwd, rear) combination in the data,
so this is not counted.
What is the answer if we remove examples where there is some unknown value
within this set of attributes?
1
[15]
Solution: we can use
cut -f4,5,6,7,8,9 -d, imports-85.data | sort | uniq | wc -l
to get the answer 36.
Then
cut -f4,5,6,7,8,9 -d, imports-85.data | grep -v "?" |
sort | uniq | wc -l
removes “?” values, to give 34. (Note, the position of the “grep” affects the
results: if we do it first, then we remove some vehicles where the value of
these attributes is known but others are not.
3. What is the median and average price of four door vehicles? Describe briefly the
steps you followed to achieve this.
[15]
Solution: We can first select out the data that we want:
cut -f6,26 -d, imports-85.data | grep "four" |
grep -v "?" > fourprices
Then use R or a spreadsheet to get the values we want:
p4 <- read.csv("fourprices")
summary(p4)
tells us that the median price is 11245, and the mean price is 13566. Note that
we need care to avoid including vehicles with four cylinders in the results, and
we remove “?” values so R can treat the values as numeric, not categoric.
Plotting Data
In this section, we will use the “abalone” data, which contains information about physical measurements of these sea creatures.
This is available from: http://archive.ics.uci.edu/ml/datasets/Abalone
4. Create a scatter plot that shows the distribution of height (on the x-axis) against
length (on the y-axis). Ensure the axes are labeled clearly. Include a line of best fit,
and give the equation. Describe the outliers in the data.
Page 2
[20]
Solution:
Spreadsheet Solution. Here’s one solution using Excel:
after some manipulation to change formatting etc.
Gnuplot solution. Here’s one solution in gnuplot
Height vs Length in Abalone Data
3
2.5
length
2
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1
height
Abalone
Fit
set term pdf
set output "heightlength.pdf"
set title "Height vs Length in Abalone Data"
set xlabel "height"
set ylabel "length"
y(x) = a*x + b
Page 3
1.2
set datafile separator ","
set key below
fit y(x) "../datasets/abalone.data" using 4:2 via a,b
plot "../datasets/abalone.data" using 4:2 \
with points title "Abalone", y(x) with line title "Fit"
gnuplot finds a line of best fit as length = 2.36705 height + 0.192494
This is slightly different from the parameters found by Excel. Note, gnuplot
does not make it convenient to obtain the pearson correlation coefficient, since
it performs fitting in a different way (gradient descent).
Outliers. There are two clear outliners with much larger heights compared to
the rest of the data: 1.13 and 0.515. It is not clear if these are unusual heights,
or transcription errors in the data. There are some other outliers visible which
seem to have much smaller height than expected for their length.
5. Which pairs of numeric variables out of {length, diameter, height, whole weight,
shucked weight, viscera weight, shell weight} have a correlation coefficient of
more than 0.95? Outline briefly the steps you followed to answer this question.
[20]
Hint: Find a built-in function in your tool of choice that will compute the pearson
product moment correlation coefficient for you. Can you find a way to apply it to
all pairs of variables in turn?
Solution: The pairs are: (length, diameter) : 0.98612, (whole weight, shucked
weight) : 0.969393, (whole weight, viscera weight) : 0.96636, (whole weight,
shell weight) : 0.955335.
One way to do this is to use the PEARSON function in excel to compute the
correlation for each pair – ideally, setting up a table so that each cell gives the
corresponding PMCC.
Here’s one approach:
Load the abalone data in to the spreadsheet in cells a2 to i4178.
Fill in the cells m1 to s1 with 0, 1, 2, 3, 4, 5, 6, and fill cells l2 to l8 with 0, 1, 2, 3,
4, 5, 6.
Then in cell m2, fill in the formula =PEARSON(OFFSET($B1:$B10000,0,$L2),
OFFSET($B1:$B10000,0,M$1))
and drag it across then down to fill in the cells m2 to s8
0
1
0
=PEARSON(OFFSET($B1:$B10000,0,$L2),
OFFSET($B1:$B10000,0,M$1))
=PEARSON(OFFSET($B2:$B10001,0,$L3),
OFFSET($B2:$B10001,0,M$1))
Page 4
1
=PEARSON(OFFSET($B1:$B10000,0,$L2),
OFFSET($B1:$B10000,0,N$1))
=PEARSON(OFFSET($B2:$B10001,0,$L3),
OFFSET($B2:$B10001,0,N$1))
R can also do this quite easily:
abalone <- read.csv("abalone.data", header =F)
cor(abalone[2],abalone[3])
cor(abalone[2],abalone[4])
cor(abalone[2],abalone[5])
cor(abalone[2],abalone[6])
cor(abalone[2],abalone[7])
cor(abalone[2],abalone[8])
cor(abalone[3],abalone[4])
cor(abalone[3],abalone[5])
cor(abalone[3],abalone[6])
cor(abalone[3],abalone[7])
cor(abalone[3],abalone[8])
cor(abalone[4],abalone[5])
cor(abalone[4],abalone[6])
cor(abalone[4],abalone[7])
cor(abalone[4],abalone[8])
cor(abalone[5],abalone[6])
cor(abalone[5],abalone[7])
cor(abalone[5],abalone[8])
cor(abalone[6],abalone[7])
cor(abalone[6],abalone[8])
cor(abalone[7],abalone[8])
All this typing is quite tiresome, so we can do a bit better:
for (i in 2:7) {
for (j in i+1:(8-i)) {
if (cor(abalone[i],abalone[j])>0.95) {
print(cor(abalone[i],abalone[j]))
}
}
}
(note the slightly unintuitive form of the inner for-loop)
which produces
V3
V2 0.9868116
V6
V5 0.9694055
V7
V5 0.9663751
V8
V5 0.9553554
Page 5
corresponding to (length, diam), (whole weight, shucked weight), (whole weight,
viscera weight) and (whole weight, shell weight).
6. Generate a single plot showing the (empirical) cumulative distribution function
for the number of rings for each of the three sexes (male, female, infant). Outline
briefly the steps you followed to do this.
Hint: In R, the “subset” and “lines” functions may be useful. For spreadsheets
or gnuplot, you may have to manipulate the data first before plotting.
Solution:
Approach in R
abaloneM <- subset(abalone, V1==’M’)
abaloneF <- subset(abalone, V1==’F’)
abaloneI <- subset(abalone, V1==’I’)
plot(ecdf(abaloneM$V9), xlab="Rings", verticals=TRUE,
col="green", pch=" ", ann=FALSE)
lines(ecdf(abaloneF$V9), verticals=TRUE,col="blue",pch=" ")
lines(ecdf(abaloneI$V9), verticals=TRUE, col="red",pch=" ")
legend(’bottomright’,c("Male", "Female", "Infant"),
col=c(’green’, ’blue’, ’red’), lty=1, bty=’n’)
title(main="CDFs of rings in abalone data")
Which (using a few options to neaten up the plot) generates:
Page 6
[20]
0.4
0.8
CDFs of rings in abalone data
0.0
Male
Female
Infant
0
5
10
15
20
25
30
which can probably be made more attractive.
Approach in Excel First sort the data by sex and secondarily by the number
of rings (using the “data sort” functions). This makes it much easier to work
with.
Then count how many examples there are of each sex: there are 1307 F, 1342 I,
and 1528 M.
From this, we can generate a cumulative probability value for each observation
in the three series in turn, and then add these (manually) to a plot for the CDF.
This should look something like this:
Page 7
Approach in Gnuplot
mands:
First, some manipulation of the data via shell com-
grep ˆM abalone.data | cut -d, -f9 | sort -n > abaloneM.data
grep ˆI abalone.data | cut -d, -f9 | sort -n > abaloneI.data
grep ˆF abalone.data | cut -d, -f9 | sort -n > abaloneF.data
Then some gnuplot:
set output "gnusexplot.pdf"
set title "CDFs of rings in abalone data"
set xlabel "Rings"
set ylabel "Probability"
set key bottom right
set yrange [0:1]
m= "‘wc -l abaloneM.data‘"
f="‘wc -l abaloneF.data‘"
i="‘wc -l abaloneI.data‘"
plot "abaloneM.data" using 1:(1./m) smooth cumulative t "Male", \
"abaloneF.data" using 1:(1./f) smooth cumulative t "Female", \
"abaloneI.data" using 1:(1./i) smooth cumulative t "Infant"
Which results in
Page 8
CDFs of rings in abalone data
1
Probability
0.8
0.6
0.4
0.2
0
0
5
10
15
20
25
Rings
Male
Page 9
Female
Infant
30
Download