CS910 Exercise Sheet 2: Trying out tools The exercises on this sheet ask you to accomplish a number of tasks. It’s up to you to choose tools that will help you achieve them — either from those covered in lectures, or others that you are familiar with. Please present your results as a single printed document. Limit your answers to 4 sides of paper – this should be more than enough to express your findings; any more and you are including a lot of unnecessary detail. Warning: some of the questions require you to figure out how to do things in your tool of choice that have not been discussed in detail in lectures. You’ll need to do some searching on the web and trial and error to find the commands that will produce the output you want. Manipulating Data In this section, we will use the “automobile” data set, which contains information about the characteristics of cars from the 1980s. This is available from: http://archive.ics.uci.edu/ml/datasets/Automobile 1. How many vehicle models are produced by manufacturers beginning with ’m’ (e.g. mazda)? Describe how you produced your answer. [10] Solution: One could do: grep mazda imports-85.data | wc -l grep mercedes-benz imports-85.data | wc -l grep mitsubishi imports-85.data | wc -l and add up the results (17+9+13 = 39). However, it may be more direct to cut -f3 -d, imports-85.data | grep ˆm | wc -l to get the answer 39. 2. How many different (unique) combinations of {fuel type, aspiration, number of doors, body style, drive wheels, engine location} are there present in the data (counting examples where there is an unknown value as distinct)? For example, there is a (gas, std, two, hatchback, rwd, front) combination in the data, so this counts 1; there is no (diesel, std, two, wagon, fwd, rear) combination in the data, so this is not counted. What is the answer if we remove examples where there is some unknown value within this set of attributes? 1 [15] Solution: we can use cut -f4,5,6,7,8,9 -d, imports-85.data | sort | uniq | wc -l to get the answer 36. Then cut -f4,5,6,7,8,9 -d, imports-85.data | grep -v "?" | sort | uniq | wc -l removes “?” values, to give 34. (Note, the position of the “grep” affects the results: if we do it first, then we remove some vehicles where the value of these attributes is known but others are not. 3. What is the median and average price of four door vehicles? Describe briefly the steps you followed to achieve this. [15] Solution: We can first select out the data that we want: cut -f6,26 -d, imports-85.data | grep "four" | grep -v "?" > fourprices Then use R or a spreadsheet to get the values we want: p4 <- read.csv("fourprices") summary(p4) tells us that the median price is 11245, and the mean price is 13566. Note that we need care to avoid including vehicles with four cylinders in the results, and we remove “?” values so R can treat the values as numeric, not categoric. Plotting Data In this section, we will use the “abalone” data, which contains information about physical measurements of these sea creatures. This is available from: http://archive.ics.uci.edu/ml/datasets/Abalone 4. Create a scatter plot that shows the distribution of height (on the x-axis) against length (on the y-axis). Ensure the axes are labeled clearly. Include a line of best fit, and give the equation. Describe the outliers in the data. Page 2 [20] Solution: Spreadsheet Solution. Here’s one solution using Excel: after some manipulation to change formatting etc. Gnuplot solution. Here’s one solution in gnuplot Height vs Length in Abalone Data 3 2.5 length 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 height Abalone Fit set term pdf set output "heightlength.pdf" set title "Height vs Length in Abalone Data" set xlabel "height" set ylabel "length" y(x) = a*x + b Page 3 1.2 set datafile separator "," set key below fit y(x) "../datasets/abalone.data" using 4:2 via a,b plot "../datasets/abalone.data" using 4:2 \ with points title "Abalone", y(x) with line title "Fit" gnuplot finds a line of best fit as length = 2.36705 height + 0.192494 This is slightly different from the parameters found by Excel. Note, gnuplot does not make it convenient to obtain the pearson correlation coefficient, since it performs fitting in a different way (gradient descent). Outliers. There are two clear outliners with much larger heights compared to the rest of the data: 1.13 and 0.515. It is not clear if these are unusual heights, or transcription errors in the data. There are some other outliers visible which seem to have much smaller height than expected for their length. 5. Which pairs of numeric variables out of {length, diameter, height, whole weight, shucked weight, viscera weight, shell weight} have a correlation coefficient of more than 0.95? Outline briefly the steps you followed to answer this question. [20] Hint: Find a built-in function in your tool of choice that will compute the pearson product moment correlation coefficient for you. Can you find a way to apply it to all pairs of variables in turn? Solution: The pairs are: (length, diameter) : 0.98612, (whole weight, shucked weight) : 0.969393, (whole weight, viscera weight) : 0.96636, (whole weight, shell weight) : 0.955335. One way to do this is to use the PEARSON function in excel to compute the correlation for each pair – ideally, setting up a table so that each cell gives the corresponding PMCC. Here’s one approach: Load the abalone data in to the spreadsheet in cells a2 to i4178. Fill in the cells m1 to s1 with 0, 1, 2, 3, 4, 5, 6, and fill cells l2 to l8 with 0, 1, 2, 3, 4, 5, 6. Then in cell m2, fill in the formula =PEARSON(OFFSET($B1:$B10000,0,$L2), OFFSET($B1:$B10000,0,M$1)) and drag it across then down to fill in the cells m2 to s8 0 1 0 =PEARSON(OFFSET($B1:$B10000,0,$L2), OFFSET($B1:$B10000,0,M$1)) =PEARSON(OFFSET($B2:$B10001,0,$L3), OFFSET($B2:$B10001,0,M$1)) Page 4 1 =PEARSON(OFFSET($B1:$B10000,0,$L2), OFFSET($B1:$B10000,0,N$1)) =PEARSON(OFFSET($B2:$B10001,0,$L3), OFFSET($B2:$B10001,0,N$1)) R can also do this quite easily: abalone <- read.csv("abalone.data", header =F) cor(abalone[2],abalone[3]) cor(abalone[2],abalone[4]) cor(abalone[2],abalone[5]) cor(abalone[2],abalone[6]) cor(abalone[2],abalone[7]) cor(abalone[2],abalone[8]) cor(abalone[3],abalone[4]) cor(abalone[3],abalone[5]) cor(abalone[3],abalone[6]) cor(abalone[3],abalone[7]) cor(abalone[3],abalone[8]) cor(abalone[4],abalone[5]) cor(abalone[4],abalone[6]) cor(abalone[4],abalone[7]) cor(abalone[4],abalone[8]) cor(abalone[5],abalone[6]) cor(abalone[5],abalone[7]) cor(abalone[5],abalone[8]) cor(abalone[6],abalone[7]) cor(abalone[6],abalone[8]) cor(abalone[7],abalone[8]) All this typing is quite tiresome, so we can do a bit better: for (i in 2:7) { for (j in i+1:(8-i)) { if (cor(abalone[i],abalone[j])>0.95) { print(cor(abalone[i],abalone[j])) } } } (note the slightly unintuitive form of the inner for-loop) which produces V3 V2 0.9868116 V6 V5 0.9694055 V7 V5 0.9663751 V8 V5 0.9553554 Page 5 corresponding to (length, diam), (whole weight, shucked weight), (whole weight, viscera weight) and (whole weight, shell weight). 6. Generate a single plot showing the (empirical) cumulative distribution function for the number of rings for each of the three sexes (male, female, infant). Outline briefly the steps you followed to do this. Hint: In R, the “subset” and “lines” functions may be useful. For spreadsheets or gnuplot, you may have to manipulate the data first before plotting. Solution: Approach in R abaloneM <- subset(abalone, V1==’M’) abaloneF <- subset(abalone, V1==’F’) abaloneI <- subset(abalone, V1==’I’) plot(ecdf(abaloneM$V9), xlab="Rings", verticals=TRUE, col="green", pch=" ", ann=FALSE) lines(ecdf(abaloneF$V9), verticals=TRUE,col="blue",pch=" ") lines(ecdf(abaloneI$V9), verticals=TRUE, col="red",pch=" ") legend(’bottomright’,c("Male", "Female", "Infant"), col=c(’green’, ’blue’, ’red’), lty=1, bty=’n’) title(main="CDFs of rings in abalone data") Which (using a few options to neaten up the plot) generates: Page 6 [20] 0.4 0.8 CDFs of rings in abalone data 0.0 Male Female Infant 0 5 10 15 20 25 30 which can probably be made more attractive. Approach in Excel First sort the data by sex and secondarily by the number of rings (using the “data sort” functions). This makes it much easier to work with. Then count how many examples there are of each sex: there are 1307 F, 1342 I, and 1528 M. From this, we can generate a cumulative probability value for each observation in the three series in turn, and then add these (manually) to a plot for the CDF. This should look something like this: Page 7 Approach in Gnuplot mands: First, some manipulation of the data via shell com- grep ˆM abalone.data | cut -d, -f9 | sort -n > abaloneM.data grep ˆI abalone.data | cut -d, -f9 | sort -n > abaloneI.data grep ˆF abalone.data | cut -d, -f9 | sort -n > abaloneF.data Then some gnuplot: set output "gnusexplot.pdf" set title "CDFs of rings in abalone data" set xlabel "Rings" set ylabel "Probability" set key bottom right set yrange [0:1] m= "‘wc -l abaloneM.data‘" f="‘wc -l abaloneF.data‘" i="‘wc -l abaloneI.data‘" plot "abaloneM.data" using 1:(1./m) smooth cumulative t "Male", \ "abaloneF.data" using 1:(1./f) smooth cumulative t "Female", \ "abaloneI.data" using 1:(1./i) smooth cumulative t "Infant" Which results in Page 8 CDFs of rings in abalone data 1 Probability 0.8 0.6 0.4 0.2 0 0 5 10 15 20 25 Rings Male Page 9 Female Infant 30