Bat Echolocation Data - due Feb. 25

advertisement
A former student in data mining, Adam Morris, supplied this information on 7 types of bats. As you may
know, bats are flying mammals that are active at night. The challenge is to identify the variety of bat
using their calls. As you also may know, bats navigate by sending out calls and listening for their echos.
Some information on this “echolocation” process is given at
http://animals.howstuffworks.com/mammals/bat2.htm
The calls of several bats of each of the 7 types have been characterized by a set of their features as
described below. In Adam’s e-mail, that data are described as follows:
The target variable is species (there are 7 species).
Labo = Eastern red bat
Nyhu = Evening bat
Pisu = Tricolored bat
Epfu = Big brown bat
hoary = Hoary bat
Myau = Northern Long-eared bat
Tabr = LeConte Free-tail bat
There are 11 continuous variables (features), which are parameters measured from sonograms of the
echolocation recordings. The species identifications were made by comparing each sonogram with
known-species reference calls (by eye). Since this is incredibly tedious, a quantitative sorting algorithm
would be quite useful. (note the large number of bats for which this was done)
Measured Call Characteristics:
dur = duration
pre = preceding interval
highf = high frequency
lowf = low frequency
band = bandwidth
fmaxamp = frequency of maximum amplitude
maxamp = maximum amplitude (% duration)
slope = overall slope
heel = location of heel if present
upper = upper slope (if heel is present)
lower = lower slope (if heel present)
Click on the link to get the SAS program that reads in the data and gets you started on part 2. Note that
it assumes proportional priors (i.e. a representative sample). You do not need to organize a nice report
this time but please use complete sentences/paragraphs to answer these questions. The tasks are:
(1) Describe the data: What are the counts and percentages of the 7 species of bat? Assuming this is a
representative sample what are the most common and rarest species?
(2) Run a Fisher Linear Discriminant function for identifying the species of bat.
(A) For the Fisher Linear discriminant function, what assumptions are made about the seven
covariance matrices?
(B) How many rows and columns does each of these seven covariance matrices have?
(C) Besides the intercept, how many coefficients does each discriminant function involve? Would this
answer change if there were more features? Would it change if there were more species?
(D) Why are the discriminant numbers (Fj in our notes) different for different individual bats? Is it the
coefficients, the features, or both?
(E) Suppose (for simplicity) that a bat’s discriminant functions were F1=2 for comparing to Labo and
Fj=1 for j=2,3,…,7 for comparing to each of the other 6 species. What is the (posterior) probability
that this bat is a Labo bat (Eastern Red bat)?
(F) Suppose (again for simplicity) that we have a bat whose sonogram trace has highf=10 and all other
features equal to 0. Find from your Fisher Linear Discriminant function output, the discriminant
number (Fj in the notes) for comparing this bat to each of the seven species’ distribution.
(G) How many Epfu bats where accidentally classified as Labo and how many Labo bats were
accidentally classified as Epfu using your linear discriminant function?
(H) How would you change your code to force PROC DISCRIM to run a quadratic discriminant function?
Under what conditions would you prefer quadratic to linear? (Note the relationship of this question
and question 2A).
(I) Test to see if a quadratic discriminant function is needed by changing your SAS code appropriately.
Report the result. Run a quadratic discriminant function and compare the misclassification rate to
that of the linear discriminant function by showing both rates.
Download