Computer lab 5: Kernel smoothers and splines

advertisement
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
Computer lab 5: Kernel smoothers and splines
Learning objectives
The main objective of this computer lab is to make the student familiar with smoothing
splines, local polynomial regression, and kernel density estimation and classification.
After completing the lab the student shall be able to:
(i)
(ii)
(iii)
(iv)
Perform spline and kernel smoothing using the SAS procedures TPSPLINE
and LOESS, respectively.
Understand how the choice of smoothing parameters influences the quality of
the fit
Construct a confidence band and use it for inference about the mean function
of the underlying model
Perform kernel density estimation and use the estimated densities for kernel
density classification according to the KDE procedure in SAS.
Recommended reading
Chapters 5-6 in Hastie et al.
Assignment 1: Examination of the randomness of a military draft
lottery using local polynomial smoothing
In 1970, the US Congress instituted a random selection process for the military draft. All
366 possible birth dates were placed in plastic capsules in a rotating drum and were
selected one by one. The first date drawn from the drum received draft number one, the
second date drawn received draft number two, etc. Then, eligible men were drafted in the
order given by the draft number of their birth date. In a truly random lottery there should
be no relationship between the date and the draft number. Your task is to investigate
whether or not the draft numbers were randomly selected. The draft numbers
(Y=Draft_No) sorted by day of year (X=Day_of_year) are given in the file lottery.xls
Consider the model Y  f  X    , estimate the function f and test whether
f  (1  366) / 2
1. Plot Y versus X using the GPLOT procedure in SAS or any other software in
which scatter charts can be drawn. State whether you can see any trend or pattern
in the data.
2. Fit a smooth function to the data using a local polynomial smoother in proc
LOESS in SAS. Start by choosing a suitable kernel width by varying the smooth
parameter from zero to one, and visually inspecting the fitted functions. Describe
how the curve of fitted values changes with increasing kernel width. Explain the
fundamental statistical issue you must address when you select the kernel width.
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
What levels of the smoothing parameter appear to be reasonable? Plot the data set
and the fitted values corresponding to the selected smoothing parameter in the
same plot using the GPLOT procedure.
3. Use proc LOESS to fit local linear and quadratic polynomials to observed data.
Let the width of the kernel be chosen automatically by specifying select=GCV .
Plot the two graphs and explain the similarities and differences between the two
polynomial regression models. Use the two models to draw conclusions about the
randomness of the draft lottery..
4. Produce 90% confidence bands of the mean function by running proc LOESS
with parameters clm and alpha. Plot the obtained results using GPLOT and
judge whether or not the given plot supports or contradicts the hypothesis that the
draft was completely random.
Assignment 2: Mortality rates hypothesis testing using
smoothing splines
Gompertz' (1825) theory that mortality rates (probability of dying per unit time) of many
organisms increase at an exponential rate was examined.in an experiment involving
fruitflies. A total of 1,203,646 fruit flies comprised the population for this experiment and
the number of flies found dead each day was recorded. The data set mortality_rate.xls
contains the mortality rate (Rate) of the flies for each day (Day)
1. Compute the Log-Mortality-Rate (LMR) as the logarithm of Rate and augment the
data set with the observations of this variable.
2. Fit a smoothing spline using proc TPSPLINE and make a scatter chart of the
observed and fitted values of LMR versus Day using proc GPLOT. Vary the
roughness penalty of the spline function by using different values of
lognlambda0, and choose (by visual inspection of the fitted curves) penalty
factors that
a. Make the curve too smooth so it approaches a linear function
b. Make the curve too wiggly
c. Make a reasonably good fit
List the selected penalty factors and the corresponding degrees of freedom in a, b
and c. Explain how the degrees of freedom change when the penalty factor
increases. Also, include into your report the plots obtained by proc GPLOT for
a,b, and c. (In total, it should be 3 pairs (lambda, degrees of freedom) and 3 plots).
3. Try to run TPSPLINE using the degrees of freedom obtained in 2a,b,c and make
sure that you get the appropriate value of the penalty factor (use parameter DF).
4. Use proc TPSPLINE to compute an 80% confidence band using value of the
penalty factor from 2c (by specifying /lognlambda0, alpha, uclm, lclm).
Plot the confidence band and test the hypothesis of exponentially increasing
mortality rate.
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
Assignment 3: Prediction of the beetle class using kernel
density classification
The data set beetles.xls represents a collection of measurements (X) made for two types
of beetles(Class). The aim is to define the best classification rule based on measurement
values.
1. Run the KDE procedure to construct kernel density estimates of each type of
beetles. Use out parameter to specify where the results of the estimation shall be
stored.
2. Plot the two density estimates in the same graph using proc GPLOT.
3. Assuming prior probabilities to be equal, construct a classification rule using the
plotted data.
To hand in
Highlighted items
Download