732A20 Data Mining and Statistical Learning Department of Computer and Information Science Computer lab 5: Kernel smoothers and splines Learning objectives The main objective of this computer lab is to make the student familiar with smoothing splines, local polynomial regression, and kernel density estimation and classification. After completing the lab the student shall be able to: (i) (ii) (iii) (iv) Perform spline and kernel smoothing using the SAS procedures TPSPLINE and LOESS, respectively. Understand how the choice of smoothing parameters influences the quality of the fit Construct a confidence band and use it for inference about the mean function of the underlying model Perform kernel density estimation and use the estimated densities for kernel density classification according to the KDE procedure in SAS. Recommended reading Chapters 5-6 in Hastie et al. Assignment 1: Examination of the randomness of a military draft lottery using local polynomial smoothing In 1970, the US Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one, the second date drawn received draft number two, etc. Then, eligible men were drafted in the order given by the draft number of their birth date. In a truly random lottery there should be no relationship between the date and the draft number. Your task is to investigate whether or not the draft numbers were randomly selected. The draft numbers (Y=Draft_No) sorted by day of year (X=Day_of_year) are given in the file lottery.xls Consider the model Y f X , estimate the function f and test whether f (1 366) / 2 1. Plot Y versus X using the GPLOT procedure in SAS or any other software in which scatter charts can be drawn. State whether you can see any trend or pattern in the data. 2. Fit a smooth function to the data using a local polynomial smoother in proc LOESS in SAS. Start by choosing a suitable kernel width by varying the smooth parameter from zero to one, and visually inspecting the fitted functions. Describe how the curve of fitted values changes with increasing kernel width. Explain the fundamental statistical issue you must address when you select the kernel width. 732A20 Data Mining and Statistical Learning Department of Computer and Information Science What levels of the smoothing parameter appear to be reasonable? Plot the data set and the fitted values corresponding to the selected smoothing parameter in the same plot using the GPLOT procedure. 3. Use proc LOESS to fit local linear and quadratic polynomials to observed data. Let the width of the kernel be chosen automatically by specifying select=GCV . Plot the two graphs and explain the similarities and differences between the two polynomial regression models. Use the two models to draw conclusions about the randomness of the draft lottery.. 4. Produce 90% confidence bands of the mean function by running proc LOESS with parameters clm and alpha. Plot the obtained results using GPLOT and judge whether or not the given plot supports or contradicts the hypothesis that the draft was completely random. Assignment 2: Mortality rates hypothesis testing using smoothing splines Gompertz' (1825) theory that mortality rates (probability of dying per unit time) of many organisms increase at an exponential rate was examined.in an experiment involving fruitflies. A total of 1,203,646 fruit flies comprised the population for this experiment and the number of flies found dead each day was recorded. The data set mortality_rate.xls contains the mortality rate (Rate) of the flies for each day (Day) 1. Compute the Log-Mortality-Rate (LMR) as the logarithm of Rate and augment the data set with the observations of this variable. 2. Fit a smoothing spline using proc TPSPLINE and make a scatter chart of the observed and fitted values of LMR versus Day using proc GPLOT. Vary the roughness penalty of the spline function by using different values of lognlambda0, and choose (by visual inspection of the fitted curves) penalty factors that a. Make the curve too smooth so it approaches a linear function b. Make the curve too wiggly c. Make a reasonably good fit List the selected penalty factors and the corresponding degrees of freedom in a, b and c. Explain how the degrees of freedom change when the penalty factor increases. Also, include into your report the plots obtained by proc GPLOT for a,b, and c. (In total, it should be 3 pairs (lambda, degrees of freedom) and 3 plots). 3. Try to run TPSPLINE using the degrees of freedom obtained in 2a,b,c and make sure that you get the appropriate value of the penalty factor (use parameter DF). 4. Use proc TPSPLINE to compute an 80% confidence band using value of the penalty factor from 2c (by specifying /lognlambda0, alpha, uclm, lclm). Plot the confidence band and test the hypothesis of exponentially increasing mortality rate. 732A20 Data Mining and Statistical Learning Department of Computer and Information Science Assignment 3: Prediction of the beetle class using kernel density classification The data set beetles.xls represents a collection of measurements (X) made for two types of beetles(Class). The aim is to define the best classification rule based on measurement values. 1. Run the KDE procedure to construct kernel density estimates of each type of beetles. Use out parameter to specify where the results of the estimation shall be stored. 2. Plot the two density estimates in the same graph using proc GPLOT. 3. Assuming prior probabilities to be equal, construct a classification rule using the plotted data. To hand in Highlighted items