Functional Data Analysis in Matlab and R James Ramsay, Professor, McGill U., Montreal Hadley Wickham, Grad student, Iowa State, Ames, IA Spencer Graves, Statistician, PDF Solutions, San José, CA 1 Outline • What is Functional Data Analysis? • FDA and Differential Equations • Examples: – Squid Neurons – Continuously Stirred Tank Reactor (CSTR) • Conclusions • References 2 What is FDA? • Functional data analysis is a collection of techniques to model data from dynamic systems – possibly governed by differential equations – in terms of some set of basis functions • The ‘fda’ package supports the use of 8 different types of basis functions: constant, monomial, polynomial, polygonal, B-splines, power, exponential, and Fourier. 3 Observations of different lengths • Observation vectors of different lengths can be mapped to coordinates of a fixed basis set • All examples in the ‘fda’ package have the same numbers of observations • No conceptual obstacles to handling observation vectors of different lengths 4 Time Warping • “start” and “stop” are sometimes determined by certain transitions • Example: growth spurts in the life cycle of various species do not occur at exactly the same ages in different individuals (even within the same species) 5 160 140 120 100 80 Height (cm.) • Tuddenham, R. D., and Snyder, M. M. (1954) "Physical growth of California boys and girls from birth to age 18", _University of California Publications in Child Development_, 1, 183-364. 180 10 Girls: Berkeley Growth Study o o o o o o o o o o o oo o o ooo o o o o o o o oo o o o o o o o o oo o o o o o o o 5 o o o o o o o o o o o o o o o ooo ooooo ooo o o o o oooo oo ooo ooo ooo oo oo o ooo oo o o o o o ooo oo oooo oo o oo oo o o o o o o o o o ooooooo o oo o o o o ooo o o o o oooo o o o o o o o o o o o o o o o o o o o ooo o oo o o oo o o o o o o o ooo o o o o oo o o o o o o o o o o o o o o oo o o o o 10 15 age 6 180 160 140 120 100 80 2 1 o o o o oo o o ooo o o oo o o ooo o oo o o ooo o o o o o o o o o o o o o o o o o o o o o o 10 15 0 5 o o o o o o o -3 -2 -1 age -4 Growth acceleration (cm/year^2) • Growth spurts occur at different ages • Average shows the basic trend, but features are damped by improper registration Height (cm.) Acceleration oooo ooooo o o o o o o oooo oo ooo ooo ooo oo oo ooo o o o o o o ooo o ooo oo o oo oooo o oo oo o o o o o o ooooooo o oo o o o o o o o o o oooo o o o o oo o oo o o o o o o o o o o o o oo o o oo o o o o o ooo oo oo o o o o o ooo o o o o o o o o o o o oo oo o o o o 5 10 15 7 age -4 -3 -2 -1 0 1 2 -4 • Not perfect, but better -3 -2 -1 0 1 Growth acceleration (cm/year^2) • register.fd all to the mean Growth acceleration (cm/yr^2) Registration 5 5 10 10 15 age 8 15 A Stroll Along the Beach • Light intensity over 365 days at each of 190*143 = 27140 pixels was – smoothed – functional principal components • http://www.stat.berkeley.edu/~wickham/us erposter.pdf 9 Other fda capabilities – even with series of different lengths! 0.006 15 5 S O A N m D j F Montreal average daily temp deviation from average (C) Apr Jun Sep Dec Month j F 0.000 m O D A N -0.006 – good estimates of derivatives M Jan Acceleration • Phase plane plots J A J 0 • Correlations -10 Mean Temperature Montreal average daily temp deviation from average (C) -10 -5 0 5 M J S J A afda-ch03.R fda-ch01.R fda-ch02.R 10 15 20 10 Temperature (C) Script files for fda books • Ramsay and Silverman – (2002) Applied Functional Data Analysis (Springer) – (2006) Functional Data Analysis, 2nd ed. (Springer) • ~R\library\fda\scripts – Some but not all data sets discussed in the books are in the ‘fda’ package – Script files are available to reproduce some but not all of the analyses in the books. – plus CSTR demo 11 FDA and Differential Equations • Many dynamic systems are believed to follow processes where output changes are a function of the outputs, x, and inputs, u (and unknown parameters q): x t f x,u, t | θ, t 0, T • Matlab was designed in part for these types of models 12 Squid Neurons • FitzHugh (1961) - Nagumo et al. (1962) Equations: Estimate a, b and c in: 3 Voltage across Axon Membrane V Recovery via Outward Currents V c V V 3 R R V a bR c R 13 Tank Reactions Temperature Concentration • Continuously Stirred Tank Reactor (CSTR) 14 Functional Data Analysis Process 1. Select Basis Set 2. Select Smoothing Operator – e.g., differential equation – equivalent to a Bayesian prior over coefficients to estimate 3. Estimate coefficients to optimize some objective function 4. Model criticism, residual plots, etc. 5. Hypothesis testing 15 Inputs to Tank Reaction Simulation 16 Computations: Nonlinear ODE dC dt CC T , Fin C FinCin dT dt TT Fco , Fin T TC T , Fin C FinTin Fco Tco CC T , Fin exp 104 1 T 1 Tref Fin TT Fco , Fin Fco Fin TC T , Fin 130 CC T , Fin Temperature • Compute Input vectors • Define functions • Call differential equation solver • Summarize, plot Concentration Fco aFcob 1 Fco aFcob 2 4 parameters: , , a, b estimate parameters (, , a, b) 17 Three problems • Estimate (, , a, b) to minimize SSE in Temperature only function Matlab lsqnonlin R nls optim Nelder-Mead BFGS CG SANN nlminb SSE 5.09888 5.09652 5.09652 5.09652 5.09900 5.17504 5.09652 SSE-min 0.00236 0 0 0 0.00248 0.07852 0 18 SSE(Temp, Conc) • Matlab: lsqnonlin Median absolute relative error Matlab R Concentration 1.149E-03 1.145E-03 Temperature 2.640E-04 2.636E-04 • R: nls Concentration (red = true, blue = estimate) C(t) 1.6 1.4 C(t) 1.6 1.8 Concentration (red = true, blue = estimated) 1.2 1.4 1.2 0 10 20 30 40 50 0 60 10 20 30 40 50 60 50 60 Temperature 360 Temperature 350 350 340 340 0 10 20 30 40 50 60 330 330 C(t) T(t) 360 0 10 20 30 40 19 R vs. Matlab • Gave comparable answers • R code for CSTR slightly more accurate but requires much more compute time – coded by different people • R has helper functions not so easily replicated in Matlab – summary.nls – confint.nls – profile.nls Estimate kref 0.466 EoverR 0.840 a 1.720 b 0.496 StdErr t 0.004 113.0 0.009 94.7 0.232 7.4 0.050 10.0 Pr(>|t|) < 2e-16 *** < 2e-16 *** 8.2e-13 *** < 2e-16 *** 20 confint.nls • Likelihood-based confidence intervals: generally more accurate than Wald intervals – Wald subject to parameter effects curvature – Likelihood: only affected by intrinsic curvature > confintNlsFit 2.5% kref 0.458 EoverR 0.823 a 1.300 b 0.401 97.5% 0.474 0.858 2.222 0.599 21 plot.profile.nls 2.0 80 1.0 2.0 1.0 0.0 0.0 50 0.455 0.465 0.475 0.82 0.84 0.86 2.0 EoverR 2.0 kref 95 1.0 0.0 1.0 90 0.0 • for a plot showing the sqrt(log(LR)) 99 1.2 1.6 2.0 a 2.4 0.40 0.50 0.60 b 22 Conclusions • R and Matlab give comparable answers • R:nls has helper functions absent from Matlab:lsqnonlin • Functional data analysis tools are key for – estimating derivatives and – working with differential operators 23 References • www.functionaldata.org • Ramsay and Silverman (2006) Functional Data Analysis, 2nd ed. (Springer) • ________(2002) Applied Functional Data Analysis (Springer) • Ramsay, J. O., Hooker, G., Cao, J. and Campbell, D. (2007) Parameter estimation for differential equations: A generalized smoothing approach (with discussion). Journal of the Royal Statistical Society, Series B. To appear. 24 NOT free-knot splines • For this, see – DierckxSpline package – Companion to Dierckx, P. (1993). Curve and Surface Fitting with Splines. Oxford Science Publications, New York. • R package by Sundar Dorai-Raj – links to Fortran code by Dierckx available from www.netlib.org/dierckx • soon to appear on CRAN 25