The Interface of Functional and Longitudinal Data Raymond J. Carroll Department of Statistics Member, Center for Statistical Bioinformatics Director, Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll My Charge • “Please feel free to talk about anything you wish” (Dangerous) • “Your thinking about longitudinal data and perhaps functional data from a wider perspective “ • “Goals of the workshop are to inspire new researchers, and to take stock of where the interface of longitudinal-functional data and dynamics is headed” What I Want to Talk about Mother and joey, Tidbinbilla (outside Canberra), September 2010 What I Want to Talk about Namadji National Park July 2005 What I Will Talk About • I will talk about some of the problems I have worked on • No technical solutions, the other speakers look to be providing them • Investigators think marginally, statisticians think of random effects Some Observations • In my work, there is a tension between • Providing answers to my collaborators that • • they can understand Developing new general methodology publishable in statistics and that can solve more general problems Thinking about parts of the actual problem that my collaborators would not have thought about • It’s easy to get caught up in either of the 1st two Some Observations • When I am simply providing answers to stated questions, I find similar themes as the distinction between marginal models such as GEE and nonlinear mixed effects models for longitudinal data • GEE is simply easier • Most scientists think marginally because they are uncomfortable with the idea of variability What I Will Talk About • Think what the typical smart biologist knows about statistics. • t-tests, ANOVA, simple linear regression • All the focus is on the mean, none on the variability Some Observations • • What we have to do is to deliver the analysis the data collectors can understand, and teach them about variability Pictures work wonders: functions are no harder to understand than histograms, and understanding variability can help investigators tell stories Some Observations • • We need to advance the field of statistics Deeper understanding of the underlying process, through random effects modeling, often helps inform future studies and helps investigators tell their story An Old Colon Carcinogenesis Project • Experiment with 2 lipids (fish oil and corn oil) • • with and without butyrate (a fatty acid) supplementation, with p27 or MGMT repair measured as the response Longitudinal, maybe even dynamic, hierarchical and functional. Hierarchical because each of the treatment groups has multiple samples, and each of them have multiple functions • Functional because of the biology Colon Cancer Data Jeff Morris Naisyin Wang Ciprian Crainiceanu Veera B Ana-Maria Staicu Yehua Li Functional • The colonic crypts have cells, near the bottom (x=0) are the stem cells, near the top (x=1) are the differentiated cells MGMT Repair Enzyme, 1 crypt • MGMT curve in one crypt. • Original analysis found large diet effects MGMT Repair Enzyme, 1 crypt • The large diet effects on the MGMT repair enzyme are real. • There are also large diet effects on apoptosis MGMT Repair Enzyme, 1 crypt • What do biologists do (define original analysis)? • They simplify the data so that they can do ANOVA, duh! • They average all the response (p27 or MGMT, about 200 observations in each analysis) in the bottom 1/3rd, Middle 1/3rd and top 1/3rd. Then they run 3 ANOVA. MGMT Repair Enzyme, 1 crypt • They then they tell a story about all the ANOVA they have done. • We all smile about this, but my collaborator (Joanne Lupton) just got elected into the U. S. National Academy of Science. MGMT Repair Enzyme, 1 crypt • I like to think that our more nuanced analyses help her tell her stories, which is hopefully not wishful thinking! MGMT Repair Enzyme, 1 crypt • Wavelet functional coefficients for apoptotic index in the top 1/3 of the crypt, for fish oil and for corn oil. From Morris and Carroll (2006): “fish-oil-fed animals who had a large amount of apoptosis near their lumenal surface also had high levels of the DNA repair enzyme MGMT near their lumenal surface, meaning that the two major mechanisms for dealing with DNA damage were correlated. This relationship was not so strong for corn-oil-fed animals”. MGMT Repair Enzyme, the stiry • We did a full-blown wavelet-based functional mixed model analysis to get these conclusions. Could it have been done marginally? • Probably Yes, but then that’s dull. • However, we (a) know much more about the pattern of variability and (b) we built up methods and software that can be used in a wide variety of settings Longitudinal • Colon carcinogenesis is a localized phenomenon. The crypts closest to one another are highly correlated Colon Cancer Data • The locality hypothesis says that colon cancer starts because of highly localized damage. • • Longitudinal and hierarchical FDA can tell us many things about this hypothesis, e.g., where is localized damage more likely to occur? While most research focuses on the proximal and distal portions of the colon, FDA reveals that there is as much or more in the middle Basic Model for p27 Ydrc (x) = ηd (x)+γ dr (x)+θdrc (x)+ε drc (x) ηd (x)+γ dr (x) = rat-level function within diet-fatty acid group θdrc (x) = crypt-level functions, which are not independent Colon Cancer Data • Lots of fun fitting this longitudinal, hierarchical functional data set • What did the investigators want to know? • They were interested in how correlated neighboring crypts are, consistent with the locality hypothesis. Colon Cancer Data • The Bayesian • analysis gives them strong point-wise evidence (can supplement with FDR) Allows summary measures Colon Cancer Data • Acknowledging the longitudinal nature led to much more precise inferences. This is the interaction function between diet and treatment: guess which one allows for locality? Cell Signaling Data • Myometrial cells meant to mimic what goes on near birth were either exposed to dioxin (TCDD) or not exposed. • They were then exposed to a hormone, oxytocin, that stimulates calcium ion signaling (CA2+) • The CA2+ signal was observed at many pixels of each cell for 512 time points (85 minutes) Cell Signaling Data Josue Martinez Jianhua Huang Cell Signaling Data • The cells were segmented, and intensity of the signals were obtained for each pixel, each cell and all time points. • Roughly 25 cells in each treatment group (control and TCDD) • Hierarchical because of pixels within cells within treatments Cell Signaling Data • Functional because pixels are measured over time • Possibly different levels of spatial because the cells are in spatial alignment • Lots of preprocessing: cell segmentation, adjustment for saturation, and more Cell Signaling Data First two minutes of the experiment for the TCDD treated plate. Next comes two movies of the data Cell Signaling Data All cells (Control and TCDD), at a basal state in which the cells were cultured, 0-4 minutes and 40-80 minutes after oxytocin exposure Cell Signaling Data All cells (Control and TCDD), at a low estrogen state, just before pregnancy (note the delayed response due to TCDD) Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near fullterm in pregnancy Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near fullterm in pregnancy, after normalization and registration Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near fullterm in pregnancy, after normalization and registration. Areas under the curve (p < 0.001) Cell Signaling Data • You should see that in this analysis, we have not made use of the structure of the data. • We have thought like GEE people, and indeed reduced the comparison of control and TCDD to single numbers, e.g., peak time and area under the curve. • We did lots of dimension reduction (4 weighted SVD) to get here Cell Signaling Data • There was a lot of work to get the data into a format for analysis • Question: what can hierarchical, possible spatial FDA do for us here, and given the structure, how should an analysis proceed? • I feel that there is a lot more that we can learn about the process by thinking more deeply about the modeling Bat Chirp Data • Bats of the same species, residing in Austin (city bats) and College Station (Aggie bats) Bat Chirp Data Josue Martinez Jeff Morris Bat Chirp Data Bat Chirp Data • Bat chirps were recorded, some multiple times for each bat. • The hierarchy is species, bat, replicate • I believe this analysis is a poster child for why to think functionally and hierarchically A Representative Bat Chirp Bat Chirp Spectrogam Bat Chirp Data • The chirp is mainly composed of frequencies that start at about 40 kilohertz (kHz) and slowly decrease to 20 kHz from 0 to 8 milliseconds into the chirp. • The bat then transitions to predominant frequencies at 60 kHz that slowly decrease back down to 40 kHz and then rise up to 60 kHz towards the end of the chirp. • Frequencies above ∼ 80 kHz are harmonics of the fundamental signal. One Chirp per Bat Bat Chirp Data • • • It seems clear to me that this is an inherently functional problem. Trying to reduce it to a single number to do a ttest seems difficult to contemplate, but it is not impossible. People have tried t-tests and classification based on measures such as duration, start frequency, end frequency, etc. Bat Chirp Data • One could simply take each pixel of the spectrogram and do t-tests, with FDR control • This would ignore the replicate data, would ignore the correlated nature of the data, would do no dimension reduction, etc. • Kisi Bohn What did the biologist want to know? Bat Chirp Data • • She wanted to know if the bats from the same species (City Bats and Aggie Bats) evolved and have different vocalizations What did we want to do: • Answer her question precisely, and let her tell a story (the marginal question, imprecisely framed) • Use all the data • Understand the variability Bat Chirp Data • We wavelet transformed the spectrograms, fit a 2-D hierarchical WFFM, transformed back, and did analysis of the results (see next) Bat Chirp Data • Difference in mean spectrogram inferred from model. • Red favors College Station, Blue favors Austin • This could be done without random effects Bat Chirp Data • White areas are those in which the spectrograms differ by 1.5 fold or more, with a global FDR control of 15%. • Hard to do legitimately without random effects? Frequency Agile Lidar Data • This is a recent project from Bani Mallick’s group • Here is a comic describing the process LIDAR Data Bani Mallick and his student Swarup De Peter Hall and Aurore Delaigle Frequency Agile Lidar Data • There is a transmitted signal • There is background • There is a received signal, which is then background corrected • For each time (100+) and wavelength (19), we see 625 observations across the physical range of observation, i.e., equally spaced functional data with noise. Frequency Agile Lidar Data Frequency Agile Lidar Data • There are two types of signals • The first is benign, ordinary dust that has been released • The second are biological aerosols Frequency Agile Lidar Data Frequency Agile Lidar Data Four samples at same time and wavelength. Background corrected only Frequency Agile Lidar Data Four samples and same time and wavelength. Background corrected, truncated at zero and normalized Data • For aerosol type a = 1,2, and sample i=1,…,n within type, we observe background corrected received data R aiw (t,x) • Here t = time, w = wavelength and x = distance. • This is hierarchical: there are samples within types Data • For aerosol type a = 1,2, and sample i=1,…,n within type, we observe R aiw (t,x) • This is functional: there are bivariate space-time curves over distance x and time t • It is longitudinal, over wavelength Approaches • For aerosol type a = 1,2, and sample i=1,…,n within type, we observe R (t,x) aiw • There are a vast number of approaches possible • The fun thing to do is to build a hierarchical, longitudinal, space-time model • Doing this is not trivial, will advance the field, will allow sharing of data, will allow understanding of variability, etc. Approaches • The investigators want things far more boring • They want to know if there are differences between the two types of samples (biological and non-biological), sigh. Approaches • Both simple questions can be handled by a model-based approach, of course. • But they can also be answered by much simpler, ad hoc, dimension reduction-based and not particularly innovative approaches • We will have to decide what to do! • • • Conclusions Functional, hierarchical and longitudinal data are the wave of the future. I have given 4 examples of functional data that are either hierarchical or longitudinal Analyzing data like this is great fun! • • • Conclusions The questions I have raised are about the goals of such studies. If investigators only think marginally, they miss out. If we do not think marginally, we have less influence • • Conclusions Marginal approaches are often much faster to implement, and easier to explain. I’d like speaker at this conference to help me by indicating why powerful random effects models are “better” than marginal approaches. • • Advertisement TAMU has an full professor opening in computational statistics as broadly defined. Startup funding is at least $750,000 Other Acknowledgments • I gratefully acknowledge financial support from the U. S. National Cancer Institute (R37CA057030) and King Abdullah University of Science and Technology (KAUST, Award Number KUS-CI-016-04). Approaches • There is a deconvolution aspect to this problem that is fairly unique • Along with the received signal is a transmitted signal R aiw (t,x) , there Taiw (t,x) • There is thought to be a true signal G aiw (t,x) Approaches • The deconvolution equation is x R aiw (t,x)= Gaiw (t,v)Taiw (t,x-v) dv+ε aiw (t,x) 0 • Here, over x • εaiw (t,x) is supposed to be white noise ˆ (t,x) ? Should one use R aiw (t,x) or G aiw Approaches • It turns out that there are no systematic differences across treatment for εaiw (t,x) or for Taiw (t,x) • So differences across treatments in the received signal reflect differences in the true signal, and vice-versa • Is deconvolution a good idea? It is a heck of a lot of work, and the model assumptions are stringent Approaches • We think deconvolution here is not only harder than simply using the observed data, but less efficient because of the excess noise induced by deconvolution • The Mallick group has made great progress on attacking this in a systematic, functional, hierarchical, Bayesian manner