What, Exactly, is the Food Propensity Questionnaire?

advertisement
The Interface of Functional and
Longitudinal Data
Raymond J. Carroll
Department of Statistics
Member, Center for Statistical Bioinformatics
Director, Institute for Applied Mathematics
and Computational Science
Texas A&M University
http://stat.tamu.edu/~carroll
My Charge
• “Please feel free to talk about anything you
wish” (Dangerous)
• “Your thinking about longitudinal data and
perhaps functional data from a wider
perspective “
• “Goals of the workshop are to inspire new
researchers, and to take stock of where the
interface of longitudinal-functional data and
dynamics is headed”
What I Want to Talk about
Mother and
joey,
Tidbinbilla
(outside
Canberra),
September
2010
What I Want to Talk about
Namadji
National Park
July 2005
What I Will Talk About
• I will talk about some of the problems I have
worked on
• No technical solutions, the other speakers look
to be providing them
• Investigators think marginally, statisticians think
of random effects
Some Observations
• In my work, there is a tension between
• Providing answers to my collaborators that
•
•
they can understand
Developing new general methodology
publishable in statistics and that can solve
more general problems
Thinking about parts of the actual problem
that my collaborators would not have thought
about
• It’s easy to get caught up in either of the 1st two
Some Observations
• When I am simply providing answers to stated
questions, I find similar themes as the
distinction between marginal models such as
GEE and nonlinear mixed effects models for
longitudinal data
• GEE is simply easier
• Most scientists think marginally because they are
uncomfortable with the idea of variability
What I Will Talk About
• Think what the typical smart biologist knows
about statistics.
• t-tests, ANOVA, simple linear regression
•
All the focus is on the mean, none on the
variability
Some Observations
•
•
What we have to do is to deliver the analysis
the data collectors can understand, and teach
them about variability
Pictures work wonders: functions are no harder
to understand than histograms, and
understanding variability can help investigators
tell stories
Some Observations
•
•
We need to advance the field of statistics
Deeper understanding of the underlying
process, through random effects modeling, often
helps inform future studies and helps
investigators tell their story
An Old Colon Carcinogenesis Project
• Experiment with 2 lipids (fish oil and corn oil)
•
•
with and without butyrate (a fatty acid)
supplementation, with p27 or MGMT repair
measured as the response
Longitudinal, maybe even dynamic, hierarchical
and functional.
Hierarchical because each of the treatment
groups has multiple samples, and each of
them have multiple functions
• Functional because of the biology
Colon Cancer Data
Jeff Morris
Naisyin Wang
Ciprian Crainiceanu
Veera B
Ana-Maria Staicu
Yehua Li
Functional
• The colonic
crypts have cells,
near the bottom
(x=0) are the
stem cells, near
the top (x=1) are
the differentiated
cells
MGMT Repair Enzyme, 1 crypt
• MGMT curve in
one crypt.
• Original analysis
found large diet
effects
MGMT Repair Enzyme, 1 crypt
• The large diet effects on the MGMT repair
enzyme are real.
•
There are also large diet effects on apoptosis
MGMT Repair Enzyme, 1 crypt
• What do biologists do (define original analysis)?
•
They simplify the data so that they can do
ANOVA, duh!
• They average all the response (p27 or MGMT,
about 200 observations in each analysis) in the
bottom 1/3rd, Middle 1/3rd and top 1/3rd. Then
they run 3 ANOVA.
MGMT Repair Enzyme, 1 crypt
• They then they tell a story about all the ANOVA
they have done.
• We all smile about this,  but my collaborator
(Joanne Lupton) just got elected into the U. S.
National Academy of Science.
MGMT Repair Enzyme, 1 crypt
• I like to think that our more nuanced analyses
help her tell her stories, which is hopefully not
wishful thinking!
MGMT Repair Enzyme, 1 crypt
•
Wavelet functional coefficients for apoptotic index in the top 1/3 of the crypt,
for fish oil and for corn oil. From Morris and Carroll (2006): “fish-oil-fed
animals who had a large amount of apoptosis near their lumenal surface also
had high levels of the DNA repair enzyme MGMT near their lumenal surface,
meaning that the two major mechanisms for dealing with DNA damage were
correlated. This relationship was not so strong for corn-oil-fed animals”.
MGMT Repair Enzyme, the stiry
• We did a full-blown wavelet-based functional
mixed model analysis to get these conclusions.
Could it have been done marginally?
• Probably Yes, but then that’s dull.
• However, we (a) know much more about the
pattern of variability and (b) we built up
methods and software that can be used in a
wide variety of settings
Longitudinal
• Colon
carcinogenesis is
a localized
phenomenon.
The crypts closest
to one another
are highly
correlated
Colon Cancer Data
• The locality hypothesis says that colon cancer
starts because of highly localized damage.
•
•
Longitudinal and hierarchical FDA can tell us
many things about this hypothesis, e.g., where
is localized damage more likely to occur?
While most research focuses on the proximal
and distal portions of the colon, FDA reveals that
there is as much or more in the middle
Basic Model for p27
Ydrc (x) = ηd (x)+γ dr (x)+θdrc (x)+ε drc (x)
ηd (x)+γ dr (x) = rat-level function
within diet-fatty acid group
θdrc (x) = crypt-level functions,
which are not independent
Colon Cancer Data
• Lots of fun fitting this longitudinal, hierarchical
functional data set
• What did the investigators want to know?
• They were interested in how correlated
neighboring crypts are, consistent with the
locality hypothesis.
Colon Cancer Data
• The Bayesian
•
analysis gives
them strong
point-wise
evidence (can
supplement
with FDR)
Allows
summary
measures
Colon Cancer Data
• Acknowledging the longitudinal nature led to much
more precise inferences. This is the interaction
function between diet and treatment: guess which
one allows for locality?
Cell Signaling Data
• Myometrial cells meant to mimic what goes on near
birth were either exposed to dioxin (TCDD) or not
exposed.
• They were then exposed to a hormone, oxytocin,
that stimulates calcium ion signaling (CA2+)
• The CA2+ signal was observed at many pixels of
each cell for 512 time points (85 minutes)
Cell Signaling Data
Josue Martinez
Jianhua Huang
Cell Signaling Data
• The cells were segmented, and intensity of the
signals were obtained for each pixel, each cell and
all time points.
• Roughly 25 cells in each treatment group (control
and TCDD)
• Hierarchical because of pixels within cells within
treatments
Cell Signaling Data
• Functional because pixels are measured over time
• Possibly different levels of spatial because the cells
are in spatial alignment
• Lots of preprocessing: cell segmentation,
adjustment for saturation, and more
Cell Signaling Data
First two minutes of
the experiment for the
TCDD treated plate.
Next comes two
movies of the data
Cell Signaling Data
All cells (Control and TCDD), at a basal state in which the cells
were cultured, 0-4 minutes and 40-80 minutes after oxytocin
exposure
Cell Signaling Data
All cells (Control and TCDD), at a low estrogen state, just before
pregnancy (note the delayed response due to TCDD)
Cell Signaling Data
All cells (Control and TCDD), at a high estrogen state, near fullterm in pregnancy
Cell Signaling Data
All cells (Control and TCDD), at a high estrogen state, near fullterm in pregnancy, after normalization and registration
Cell Signaling Data
All cells (Control and TCDD), at a high estrogen state, near fullterm in pregnancy, after normalization and registration. Areas under
the curve (p < 0.001)
Cell Signaling Data
•
You should see that in this analysis, we have not
made use of the structure of the data.
• We have thought like GEE people, and indeed
reduced the comparison of control and TCDD to
single numbers, e.g., peak time and area under the
curve.
•
We did lots of dimension reduction (4 weighted
SVD) to get here
Cell Signaling Data
• There was a lot of work to get the data into a
format for analysis
• Question: what can hierarchical, possible spatial
FDA do for us here, and given the structure, how
should an analysis proceed?
• I feel that there is a lot more that we can learn
about the process by thinking more deeply about
the modeling
Bat Chirp Data
• Bats of the same species, residing in Austin (city
bats) and College Station (Aggie bats)
Bat Chirp Data
Josue Martinez
Jeff Morris
Bat Chirp Data
Bat Chirp Data
• Bat chirps were recorded, some multiple times for
each bat.
• The hierarchy is species, bat, replicate
•
I believe this analysis is a poster child for why to
think functionally and hierarchically
A Representative Bat Chirp
Bat Chirp Spectrogam
Bat Chirp Data
• The chirp is mainly composed of frequencies that
start at about 40 kilohertz (kHz) and slowly
decrease to 20 kHz from 0 to 8 milliseconds into the
chirp.
• The bat then transitions to predominant frequencies
at 60 kHz that slowly decrease back down to 40 kHz
and then rise up to 60 kHz towards the end of the
chirp.
• Frequencies above ∼ 80 kHz are harmonics of the
fundamental signal.
One Chirp per Bat
Bat Chirp Data
•
•
•
It seems clear to me that this is an inherently
functional problem.
Trying to reduce it to a single number to do a ttest seems difficult to contemplate, but it is not
impossible.
People have tried t-tests and classification
based on measures such as duration, start
frequency, end frequency, etc.
Bat Chirp Data
•
One could simply take each pixel of the
spectrogram and do t-tests, with FDR control
• This would ignore the replicate data, would
ignore the correlated nature of the data, would
do no dimension reduction, etc.
•
Kisi Bohn
What did the biologist want to know?
Bat Chirp Data
•
•
She wanted to know if the bats from the same
species (City Bats and Aggie Bats) evolved and
have different vocalizations
What did we want to do:
• Answer her question precisely, and let her tell
a story (the marginal question,
imprecisely framed)
• Use all the data
• Understand the variability
Bat Chirp Data
•
We wavelet transformed the spectrograms, fit a
2-D hierarchical WFFM, transformed back, and
did analysis of the results (see next)
Bat Chirp Data
•
Difference in
mean
spectrogram
inferred from
model.
•
Red favors
College
Station, Blue
favors Austin
•
This could be
done without
random effects
Bat Chirp Data
•
White areas
are those in
which the
spectrograms
differ by 1.5
fold or more,
with a global
FDR control
of 15%.
•
Hard to do
legitimately
without
random
effects?
Frequency Agile Lidar Data
• This is a recent project from Bani Mallick’s group
• Here is a comic describing the process
LIDAR Data
Bani Mallick and his student Swarup De
Peter Hall and Aurore Delaigle
Frequency Agile Lidar Data
• There is a transmitted signal
• There is background
• There is a received signal, which is then
background corrected
• For each time (100+) and wavelength (19), we
see 625 observations across the physical range
of observation, i.e., equally spaced functional
data with noise.
Frequency Agile Lidar Data
Frequency Agile Lidar Data
• There are two types of signals
• The first is benign, ordinary dust that has been
released
• The second are biological aerosols
Frequency Agile Lidar Data
Frequency Agile Lidar Data
Four samples at
same time and
wavelength.
Background
corrected only
Frequency Agile Lidar Data
Four samples and
same time and
wavelength.
Background
corrected, truncated
at zero and
normalized
Data
• For aerosol type a = 1,2, and sample i=1,…,n
within type, we observe background corrected
received data
R aiw (t,x)
• Here t = time, w = wavelength and x =
distance.
• This is hierarchical: there are samples within
types
Data
• For aerosol type a = 1,2, and sample i=1,…,n
within type, we observe
R aiw (t,x)
• This is functional: there are bivariate space-time
curves over distance x and time t
• It is longitudinal, over wavelength
Approaches
• For aerosol type a = 1,2, and sample i=1,…,n
within type, we observe R (t,x)
aiw
• There are a vast number of approaches possible
• The fun thing to do is to build a hierarchical,
longitudinal, space-time model
• Doing this is not trivial, will advance the field,
will allow sharing of data, will allow
understanding of variability, etc.
Approaches
• The investigators want things far more boring
• They want to know if there are differences
between the two types of samples (biological
and non-biological), sigh.
Approaches
• Both simple questions can be handled by a
model-based approach, of course.
• But they can also be answered by much simpler,
ad hoc, dimension reduction-based and not
particularly innovative approaches
• We will have to decide what to do!
•
•
•
Conclusions
Functional, hierarchical and longitudinal data
are the wave of the future.
I have given 4 examples of functional data that
are either hierarchical or longitudinal
Analyzing data like this is great fun!
•
•
•
Conclusions
The questions I have raised are about the goals
of such studies.
If investigators only think marginally, they miss
out.
If we do not think marginally, we have less
influence
•
•
Conclusions
Marginal approaches are often much faster to
implement, and easier to explain.
I’d like speaker at this conference to help me by
indicating why powerful random effects models
are “better” than marginal approaches.
•
•
Advertisement
TAMU has an full professor opening in
computational statistics as broadly defined.
Startup funding is at least $750,000
Other Acknowledgments
•
I gratefully acknowledge financial support from
the U. S. National Cancer Institute (R37CA057030) and King Abdullah University of
Science and Technology (KAUST, Award Number
KUS-CI-016-04).
Approaches
• There is a deconvolution aspect to this problem
that is fairly unique
• Along with the received signal
is a transmitted signal
R aiw (t,x) , there
Taiw (t,x)
• There is thought to be a true signal
G aiw (t,x)
Approaches
• The deconvolution equation is
x
R aiw (t,x)=  Gaiw (t,v)Taiw (t,x-v) dv+ε aiw (t,x)
0
• Here,
over x
•
εaiw (t,x) is supposed to be white noise
ˆ (t,x) ?
Should one use R aiw (t,x) or G
aiw
Approaches
• It turns out that there are no systematic
differences across treatment for εaiw (t,x) or for
Taiw (t,x)
• So differences across treatments in the received
signal reflect differences in the true signal, and
vice-versa
• Is deconvolution a good idea? It is a heck of
a lot of work, and the model assumptions are
stringent
Approaches
• We think deconvolution here is not only harder
than simply using the observed data, but less
efficient because of the excess noise induced by
deconvolution
• The Mallick group has made great progress on
attacking this in a systematic, functional,
hierarchical, Bayesian manner
Download