The Stats side of things - Department of Mathematics and Statistics

advertisement
Nairanjana Dasgupta
Also called JAN
Professor, Dept of Mathematics and Statistics
WSU.

My theoretical issues are simultaneous testing, binary data and multivariate
data.

If you are doing many many tests with large scale issues of Type I error, I
would be the person you want to talk to

I will delineate four areas that I am currently working on.
WA is associated with apples.
 I have been working with the Tree Fruit Research Commission for the last
5 years modeling various aspects of apples
 Apples grow from a bud and goes through about 8 stages before it becomes
a fruit.
 One relevant question the growers ask: can we build a model so that we
can predict what stage a apple will be at a particular time, if I have weather
info as well as prior data.
 Statistically it becomes a question of modeling the pattern while taking into
account issues like auto-correlation, missing data, istonic nature of the data
etc.
Fun project if you like to dirty your hands with large data sets and solve a
REAL problem for this state!


Modeling Growth: Can we predict the SIZE an apple will be at harvest
from prior data and weather information at the current time point.

Statistically we need to deal with non-linear curves and try to establish
growth patterns.
Lets talk about how we worked on this

 About
70% of apple production in the United States takes place in the State of
Washington alone, with $2.18 billion value of apple produced in 2013 (WSDA,
2015).
 Issue:
a considerable amount of time gap exists between the order of defined sizes
of crates and the delivery of the crates by the manufacturers/suppliers.
 Due to this time lag, farmers have to be able to make correct judgment on the size
of crates to be ordered far ahead of apple harvest date.
 Therefore, a decision tool that can assist apple growers/marketers to make accurate
judgment on the size range of apple at harvest is warranted.
 To
develop a statistical model that can be used for making
management decisions on apple marketing.
 Specifically, to develop a model that can predict the size of
apple at harvest with production data.
 In such a model, farmers can input the production parameters
in advance and the model will yield the average size of apple
for a particular date of harvest.
Model for each growth stage
Use models from above to
estimate model for harvest stage
Select few competing models
Predict test sample
Select the final model
 The
model will be developed with data from commercial apple
growers in the Washington State.
 Data are available for different apple varieties for different
years (2010 – 2014) across various locations in the state.
 For this project, we use data only from Crisp Pink apple
variety.
 Data are obtained from 11 locations that include East
Wenatchee, Lewis Delay, Auvil Chelan, Auvil Brays, Sun
Orondo CO, Sunrise, Olmstead Wapato, Ines Kon Pass, Gwen
Ballard, Prossor, and Finley.
 Suppressing
index for individual observation, the regression
model can be expressed as:
t = {40, 50, …, 180}, i = {10, 20, …}.
 We
chose 180 DAFB as our targeted date for which apple size
has to be predicted.
 Each of the estimated regression will first determine the
appropriate predictors for the corresponding dependent
variable, and finally, the model for Mean180 is estimated.
 A series of predictors that seem appropriate to include in the
Mean180 model are sequentially replaced by their respective
predictors as identified in the previous regressions.
Predicting Mean180
3
2.4 2.6 2.8
2.4 2.6 2.8
3
3.2
m180_2
3.2
m180_1
2.4
2.6
2.8
mean180
Fitted values
3
3.2
2.4
mean180
2.6
2.8
mean180
Fitted values
3.2
mean180
3
2.4 2.6 2.8
2.4 2.6 2.8
3
3.2
m180_4
3.2
m180_3
3
2.4
2.6
Fitted values
2.8
mean180
3
mean180
3.2
2.4
2.6
Fitted values
2.8
mean180
3
mean180
3.2
Particulars
Model 1
Model 2
Predictors
FB, DAFB,
Mean40,
Mean130
RMFSE, inch
0.0576
FB, DAFB,
Mean50,
Mean60,
Mean100
0.3684
RMFSE, % of
Average Harvest
Size
1.993%
12.75%

I do spend a significant amount of time modeling genes.

Currently I am working on a project where we are trying to establish early stage
markers for Hepatocellular carcinoma (it’s a type of liver cancer, believed to
occur after chronic liver disease).

Idea is if we follow people who have chronic Hepaptitis B or C and then see
which genes have had changes (methylation). Idea is methylation is a reversible
process. So if we find the groups of genes that are methylated early on, maybe
we can stop the cancer early. The end product of the research is a therapy that
might replace chemo (which doesn’t work for liver cancers anyway).

My part is in identifying the specific genes out of the entire genome by
following its expression across he stages of liver disease… To establish
pathways for the genes and the proteins involved in the process.
 We
had 4 groups:
• T : Cancer Tumor Group (people with active HCC)
• C: Chronic Group (people with active HepB or HepC)
• A: Asymptomatic Group (people who have had Hep B/C but
have had a liver transplant
• B: normal healthy people
 We
wanted to see the effect of methylation or not on some
specific genes that are known to be oncogenes or Tumor
Supressor Genes
 I will share some pictures:





In any testing, false positives are always a risk.
But in larger studies (fMRI – brain imaging, genomic, proteomic studies)
when thousands of tests are conducted simultaneously, FP is more than a
nuisance, it can have pretty drastic and financial consequences.
One cannot eliminate FP’s but we can reduce the probability of such an
occurrence.
One option people often use is looking at the top k genes or pixels (ordered
using some criteria).
One question I am investigating is what is expected value of
misclassification using these top tables. How do we decide how to pick the
top k. What does k depend upon?

P(u( k )  v( N  k  r 1) )   P(u( k )  v | v ) g ( v )dv

 
   h(u )g ( v )dudv
 v
 
   {k [ (u )]k 1 (u )du}g ( v )dv
 v
 
( N  k )!
k 1
N  k  r 1
r

{
k
[

(
u
)]

(
u
)
du
}
{

(
v
)]
(
1


(
v
))
 ( v )dv
 


( N  k  r  1)! r!   v







I have worked with an Anthropologist interested comparing hunting
patterns of two tribes
Worked with a plant pathologist of potato blight disease (cause of Irish
Famine)
Ecologist looking at heavy metal contamination of WA lakes
I work on many interesting (at least to me) real problems.
Some of these I see immediate applications: some I have to wait for.
But each problem has its own story and it makes it interesting to be able to
contribute something to the real world.
Please feel free to ask me questions and email me: dasgupta@wsu.edu
Download