Statistical Considerations for
Studies Involving Mice (et al.)
ELIZABETH GARRETT-MAYER
PROFESSOR OF BIOSTATISTICS
DIRECTOR, BIOSTATISTICS SHARED
RESOURCE
HOLLINGS CANCER CENTER
What started this? ARRIVE
 Improving Bioscience Research Reporting: The ARRIVE
Guidelines for Reporting Animal Research*
 The gist:


careful consideration should be given to planning experiments with
animals
results should provide details of experimental designs, including sample
sizes, of mouse experiments.
 “A wealth of evidence shows that across many areas, the
reporting of biomedical research is often inadequate, leading
to the view that even if the science is sound, in many cases the
publications themselves are not ‘‘fit for purpose,’’ meaning
that incomplete reporting of relevant information effectively
renders many publications of limited value as instruments to
inform policy or clinical and scientific practice.”
*Kilkenny et al., PLOS Biology, June 2010.
Why are specifics not reported?
 Lots of reasons
 space
 lack of understanding of their importance
 lack of confidence in methods used or inability to articulate
methods
 recognition that results may look less important if all details
are included
 they are not required!
 peer reviewers don’t care or understand
 novel approaches often get panned because it isn’t a t-test, for
example.
Why are specifics not reported?
 If you don’t have good rationale for your planned
experimental design, what are you going to say?




“we experimented on 3 mice, and the p-value was nonsignificant. So, we continued adding 3 mice to the experiment
until we got a significant p-value.”
“We chose 6 mice per group because we always use 6 mice per
group.”
“We chose 8 mice per group because that was all the budget
allowed.”
“We chose to report the results of differences in tumor size at
day 45 because differences at all other days were not
statistically significant.”
ARRIVE guidelines: 20 item checklist
TITLE (1)
 Provide as accurate and concise a description of the content of the article as
possible.
ABSTRACT (2)
 Provide an accurate summary of the background, research objectives
(including details of the species or strain of animal used), key methods,
principal findings, and conclusions of the study.
INTRODUCTION
 Background


(3 a). Include sufficient scientific background (including relevant references to previous
work) to understand the motivation and context for the study, and explain the experimental
approach and rationale.
(3 b.) Explain how and why the animal species and model being used can address the
scientific objectives and, where appropriate, the study’s relevance to human biology.
 Objectives (4) Clearly describe the primary and any secondary objectives of
the study, or specific hypotheses being tested.
ARRIVE guidelines: 20 item checklist
METHODS
 Ethical statement (5) Indicate the nature of the ethical review permissions, relevant licences
(e.g. Animal [Scientific Procedures] Act 1986), and national or institutional guidelines for the
care and use of animals, that cover the research.
 Study design (6) For each experiment, give brief details of the study design, including:




Experimental procedures (7) For each experiment and each experimental group, including
controls, provide precise details of all procedures carried out. For example:





a. The number of experimental and control groups.
b. Any steps taken to minimise the effects of subjective bias when allocating animals to treatment (e.g.,
randomisation procedure) and when assessing results (e.g., if done, describe who was blinded and when).
c. The experimental unit (e.g. a single animal, group, or cage of animals). A time-line diagram or flow chart can be
useful to illustrate how complex study designs were carried out.
a. How (e.g., drug formulation and dose, site and route of administration, anaesthesia and analgesia used [including
monitoring], surgical procedure, method of euthanasia). Provide details of any specialist equipment used, including
supplier(s).
b. When (e.g., time of day).
c. Where (e.g., home cage, laboratory, water maze).
d. Why (e.g., rationale for choice of specific anaesthetic, route of administration, drug dose used).
Experimental animals (8)


a. Provide details of the animals used, including species, strain, sex, developmental stage (e.g., mean or median age
plus age range), and weight (e.g., mean or median weight plus weight range).
b. Provide further relevant information such as the source of animals, international strain nomenclature, genetic
modification status (e.g. knock-out or transgenic), genotype, health/immune status, drug- or testnaıve, previous
procedures, etc.
ARRIVE guidelines: 20 item checklist
 Housing and husbandry (9) Provide details of:
 a. Housing (e.g., type of facility, e.g., specific pathogen free (SPF); type of cage or housing;
beddingmaterial; number of cage companions; tank shape and material etc. for fish).
 b. Husbandry conditions (e.g., breeding programme, light/dark cycle, temperature, quality of water
etc. for fish, type of food, access to food and water, environmental enrichment).
 c. Welfare-related assessments and interventions that were carried out before, during, or after the
experiment.
 Sample size (10)

a. Specify the total number of animals used in each experiment and the number of animals in each
experimental group.
 b. Explain how the number of animals was decided. Provide details of any sample size calculation
used.
 c. Indicate the number of independent replications of each experiment, if relevant.
 Allocating animals to experimental groups (11)

a. Give full details of how animals were allocated to experimental groups, including randomisation
or matching if done.
 b. Describe the order in which the animals in the different experimental groups were treated and
assessed.
 Experimental outcomes (12) Clearly define the primary and secondary experimental
outcomes assessed (e.g., cell death, molecular markers, behavioural changes).
ARRIVE guidelines: 20 item checklist
 Statistical methods (13)
 a. Provide details of the statistical methods used for each analysis.
 b. Specify the unit of analysis for each dataset (e.g. single animal, group of animals, single neuron).
 c. Describe any methods used to assess whether the data met the assumptions of the statistical
approach.
RESULTS
 Baseline data (14) For each experimental group, report relevant characteristics and
health status of animals (e.g., weight, microbiological status, and drug- or testnaıve) before treatment or testing (this information can often be tabulated).
 Numbers analysed (15)


a. Report the number of animals in each group included in each analysis. Report absolute numbers
(e.g. 10/20, not 50%a).
b. If any animals or data were not included in the analysis, explain why.
 Outcomes and estimation (16) Report the results for each analysis carried out, with
a measure of precision (e.g., standard error or confidence interval).
 Adverse events (17)


a. Give details of all important adverse events in each experimental group.
b. Describe any modifications to the experimental protocols made to reduce adverse events.
ARRIVE guidelines: 20 item checklist
DISCUSSION
 Interpretation/scientific implications (18)



a. Interpret the results, taking into account the study objectives and hypotheses,
current theory, and other relevant studies in the literature.
b. Comment on the study limitations including any potential sources of bias, any
limitations of the animal model, and the imprecision associated with the resultsa.
c. Describe any implications of your experimental methods or findings for the
replacement, refinement, or reduction (the 3Rs) of the use of animals in research.
 Generalisability/translation (19)
 Comment on whether, and how, the findings of this study are likely to translate to
other species or systems, including any relevance to human biology.
 Funding (20)
 List all funding sources (including grant number) and the role of the funder(s) in
the study.
Manuscripts vs. grants
 Similar principles, but very limited space in grants
 Convince the reviewers
 you have a clear question
 you have a method for gathering relevant data
 you can measure the outcomes of interest
 you know what to do with the data
 the results will answer the clear question
1. You have a clear question
 Stated objective and rationale should make it clear
what your scientific question is.
 Stating your hypothesis can be very helpful.


E.g., the knockout model will have faster tumor growth than
the wildtype model.
E.g., gene expression will be lower in the mice treated with
inhibitor compared to untreated mice.
 This is standard in clinical research but not as
common in basic/translational research.
2. You have a method for gathering relevant data
 Experimental design!!
 Details….we need details.
 If you cannot explain the design to the reviewer, he
cannot understand the data that is generated.
 Be very careful of ‘bias’s’ in your design
 Example: evaluating metastases.


design says mice will be followed until death or primary tumor
reaches xx mm3.
Differential follow-up time.
 Example: comparing tumor size


design says you will compare tumor volumes at day 60
preliminary data suggests that you will have had to sacrifice most of
the animals in your control group by day 50. how can you compare
tumors when the mouse died 10+ days ago?
2. You have a method for gathering relevant data
 Randomization?
 important to consider “confounders”
 confounder: a ‘variable’ that might affect your outcome that is
not related to the experimental conditions
 e.g., shipping batch
 e.g., diet/temperature/location of cage
 Blinding?
 inherent biases when you know group assignment!
 important to NOT know if the mouse being evaluated is in the
group you expect to do better/worse.
 subconcious effects. in clinical research, taken for granted.
2. You have a method for gathering relevant data
 Sources of variation?
 transgenic vs. xenografts models?
 how many cell lines? Or, are you using (multiple) primary
tumors?
 fresh vs. frozen tissue?
 reference gene? is it really ‘stable’?
 Sampling
longitudinal? same mouse measured repeatedly over time
 separate cohorts? sac’ing different cohorts over time


Measurement
same vs. different ‘raters’?
 different measurement approaches?

3. how you measure the outcome
 Very often see

“we will evaluate differences in antitumor efficacy”
tumor size at time t?
 tumor growth rate?
 presence/absence of metastases?
 tumor take rate?
 caliper vs. imaging measures

 Other measures: gene expression, methylation,
mutations, histology, etc.


assays need to be included
type of measure should be included
continuous expression?
 positive vs. negative?
 IHC, for example, can be expressed in many ways.

4. You know what to do with data
 Example:



You have measured the tumor volume every 5 days for 100 days on
100 mice.
21 measures per mouse x 100 mice = 2100 measures.
Why would you look only at the measures at one time point?
 Statistical analysis plan is very important.
 There are different approaches for



continuous outcomes (e.g. tumor volume at time t).
binary/categorical outcomes (e.g, presence of metastases)
time to event outcomes (e.g., time to death/sacrifice)
 There are different approaches for


longitudinal data
comparing vs. estimating vs. dose finding
Example
 Moussa O, Ashton AW, Fraig M, Garrett-Mayer E, Ghoneim
MA, Halushka PV, Watson DK. Novel role of thromboxane
receptors beta isoform in bladder cancer pathogenesis. Cancer
Research, 2008, Jun 1; 68(11): 4097-4104.
Xenograft mouse model. TCC-SUP tumorigenic human bladder cancer cells were
selected as they express TP-β receptor and were used for the drug combination studies.
Immortalized nontransformed normal urothelial SV-HUC cells were selected because they
express the TP-α. These cells were stably transfected with pcDNA3, TP-α, or TP-β for cell
transformation studies. Both cell lines were used in a s.c. model in immunocompromised
(nu/nu) mice. TCC-SUP cells (5 × 106) or SV-HUC cells (5 × 107) in Matrigel (BD
Bioscience, Inc.) were injected s.c. into the right and left flanks of anesthetized mice. Tumor
growth was monitored in these mice twice a week. For mice injected with TCC-SUP,
GR32191 or vehicle control was administered daily (20 mg/kg) by gavage with treatment
initiated 24 h after initial injection. Two cycles of cisplatin [single high dose (5 mg/kg) or
single low dose (0.5 mg/kg)] were administrated at day 4 and day 11 post-tumor cell
injection
The data
4000
High Cis+GR
High Cis
Low Cis+GR
Low Cis
GR
Control
3000
Volume
2000
1000
600
300
100
30
0
0
20
40
Time (days)
60
80
What are our questions?
 Is the time to tumor initiation different across treatment
groups?

is onset later in the cisplatin groups than in the other groups?
 Is the growth rate different across treatment groups?

is the growth rate for high cisplatin smaller or larger than for low
cisplatin in the GR group?
 These are questions that can be addressed statistically.
 Vague questions:


Is tumor size different? (when?)
Which treatment is the most effective? (using what metric?)
1.0
Time to tumor initiation: antitumor effects of GR32191 and cisplatin treatment in
immunocompromised mice. Subcutaneous tumors from TCC-SUP human bladder
cancer cells were treated with vehicle control (12 mice), GR32191 (15 mice), 5 mg/Kg
cisplatin (cisplatin high; 12 mice), 5 mg/kg cisplatin in combination with GR32191 (13
mice), or 0.5 mg/kg (single low-dose cisplatin) alone (10 mice), or 0.5 mg/kg cisplatin
in combination with GR32191 (10 mice). Tumor size was measured over time. KaplanMeier curves showing time to tumor onset across the treatment groups.
0.6
0.4
0.2
0.0
Proportion Tumor-Free
0.8
High Cis+GR
High Cis
Low Cis+GR
Low Cis
GR
noGR
0
10
20
30
40
Time to Tumor (Days)
50
60
70
80
Table 2: Hazard ratios comparing time to tumor in treatment groups. A hazard ratio greater than 1.00
implies that the first treatment has shorter time to tumor than the second in the comparison. For example,
the hazard ratio comparing Low Cis vs. Low Cis + GR is 23.5. This implies that, at any given time for mice
who haven’t yet developed a tumor, mice treated with Low Cis were 23 times more likely to have tumor
incidence than mice treated with Low Cisp + GR. Hazard ratios less than 1 imply a protective effect.
For example, the hazard ratio for GR vs. Low is 0.51. This implies that for mice who haven’t yet developed
a tumor, those treated with GR are 0.51 times as likely to have tumor incidence as mice treated on Low
Cis at any given point in time.
Comparison
No GR vs. GR
No GR vs. Low
No GR vs. Low + GR
No GR vs. High
No GR vs. High + GR
GR vs. Low
GR vs. Low + GR
GR vs. High
GR vs. High + GR
Low vs. Low + GR
Low vs. High
Low vs. High + GR
Low + GR vs. High
Low + GR vs. High + GR
High vs. High + GR
Hazard
Ratio
2.42
1.24
29.1
304.1
556.5
0.51
12.0
125.6
229.8
23.5
245.7
449.6
10.5
19.1
1.83
p-value
0.02
0.63
<0.0001
<0.0001
<0.0001
0.12
0.0001
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
0.19
95% Confidence interval for
hazard ratio
1.12, 5.22
0.51, 2.99
8.70, 97.2
65.4, 1414.2
113.7, 2722.7
0.22, 1.20
3.29, 43.79
25.2 626.4
43.9, 1203.3
5.88, 93.9
45.7, 1320.5
79.8, 2531.5
3.36, 32.6
5.77, 63.4
0.73, 4.61
Tumor growth rate: how to compare?
4000
High Cis+GR
High Cis
Low Cis+GR
Low Cis
GR
Control
3000
Volume
2000
1000
600
300
100
30
0
0
20
40
Time (days)
60
80
Data used for tumor growth analysis: notice that for mice
With no tumor onset, they are included from day 60+. Remaining
Mice all have data shown for volumes>0.
High Cis+GR
High Cis
Low Cis+GR
Low Cis
GR
Control
4000
3000
Volume
2000
1000
600
300
100
30
0
0
20
40
60
Time from Tumor Initiation (days) or Day 60
80
Fitted regression lines per mouse, by treatment group (stage 1 of
Two stage analysis).
4000
High Cis+GR
High Cis
Low Cis+GR
Low Cis
GR
Control
3000
Tumor Volume
2000
1000
600
300
100
30
0
0
20
40
Time from Injection (days)
60
80
Estimated regression lines per treatment group (result of stage 2 of
Two stage analysis)
4000
High Cis+GR
High Cis
Low Cis+GR
Low Cis
GR
Control
3000
Tumor Volume
2000
1000
600
300
100
30
0
0
20
40
Time from Injection (days)
60
80
Comparisons of slopes
Table 3: P-values for comparing slopes of tumor growth
GR
Control
GR
Low Cis
Low Cis + GR
High Cis
0.004
Low
Cis
0.0001
0.75
Low Cis +
GR
<0.0001
0.47
0.58
High Cis
0.14
0.08
0.01
0.003
High Cis +
GR
0.0006
0.89
0.84
0.49
0.03
Simpler ways to deal with the data?
 Simple comparisons across groups
 Example:
 two groups of mice have tail vein injections to establish
tumors. they are followed for a fixed amount of time and then
are sacrificed.
 Question 1: All mice get tumors. you want to compare
tumor burden in the two groups of mice. What test would you
use to compare them?
a)
b)
c)
d)
e)
t-test
Wilcoxon rank sum test
Anova
Fisher’s exact test
it depends
Simpler ways to deal with the data?
 Simple comparisons across groups
 Example:
 two groups of mice have tail vein injections to establish
tumors. they are followed for a fixed amount of time and then
are sacrificed.
 Question 3: SOME mice get metastases. You want to
compare incidence of metastases. What test should you use?
a)
b)
c)
d)
e)
f)
Chi-square test
Fisher’s exact test
Anova
Kaplan Meier
Signed rank test
it depends
Simpler ways to deal with the data?
 Simple comparisons across groups
 Example:
 two groups of mice have tail vein injections to establish
tumors. they are followed for a fixed amount of time and then
are sacrificed.
 Question 2: All mice get tumors. you want to compare
tumor burden in the two groups of mice. What are you
comparing if you use a t-test?
a)
b)
c)
d)
e)
the distribution of tumor volume
the distribution of the log of tumor volume
the mean of tumor volume
the mean of log tumor volume
It depends
Simpler way to deal with the data?
 By using simple approaches, you might be
oversimplifying
 This can hurt you.
 Example: you average triplicate values


naively assuming that there is 1/3 of the data as truly exists
this will make your standard errors larger than they should be
 This can invalidate your results.
 Example: repeated measures on the same mouse
assumed to be independent


naively assuming that there is more data than truly exists.
this will make your standard errors smaller than they should be
5. Lastly…how many mice should you use?
 Stay tuned for next week’s talk by EKG.
Quick aside….interpreting pvalues
 Definition: the p-value is the probability of getting a
result as or more extreme than you observed if the
null hypothesis is true
 Can anyone translate that?
Hypotheses
 Ho: mean gene expression is the same in two groups
 H1: mean gene expression is different in two groups
 Next: we do an experiment, we collect data (e.g.
gene expression), we perform a test.
 What will affect the p-value?
 which hypothesis is actually true
 the variance of the values in each group
 the SAMPLE SIZE!!!
What if…?
 What if we find a p-value of 0.02 comparing the
mean gene expression in two groups? Your
conclusion is….

i need more information including:
the effect size! how different is the gene expression?
 the sample size! how many samples/animals were in each group?


And, it would be nice to ‘see’ the data either in a figure of by
knowing the means and standard deviations in the two groups
Two scenarios
 Scenario 1: 5 mice per group. Gene expression is 4
times higher on average in the KO group compared
to the WT group. P-value is p=0.02.
 Scenario 2: Tissue microarray. Gene expression is
1.2 times higher in late stage cancers (n=100)
compared to the normals (n=50) (p=0.02). There is
no significant difference (fold change = 1.05)
between early stage cancers (n=100) and normals
(p=0.45).
Take home points
Never interpret a p-value without additional
information
2. statisticians can help you. at a minimum, we help
you clarify your experimental design and definition
of outcomes.
3. statisticians have many tools in our toolboxes. A ttest is not the hammer and every experiment is not
a nail.
4. sample size justification will be next….very
important (maybe most important) piece of
statistical considerations in your grant.
1.