Statistical Considerations for Studies Involving Mice (et al.) ELIZABETH GARRETT-MAYER PROFESSOR OF BIOSTATISTICS DIRECTOR, BIOSTATISTICS SHARED RESOURCE HOLLINGS CANCER CENTER What started this? ARRIVE Improving Bioscience Research Reporting: The ARRIVE Guidelines for Reporting Animal Research* The gist: careful consideration should be given to planning experiments with animals results should provide details of experimental designs, including sample sizes, of mouse experiments. “A wealth of evidence shows that across many areas, the reporting of biomedical research is often inadequate, leading to the view that even if the science is sound, in many cases the publications themselves are not ‘‘fit for purpose,’’ meaning that incomplete reporting of relevant information effectively renders many publications of limited value as instruments to inform policy or clinical and scientific practice.” *Kilkenny et al., PLOS Biology, June 2010. Why are specifics not reported? Lots of reasons space lack of understanding of their importance lack of confidence in methods used or inability to articulate methods recognition that results may look less important if all details are included they are not required! peer reviewers don’t care or understand novel approaches often get panned because it isn’t a t-test, for example. Why are specifics not reported? If you don’t have good rationale for your planned experimental design, what are you going to say? “we experimented on 3 mice, and the p-value was nonsignificant. So, we continued adding 3 mice to the experiment until we got a significant p-value.” “We chose 6 mice per group because we always use 6 mice per group.” “We chose 8 mice per group because that was all the budget allowed.” “We chose to report the results of differences in tumor size at day 45 because differences at all other days were not statistically significant.” ARRIVE guidelines: 20 item checklist TITLE (1) Provide as accurate and concise a description of the content of the article as possible. ABSTRACT (2) Provide an accurate summary of the background, research objectives (including details of the species or strain of animal used), key methods, principal findings, and conclusions of the study. INTRODUCTION Background (3 a). Include sufficient scientific background (including relevant references to previous work) to understand the motivation and context for the study, and explain the experimental approach and rationale. (3 b.) Explain how and why the animal species and model being used can address the scientific objectives and, where appropriate, the study’s relevance to human biology. Objectives (4) Clearly describe the primary and any secondary objectives of the study, or specific hypotheses being tested. ARRIVE guidelines: 20 item checklist METHODS Ethical statement (5) Indicate the nature of the ethical review permissions, relevant licences (e.g. Animal [Scientific Procedures] Act 1986), and national or institutional guidelines for the care and use of animals, that cover the research. Study design (6) For each experiment, give brief details of the study design, including: Experimental procedures (7) For each experiment and each experimental group, including controls, provide precise details of all procedures carried out. For example: a. The number of experimental and control groups. b. Any steps taken to minimise the effects of subjective bias when allocating animals to treatment (e.g., randomisation procedure) and when assessing results (e.g., if done, describe who was blinded and when). c. The experimental unit (e.g. a single animal, group, or cage of animals). A time-line diagram or flow chart can be useful to illustrate how complex study designs were carried out. a. How (e.g., drug formulation and dose, site and route of administration, anaesthesia and analgesia used [including monitoring], surgical procedure, method of euthanasia). Provide details of any specialist equipment used, including supplier(s). b. When (e.g., time of day). c. Where (e.g., home cage, laboratory, water maze). d. Why (e.g., rationale for choice of specific anaesthetic, route of administration, drug dose used). Experimental animals (8) a. Provide details of the animals used, including species, strain, sex, developmental stage (e.g., mean or median age plus age range), and weight (e.g., mean or median weight plus weight range). b. Provide further relevant information such as the source of animals, international strain nomenclature, genetic modification status (e.g. knock-out or transgenic), genotype, health/immune status, drug- or testnaıve, previous procedures, etc. ARRIVE guidelines: 20 item checklist Housing and husbandry (9) Provide details of: a. Housing (e.g., type of facility, e.g., specific pathogen free (SPF); type of cage or housing; beddingmaterial; number of cage companions; tank shape and material etc. for fish). b. Husbandry conditions (e.g., breeding programme, light/dark cycle, temperature, quality of water etc. for fish, type of food, access to food and water, environmental enrichment). c. Welfare-related assessments and interventions that were carried out before, during, or after the experiment. Sample size (10) a. Specify the total number of animals used in each experiment and the number of animals in each experimental group. b. Explain how the number of animals was decided. Provide details of any sample size calculation used. c. Indicate the number of independent replications of each experiment, if relevant. Allocating animals to experimental groups (11) a. Give full details of how animals were allocated to experimental groups, including randomisation or matching if done. b. Describe the order in which the animals in the different experimental groups were treated and assessed. Experimental outcomes (12) Clearly define the primary and secondary experimental outcomes assessed (e.g., cell death, molecular markers, behavioural changes). ARRIVE guidelines: 20 item checklist Statistical methods (13) a. Provide details of the statistical methods used for each analysis. b. Specify the unit of analysis for each dataset (e.g. single animal, group of animals, single neuron). c. Describe any methods used to assess whether the data met the assumptions of the statistical approach. RESULTS Baseline data (14) For each experimental group, report relevant characteristics and health status of animals (e.g., weight, microbiological status, and drug- or testnaıve) before treatment or testing (this information can often be tabulated). Numbers analysed (15) a. Report the number of animals in each group included in each analysis. Report absolute numbers (e.g. 10/20, not 50%a). b. If any animals or data were not included in the analysis, explain why. Outcomes and estimation (16) Report the results for each analysis carried out, with a measure of precision (e.g., standard error or confidence interval). Adverse events (17) a. Give details of all important adverse events in each experimental group. b. Describe any modifications to the experimental protocols made to reduce adverse events. ARRIVE guidelines: 20 item checklist DISCUSSION Interpretation/scientific implications (18) a. Interpret the results, taking into account the study objectives and hypotheses, current theory, and other relevant studies in the literature. b. Comment on the study limitations including any potential sources of bias, any limitations of the animal model, and the imprecision associated with the resultsa. c. Describe any implications of your experimental methods or findings for the replacement, refinement, or reduction (the 3Rs) of the use of animals in research. Generalisability/translation (19) Comment on whether, and how, the findings of this study are likely to translate to other species or systems, including any relevance to human biology. Funding (20) List all funding sources (including grant number) and the role of the funder(s) in the study. Manuscripts vs. grants Similar principles, but very limited space in grants Convince the reviewers you have a clear question you have a method for gathering relevant data you can measure the outcomes of interest you know what to do with the data the results will answer the clear question 1. You have a clear question Stated objective and rationale should make it clear what your scientific question is. Stating your hypothesis can be very helpful. E.g., the knockout model will have faster tumor growth than the wildtype model. E.g., gene expression will be lower in the mice treated with inhibitor compared to untreated mice. This is standard in clinical research but not as common in basic/translational research. 2. You have a method for gathering relevant data Experimental design!! Details….we need details. If you cannot explain the design to the reviewer, he cannot understand the data that is generated. Be very careful of ‘bias’s’ in your design Example: evaluating metastases. design says mice will be followed until death or primary tumor reaches xx mm3. Differential follow-up time. Example: comparing tumor size design says you will compare tumor volumes at day 60 preliminary data suggests that you will have had to sacrifice most of the animals in your control group by day 50. how can you compare tumors when the mouse died 10+ days ago? 2. You have a method for gathering relevant data Randomization? important to consider “confounders” confounder: a ‘variable’ that might affect your outcome that is not related to the experimental conditions e.g., shipping batch e.g., diet/temperature/location of cage Blinding? inherent biases when you know group assignment! important to NOT know if the mouse being evaluated is in the group you expect to do better/worse. subconcious effects. in clinical research, taken for granted. 2. You have a method for gathering relevant data Sources of variation? transgenic vs. xenografts models? how many cell lines? Or, are you using (multiple) primary tumors? fresh vs. frozen tissue? reference gene? is it really ‘stable’? Sampling longitudinal? same mouse measured repeatedly over time separate cohorts? sac’ing different cohorts over time Measurement same vs. different ‘raters’? different measurement approaches? 3. how you measure the outcome Very often see “we will evaluate differences in antitumor efficacy” tumor size at time t? tumor growth rate? presence/absence of metastases? tumor take rate? caliper vs. imaging measures Other measures: gene expression, methylation, mutations, histology, etc. assays need to be included type of measure should be included continuous expression? positive vs. negative? IHC, for example, can be expressed in many ways. 4. You know what to do with data Example: You have measured the tumor volume every 5 days for 100 days on 100 mice. 21 measures per mouse x 100 mice = 2100 measures. Why would you look only at the measures at one time point? Statistical analysis plan is very important. There are different approaches for continuous outcomes (e.g. tumor volume at time t). binary/categorical outcomes (e.g, presence of metastases) time to event outcomes (e.g., time to death/sacrifice) There are different approaches for longitudinal data comparing vs. estimating vs. dose finding Example Moussa O, Ashton AW, Fraig M, Garrett-Mayer E, Ghoneim MA, Halushka PV, Watson DK. Novel role of thromboxane receptors beta isoform in bladder cancer pathogenesis. Cancer Research, 2008, Jun 1; 68(11): 4097-4104. Xenograft mouse model. TCC-SUP tumorigenic human bladder cancer cells were selected as they express TP-β receptor and were used for the drug combination studies. Immortalized nontransformed normal urothelial SV-HUC cells were selected because they express the TP-α. These cells were stably transfected with pcDNA3, TP-α, or TP-β for cell transformation studies. Both cell lines were used in a s.c. model in immunocompromised (nu/nu) mice. TCC-SUP cells (5 × 106) or SV-HUC cells (5 × 107) in Matrigel (BD Bioscience, Inc.) were injected s.c. into the right and left flanks of anesthetized mice. Tumor growth was monitored in these mice twice a week. For mice injected with TCC-SUP, GR32191 or vehicle control was administered daily (20 mg/kg) by gavage with treatment initiated 24 h after initial injection. Two cycles of cisplatin [single high dose (5 mg/kg) or single low dose (0.5 mg/kg)] were administrated at day 4 and day 11 post-tumor cell injection The data 4000 High Cis+GR High Cis Low Cis+GR Low Cis GR Control 3000 Volume 2000 1000 600 300 100 30 0 0 20 40 Time (days) 60 80 What are our questions? Is the time to tumor initiation different across treatment groups? is onset later in the cisplatin groups than in the other groups? Is the growth rate different across treatment groups? is the growth rate for high cisplatin smaller or larger than for low cisplatin in the GR group? These are questions that can be addressed statistically. Vague questions: Is tumor size different? (when?) Which treatment is the most effective? (using what metric?) 1.0 Time to tumor initiation: antitumor effects of GR32191 and cisplatin treatment in immunocompromised mice. Subcutaneous tumors from TCC-SUP human bladder cancer cells were treated with vehicle control (12 mice), GR32191 (15 mice), 5 mg/Kg cisplatin (cisplatin high; 12 mice), 5 mg/kg cisplatin in combination with GR32191 (13 mice), or 0.5 mg/kg (single low-dose cisplatin) alone (10 mice), or 0.5 mg/kg cisplatin in combination with GR32191 (10 mice). Tumor size was measured over time. KaplanMeier curves showing time to tumor onset across the treatment groups. 0.6 0.4 0.2 0.0 Proportion Tumor-Free 0.8 High Cis+GR High Cis Low Cis+GR Low Cis GR noGR 0 10 20 30 40 Time to Tumor (Days) 50 60 70 80 Table 2: Hazard ratios comparing time to tumor in treatment groups. A hazard ratio greater than 1.00 implies that the first treatment has shorter time to tumor than the second in the comparison. For example, the hazard ratio comparing Low Cis vs. Low Cis + GR is 23.5. This implies that, at any given time for mice who haven’t yet developed a tumor, mice treated with Low Cis were 23 times more likely to have tumor incidence than mice treated with Low Cisp + GR. Hazard ratios less than 1 imply a protective effect. For example, the hazard ratio for GR vs. Low is 0.51. This implies that for mice who haven’t yet developed a tumor, those treated with GR are 0.51 times as likely to have tumor incidence as mice treated on Low Cis at any given point in time. Comparison No GR vs. GR No GR vs. Low No GR vs. Low + GR No GR vs. High No GR vs. High + GR GR vs. Low GR vs. Low + GR GR vs. High GR vs. High + GR Low vs. Low + GR Low vs. High Low vs. High + GR Low + GR vs. High Low + GR vs. High + GR High vs. High + GR Hazard Ratio 2.42 1.24 29.1 304.1 556.5 0.51 12.0 125.6 229.8 23.5 245.7 449.6 10.5 19.1 1.83 p-value 0.02 0.63 <0.0001 <0.0001 <0.0001 0.12 0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.19 95% Confidence interval for hazard ratio 1.12, 5.22 0.51, 2.99 8.70, 97.2 65.4, 1414.2 113.7, 2722.7 0.22, 1.20 3.29, 43.79 25.2 626.4 43.9, 1203.3 5.88, 93.9 45.7, 1320.5 79.8, 2531.5 3.36, 32.6 5.77, 63.4 0.73, 4.61 Tumor growth rate: how to compare? 4000 High Cis+GR High Cis Low Cis+GR Low Cis GR Control 3000 Volume 2000 1000 600 300 100 30 0 0 20 40 Time (days) 60 80 Data used for tumor growth analysis: notice that for mice With no tumor onset, they are included from day 60+. Remaining Mice all have data shown for volumes>0. High Cis+GR High Cis Low Cis+GR Low Cis GR Control 4000 3000 Volume 2000 1000 600 300 100 30 0 0 20 40 60 Time from Tumor Initiation (days) or Day 60 80 Fitted regression lines per mouse, by treatment group (stage 1 of Two stage analysis). 4000 High Cis+GR High Cis Low Cis+GR Low Cis GR Control 3000 Tumor Volume 2000 1000 600 300 100 30 0 0 20 40 Time from Injection (days) 60 80 Estimated regression lines per treatment group (result of stage 2 of Two stage analysis) 4000 High Cis+GR High Cis Low Cis+GR Low Cis GR Control 3000 Tumor Volume 2000 1000 600 300 100 30 0 0 20 40 Time from Injection (days) 60 80 Comparisons of slopes Table 3: P-values for comparing slopes of tumor growth GR Control GR Low Cis Low Cis + GR High Cis 0.004 Low Cis 0.0001 0.75 Low Cis + GR <0.0001 0.47 0.58 High Cis 0.14 0.08 0.01 0.003 High Cis + GR 0.0006 0.89 0.84 0.49 0.03 Simpler ways to deal with the data? Simple comparisons across groups Example: two groups of mice have tail vein injections to establish tumors. they are followed for a fixed amount of time and then are sacrificed. Question 1: All mice get tumors. you want to compare tumor burden in the two groups of mice. What test would you use to compare them? a) b) c) d) e) t-test Wilcoxon rank sum test Anova Fisher’s exact test it depends Simpler ways to deal with the data? Simple comparisons across groups Example: two groups of mice have tail vein injections to establish tumors. they are followed for a fixed amount of time and then are sacrificed. Question 3: SOME mice get metastases. You want to compare incidence of metastases. What test should you use? a) b) c) d) e) f) Chi-square test Fisher’s exact test Anova Kaplan Meier Signed rank test it depends Simpler ways to deal with the data? Simple comparisons across groups Example: two groups of mice have tail vein injections to establish tumors. they are followed for a fixed amount of time and then are sacrificed. Question 2: All mice get tumors. you want to compare tumor burden in the two groups of mice. What are you comparing if you use a t-test? a) b) c) d) e) the distribution of tumor volume the distribution of the log of tumor volume the mean of tumor volume the mean of log tumor volume It depends Simpler way to deal with the data? By using simple approaches, you might be oversimplifying This can hurt you. Example: you average triplicate values naively assuming that there is 1/3 of the data as truly exists this will make your standard errors larger than they should be This can invalidate your results. Example: repeated measures on the same mouse assumed to be independent naively assuming that there is more data than truly exists. this will make your standard errors smaller than they should be 5. Lastly…how many mice should you use? Stay tuned for next week’s talk by EKG. Quick aside….interpreting pvalues Definition: the p-value is the probability of getting a result as or more extreme than you observed if the null hypothesis is true Can anyone translate that? Hypotheses Ho: mean gene expression is the same in two groups H1: mean gene expression is different in two groups Next: we do an experiment, we collect data (e.g. gene expression), we perform a test. What will affect the p-value? which hypothesis is actually true the variance of the values in each group the SAMPLE SIZE!!! What if…? What if we find a p-value of 0.02 comparing the mean gene expression in two groups? Your conclusion is…. i need more information including: the effect size! how different is the gene expression? the sample size! how many samples/animals were in each group? And, it would be nice to ‘see’ the data either in a figure of by knowing the means and standard deviations in the two groups Two scenarios Scenario 1: 5 mice per group. Gene expression is 4 times higher on average in the KO group compared to the WT group. P-value is p=0.02. Scenario 2: Tissue microarray. Gene expression is 1.2 times higher in late stage cancers (n=100) compared to the normals (n=50) (p=0.02). There is no significant difference (fold change = 1.05) between early stage cancers (n=100) and normals (p=0.45). Take home points Never interpret a p-value without additional information 2. statisticians can help you. at a minimum, we help you clarify your experimental design and definition of outcomes. 3. statisticians have many tools in our toolboxes. A ttest is not the hammer and every experiment is not a nail. 4. sample size justification will be next….very important (maybe most important) piece of statistical considerations in your grant. 1.