Models for Integrating Statistics in Biology Education

advertisement
Models for Integrating Statistics
in Biology Education
Laura Kubatko — The Ohio State University
Danny Kaplan — Macalester College
Jeff Knisley — East Tennessee State University
Models for Integrating Statistics
in Biology Education:
The Ohio State University
Laura Kubatko — The Ohio State University
Danny Kaplan — Macalester College
Jeff Knisley — East Tennessee State University
The Ohio State University
• Approximately 38,000 undergraduates on main
campus in Columbus, OH
• Six biology departments, offering eight distinct
majors
– 2,300 majors in biological sciences
• Also undergraduate programs in medical fields,
environmental sciences, etc.
• Variability in mathematical and statistical
requirements across majors
The Ohio State University
• Growing presence in mathematical biology
– NSF-funded Mathematical Biosciences Institute
– Associated faculty hires, joint appointments, etc.
– Degree programs under development
• M.S. in Mathematical Biology
• Track with undergrad Math Major for Mathematical Biology
– NSF-funded UBM Program (2008-2013)
Development of Curriculum in
Mathematical Biology at OSU
• At the request of the College of Biological
Sciences
• Four courses:
– Calculus for the Life Sciences I and II
– Statistics for the Life Sciences
– Mathematical Modeling
• Student population: Freshmen life science
majors who place into calculus
Development of Curriculum in
Mathematical Biology at OSU
• Goal: All biology majors take calculus, Statistics
and Modeling optional
• Considerations in designing statistics course:
–
–
–
–
Build on calculus sequence
Satisfy requirements for statistics courses in majors
Include analysis of actual data sets
Introduce computing
Statistics for the Life Sciences
• Three 48-minute lectures per week
– Traditional lecture format (with some activities)
• Two 48-minute labs in computer room (taught by GTA)
– Half lab activities, half problem-solving sessions
– StatCrunch software used for data analysis
• Advantages: Runs in JAVA, easy to use in a 10week course
• Disadvantage: Availability after course
Statistics for the Life Sciences
• Four data sets integrated in lecture and lab
throughout the quarter:
– Magnetic field data (Barnothy, 1964)
– Fisher’s iris data set (Fisher, 1936)
– Limnology data (collected in 1993 at the
James H. Barrow Field Station)
– Forest composition data (collected in 1993 at
the James H. Barrow Field Station)
Statistics for the Life Sciences
• Overview of Topics:
– Descriptive statistics, graphical displays (1 week)
– Probability, including Bayes Theorem (1 week)
– Discrete distributions, analyzing categorical data
(2.5 weeks)
– One- and two-sample inference for means and
variances (2.5 weeks)
– Experimental design (1 week)
– Correlation and regression ( 1.5 weeks)
Successes
• Student feedback: course very useful
• Use of GTAs in all of the courses: assist in
training interdisciplinary teachers
• MBI post-docs have been involved in calculus
projects
• Several students recruited into our UBM program
Challenges
• Enrollment!!
– Not required for any students at present
– Shifts in administrative structure in College of
Biological Sciences
– Decreasing enrollment in Calculus for Life
Sciences (only one section next year)
– Students have full schedules in their first year
Challenges
• Experience of students
– Freshmen: may only have one or two courses
in biology, and often none in genetics
• Selection of topics
– 10-week course
– Balance coverage of fundamental ideas with
more current topics
The Future
• As OSU converts to semesters, work to have these
courses included formally in the appropriate places in
biology majors
• Work more closely with Center for Life Science Education
to understand how to integrate this course better with
other experiences of these students
• Enhance lab activities
• Possibly use R for data analysis
The Future
• More broadly, the math-bio curriculum at OSU
continues to grow
• UBM program has recently hired its first group of 9
Undergraduate Research Fellows
• New course: Undergraduate Seminar in
Mathematical Biology
• New majors/minors will soon be available
More Information
• Syllabus, Lab Material available at
http://www.stat.osu.edu/~lkubatko/CAUSEwebinar/
Models for Integrating Statistics in
Biology Education:
Macalester College’s Program
Laura Kubatko — The Ohio State University
Danny Kaplan — Macalester College
Jeff Knisley — East Tennessee State University
The Revolution in the Biosciences
The biological and medical sciences have changed
dramatically in the last 50 half century.
• The dominance of molecular biology and genetics.
Example: the sequencing of whole genomes.
• Dramatic improvements in instrumentation and
techniques. Example: DNA microarrays.
• The emergence of the clinical trial, large cohort
studies, and “scientific medicine.” Example: The
Framingham Heart Study ongoing from 1948.
Biology used to be a haven for non-quantitative
students with an interest in science.
Now it is data-intensive: Large and multivariate.
What Statistics Do We Teach?
The typical statistics course required by a biology
department is:
• Single-variable. There is a treatment group and a
control group that are alike in every other way.
• Emphasizes small data sets. n = 3 is pretty common,
perhaps reaching up to n = 20.
• Warns about “lurking” or “confounding” variables,
but offers no way to deal with them except
randomization. We do t-tests and one-way ANOVA,
not multiple regression.
• Has no university-level mathematics pre-requisite.
Example: DNA Microarrays
An array of thousands to tens of thousands of small
dots of different features (DNA oligonucleotides) that
can probe which genes are being expressed at a
given time.
From the Wikipedia article http://en.wikipedia.org/wiki/DNA_microarray.time.
DNA Microarrays: Statistics
“A basic difference between microarray data analysis and much
traditional biomedical research is the dimensionality of the data.
A large clinical study might collect 100 data items per patient for
thousands of patients. A medium-size microarray study will
obtain many thousands of numbers per sample for perhaps a
hundred samples. Many analysis techniques treat each sample
as a single point in a space with thousands of dimensions, then
attempt by various techniques to reduce the dimensionality of
the data to something humans can visualize.”
“Experimenters must account for multiple comparisons: even if the
statistical P-value assigned to a gene indicates that it is
extremely unlikely that differential expression of this gene was
due to random rather than treatment effects, the very high
number of genes on an array makes it likely that differential
expression of some genes represent false positives or false
negatives.”
From the Wikipedia article http: // en. wikipedia. org/ wiki/ DNA_ microarray .
From D. M. Windish, S. J. Huot,
M. L. Green (2007) ``Medical
Resident's Understanding of
the Biostatistics and Results in
the Medical Literature,'’ JAMA
298 (9): 1010-1017
What do Medical Residents Know about Statistics?
Questions & Responses
• If we don’t teach biology students about multiple
variables and the complications that arise from
them, where are they going to learn about this?
• Biology students are NOT strong enough
mathematically to handle multivariate material.
So why to we think they are going to learn it on
their own?
• You have to learn the basics first. Crawl before
walking. Walk before running. If the plan is for
students to take a “second course” in statistics, is
there any evidence that this plan is working?
Assumptions We Made in Revising
our Introductory Quantitative
Curriculum
• We would have only two semester courses in which to
provide material that students can use to study biology
in a sophisticated way.
• It was our job to figure out how to make the material
accessible to the students we have.
• The technical skills to work with multivariate data are
important.
• We want students to have a good theoretical
understanding of the material, not just technical skills.
• Our courses would be suitable for students preparing
to take Calc I. No requirement for previous work in
calculus.
Our Goals
• Common foundation for all students, more or less
regardless of their earlier preparation. (Students who
are ready to take Calc I I I are the exception — they do
that, although some opt to take Applied Calculus.)
• Provide skills and concepts that are directly and
concretely relevant to the follow-up courses students
will take in other areas, e.g., biology, economics, ... NOT
“this teaches them to think rigorously”
• Add value to the student’s existing mathematical
knowledge. Not so important to refine that existing
knowledge (e.g., learn how to do symbolic integrals) but
to EXTEND it in ways that the student would not be able
to do on his or her own. (Why do we think that students
can learn multivariate stuff on their own?)
The Constraints We Faced
• Students come from different entry points.
• Students have to be prepared to do calculusbased physics.
• Pre-meds have to have a calculus course. (But
there are good reasons to make it calculus,
too.)
• Statistics had to be accessible to mid-level
mathematics students as a stand-alone course.
• Students cannot be channelled into a special
section or a special course: they typically don’t
know their major when they enter.
Broad View
• In order to teach students about multivariate statistics, we need
for them to know something about multivariate functions. So to
teach statistics, we also had to teach calculus in a manner that
would be useful for statistics.
– What a linear approximation looks like.
– What a quadratic approximation looks like. (Including interactions.)
– What a partial derivative is.
• The program had to be organized as two distinct, stand-alone
courses: one in calculus and one in statistics.
– Some programs require a calculus course, and many students and
parents expect a calculus course, so one of the courses would be
calculus. This does NOT mean it has to be about the chain rule, the
quotient rule, etc.
– Some programs require a statistics course, and many students
come in with some calculus already, so the statistics course had to
be accessible to them.
Broad View (cont.)
• Macalester is small (1800 students), and
students don’t necessarily know their
major when they start. So it wouldn’t work
to have specialized courses just for biology
majors. The new courses had to be
suitable for the mainstream student.
• We wanted biology students (and others)
to get a reasonable introduction to
computation. This includes the
organization of data and a familiarity with
the structure of computer commands.
Calculus, Mathematics, and Statistics
• Calculus and statistics are taught as if they have little in
common.
• There are actually very strong connections in terms of
modeling and the interpretation of statistical models.
• The problem is that students don’t have a language for
talking about modeling, change and difference. So the
statistics course is forced to focus on very simple
descriptions, e.g., are these group means different?
• Why?
– Calculus topics were almost entirely established BEFORE
1900. Statistics starts AFTER 1900.
– Mathematicians usually have no training in statistics
whatsoever.
The way calculus is taught should change in order to support
statistics.
Comment on Calculus and Statistics
There is a strong link between calculus and statistics, but
many people assume that it is about:
– Integrating probability densities
– Using derivatives to optimize: e.g. finding the least
squares fit.
Neither of these is particularly important. Students can
understand areas without calculus. Least squares can be
completely explained without derivatives.
– Approximating relationships with functions (esp. linear
and quadratic functions)
– Describing rates of change: how one variable changes
with another
– The idea of partial change: the consequences of
changing one variable while holding others constant.
– Ideas of linear combinations: subspaces, projections,
collinearity, redundancy.
Before Bio2010 ... there was
CRAFTY
• Mathematical Association of America
project on “Curriculum Reform in the First
Two Years”
• A dozen CRAFTY workshops in 1999–
2001 covered a broad range of STEM
fields — biology, chemistry, computer
science, engineering in various flavors,
mathematics, statistics, physics.
• The conclusions reached are remarkably
consistent across all disciplines (and,
broadly, with Bio2010).
CRAFTY Recommendations
CRAFTY calls for much greater emphasis on ...
• Mathematical modeling, the process of
constructing a representation of an object, system,
or process that can be manipulated using
mathematical operations.
• Statistics and data analysis.
• Multivariate topics. The reports refer specifically
to two- and three-dimensional topics. Many of the
topics mentioned are related to the traditional
calculus sequence (including linear algebra,
differential equations, and multivariable calculus)
— we’ll refer to these topics as “calculus.”
• The appropriate use of computers.
• Bio 2010: Transforming
Undergraduate
Education for Future
Research Biologists
National Research
Council (U.S.).
Committee on
Undergraduate Biology
Education to Prepare
Research Scientists for
the 21st Century
National Academies
Press, 2003
Bio2010 & Mathematics/CS
RECOMMENDATION #1.5 Quantitative analysis, modeling,
and prediction play increasingly significant day-to-day
roles in today’s biomedical research. To prepare for this
sea change in activities, biology majors headed for
research careers need to be educated in a more
quantitative manner .... The committee recommends that
all biology majors master the concepts listed below. —
Bio2010, pp. 41-46
Topics are organized by
• Calculus
• Linear Algebra Dynamical Systems
• Probability and Statistics
• Information and Computation
• Data Structures
See appendix for a detailed list
Commentary on BIO2010
The recommendations are certainly ambitious and laudable, but…
• The recommendations seem to have been formed without any
time-budget constraint.
• Some of them are vague and there is no prioritization of them.
Examples: “the integral”, “integration over multiple variables.”
Does this mean the concept of accumulation, or rules for
symbolic integration?
• The statistics topics are out of line with current thought on
“statistical literacy” and “statistical thinking.”
• To follow them with the courses currently available at most
schools, every biology major would have to major in
mathematics as well.
• Even though it might be impractical to cover all of the Bio2010
quantitative topics, a good majority can be covered in a
coherently organized two-course sequence.
Outline of our “Applied Calculus”
Course
One-semester course. Pre-requisite to
“Statistical Modeling.”
1.Modeling basics
2.Derivatives and change
3.Differential equations (emphasis on
phenomena: growth, stability, oscillation)
4.Linear algebra (emphasis on geometry as
it applies to statistics)
See slides in the appendix
Introduction to Statistical Modeling
• Organization and (simple) descriptions of data.
• Construction of (linear) statistical models. This includes
multiple variables and nonlinear terms, esp. interactions.
• Adjustment for covariation. The idea of “partial change.”
• Inference:
– Confidence intervals and the effects of collinearity.
– Analysis of Covariance. Central question: Does this variable
contribute to the explanation.
– Resampling and bootstrapping.
• Causation & Experimental design: Randomization,
blocking, and orthogonality.
• Logistic regression and non-parametrics.
For the preface and outline, see
http://www.macalester.edu/~kaplan/ISM
Example of a Case Study: Nitrogen
Fixing by Plants
Macalester Biologist Mike Anderson studies the
ecology of nitrogen fixing bacteria.
Students are given the data he collected in field
studies of alder bushes in Alaska.
• Measured nitrogen fixation.
• The genotype ID of the bacteria on each plant’s
roots.
• The characteristics of the site: e.g., soil
temperatures at 1cm and 5cm, water content in
soil.
• The time in the season when the data were
taken.
Case Study: Nitrogen Fixing
(continued)
The analysis involves modeling nitrogen fixation by
these other explanatory variables, taking into
account the highly non-normal distribution of the
nitrogen fixation, and the strong collinearity among
the explanatory variables.
• Naive models indicate strongly that fixation varies
among genotypes (p < 0.001), one-way ANOVA.
• Using analysis of covariance, the p-value is
reduced even further (p < 0.0001). However, ...
• The association with genotype is completely
captured by the covariates of site characteristics,
especially when non-parametric techniques are
used.
Approach in Both Courses
• Multivariate from the beginning. Let’s us treat F
= ma seriously, but also look at interesting
biology models, e.g., predator-prey, nerve-cell,
SEIR, damped harmonic oscillator, …
• De-emphasis on algebraic manipulation.
Geometry used: Contours, gradients,
directional derivatives, subspaces, ...
• Computation integrated into both courses. We
use R, a statistics package.
• Simulations, e.g.,
– Motion in the phase plane.
– Hypothetical causal networks
Some Successes of Our Program
• The courses genuinely cover many of the Bio2010
topics.
• The courses have been popular with both students
and faculty.
– Fully one-third of the student body at Macalester takes
Applied Calculus.
– One-quarter takes Introduction to Statistical Modeling.
• Math/Statistics faculty enjoy teaching the courses.
• They have become the mainstream courses and
are taught by multiple faculty in multiple sections
each semester.
Some Failures of our Program
• The topics, skills, and techniques haven’t
been picked up in the downstream biology
courses.
• We still don’t offer an easy route to a
reasonable education in computing. We think
we would need to have a three-course
sequence in order to do this well.
Toward the Future
• Introduction to Statistical Modeling
– A textbook, exercises, class activities, etc. are available now
in draft form and will be published this summer.
– Workshops on ISM at the US Conference on Teaching
Statistics (Columbus, OH, June 23-25, 2009) and the Joint
Mathematics Meetings (San Francisco, January 2010) See
www.macalester.edu/~kaplan/ISM
• An NSF CCLI Phase 2 proposal: Building a Community
around Modeling, Statistics, Computation, and Calculus.
See www.macalester.edu/~kaplan/MSCC
• The plan is to provide support for faculty who want to
develop materials and who want to adopt materials that
unify modeling, statistics, computation, and calculus in the
quantitative curriculum.
Thanks to ...
• W.M. Keck Foundation for their support of Introduction to
Statistical Modeling through the Keck Data Fluency project
grant.
• The Howard Hughes Medical Institute, which funded the
first three years of the project: the original “Calculus with
Biological Applications” and “Statistics with Biological
Applications.”
• The Macalester biology department, esp. Jan Serie, who
sponsored the original project and agreed to require their
students to take these courses even before they were fully
developed.
• Other Macalester faculty involved in teaching and
developing these courses: Tom Halverson, Karen Saxe,
Dan Flath, David Bressoud (current president of the
Mathematical Association of America), Victor Addona,
Chad Topaz, Andrew Beveridge.
APPENDICES
• See
www.macalester.edu/~kaplan/ISM/CauseMay
2009.pdf
Models for Integrating Statistics in
Biology Education:
The Symbiosis Project
East Tennessee State University
Laura Kubatko — The Ohio State University
Danny Kaplan — Macalester College
Jeff Knisley — East Tennessee State University
Symbiosis: An Introductory
Integrated Mathematics and
Biology Curriculum for the 21st
Century (HHMI 52005872)
• Team-taught by Biologists (6),
Mathematicians (3), and Statisticians (1)
– Biologists progress to needs for analyses,
models, or related concepts (e.g., optimization)
– A complete intro stats and calculus curriculum via
the needs and contexts provided by the biologists
(presentation is primarily about our experiences
working with our biologists)
Goals of the Symbiosis Project
• Implement a large subset of the
recommendations of the BIO2010 report in an
introductory lab science sequence
– Semester 1: Statistics + Precalculus, Limits,
Continuity
– Semester 2: Completion of a Calculus I course +
Statistics
(Our focus on Semesters 1 and 2)
– Semester 3: Modeling, BioInformatics, reinforcement
of previous ideas, More Statistics
Goals of the Symbiosis Project
• Use Biological contexts to motivate
mathematical and statistical concepts and
tools
– Analysis of data used to inform and interpret
– Models and inference used to predict and explain
• Use Mathematical concepts and Statistical
Inference to produce biological insights
– Insights often need to be quantified if only to
predict the scale on which the insight is valid
– Especially useful are insights that cannot be
obtained without resorting to mathematics or
statistics
Table of Contents
• Symbiosis I and II
– List of “modules” with topics selected by
biologists
– Mathematical and Statistical Highlights included
(Not enough time to explore Symbiosis III)
• Logistics: 5 + 1 format, student populations
between 7 and 30, and 3 or 4 faculty per
course
Symbiosis I
1. The Scientific Method: Numbers, models, binomial,
Randomization Test, Intro to Statistical Inference
2. The Cell: Descriptive Statistics and Correlation
3. Size and Scale: Lines, power laws, fractals, Poisson,
exponentials, logarithms, and linear regression
4. Mendelian Genetics: Chi-Square, Normal, Goodness
of Fit Test, Test of Independence
5. DNA: Conditional Probability, the Markov Property,
Sampling distributions
6. Proteins and Evolution: Limits, continuity,
approximations, and the t-test
Symbiosis II
7. Population Ecology: Derivatives, Rates of Change,
Power, Product, Quotient rules, Differential Equations
8. Species-Species Interactions: Chain rule, Properties
of the Derivative, Differential Equations Qualitatively,
Equilibria, Parameter Estimation
9. Behavioral Ecology: Optimization, curve-sketching,
L’hopital’s rule
10. Chronobiology: Trigonometric functions and their
derivatives, Periodograms
11. Integration and Plant Growth: Antiderivatives,
Definite Integrals, and the Fundamental Theorem
12. Energy and Enzymes: Applications of the Integral,
differential equations methods, Nonlinear Regression
Major Outcomes
• Complete and/or Comprehensive Biological
Investigations
– Traditional Bio Curriculum: Biological questions
pursued to a point short of quantitative analysis
– Symbiosis: Data and Models used to explore
biological questions and predict answers
• Mendelian genetics via chi-square analysis of
data
• rK strategists based on logistic model and its
solutions, including N(t) = K as an equilibrium
solution
Aspects of Integration
• Biologists need or can use almost all the
math and stats we can provide
– But their goals are radically different
• Statistical inference as a tool for justifying
classification of organisms into different categories
• Models as a means of separating different
phenomena
– And the results are used to address their (often
non-quantitative) questions
• E.g.: Simple epidemiological models used to
suggest whether or not mosquito’s can carry the
aids virus
Aspects of Integration
• Statisticians and Mathematicians can
contribute to biology in a variety of ways
– But transparency is paramount
• Examples of techniques “Transparent” to our biologists:
The Randomization test, Chi-square, Periodograms,
Nonlinear Regression, phase-plane analysis
– Or time/effort must be devoted to importance of
subtleties within biological contexts
• Example: Logarithms and exponentials with base e.
(Why not just use base 10 for everything?)
Observation
• The issues preventing “downstream” usage of math
and stats by biologists and their students
– Start as small issues at the most elementary levels
• Nearly all of Symbiosis module 1 addresses the difference
between a scientific hypothesis and a statistical hypothesis
• Surface area to volume ratio: First we must agree on notation.
• Is a math idea that holds for an arbitrary f(x) also always true
for a population with density N(t) at time t?
– And grow into major obstacles
• E.g.: If time is not spent exploring what a biologist means by a
population density, ecological models may become impossible
to interpret biologically.
• Statistical results are useless if based on invalid assumptions
(e.g., populations of same species may differ quantitatively)
Further Insights
• Computing and Computational Science
have emerged as major components
– Informatics, genetics, proteomics, …
– And Even in Ecology!
• Programming in R
– Need is for math/stat informed algorithms
– Not for elaborate structures or
sophisticated programming languages
Further Insights
• Logistics are a challenge
– Transcripts are important!!!
– Course sizes / delivery methods differ significantly
• Biology lectures can be huge
• Biology labs are typically smaller than math/stat sections
• (I had never had to consider how to combine a lab grade
with a lecture grade)
• Communication is very important, especially
about the “little issues” that tend to grow
Future Directions for Symbiosis
• An “Integrated Courses” model
– Separate Math/Biology courses
• Better for transcript
• Allows familiar examination techniques
– Common Curriculum
• Same materials as 5+1 courses
• Biology section maintains the lab component
• This is a re-constitution of Symbiosis,
not a replacement for it!!!
– i.e., a better (logistically, in particular)
approach to what we are doing now
Future Directions for Symbiosis
• More emphasis on computation
– Algorithms as method to address biological
inquiries
– Algorithms as statistical tools
•
•
•
•
Inference via bootstrapping,
Predictions via clustering
Informatics
Avoiding reliance on “off-the-shelf” approaches
• Symbiosis IV: A Gen Ed “Intro to
Computational Science” course for math
and bio majors
Thank you!
Any questions
Download