Introduction to data analysis and the course

advertisement
ANALYSIS OF
BIOLOGICAL DATA
BIOL4062/5062
Hal Whitehead
•
•
•
•
Introduction
Assignments
Tentative schedule
Analysis of biological data
Introduction
•
•
•
•
•
Instructors
Purpose of class
Related classes
Books
Computer programs
http://myweb.dal.ca/~hwhitehe/BIOL4062/handout4062.htm
• Instructor: Hal Whitehead
– LSC3076 (Ph 3723; email hwhitehe@dal.ca)
– Best times: 8:00-9:00 a.m.
• Teaching Assistant: ?
• Other instructors
– Dr David Lusseau
Why “Analysis of Biological Data”?
• Biologists
– increasingly using quantitative techniques
– to analyze larger and larger data sets
– need skills in data analysis
• especially in broad area of ecology
• BIOL4062/5062
– introduce techniques for analysis of biological data
– emphasis will be on the practical use and abuse of
techniques, not derivations or mathematical formulae
– in assignments students explore real and realistic data
sets
Related classes
• Design of Biological Experiments (BIOL4061/5061)
– most useful for those who work with systems that
can be manipulated
• Courses in Statistics
– more emphasis on mathematical sides
Some books (on reserve)
• Legendre, L. and P. Legendre. Numerical
Ecology (2nd edition). Elsevier (1998)
• Manly, B.F.J. Multivariate statistical
methods: a primer (2nd edition). Chapman &
Hall (1994)
• Other books:
– Many, do not need to be right up to date
Computer programs
•
•
•
•
•
•
•
MINITAB
SPSS
Good, comprehensive
packages, can do
SYSTAT
analyses for this class
SAS
MATLAB (Statistics toolbox)
More sophisticated
S-plus
and powerful,
harder to use
R
Computer programs
•
•
•
•
•
•
•
MINITAB
*†
SPSS
*†
SYSTAT †
Support from Hal
SAS
*†
MATLAB (Statistics toolbox)
Support
S-plus (freely available at Dal.?)
from
R † (freely available on the web)
?
* on GS.DAL.CA
† in Biology-Earth Sciences computer lab
Assignments
• Type 1
– artificial data sets for trying different
techniques
• Type 2
– real data set to try a real analysis
Type 1 assignments
• Five assignments, sent by email (next few days)
• Each 10% final mark
• Artificial but realistic data sets
– Different data sets to each student, but
structurally similar
– More analyses expected for graduate students
(BIOL5062)
• Analyze using a computer statistical
package
Type 1 assignments
• Hand in a short write-up, explaining clearly:
– what you did
– what you found
– what you think the results might mean
biologically
• Beware of:
– Rubbish!
• Check the results against patterns in the original data
to make sure they make sense.
– Over-interpreting the results
– Not answering the questions posed
Type 1 assignments
• Five assignments:
–
–
–
–
–
Multiple regression 10%
Log-linear models 10%
Principal components analysis 10%
Discriminant function analysis 10%
Cluster analysis, multidimensional scaling,
network analysis 10%
Type 2 assignment
• Find a biological data set, and then analyze it
• The analysis should not be:
– part of past, present, or future Honours, MSc or
PhD thesis, or used for another class:
self-plagiarism
– that, or repeat that, done by someone else:
plagiarism
Type 2 assignment
• The analysis can
– use same data as in thesis or another course, but
totally different analysis
– use data collected by your supervisor, or someone
else, but you should ask them
– use a data set that you find on the web, or
somewhere else, but you should check that it is
OK
– be submitted for publication, but you must check
that you have all necessary permissions
Type 2 assignment
• Minimum sizes of data set (ask Hal for
exceptions or in case of uncertainty):
– For undergraduates (BIOL4062):
• >50 units x >3 variables
– For graduates (BIOL5062)
• >50 units x >5 variables
• either, two types of variables
– e.g. “Dependent; Independent”; “Species; Environment”
• or, link two data sets with one at least as large as the
undergraduate data set
• Must address at least 3 biological questions
(BIOL4062), or 4 questions (BIOL5062)
Type 2 assignment (4 steps)
• a) Short meeting with Hal or *** to discuss your
proposed data set and proposed analysis: feedback
– bring draft of 2b assignment
• b) Description of data set and proposed analysis.
– where it came from
– its structure(s) (number of variables, units, names of
variables, types of variables, ...)
– proposed biological questions
– proposed analytical methods
– possible problems
– Example on web
Type 2 assignment (4 steps)
• c (i) Presentation of results to the class by
graduate students
–
–
–
–
–
biological questions being addressed
brief description of the data set
how you analyzed it
conclusions
Example in Class
• c(ii) Undergraduate students should go to
graduate presentations and will be tested on
general issues arising from them on last day
Type 2 assignment (4 steps)
• d) Write-up of your analysis as for a scientific journal
paper
– Max 5 pages (4062) or 7 pages (5062) single-spaced
• excluding references, tables, figures
– Explain biological question, methods in sufficient detail for
someone to replicate them, problems, and biological
conclusions
– Show graphically, or in tables, the major effects
• Do not just present summaries of ordinations or significance levels of
hypotheses tests
– Introduction and Discussion can be shorter and less detailed
than in published paper
• sufficient to give a good feel for biological issue being examined and
the potential biological significance of the results Example on web
Type 2 assignment
• Marks
• 2b Description of data set and proposed
analysis 5%
• 2c 15%
– (i) Presentation of results by graduate students
(BIOL5062)
– (ii) Test on general principles from graduate
student presentations (BIOL4062)
• 2d Write-up of results 30%
Tentative schedule
Date
Topic
6-Sep
11-Sep
13-Sep
18-Sep
20-Sep
25-Sep
27-Sep
2-Oct
4-Oct
9-Oct
11-Oct
16-Oct
Thurs
Tues
Thurs
Tues
Thurs
Tues
Thurs
Thurs
Tues
Thurs
Tues
Thurs
18-Oct
23-Oct
25-Oct
30-Oct
Tues
Thurs
Tues
Thurs
1-Nov Thurs
6-Nov Tues
8-Nov Thurs
13-Nov Tues
15-Nov Thurs
20-Nov Tues
22-Nov Thurs
27-Nov Tues
29-Nov Thurs
Introduction to data analysis and the course
Modes of statistical analysis
Plotting and tabulating data and results
Introduction to S-plus and R (optional)
Correlation
Linear regression
Multiple linear regression, path analysis
General linear models
Introduction to likelihood
Logistic regression
Categorical data and log-linear models
Introduction to multivariate analysis and
multivariate distances
Principal Components Analysis
Network analysis-1
Network analysis-2
Discriminant Function Analysis and Canonical
Variate Analysis
Canonical Correlation Analysis, Redundancy
Analysis and Canonical Correspondence
Analysis
Principal Coordinate Analysis, Correspondence
Analysis and Multidimensional Scaling
Cluster analyses
Bootstraps and Jackknives
Permutation tests, Mantel tests and matrix
correlations
Graduate presentations
Graduate presentations
Graduate presentations
Test for undergraduates (BIOL4062) on grad.
student projects
Who
HW
HW
HW
HW
HW
HW
HW
HW
HW
HW
Type 1
Examples Assignments
TREE
SYSTAT
S-Plus
SYSTAT
SYSTAT
SYSTAT
SYSTAT
SYSTAT
SYSTAT
SYSTAT
HW
HW
DL
DL
SYSTAT
SYSTAT
HW
SYSTAT
1a give
1a due
1b give
1c give
1e give
1b due
1d give
1c due
HW
SYSTAT
HW
SYSTAT
1e give
HW
HW
SYSTAT
SYSTAT
HW
HW
HW
HW
SYSTAT
1e give; 1d
due
1e due
HW
SYSTAT
demo.
at end of
lectures
Analysis of Biological Data
• Types of biological data
• History (very abbreviated!)
• The process of biological data analysis
– why garbage may come out
• Hypothesis testing and data analysis
– assumptions
– other issues
Types of biological data
• Morphometric
• Community ecology
– organism distribution and environmental
variation
• Genetic data for ecological and evolutionary
questions
• Population data for management,
conservation, evolutionary questions
• Behavioural, physiological, ...
Development of biological data analysis
• >~1850 Displays
• >~1900 ANOVA's, regression, correlation
– without computers
• >~1930 Non-parametric methods
• >~1970 Multiple regression and
multivariate analysis
– matrix algebra using computers
• >~1980 Robust methods: bootstraps,
jackknives, permutations
– need powerful computers
Real Biological System
Stochastic error
Measurement error
Sampling process
Data
Model+Assumptions
Data Analysis
Inferences about Biological System
Garbage in => Garbage out
• Good data + Errors =>
Garbage in => Garbage out
Real Biological System
– Check data entry
• Good data + Errors in routine
=> Garbage out
– Check results, run routines on
data with known answer,
– run on 2 routines
• Good data + Wrong model =>
Garbage out
– Think about, read about and
discuss model
Stochastic error
Measurement error
Sampling process
Data
Model+Assumptions
Data Analysis
Inferences about Biological System
Hypothesis Testing
Hypothesis
Data Analysis
Data Collection
Experimental Design
Data Analysis
Experiment
Hypothesis
Analysis
Conclusion
[ANOVA, T-test]
Agriculture
Experimental ecology
Physiology
Animal behaviour
[scatter plots, box plots, most
multivariate analyses]
Fisheries
Community ecology
Paleontology
Some assumptions
• Normality
– can only be properly examined on large data
sets
– mainly a problem on small ones
– an important issue for hypothesis testing
– normality desirable in data analysis
• Linearity
– makes hypothesis testing easier
– makes data analysis easier
• Independence
– major problem for hypothesis testing
– no problem, or advantage, in data analysis
Transform
data
or use
non-linear or
non-parametric
methods
Other issues in data analysis
• Missing data
– Often present in ecological data
• Outliers
– What do we do with apparent outliers?
– Remove them?
• Multiple comparisons
– Major issue with hypothesis testing
– Not an issue with data analysis
• although: Patterns appear in random data
Next class:
• Inference in ecology and evolution:
–
–
–
–
Null hypothesis statistical tests
Effect size statistics
Bayesian statistics
Information theoretic model comparisons
Date
Topic
6-Sep
11-Sep
13-Sep
18-Sep
20-Sep
25-Sep
27-Sep
2-Oct
4-Oct
9-Oct
11-Oct
16-Oct
Thurs
Tues
Thurs
Tues
Thurs
Tues
Thurs
Thurs
Tues
Thurs
Tues
Thurs
18-Oct
23-Oct
25-Oct
30-Oct
Tues
Thurs
Tues
Thurs
1-Nov Thurs
6-Nov Tues
8-Nov Thurs
13-Nov Tues
15-Nov Thurs
20-Nov Tues
22-Nov Thurs
27-Nov Tues
29-Nov Thurs
Introduction to data analysis and the course
Modes of statistical analysis
Plotting and tabulating data and results
Introduction to S-plus and R (optional)
Correlation
Linear regression
Multiple linear regression, path analysis
General linear models
Introduction to likelihood
Logistic regression
Categorical data and log-linear models
Introduction to multivariate analysis and
multivariate distances
Principal Components Analysis
Network analysis-1
Network analysis-2
Discriminant Function Analysis and Canonical
Variate Analysis
Canonical Correlation Analysis, Redundancy
Analysis and Canonical Correspondence
Analysis
Principal Coordinate Analysis, Correspondence
Analysis and Multidimensional Scaling
Cluster analyses
Bootstraps and Jackknives
Permutation tests, Mantel tests and matrix
correlations
Graduate presentations
Graduate presentations
Graduate presentations
Test for undergraduates (BIOL4062) on grad.
student projects
Who
HW
HW
HW
HW
HW
HW
HW
HW
HW
HW
Type 1
Examples Assignments
TREE
SYSTAT
S-Plus
SYSTAT
SYSTAT
SYSTAT
SYSTAT
SYSTAT
SYSTAT
SYSTAT
HW
HW
DL
DL
SYSTAT
SYSTAT
HW
SYSTAT
1a give
1a due
1b give
1c give
1e give
1b due
1d give
1c due
HW
SYSTAT
HW
SYSTAT
1e give
HW
HW
SYSTAT
SYSTAT
HW
HW
HW
HW
SYSTAT
1e give; 1d
due
1e due
HW
Performance in BIOL4062/5062
• Graduate students (BIOL5062)
– some do well with rather little effort
– some do well with a lot of effort
• Undergraduate students (BIOL4062)
– most do well with some effort
• adequate statistical background
– some do poorly
• inadequate statistical background or effort
Download