ANALYSIS OF BIOLOGICAL DATA BIOL4062/5062 Hal Whitehead • • • • Introduction Assignments Tentative schedule Analysis of biological data Introduction • • • • • Instructors Purpose of class Related classes Books Computer programs http://myweb.dal.ca/~hwhitehe/BIOL4062/handout4062.htm • Instructor: Hal Whitehead – LSC3076 (Ph 3723; email hwhitehe@dal.ca) – Best times: 8:00-9:00 a.m. • Teaching Assistant: ? • Other instructors – Dr David Lusseau Why “Analysis of Biological Data”? • Biologists – increasingly using quantitative techniques – to analyze larger and larger data sets – need skills in data analysis • especially in broad area of ecology • BIOL4062/5062 – introduce techniques for analysis of biological data – emphasis will be on the practical use and abuse of techniques, not derivations or mathematical formulae – in assignments students explore real and realistic data sets Related classes • Design of Biological Experiments (BIOL4061/5061) – most useful for those who work with systems that can be manipulated • Courses in Statistics – more emphasis on mathematical sides Some books (on reserve) • Legendre, L. and P. Legendre. Numerical Ecology (2nd edition). Elsevier (1998) • Manly, B.F.J. Multivariate statistical methods: a primer (2nd edition). Chapman & Hall (1994) • Other books: – Many, do not need to be right up to date Computer programs • • • • • • • MINITAB SPSS Good, comprehensive packages, can do SYSTAT analyses for this class SAS MATLAB (Statistics toolbox) More sophisticated S-plus and powerful, harder to use R Computer programs • • • • • • • MINITAB *† SPSS *† SYSTAT † Support from Hal SAS *† MATLAB (Statistics toolbox) Support S-plus (freely available at Dal.?) from R † (freely available on the web) ? * on GS.DAL.CA † in Biology-Earth Sciences computer lab Assignments • Type 1 – artificial data sets for trying different techniques • Type 2 – real data set to try a real analysis Type 1 assignments • Five assignments, sent by email (next few days) • Each 10% final mark • Artificial but realistic data sets – Different data sets to each student, but structurally similar – More analyses expected for graduate students (BIOL5062) • Analyze using a computer statistical package Type 1 assignments • Hand in a short write-up, explaining clearly: – what you did – what you found – what you think the results might mean biologically • Beware of: – Rubbish! • Check the results against patterns in the original data to make sure they make sense. – Over-interpreting the results – Not answering the questions posed Type 1 assignments • Five assignments: – – – – – Multiple regression 10% Log-linear models 10% Principal components analysis 10% Discriminant function analysis 10% Cluster analysis, multidimensional scaling, network analysis 10% Type 2 assignment • Find a biological data set, and then analyze it • The analysis should not be: – part of past, present, or future Honours, MSc or PhD thesis, or used for another class: self-plagiarism – that, or repeat that, done by someone else: plagiarism Type 2 assignment • The analysis can – use same data as in thesis or another course, but totally different analysis – use data collected by your supervisor, or someone else, but you should ask them – use a data set that you find on the web, or somewhere else, but you should check that it is OK – be submitted for publication, but you must check that you have all necessary permissions Type 2 assignment • Minimum sizes of data set (ask Hal for exceptions or in case of uncertainty): – For undergraduates (BIOL4062): • >50 units x >3 variables – For graduates (BIOL5062) • >50 units x >5 variables • either, two types of variables – e.g. “Dependent; Independent”; “Species; Environment” • or, link two data sets with one at least as large as the undergraduate data set • Must address at least 3 biological questions (BIOL4062), or 4 questions (BIOL5062) Type 2 assignment (4 steps) • a) Short meeting with Hal or *** to discuss your proposed data set and proposed analysis: feedback – bring draft of 2b assignment • b) Description of data set and proposed analysis. – where it came from – its structure(s) (number of variables, units, names of variables, types of variables, ...) – proposed biological questions – proposed analytical methods – possible problems – Example on web Type 2 assignment (4 steps) • c (i) Presentation of results to the class by graduate students – – – – – biological questions being addressed brief description of the data set how you analyzed it conclusions Example in Class • c(ii) Undergraduate students should go to graduate presentations and will be tested on general issues arising from them on last day Type 2 assignment (4 steps) • d) Write-up of your analysis as for a scientific journal paper – Max 5 pages (4062) or 7 pages (5062) single-spaced • excluding references, tables, figures – Explain biological question, methods in sufficient detail for someone to replicate them, problems, and biological conclusions – Show graphically, or in tables, the major effects • Do not just present summaries of ordinations or significance levels of hypotheses tests – Introduction and Discussion can be shorter and less detailed than in published paper • sufficient to give a good feel for biological issue being examined and the potential biological significance of the results Example on web Type 2 assignment • Marks • 2b Description of data set and proposed analysis 5% • 2c 15% – (i) Presentation of results by graduate students (BIOL5062) – (ii) Test on general principles from graduate student presentations (BIOL4062) • 2d Write-up of results 30% Tentative schedule Date Topic 6-Sep 11-Sep 13-Sep 18-Sep 20-Sep 25-Sep 27-Sep 2-Oct 4-Oct 9-Oct 11-Oct 16-Oct Thurs Tues Thurs Tues Thurs Tues Thurs Thurs Tues Thurs Tues Thurs 18-Oct 23-Oct 25-Oct 30-Oct Tues Thurs Tues Thurs 1-Nov Thurs 6-Nov Tues 8-Nov Thurs 13-Nov Tues 15-Nov Thurs 20-Nov Tues 22-Nov Thurs 27-Nov Tues 29-Nov Thurs Introduction to data analysis and the course Modes of statistical analysis Plotting and tabulating data and results Introduction to S-plus and R (optional) Correlation Linear regression Multiple linear regression, path analysis General linear models Introduction to likelihood Logistic regression Categorical data and log-linear models Introduction to multivariate analysis and multivariate distances Principal Components Analysis Network analysis-1 Network analysis-2 Discriminant Function Analysis and Canonical Variate Analysis Canonical Correlation Analysis, Redundancy Analysis and Canonical Correspondence Analysis Principal Coordinate Analysis, Correspondence Analysis and Multidimensional Scaling Cluster analyses Bootstraps and Jackknives Permutation tests, Mantel tests and matrix correlations Graduate presentations Graduate presentations Graduate presentations Test for undergraduates (BIOL4062) on grad. student projects Who HW HW HW HW HW HW HW HW HW HW Type 1 Examples Assignments TREE SYSTAT S-Plus SYSTAT SYSTAT SYSTAT SYSTAT SYSTAT SYSTAT SYSTAT HW HW DL DL SYSTAT SYSTAT HW SYSTAT 1a give 1a due 1b give 1c give 1e give 1b due 1d give 1c due HW SYSTAT HW SYSTAT 1e give HW HW SYSTAT SYSTAT HW HW HW HW SYSTAT 1e give; 1d due 1e due HW SYSTAT demo. at end of lectures Analysis of Biological Data • Types of biological data • History (very abbreviated!) • The process of biological data analysis – why garbage may come out • Hypothesis testing and data analysis – assumptions – other issues Types of biological data • Morphometric • Community ecology – organism distribution and environmental variation • Genetic data for ecological and evolutionary questions • Population data for management, conservation, evolutionary questions • Behavioural, physiological, ... Development of biological data analysis • >~1850 Displays • >~1900 ANOVA's, regression, correlation – without computers • >~1930 Non-parametric methods • >~1970 Multiple regression and multivariate analysis – matrix algebra using computers • >~1980 Robust methods: bootstraps, jackknives, permutations – need powerful computers Real Biological System Stochastic error Measurement error Sampling process Data Model+Assumptions Data Analysis Inferences about Biological System Garbage in => Garbage out • Good data + Errors => Garbage in => Garbage out Real Biological System – Check data entry • Good data + Errors in routine => Garbage out – Check results, run routines on data with known answer, – run on 2 routines • Good data + Wrong model => Garbage out – Think about, read about and discuss model Stochastic error Measurement error Sampling process Data Model+Assumptions Data Analysis Inferences about Biological System Hypothesis Testing Hypothesis Data Analysis Data Collection Experimental Design Data Analysis Experiment Hypothesis Analysis Conclusion [ANOVA, T-test] Agriculture Experimental ecology Physiology Animal behaviour [scatter plots, box plots, most multivariate analyses] Fisheries Community ecology Paleontology Some assumptions • Normality – can only be properly examined on large data sets – mainly a problem on small ones – an important issue for hypothesis testing – normality desirable in data analysis • Linearity – makes hypothesis testing easier – makes data analysis easier • Independence – major problem for hypothesis testing – no problem, or advantage, in data analysis Transform data or use non-linear or non-parametric methods Other issues in data analysis • Missing data – Often present in ecological data • Outliers – What do we do with apparent outliers? – Remove them? • Multiple comparisons – Major issue with hypothesis testing – Not an issue with data analysis • although: Patterns appear in random data Next class: • Inference in ecology and evolution: – – – – Null hypothesis statistical tests Effect size statistics Bayesian statistics Information theoretic model comparisons Date Topic 6-Sep 11-Sep 13-Sep 18-Sep 20-Sep 25-Sep 27-Sep 2-Oct 4-Oct 9-Oct 11-Oct 16-Oct Thurs Tues Thurs Tues Thurs Tues Thurs Thurs Tues Thurs Tues Thurs 18-Oct 23-Oct 25-Oct 30-Oct Tues Thurs Tues Thurs 1-Nov Thurs 6-Nov Tues 8-Nov Thurs 13-Nov Tues 15-Nov Thurs 20-Nov Tues 22-Nov Thurs 27-Nov Tues 29-Nov Thurs Introduction to data analysis and the course Modes of statistical analysis Plotting and tabulating data and results Introduction to S-plus and R (optional) Correlation Linear regression Multiple linear regression, path analysis General linear models Introduction to likelihood Logistic regression Categorical data and log-linear models Introduction to multivariate analysis and multivariate distances Principal Components Analysis Network analysis-1 Network analysis-2 Discriminant Function Analysis and Canonical Variate Analysis Canonical Correlation Analysis, Redundancy Analysis and Canonical Correspondence Analysis Principal Coordinate Analysis, Correspondence Analysis and Multidimensional Scaling Cluster analyses Bootstraps and Jackknives Permutation tests, Mantel tests and matrix correlations Graduate presentations Graduate presentations Graduate presentations Test for undergraduates (BIOL4062) on grad. student projects Who HW HW HW HW HW HW HW HW HW HW Type 1 Examples Assignments TREE SYSTAT S-Plus SYSTAT SYSTAT SYSTAT SYSTAT SYSTAT SYSTAT SYSTAT HW HW DL DL SYSTAT SYSTAT HW SYSTAT 1a give 1a due 1b give 1c give 1e give 1b due 1d give 1c due HW SYSTAT HW SYSTAT 1e give HW HW SYSTAT SYSTAT HW HW HW HW SYSTAT 1e give; 1d due 1e due HW Performance in BIOL4062/5062 • Graduate students (BIOL5062) – some do well with rather little effort – some do well with a lot of effort • Undergraduate students (BIOL4062) – most do well with some effort • adequate statistical background – some do poorly • inadequate statistical background or effort