Lecture 13 Analysis of variance Spiders on Mazurian lake islands: Wigry –Mikołajki, Nidzkie, Bełdany Photo: Wigierski Park Narodowe Photo: Ruciane.net Salticidae Araneus diadematus Photo: Eurospiders.com Spider species richness on Mazurian lake islands Island Disturbance Species Górna E High 33 Kopanka High 34 Kopanka N High 32 Piaseczna High 38 Górna W High 29 Królewski Medium 51 Ostrów Wygryńska Medium 43 Maleńka Low 6 Ruciane - ląd Low 28 Mikołajki - ląd Low 75 Wierzba Low 47 Kamień Low 60 Mysia Wigry Low 49 Ordów Low 64 Koń Pristine 25 Mała Wierzba Pristine 27 Ośrodek Pristine 22 Śluza Pristine 19 Bryzgiel Pristine 21 Bryzgiel - ląd Pristine 46 Brzozowa L Pristine 31 Brzozowa P Pristine 30 Cimochowski Pristine 25 Grądzik C Cimochowski Pristine 31 Grądzik N Cimochowski Pristine 25 Grądzik S Krowa Pristine 34 Ostrów Pristine 93 Rośków Pristine 24 Walędziak Pristine 28 Wysoki Pristine 57 High 33 34 32 38 29 High Medium Low Medium 51 43 Low 6 28 75 47 60 49 64 Pristine 25 27 22 19 21 46 31 30 25 31 25 34 93 24 28 57 T-TEST Medium Low Pristine 0.145265 0.172254 0.931288 1 0.081749 0.211812 Single test p ( nsig ) 1 p ( sig ) n independent tests p Exp ( nsig ) (1 p test ( sig )) n p Exp ( sig ) 1 (1 p test ( sig )) n 1 (1 np test ( sig )) Exp 0 . 05 n Test T-TEST High Medium Low If we use the same test several times with the same data we have to apply a Bonferroni correction. p Exp ( sig ) np test ( sig ) Bonferroni corrected Medium Does species richness differ with respect to the degree of disturbance? Low Pristine 0.857544 0.862042 0.988548 1 0.846958 0.868635 Test 0 . 05 n Spider species richness on Mazurian lake islands Island Disturbance Species Górna E High 33 Kopanka High 34 Kopanka N High 32 Piaseczna High 38 Górna W High 29 Królewski Medium 51 Ostrów Wygryńska Medium 43 Maleńka Low 6 Ruciane - ląd Low 28 Mikołajki - ląd Low 75 Wierzba Low 47 Kamień Low 60 Mysia Wigry Low 49 Ordów Low 64 Koń Pristine 25 Mała Wierzba Pristine 27 Ośrodek Pristine 22 Śluza Pristine 19 Bryzgiel Pristine 21 Bryzgiel - ląd Pristine 46 Brzozowa L Pristine 31 Brzozowa P Pristine 30 Cimochowski Pristine 25 Grądzik C Cimochowski Pristine 31 Grądzik N Cimochowski Pristine 25 Grądzik S Krowa Pristine 34 Ostrów Pristine 93 Rośków Pristine 24 Walędziak Pristine 28 Wysoki Pristine 57 Sir Ronald Aylmer Fisher (1890-1962) One way analysis of variance sH2 x H sM2 x M If there would be no difference between the sites the average within variance sWithin2 should equal the variance between the sites sBetween2 . s L2 x L sBetween2 2 sT2 sP2 x P F s Between 2 s With in 2 s Between s T s Between 2 2 We test for significance using the F-test of Fisher with k-1 (Between) and n-k (Within) degrees of freedom. n-1 = n-k + k-1 df Total df Within df Between ni k s Between 2 (x i x Total ) i 1 k 1 2 SS Between df Between k 2 sW ithin S S to ta l S S b etw een S S w ith in i 1 (x i, j xi ) j 1 ni 1 n 2 SS W ithin df W ithin s Total 2 (x i x Total ) i 1 n 1 df total df betw een df w ithin 2 SS Total df Total MS SS df F MS Between MS W ithin t x1 x1 2 s1 n1 2 s2 n2 Welch test The Levene test compares the group variances using the F distribution. Variances shouldn’t differ too much (shouldn’t be heteroskedastic)!!! The Tuckey test compares simultaneously the means of all combinations of groups. It’s a t-test corrected for multiple comparisons (similar to a Bonferroni correction) O b s e rva tio n s 1 2 3 4 5 G ro u p m e a n S S w ith in T o ta l S S w ith in T o ta l S S b e tw e e n G ra n d m e a n G ra n d S S G ra n d S S S S b e tw e e n + S S w ith in F F -te s t A 0 .0 8 0 .7 1 0 .1 9 0 .5 1 0 .7 3 0 .4 4 5 0 .1 3 1 0 .0 7 0 0 .0 6 5 0 .0 0 4 0 .0 8 2 4 .1 1 1 3 .9 6 1 .0 8 1 .0 0 0 .1 4 0 .7 9 0 .3 2 0 .1 2 1 8 .0 7 1 8 .0 7 1 8 .1 4 2 .1 1 8 E -0 5 T re a tm e n ts B C 0 .1 9 0 .8 3 1 .2 1 0 .7 1 1 .9 7 1 .1 0 0 .1 9 0 .1 1 0 .1 9 0 .3 0 0 .7 5 0 0 .6 1 1 0 .3 1 9 0 .0 4 6 0 .2 1 6 0 .0 1 0 1 .4 8 4 0 .2 4 4 0 .3 1 4 0 .2 5 0 0 .3 1 2 0 .0 9 6 S S b e tw e e n D 2 .8 0 2 .6 9 1 .9 3 2 .5 7 2 .5 8 2 .5 1 5 0 .0 8 1 0 .0 3 2 0 .3 4 2 0 .0 0 4 0 .0 0 4 0 .4 0 4 0 .4 0 4 0 .4 0 4 0 .4 0 4 0 .4 0 4 0 .1 0 9 0 .1 0 9 0 .1 0 9 0 .1 0 9 0 .1 0 9 s Between 2 (x 2 x Total ) 2 i 1 k 1 (x SS Between df Between i, j xi ) 2 j 1 ni 1 i 1 2 .9 6 2 .6 1 0 .7 2 2 .2 3 2 .2 4 i ni sW ithin 0 .0 6 0 .1 4 0 .0 0 0 .9 4 0 .6 1 2 .0 5 9 2 .0 5 9 2 .0 5 9 2 .0 5 9 2 .0 5 9 k k 0 .8 0 0 .0 2 0 .7 9 0 .7 9 0 .7 9 0 .2 2 0 0 .2 2 0 0 .2 2 0 0 .2 2 0 0 .2 2 0 SS W ithin df W ithin n s Total 2 (x i x Total ) i 1 n 1 2 SS Total df Total We include the effect of island complex (Wigry – Nidzkie, Bełdany, Mikołaiki) Island Complex Disturbance Species Górna E NBM High 33 Kopanka NBM High 34 Kopanka N NBM High 32 Piaseczna NBM High 38 Górna W NBM High 29 Królewski Ostrów NBM Medium 51 Wygryńska NBM Maleńka NBM Ruciane - ląd NBM Mikołajki - lądNBM Wierzba NBM Kamień Wigry Mysia Wigry Wigry Ordów Wigry Koń NBM Mała WierzbaNBM Ośrodek NBM Śluza NBM Bryzgiel Wigry Bryzgiel - lądWigry Brzozowa L Wigry Brzozowa P Wigry Medium Low Low Low Low Low Low Low Pristine Pristine Pristine Pristine Pristine Pristine Pristine Pristine 43 6 28 75 47 60 49 64 25 27 22 19 21 46 31 30 CimochowskiWigry Grądzik C Pristine There must be at least two data for each combination of groups. We use a simple two way ANOVA Island Complex Disturbance Species Maleńka NBM Low 6 Ruciane - ląd NBM Low 28 Mikołajki - lądNBM Low 75 Wierzba NBM Low 47 Kamień Wigry Low 60 Mysia Wigry Wigry Low 49 Ordów Wigry Low 64 Koń NBM Pristine 25 Mała WierzbaNBM Pristine 27 Ośrodek NBM Pristine 22 Śluza NBM Pristine 19 Bryzgiel Wigry Pristine 21 Bryzgiel - lądWigry Pristine 46 Brzozowa L Wigry Pristine 31 Brzozowa P Wigry Pristine 30 CimochowskiWigry Grądzik C Pristine 25 CimochowskiWigry Grądzik N Pristine 31 CimochowskiWigry Grądzik S Pristine 25 25 Krowa Ostrów Rośków Walędziak Wigry Wigry Wigry Wigry Pristine Pristine Pristine Pristine 34 93 24 28 CimochowskiWigry Grądzik N Pristine 31 Wysoki Węgieł Wigry Pristine 57 CimochowskiWigry Grądzik S Pristine 25 Krowa Ostrów Rośków Walędziak Wigry Wigry Wigry Wigry Pristine Pristine Pristine Pristine 34 93 24 28 Wysoki Węgieł Wigry Pristine 57 SS total SS A SS B SS A xB SS error Main effects Secondary effects SS Complex SS Disturbanc e SS Complex Disturbanc e The significance levels have to be divided by the number of tests (Bonferroni correction) Spider species richness does not significantly depend on island complex and degree of disturbance. Correcting for covariates: Anaysis of covariance 100 y = 33.431x0.1917 R² = 0.7215 80 Species Island Complex Disturbance Area [ha] Species Górna E NBM 1 0.7 33 Koń NBM 4 0.5 25 Kopanka NBM 1 0.69 34 Królewski Ostrów NBM 2 6.15 51 Maleńka NBM 3 0.0003 6 Mała Wierzba NBM 4 0.4 27 Kopanka N NBM 1 0.18 32 Ośrodek NBM 4 0.09 22 Piaseczna NBM 1 0.63 38 Ruciane - ląd NBM 3 15 28 Mikołajki - ląd NBM 3 20 75 Śluza NBM 4 0.48 19 Górna W NBM 1 0.44 29 Wierzba NBM 3 0.78 47 Wygryńska NBM 2 0.67 43 Bryzgiel Wigry 4 0.2 21 Bryzgiel - ląd Wigry 4 16 46 Brzozowa L Wigry 4 3.81 31 Brzozowa P Wigry 4 2.32 30 Cimochowski Grądzik Wigry C 4 0.15 25 Cimochowski Grądzik Wigry N 4 0.14 31 Cimochowski Grądzik Wigry S 4 0.76 25 Kamień Wigry 3 3.13 60 Krowa Wigry 4 4.49 34 Mysia Wigry Wigry 3 1.55 49 Ordów Wigry 3 8.69 64 Ostrów Wigry 4 38.82 93 Rośków Wigry 4 0.56 24 Walędziak Wigry 4 0.76 28 Wysoki Węgieł Wigry 4 18 57 60 40 20 0 0 10 20 Area 30 40 Instead of using the raw data we use the residuals. These are the area corrected species numbers. The conmparison of within group residuals and between group residuals gives our F-statistic. 50 Disturbance does not significantly influence area corrected species richness Total residuals SStotal = SSbetween Within group residuals + SSerror We need four regression equations: one from all data points and three within groups. Before After SSbetween Medical treatment SSwithin n SS total Repetitive designs In medical research we test patients before and after medical treatment to infer the influence of the therapy. We have to divide the total variance (SStotal) in a part that contains the variance between patients (SSbetween) and within the patient (SSwithin). The latter can be divided in a part that comes from the treatment (SStreat) and the error (SSerror) i 1 x) ij S S tre a t S S E rro r SS total SS between SS within SS between SS treat SS error 2 j 1 n SS between k ( Pi x ) df total df betw een df w ithin df betw een df treat df error 2 kn 1 n 1 n (k 1) n 1 k 1 (n 1)(k 1) i 1 n k (x i 1 ij Pi ) 2 k n (T j x ) j 1 k SS treat n (T j x ) F 2 S S treat df error S S error df treat j 1 k (x i 1 j 1 ij Pi T j x ) 2 2 (n 1)(k 1) j1 k n (x j1 i 1 n SS error S S w ithin S S b e tw e e n k (x SS within S S to tal ij Pi T j x ) 2 k 1 Before – after analysis in environmental protection Island Górna E Koń Kopanka Królewski Ostrów Maleńka Mała Wierzba Kopanka N Ośrodek Piaseczna Ruciane - ląd Mikołajki - ląd Śluza Górna W Wierzba Wygryńska Bryzgiel Bryzgiel - ląd Brzozowa L Brzozowa P Cimochowski Grądzik Cimochowski Grądzik Cimochowski Grądzik Kamień Krowa Mysia Wigry Ordów Ostrów Rośków Walędziak Wysoki Węgieł Mean P Grand Mean SStreat df Spring 26 19 21 50 6 25 28 16 34 22 43 12 19 29 26 15 44 22 29 C 19 N 29 S 14 37 19 32 37 77 21 14 32 27 23 1115.30 2 Summer 14 10 17 46 5 19 17 15 25 15 39 10 10 25 18 11 23 20 17 15 25 8 21 11 16 25 50 14 8 19 19 Autumn 22 16 15 47 4 21 23 12 29 13 26 7 11 23 26 14 28 13 23 17 29 14 37 13 29 25 57 17 13 19 21 Mean P 21 15 18 48 5 22 23 14 29 17 36 10 13 25 23 14 32 18 23 17 28 12 32 14 26 29 61 17 12 24 SS error df SStreat/ SSerror F p(f) SSError 15.2350 9.7935 11.4201 8.0776 29.2908 4.8073 4.5288 31.6569 0.6449 10.9042 120.5322 19.2222 5.0881 13.5529 18.3658 16.5768 98.0089 45.4914 6.8833 9.5781 17.3434 13.2847 93.9253 0.0904 60.1676 18.7889 193.4698 2.5835 9.1076 28.6667 917.0866 58 k SS treat n (T j x ) 2 j 1 n SS error k (x i 1 ij Pi T j x ) 2 j 1 dftreat = k-1 dfError = (n-1)(k-1) In the case of unequal variances between groups it is save to use the conservative ANOVA with (n-1) dferror and only one dfEffect in the final F-test. Mean P Grand Mean 1.2161338 SStreat 70.53576 2.953E-09 df 27 23 1115.30 1 19 21 SSerror 917.0866 df 29 SStreat/ 1.2161338 SSerror F 35.26788 p(f) 1.885E-06 Bivariate comparisons in environmental protection Species Residual Complex Area[ha] Species Predicted_Species Island 31.22156 1.778435 33 0.7 NBM Górna E 29.27129 -4.27129 25 0.5 NBM Koń 31.13556 2.864436 34 0.69 NBM Kopanka 47.35619 3.643813 51 6.15 NBM Królewski Ostrów 7.060143 -1.06014 6 0.0003 NBM Maleńka 28.04557 -1.04557 27 0.4 NBM Mała Wierzba 24.06496 7.935042 32 0.18 NBM Kopanka N 21.07064 0.929363 22 0.09 NBM Ośrodek 30.59729 7.402711 38 0.63 NBM Piaseczna 56.18315 -28.1831 28 15 NBM Ruciane - ląd 59.3686 15.6314 75 20 NBM Mikołajki - ląd 29.04312 -10.0431 19 0.48 NBM Śluza 28.5627 0.437301 29 0.44 NBM Górna W 100 31.87601 15.12399 47 0.78 NBM Wierzba 30.9605 12.0395 43 0.67 NBM Wygryńska 24.55595 -3.55595 21 0.2 Wigry Bryzgiel 56.88256 -10.8826 46 16 Wigry Bryzgiel - ląd 43.20288 -12.2029 31 3.81 Wigry Brzozowa L 10 0.1917 -9.28379 30 2.32 Wigry Brzozowa P = 33.431x y39.28379 23.23839 1.761609 25 0.15 C Cimochowski Grądzik Wigry R² = 0.7215 22.93307 8.066934 31 0.14 N Cimochowski Grądzik Wigry 31.71767 -6.71767 25 0.76 S Cimochowski Grądzik Wigry 60 3.13 Wigry Kamień 41.60497 18.39503 1 34 4.49 Wigry Krowa 44.58461 -10.5846 49 1 36.36101 12.63899 Wigry 0.011.55 Mysia Wigry 100 0.0001 50.60104 13.39896 8.69 Area 64 Wigry Ordów 67.41729 25.58271 93 38.82 Wigry Ostrów 29.91417 -5.91417 24 0.56 Wigry Rośków 31.71767 -3.71767 28 0.76 Wigry Walędziak 57 18 Wigry Wysoki Węgieł 58.18153 -1.18153 The outlier would disturb direct comparisons of species richness Due to possible differences in island areas between the two island complexes we have to use the residuals. A direct t-test on raw data would be erroneous. Upper 2.5% Observed P(t) confidence limit. Frequency Permutation testing 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 Observed values NBM Wigry 1.778435 -3.55595 -4.27129 -10.8826 2.864436 -12.2029 3.643813 -9.28379 -1.06014 1.761609 -1.04557 8.066934 7.935042 -6.71767 0.929363 18.39503 7.402711 -10.5846 -28.1831 12.63899 15.6314 13.39896 -10.0431 25.58271 0.437301 -5.91417 15.12399 -3.71767 12.0395 -1.18153 t 0.118799 NBM 0.929363 -10.5846 25.58271 1.778435 -3.55595 1.761609 15.6314 -5.91417 -3.71767 -1.04557 13.39896 -9.28379 18.39503 -1.06014 -12.2029 0.34257 Wigry -10.0431 2.864436 3.643813 12.0395 15.12399 8.066934 -4.27129 12.63899 -28.1831 -1.18153 7.402711 -10.8826 0.437301 7.935042 -6.71767 Randomized values NBM Wigry 1.761609 1.778435 7.402711 -3.55595 25.58271 2.864436 -12.2029 -10.5846 -5.91417 18.39503 -10.8826 -9.28379 3.643813 -6.71767 15.6314 13.39896 0.929363 7.935042 -1.18153 0.437301 12.63899 -28.1831 -4.27129 8.066934 -1.04557 -10.0431 12.0395 -3.71767 -1.06014 15.12399 0.766559 NBM -10.5846 15.12399 -12.2029 -4.27129 -1.18153 0.437301 25.58271 -6.71767 15.6314 -9.28379 -1.06014 7.402711 -3.71767 -10.0431 3.643813 0.346264 Wigry -3.55595 1.761609 12.0395 -10.8826 0.929363 1.778435 8.066934 -28.1831 2.864436 12.63899 18.39503 13.39896 7.935042 -1.04557 -5.91417 0.2 0.4 0.6 t-values 0.8 10000 randomizations of observed values gives a null distribution of t-values and associated probability levels with which we compare the observed t. This gives the probability level for our t-test. 1 Bivariate comparisons using ANOVA t 0 . 11884 2 2 F 0 . 01412 F t 2 t and F tests can both be used for pair wise comparisons. Repeated measures n Species richness of ground living Hymenoptera in a beech forest SS error Mean 25.88889 Grand Mean 23 20.11111111 SSEffect 16.69136 df Photo Simon van Noort (x i 1 Before Leaf-litter free Plot 34 52 1 39 58 2 1 10 3 52 50 4 45 49 5 6 15 6 33 32 7 12 14 8 28 52 9 1 19 10 35 29 11 7 22 12 33 18 13 7 11 14 9 15 15 10 15 16 3 2 17 7 3 18 T-Test 0.027271 Photo Tim Murray k 1 k SS treat n (T j x ) j 1 2 ij Pi T j x ) 2 j 1 Mean 43 48.5 5.5 51 47 10.5 32.5 13 40 10 32 14.5 25.5 9 12 12.5 2.5 5 Mean 43 48.5 5.5 51 47 10.5 32.5 13 40 10 32 14.5 25.5 9 12 12.5 2.5 5 SS Error 74.69136 87.41358 5.191358 30.24691 1.580247 5.191358 22.96914 7.135802 166.0247 74.69136 69.35802 42.52469 215.858 1.580247 0.024691 0.302469 22.96914 47.80247 Sum df SSEffect/ SSError F P(F) 875.5556 17 0.019064 0.324083 0.576609 Advices for using ANOVA: You need a specific hypothesis about your variables. In particular, designs with more than one predicator level (multifactorial designs) have to be stated clearly. ANOVA is a hypothesis testing method. Pattern seeking will in many cases lead to erroneous results. Predicator variables should really measure different things, they should not correlate too highly with each other The general assumptions of the GLM should be fulfilled. In particular predicators should be additive. The distribution of errors should be normal. It is often better to use log-transformed values In monofactorial designs where only one predicator variable is tested it is often preferable to use the non-parametric alternatives to ANOVA, the Kruskal Wallis test. The latter test does not rely on the GLM assumptions but is nearly as powerful as the classical ANOVA. Another non-parametric alternative for multifactorial designs is to use ranked dependent variables. You loose information but become less dependent on the GLM assumptions. ANOVA as the simplest multivariate technique is quite robust against violations of its assumptions. Home work and literature Refresh: Literature: • • • • • • • • Łomnicki: Statystyka dla biologów http://statsoft.com/textbook/ ANOVA Treatments Degrees of freedom Repeated design Incomplete design Permutation testing Welsh test Tuckey test Prepare to the next lecture: • Binomial distribution • Combinations