Choosing and Using Statistics
Presentation · December 2015
DOI: 10.13140/RG.2.1.3829.0005
Jinliang Liu
Wenzhou University
Choosing and Using Statistics
Jinliang liu (刘金亮)
Institute of Ecology, College of life science
Zhejiang University
E-mail: jinliang.liu@foxmail.com
What is statistics for?
• To find patterns in your data
• To detect difference among groups of data
• To detect relations between variables
Eight steps to successful data analysis
Decide what you are interested in.
Formulate a hypothesis or hypotheses.
Design the experiment or sampling routine.
Collect dummy data. Make up approximate values based on what
you expect.
Use the statistic methods to decide on the appropriate test or tests.
Carry out the test using the dummy data.
If there are problems go back to step 3(or 2), otherwise collect
the real data.
Carry out the test(s) the real data.
• Observations
• Hypothesis testing
• P-value
• Sampling
• Experiments
• Statistics
 Descriptive statistics
 Tests of difference
 Tests of relationships
 Tests for data investigation
• P≤ 0.01 : very strong presumption against null hypothesis
• 0.01<P ≤ 0.05 : strong presumption against null hypothesis
• 0.05<P≤0.1 : low presumption against null hypothesis
• P >0.1: no presumption against the null hypothesis
Types of statistics
• Descriptive statistic
—mean, median, dispersion…
• Parametric statistics
—know distributions, e.g. normal
• Non-parametric statistics
—little or no knowledge distribution
• Measurement variables
--Continuous variables
--Discrete variables
• Ranked variables
• Attributes
• Derived variables (or “computed variables”)
Types of distribution
Discrete distributions
1. The Poisson distribution
the number of times an event occurs in a unit of time or
To test for randomness or independence in either space
or time
Variance>mean: more clumped than random
Variance< mean: more ordered (uniform) than random
Types of distribution
Discrete distributions
2. The binomial distribution
This is a discrete distribution of number of events
3. The negative binomial distribution
It can be used to describe clumped data (i.e. when there are more
very crowded and more sparse observations than a Poisson
distribution with the same mean)
4. The hypergeometric distribution
To describe events where individuals are removed from a population
and not replaced
Types of distribution
Continuous distributions
1. The rectangular distribution
Describes any distribution where all values are equally
likely to occur.
Types of distribution
Continuous distributions
2. The normal distribution (Gaussian distribution)
Test: Kolmogorov-Smirnov test, Anderson-Darling test, Shapiro-Wilk test,
chi-square goodness of fit
R code: shapiro.test(), qqnorm(),qqline()
Describe: mean, standard deviation, skewness, kurtosis
Transformations: logarithmic, square root, arcsine square root for
percentage or proportion data, probits, logits
Types of distribution
Continuous distributions
3. The t-distribution
Related to the normal distribution but is flatter with
extended tails
Confidence intervals
95% confidence intervals (CI) are calculated for samples using t-distributions
Types of distribution
Continuous distributions
4. The chi-square distribution
It is asymmetric and varies from 0 to positive infinity. The
chi-square distribution is related to variance
5. The exponential distribution
It occurs when there is a constant probability of birth,
death, increase or decrease
Types of distribution
Non-parametric “distributions”
Ignore distributions totally
Ranking, quartiles and the interquartile range
Box and whisker plots
Descriptive statistics
Statistics of location or position
• Arithmetic mean
—‘normal’ mean or called average
• Geometric mean
—antilog of the mean of the logged data; it is always smaller than the
arithmetic mean.
—used: when data have been logged or when data sets that are known to
right skewed are being compared.
• Harmonic mean
—the reciprocal of the mean of the reciprocal and is always smaller than
geometric mean
• Median
• Mode
Descriptive statistics
Statistics of distribution, dispersion or spread
• Range
• Interquartile range
• Variance
• Standard deviation (SD)
• Standard error (SE)
—the standard deviation of a distribution of means for repeated
samples from a population
• Confidence intervals (CI) or confidence limits
• Confidence of variation
Displaying data
A single variable
• Summarizing
—Box and whisker box (box plot)
• Showing the distribution
—Bar chart: for discrete data
—Histogram: for continuous data
—Pie char: for categorical data or attribute data
Displaying data
Two or more variables
• Summarizing
—Box and whisker plots (box plots)
—Error bars and confidence intervals
• Comparing two variables
—Multiple scatterplots
—Trends, predictions and time series
Lines, fitted lines, confidence intervals
• Comparing more variables
—Three-dimensional scatterplots
—Multiple fitted lines
Tests to look at difference
1 Do frequency distribution differ?
1.1 Does one observed set of frequencies differ from another?
Where you have observed frequencies of various categories and
expected proportions for those categories that were not derived from the
Chi-square goodness of fit
Whether the observed and expected frequencies are not different from
each other.
Tests to look at difference
1 Do frequency distribution differ?
1.2 Do the observed frequencies conform to a standard distribution?
Kolmogorov-Smirnov test
• Continuous data
• To compare two sets of data to determine whether they come from the
same distribution
• R code: ks.test()
The k-s test delivers a probability that two distributions are the same while the t-test is is concerned with
means and the Mann-Whitney U test with medians. Two distributions may have identical means and
medians and yet have differences elsewhere in their distributions.
Tests to look at difference
2. Do the observations from two groups differ?
Paired data
Data form
Non-or Parameter test
Equal var.
Paired t-test
t.Test(x1,X2, paired=T)
Parameter test
Wilcoxon signed ranks test
Non-parameter test
Sign test
Equal var.
t.Test(x1,x2, paired=F)
Parameter test
Equal var.
One-way ANOVA
Parameter test
Mann-Whitney U test
(Wilcoxon rank sum test )
Non-parameter test
Tests to look at difference
3. Do the observations from more than two groups differ?
3.1 Repeated measures
• Friedman test
– Is a non-parametric analogue of a two-way ANOVA
– Null hypothesis: that observations in the same group (factor
level) have the same median values
– R code: friedman.test()
Example: The data comprise the number of
cyanobacterial cells in 1 mm3 of water from six ponds,
with samples taken on four different days and only one
sample taken each day from each pond.
Tests to look at difference
3. Do the observations from more than two groups differ?
3.1 Repeated measures
• Repeated-measure ANOVA
– Unlike ANOVA, which makes the assumption that each of the
factor levels is independent of all others
– R code: summary(aov(cells~pond*day+Error(cells/(pond*day))))
Tests to look at difference
3 Do the observations from more than two groups differ?
3.2 Independent samples
• One-way ANOVA
– Data: continuous, normally, equal variance
– Null hypothesis: same mean
– R code
> summary(aov(y~as.factor(x)))
A significant result in the ANOVA will only show that at least one pair of the groups is
significantly different.
> TukeyHSD(aov(y~as.factor(x))) #### Post hoc testing
Tests to look at difference
3. Do the observations from more than two groups differ?
3.2 Independent samples
• Kruskal-Wallis test
– Data: un-continuous, un-normally, un-equal variance
– Non-parametric equivalent of the one-way ANOVA
– Null hypothesis: all samples are taken form populations with the same
– R code
> kruskal.test(y~as.factor(x))
Post hoc testing: This test may be used when there are only two samples, but the Mann-Whitney U
test is more powerful for two samples and should be preferred.
Tests to look at difference
4 Two independent ways of classifying the data
4.1 One observation for each factor combination (no replication)
• Friedman test
– non-parametric analogue of a two-way ANOVA
– a single observation for each factor combination
– null hypotheses: the median values of each factor level are the same
between columns and between rows
– R code: friedman.test()
Tests to look at difference
4 Two independent ways of classifying the data
4.1 One observation for each factor combination (no replication)
• Two-way ANOVA (without replication)
– Data: continuous, approximately normally distributed, same variance in
each factor combination
– R code: summary(aov(Yield~Farm*Blend-Farm:Blend) )
Tests to look at difference
4 Two independent ways of classifying the data
4.2 More than one observation for each factor combination (with
• (1) that all levels of the first factor have the same mean;
• (2) that all levels of the second factor have the same mean
– one-way ANOVA analyses
• (3) that there is no interaction between the two factors
– two-way ANOVA and Scheirer–Ray–Hare test
• Interaction
• If the test gives a significant result for the interaction term it shows that the
effects of the two factors in the test are not additive, which means that
groups of observations assigned to levels of factor 1 do not respond in the
same way to those assigned to factor 2.
• R code:
– interaction.plot(Farm,Blend,Yield)
– interaction.plot(Blend,Farm,Yield)
Tests to look at difference
• two-way ANOVA
This means that the two sexes are not responding to day length in the same way.
Tests to look at difference
• Scheirer–Ray–Hare test
• non-parametric equivalent of a two-way ANOVA with replication
• conservative and has much lower power than the parametric ANOVA
• If you do use it you do so with some caution and perhaps consider a
generalized linear model with an error structure that doesn’t require normal
Tests to look at difference
5 More than two independent ways of classifying the
• Multifactorial testing
• the factors are all fully independent of each other, then
the data are suitable for multifactorial testing
Tests to look at difference
• Three-way ANOVA (without replication)
• R code:
• Three-way ANOVA (with replication)
R code:
model< (aov(grass~exclosure*distance*site))
TukeyHSD(model, “exclosure”)
Tests to look at difference
• Multi-way ANOVA
If there are more than three ways of dividing the data into groups and each of
the classifications is independent of the others then ANOVA may be carried
Not all classifications are independent
 Non-independent factors
 Nested factors
 Random or fixed factors
Tests to look at difference
• Nested or hierarchical designs
Two-level nested-design ANOVA
R code:
summary(aov(cholest~intake/cage)) ####’/’ indicate that a factor is nested in
another, ‘aov(data~A/B)’ indicates that factor ‘B’ is nested in factor ‘A’.
Tests to look at relationships
Correlation or association between two variables
Are the observations for two categorical variable
• Chi-square test of association
• Phi coefficient
• Cramer coefficient
Tests to look at relationships
Categorical variable
• Chi-square test of association
Null hypothesis: the categories in the two variables are
• For example: if ‘eye color’ and ‘sex’ are the two variables and individuals are assigned to
either ‘blue’ or ‘brown’ and to either ‘male’ or ‘female’ then the null hypothesis is that
there is no association between sex and eye color.
Data: frequencies (number of observations), never be carried out
on percentages or data transformed in any way
R code:
Tests to look at relationships
Categorical variable
• Cramér coefficient of association
 a test carried out on tables of frequencies in conjunction with a chi-square
test that provides additional information about the strength of the
 The statistic X2 is used to determine significance while the Cramér
coefficient (C) is a measure from 0 (no association) to 1 (perfect
association) that is independent of the sample size
• Phi coefficient of association
 This is a special case of the Cramér coefficient for 2×2 tables (i.e. there
are only two categories for each of the two variables)
Tests to look at relationships
Observations assigned a value
• Pearson’s product-moment correlation
• Spearman’s rank-order correlation
• Kendall rank-order correlation
• Regression
Tests to look at relationships
Pearson’s product-moment correlation
• Data: Continuous scale, Normally distributed
• R code:
cor(); cor.test()
Spearman’s rank-order correlation
Kendall rank-order correlation
• Data: non-parametric
• R code:
cor(, method=“spearman”) ####method=“k”
cor.test (, method=“spearman”)
Tests to look at relationships
• Regression
Linear regression, kendall robust line-fit method, logistic
regression, model II regression, polynomial regression
Tests to look at relationships
More than two variables
• Correlation
• Partial correlation
• Kendall partial rank-order correlation
• Regression
• Analysis of covariance (ANCOVA)
Summary(aov(BMP~Species+Temp)) or summary.aov(lm(BMP~Species+Temp))
• Multiple regression
• Stepwise regression
• Path analysis
Tests for data exploration
• Principle component analysis (PCA) and factor analysis
• Canonical variate analysis
• Discriminant function analysis
• Multivariate analysis of variance (MANOVA)
• Multivariate analysis of covariance (MANCOVA)
• Cluster analysis
