median - Workspace

advertisement
Central tendency and spread
Stats Club 4
Marnie Brennan
References
• Petrie and Sabin - Medical Statistics at a
Glance: Chapter 5, 6, 10, 35 Good
• Petrie and Watson - Statistics for Veterinary
and Animal Science: Chapter 2, 4 Good
• Thrusfield – Veterinary Epidemiology: Chapter
12
• Kirkwood and Sterne – Essential Medical
Statistics: Chapter 4
Terminology!
• Along similar lines of previous Stats Clubs, we
are talking about ways of describing your
continuous data
– Gives you basic calculations to do to explore your
data (get a feel for it)
– Enables you to compare your data with those
collected by other researchers
Central tendency
• Central tendency = a measure of location or
position of data, i.e. the ‘average’
– This basically means calculating things like:
•
•
•
•
Mean (arithmetic mean)
Median
Mode
Others
– E.g. geometric mean (distn. skewed to the right), weighted
mean
– Nice table in Petrie and Sabin (Chapter 5)
summarising advantages and disadvantages of all
measurements
Central tendency – Mean, Median
• Mean = Sum of your data/total number of
measurements
– Algebraically defined
– Affected by skewed data THEREFORE good to use for
normally distributed variables
• Median = The midpoint of your values i.e. what the
‘halfway’ value in your data is
– If the observations are arranged in increasing order, the
median would be the middle value
– Not algebraically defined
– Not affected by skewed data THEREFORE good to use for
non-normally distributed variables
Distributions
Median
Mean
Mean and
median the
same
Central tendency - Mode
• Mode = the value that occurs the most
frequently in a data set
– Generally means more if you have categorical data
e.g. The most common litter size of bearded collie
dogs is 7
– Not often used
What is the mode?
Spread
• Spread = measure of dispersion or variability
(variation) of data
– This basically means calculating things like:
•
•
•
•
•
Range
Percentiles (Quartiles, Interquartile range)
Variance
Standard deviation
Others
– E.g. coefficient of variation
– Nice table in Petrie and Sabin (Chapter 6) summarising
main points about these measurements
Range and percentiles
• Range = the range between the minimum and maximum
values of your data
– Gives an indication of spread at a very basic level
– Distorted by outliers (get a large range)
• Percentiles = if data is ordered from lowest to highest,
these divide the data up into ‘compartments’
– E.g. The 5th percentile is the point along the data below which
5% of the data lies; the 20th percentile is the point in the data
below which 20% of the data lies
– Special types of percentiles are called ‘quartiles’ – these divide
the data into 4 equal parts (the 25th, 50th and 75th percentiles)
– From these, you get an ‘interquartile range’ - IQR, which is
values between the 25th and 75th percentiles
– The 50th percentile is the median
– Not distorted by outliers
Range = 22-28 (6)
Q1 (25th percentile) = 24
Q3 (75th percentile) = 26
IQR = 24-26 (2)
Range = 0.12-134 (133.9)
Q1 (25th percentile) = 6
Q3 (75th percentile) = 36
IQR = 6-36 (30)
What conclusions can we
draw about what to use
when??
Rule of thumb
• Mean and range = good to use for normally
distributed variables
• Median and interquartile range = good to use
for non-normally distributed variables
Variance
• Variance = the deviations of the data values from the
mean
– e.g. If the data are bunched around the mean, the variance
is small; if the data are spread out, the variance is large
– Calculated by squaring each distance between the
observations and the mean
– We then take the mean of this (add all values together and
divide by the total number of observations minus 1)
– DON’T WORRY ABOUT HOW TO DO THIS! This is what
computers are for!
– Measured in the same units as the observations, but
squared e.g. If the units are grams, the variance will be in
grams squared
Mean = 26
Variance = 430
Mean = 23
Variance = 11090
Example
• If we had 6 observations (with mean = 0.17):
15, 18, -14, -17, -3 and 2
• What is the variance?
= (15 – 0.17)2 + (18-0.17) 2 + (-14 – 0.17) 2 + (-17
– 0.17) 2 + (-3 – 0.17) 2 + (2-0.17) 2/6-1
= 209.37
It is n-1 to reduce bias (again don’t worry too
much!)
Standard Deviation (SD)
• Standard deviation = square root of the
variance
– The average of the deviations of the observations
from the mean
– Therefore the units are the same as for the
observations – more convenient
– If we have a normally distributed dataset, then the
mean +/- 2 x standard deviations approximately
encompasses the central 95% of observations
What about the standard error of the
mean (SE or SEM)?
• Similar to standard deviation, but relates to
the precision of the sample mean as an
estimate of the population mean
• Can use SEM to construct confidence intervals
• This will be covered in greater detail in
another session
General rule
• Standard deviation, variance and SEM are for
normally distributed variables only
• For non-normally distributed variables, stick
with interquartile range
Equal variances?
• It is an assumption of some of the tests used to
compare different continuous data groups (e.g. Ttests, ANOVAs) that the variances must be equal
(homogeneity of variance) in the groups
compared
– This is because these tests are not particularly robust
under conditions of heterogeneity of variance
– In order to use these tests, you need to know whether
your groups meet these criteria – if they do not, you
need to use other non-parametric tests, or transform
your data to fit the assumptions
Tests for equal variances
• Eyeball the distributions!
• Levene’s test (two or more groups)
– Null hypothesis – groups have equal variances
– Calculation not affected by normality status
• F-test (variance-ratio test; two groups only)
– Calculation is affected by non-normal data
• Bartlett’s test (two or more groups)
– Calculation is affected by non-normal data
Next month
• The bunfight that is:
– P-values.................!
– Type I and Type II errors
Download