Descriptive Statistics Example Dataset: Shellfish Contamination Observed variables: • Location, Year, Species, Species2, Method, …. • Metals (mg/kg): Lead, Cadmium, Chromium, Copper, Mercury, Zinc 1 Understanding Data: • Understand the distribution of the variables • Find relationships among variables (trends, correlation) • Consider the need for transformation • Look for potentially influential observations • Find errors that have occurred during data entry • Test the assumptions of the statistical models that you intend to employ 2 Descriptive Statistics Numerical size, ‘middle’, spread, … Graphical: Boxplot, Histogram … 3 Categorical Variables Size: Proportions: n p1,…, pk 4 Continuous Variables n Size: n ‘middle’: arithmetic mean or median Spread: 1 x = ∑ xi , n i =1 Standard deviation s Range, R = max(x)-min(x) 5 Quantiles Sort the data into ascending order to obtain a sequence of order statistics x 1 ,x2 , ,x n The p'th quantile qp is the 1+(n-1)p'th order statistic x(1+(n-1)p) (or an average of neigbouring values if 1+(n-1)p is not integer). q0.25=lower quartile, q0.5 =median, q0.75= upper quartile E.g. n= 11, median = 1+(10)(0.5)=6th order statistic 6 Unlike the arithmetic mean, the median is not at all influenced by the exact value of the largest objects and so provides a resistant measure of the central location. 7 Graphical Summaries A picture can save a thousand numbers …. 8 Boxplot (box-and-whiskers plot) • The boxplot is a useful way of plotting the 5 quantiles q0, q0.25,q0.5, q0.75 and q1 of the data. • The ends of the whiskers show the position of the minimum and maximum of the data whereas the edges and line in centre of the box show the upper and lower quartiles and the median. • The whiskers show at a glance the behaviour of the extreme outliers, whereas the box edges and mid-line summarize the sample in a resistant manner. • Strong asymmetry in the box mid-line and whiskers suggests that the data is not symmetric. 9 Modified Box-Plot • The modified version draws the whiskers only up to 1.5*IQR beyond the quartiles. • IQR stands for the Interquartile Range which is q0.75– q0.25. • Points beyond the whiskers, called outliers, are plotted individually (in MINITAB using the * - symbol). 10 Time Series plots • Useful way of seeing if there is any trend in a continuous variable across time. Scatter plots • Useful way of seeing if there is any relationship between pairs of continuous variables. 11 Histogram • The range of values is divided up into a finite set of class intervals (bins). The number of objects in each bin is then counted and divided by the sample size to obtain the frequency of occurrence and then these are plotted as vertical bars of varying height. • The histogram quickly reveals the location, spread, and shape of the distribution. The shape of the distribution can be unimodal (one hump), multimodal (many humps) or skewed (fatter tail to left or right). 12 Probability Distributions • • • • Models for population variability Provide simple descriptions Used as basis for statistical inference Many different models – discrete: categories, counts – continuous: standard measurements 13 Discrete Distributions • Described by probabilities Example: Binomial distribution B(n,π) 20% of fish with high pcb levels, i.e π=0.2 How many contaminated fish in a group of size n? 14 Illustration: Binomial distributions B(n, π) for n=10 15 Continuous Distributions • For variables measured to an arbitrary precision on some scale • No probability associated with specific values • Histograms provide a useful lead-in … 16 Equal-width intervals: • Draw boxes of height equal to frequency for each interval Area of each bar is proportional to frequency and relative frequency Frequency 300 200 100 0 40 90 140 190 bwt 17 Probability Densities For a population histogram, as you • increase the number of histogram cells, and • decrease the interval width the histogram approaches a smooth curve (conceptually). This is called a probability density function, or simply a density. 18 Illustration 50000 100000 150000 200000 0 50000 100000 150000 200000 pop Population histogram, 20 bins Population histogram, 40 bins 0 50000 100000 pop 150000 200000 1.0 e-05 0.0 e+00 Density 2.0 e-05 pop 1.0 e-05 0.0 e+00 Density 1.0 e-05 0.0 e+00 Density 1.0 e-05 0 2.0 e-05 0.0 e+00 Density 2.0 e-05 Population histogram, 10 bins 2.0 e-05 Population histogram, 5 bins 0 50000 100000 150000 200000 pop 19 Probability Models This smooth density curve gives us a probability model for the population • Take (simple) mathematical forms for these • Allow probability calculations for the population (areas under the density curve) • Can be compared with the distribution of the sample given by a histogram 20 Histogram with superimposed normal probability model 0.4 0.3 0.2 Good Agreement! 0.1 0.0 -3.934136 -2.442382 -0.950628 0.541127 2.032881 3.524636 -3.188259 -1.696505 -0.204750 1.287004 2.778758 x 21 Normal Distribution • Model for continuous measurements • Bell-shaped curve that approximates a density histogram for many types of observations • Single mode • Symmetric • Parameters: – mean µ – standard deviation σ (variance σ2) 22 Effects of µ and σ (a) Changing (b) Increasing shifts the curve along the axis increases the spread and flattens the curve 1 1 = 2= =6 6 2= 140 160 1 = 160 180 2 =174 200 140 160 180 1 = 12 200 2 =170 23 Understanding the standard deviation σ (c) Probabilities and numbers of standard deviations Shaded area = 0.683 − + 68% chance of falling between − and + Shaded area = 0.954 −2 +2 95% chance of falling between − 2 and +2 Shaded area = 0.997 −3 +3 99.7% chance of falling between − 3 and +3 24 Histogram with normal curve Histogram of Cadmium Normal 35 Mean StDev N 30 0.2687 0.1633 168 Frequency 25 20 15 10 5 0 -0.00 0.15 0.30 0.45 Cadmium 0.60 0.75 Note: Approximation to normal distribution improves when taking the logarithmic values. 25 Probability Plot Probability Plot of Cadmium Normal - 95% CI 99.9 Mean StDev N AD P-Value 99 95 Percent 90 0.2687 0.1633 168 4.539 <0.005 80 70 60 50 40 30 20 10 5 1 0.1 -0.4 -0.2 0.0 0.2 0.4 Cadmium 0.6 0.8 26 Probability Plot – Normality Probability Plot of C22 Normal - 95% CI 99.9 Mean StDev N AD P-Value 99 95 Percent 90 0.04470 1.068 100 0.194 0.890 80 70 60 50 40 30 20 10 5 1 0.1 -4 -3 -2 -1 0 C22 1 2 3 4 27 Plotting by groups Probability Plot of Cadmium Normal - 95% CI 0.00 M 99.9 0.50 0.75 1.00 O M Mean StDev N AD P-Value 99 Percent 0.25 95 90 80 70 60 50 40 30 20 0.1656 0.08702 89 3.465 <0.005 O Mean 0.3848 StDev 0.1510 N 79 AD 0.242 P-Value 0.763 10 5 1 0.1 0.00 0.25 0.50 0.75 1.00 Cadmium Panel variable: SpeciesGroup 28 Skewness Measured by skewness coefficient – Negative ⇒ left skewed (tail to left) – Zero ⇒ symmetric – Positive ⇒ right skewed (tail to right) Environmental data is frequently positive and skewed to the right mean > median Variable Cadmium Skewness 0.84 29 Outliers Points which are outside the general pattern of the data – – – – – recording errors Measurement failures Rogue values Greater variability Unsuspected factors Identify, assess impact, delete? 30 Histogram of Copper 0 M 90 10 20 30 40 50 60 O 80 Frequency 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 Copper Panel variable: SpeciesGroup 31 Measures of location • Mean – highly sensitive to outliers, skewness Median – insensitive to outliers, distribution shape • Trimmed mean – trim 5% from each tail; calculate mean of central part – Median is 50% trimmed mean 32 Measures of Spread • Range = max –min highly sensitive to outliers • Standard deviation – very sensitive to outliers, skewness • Interquartile range – length of central box of boxplot • MAD – median absolute deviation of data values from the median; robust 33