Descriptive and exploratory statistics Garib Murshudov Contents 1. 2. 3. 4. 5. Itroduction Location Spread Various plots: plots depend on data Histograms, cumulative distributions and all that Purpose of descriptive statistics and various plots • • • • • • maximize insight into a data set uncover underlying structure extract important variables detect outliers and anomalies test underlying assumptions In general: build intuition about the data and problem Descriptive statistics There are two types of the simplest numerical descriptors of a data sets: 1) Values describing location – mean, median, mode 2) Values describing spread – variance, interquartile range Descritpive statistics Histogram of aa Histogram of bb 1000 400 200 0 0 Skewness and kurtosis for normal distribution are zero. 600 Frequency 600 400 0 200 Frequency Kurtosis is a measure of tail: Positive kurtosis: heavier tailthan normal distribution and negative kurtosis:lighter tail than normal distribution 800 Skewness – it is a measure of symmetry of the distribution. Positive skewness: right tail is fatter and negative skewness: left tail is fatter 800 1000 Additional descriptive numerical values: 2 4 6 8 aa Skewness = 1.18 Kurtosis = 2.33 2 4 6 8 bb Skewness = -1.18 Kurtosis = 2.33 10 Location 0.04 0.02 0.00 Density Example: mean = 9.91 median = 9.92 mode = 9.98 0.06 0.08 The simplest information about a data set is about its location. There are three different location parameters: average, median and mode: 1) Average = sum(data)/Ndata 2) Median: proportion of data more than median is equal to that of less than median Histogram of aa 3) Mode: the most occurring data point. -10 0 10 aa 20 30 Location Average is very sensitive to few outliers. If we change one value of data arbitrarily then we can affect average value substantially. However median is not affected very much Example: 13.2 13.2 8.2 10.9 14.3 10.7 8.2 10.9 74.3 10.7 6.6 6.6 9.5 10.8 9.5 10.8 8.8 13.3 8.8 13.3 - Av = 10.63, median = 10.75 - Av = 16.63, median = 10.75 Breakdown point of average is 0, breakdown point of median is 0.5. I.e. you have to change 50% of the data dramatically to affect the median. Median is the most robust estimator Average is the most convenient estimator with nice properties If sample is small then mode it may be impossible to estimate mode. Wikidictionary: The number or proportion of arbitrarily large or small extreme values that must be introduced into a batch or sample to cause the estimator to yield an arbitrarily large result. Simpson’s paradox: batting averages One should be careful in dealing with averages. The most famous paradox related to averages is Simpson’s paradox. Runs Outs Average 1st Ashes MW 270 6 45 SW 500 10 50 2nd Ashes MW 700 10 70 SW 320 4 80 Total MW 970 16 60.25 SW 820 14 58.57 MW – Mark Waugh SW – Steve Waugh Spread 0.3 0.4 Histogram of rn 0.2 Density There are two main indicators of spread of a data set 1) Standard deviation (=(var)1/2). It is a usual indicator of spread. Very easy to calculate. But it is not robust to outliers. One outlier is sufficient to corrupt standard deviation 0.0 0.1 2) Interquartile range - IQR: 50% of the data are within first and third quartile of the data. This indicator is more robust. You need to corrupt 25% of the data to corrupt IQR 0 2 4 6 rn Black vertical lines – quartiles Blue vertical lines - mean+sd, mena-sd Spread: robustness Average is very sensitive to few outliers. If we change one value of data arbitrarily then we can affect average value substantially. However median is not affected very much Example: 13.2 13.2 8.2 10.9 14.3 10.7 8.2 10.9 74.3 10.7 6.6 6.6 9.5 10.8 9.5 10.8 8.8 13.3 8.8 13.3 - sd = 2.45, IQR = 3.65 - sd = 20.37, IQR = 3.65 Breakdown point of sd is 0, breakdown point of median is 0.25. I.e. you have to change at 25% of the data dramatically to affect IQR. IQR is the much more robust than sd Wikidictionary: The number or proportion of arbitrarily large or small extreme values that must be introduced into a batch or sample to cause the estimator to yield an arbitrarily large result. Tukey’s five number summaries One of the important books on statistical data analysis is: Tukey, JW. (1977) Exploratory data analysis After this book there was explosion of exploratory data analysis. I.e. visualisation of datasets and modelling based on visual analysis. One of the suggestions in this book is five number summary of data sets. Essentially these numbers are (although in Tukey’s book different numbers are suggested): Minimum, 1st quartile, median, 3rd quartile, maximum. These numbers are calculated by R with the command summary. For example: A = 13.2 8.2 10.9 14.3 10.7 6.6 summary(A) Min. 1st Qu. Median Mean 6.600 8.975 10.750 10.630 9.5 10.8 8.8 13.3 3rd Qu. 12.620 Max. 14.300 Various plots In general data visualisation is dependent on the type of data and the system it comes from. For some of the data sets it can be suggested to use some general plots. These include: 1) Box and whisker plot – boxplot 2) Histograms 3) Cumulative distribution plots 4) QQ plots 12 10 8 20 18 16 14 12 10 8 Boxplots are convenient ways to visualise one dimensional data. It shows minimum maximum, first quartile, median and third quartile – visual representation of five number summary. This plot can indicate if the distribution of the data is symmetric. This plot may indicate outliers – if one of the points is too different from others – e.g. it is outside the interval (median + 2*IQR) 14 Boxplots Side by side boxplot 0 5 10 15 20 25 Boxplots can be used visual comparison of data, e.g. effects of different treatments. A B C D E Effect of different insecticides F Boxplots Boxplots are just a schematic plots. Sometimes they must mask out some of the features of the data. Classic example is Lord Rayleigh’s data on measurement of densities of nitrogen derived from different sources which lead to the discovery of Argon. Rayleigh was led into the investigation by small anomalies he found in measurements of the density of nitrogen purified by different methods. Those different methods led to different quantities of nitrogen, and thus to different proportions of nitrogen and a hitherto unsuspected atmospheric gas. Argon was the first noble gas isolated. Ramsay's subsequent work isolated helium and discovered neon, krypton, and xenon by the end of the century. Ramsay and Rayleigh were awarded Nobel Prizes in 1904. Rayleigh was awarded the physics prize for argon, while Ramsay was awarded the chemistry prize for argon and the family of noble gases. 2.304 2.302 2.300 2.298 2.300 2.302 2.304 2.306 2.308 2.310 2.298 2.310 2.308 2.306 2.304 Nitrogen 2.302 2.300 2.298 2.308 2.306 2.310 2.308 2.306 2 4 6 8 10 2.304 2.302 2.300 2.298 2.300 2.302 2.304 Index 2.298 Nitrogen 2.310 By looking at the boxplot we do not see any peculiarity in the data. However one can notice that whiskers are very close to the edges of the box, i.e. minimum and maximum are close to first and third quartile respectively. When you see that then you should be suspicious about the data. If we do side by side plot of scatter (dot) plot and boxplot we see peculiar behaviour. There seem to be two classes. Let us use boxplot for different sources of nitrogen. There is definitely two classes. One derived from air and another from other sources. 2.306 2.308 2.310 Boxplots 2 4 6 8 10 Index 14 Air NoAir 14 Insectsprays revisited 25 20 15 0 5 10 15 10 5 0 N Insects 20 25 If we do side by side scatter and boxplot of Insectsparys data we see that there is some peculiarity for spray F. I do not know the reason but it may be interesting to investigate if you see something like that in your data. 0 10 30 Index 50 70 A B C D E F Histograms Histogram of rn1 5 10 rn1 Histogram of rn2 60 0 5 rn1 10 15 0 0 0 0 50 20 40 Frequency 150 Frequency 100 3000 Frequency 1000 2000 6000 4000 2000 Frequency 8000 4000 200 10000 80 Histogram of rn2 0 -5 10 20 30 40 50 60 rn2 0 5 rn1 Scott DW, Multivariate Density Estimation 10 0 10 20 30 rn2 40 0 10 20 30 rn2 Nbin=50 Histogram of rn1 Histogram of rn1 400 500 0 Nbin=500 Nbin=5 300 Frequency 0 0 100 500 200 1000 Frequency 1500 600 2000 Histograms are good way of visualisation of 1D data (there are high dimensional versions also). If there are enough data points then histograms may indicate the potential distribution, multimodality, skewness. For visually pleasing histograms number of bins to calculate histograms is important. Too many bins might be very noisy, too few bins can mask out important features. Histogram of rn2 50 40 50 Cumulative frequency (probability) plot Histograms represent density of probability distribution. To plot histograms we must divide the range of data into bins and then count the number of data points in each bin (for bin number n we need to count the number of data points obeying this: xi ≤ y < xi+1 where xi is the bin boundary and y is the observation). For each bin we may have very small number of data points and therefore their variation may be large resulting in noisy histograms. Cumulative frequency (probability) plots are another way of representing data. In this case we count the number of data points below given point (all y for which y < xi). As we see the number of data points become larger and larger as xi approches to the maximum value of the data points. P Cumulative frequency plots Cumulative distributions may indicate if the data points have normal distribution or heavy tail or some other peculiarities. These plots can also help to select appropriate distribution. However these plots are hard to interpret by their own -3 -2 -1 0 x 1 2 3 0.8 0.6 Fn(x) 0.4 0.2 0.0 0.2 0.4 Fn(x) 0.6 0.8 1.0 ecdf(rn2) 0.0 Data standardisation: y = (x-mean(x))/sd(x) 0.0 0.2 0.4 Fn(x) 0.6 0.8 1.0 One way of comparing two distribution would be plotting them on the same plot. To do this we need at least standardise the data. Even after standardisation the range of the data can be very different. ecdf(rn1) 1.0 ecdf(rn) -2 -1 0 1 x 2 3 -2 0 2 x 4 6 QQ plots Quantile-quantile plots are useful when testing distributions assumptions. These plots could indicate if two data sets are from the same distribution, if yes then they can help to transfer linearly one of them into another one. Mathematically: let us say that X is from the distribution with cumulative distribution function (CDF) – F(x) and Y has the distributions G(y). Then by solving: G(y) = F(x) y = G-1(F(x)) we can find relationship between y and x. As it can be seen random variables can be converted from one to another using QQ plots. For example if x is from exponential distribution – F(x) = 1 – exp(-lambda x) and y is from uniform distribution in the interval (a,b): G(y) = (y-a)/(b-a) then we need to solve: (y-a)/(b-a) = 1-exp(-lambda x) y = b – (b-a) exp(-lambda x), if we see exponential function then we may have this particular relationship. QQ plots Example: uniform and exponential distributions Empirical 0.4 0.6 0.8 0.0 0.0 0.2 0.2 0.4 ru 1 - exp(-3 * xx) 0.6 0.8 1.0 1.0 Theoretical 0.0 0.5 1.0 1.5 r1 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 xx QQ norm QQnorm is the special case of QQ plot – it is a quantile quantile plot against normal distribution. QQ norm can already indicate some properties of the data. 1) Outliers: Normal Too large value Too small value Normal Q-Q Plot Normal Q-Q Plot 2 Sample Quantiles -2 Sample Quantiles -4 0 -6 0 -1 -2 -2 -8 -3 Sample Quantiles 4 1 0 6 2 2 8 3 Normal Q-Q Plot -3 -2 -1 0 1 2 Theoretical Quantiles 3 -3 -2 -1 0 1 2 Theoretical Quantiles 3 -3 -2 -1 0 1 2 Theoretical Quantiles 3 QQ norm 2) Bimodality, skewness (note that small curviture for small and large values can be expected) Normal Skewed to left Bimodal Normal Q-Q Plot Normal Q-Q Plot 3 3 Normal Q-Q Plot Convex 0 2 1 Sample Quantiles 1 Sample Quantiles 0 0 -1 -1 -1 -2 -3 Sample Quantiles 3 1 4 2 2 5 Curviture -3 -2 -1 0 1 2 Theoretical Quantiles 3 -3 -2 -1 0 1 2 Theoretical Quantiles 3 -3 -2 -1 0 1 2 Theoretical Quantiles 3 QQ norm 3) Heavier tail t distribution, df = 3 Normal Histogram of rtt Normal Q-Q Plot Heavier tails 0 Sample Quantiles 40 -2 60 Frequency 100 0 -6 20 -4 50 0 Frequency 80 2 150 4 100 6 120 200 Histogram of rn -3 -2 -1 0 rn 1 2 3 -6 -4 -2 0 rtt 2 4 6 -3 -2 -1 0 1 2 Theoretical Quantiles 3 QQ norm If distribution of two random variables have the same form then we may derive linear transformation of data vs another one. -30 -20 -10 0 rn1 10 20 30 Qqplot: one data vs another. Slope and intercept of the line gives linear transformation needed: y = a + bx and a = 3, b=10 -3 -2 -1 0 rn 1 2 3 Conclusions • Average and variance are usual measures for location and spread of the data. However they are not robust. Median and IQR are more rbust • Boxplot is good way of summarising data, however it might mask out features of the data • QQ plot can be used to check distribution assumptions References • Tukey, JW. Exploratory data analysis • Scott DW, Multivariate Density Estimation