Chapter 7 What to do when you have the data We saw in the previous chapters how to collect data. We will spend the rest of this course looking at how to analyse the data that we have collected. Stem and Leaf Diagrams Stem and Leaf Diagrams are graphical ways to display a group of integers in a dataset. Steps for Constructing a Stem and Leaf Diagram 1. Select one or more of the leading digits to be the Stem values, the remaining digits become the Leaves. 2. List Possible Stem values in a column 3. Record the Leaf for every observation beside the corresponding Stem value. 4. Indicate on the display what units are used for the Stems and Leaves. Example The following are a selection of exam marks 71 52 52 75 64 60 48 56 67 29 11 53 25 46 58 46 49 62 66 40 19 54 57 54 60 19 59 43 51 40 21 45 46 62 73 59 36 45 55 46 45 32 55 46 51 46 65 49 61 40 A Stem And Leaf Diagram will look like this: 1 199 2 159 3 26 4 0003555666666899 5 11223445567899 6 001224567 7 135 STEM UNIT = TENS LEAF UNIT = ONES Histogram for Discrete Numerical Data 1. Draw a horizontal X-axis and on it mark the possible values taken by the observations 2. Draw a vertical Y-axis marked with either relative frequencies or frequencies 3. Above each possible value on the X-axis draw a rectangle centred on the value with width 1 and height equal to the relative frequency or frequency of that value. Value 30 40 50 60 Frequency 100 150 200 100 250 200 150 100 50 0 30 40 50 Frequency 60 The Shape of Histograms The general shape of a histogram is important. The number of peaks in the histogram determines whether a distribution is classed as Unimodal, Bimodal or Multimodal. In addition to this classification we can further classify UniModal distributions as to whether they are symmetric or not. A unimodal distribution is defined to be Symmetric if there is a vertical line of symmetry through the middle of the distribution such that the distribution to the left of this line is the mirror image of the distribution to the right of this line. The right part of a unimodal distribution is called the Upper Tail of the distribution while the left part is called the Lower Tail: A Unimodal distribution which is not symmetric is called skewed, there are two types of skewness. Positive Skew: If the upper tail of the distribution stretches out more than the lower tail then the distribution is said to be positively skewed. Negative Skew: If the Lower tail of the distribution stretches out more than the upper tail then the distribution is said to be negatively skewed. Symmetric Distributions POSITIVELY SKEWED DISTRIBUTION NEGATIVELY SKEWED DISTRIBUTION Definitions Mean: The Mean of a quantitative dataset is the sum of the observations in the dataset divided by the number of observations in the dataset. Median: The Median (m) of a quantitative dataset is the middle number when the observations are arranged in ascending order. Mode: The Mode of a datset is the observation that occurs most frequently in the dataset. How to calculate these Dataset: X1 X2 X3 X4 X5. . . . . Xn Mean = (X1+ X2 + X3+ . .+ Xn)/n Median: Arrange the n observations in order from smallest to largest, then: if n is odd, the median (m) is the middle number, if n is even, the median is the mean of the middle two numbers Mode: If given a dataset, the mode is easily chosen as the value which appears most often. Example A: Dataset: 5, 3, 8, 5, 6 Mean = 5.4 Mode = 5 Median: 3, 5, 5, 6, 8 so m = 5 Note: 5.4 is not one of the original values in the dataset Example B: 11, 140, 98, 23, 45, 14, 56, 78, 93, 200, 123, 165 n = 12 Mean = 1046/12 = 87.16666666 Median: 11, 14, 23, 45, 56, 78, 93, 98, 123, 140, 165, 200 m = (78 + 93)/2 = 85.5 Example C: generate a dataset containing 9 numbers using the Day, Month and Year of your birth and that of the people sitting to your left and right. ie: DD/MM/YY Mean vs Median vs Mode which measures the centre best? Choosing which of these three measures to use in practice can sometimes seem like a difficult task. However if we understand a little about the relative merits of each we should at least be able to make an informed decision. If the distribution is symmetric then Mean = Median If the distribution is Positively Skewed (to the right) then Median < Mean If the distribution is Negatively Skewed (to the left) then Mean < Median So the difference between the mean and median can be used to measure the skewness of a dataset. Note: The presence of outliers affects the mean but not the median. This can be seen from the diagrams and from the following example Example: Ten statistics graduates who are now working as statisticians are surveyed for their annual salary. The survey produced the following dataset: £60,000 £20,000 £19,000 £22,000 £21,500 £21,000 £18,000 £16,000 £17,500 £20,000 Mode = £20,000 Median = £20,000 Mean = £23,500 Notice that the distribution is positively skewed, the presence of the one high earner has affected the Mean causing it to be £1,500 higher than the highest of all the salaries excluding £60,000. For this dataset the Mean is therefore not a good measure of the centre of the dataset. Notice also that the median would be unaffected if the £60,000 was changed to a value like £23,000 which is more in line with the rest of the data. Examples Would you expect the datasets described below to be symmetric, skewed to the right or skewed to the left. A. The salaries of people employed by UCD B. The grades on an easy exam C. The grades on a diffucult exam D. The amount of time spent by students in a difficult 3 hour exam. E. The amount of time students in this class studied last week. F. The age of cars on a used car lot Example:The median age of the population in Ireland is now 32 years old. The median age of the Irish population in 1986 was 27. Interpret these values and explain the trend, what implications does this data have for Irish society. What are the consequences for the entertainment industry in Ireland? Numerical Measures of Variability When we want to describe a dataset providing a measure of the centre of that dataset is only part of the story. Consider the following two distributions: Both of these distributions are symmetric and meanA = meanB, modeA=modeB and medianA=medianB. However these two distributions are obviously different, the data in A is quite spread out compared to the data in B. This spread is technically called variability and we will now examine how best to measure it. Revision Tutorials M 11 12 1 2 3 4 5 6 T W T F Definitions Range: The Range of a quantatitive dataset is equal to the largest value minus the smallest value. Sample Variance: The Sample Variance is equal to the sum of the squared distances from the mean divided by n-1. Standard Deviation: The Sample Standard Deviation, s, is defined as the positive square root of the Sample Variance, s2. Sample Variance n s 2 (x i 1 i x) 2 n 1 xi n i 1 2 xi n i 1 2 s n 1 n 2 Which is best? The meaning of the Range is easily seen from its definition. It is a very crude measure of the variability contained in a dataset as it is only interested in the largest and smallest values and does not measure the variability of the rest of the dataset. Example: These two datasets have the same range but do they have the same variability? Dataset1: 1, 5, 5, 5, 9 Dataset2: 1, 2, 5, 8, 9 NO, Dataset2 is obviously more spread out than Dataset1 which has three values clustered at 5. Example Once upon a time there were two lecturers A & B, each delivered the same course to two different classes. When exam time came both classes had the same average marks of 70%. The marks for Lecturer A’s class however had a standard deviation of 25% whereas the Standard Deviation for Lecturer B’s class was 5%. Who’s class would you rather be in? Chapter 8 Normal Curves and Relative Standing We have just seen how datasets can be described by histograms. For large datasets of continuous variables the histograms have so many possible values that it would be impracticable to draw all of the really narrow rectangles necessary. Instead we represent these datasets by curves (distributions). The curve can be thought of as joining the centre points of tops of all the rectangles in the histogram. These distributions which are like generalised relative frequency histograms can take many different shapes, some symmetrical some skewed. There is one shape however that crops up all through the natural world and that is … THE NORMAL DISTRIBUTION aka The Gaussian Distribution or The Bell Curve The Normal Curve The Normal Distribution is Symmetric. There are many different Normal curves, some are fat some are thin. Some are centred at 0 some at 1 some at 5 etc. Each normal curve can be uniquely identified by two parameters. The Mean and the Standard Deviation Once you know the mean and the S.Deviation for a Normal curve then it is possible to draw the curve. Normal curves are centred at the Mean. And the Standard Deviation describes how spread out they are. The Normal Curve Standard Deviation MEAN The area under a Normal curve to the left of the mean is .5. This indicates that the probability that something which is normally distributed is less than its mean is .5. The area under the curve to the left of any point A on the X axis represents the probability that a Normal variable is less than A. X ~ Normal Probability( X<A) is the area under the curve to the left of A A MEAN There are an infinite number of different Normal curves, one for each possible combination of values of the mean and the standard deviation. However there is a relationship between all Normal curves. All Normal variables X can be transformed into a Standard Normal Variable Z. Z is Normal with Mean 0 and Standard Deviation 1. Z X We can use tables to look areas under the Standard Normal Curve. Example: Find the Probability that a Normal variable with Mean 3 and Standard Deviation 2 is less than 4. Pr( X 4) Pr( X 3 4 3) X 3 4 3 Pr 2 2 Pr( Z 0.5) 0.6915 Section Interpreting the Standard Deviation -the Empirical Rule We have seen that the Variance and hence the Standard Deviation of a dataset provides us with a relative measure of the variability contained in a dataset. So that if we are given two datasets the one with the larger Standard Deviation will be the dataset which exhibits the greater variability. Is it posssible for the Standard Deviation to give more than a relative measure of variability? Can we actually say how spread out the data is? The answer is yes, we will see later how to give detailed answers for particular distributions. In the meantime there are two rules which will provide us with a good deal of information about some general datasets. The Empirical rule provides us with some definite statements about the proportion of observations in a specified interval. It only works for Symmetric Bell-Shaped (moundshaped) distributions. Also this rule is an approximation and more or less data than is indicated by the rule may lie in each interval. The Empirical Rule For a Symmetric Bell-Shaped distribution - Normal or close to Normal. Approximately 68% of the observations are within 1 Standard Deviation of the Mean Approximately 95% of the observations are within 2 Standard Deviation of the Mean Approximately 99.7% of the observations are within 3 Standard Deviation of the Mean Example In Tombstone, Arizona Territory people used Colt .45 revolvers. However people used different ammunition. Wyatt Earp knew that his brothers and Doc Holliday were the only ones in the territory who used Colt .45s with Winchester ammunition. The Earp brothers conducted tests on many different combinations of weapons and ammunition.They found that dataset of observations produced by the combination of Colt .45 with Winchester shells showed a Mean velocity of 936 feet/second and a Standard Deviation of 10 feet/second. The measurements were taken at a distance of 15 feet from the gun. When Wyatt examined the body of a cowboy shot in the back in cold blood he concluded that he was shot at a distance of 15 feet and that the velocity of the bullet at impact was 1,000 feet/second. The dastardly Ike Clanton claimed that this cowboy was shot by the Earp brothers or Doc Holliday. Was Wyatt able to clear his good name using the Empirical Rule? The distribution of this bullet velocity data should be approximately bell-shaped. This implies that the empirical rule should give a good estimation of the percentages of the data within each interval. k# of Standard Deviations 2 3 4 5 6 7 Interval Empirical approximate Percentage 916, 956 906, 966 896, 976 886, 986 876, 996 866, 1006 95% 99.7% ~100% ~100% ~100% ~100% This table quite clearly demonstrates that since the bullet velocity in the shooting was 1000 ft/sec and since this lies more than 6 Standard Deviations away from the mean the probability is extremely high that the Earps were not responsible for this shooting. This is especially evident from looking at the column showing percentages from the empirical rule. Practically 100% of bullet velocities should be between 896 and 976 ft/sec. Numerical Measures of Relative Standing While it is useful to know how to measure the centre of a dataset and the variability of a dataset, many times we want to be able to compare one observation with the rest of the observations in the dataset. Is one observation larger than many others? For Example suppose you get 35% on the exam for this course you will probably feel quite bad about your performance but what if 90% of the class actually did worse than you? Then you might feel a bit better about your 35%. So in some cases knowing how one observation compares with others can be more useful than just knowing the value of that observation. We will now look at some different ways of measuring Relative Standing. Definitions Percentile: For any dataset the pth percentile is the observation which is greater in value than P% of all the numbers. Consequently this observation will be smaller than (100-P)% of the data. Z-Score: The Z-Score of an observation is the distance between that observation and the mean expressed in units of standard deviations. So: Z X The numerical value of the Z-score reflects the relative standing of the observation. A large positive Z-score implies that the observation is larger than most of the other observations. A large negative Z-score indicates that the bservation is smaller than almost all the other observations. A Z score of zero or close to 0 means that the observation is located close to the mean of the dataset. Example A sample of 120 statistics students was chosen and their exam results summarised, the mean and standard deviation were shown to be: mean = 53% st.dev. = 7% Eric and Kenny are two students in this class and Eric’s exam result was 47% what was his Z-score? If Kenny’s Z-Score is 2, what was his percentage on the exam? What happens to Kenny then?