Representation of Data • In the last unit we have focused on the definition of statistics and also looked at what a population and a sample entails. • In this unit, we shall look at Visual representations of data, which are used often • In short,we will be looking at ways of representing data visually • There will also be use of what we have learnt so far, and when it is appropriate to use them Representation of Data Stem and Leaf Diagrams You will have seen stem and leaf diagrams on your GCSE. They are also on A-level, but you will be asked more questions on them. 20, 9, 17, 12, 28, 31, 22, 24, 17, 25, 24, 24, 26 Stem Leaf Stem 0 9 1 7 2 7 2 0 8 2 3 1 4 5 4 4 The leaf will usually be the last number, and the stem the rest. 6 Leaf 0 9 1 2 7 7 2 0 2 4 3 1 4 4 5 6 8 Make sure the data is in order! 4A Representation of Data Twin Stem and Leaf Diagrams Sometimes you will have 2 sets of data on one diagram. The following numbers represent flower widths for 2 different plants of the same species (cm). Plant 1 2.5 2.1 3.0 3.2 1.9 1.5 1.9 2.2 2.4 Plant 2 3.1 2.6 2.9 3.3 3.5 4.0 3.7 2.7 3.0 Plant 2 7 5 Stem Plant 1 1 5 9 9 4 9 7 6 2 1 2 3 1 0 3 0 2 0 4 Key: 6 | 2 | 1 5 Means 2.1 for plant 1 and 2.6 for plant 2 4A Representation of Data Twin Stem and Leaf Diagrams Calculate the Median and Inter-quartile range for the following Stem and Leaf diagram. Stem 2 Q2 n 2 13 2 Q1 n 4 13 4 Q3 3n 4 39 4 Leaf 3 6 3 1 5 7 7 4 0 3 3 4 5 2 13 Numbers 8 9 Q3 – Q1 6.5 (7th term) 38 3.25 (4th term) 35 9.75 (10th term) 43 43 - 35 = 8 4A Representation of Data Twin Stem and Leaf Diagrams Calculate the Median and Inter-quartile range for the following Stem and Leaf diagram. Stem 6 Q2 n 2 14 2 Q1 n 4 14 4 Q3 3n 4 42 4 Leaf 1 2 7 0 1 8 1 4 9 0 14 Numbers 5 2 5 3 8 6 7 Q3 – Q1 7 (7.5th term) 71.5 3.5 (4th term) 65 10.5 (11th term) 77 77 - 65 = 12 4A Representation of Data Outliers An outlier is an extreme value that lies outside the overall pattern of data. An outlier is any value that is; Bigger than; Upper Quartile + (1.5 x Inter-quartile Range) Q3 + 1.5(IQR) Smaller than; Lower Quartile – (1.5 x Inter-quartile Range) Q1 – 1.5(IQR) So basically, work out ‘1.5 x IQR’. Then add it to the upper quartile, subtract it from the lower quartile and you have the acceptable range of values. The rules above are standard but you may be given a different rule to apply in the exam. 4B Representation of Data Outliers For the Stem and Leaf diagram below, calculate the quartiles and find any outliers. Stem Leaf 2 30 2 15 (15.5th term) n 4 30 4 7.5 (8th term) 3n 30 4 22.5 (23rd term) 3.8 Key: 3 | 1 means 3.1 2 2 2 3 3 5 7 3 1 2 6 7 7 7 8 8 8 8 9 9 9 4 0 0 0 0 4 5 6 7 8 5 1 5 30 Numbers Q2 n Q1 Q3 4 3.2 4.0 Q3 – Q1 4.0 – 3.2 = 0.8 4B Representation of Data Outliers For the Stem and Leaf diagram below, calculate the quartiles and find any outliers. Q1 = 3.2 Stem Leaf Key: 3 | 1 means 3.1 2 2 2 3 3 5 7 3 1 2 6 7 7 7 8 8 8 8 9 9 9 4 0 0 0 0 4 5 6 7 8 5 1 5 30 Numbers Q2 = 3.8 Q3 = 4.0 Lowest acceptable value IQR = 0.8 Highest acceptable value Q1 – 1.5(IQR) Q3 + 1.5(IQR) 3.2 – 1.5(0.8) 4 + 1.5(0.8) 2 5.2 So 5.5 is the only outlier. 4B Representation of Data Box Plots and comparing data Smallest value Lower Quartile Outlier 25% 10 20 30 Median 25% 25% 40 50 Upper Quartile Largest value 25% 60 70 80 Any outliers are plotted as crosses outside the main plot Each ‘section’ contains 25% of the observations in the sample 4C/4D Representation of Data Drawing the box plot Stem Leaf Q1 = 3.2 Key: 3 | 1 means 3.1 2 2 2 3 3 5 7 3 1 2 6 7 7 7 8 8 8 8 9 9 9 4 0 0 0 0 4 5 6 7 8 5 1 5 Q2 = 3.8 Q3 = 4.0 Lowest acceptable value IQR = 0.8 Highest acceptable value Q1 – 1.5(IQR) Q3 + 1.5(IQR) 3.2 – 1.5(0.8) 4 + 1.5(0.8) 2 5.2 So 5.5 is the only outlier. 2 2.5 3 3.5 4 4.5 5 5.5 4C/4D Representation of Data • Drawing the box plot The blood glucose level of 30 males is recorded. Below is a summary of the results. Lower Quartile = 3.6 Lowest Value = 1.4 Upper Quartile = 4.7 Median = 4 Highest Value = 5.2 Given that there was only one outlier, draw a box plot for the data. As we do not know the actual lowest value, we use the lower boundary (1.95) IQR = 4.7 – 3.6 = 1.1 Max value = 4.7 + 1.5(1.1) = 6.35 Min value = 3.6 – 1.5(1.1) = 1.95 So 1.4 is the outlier. 1 2 3 4 5 6 4C/4D Representation of Data Comparing Box Plots Females Males 1 2 3 4 5 6 Glucose Level When you compare 2 box plots you should always comment on the Median and the Inter-quartile range. This is because Median is a measure of location (average), and the Inter-quartile range is a measure of spread. The median is higher for males, and they also have a larger Interquartile range. This indicates that males have a higher blood glucose level on average, and also have a wider range of values. 4C/4D Representation of Data Histograms A Histogram is similar to a bar chart but there are 2 major differences There are no gaps between bars (continuous data) The area of a bar is proportional to the frequency When drawing a Histogram, use Frequency Density rather than frequency. Frequency Frequency = Density Class width You may also need to use the following formula when interpreting a Histogram. Area of Bar = k x Frequency Usually the Area of the bar is equal to the frequency. But it may be that all areas have been halved (ie k = 0.5) in order to make the diagram smaller. 4E Representation of Data Frequency = Density Histograms The following table shows how long a sample of 200 students took to complete their homework. Draw a Histogram to represent the data. Frequency Class width 14 12 Frequency Frequency Density 25-30 55 11 (55 ÷ 5) 8 30-35 39 7.8 (39 ÷ 5) 6 35-40 68 13.6 (68 ÷ 5) 40-50 32 3.2 (32 ÷ 10) 50-80 6 0.2 (6 ÷ 30) Frequency Density Time (mins) 10 4 2 0 20 30 40 50 60 70 80 90 Time (mins) 4E Representation of Data As Area represents Frequency, we need to calculate the Area of each Rectangle we are including. Rectangle 1: 4 x 13.6 54.4 students Rectangle 2: 5 x 3.2 16 students Overall our estimate would be 70.4 (70) students between 36 and 45 minutes. 36 to 45 13.6 14 12 Frequency Density Histograms Use the Histogram to estimate the number of students whose times were between 36 and 45 minutes. 3.2 10 8 6 1 4 2 0 20 2 30 40 50 60 70 80 90 Time (mins) 4E Representation of Data Histograms The Histogram to the right shows the time taken (s) for a group of children to complete a puzzle. Why has a Histogram been used? Time is Continuous Data What is the underlying feature of each bar? It is proportional to the group Frequency 14 16 18 20 22 24 26 28 30 32 Time (s) 4E Representation of Data Histograms The Histogram to the right shows the time taken (s) for a group of children to complete a puzzle. Bar A represents 78 children. What Area represents 1 child? 27.3 Area represents Frequency 2 x 27.3 54.6cm2 A 78 Children = 54.6cm2 1 Child = 0.7cm2 ÷ 78 14 16 18 20 2 22 24 26 28 30 32 Time (s) 4E Representation of Data Histograms The Histogram to the right shows the time taken (s) for a group of children to complete a puzzle. 1 Child = 0.7cm2 If the Area is 210cm2 in total, how many children were surveyed? x 0.7 1 Child = 0.7cm2 ? Children = 210cm2 ÷ 0.7 210cm2 ÷ 0.7 = 14 16 18 20 22 24 26 28 30 32 Time (s) 300 Children 4E Representation of Data Skewness and Comparisons The Skewness of data can be described using diagrams, measures of location and measures of spread. Symmetrical Positive Skew Negative Skew Data which is spread evenly Symmetrical Data which is mostly at the lower values Positive Skew Data which is mostly at the higher values Negative Skew 4F Representation of Data Skewness and Comparisons There are several ways of comparing Skewness. Sometimes you will be told which to use, and sometimes you will have to choose one depending on what data you have available. You can see shape of the data from a box plot. Q1 Q2 Q3 You can also look at the quartiles Symmetrical Q 2 – Q1 = Q3 – Q2 Positive Skew Q 2 – Q1 < Q3 – Q2 Negative Skew Q 2 – Q1 > Q3 – Q2 Q1 Q2 Q3 Q1 Q2Q3 4F Representation of Data Skewness and Comparisons There are several ways of comparing Skewness. Sometimes you will be told which to use, and sometimes you will have to choose one depending on what data you have available. Another test uses the measures of location: Symmetrical mean = median = mode Positive Skew mean > median > mode Negative Skew mean < median < mode Low mode = lots of low values ie) Positive Skew High mode = lots of high values ie) Negative Skew 4F Representation of Data Skewness and Comparisons There are several ways of comparing Skewness. Sometimes you will be told which to use, and sometimes you will have to choose one depending on what data you have available. The final test is a formula: Negative Skew 3(Mean – Median) Standard Deviation Symmetrical 0 Positive Skew A value of 0 implies that mean = median Symmetrical Data A positive value implies that median < mean Positive Skew A negative value implies that median > mean Negative Skew The further from 0 a positive or negative value is, the more skewed the data is. 4F Representation of Data Skewness and Comparisons Find the 3 Quartiles for this data on test marks for 50 students. Q2 Q1 Q3 n 2 n 4 3n 4 50 2 50 4 150 4 25 (25.5th term) 60 12.5 (13th term) 46 37.5 (38th term) 69 Stem Leaf Key: 6 | 1 means 61 2 1 2 8 3 3 4 7 8 9 4 1 2 3 5 6 7 9 5 0 2 3 3 5 5 6 8 9 9 6 1 2 2 3 4 4 5 6 6 8 8 8 9 9 7 0 2 3 4 5 7 8 9 8 0 1 4 4F Representation of Data Skewness and Comparisons Given the two values below, calculate the Mean and Standard Deviation of the data. x 2873 x Mean x n 2873 50 x 57.46 2 Stem 177353 n 50 Standard Deviation 2 2 x x 2 n n 177353 2873 2 50 50 2 2 245.4084 15.67 (2dp) Leaf Key: 6 | 1 means 61 2 1 2 8 3 3 4 7 8 9 4 1 2 3 5 6 7 9 5 0 2 3 3 5 5 6 8 9 9 6 1 2 2 3 4 4 5 6 6 8 8 8 9 9 7 0 2 3 4 5 7 8 9 8 0 1 4 Q1 = 46 Q2 = 60 Q3 = 69 4F Representation of Data Skewness and Comparisons Use the formula below to calculate the Skewness of the data. Stem Key: 6 | 1 means 61 Leaf 2 1 2 8 3 3 4 7 8 9 4 1 2 3 5 6 7 9 3(57.46 - 60) 5 0 2 3 3 5 5 6 8 9 9 15.67 6 1 2 2 3 4 4 5 6 6 8 8 8 9 9 7 0 2 3 4 5 7 8 9 8 0 1 4 3(Mean – Median) Standard Deviation -7.62 15.67 = -0.486 So the data is Negatively Skewed! Q1 = 46 Mean = 57.46 Q2 = 60 Standard Deviation = 15.67 Q3 = 69 Mode = 68 4F Representation of Data Skewness and Comparisons Use another two methods to show the data is Negatively Skewed. 1) Q2 – Q1 = 14 Q 3 – Q2 = 9 Q 2 – Q1 > Q 3 – Q2 Negative Skew 2) Mean < Median < Mode 57.46 < 60 < 68 High mode implies many higher values… Negative Skew Stem Key: 6 | 1 means 61 Leaf 2 1 2 8 3 3 4 7 8 9 4 1 2 3 5 6 7 9 5 0 2 3 3 5 5 6 8 9 9 6 1 2 2 3 4 4 5 6 6 8 8 8 9 9 7 0 2 3 4 5 7 8 9 8 0 1 4 Q1 = 46 Mean = 57.46 Q2 = 60 Standard Deviation = 15.67 Q3 = 69 Mode = 68 4F Representation of Data Skewness and Comparisons A company runs two manufacturing lines, A and B. They both make 2cm rods in different ways. Samples are taken from both lines and data summarised in the following table. Which manufacturing line is best in this situation? Mean Standard Deviation A 2 0.015 B 2 0.05 The rods need to be accurate… Standard Deviation measures spread The rods from line A have a lower Standard Deviation Line A is therefore more reliable 4F Representation of Data Skewness and Comparisons This table shows data on pupils taking a Statistics and Mechanics Paper. Which will be easier to set fair grade boundaries for? Mean Standard Deviation Statistics 55 16 Mechanics 55 4 A higher standard deviation means the marks are more spread out Therefore the grade boundaries will be more spread out for Statistics And will therefore be fairier! 4F Summary • We have looked at using Stem and Leaf diagrams and Histograms to represent data • We have looked at comparing data using these, as well as box plots • We have learnt what outliers are • We have learnt what Skewness is and used several measures to test it