COIT11224 – Computer Systems – Lecture – Week 6 2.1 Listing Numerical Data Suppose we have a class of 60 students that have submitted assignment 1 (worth a maximum of 10 marks) for the course COIT11224 – Computer Systems. We would have 60 pieces of data with each piece of data being a number in the range of 0 to 10. The data may look like the following: 5 2 9 5 0 8 7 4 9 10 6 9 9 9 8 8 8 6 3 7 7 6 7 8 10 5 8 7 4 3 7 1 6 7 3 8 9 1 5 4 2 0 5 6 8 8 9 2 6 8 7 9 1 4 8 7 9 10 1 5 This is called raw data and a very basic listing has been done simply by having six rows with ten marks in each row. By inspection, we can see that some students received the top mark being 10. We can also see that some students really struggled and didn’t get any marks at all. We can see quite a few marks greater than five and it would appear that maybe more students passed the assignment than failed. However, to get a good understanding of the above data, different ways of listing or grouping of the data into categories can provide easier analysis of just how well the students performed. We could list the above data is ascending order and see how it looks. 0 3 5 7 8 9 0 3 5 7 8 9 1 4 6 7 8 9 1 4 6 7 8 9 1 4 6 7 8 9 1 4 6 7 8 9 2 5 6 7 8 9 2 5 6 8 8 10 2 5 7 8 9 10 3 5 7 8 9 10 It is reasonably easy to see that two students received 0 with another 3 students received the maximum mark of 10. It is also relatively easy to see that far more students passed than failed the assignment. Can we list or group the data even more effectively? Dot diagrams use dots (and other symbols) to indicate the number of times each data value occurs. Using the above data, we have already noticed that 2 students obtained 0 marks and another 3 obtained 10. We can continue in like manner and count the number of students achieving each mark represented by the above data. Then we can construct a dot diagram. Below is a dot diagram of the above data. I have used the * symbol, but could have used other symbols or a dot. 1 # of students 11 10 9 8 7 6 5 4 3 2 * 1 * 0 * * * * 1 * * * 2 * * * 3 * * * * * * 5 * * * * 4 * * * * * * 6 * * * * * * * * * 7 * * * * * * * * * * * 8 * * * * * * * * * 9 * * * 10 Marks for Assignment 1 Very easily we can see that the mark of 8 was the most common mark obtained by the students. It is also easy to see that the cluster of asterisks is far more plentiful on the right-hand side of the diagram. This indicates that far more students passed than failed. By grouping data, we can improve the view. It is easier to gauge how well overall the class performed in assignment 1 with the above dot diagram. The above dot graph uses vertically displayed asterisks to represent the number of marks in each mark category. We could quite easily rearranged the diagram such that the asterisks were horizontal and the category listing of the marks shown vertically on the left-hand side. Another popular way to represent the frequency of categories is to use proportional length rectangles or bars. Such diagrams are referred to as bar graphs. In the above diagram, we would simply replace each column of asterisks with a similar height bar. These bars are usually separated but sometimes you will see bars touching each other. The important part is that the height of each bar represents the relative frequency. Again, we can also have horizontal bars with the extension of the bar to the right indicating the relative frequency. 2.2 Stem-and-Leaf Displays Please refer to section 2.2, page 302 of the textbook.. 2.3 Frequency Distribution Suppose now we turn our attention from the marks allocated for assignment 1 to the overall marks out of 100 that each of the 60 students obtained for the course COIT11224. The possible range of marks is now 0 to 100. If we tried to apply a dot diagram the represent the overall marks we soon see that the resultant diagram is not 2 so useful. Image if we had a graph with 101 mark columns along the horizontal axis. It would be 10 times the size of our assignment 1 diagram. Additionally, we only have 60 marks in total spread out over 101 vertical columns or mark categories. We would also have many categories with a frequency of zero particularly towards the lower end of the marks. The rest of the diagram would consist of a light spattering of data symbols and the diagram would lose its appeal. Trends on how well students were performing overall could be difficult to see by inspection. We can make improvements to the dot diagram to convey a more meaningful view. We can introduce classes of data (intervals or categories that cover a range of data). Instead of having a diagram with 101 individual mark categories, we could have five classes of marks whereby each class represents a range of marks. Performing this operation loses the individual raw marks, but can improve the visual appeal and understanding of the diagram. Using the above example, we can construct custom classes to line up with the grading system used at CQU. For example, if a student achieves 85 or higher marks in a course, CQU awards a HD, 75 to 84, a D, 65 to 74, C, 50 to 64, P, and finally a F for all marks less than 50. Using these classes, we sort and tally all the student marks and count the number of marks (frequency) in each class. Again, we can use a dot or bar diagram to represent graphically the frequency of data in each class. Alternately, we can simply tabulate the frequency distribution as follows: Grade # of students ______________________________ HD D C P F 3 11 15 22 9 ______________________________ Total 60 A simple dot diagram of the above could look like the following: Grade HD D C P F *** *********** *************** ********************** ********* Number of Students 3 The above dot diagram indicates visually that the class or category with the greatest number of students was P – Pass. General conclusions can be draw quite quickly on how well overall students performed using the above frequency distribution diagrams. A couple of points should be mentioned here. Firstly, our classes should be such that all data fits within one class or another, but never fits in more than one of the classes. Normally, classes would cover equal ranges of values. Our example above doesn’t, but the classes used are aimed to creating the desired results for the allocation of grades. So, some freedom in using or not using equal class sizes can improve the use and effectiveness of the diagram. 2.4 Graphical Presentations Histograms are often used to graphically represent frequency distributions. Histograms are like bar or dot graphs with the intervals on the horizontal axis representing a range of values. Other graphical representations of frequency distributions are frequency polygons, pictograms and pie charts. Refer to Section 2.4 of the textbook. 2.5 Summarizing two-Variable Data Lightly read Section 2.5 of the textbook and have an understanding of a scatter diagram. 3.1 Populations and Samples What is a population in statistical terms? It is the set of data that covers all conceivable possibilities in the area of data under consideration. For example, if there are 60 students in the class and we have the 60 marks, we would say that the set of data constitutes a population. If we pick (usually randomly) a group of student marks from the population, we refer to this smaller number of marks as a sample. For example, we may pick randomly 32 student marks from the population. There 32 marks form a sample. Interestingly, a sample of reasonable size can be used to predict the trends of the population with reasonable accuracy. 3.2 The Mean One of the most important calculations that is done on a population or sample is the “average” or in statistical terms, the arithmetic mean or simply the mean. The mean of n numbers is their sum divided by n. For example, to calculate the mean of the 60 student marks, we would add all of the 60 marks together then divide by 60. This gives us the population mean. Similarly, we can calculate the mean of our sample of 32 student marks in a similar way. We simply add all the 32 marks together and divide by 32 giving us the sample mean. We might be surprised at just how close the value of our sample mean is to the population mean. 4 Why do we use samples? Consider the case where there may be more than 5,000 students. It could be quite a task analyzing and calculating the population mean. We could simply select a random sample of 50 student marks and calculate the sample mean. The resulting sample mean and sample dot diagram usually can accurately predict the population mean and population trends without all the work. In mathematical terms, the mean of the sample is represented by: x x n where x = sample mean, x = the sum of the data in the sample and n = the number of data objects in the sample. If we consider the mean of the total population, x N where = the population mean, x = the sum of the data in the sample and N = the number of data objects in the population. 3.3 The Weighted Mean Skip this section dealing with “Weighted Mean”, “Geometric Mean” and “Harmonic Mean”. 3.4 Median The median is the value of the middle term when n is odd, and the mean of the middle two terms when n is even. Before the median can be found or calculated, the data must be arranged according to size (ie. In ascending or descending order). The median of a sample is denoted as x . The median of a population is denoted by . The median is the value of the 3.5 (n 1) th term. 2 Other Fractiles Skip this section on “Other Fractiles”. 3.6 Mode 5 Mode is another statistical Measure. It is simply the value or category that occurs with highest frequency. 3.7 The Description of Grouped Data Skip section 3.7 3.8 Technical Note (Summations) The notation x used above does not tell us how many and which values of x we should add together. We can use a more explicit notation as follows: n xi x1 x2 x3 ... xn i 1 where it is clear we are adding all the terms of x with subscripts of i ranging from 1 to n . In next week’s class, we will come across notation as follows: n xi i 1 2 x12 x2 2 x32 ... xn 2 which simply means that we are summing the squares of each term of x with subscripts of i ranging from 1 to n . 6