Drawing a histogram using Excel STEP 1: Examine the data to decide how many class intervals you need and what the class boundaries should be. (In an assignment you may be told what class boundaries to use and so you can skip this step.) To illustrate the process the excel file House Sales in Kaleen 2003 will be used. A portion of the file is shown below and was obtained from www.allhomes.com.au. The histogram is to be drawn on the variable Price. The file is conveniently ordered from smallest to largest and so it can easily be seen that there are 134 values (there are 136 lines of data but the first two are headings) with the minimum value $142,500 and the maximum value $650,000. If your file is not ordered see page 7 for instructions on what to do. How many class intervals should there be? A rough estimate is the square root of the number of observations. 134 is approximately 11.6 and so somewhere around 10-12 class intervals would be suitable. (Remember that a histogram is trying to capture the shape of the distribution. Too few classes and the shape is lost. Too many classes and the shape is lost by the random fluctuations of the data.) Look at the Excel worksheet Different Numbers of Class Intervals in the Excel file House Sales in Kaleen 2003 for examples of histograms with different numbers of class intervals. What should the class boundaries be? Where possible the class boundaries should be of equal width and should be natural. If we choose the stated lower boundary to be $100,000 and the class width to be $50,000 then there are 12 classes. This seems a fairly natural choice but others are possible. 1 STEP 2: Decide on the bin values. The bin column allows you to choose the class intervals for the histogram. If you do not include a bin column Excel will choose the class intervals. It is not recommended to let Excel choose. For the first bin value Excel chooses all data items less than and equal to this value. For the next class it chooses all the data items up to and including the next bin value. It goes like this until the last bin value. If there are any data items greater than the last bin value Excel puts them all into a last class. Unfortunately, this process does not exactly mirror the process we use when classifying data items into class intervals. We take all values up to but not including the stated upper bound. In the house price example the stated upper bound of the first class is $150,000. Excel will take all values up to and including this value whereas we want it to take all values up to but not including this value. No houses were actually sold for $150,000 and so this may not be a problem however, there was a house sold for $300,000 and Excel will put this house in the wrong class interval. To overcome this problem the bin values should be given as the true upper class boundaries not the stated ones. To decide what the true upper class boundaries are we need to know the level of accuracy to which the house prices have been given. The house at 65, Wakool Cct was sold for $223,333 and so we can assume that the house prices have been given to the nearest dollar. The true class boundaries should have one more decimal place than the data and so the true class boundaries should be to one decimal place. The stated upper boundary of the first class is $150,000 but the bin value given should be 149,999.5. Then excel will place the data items into the class intervals the way we wish it to. (I usually include the lower boundary of the first class to force Excel to include a space between the histogram and the vertical axis. This is not necessary but I prefer it. You can choose whether or not to do this by deciding how you prefer the histogram to appear.) The table below shows the class intervals and the corresponding bin values for the house price example. Class $100,000 up to $150,000 $150,000 up to $200,000 $200,000 up to $250,000 $250,000 up to $300,000 $300,000 up to $350,000 $350,000 up to $400,000 $400,000 up to $450,000 $450,000 up to $500,000 $500,000 up to $550,000 $550,000 up to $600,000 $600,000 up to $650,000 $650,000 up to $700,000 Bin value 99,999.5 149,999.5 199,999.5 249,999.5 299,999.5 349,999.5 399,999.5 449,999.5 499,999.5 549,999.5 599,999.5 649,999.5 699,999.5 2 STEP 3: Use Excel’s histogram tool. Type the bin values into the excel spreadsheet. (I usually copy and paste the data column of interest into a new worksheet but it is not necessary.) Look at the first two columns of the worksheet labeled Histogram in the Excel file House Sales in Kaleen 2003. Now select Tools from the top bar and then Data Analysis and then select Histogram from the box that appears like the one below. Then click on OK. (If Data Analysis does not appear then select Add- Ins and select the Analysis tool pack. Click OK and now when you select Tools, Data Analysis should be there.) Having clicked OK, the following dialogue box should appear. I have filled it in for the house price example using the worksheet labeled Histogram. The Input Range and Bin Range can be filled in by using the cursor to highlight the required cells on the Excel spreadsheet. Notice that I have ticked the box marked Labels because I have included the labels in the data items. I prefer to place the histogram on the same worksheet as the data and so I have selected Output Range and specified where I want the output to go. Some people prefer to have the histogram on a new worksheet. This is the default. Don’t forget to tick Chart Output otherwise Excel will not draw the histogram. Once you have filled in the dialogue box and clicked OK you should get output similar to that shown over the page. 3 The histogram produced unfortunately leaves a lot to be desired and requires quite a bit of editing before it is acceptable. This is detailed in Step 4. 4 STEP 4: Edit Excel’s output to get a suitable histogram. • Make the histogram bigger. Click inside the histogram chart to make black squares appear around the edges. Hold the cursor over one of the black squares at a corner until the cursor becomes a diagonal black arrow. Hold the left mouse button down and drag the cursor to make the histogram box the size you want. • Remove the box on the right hand side that says Frequency. Click on the box and then press the Delete button. • Remove the gaps between the bars. Right click the mouse when the cursor is over one of the histogram bars. A new box will appear and you should select Format Data Series. The following dialogue box will appear. Select Options as shown and reduce the Gap width to zero. Click OK. • Fix the Title. Histogram is not a suitable title. You need to have something that describes the data you have. (Don’t forget the Figure number.) Click on the word Histogram so that its border is displayed. Highlight the word Histogram and then type in your desired title. You can move the position of the title by using the cursor and dragging it with the mouse. • Fix the horizontal axis label. Bin is not a suitable label for the horizontal axis. You must label the axis with the variable that was used to draw the histogram. Don’t forget to include the units. 5 • Include a source. The primary source of the data should be shown in the bottom left hand corner of the histogram. Click in the histogram box to make the black squares appear. Type what you want for the source. Nothing will appear until you press Enter and a text box appears in the centre of your histogram containing what you typed. Use the cursor to move it to the bottom left hand corner. Change the font size to 8 point. • Fix the scale on the horizontal axis. Excel is really drawing a bar chart where the width of the bars is of no importance. The area of the bar of a histogram is proportional to the frequency recorded for the corresponding class interval and so the scale along the horizontal axis of a histogram must be a proper number scale. The easiest way to fix this is to clear out what Excel has put there and type in your own numbers using text boxes. In the house price example I chose to display the stated class boundaries and work in thousands as I felt that this would make for a clearer histogram. • You may be asked to put your id number in the top right hand corner of the histogram. This can also be done using a text box and should be placed inside Excel’s chart area. STEP 5: Presenting the histogram. You can print the histogram as you would for any chart or you can copy and paste it into a word document. If you do this you may wish to remove the outside border. Right click in the outside area around the histogram. Select Format Chart Area and the following dialogue box will appear. Select None for Border and than click OK. 6 What to do if your data file is not ordered. Either • Order the data file your self. Using the cursor highlight all the data not just the column of interest. Then from the bar across the top select Data and then Sort. For the Kaleen house price example the screen should be as below. For the dialogue box in the middle select the variable you want sorted. When I selected the data I included the row with the headings. (Notice that Excel has shown that the header row is selected.) If you don’t select the header row the variable names will not appear just the Excel columns. Click OK and the data will be sorted. 7 OR • Use the descriptive statistics tool to find the number of observations , the minimum and maximum values. Select Tools > Data Analysis > Descriptive Statistics and the following dialogue box should appear. Highlight the column that you want descriptive statistics for and the input range should be filled in. Tick Summary statistics and decide where you want the output to go. The descriptive statistics are in a work sheet of the same name in the excel file House Sales in Kaleen 2003. 8