Drawing a histogram using Excel

advertisement
Drawing a histogram using Excel
STEP 1: Examine the data to decide how many class intervals you need and what the
class boundaries should be. (In an assignment you may be told what class boundaries
to use and so you can skip this step.)
To illustrate the process the excel file House Sales in Kaleen 2003 will be used. A
portion of the file is shown below and was obtained from www.allhomes.com.au.
The histogram is to be drawn on the variable Price. The file is conveniently ordered
from smallest to largest and so it can easily be seen that there are 134 values (there are
136 lines of data but the first two are headings) with the minimum value $142,500 and
the maximum value $650,000. If your file is not ordered see page 7 for instructions on
what to do.
How many class intervals should there be?
A rough estimate is the square root of the number of observations. 134 is
approximately 11.6 and so somewhere around 10-12 class intervals would be suitable.
(Remember that a histogram is trying to capture the shape of the distribution. Too few
classes and the shape is lost. Too many classes and the shape is lost by the random
fluctuations of the data.) Look at the Excel worksheet Different Numbers of Class
Intervals in the Excel file House Sales in Kaleen 2003 for examples of histograms
with different numbers of class intervals.
What should the class boundaries be?
Where possible the class boundaries should be of equal width and should be natural. If
we choose the stated lower boundary to be $100,000 and the class width to be $50,000
then there are 12 classes. This seems a fairly natural choice but others are possible.
1
STEP 2: Decide on the bin values.
The bin column allows you to choose the class intervals for the histogram. If you do
not include a bin column Excel will choose the class intervals. It is not recommended
to let Excel choose.
For the first bin value Excel chooses all data items less than and equal to this value.
For the next class it chooses all the data items up to and including the next bin value.
It goes like this until the last bin value. If there are any data items greater than the last
bin value Excel puts them all into a last class.
Unfortunately, this process does not exactly mirror the process we use when
classifying data items into class intervals. We take all values up to but not including
the stated upper bound.
In the house price example the stated upper bound of the first class is $150,000. Excel
will take all values up to and including this value whereas we want it to take all values
up to but not including this value. No houses were actually sold for $150,000 and so
this may not be a problem however, there was a house sold for $300,000 and Excel
will put this house in the wrong class interval.
To overcome this problem the bin values should be given as the true upper class
boundaries not the stated ones. To decide what the true upper class boundaries are we
need to know the level of accuracy to which the house prices have been given. The
house at 65, Wakool Cct was sold for $223,333 and so we can assume that the house
prices have been given to the nearest dollar. The true class boundaries should have
one more decimal place than the data and so the true class boundaries should be to
one decimal place.
The stated upper boundary of the first class is $150,000 but the bin value given should
be 149,999.5. Then excel will place the data items into the class intervals the way we
wish it to.
(I usually include the lower boundary of the first class to force Excel to include a
space between the histogram and the vertical axis. This is not necessary but I prefer it.
You can choose whether or not to do this by deciding how you prefer the histogram to
appear.)
The table below shows the class intervals and the corresponding bin values for the
house price example.
Class
$100,000 up to $150,000
$150,000 up to $200,000
$200,000 up to $250,000
$250,000 up to $300,000
$300,000 up to $350,000
$350,000 up to $400,000
$400,000 up to $450,000
$450,000 up to $500,000
$500,000 up to $550,000
$550,000 up to $600,000
$600,000 up to $650,000
$650,000 up to $700,000
Bin value
99,999.5
149,999.5
199,999.5
249,999.5
299,999.5
349,999.5
399,999.5
449,999.5
499,999.5
549,999.5
599,999.5
649,999.5
699,999.5
2
STEP 3: Use Excel’s histogram tool.
Type the bin values into the excel spreadsheet. (I usually copy and paste the data
column of interest into a new worksheet but it is not necessary.) Look at the first two
columns of the worksheet labeled Histogram in the Excel file House Sales in Kaleen
2003.
Now select Tools from the top bar and then Data Analysis and then select Histogram
from the box that appears like the one below. Then click on OK. (If Data Analysis
does not appear then select Add- Ins and select the Analysis tool pack. Click OK and
now when you select Tools, Data Analysis should be there.)
Having clicked OK, the following dialogue box should appear. I have filled it in for
the house price example using the worksheet labeled Histogram. The Input Range and
Bin Range can be filled in by using the cursor to highlight the required cells on the
Excel spreadsheet. Notice that I have ticked the box marked Labels because I have
included the labels in the data items. I prefer to place the histogram on the same
worksheet as the data and so I have selected Output Range and specified where I want
the output to go. Some people prefer to have the histogram on a new worksheet. This
is the default. Don’t forget to tick Chart Output otherwise Excel will not draw the
histogram.
Once you have filled in the dialogue box and clicked OK you should get output
similar to that shown over the page.
3
The histogram produced unfortunately leaves a lot to be desired and requires quite a
bit of editing before it is acceptable. This is detailed in Step 4.
4
STEP 4: Edit Excel’s output to get a suitable histogram.
•
Make the histogram bigger.
Click inside the histogram chart to make black squares appear around the edges.
Hold the cursor over one of the black squares at a corner until the cursor becomes
a diagonal black arrow. Hold the left mouse button down and drag the cursor to
make the histogram box the size you want.
•
Remove the box on the right hand side that says Frequency.
Click on the box and then press the Delete button.
•
Remove the gaps between the bars.
Right click the mouse when the cursor is over one of the histogram bars. A new
box will appear and you should select Format Data Series. The following
dialogue box will appear. Select Options as shown and reduce the Gap width to
zero.
Click OK.
•
Fix the Title.
Histogram is not a suitable title. You need to have something that describes the
data you have. (Don’t forget the Figure number.) Click on the word Histogram so
that its border is displayed. Highlight the word Histogram and then type in your
desired title. You can move the position of the title by using the cursor and
dragging it with the mouse.
•
Fix the horizontal axis label.
Bin is not a suitable label for the horizontal axis. You must label the axis with the
variable that was used to draw the histogram. Don’t forget to include the units.
5
•
Include a source.
The primary source of the data should be shown in the bottom left hand corner of
the histogram. Click in the histogram box to make the black squares appear. Type
what you want for the source. Nothing will appear until you press Enter and a text
box appears in the centre of your histogram containing what you typed. Use the
cursor to move it to the bottom left hand corner. Change the font size to 8 point.
•
Fix the scale on the horizontal axis.
Excel is really drawing a bar chart where the width of the bars is of no importance.
The area of the bar of a histogram is proportional to the frequency recorded for the
corresponding class interval and so the scale along the horizontal axis of a
histogram must be a proper number scale. The easiest way to fix this is to clear out
what Excel has put there and type in your own numbers using text boxes. In the
house price example I chose to display the stated class boundaries and work in
thousands as I felt that this would make for a clearer histogram.
•
You may be asked to put your id number in the top right hand corner of the
histogram. This can also be done using a text box and should be placed inside
Excel’s chart area.
STEP 5: Presenting the histogram.
You can print the histogram as you would for any chart or you can copy and paste it
into a word document. If you do this you may wish to remove the outside border.
Right click in the outside area around the histogram. Select Format Chart Area and
the following dialogue box will appear. Select None for Border and than click OK.
6
What to do if your data file is not ordered.
Either
• Order the data file your self.
Using the cursor highlight all the data not just the column of interest. Then from
the bar across the top select Data and then Sort. For the Kaleen house price
example the screen should be as below.
For the dialogue box in the middle select the variable you want sorted. When I
selected the data I included the row with the headings. (Notice that Excel has
shown that the header row is selected.) If you don’t select the header row the
variable names will not appear just the Excel columns. Click OK and the data will
be sorted.
7
OR
• Use the descriptive statistics tool to find the number of observations , the
minimum and maximum values.
Select Tools > Data Analysis > Descriptive Statistics and the following dialogue
box should appear. Highlight the column that you want descriptive statistics for
and the input range should be filled in. Tick Summary statistics and decide
where you want the output to go. The descriptive statistics are in a work sheet of
the same name in the excel file House Sales in Kaleen 2003.
8
Download