Data exploration with Microsoft Excel: univariate analysis

advertisement
Data exploration with Microsoft Excel:
univariate analysis
Contents
1
Introduction ........................................................................................................................ 1
2
Exploring a variable’s frequency distribution.................................................................... 2
3
Calculating measures of central tendency........................................................................ 16
4
Calculating measures of dispersion (spread) ................................................................... 16
5
Exploring the shape of a variable’s distribution .............................................................. 17
6
Generating summary statistics ......................................................................................... 17
1 Introduction
This guide covers the use of Microsoft Excel (hereafter: Excel) for univariate data
exploration. It shows how techniques discussed in Chapter 13 can be applied in Excel. Please
refer to Chapter 13 for more details on the specific techniques and their interpretation; the
focus here is on how to carry them out in Excel. It covers four topics:
1. Exploring a variable’s frequency distribution
2. Calculating measures of central tendency
3. Calculating measures of dispersion (spread)
4. Exploring the shape of a variable’s distribution
5. Generating summary statistics
The guide is not written for a specific version of Excel although it includes screenshots for
Excel 2010. Most of the functionality referred to in the guide is also available in earlier and
later versions, although the user interface has changed somewhat.
The guide assumes that you have entered your data and prepared it for analysis as described
in the guide Introduction to using Microsoft Excel for quantitative data analysis. It also
assumes that you are familiar with basic Excel functionality, including creating and editing
charts (for information on how to use functions and the Data Analysis ToolPak see
Introduction to using Microsoft Excel for quantitative data analysis).
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
1
2 Exploring a variable’s frequency distribution
As explained in Chapter 13, there are two main ways of exploring a variable’s frequency
distribution: frequency tables and graphical displays.
2.1 Creating frequency tables in Excel using pivot tables
Pivot tables are the most efficient way of creating frequency tables in Excel. They can be
extended to more complex analysis such as contingency tables and can be used as the basis
for graphical outputs. They are widely used in business and management, for example for
financial analysis, so you may already be familiar with how they work. If you are new to
pivot tables, it is worth taking some time to learn how to use them as they are a very flexible
analysis tool which makes them ideally suited for data exploration.
We will demonstrate their use in creating frequency tables for a simple dataset about
customers’ shopping habits (Figure 1) (available on the website as a downloadable file
customer satisfaction.xlsx). One of the nominal variables in the dataset records the store
location where the customer shops (north, central or south). Our aim is to create a simple
frequency table showing how many customers shop in each location and what per cent that
represents of the total. We will also add a cumulative per cent column.
Figure 1 – Customer satisfaction dataset
2.1.1 Creating a table of frequency counts
To create a pivot table, carry out the following steps:
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
2
1. Check that the data are ready for analysis. Each column requires a unique header,
there should be no missing rows or columns and, if nominal variables are left as text,
spelling should be consistent.
2. Click on any cell in the dataset.
3. Select Insert > PivotTable > PivotTable to open up the Create PivotTable dialogue
box (see Figure 2). Note: Excel writes PivotTable as a single word.
Figure 2 – PivotTable dialogue box
4. In the dialogue box, select the table or range you wish to analyse. If you placed the
cursor in a cell in the dataset before opening up the dialogue box, the dataset should
automatically be selected. If not, select it manually.
5. Choose where you want the PivotTable to be placed. The default is New Worksheet
which is generally the easier option.
6. When ready, click OK. This will insert an empty PivotTable report and a PivotTable
Field List into a new worksheet (Figure 3).
Figure 3 – PivotTable report and Field List
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
3
7. The PivotTable Field List consists of two parts. The upper part, the field section, lists
all the field names you can add to the PivotTable. In this case it is all the variable
names (column headers) in the dataset. The lower part, the layout section, contains the
Report Filter area, the Column Labels area, the Row Labels area and the Values area.
You populate the PivotTable report by dragging and dropping fields from the field
section into the appropriate area in the layout section (Figure 4).
Figure 4 – PivotTable Field List
8. To create a frequency table for the variable Store location, start by dragging and
dropping the field ‘Store location’ from the field section into the Row Labels area.
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
4
This creates a set of row labels in the table, one for each value in the Store location
variable and a grand total.
9. Next, drag and drop the field ‘Store location’ from the field section into the Values
area. This immediately creates a new column ‘Count of Store location’ in the
PivotTable report that gives the frequency of occurrence of each category in the
variable, as well as a count of the grand total. (Note: count is the default value setting
for nominal data in an Excel pivot table; for metric data, the default is to sum the
values in each category.) The resulting table is shown in Figure 5.
Figure 5 – Frequency table of Store location showing counts only (n = 20)
2.1.2 Adding per cent columns to a frequency table
Now that the basic frequency table has been created, the next step is to add a column showing
the per cent of the total represented by each category. To do this:
1. Drag and drop another copy of the ‘Store location’ field into the Values box in the
PivotTable Field List. This will add another column to the pivot table, called Count of
Store location 2 (Figure 6). (Hint: if your PivotTable Field List has disappeared, click
on the PivotTable report and it will reappear.)
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
5
Figure 6 – Adding an additional column
2. Click on the down arrow of the new field in the Values area of the PivotTable Field
List. This opens up a new menu. Choose Value Field Settings to open the Value Field
Settings dialogue box (Figure 7).
Figure 7 – Value Field Settings dialogue box
3. The Value Field Settings dialogue box allows you to manipulate the values displayed
in the pivot table for that field:
a. The Summarise Values By tab allows you to determine the value displayed in
each cell (e.g. count, average, etc.)
b. The Show Values As tab gives further options for displaying the data.
4. Select the Show Values As tab. From the Show Values As drop down box, select %
Column Total.
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
6
5. At this point you can change the default name at the column head by typing your
choice of name in the Custom Name field. (Note: you can also change column
headers by typing directly into the PivotTable report.)
6. If desired, click on the Number Format button to open up Excel’s number format
dialogue box if you wish to change the format of the numbers (e.g. to set the number
of decimal places). (Note: you can also change number formats directly in the
PivotTable report using the standard commands available under the Home tab.)
7. Click OK. The resulting frequency table is shown in Figure 8, with the column header
changed to Per cent of total and the number of decimal places set to 0 for the per cent
column.
Figure 8 – Frequency table of Store location showing counts and per cent (n = 20)
2.1.3 Adding a cumulative per cent column to a frequency table
To add a cumulative per cent column to your frequency table, repeat steps 1 to 3 above to
create a new column and open the Value Field Settings dialogue box. In that box, select the
Show Values tab and in the Show Values As drop down box select % Running Total In.
Select ‘Store location’ in the Base field box. As before, you can change the name and set the
number format (Figure 9).
Figure 9 – Value Field Settings for cumulative per cent column
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
7
Once complete the pivot table fields can be renamed and reformatted if required. Figure 10
shows the final result. The finished table can be copied and pasted into a word-processing
package for further editing or be used as the basis for generating graphs (charts).
Figure 10 – Frequency table of Store location showing counts, per cent and cumulative per cent (n = 20)
The contents of a PivotTable report can easily be changed by adding, removing or replacing
fields in the Field List. Similarly, you can change how the values are displayed via the Value
Field Settings dialogue box at any time. The down arrow filter on the Row Labels header
(next to Store location in Figure 10) can be used to sort and filter the rows. In the guide Data
exploration with Excel: analysing more than one variable, we will show how pivot tables can
be used to analyse two or more variables.
2.2 Creating frequency tables in Excel using the COUNTIF function
Another, less flexible, way of creating frequency tables is to use Excel’s COUNTIF function.
An example is shown in Figure 11.
Figure 11 – Frequency table created using COUNTIF function
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
8
The cells in the Count column are populated using the COUNTIF function. Choose the
destination cell and then select Formulas > More Functions > Statistical > COUNTIF. In the
Function Argument dialogue box (Figure 12), enter the range of the data and the criteria in
the relevant boxes; these tell Excel where to look and what to count. In this case the word
‘North’ is entered in the Criteria box which gives a count of 7, as expected.
Figure 12 – COUNTIF Function Argument dialogue box
Once the individual cells in the count column have been populated, Excel can be used to
calculate the grand total (Hint: use the SUM function) and additional columns calculated for
per cent and cumulative per cent as shown in Figure 11.
2.3 Graphical techniques for exploring frequency distributions
Excel’s suite of chart (graph) tools can be used to explore frequency distributions visually. If
your data are already in a suitable format, for example if you have pre-existing frequency
tables or you have created frequency tables using the COUNTIF function, you can generate
suitable graphs via the Insert tab and select an appropriate chart type, such as a bar or pie
chart. Figure 13 shows a pie chart created from the frequency table in Figure 11. It has been
edited using the Chart Tools in Excel to include per cent labels for the slices, a chart legend
and a suitable title (Hint: click on the chart to activate these tools).
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
9
Figure 13 – Pie chart created from pre-existing frequency table
(Note: For convenience of presentation and to make it easier to relate the output to the raw
data, we have created the frequency table and the pie chart in the same worksheet as the main
dataset. For large data sets and to avoid overwriting your data, it is usually better to work in a
separate worksheet when creating output of this kind.)
2.3.1 Using pivot charts to display frequency distributions
Pivot charts provide a very useful way of generating graphical displays of frequency
distributions directly from a dataset. They can be generated either from a pivot table that you
have created or directly using the PivotChart command. We will demonstrate the latter using
the customer satisfaction data and the Store location variable.
To create a pivot chart using the PivotChart command, select Insert > PivotChart. This opens
up the Create PivotChart with PivotTable dialogue box. This is similar to the PivotTable
equivalent (Figure 2) so select the data table/range and the location for the output (New
Worksheet is the default). Click OK.
This opens up a blank PivotChart area, along with a blank PivotTable report and Field List
similar to those you have already seen (Figure 14).
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
10
Figure 14 – PivotChart area
To populate the chart area, carry out the following steps:
1. In the PivotTable Field List drag and drop a copy of the ‘Store location’ field into the
Axis Fields (Categories) area.
2. Drag and drop a second copy of the ‘Store location’ field into the Values area (as with
the pivot table, this field contains nominal data so the Excel default value setting is
count).
3. A PivotChart in the form of a bar (Excel: column) chart is created along with a
PivotTable report of the data (Figure 15).
4. This chart can now be formatted using the PivotChart tools if needed.
Figure 15 – Bar chart of shopping by store location (n =20)
If you want to change the type of chart, for example to a pie chart, you can do this via
PivotChart Tools > Design > Change Chart Type > Pie. Select the type of pie chart you want
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
11
and click OK. The resulting chart can then be formatted as desired using the PivotChart
Tools. Figure 16 shows a pie chart created in this way after formatting.
Figure 16 – Pie chart of shopping by store location (n = 20)
2.3.2 Creating histograms in Excel
Histograms are a useful way of inspecting the frequency distribution and the shape of the
distribution of metric variables. Excel’s Data Analysis ToolPak contains a function for
generating histograms from your data. To illustrate how this is done, we will create a
histogram for the Satisfaction variable in the customer satisfaction dataset.
1. Select Data > Data Analysis to open the Data Analysis dialogue box. Select
Histogram and click OK. This opens up the Histogram dialogue box (Figure 17).
2. In the Histogram dialogue box enter the desired range in the Input Range box. If you
have included the column header, tick the Labels in first row box. Confirm where you
want the output to go; New Worksheet Ply is the default and is recommended.
3. Select the Chart Output box (leave the others blank). (Note: the histogram function
can also be used to generate Pareto charts and cumulative per cent outputs by
checking the appropriate box but the result will not be a standard histogram.)
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
12
Figure 17 – Histogram dialogue box
4. Click OK. The resulting output is shown in Figure 18. It includes both a summary
table and the histogram graph.
Figure 18 – Histogram output
The resulting chart can be edited in Excel as any normal chart. Conventionally there are no
gaps between the bars in histograms. To remove the gaps, right click on a bar and choose
Format Data Series > Series Options and move the Gap Width slider to No Gap. You can also
add an outline to the bar if desired using the Format Data Series tools. A formatted version of
the histogram is shown in Figure 19.
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
13
Figure 19 – Formatted histogram
The histogram function automatically selects a bin range for the histogram. In some cases, for
example if the dataset is small, the resulting bin range may not be very informative. It also
groups high values together under the label ‘more’ which makes it harder to spot outliers or
extreme values. Additionally, if you are working with Likert-scale data it is often useful to set
the bin range so that the intervals represent a point on the scale1. You can set your own bin
range intervals as follows:
1. Create a new column in your worksheet (Hint: keep it separate from your main
dataset) showing the bin intervals you want to use. The number entered sets the upper
level of that interval (inclusive). Give the column an appropriate title. If using Likertscale data, set the bin range 1, 2…n (n = maximum value of scale). See Figure 20.
Customer satisfaction is measured on a 7-point scale so we have set the bin range 1,
2…7.
1
Note we are treating the Likert data as an interval for the purposes of this illustration (see
Chapter 13 for a discussion).
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
14
Figure 20 – Creating intervals for a histogram bin range
2. Open up the Histogram dialogue box (Data > Data Analysis > Histogram > OK).
3. In the Histogram dialogue box enter the desired range in the Input Range box.
4. Now select the bin range (i.e. the new column of bin range intervals that you have
created).
5. If you have included the column header, tick the Labels in first row box (Note: both
the data to be analysed and the bin range must have headers). Confirm where you
want the output to go; New Worksheet Ply is the default and is recommended. Select
the Chart Output box (see Figure 21).
Figure 21 – Histogram dialogue box with bin range specified
6. Click OK.
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
15
The resulting output is shown in Figure 22. It can now be formatted as required.
Figure 22 – Histogram with specified bin range
3 Calculating measures of central tendency
Measures of central tendency introduced in Chapter 13 can be calculated using Excel’s
statistical functions (select Formulas > More Functions > Statistical and select chosen
function to open the relevant Function Argument dialogue box). These are shown in Table 1.
See Introduction to using Microsoft Excel for quantitative data analysis (Appendix A) for
more details on how to select and use functions.
Table 1 – Measures of central tendency in Excel’s statistical functions
Function name
Description
AVERAGE
Returns the arithmetic mean (average) of the given numbers
MEDIAN
Returns the median of the given numbers
MODE.SNGL
Returns the mode of the given numbers
These can also be calculated using the Descriptive Statistics function in the Data Analysis
ToolPak (see below).
4 Calculating measures of dispersion (spread)
Measures of dispersion (spread) introduced in Chapter 13 can be calculated using Excel’s
statistical functions (select Formulas > More Functions > Statistical and select chosen
function to open the relevant Function Argument dialogue box). These are shown in Table 2;
note that if you are using sample data, you should use STDEV.S and VAR.S for calculating
the standard deviation and variance of your sample. See Introduction to using Microsoft
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
16
Excel for quantitative data analysis (Appendix A) for more details on how to select and use
functions.
Table 2 – Measures of dispersion in Excel’s statistical functions
Function name
Description
MAX
Returns the maximum value of the given numbers
MIN
Returns the minimum value of the given numbers
STDEV.P
Returns the standard deviation of the given numbers, based on the population
STDEV.S
Returns the standard deviation of the given numbers, based on a sample
VAR.P
Returns the variance of the given numbers, based on the population
VAR.S
Returns the variance of the given numbers, based on a sample
These can also be calculated using the Descriptive Statistics function in the Data Analysis
ToolPak (see below).
5 Exploring the shape of a variable’s distribution
Excel’s Histogram function in the Data Analysis ToolPak described above (select Data >
Data Analysis > Histogram > OK) can be used to generate a histogram for visual evaluation
of the shape of a metric variable’s distribution.
Excel’s statistical functions include functions for calculating skewness and kurtosis (Table 3).
See Using Microsoft Excel for quantitative data analysis guide (Appendix A) for more details
on how to select and use functions.
Table 3 – Measures of dispersion in Excel’s statistical functions
Function name
Description
KURT
Returns the kurtosis of a dataset
SKEW
Returns the skewness of a dataset
These can also be calculated using the Descriptive Statistics function in the Data Analysis
ToolPak (see below).
6 Generating summary statistics
Excel’s Descriptive Statistics routine in the Data Analysis ToolPak provides a quick way of
generating summary statistics for metric variables that includes measures of central tendency,
dispersion and skewness/kurtosis. To calculate descriptive statistics (for convenience this is
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
17
repeated from Appendix B to Introduction to using Microsoft Excel for quantitative data
analysis):

Select Data > Data Analysis to open the Data Analysis menu dialogue box (Figure
23).
Figure 23 – Data Analysis menu dialogue box

Select the desired function, in this case Descriptive Statistics, which opens the
relevant dialogue box (Figure 24).

In the dialogue box, enter the desired range in the Input Range box. If you have
included the column header, select the Labels in first row box. Confirm where you
want the output to go. The default setting is New Worksheet Ply which creates a new
worksheet for the output; since most ToolPak outputs are quite large, this is a sensible
option.

Select Summary Statistics to get descriptive statistics for your chosen data; you can
also select an appropriate confidence interval for the mean if desired (the default is
95%).
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
18
Figure 24 – Descriptive Statistics dialogue box

Click OK. The output will be shown in a new worksheet (Figure 25). Note that here
the column widths have been adjusted to make it easier to read.
Figure 25 – Descriptive Statistics output for variable Age
Note also that this output is not dynamically linked to the original dataset so changes to the
dataset will not automatically be updated in the output. You will need to run a new analysis.
Once created, the output can be cut-and-pasted into word-processing software for further
editing.
(Hint: if using the Descriptive Statistics function, you can select multiple adjacent metric
variables and the function will report the output for each one.)
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto
19
Download