Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction ........................................................................................................................ 1 2 Exploring a variable’s frequency distribution.................................................................... 2 3 Calculating measures of central tendency........................................................................ 16 4 Calculating measures of dispersion (spread) ................................................................... 16 5 Exploring the shape of a variable’s distribution .............................................................. 17 6 Generating summary statistics ......................................................................................... 17 1 Introduction This guide covers the use of Microsoft Excel (hereafter: Excel) for univariate data exploration. It shows how techniques discussed in Chapter 13 can be applied in Excel. Please refer to Chapter 13 for more details on the specific techniques and their interpretation; the focus here is on how to carry them out in Excel. It covers four topics: 1. Exploring a variable’s frequency distribution 2. Calculating measures of central tendency 3. Calculating measures of dispersion (spread) 4. Exploring the shape of a variable’s distribution 5. Generating summary statistics The guide is not written for a specific version of Excel although it includes screenshots for Excel 2010. Most of the functionality referred to in the guide is also available in earlier and later versions, although the user interface has changed somewhat. The guide assumes that you have entered your data and prepared it for analysis as described in the guide Introduction to using Microsoft Excel for quantitative data analysis. It also assumes that you are familiar with basic Excel functionality, including creating and editing charts (for information on how to use functions and the Data Analysis ToolPak see Introduction to using Microsoft Excel for quantitative data analysis). Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 1 2 Exploring a variable’s frequency distribution As explained in Chapter 13, there are two main ways of exploring a variable’s frequency distribution: frequency tables and graphical displays. 2.1 Creating frequency tables in Excel using pivot tables Pivot tables are the most efficient way of creating frequency tables in Excel. They can be extended to more complex analysis such as contingency tables and can be used as the basis for graphical outputs. They are widely used in business and management, for example for financial analysis, so you may already be familiar with how they work. If you are new to pivot tables, it is worth taking some time to learn how to use them as they are a very flexible analysis tool which makes them ideally suited for data exploration. We will demonstrate their use in creating frequency tables for a simple dataset about customers’ shopping habits (Figure 1) (available on the website as a downloadable file customer satisfaction.xlsx). One of the nominal variables in the dataset records the store location where the customer shops (north, central or south). Our aim is to create a simple frequency table showing how many customers shop in each location and what per cent that represents of the total. We will also add a cumulative per cent column. Figure 1 – Customer satisfaction dataset 2.1.1 Creating a table of frequency counts To create a pivot table, carry out the following steps: Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 2 1. Check that the data are ready for analysis. Each column requires a unique header, there should be no missing rows or columns and, if nominal variables are left as text, spelling should be consistent. 2. Click on any cell in the dataset. 3. Select Insert > PivotTable > PivotTable to open up the Create PivotTable dialogue box (see Figure 2). Note: Excel writes PivotTable as a single word. Figure 2 – PivotTable dialogue box 4. In the dialogue box, select the table or range you wish to analyse. If you placed the cursor in a cell in the dataset before opening up the dialogue box, the dataset should automatically be selected. If not, select it manually. 5. Choose where you want the PivotTable to be placed. The default is New Worksheet which is generally the easier option. 6. When ready, click OK. This will insert an empty PivotTable report and a PivotTable Field List into a new worksheet (Figure 3). Figure 3 – PivotTable report and Field List Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 3 7. The PivotTable Field List consists of two parts. The upper part, the field section, lists all the field names you can add to the PivotTable. In this case it is all the variable names (column headers) in the dataset. The lower part, the layout section, contains the Report Filter area, the Column Labels area, the Row Labels area and the Values area. You populate the PivotTable report by dragging and dropping fields from the field section into the appropriate area in the layout section (Figure 4). Figure 4 – PivotTable Field List 8. To create a frequency table for the variable Store location, start by dragging and dropping the field ‘Store location’ from the field section into the Row Labels area. Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 4 This creates a set of row labels in the table, one for each value in the Store location variable and a grand total. 9. Next, drag and drop the field ‘Store location’ from the field section into the Values area. This immediately creates a new column ‘Count of Store location’ in the PivotTable report that gives the frequency of occurrence of each category in the variable, as well as a count of the grand total. (Note: count is the default value setting for nominal data in an Excel pivot table; for metric data, the default is to sum the values in each category.) The resulting table is shown in Figure 5. Figure 5 – Frequency table of Store location showing counts only (n = 20) 2.1.2 Adding per cent columns to a frequency table Now that the basic frequency table has been created, the next step is to add a column showing the per cent of the total represented by each category. To do this: 1. Drag and drop another copy of the ‘Store location’ field into the Values box in the PivotTable Field List. This will add another column to the pivot table, called Count of Store location 2 (Figure 6). (Hint: if your PivotTable Field List has disappeared, click on the PivotTable report and it will reappear.) Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 5 Figure 6 – Adding an additional column 2. Click on the down arrow of the new field in the Values area of the PivotTable Field List. This opens up a new menu. Choose Value Field Settings to open the Value Field Settings dialogue box (Figure 7). Figure 7 – Value Field Settings dialogue box 3. The Value Field Settings dialogue box allows you to manipulate the values displayed in the pivot table for that field: a. The Summarise Values By tab allows you to determine the value displayed in each cell (e.g. count, average, etc.) b. The Show Values As tab gives further options for displaying the data. 4. Select the Show Values As tab. From the Show Values As drop down box, select % Column Total. Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 6 5. At this point you can change the default name at the column head by typing your choice of name in the Custom Name field. (Note: you can also change column headers by typing directly into the PivotTable report.) 6. If desired, click on the Number Format button to open up Excel’s number format dialogue box if you wish to change the format of the numbers (e.g. to set the number of decimal places). (Note: you can also change number formats directly in the PivotTable report using the standard commands available under the Home tab.) 7. Click OK. The resulting frequency table is shown in Figure 8, with the column header changed to Per cent of total and the number of decimal places set to 0 for the per cent column. Figure 8 – Frequency table of Store location showing counts and per cent (n = 20) 2.1.3 Adding a cumulative per cent column to a frequency table To add a cumulative per cent column to your frequency table, repeat steps 1 to 3 above to create a new column and open the Value Field Settings dialogue box. In that box, select the Show Values tab and in the Show Values As drop down box select % Running Total In. Select ‘Store location’ in the Base field box. As before, you can change the name and set the number format (Figure 9). Figure 9 – Value Field Settings for cumulative per cent column Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 7 Once complete the pivot table fields can be renamed and reformatted if required. Figure 10 shows the final result. The finished table can be copied and pasted into a word-processing package for further editing or be used as the basis for generating graphs (charts). Figure 10 – Frequency table of Store location showing counts, per cent and cumulative per cent (n = 20) The contents of a PivotTable report can easily be changed by adding, removing or replacing fields in the Field List. Similarly, you can change how the values are displayed via the Value Field Settings dialogue box at any time. The down arrow filter on the Row Labels header (next to Store location in Figure 10) can be used to sort and filter the rows. In the guide Data exploration with Excel: analysing more than one variable, we will show how pivot tables can be used to analyse two or more variables. 2.2 Creating frequency tables in Excel using the COUNTIF function Another, less flexible, way of creating frequency tables is to use Excel’s COUNTIF function. An example is shown in Figure 11. Figure 11 – Frequency table created using COUNTIF function Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 8 The cells in the Count column are populated using the COUNTIF function. Choose the destination cell and then select Formulas > More Functions > Statistical > COUNTIF. In the Function Argument dialogue box (Figure 12), enter the range of the data and the criteria in the relevant boxes; these tell Excel where to look and what to count. In this case the word ‘North’ is entered in the Criteria box which gives a count of 7, as expected. Figure 12 – COUNTIF Function Argument dialogue box Once the individual cells in the count column have been populated, Excel can be used to calculate the grand total (Hint: use the SUM function) and additional columns calculated for per cent and cumulative per cent as shown in Figure 11. 2.3 Graphical techniques for exploring frequency distributions Excel’s suite of chart (graph) tools can be used to explore frequency distributions visually. If your data are already in a suitable format, for example if you have pre-existing frequency tables or you have created frequency tables using the COUNTIF function, you can generate suitable graphs via the Insert tab and select an appropriate chart type, such as a bar or pie chart. Figure 13 shows a pie chart created from the frequency table in Figure 11. It has been edited using the Chart Tools in Excel to include per cent labels for the slices, a chart legend and a suitable title (Hint: click on the chart to activate these tools). Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 9 Figure 13 – Pie chart created from pre-existing frequency table (Note: For convenience of presentation and to make it easier to relate the output to the raw data, we have created the frequency table and the pie chart in the same worksheet as the main dataset. For large data sets and to avoid overwriting your data, it is usually better to work in a separate worksheet when creating output of this kind.) 2.3.1 Using pivot charts to display frequency distributions Pivot charts provide a very useful way of generating graphical displays of frequency distributions directly from a dataset. They can be generated either from a pivot table that you have created or directly using the PivotChart command. We will demonstrate the latter using the customer satisfaction data and the Store location variable. To create a pivot chart using the PivotChart command, select Insert > PivotChart. This opens up the Create PivotChart with PivotTable dialogue box. This is similar to the PivotTable equivalent (Figure 2) so select the data table/range and the location for the output (New Worksheet is the default). Click OK. This opens up a blank PivotChart area, along with a blank PivotTable report and Field List similar to those you have already seen (Figure 14). Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 10 Figure 14 – PivotChart area To populate the chart area, carry out the following steps: 1. In the PivotTable Field List drag and drop a copy of the ‘Store location’ field into the Axis Fields (Categories) area. 2. Drag and drop a second copy of the ‘Store location’ field into the Values area (as with the pivot table, this field contains nominal data so the Excel default value setting is count). 3. A PivotChart in the form of a bar (Excel: column) chart is created along with a PivotTable report of the data (Figure 15). 4. This chart can now be formatted using the PivotChart tools if needed. Figure 15 – Bar chart of shopping by store location (n =20) If you want to change the type of chart, for example to a pie chart, you can do this via PivotChart Tools > Design > Change Chart Type > Pie. Select the type of pie chart you want Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 11 and click OK. The resulting chart can then be formatted as desired using the PivotChart Tools. Figure 16 shows a pie chart created in this way after formatting. Figure 16 – Pie chart of shopping by store location (n = 20) 2.3.2 Creating histograms in Excel Histograms are a useful way of inspecting the frequency distribution and the shape of the distribution of metric variables. Excel’s Data Analysis ToolPak contains a function for generating histograms from your data. To illustrate how this is done, we will create a histogram for the Satisfaction variable in the customer satisfaction dataset. 1. Select Data > Data Analysis to open the Data Analysis dialogue box. Select Histogram and click OK. This opens up the Histogram dialogue box (Figure 17). 2. In the Histogram dialogue box enter the desired range in the Input Range box. If you have included the column header, tick the Labels in first row box. Confirm where you want the output to go; New Worksheet Ply is the default and is recommended. 3. Select the Chart Output box (leave the others blank). (Note: the histogram function can also be used to generate Pareto charts and cumulative per cent outputs by checking the appropriate box but the result will not be a standard histogram.) Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 12 Figure 17 – Histogram dialogue box 4. Click OK. The resulting output is shown in Figure 18. It includes both a summary table and the histogram graph. Figure 18 – Histogram output The resulting chart can be edited in Excel as any normal chart. Conventionally there are no gaps between the bars in histograms. To remove the gaps, right click on a bar and choose Format Data Series > Series Options and move the Gap Width slider to No Gap. You can also add an outline to the bar if desired using the Format Data Series tools. A formatted version of the histogram is shown in Figure 19. Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 13 Figure 19 – Formatted histogram The histogram function automatically selects a bin range for the histogram. In some cases, for example if the dataset is small, the resulting bin range may not be very informative. It also groups high values together under the label ‘more’ which makes it harder to spot outliers or extreme values. Additionally, if you are working with Likert-scale data it is often useful to set the bin range so that the intervals represent a point on the scale1. You can set your own bin range intervals as follows: 1. Create a new column in your worksheet (Hint: keep it separate from your main dataset) showing the bin intervals you want to use. The number entered sets the upper level of that interval (inclusive). Give the column an appropriate title. If using Likertscale data, set the bin range 1, 2…n (n = maximum value of scale). See Figure 20. Customer satisfaction is measured on a 7-point scale so we have set the bin range 1, 2…7. 1 Note we are treating the Likert data as an interval for the purposes of this illustration (see Chapter 13 for a discussion). Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 14 Figure 20 – Creating intervals for a histogram bin range 2. Open up the Histogram dialogue box (Data > Data Analysis > Histogram > OK). 3. In the Histogram dialogue box enter the desired range in the Input Range box. 4. Now select the bin range (i.e. the new column of bin range intervals that you have created). 5. If you have included the column header, tick the Labels in first row box (Note: both the data to be analysed and the bin range must have headers). Confirm where you want the output to go; New Worksheet Ply is the default and is recommended. Select the Chart Output box (see Figure 21). Figure 21 – Histogram dialogue box with bin range specified 6. Click OK. Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 15 The resulting output is shown in Figure 22. It can now be formatted as required. Figure 22 – Histogram with specified bin range 3 Calculating measures of central tendency Measures of central tendency introduced in Chapter 13 can be calculated using Excel’s statistical functions (select Formulas > More Functions > Statistical and select chosen function to open the relevant Function Argument dialogue box). These are shown in Table 1. See Introduction to using Microsoft Excel for quantitative data analysis (Appendix A) for more details on how to select and use functions. Table 1 – Measures of central tendency in Excel’s statistical functions Function name Description AVERAGE Returns the arithmetic mean (average) of the given numbers MEDIAN Returns the median of the given numbers MODE.SNGL Returns the mode of the given numbers These can also be calculated using the Descriptive Statistics function in the Data Analysis ToolPak (see below). 4 Calculating measures of dispersion (spread) Measures of dispersion (spread) introduced in Chapter 13 can be calculated using Excel’s statistical functions (select Formulas > More Functions > Statistical and select chosen function to open the relevant Function Argument dialogue box). These are shown in Table 2; note that if you are using sample data, you should use STDEV.S and VAR.S for calculating the standard deviation and variance of your sample. See Introduction to using Microsoft Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 16 Excel for quantitative data analysis (Appendix A) for more details on how to select and use functions. Table 2 – Measures of dispersion in Excel’s statistical functions Function name Description MAX Returns the maximum value of the given numbers MIN Returns the minimum value of the given numbers STDEV.P Returns the standard deviation of the given numbers, based on the population STDEV.S Returns the standard deviation of the given numbers, based on a sample VAR.P Returns the variance of the given numbers, based on the population VAR.S Returns the variance of the given numbers, based on a sample These can also be calculated using the Descriptive Statistics function in the Data Analysis ToolPak (see below). 5 Exploring the shape of a variable’s distribution Excel’s Histogram function in the Data Analysis ToolPak described above (select Data > Data Analysis > Histogram > OK) can be used to generate a histogram for visual evaluation of the shape of a metric variable’s distribution. Excel’s statistical functions include functions for calculating skewness and kurtosis (Table 3). See Using Microsoft Excel for quantitative data analysis guide (Appendix A) for more details on how to select and use functions. Table 3 – Measures of dispersion in Excel’s statistical functions Function name Description KURT Returns the kurtosis of a dataset SKEW Returns the skewness of a dataset These can also be calculated using the Descriptive Statistics function in the Data Analysis ToolPak (see below). 6 Generating summary statistics Excel’s Descriptive Statistics routine in the Data Analysis ToolPak provides a quick way of generating summary statistics for metric variables that includes measures of central tendency, dispersion and skewness/kurtosis. To calculate descriptive statistics (for convenience this is Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 17 repeated from Appendix B to Introduction to using Microsoft Excel for quantitative data analysis): Select Data > Data Analysis to open the Data Analysis menu dialogue box (Figure 23). Figure 23 – Data Analysis menu dialogue box Select the desired function, in this case Descriptive Statistics, which opens the relevant dialogue box (Figure 24). In the dialogue box, enter the desired range in the Input Range box. If you have included the column header, select the Labels in first row box. Confirm where you want the output to go. The default setting is New Worksheet Ply which creates a new worksheet for the output; since most ToolPak outputs are quite large, this is a sensible option. Select Summary Statistics to get descriptive statistics for your chosen data; you can also select an appropriate confidence interval for the mean if desired (the default is 95%). Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 18 Figure 24 – Descriptive Statistics dialogue box Click OK. The output will be shown in a new worksheet (Figure 25). Note that here the column widths have been adjusted to make it easier to read. Figure 25 – Descriptive Statistics output for variable Age Note also that this output is not dynamically linked to the original dataset so changes to the dataset will not automatically be updated in the output. You will need to run a new analysis. Once created, the output can be cut-and-pasted into word-processing software for further editing. (Hint: if using the Descriptive Statistics function, you can select multiple adjacent metric variables and the function will report the output for each one.) Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 19