Statistical Analysis using Excel

advertisement
Statistical Analysis using Excel
Preparing the workspace
Upon opening or creating a spreadsheet of data you should see something like what is shown in Figure 1.
Figure 1: View when data are first loaded
One extremely useful feature of a spreadsheet application is the availability of multiple “sheets”. The sheet shown
here is called “Project1Data”. Right click on the sheet tab (bottom left of screen) and rename the first sheet to
simply “Data”. Then create a new sheet called “BasicStats” (Figure 2). Using multiple tabs allows us to separate
the presentation and summary elements from the data. This helps keep us from being distracted by the large
table of data and also keeps the original data safe from accidental modification since we will spend very little time
working on the actual “Data” sheet itself.
Figure 2: Created a new sheet and save in native format
1
Since we intend to make use of some advanced spreadsheet features it is a good idea to make sure that the
document is saved in the native format for our application (Figure 2).
We can access values on different sheets by referring to the sheet’s name. This allows us to structure our document
logically and consistently, even if we choose to change column names or data values in the future. On the
“BasicStats” sheet we duplicate the column headers by referring to the headers on the first page. Type =Data!D1
into the D1 cell on the “BasicStats” sheet and press enter (Figure 3). If we later decide to change the header
labels on the first page, that change will be duplicated on our “BasicStats” sheet.
Figure 3: Referencing another sheet and the magic behavior of pasting
When you copy and paste a formula in a spreadsheet the referenced cells behave somewhat magically. On the
“BasicStats” sheet select the D1 cell (notice that the formula in the formula bar still says “=Data!D1”) and copy
that cell by pressing Ctrl - c (at the same time). Then click and drag so that cells E1 through T1 are selected
(still on the “BasicStats” sheet). Paste your formula into these cells by pressing Ctrl - v (Figure 3). Notice that
if you select one of the other cells the formula was changed to reference the corresponding cell on the “Data”
sheet1 .
Standard Numerical Summaries
We are now able to easily compute all of the basic numerical summaries. We start by labeling each row (just type
the label text into the corresponding cell) and compute the mean of the “not English” column (column D) which
has data in rows 2 through 237 (Figure 4). We can similarly use the MAX, MEDIAN, MIN, PERCENTILE, and STDEV
functions to compute the other values (Figure 4).
Figure 4: Computing numerical summaries
We can now easily compute all of the other summaries by simply copying these seven cells and pasting them into
the remaining columns. Furthermore, since the spreadsheet can do standard arithmetic as well, we can compute
some of our own values such as the LOW and HIGH values that arise from the 1.5 IQR rule (Figure 5). Notice
that we do not need to precede the cell number with the sheet name when referring to a cell on the same sheet
we are working on.
1
You can stop the originally copied cell from blinking by pressing the Escape key.
2
IQR
1.5 IQR
LOW
HIGH
=
=
=
=
D8 - D6
1.5 * D11
D6 - D12
D8 + D12
Figure 5: Applying the 1.5 IQR Rule
Analyzing a single variable
Calculations
I recommend creating a new sheet for each variable you wish to analyze. As an example, I have created a new
sheet titled “NotEnglish”. The description for column D is “the percentage of people who do not speak English
at home.” Therefore those values are percentages of the total population of each county which is held in column
T. If we wanted to compute the (approximate) number of individuals who do not speak English at home in Ada
county (row 2 of our data) we would need the following computation.
number not english = total population × percent not english (as decimal)
= value in column T × value in column D/100
= Data!T2 * Data!D2/100
We place this computation into cell A2 on our “NotEnglish” sheet then copy and paste that formula into rows 3
through 237 (select them all at once before pasting!). This tells us the total number of individuals (per county) who
live in a household that does not speak English. We could then SUM (add) those values to approximate the total
number of individuals in the entire country who live in a household that does not speak English (Figure 6).
Figure 6: Counting the number of individuals who do not speak English at home
Histogram
Creating a histogram in a spreadsheet first requires constructing a frequency table. Looking at the 5 number
summary for the “not english” column helps us decide how to break up our bins for the histogram. Since our
minimum value is 2.7 and our maximum value is 78.4 we will choose to start at zero and take steps of 5. This
3
should (potentially) give us 17 bars in our histogram. We start by typing by hand some headers and the values
we chose to round to (Figure 7 — ignore the highlighted “frequency” column for now).
Figure 7: Frequency data for non-English households
Now that we have decided how the data should be grouped, we need to tell the spreadsheet to actually compute
the frequencies. Highlight the set of cells where you wish the frequencies to be placed (in this case we are putting
them in cells D8 through D24). When we type in the FREQUENCY command we need to tell it what data are of
interest as well as what bins to use. Here we want the “not english” data and the bins that we just typed so
our command will be =FREQUENCY(Data!D2:D237, C8:C24). Now, instead of pressing enter as usual press
Ctrl - Shift - Enter all at once2 . This should fill in the frequencies as show in Figure 7. Notice that Excel
rounds data to the nearest value rather than truncating down (our minimum, 2.7, was counted in the 5 bin rather
than the 0 bin).
To create the actual chart, select “Insert” → “Column” → “Clustered Column” then “Select Data”. Select the
frequency data that we just generated (Figure 8). Then in the “Horizontal (Category) Axis Labels” panel click
“Edit”. This allows us to choose the horizontal axis labels which we have types into column C (Figure 9).
2
Even though one says “all at once” you may hold down the Ctrl and Shift keys and then press enter.
4
Figure 8: Creating a histogram — selecting data
Figure 9: Creating a histogram — selecting labels
Finally you should add a heading (select “Layout” → “Chart Title” → “Above Chart”) and remove the unnecessary
“Series 1” label leaving a histogram which is ready to be copied into a Word document as part of a report.
5
Figure 10: Creating a histogram — completed histogram
Boxplot
The following method was found on Neville Hunt’s webpage3 . There are also instructions there for other versions
of Excel.
Begin by arranging the 5 number summary in the order shown in Figure 114 . Select the data and labels (6 rows and
2 columns here, though it is possible to plot more than one boxplot simultaneously by adding more columns). In
the menus select “Insert” → “Chart” → “Line Chart” → “Line with Markers” then select “Switch Row/Column”.
Your boxplot should now look like Figure 11.
Figure 11: Creating a boxplot — Switch Row/Column
We now need to do a bit of formatting to make the boxplot look right.
• Select any of the data points and click “Layout” → “Lines” → “High-Low Lines”, then “Layout” →
“Up/Down Bars” → “Up/Down Bars”
• Right click on the box part of the boxplot, select “Format Up Bars”, and set Fill to “No Fill”
• Select “Layout” → “Gridlines” → “Primary Horizontal Gridlines” → “None”
• Select “Layout” → “Chart Title” → “Above Chart”
• Finally, delete the legend
3
http://www.mis.coventry.ac.uk/~nhunt/boxplot.htm
I included my boxplot on my “NotEnglish” sheet just below my histogram (I scrolled down a page) so that all of my “not english”
analysis can be found on the same sheet.
4
6
Once finished you should have a slightly strange but serviceable boxplot (Figure 12).
Figure 12: Creating a boxplot — completed boxplot
Two-Variable Statistics
Fortunately for us, spreadsheets are a bit more friendly when it comes to two-variable statistics. We can compute
the correlation between a pair of variables using the CORREL function. Create a new sheet with the name “above 65
vs men never”. We will be comparing columns N (percent above age 65) and O (percent men never married). In our
new sheet we compute the correlation by typing =CORREL(Data!N2:N237,Data!O2:O237) and find an extremely
poor association (r = −0.2502).
Pressing on, we can draw a scatterplot by switching to the “Data” sheet, highlighting columns N and O (simply
by selecting the N and O headers), and then clicking “Insert” → “Scatter” → “Scatter with only Markers”.
Unfortunately this puts the chart on the Data sheet which is bad organization so we select “Design” → “Move
Chart” and select our “above 65 vs men never” sheet.
We may also add a regression line by selecting “Layout” → “Trendline” → “Linear Trendline”. Removing the
legend and adjusting the title appropriately gives us the following scatterplot.
Figure 13: A complete scatterplot
7
Download