Excel for Statistical Analysis 2011

advertisement
1
USING MICROSOFT® EXCEL FOR STATISTICAL ANALYSIS
1.1
INTRODUCTION TO EXCEL AS A STATISTICAL TOOL
Excel is undeniably the dominant spreadsheet package in use today. As part of
the Microsoft® suite of programmes it integrates seamlessly with Word and
PowerPoint, and it can also be used as a convenient tool for capturing and storing
field data such at that gathered for the Samouel’s restaurant study. Standard cut
and paste techniques can often be used to transfer data records into and out of
Excel when more formal file importing procedures are unavailable.
In addition to its standard spreadsheet capabilities that are well-known to most
readers, Excel offers wide range of statistical routines that may be considered
adequate for most standard survey applications. Developing an appreciation of
this potential of Excel can save you considerable money, given the expense
involved in buying comprehensive specialist statistical software, and time, given
the learning curve attached to many of these specialist packages.
This note provides a brief introduction to Excel for readers that have never used
the package before introducing some of the more useful statistics that can be
accessed within Excel using its standard built-in routines and through its add-in
functionality.
1.2
EXCEL BASICS
Upon first loading Excel, you will see a screen like the one displayed in Exhibit 1.
This screen provides all the functionality you will require to analyse data, present
it in various graphical and text formats, and to export the desired output to
standard presentation or word processing packages such as PowerPoint and
Word.
Most of what appears in Exhibit 1 is obvious to the regular computer user
including the menu bar structure, the scroll bars, and minimise, maximise and
close icons. Distinguishing characteristics of Excel are the cell structure that is
2
identified by the column and row labelling and the sheet tabulations displayed at
the bottom of the screen. Currently, the selected cell is B9 as identified by the
bold box around its borders. Letters of the alphabet are used to identify the
columns and Arabic numerals are used to identify the rows of the displayed sheet.
Each sheet has a limiting capacity of 256 columns labelled A through IV and
65,536 rows labelled 1 through 65,546. This is far more than the average user will
ever need.
Exhibit 1:
Opening Screen for an Excel Workbook named Book1
Selecting any cell on the spreadsheet by moving the cursor and pressing the left
mouse button will result in the selected cell being identified in the cell identification
box and the contents of the cell being displayed in the cell contents box. Exhibit 1
shows these to be B9 and =sum(B3:B8) respectively. The solution to the
contents box is obtained by adding the contents of cells B3 down to B8 and it is
3
displayed in cell B9. This is the basic principle of spreadsheets. Each of the cells
within the spreadsheet can contain numerical data, text data or formulae that
calculate values based upon the contents of other referenced cells. In the
illustration described above, cells B3 to B8 contain the numerical data 1, 2, 3, 4, 5
and 6 respectively while cell B9 contains the formula needed to add these
numbers together. Cell B2 contains the text data “Samouel’s Restaurant”.
Although the text appears to extend into cell C2, this is not the case but merely
seems to be so because the column display width is narrower than the text it
contains.
Formatting areas of a spreadsheet is achieved by highlighting the desired
elements of the spreadsheet and then using the Format menu contained in the
menu bar or by referring directly to the icon options that are on display. This
approach has been adopted to make the contents of cell B9 appear in bold and
have lines (borders) above and below the summed value of 21. Exhibit 2 displays
the scope of formatting available. Individual cells and ranges of cells can be
formatted based upon desired numerical presentation, alignment, font style and
colour, border, pattern or shading, and whether they should be protected against
overwriting.
4
Exhibit 2:
1.3
Options available once the Format>Cells dropdown
menu has been selected
STANDARD BUILT-IN STATISTICAL OPTIONS
Excel contains approximately eighty easily accessible built-in statistical functions. These
are accessed by moving the cursor to the cell where you wish to place the desired output
and entering “=” into the cell. Immediately this is done, the Cell Identification Box
described in Exhibit 1 will change its appearance to grey and display a function for
selection (the displayed function may be one that you have used recently). Placing the
cursor over the down arrow displayed immediately to the right of the function label and
clicking the left mouse button causes a drop down box to appear as displayed on the left
hand side of Exhibit 3. Once the More Functions … option is selected by again
highlighting it with the cursor and clicking the left mouse button, the next drop down box
appears. It is displayed on the right hand side of Exhibit 3. Selecting the category
Statistical lists all the standard statistical functions available for selection.
5
Exhibit 3:
Displaying the standard built-in statistical functions for
selection
Excel provides continued support to help you use the built-in function of your
choice. This is best illustrated by way of an example. Assume that you wish to test
for the equivalence of means between two data samples. Exhibit 4 displays the
samples in columns B and C respectively. Each sample contains twenty data
points and the drop down menu is obtained by selecting the TTEST statistical
function using the methodology outlined above. As can be seen in the exhibit, this
statistics requires four inputs. Array1 contains the range of cells B2:B21. It is
created by either typing the range directly into the data box or by first clicking on
the red arrow located at the right of the data box and then simply highlighting the
cells containing the first sample data. Array2 identifies the second sample as
being in cells C2:C21 in the identical manner. Tails is designed to contain either a
1 for a single tailed t-statistic or a 2 for a two tailed t-statistic. Finally, Type should
contain 1 if a paired comparison test is required, 2 for a two-sample equal
variance test or 3 for a two-sample unequal variance test. Although not shown in
the exhibit, entering 2 in the Type data box will result in the answer 0.097120096
appearing at the lower equal sign. Selecting [OK] transfers this result to cell E1.
6
As declared in the drop down box, the answer provided by this test is the
probability associated with the Student’s t-Test.
A final point to note from the above illustration is that the Cell Contents Box
contains
the
exact
format
required
for
the
formula,
namely:
=TTEST(B2:B21,C2:C21,2,2) where the four arguments contained within
brackets in the formula are for Array1, Array2, Tails and Type respectively.
Exhibit 4:
Comparison of means using the TTEST function
Although listed alphabetically, the popularly used built in statistical functions may
be categorised as descriptive, inferential and distributional. These are presented
and discussed in Exhibit 5.
7
Exhibit 5: List of commonly used built in statistical functions
A. Descriptive Statistics
Function:
=AVERAGE(Array1,Array2,Array3,…)
Purpose:
Computes the arithmetic average of a range of numbers.
Argument(s): Arrays or values separated by commas. Examples include (B1:B35)
to average the thirty numbers contained in the identified array or
(B1:B10,12,C1:C20) to average the ten numbers in the first array,
the number 12 and the twenty numbers in the second array.
Function:
=CORREL(Array1,Array2)
Purpose:
Computes the correlation between the two identified arrays.
Argument(s): The two arrays are separated by commas and are as described for
the AVERAGE function.
=COUNT(Array1,Array2,Array3,…)
Computes the number of numerical values contained within the
identified arrays.
Argument(s): Arrays or values separated by commas as described for the
AVERAGE function.
Function:
Purpose:
Function:
=COVAR(Array1,Array2)
Purpose:
Computes the covariance between the two identified arrays.
Argument(s): The two arrays are separated by commas and are as described for
the AVERAGE function.
Function:
Purpose:
=FREQUENCY(Data_Array,Bin_Array)
Computes the frequency count of an array of numbers based upon
a pre-specified bin range.
Argument(s): Data Array contains the numerical values that you want to develop a
frequency count for and Bin Array contains the reference values that
you with to group the data into. An example of this would be
(B1:B200,C1:C5) where cells C1 to C5 contain the numbers 0.2, 0.4,
0.6, 0.8 and 1.0. The solution will appears in the cell as a vector one
observation longer than the Bin Array. It includes the number of
values smaller than 0.2; the number between 0.2 and 0.4, the
number between 0.4 and 0.6 and so on. Each value in the vector
may be shown by using the
=INDEX(FREQUENCY(Data_Array,Bin_Array),Index_Value)
function. For example if the Index Value from the above example is
6 then the cell containing the function will give the number of
observations greater than 1.0 in the original Data Array.
Function:
=KURT(Array1,Array2,Array3,…)
8
Purpose:
Computes the kurtosis of a range of numbers.
Argument(s): Arrays or values separated by commas as described for the
AVERAGE function.
Function:
=MAX(Array1,Array2,Array3,…)
Purpose:
Computes the maximum number of a range of numbers.
Argument(s): Arrays or values separated by commas as described for the
AVERAGE function.
Function:
=MEDIAN(Array1,Array2,Array3,…)
Purpose:
Computes the median or middle number of a range of numbers.
Argument(s): Arrays or values separated by commas as described for the
AVERAGE function.
Function:
=MIN(Array1,Array2,Array3,…)
Purpose:
Computes the minimum number of a range of numbers.
Argument(s): Arrays or values separated by commas as described for the
AVERAGE function.
=MODE(Array1,Array2,Array3,…)
Computes the mode or most frequently occurring number of a range
of numbers.
Argument(s): Arrays or values separated by commas as described for the
AVERAGE function.
Function:
Purpose:
Function:
Purpose:
=PEARSON(Array1,Array2)
Computes the Pearson product moment correlation between the
two identified arrays.
Argument(s): The two arrays are separated by commas and are as described for
the AVERAGE function.
Function:
=PERCENTILE(Array,K)
Purpose:
Computes the minimum number of a range of numbers.
Argument(s): Array contains the numeric values that you want to find the
percentile of and K is a percentile number between 0 and 1.
=SKEW(Array1,Array2,Array3,…)
Computes the skewness or degree of asymmetry of a range of
numbers.
Argument(s): Arrays or values separated by commas as described for the
AVERAGE function.
Function:
Purpose:
Function:
Purpose:
=STDEV(Array1,Array2,Array3,…)
Computes the sample standard deviation of a range of numbers.
9
Argument(s): Arrays or values separated by commas as described for the
AVERAGE function.
Function:
=VAR(Array1,Array2,Array3,…)
Purpose:
Computes the sample variance of a range of numbers.
Argument(s): Arrays or values separated by commas as described for the
AVERAGE function.
B. Inferential Statistics
Function:
Purpose:
=CHITEST(Actual,Expected)
Computes the test for independence of two classification techniques
using the Chi-square distribution and the appropriate degrees of
freedom. The value produces is the probability that the two
classifications are independent.
Argument(s): Actual is the array containing the observed frequencies and
Expected is the array containing the expected observations
assuming the row and column classifications are independent. As
example consider data classroom containing 24 male students and
18 female students and that 9 male students and 8 female student
smoke cigarettes. This test can be used to assess whether smoking
is independent of gender. The Actual data for this test may be set
up in cell A1:B2 with A1 containing the number of male smokers (9),
B1 containing the number of males who do not smoke (15), A2
containing the female smokers (8) and B2 containing the number of
females who do not smoke. If the Expected data assuming
independence is placed in cells D1:E2, then the function
=CHITEST(A1:B2,D1:E2) will produce the result 0.650014
indicating a high probability that the two classifications of gender
and smoking are independent.
Function:
=CONFIDENCE(Alpha,SD,Size)
Purpose:
Computes the confidence interval for a population mean.
Argument(s): Alpha is the significance level used to compute the confidence
interval, SD is the sample standard deviation and Size is the sample
size. The value returned must be added to and subtracted from the
sample mean to produce the desired confidence interval.
Function:
Purpose:
=FTEST(Array1,Array2)
Computes the result of an F-test that the variances in the two arrays
are not significantly different using a one tailed test.
Argument(s): Array1 contains the first data sample and Array2 contains the
second data sample.
Function:
=LINEST(Ys,Xs,Const,Stat)
10
Purpose:
Computes the results of a univariate or multi-variate linear
regression.
Argument(s): Ys is a vector of containing the dependent variable for the
regression, Xs is a vector or array containing the independent
variable(s) for the regression, Const is a logical value of True or
False indicating whether a constant of the regression is allowed or
whether the regression equation should be forced through the origin,
and Stat is a logical value of True or False indicating whether
additional statistics are required. As example consider
=LINEST(A1:A20,B1:D20,TRUE,TRUE). Here the dependent
variable consists of the twenty observations in cells A1 to A20.
There are three independent variables contained in cells B1 to B20,
C1 to C20 and D1 to D20 respectively. The function allows for a
regression constant and additional statistics are produced. The full
statistics produced by the function are: (1) the slope coefficients for
each of the independent variables and the constant of the
regression; (2) the standard errors of each slope coefficient and of
the constant to test for their respective significances; (3) the RSquare and standard error of the regression; (4) the F-statistic and
the regression degrees of freedom; and (5) the sum of squares of
the regression and sum of squares of the residuals. As with the
FREQUENCY function, the output of this function is not a single
value but an array. As such the =INDEX(LINEST(…),Row,Column)
must be used to access individual outputs. The =LINEST output
array contains the coefficients in its first row starting with the last
independent variable in the first column and progressing to the
regression constant. The corresponding standard errors are
contained in the second row. The R-Square and the standard error
of the regression are contained in the first and second column
respectively of the third row. The F-statistic and the residual
degrees of freedom are contained in the first and second column
respectively of the fourth row. Finally, the regression and residual
sum of squares are contained in the first and second column
respectively of the fifth row.
Function:
Purpose:
=TTEST(Array1,Array2,Tails,Type)
Computes the result of a Student’s t-test that the means in the two
arrays are not significantly different.
Argument(s): Array1 contains the first data sample, Array2 contains the second
data sample, Tails indicates whether a one tailed or two tailed test is
to be conducted by containing the integer 1 or 2 respectively (or a
cell reference to one of these numbers) and Type should contain
the integer 1, 2 or 3 (or a cell reference to one of these numbers) to
indicate whether a paired test is being conducted, whether a test
assuming equal variance across the two populations from which the
samples have been drawn or whether a test allowing for unequal
11
variances is being conducted. The function returns the probability
associated with the function.
C. Distributional Statistics
Function:
=CHIDIST(X,DoF)
Purpose:
Computes the one tailed probability of the Chi-square distribution.
Argument(s): X is a value or cell reference to a value at which to evaluate the
function and DoF is the degrees of freedom of the distribution or a
cell reference to where the value is located.
Function:
Purpose:
=CHIINV(Prob,DoF)
Computes the inverse of the one tailed probability of the Chi-square
distribution.
Argument(s): Prob is the probability associated with the Chi-square distribution
and may be any numeric between 0 and 1 inclusive (or a cell
reference to such a number) and DoF is the degrees of freedom of
the distribution or a cell reference to where the value is located.
Function:
=FDIST(X,DoF1,DoF2)
Purpose:
Computes the F probability distribution for two datasets.
Argument(s): X is a value or cell reference to a value at which to evaluate the
function and must be a non-negative number, DoF1 is the
numerator degrees of freedom and DoF2 is the denominator
degrees of freedom. Both degrees of freedom must be 1 or greater.
Function:
=FINV(Prob,DoF1,DoF2)
Purpose:
Computes the inverse of the F probability-distribution.
Argument(s): Prob is the cumulative probability associated with the F distribution
and may be any numeric between 0 and 1 inclusive (or a cell
reference to such a number), and DoF1 is the numerator degrees of
freedom and DoF2 is the denominator degrees of freedom. As for
FDIST, both degrees of freedom must be greater than or equal to 1.
Function:
=NORMSDIST(Z)
Purpose:
Computes the standard normal cumulative distribution.
Argument(s): Z is a value or cell reference to a value at which to evaluate the
function.
Function:
=NORMSINV(Prob)
Purpose:
Computes the inverse of the standard normal cumulative distribution.
Argument(s): Prob is the probability associated with the normal distribution and
may be any numeric between 0 and 1 inclusive (or a cell reference
to such a number).
12
Function:
=TDIST(X,DoF,Tails)
Purpose:
Computes the Student’s t-distribution.
Argument(s): X is a value or cell reference to a value at which to evaluate the
function, DoF is the degrees of freedom of the distribution or a cell
reference to where the integer value is located and Tails is a 1 or a
2 indicating whether the one tailed probability or two tailed
probability value is required.
Function:
=TINV(Prob,DoF)
Purpose:
Computes the inverse of the Student’s t-distribution.
Argument(s): Prob is the probability associated with the two tailed Student’s tdistribution and may be any numeric between 0 and 1 inclusive (or a
cell reference to such a number) and DoF is the degrees of freedom
of the distribution or a cell reference to where the value is located.
1.4
ADD-IN STATISTICAL OPTIONS
In addition to the standard built-in functions described above, Excel also offers the
user the opportunity to add additional statistical functionality through an analysis
took pack. This functionality is added through the Tools menu bar at the top of the
Excel screen. Selecting Tools and then Add-Ins… results in a tick box of
available functionality options being displayed. This box is displayed as Exhibit 6.
Selecting the Analysis ToolPak and OK installs the statistical functionality
described below.
13
Exhibit 6:
List of add in functionality available in Excel
Once the Analysis ToolPak has been installed, the Tools menu bar includes a
Data Analysis option that offers the range of statistical options displayed in
Exhibit 7. Each of these techniques offers a comprehensive range of data
identification and method selection options that are described within the technique
load function using a form of Wizard® identical to that observed for the built-in
statistical functions. The approach is illustrated below for two of the more popular
techniques. The main difference between the data analysis routines described
here and the built-in functions is that these produce multiple cell output that can
be displayed on a different worksheet if required. If this is not required then you
need to make sure that your existing sheet has sufficient free cells to the right and
below the selected output cell to avoid overwriting existing spreadsheet data.
14
Exhibit 7:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
17.
Add in Data Analysis Options available through the
ToolPak
Anova: Single-Factor
Anova: Two-Factor with Replication
Anova: Two-Factor without Replication
Correlation
Covariance
Descriptive Statistics
Exponential Smoothing
F-Test: Two-Sample for Variance
Fourier Series
Histogram
Moving Average
Random Number Generation
Rank and Percentile
Regression
Sampling
t-Test: Paired Two-Sample for Means
t-Test: Two-Sample assuming Equal Variances
t-Test: Two-Sample assuming Unequal Variances
z-Test: Two-Sample for Means
The approach and outcome of using the Descriptive Statistics procedure are
displayed in Exhibit 8 and Exhibit 9 respectively. As illustrated in Exhibit 8, the
input panel (Wizard) that is opened when Tools → Data Analysis → Descriptive
Statistics is selected allows you to select an input range such as A1:B21 in this
case and indicate that the variables are stored in column format. Furthermore,
ticking the available box allows the procedure to recognise that variable names
are included in the first row of the selected array. Finally, the displayed procedure
shows that the output is required to be placed in cell D1 (or at least that this cell
will be the upper left most cell of the output range) and that summary statistics
and the 95% confidence level for the means are required.
Exhibit 9 displays the outcome once OK is selected. For each of the two variables,
the procedure gives the mean, standard error of the mean, median, mode,
standard deviation, sample variance, kurtosis, skewness, range, minimum,
maximum, sum, count and 95% confidence Level.
15
Exhibit 8:
Illustration of the Descriptive Statistics procedure
capture screen
Exhibit 9:
Illustration of the output from the Descriptive Statistics
procedure
16
As a final illustration of the statistical procedures available through the add-in
Analysis ToolPak, Exhibits 10 and 11 present the approach and outcome of the
Regression procedure. As illustrated in Exhibit 10, the input panel (Wizard) that is
opened when Tools → Data Analysis → Regression is selected allows you to
select an range for the dependent or Y-variable such as A1:A21 in this case and a
range for the independent or X-variable or variables such as B1:D21 in this case.
Furthermore, ticking the Labels box allows the procedure to recognise that
variable names are included in the first row of the selected arrays. Finally, the
displayed procedure shows that the output is required to be placed in cell F1 (or at
least that this cell will be the upper left most cell of the output range). Although not
selected for this illustration, the Regression procedure allows you to request
various graphical plots that may be of interest.
Exhibit 10: Illustration of the Regression procedure capture screen
Exhibit 11 displays the outcome once OK is selected. The regression statistics include
the multiple-R, the R-square, the adjusted R-square, the standard error of the regression
17
and the number of observations. Additionally the regression ANOVA table is
produced and it provides the regression F-statistic and its associated probability.
The last table provided as output gives the estimated intercept and slope
coefficients for each independent variable as well as their associated standard
errors, t-statistics and probabilities. Finally, the 95% confidence limits for the
intercept and coefficients are presented (together with an additional confidence
limit if this has been selected as part of the regression requirement).
Exhibit 11:
Illustration of the output from the Regression procedure
Microsoft product screen shots reprinted with permission from Microsoft Corporation.
Download