Ch07.PowerPoint

advertisement
Spreadsheet-Based Decision Support Systems
Chapter 7: Statistical Analysis with Excel
Prof. Name
Position
University Name
name@email.com
(123) 456-7890
Overview





7.1 Introduction
7.2 Understanding Data
7.3 Relationships in Data
7.4 Distributions
7.5 Summary
2
Introduction

Performing basic statistical analysis of data using Excel functions

Statistical features of the Data Analysis ToolPak

Trend curves for analyzing data patterns

Basic linear regression techniques in Excel

Several different distribution functions in Excel
3
Understanding Data

Statistical Functions

Descriptive Statistics

Histograms
4
Statistical Functions

AVERAGE
– Finds the mean of a set of data.
– =AVERAGE(range or range_name)

MEDIAN
– Finds the middle number in a list of sorted data.
– =MEDIAN(range or range_name)

STDEV
– Finds the standard deviation of a set of data.
– This is equal to the square root of the variance, which measures the
difference between the mean of the data set and the individual values.
=STDEV.P(range or range_name)
=STDEV.S(range or range_name)
5
Figures 7.1 and 7.2
6
Figures 7.3 and 7.4
7
Data Analysis ToolPak

An Excel Add-In which includes several statistical analysis techniques

To ensure that it is an active Add-in:
– Display Excel Options dialog box:
 Select Options from the list of options in the File tab.
– Select the Add-Ins tab on the left side of the dialog box.
– Select Analysis ToolPak listed on the Add-ins window.
8
Descriptive Statistics

Provides a list of statistical information about your data set including
–
–
–
–
Mean
Median
Standard deviation
Variance

Click on Data > Analysis > Data Analysis command to display the Data
Analysis dialog box.

Choose the Descriptive Statistics option and click OK.
9
Descriptive Statistics (cont’d)

The Input Range refers to the
location of the data set.

Check the option button Columns or
Rows to indicate how your data is
grouped.

If there are labels in the first row of
each column of data, then check the
Labels in First Row box.

The Output Range refers to where
the results of the analysis will be
displayed in the current worksheet.

Check the Summary Statistics box
to calculate the most commonly used
statistics.
10
Figure 7.7

Quarterly stock returns for three
different companies are recorded. We
want to know
– Average stock return
– Variability of stock returns
– Which quarters had the highest and
lowest stock returns
11
Figures 7.8 and 7.9
12
Figure 7.10

Almost all of the data points lie between +2s and –2s from the mean.

Outliers are data that are inconsistent with the main pattern of data.
13
Figure 7.11

The standard deviation is used to identify outliers in a data set.
14
Figure 7.12

Conditional Formatting with the Formula Is option is used to identify
outliers.
– Select the column of values in the data set; and fill in the Conditional
Formatting dialog box to highlight outlier points.
15
Figure 7.13

The cell that holds an outlier is highlighted.
16
More Descriptive Statistics

Confidence Level for Mean
– The mean is calculated using the specified confidence level (for example, 95% or
99%), the standard deviation, and the size of the sample data.
– The confidence level and calculated mean are then added to the analysis report.
– You can compare the actual mean to this calculated mean based on the specified
confidence level.

Kth Largest
– Gives the largest ranked data value for a specified value of k.
– For k = 1, the maximum data value would be returned.

Kth Smallest
– Gives the smallest ranked data value for a specified value of k.
– For k = 1, the minimum data value would be returned.
17
Descriptive Statistics Functions

PERCENTILE.INC
– Returns a value for which a desired percentile k of the specified data_set falls
below.
= PERCENTILE.INC(data_set, k)

For example, for the MSFT data, the value for which 95% of the data falls
below is
=PERCENTILE.INC(B4:B27,0.95) = 0.108

PERCENTILE.EXC
– Excludes the value of k-th percentile from the calculations
= PERCENTILE.EXC (data_set, k)

For the MSFT data, the value for which 95% of the data falls below is
=PERCENTILE.EXC(B4:B27,0.95) = 0.135
18
Descriptive Statistics Functions (cont’d)

PERCENTRANK.INC
– Returns the percentile of the data_set which falls below a given value.
=PERCENTRANK.INC(data_set, value)

For example, percent of the MSFT data falls below the value 0.108,
inclusive of 0.108 is
=PERCENTRANK.INC(B4:B27, 0.108) = 0.95, or 95%

PERCENTRANK.EXC
– Calculates the same percentile, exclusive of the value of k.
=PERCENTRANK.EXC(data_set, value)

For example, percent of the MSFT data falls below the value 0.135,
exclusive of 0.135 is
=PERCENTRANK.EXC(B4:B27, 0.135) = 0.95, or 95%
19
Histograms

Histograms calculate the number of occurrences, or frequency, with
which values in a data set fall into various intervals.

Choose the Histogram option from the Analysis ToolPak list.
20
Histograms (cont’d)

The Input Range is the range of the data set.

The Bin Range is used to specify the location of the bin values.
– Bins are the intervals into which values can fall; they can be defined by the
user or can be evenly distributed among the data by Excel.

The Output Range is the location of the output, or the frequency
calculations for each bin.

The chart options include a simple Chart Output (the actual
histogram), Cumulative Percentage for each bin value, and a Pareto
organization of the chart.
21
Figures 7.15 and 7.16
22
Figures 7.17 and 7.18

To create your own bin values, make a list of upper bounds for each
interval.
23
Figure 7.19
24
Histograms (cont’d)

To change the format of a Histogram:
– Click on the histogram to activate the Chart Tools contextual tabs.
– Use the commands listed on these tabs to change the design, layout and
format of the histogram.
25
Histograms (cont’d)

There are four basic shapes to a histogram:
– Symmetric: has peaks and dips with equal amplitude
 A curve with only one peak is also symmetric; that is, there is a central high part
and almost equal lower parts to the left and right of this peak.
– Positively skewed: has a peak on the left and many lower points (stretching)
to the right.
– Negatively skewed: has a peak on the right and many lower points
(stretching) to the left.
– Multiple peaks: imply that more than one source, or population, of data is
being evaluated.
26
Relationships in Data

Trend Curves

Regression
27
Data Relationships

Relationships in data are usually identified by comparing two variables:
the dependent variable and the independent variable.
– The dependent variable is the variable we are most interested in. By
understanding its current behavior we can better predict its future behavior.
– The independent variable is the variable we use as the comparison in order
to make this prediction.
28
Trend Curves

Trend curves are used to graph and analyze these relationships between
data.

Trend curves graph the data with
– The independent variable on the x-axis
– The dependent variable on the y-axis

To add a trend curve to your chart:
– Click on the data points in an XY Scatter chart to activate Chart Tools
contextual tabs.
– Click on the Chart Tools Layout > Analysis > Trendline command.
– Select a trend curves from the trendlines options listed.
29
Trend Curves (cont’d)

There are six types of trend
curves which Excel can model:
–
–
–
–
–
–
Exponential
Linear
Logarithmic
Polynomial
Power
Moving Average
30
Trend Curves (cont’d)

Double click on a trendline to
activate the Format
Trendline dialog box.

We can modify:
– The type of the trendline by
selecting one of the options
listed.
– The trendline’s name.

We can specify a period
forward or backward for
which we want to predict the
behavior of our dependent
variable.
31
Linear Trend Curves

Number of Units Produced each month
and the corresponding Monthly Plant Cost
are recorded.

The company needs to estimate plant
costs based on the planned production
amounts.

The dependent variable is therefore the
Monthly Plant Cost and the independent
variable is the Units Produced.
32
Figure 7.25

Begin this analysis by making an XY Scatter chart of the data.
33
Figure 7.26

Right-click on any of the data
points and choose Add
Trendline from the short-cut
menu.

The Format Trendline dialog
box appears.

Select Linear from the Types
listed.

Select Display Equation on
Chart checkbox.
34
Figure 7.27

The trendline and the equation are then added to the chart.
35
Figure 7.28

Use the displayed equation to predict future values.

Check the accuracy of the equation by calculating the error from the
known data.

Linear trends have the relationship: y = a*x - b
36
Figure 7.29

Copy the formula for “Predicted Cost” to the rest of the rows to calculate
the predicted monthly costs.
37
Exponential Trend Curves

Sales data for ten years is recorded.

We want to predict sales for the next few years.

The independent variable is Years and our dependent variable is
Sales.
38
Figure 7.31

Exponential trends have
the following relationship:
– y = a*e^(b*x) or
– y = a*EXP(b*x)

Build a XY Scatter chart
of the data.

Right-click on a data
point to add the trendline.

Choose the Exponential
curve to fit the data.
39
Figures 7.32 and 7.33
40
Figure 7.34

We use the formula to predict sales values for future years.

However, the Exponential trend curve has a sharply increasing slope
that may not be accurate for many situations.
41
Power Trend Curves

We are given yearly Production values and yearly Unit Cost for
production.

We want to determine the relationship between Unit Cost and Production
in order to predict future Unit Costs.
42
Figure 7.36

Power trends have the relationship:
y = a*x^b

Begin by creating the XY Scatter
chart.

Right-click on a data point to add a
trendline.

Choose a Power curve to fit the
data.
43
Figures 7.37 and 7.38
44
Regression Analysis

We can use some regression analysis parameters to ensure that the
relationships we have chosen for our data are “good” fits.

These parameters include
–
–
–
–
R-Squared value
Standard error
Slope
Intercept
45
R-Squared Value

The R-Squared value measures the amount of influence the
independent variable has on the dependent variable.

The closer the R-Squared value is to 1, the stronger the relationship is
between the independent and dependent variables.

If the R-Squared value is closer to 0, then there may not be a relationship
between these two variables.
46
Figure 7.39

We fit a Linear trendline to the Monthly Plant Cost per Units Produced
chart.

The R-Squared value is 0.8137, which is fairly close to 1, implying a
good fit.
47
Figure 7.40

We fit an Exponential trendline to the Sales per year chart.

The R-Squared value is 0.9828, which is fairly close to 1, implying a
sound fit.
48
Figure 7.41

We fit a Power trendline to the Unit Cost per Cumulative Production
chart.

The R-Squared value is 0.9062, which is fairly close to 1, implying a
good fit.
49
Figure 7.42

The RSQ Excel function can
calculate the R-squared value from a
set of data.
– =RSQ(y_range, x_range)

Note that this function only works
with Linear trend curves.
50
Standard Error

The standard error measures the
accuracy of any predictions made.

It can be calculated in Excel using the
STEYX function
=STEYX(y_range, x_range)

This function can also only be used for
Linear trend curves.
51
Slope and Intercept

Two Excel functions can be used with a linear regression line of a
collection of data.

SLOPE function
=SLOPE(y_range, x_range)

INTERCEPT function
=INTERCEPT(y_range, x_range)
52
Distributions

Many distributions have Excel functions associated with them.
– These functions are equivalent to using distribution tables.
– That is, given certain parameters of a set of data for a particular
distribution, you would look at a distribution table to find the corresponding
area from the distribution curve.

Some common distributions are
–
–
–
–
–
–
–
Normal
Exponential
Uniform
Binomial
Poisson
Beta
Weibull
53
Normal Distribution

The parameters for this distribution are simply the value we are
interested in finding the probability for, and the mean and standard
deviation of the set of data.

The function we use with the Normal distribution is NORM.DIST
=NORM.DIST(x, mean, std_dev, cumulative)
54
Normal Distribution (cont’d)

The cumulative parameter will be seen in many Excel distribution
functions.

This parameter can take the values True or False to determine if you
want the value returned from the cumulative distribution function or
the probability density function, respectively.
– The cumulative distribution function (cdf) will find the probability that a value
in the data set is less than or equal to x.
– The probability density function (pdf) will find the probability that a value is
exactly equal to x.
55
Figure 7.45

Annual drug sales at a local drugstore are Normally distributed with a
mean of $40,000 and standard deviation of $10,000.

The probability that the actual sales for the year are $42,000 is 0.58, or
58%.
56
Figure 7.46

What is the probability that annual sales will be between $35,000 and
$49,000?

To find this value, we will subtract the cdf values for these two bounds.
– =NORM.DIST(49000, 40000, 10000, True) –
NORM.DIST(35000, 40000, 10000, True)

This will return a 0.51 probability, or 51% chance.
57
Standard Normal Distribution

The Standard Normal distribution function is a Normal distribution
function with mean 0 and the standard deviation 1.

The STANDARDIZE function will convert the x value from a data set with
mean not equal to 0 and a standard deviation not equal to 1 into a value
which does assume a mean of 0 and a standard deviation of 1.
=STANDARDIZE(x, mean, std_dev)

The resulting standardized value is then used as the main parameter in
the NORM.S.DIST function
=NORM.S.DIST(standardized_x, cumulative)
58
Figure 7.47

Consider the same example used previously to find the probability that a
drugstore’s annual sales are $42,000.
59
Uniform Distribution

The Uniform distribution does not actually have a corresponding Excel
function.

A simple formula can also be used to model this discrete distribution.
– 1 / (b – a)
Given that a value x is Uniformly distributed between a and b, we can
use this formula to determine the probability that x will take an integer
value in this interval.
60
Figure 7.48

Consider any values for a and b, then use the formula to calculate the
Uniform value.
61
Poisson Distribution

The Poisson distribution has only one parameter, the distribution mean.

The function we use for this distribution is POISSON.DIST
=POISSON.DIST(x, mean, cumulative)

The value returned by the Poisson distribution is the probability that the
number events which occur within a time interval is either between 0 and
x (cdf), or equal to x (pdf).
62
Figure 7.49

For example, consider a bakery which serves an average of 20
customers per hour.

Find the probability that at most 35 customers will be served in the next
two hours.
63
Exponential Distribution

The Exponential distribution has only one parameter: lambda = 1 / mean
of the data set.

The function we use for this distribution is EXPON.DIST
=EXPON.DIST(x, lambda, cumulative)

The Exponential distribution is commonly used for modeling interarrival
times.
64
Figure 7.50

Let us use the same example with the bakery data.

Arrival rate is said to be 20 customers per hour.

Interarrival mean, or the Exponential mean, is 1 / arrival rate. Therefore,
for this example, the interarrival mean is 1/20 hours per customer arrival.

To find the probability that a customer arrives in 10 minutes, we would
set
– x = 10/60 = 0.17 hours
– lambda = 1/(1/20) = 20 hours
– =EXPON.DIST(0.17, 20, True)
65
Binomial Distribution

The Binomial distribution has the following parameters: the number of
trials and the probability of a success.

We are trying to determine the probability that the number of successes
is less than or equal to (using cdf) or equal to (pdf) some x value.

The function we use for this distribution is BINOM.DIST
=BINOM.DIST(x, trials, prob_success, cumulative)
66
Figure 7.51

Suppose a survey shows that 40 percent of people pay more attention to
ads in the newspaper, and 60 percent pays more attention to ads on
television.

What is the probability that out of 100 people surveyed, 50 of them
respond more to ads on television?
67
Beta Distribution

The Beta distribution has the following parameters: alpha, beta, A, and B.
– Alpha and beta are determined from the data set
– A and B are optional bounds on the x value for which you want the Beta
distribution value

The function we use for this distribution is BETA.DIST
=BETA.DIST(x, alpha, beta, A, B)

If A and B are omitted, then a standard cumulative distribution is assumed
and they are given the values 0 and 1, respectively.
68
Figure 7.52

Determine the probability that a team can complete a project in 10 days.
Estimated total time needed is1 to 2 weeks; these estimates will be the
bound values, or the A and B parameters.

Use a mean and standard deviation of 12 and 3 days to compute the
alpha and beta parameters.
69
Weibull Distribution

The Weibull distribution has two parameters: alpha and beta.

The function we use for this distribution is WEIBULL.DIST
=WEIBULL.DIST(x, alpha, beta, cumulative)

The Weibull distribution is most commonly used to model reliability
functions.
70
Figure 7.53

On average, a lightbulb will last 1,200 hours, with a standard deviation of
100 hours. We can use these values to calculate alpha and beta.

We can now use the WEIBULL distribution to determine the probability
that a lightbulb will be reliable for 55 days = 1,320 hours.
71
Inverse Functions

When we build simulation models, we need to generate random numbers
in Excel which are within a given distribution.

To accomplish this we must use the inverse function of the
corresponding distribution function.

An inverse function returns the inverse of the cumulative probability
function.

These functions are listed under the Formulas > Function Library >
More Functions drop-down menu on the Ribbon.

Some of the inverse functions of more common distributions are
BETA.INV, BINOM.INV, LOGNORM.INV, NORM.INV.
72
Figure 7.54

The format of the inverse functions is
=DIST.INV(probability, distribution_parameters)

The probability parameter is a number between 0 and 1 associated with
the given distribution.

We will use the RAND function as the value for this parameter to
generate a number between 0 and 1.
73
Summary







The Analysis ToolPak is an Excel Add-In that includes statistical analysis
techniques such as Descriptive Statistics, Histograms, Exponential
Smoothing, Correlation, Covariance, Moving Average, and others.
The Descriptive Statistics option provides a list of statistical information
about a data set, including the mean, median, standard deviation, and
variance.
Histograms calculate the number of occurrences, or frequency, which
values in a data set fall into various intervals.
Relationships in data are usually identified by comparing the dependent
variable and the independent variable.
There are six basic trend curves that Excel can model: Exponential,
Linear, Logarithmic, Polynomial, Power, and Moving Average.
Some of the more common distributions that can be recognized when
performing a statistical analysis of data are the Normal, Exponential,
Uniform, Binomial, Poisson, Beta, and Weibull distributions.
Inverse distribution functions such as BETA.INV, BIONOM.INV,
LOGNORM.INV and NORM.INV are used in simulation models to
generate random numbers from a specific distribution.
74
Additional Links

(place links here)
75
Download