Regression Analysis

advertisement
Regression Analysis
~
Productivity of a Clerical Team
Problem Definition - Purpose of this assignment
Data on various jobs performed by a team of clerks was collected for the purpose of analyzing
the relations between the volumes of each job and the level of productivity of the clerical team.
With a better understanding of these relationships, we should be able to estimate productivity for
specific circumstances.
Objective
The ultimate objective is to create a spreadsheet model that will assist in estimating productivity
given certain input values on various jobs performed by the team. To do so, we will determine
the dependent and independent variables, their relationship with productivity, estimate a
productivity equation using tools such as Excel/Data analysis/Regression, StatPro’s Stepwise
Regression Analysis. The forward and backward regression analysis will also be performed to
further evaluate the independent variables.
Variables
Dependent variable:
Number of hours in productive work performed by a clerical team in one day (PROD)
Possible independent or explanatory variables:
1. Number of pieces of mail opened, sorted and distributed (MAIL)
2. Number of times a clerk assisted another clerk (ASSIST)
3. Number of incoming calls answered or forwarded (PHONE)
4. Number of people showing up in the work area to request assistance (PERSON)
5. Number of sales orders processed (ORDER) These orders are generally processed by phone but not counted
towards the PHONE variable.
6. Number of completed schedules such as meetings and/or tickets (SCHEDULE)
7.
Number of completed photocopying orders (DUPLICATE) The orders vary in size.
Data set
52 days worth of data was compiled and stored.
View data set in “Data on Clerical Team Productivity”
Visualization of Relationships between variables
Below are graphical representations of data series constituted by one of the explanatory variables
(X) and the dependent variable which is productivity level (Y).
Through the use of “Add trend-line” in Excel, we can visualize the relationships. The respective
R-squares help us gauge the degree of fit/correlation between the trend line (least R-square line)
and the series of actual observations. Five types of fitting equations (linear, logarithmic,
polynomial, power, and exponential) were compare for each series and the one with the highest
R-square was traced. (View results in “Data on Clerical Team Productivity” under the
respective tabs PRODMAIL, PRODASSIST, PRODPHONE, PRODPERSON, PRODORDER, PRODSCHEDULE,
PRODDUPLICATE)
All the series turned out to be very non-linear. Among the five equations tested, the polynomial
and the power function seem to be the most fitting however the R-square was never very high.
Correlation Analysis
PROD
PROD
MAIL
ASSIST
PHONE
PERSON
ORDER
SCHEDULE
DUPLICATE
MAIL
1
-0.007650103
0.292819235
0.461519082
0.084798217
0.58731901
0.499012658
0.449594128
ASSIST
PHONE
PERSON
ORDER
SCHEDULE
DUPLICATE
1
0.011282017
1
0.054803588
0.24521511
1
-0.043117518 0.036861483
0.47780716
1
-0.276585736 -0.015889717 0.508993671 0.442805163
1
-0.015940412 0.338924407
0.34892016 0.167351761 0.382271946
1
-0.311766861
0.12226462 0.508788538 0.275074969 0.566073265 0.297154731
The results of the correlation analysis show that no one explanatory variable can alone explain
the level of productivity of the team. The three that have the highest correlation with productivity
are (in descending order) schedule, phone, and duplicate.
There is no manifestation of multicollinearity therefore we do not need to worry about
redundancy.
Regression Analysis
 Using Excel/Tools/Data analysis/Regression
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.75390031
R Square
0.56836568
Adjusted R Square
0.49969658
Standard Error 10.9901846
Observations
52
ANOVA
df
SS
6998.00941
5314.5029
12312.5123
MS
999.71563
120.784157
Coefficients Standard Error
Intercept
60.553792 9.49521302
MAIL
0.00134964 0.00091684
ASSIST
0.08727154 0.04825609
PHONE
0.00868785 0.00916813
PERSON
-0.04277815 0.01734491
ORDER
0.04679021
0.0119808
SCHEDULE 0.20921295 0.13022364
DUPLICATE
0.0048192
0.0055105
t Stat
6.37729684
1.47204684
1.8085082
0.947615
-2.46632309
3.90543428
1.60656661
0.87454912
Regression
Residual
Total
7
44
51
F
Significance F
8.27687717 2.0527E-06
P-value
9.4015E-08
0.14812538
0.07736449
0.34850113
0.01761786
0.00031977
0.11530325
0.38656806
Lower 95% Upper 95%
41.4174483 79.6901357
-0.00049814 0.00319741
-0.00998222
0.1845253
-0.00978929
0.027165
-0.07773451 -0.00782178
0.0226445 0.07093592
-0.05323554 0.47166144
-0.00628648 0.01592488
Lower 95.0% Upper 95.0%
41.4174483 79.6901357
-0.00049814 0.00319741
-0.00998222
0.1845253
-0.00978929
0.027165
-0.07773451 -0.00782178
0.0226445 0.07093592
-0.05323554 0.47166144
-0.00628648 0.01592488
1
The p-values > 5% are represented in red (variables have the least significance in explaining the
team’s productivity level)
The data analysis/regression is run several times till all p-values come out less than 5%. Each
additional run is done by using the original data set minus the explanatory variable with the
highest p-value.
The following independent variables were taken out one at a time (in sequence): DUPLICATE,
MAIL, PHONE, SCHEDULE.
The last summary sheet is shown below. The significant independent variables using this method
turn out to be ASSIST, PERSON, and ORDER.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.69324186
R Square
0.48058428
Adjusted R Square
0.4481208
Standard Error
11.5427759
Observations
52
ANOVA
df
Regression
Residual
Total
Intercept
ASSIST
PERSON
ORDER
SS
MS
F
Significance F
3 5917.19986 1972.39995 14.8038424 5.9103E-07
48 6395.31245 133.235676
51 12312.5123
CoefficientsStandard Error
t Stat
77.7256395 6.91019862 11.2479603
0.13626447 0.04541256 3.00058983
-0.034689 0.01714031 -2.0238248
0.0582678 0.00971385 5.99842319
P-value
4.6921E-15
0.00426472
0.04857403
2.5213E-07
Lower 95%
63.8317621
0.04495645
-0.0691518
0.0387368
Upper 95% Lower 95.0% Upper 95.0%
91.6195169 63.8317621 91.6195169
0.22757249 0.04495645 0.22757249
-0.0002261 -0.0691518 -0.0002261
0.0777988 0.0387368 0.0777988
 Using StatPro/Regression Analysis/Stepwise
With StatPro’s stepwise regression, the same explanatory variables were selected. This is logical
since the underlining methodology was the same. Here, the steps were automated. In step one the
variable ORDER was entered into the equation. Then it was ASSIST and finally in step three
PERSON.
 Using StatPro/Regression Analysis/Backward
View results in “Data on Clerical Team Productivity” under the tab BackwardRegr.
 Using StatPro/Regression Analysis/Forward
View results in “Data on Clerical Team Productivity” under the tab ForwardRegr.
Results of the Regression Analysis & Evaluation of Explanatory
Variables
 Results of the regression analysis
The following equation represents an estimation of the clerical team’s level of productivity
PROD=77.72564+0.1362645*ASSIST-0.034689*PERSON+0.0582678*ORDER
The equation implies that the fixed level of productivity is 77.72564 hours. Depending on the
types of jobs that particular day, the productivity can go up (if assistance is requested by other
internal clerks or if orders are completed) or down (if external people come to request some
assistance). More specifically, the number of hours of productivity will increase by 0.1362645
hours for each additional unit of ASSIST and by 0.0582678 hours for each additional unit of
ORDER. Every time a member of the team answers to a request for assistance made by someone
external, the team’s productivity declines by 0.034689 hours.
However, the corresponding R-square of 0.4805843 is rather low.
The search for other significant explanatory variables should be pursued.
 Additional Evaluation of the Explanatory Variables
The forward regression analysis gives us some understanding of the gain in explanation each
explanatory variable gave. Here is a summary of the amount and percentages gained in R-square.
The variable ASSIST give us the largest gain.
Step 1
Step 2
Step 3
Variable entered RChange %
square
Change
order
0.3449
assist
0.4363
0.0913
26.5%
person
0.4806
0.0443
10.2%
The backward regression analysis gives us some understanding of the loss in explanation
resulting by the elimination of an explanatory variable. Here is a summary of the amount and
percentages lossed in R-square. The elimination of the variable SCHEDULE results in the largest
loss.
Variable
leaving
Step 1
Step 2
Step 3
Step 4
Step 5
duplicate
mail
phone
schedule
R-square Change
0.5684
0.5609
0.5450
0.5182
0.4806
-0.0075
-0.0159
-0.0268
-0.0376
%
Change
-1.3%
-2.8%
-4.9%
-7.3%
The Clerical Team Productivity Spreadsheet Model
Look up link in Index page under “Clerical Team Productivity Spreadsheet Model”
Other Modeling Possibilities – Nonlinear Transformations
Given the low R-squares obtained, it would be interesting to manipulate the original values of the
given observation set and replace one or several by their matching values based on the bestfitting trend line found earlier in this study.
For example: replacing the original observation values for ORDERS by their fitted values on the
polynomial equation 0.1017X2 – 16848X + 1146.7 and reruning the regression analysis to see if
this would improve our estimated productivity equation (higher R-square).
Download