Introduction to StatTools and NeuralTools

advertisement
Introduction to
StatTools
and
NeuralTools
Presented by:
Thompson Terry
Palisade Corporation
In this presentation we will…
„
Introduce StatTools and explore several
statistical analyses
„
Investigate Neural Networks and their role in
Predictive Modeling
„
Explore NeuralTools
„
Analyze predictive models
StatTools Functionality
Statistical Inference
„
„
„
„
„
„
Sample Size Selection
Confidence Interval Analysis
Hypothesis Tests
ANOVA
Chi-square Independence Test
Runs Test for Randomness
Forecasting
„
„
„
Moving Averages
Exponential Smoothing
Seasonality
Classification Analysis
„
„
Discriminant Analysis
Logistic Regression
Data Management
„
„
„
„
„
Categorical Data
Stacked and Unstacked data types
Variable Transformations
Random Sample Generation
Analysis across multiple datasets and
worksheets
Summary Analyses
„
„
„
„
„
„
„
One-Variable Summary
Correlation/Covariance
Autocorrelation
Histogram
Scatterplot
Time Series
Boxplot
Tests for Normality
„
„
„
Chi-square Test
Lilliefors Test
Q-Q Normal Plot
Regression Analysis
„
„
Simple
Stepwise
Quality Control Charts
„
X-Bar, R, P, C, U, Pareto Charts
StatTools Functionality
We will be exploring the following StatTools
functions and features:
– Summary Statistics
– Summary Graphs
– Statistical inference
– Data management
– Regression Analysis
– Time Series & Forecasting
Data Set Manager
Summary Statistics One Variable Analysis
01 - ExamScores1.xls
StatTools
(Core Analysis Pack)
Analysis: One Variable Summary
Performed By:
Date: Monday, October 22, 2007
Updating: Live
One Variable Summary
Mean
Variance
Std. Dev.
Skewness
Kurtosis
Median
Mean Abs. Dev.
Minimum
Maximum
Range
Count
Sum
1st Quartile
3rd Quartile
Interquartile Range
1.00%
2.50%
5.00%
10.00%
20.00%
80.00%
90.00%
95.00%
97.50%
99.00%
Score
Exam Scores
67.54
323.22
17.98
-0.4571
2.4521
70.00
14.88
24.00
99.00
75.00
212
14319.00
55.00
82.00
27.00
26.00
28.00
33.00
42.00
52.00
84.00
89.00
94.00
95.00
97.00
Summary Statistics for Stacked
Data with Categorical Variable
02 – ExamScores2.xls
StatTools
(Core Analysis Pack)
Analysis: One Variable Summary
Performed By:
Date: Monday, October 22, 2007
Updating: Live
One Variable Summary
Mean
Std. Dev.
Median
Minimum
Maximum
Count
1st Quartile
3rd Quartile
Score (Female)
Data Set #1
Score (Male)
Data Set #1
68.21
16.94
71.00
24.00
98.00
97
58.00
80.00
66.98
18.87
70.00
24.00
99.00
115
54.00
83.00
Correlation and Covariance
03 – StockReturns.xls
StatTools
(Core Analysis Pack)
Analysis: Correlation and Covariance
Performed By:
Date: Monday, October 22, 2007
Updating: Live
Correlation Table
AXP
Stocks
FDX
Stocks
GM
Stocks
IBM
Stocks
MCD
Stocks
MSFT
Stocks
AXP
1.000
0.400
1.000
0.483
0.250
1.000
0.358
0.292
0.259
1.000
0.354
0.265
0.318
0.397
1.000
0.361
0.119
0.303
0.313
0.289
1.000
AXP
Stocks
FDX
Stocks
GM
Stocks
IBM
Stocks
MCD
Stocks
MSFT
Stocks
0.00205
0.00216
0.00185
0.00264
0.00512
0.00325
0.00151
0.00273
0.00323
0.00229
0.01229
FDX
GM
IBM
MCD
MSFT
Covariance Table
AXP
FDX
GM
IBM
MCD
MSFT
0.00658
0.00369 0.00319 0.00270
0.01295 0.00231 0.00309
0.00661 0.00196
0.00866
Summary Graphs: Histograms
02 – ExamScores2.xls
Histogram of Score / Data Set #1 (Female)
30
25
Frequency
20
15
10
5
85.00
95.00
85.00
95.00
75.00
65.00
55.00
45.00
35.00
25.00
0
Histogram of Score / Data Set #1 (Male)
30
25
15
10
5
75.00
65.00
55.00
45.00
35.00
0
25.00
Frequency
20
Summary Graphs: Scatterplots
04 - Expenses.xls
Let’s first study correlations!
Scatterplot of Salary vs Culture of Data Set #1
90000
80000
Salary / Data Set #1
70000
60000
50000
40000
30000
20000
10000
0
0
200
400
600
800
1000
Culture / Data Set #1
Correlation
0.506
1200
1400
1600
1800
Summary Graphs: Box-Whisker
Plots
02 – ExamScores2.xls
Box Plot of Comparison of Score / Data Set #1
Gender = Male
Gender = Female
0
20
40
60
80
100
120
Statistical Inference: Confidence Intervals
05 – GasPrices.xls
One- Sample Analysis
StatTools
(Core Analysis Pack)
Analysis: Confidence Interval
Performed By:
Date: Monday, October 22, 2007
Updating: Live
Conf. Intervals (One-Sample)
Sample Size
Sample Mean
Sample Std Dev
Confidence Level (Mean)
Degrees of Freedom
Lower Limit
Upper Limit
Confidence Level (Std Dev)
Degrees of Freedom
Lower Limit
Upper Limit
Price of regular unleaded
Data Set #1
32
1.57906
0.07945
95.0%
31
1.55042
1.60771
95.0%
31
0.06369
0.10562
Statistical Inference: Confidence Intervals
06 – SandwichRatings.xls
Two- Sample Analysis
StatTools (Core Analysis Pack)
Analysis: Confidence Interval
Performed By:
Date: Monday, October 22, 2007
Updating: Live
Sample Summaries
Sample Size
Sample Mean
Sample Std Dev
Conf. Intervals (Difference of Means)
Confidence Level
Sample Mean Difference
Standard Error of Difference
Degrees of Freedom
Lower Limit
Upper Limit
Satisfaction (Female)
Data Set #1
Satisfaction (Male)
Data Set #1
39
5.949
2.470
71
6.746
1.787
Equal
Variances
Unequal
Variances
95.0%
-0.798
0.409249402
108
-1.608964238
0.013442389
95.0%
-0.798
0.448812955
60
-1.695520502
0.099998653
Equality of Variances Test
Ratio of Sample Variances
p-Value
1.9119
0.0190
Statistical Inference: Hypothesis Test
05 – GasPrices.xls
StatTools (Core Analysis Pack)
Analysis: Hypothesis Test
Performed By:
Date: Monday, October 22, 2007
Updating: Live
Hypothesis Test (One-Sample)
Sample Size
Sample Mean
Sample Std Dev
Hypothesized Mean
Alternative Hypothesis
Standard Error of Mean
Degrees of Freedom
t-Test Statistic
p-Value
Null Hypoth. at 10% Significance
Null Hypoth. at 5% Significance
Null Hypoth. at 1% Significance
Price of regular unleaded
Data Set #1
32
1.57906
0.07945
1.55
> 1.55
0.014044567
31
2.0693
0.0235
Reject
Reject
Don't Reject
Statistical Inference: Hypothesis Test
05 – GasPrices.xls
StatTools
(Core Analysis Pack)
Analysis: Sample Size Selection
Performed By:
Date: Monday, October 22, 2007
Updating: Live
Sample Size for Mean
Confidence Level
Half-length of Interval
Std Dev (estimate)
Sample Size
95.00%
0.005
0.08
984
Data Management: Stacking & Unstacking Variables
07- EmpowerRatings.xls
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
A
South
7
1
8
7
2
3
7
5
7
4
B
C
Midwest Northeast
7
7
6
5
10
5
3
5
9
4
2
5
8
1
3
5
2
3
7
3
7
3
5
5
10
5
10
6
Plant
South
South
South
South
South
South
South
South
South
South
Midwest
Midwest
Midwest
Midwest
Midwest
Midwest
Midwest
Rating
7
1
8
7
2
3
7
5
7
4
7
6
10
3
9
2
8
Data Management: Transforming Variables
04- Expenses.xls
1
2
3
4
5
6
7
8
9
10
A
Salary
$54,600
$57,500
$53,300
$43,500
$57,200
$63,400
$58,500
$55,600
$61,300
B
Culture
$1,020
$1,100
$900
$570
$900
$820
$1,340
$1,250
$1,190
C
Sports
$990
$460
$780
$860
$1,390
$1,880
$710
$680
$1,220
D
E
F
G
H
Dining Log(Salary) Log(Culture) Log(Sports) Log(Dining)
$1,510
10.91
6.93
6.90
7.32
$1,180
10.96
7.00
6.13
7.07
$1,590
10.88
6.80
6.66
7.37
$1,750
10.68
6.35
6.76
7.47
$2,120
10.95
6.80
7.24
7.66
$3,090
11.06
6.71
7.54
8.04
$1,540
10.98
7.20
6.57
7.34
$1,800
10.93
7.13
6.52
7.50
$2,330
11.02
7.08
7.11
7.75
Data Management: Creating Dummy Variables
08 - Salaries1.xls
1
2
3
4
5
6
7
8
A
B
Employee EducLevel
1
3
2
1
3
1
4
2
5
3
6
3
7
3
C
Gender
Male
Female
Female
Female
Male
Female
Female
D
E
F
G
Salary EducLevel = 1 EducLevel = 2 EducLevel = 3
$32,000
0
0
1
$39,100
1
0
0
$33,200
1
0
0
$30,600
0
1
0
$29,000
0
0
1
$30,500
0
0
1
$30,000
0
0
1
Regression Analysis
09 – Salaries3.xls
1
2
3
4
5
6
A
Employee
1
2
3
4
5
B
Age
26
38
35
40
28
C
D
E
Gender EducLevel YrsExper
Male
3
3
Female
1
14
Female
1
12
Female
2
8
Male
3
3
F
Salary
32000
39100
33200
30600
29000
1
2
3
4
M
Employee
209
210
211
G
H
Female Female*YrsExper
0
0
1
14
1
12
1
8
0
0
N
Age
37
49
55
I
EducLevel1
0
1
1
0
0
O
P
Q
Gender EducLevel YrsExper
Female
3
7
Male
2
10
Male
3
16
J
EducLevel2
0
0
0
1
0
R
Salary
38810
41201
56272
K
EducLevel3
1
0
0
0
1
S
T
Female Female*Yrs
1
0
0
Time Series & Forecasting : Time Series Graph
10 – StereoSales.xls
Time Series of Sales / Data Set #1
300
250
200
150
100
50
Ja
n95
Ap
r95
Ju
l-9
5
O
ct
-9
5
Ja
n96
Ap
r96
Ju
l-9
6
O
ct
-9
6
Ja
n97
Ap
r97
Ju
l-9
7
O
ct
-9
7
Ja
n98
Ap
r98
Ju
l-9
8
O
ct
-9
8
0
Time Series & Forecasting : Autocorrelation
10 – StereoSales.xls
Time Series of Sales / Data Set #1
300
250
200
150
100
50
Ja
n95
Ap
r95
Ju
l-9
5
O
ct
-9
5
Ja
n96
Ap
r96
Ju
l-9
6
O
ct
-9
6
Ja
n97
Ap
r97
Ju
l-9
7
O
ct
-9
7
Ja
n98
Ap
r98
Ju
l-9
8
O
ct
-9
8
0
Autocorrelation Table
Number of Values
Standard Error
Lag #1
Lag #2
Lag #3
Lag #4
Lag #5
Lag #6
Lag #7
Lag #8
Lag #9
Lag #10
Lag #11
Lag #12
Sales
Data Set #1
48
0.1443
0.3492
0.0772
0.0814
-0.0095
-0.1353
0.0206
-0.1494
-0.1492
-0.2626
-0.1792
0.0121
-0.0516
Autocorrelation of Sales / Data Set #1
1
0.5
0
-0.5
-1
1
2
3
4
5
6
7
Number of Lags
8
9
10
11
12
Time Series & Forecasting : Simple
Exponential Smoothing
11 – HardwareSales.xls
Time Series of Sales / Data Set #1
3500
3000
2500
2000
1500
1000
Forecast and Original Observations
500
3500.00
3000.00
2500.00
2000.00
Sales
1500.00
Forecast
1000.00
500.00
Week
99
106
92
85
78
71
64
57
50
43
36
29
22
8
15
0.00
1
96
101
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
6
11
1
0
Time Series & Forecasting : Holt’s Method
12 – ChipSales.xls
Time Series of Sales / Data Set #1
10000
9000
8000
7000
6000
5000
4000
3000
Forecast and Original Observations
2000
1000
12000.00
10000.00
8000.00
Sales
6000.00
Forecast
4000.00
2000.00
Jun-1991
Jan-1991
Mar-1990
Aug-1990
Oct-1989
May-1989
Jul-1988
Dec-1988
Feb-1988
Sep-1987
Apr-1987
Nov-1986
Jun-1986
0.00
Jan-1986
200
Q
498
399
Q
Q
297
198
Q
Q
495
396
Q
Q
294
195
Q
Q
492
393
Q
Q
291
192
Q
Q
489
390
Q
Q
288
189
Q
Q
486
387
Q
Q
Q
186
0
Time Series & Forecasting : Winter’s
Method
Time Series of Sales / Data Set #1
6000
5000
4000
3000
2000
1000
0
7000.00
6000.00
5000.00
4000.00
Sales
3000.00
Forecast
2000.00
1000.00
Q2-2002
Q1-2001
Q4-1999
Q3-1998
Q2-1997
Q1-1996
Q4-1994
Q3-1993
Q2-1992
Q1-1991
Q4-1989
Q3-1988
0.00
Q2-1987
13- SoftdrinkSales.xls
Q1-1986
Q
Q
186
486
Q
38
Q 7
288
Q
18
Q 9
48
Q 9
39
Q 0
29
Q 1
19
Q 2
49
Q 2
39
Q 3
29
Q 4
195
Q
49
Q 5
39
Q 6
29
Q 7
198
Q
49
Q 8
399
Q
200
Forecast and Original Observations
NeuralTools®
Predictive Modeling
„
„
„
„
A statistical model of future behavior
Made up of predictor variables that are likely to
influence results
Historical data is analyzed to find relationships
between predictor variables and the outcome (or
outcomes)
Outputs and predictors can be numerical or
categorical/classification in nature
NeuralTools®
Numerical Modeling
„
„
„
„
„
The output of interest has a numerical value
Conventional approach using Linear Regression
(simple or multiple)
Multiple numerical outputs require Multivariate
Regression techniques
Predictors can be numerical or categorical
Examples:
– Investment prediction
– Air and sea currents
NeuralTools®
Categorical Modeling
„
„
„
„
„
„
„
The model requires observations to be placed in
groups
Logistic Regression for binary models (two
categories)
Places observations into either group based on
exceeding a critical value or not
Discriminant analysis for more than two categories
Places observations into groups based on their
statistical distance from each group
Predictors can be numerical or categorical
Examples:
– Tumour diagnosis
– Credit scoring
NeuralTools®
Neural Networks
„
A neural network is a system that takes numeric
inputs, performs computations on these inputs, and
outputs one or more numeric values
„
Inspired by the structure of the brain
„
Consists of a large number of cells (neurons)
„
Neurons receive impulses from other neurons
„
Depending on the impulse received, the neuron may
send a signal to other neurons
„
This signal will be a simple function of the impulses
it received
NeuralTools®
Neural Nets vs. Statistical
Methods
„
„
„
„
Neural nets provide an alternative to more traditional
statistical methods
Used for function approximation and classification,
just as Linear Regression, and Discriminant
Analysis and Logistic Regression are
An advantage of neural nets is that they are capable
of modelling extremely complex functions, in
contrast with the traditional linear techniques
Neural nets are not subject to the same
assumptions as statistical methods (autocorrelation,
Gaussian errors, equality of variance etc)
NeuralTools®
Training a Net
„
The process of fine-tuning the parameters of the
computation, where the purpose is to make the net
output approximately correct values for given inputs
„
The training algorithm selects various sets of
computation parameters, and evaluates each set by
applying the net to each training case
„
Each set of parameters is a "trial"
„
The training algorithm selects new sets of
parameters based on the results of previous trials
NeuralTools®
Neural Networks available in
NeuralTools
„
Multi-Layer Feedforward Network
– The user specifies one or two layers of hidden nodes, and
how many nodes the hidden layers should contain
„
Generalized Regression Neural Nets and
Probabilistic Neural Nets
– GRN Net used for numeric prediction
– PN Net used for category prediction/classification
– Always have two hidden layers of nodes, with one node per
training case in the first hidden layer, and the size of the
second layer determined by the training data
NeuralTools®
Multi-Layer Feedforward Nets
„
„
„
„
Also referred to as "Multi-Layer Perceptron Networks"
Capable of approximating complex functions, and thus capable
of modelling complex relationships between independent
variables and a dependent one
When MLF nets are used for classification, they have multiple
output nodes, one corresponding to each possible dependent
category
A net classifies a case by computing its numeric outputs; the
selected category is the one corresponding to the node that
outputs the highest value
NeuralTools®
MLF Architecture
NeuralTools®
Generalized Regression Neural
Nets
„
„
Used for numeric prediction modeling
Built on the intuitive idea that the closer a known case is to the
unknown one, the more important it is when estimating the
unknown dependent value
NeuralTools®
GRN Net Architecture
NeuralTools®
Probabilistic Neural Nets
„
„
„
Used for classification modeling
Similar in concept to GRN nets
Considers the distance of a new case to every training case,
giving greater weight to closer cases
NeuralTools®
PN Net Architecture
NeuralTools®
Advantages of GRN/PN nets
„
„
„
Training time is much shorter
Do not require topology specification
PN nets not only classify, but also return the probabilities that
the case falls in different possible dependent categories
Advantages of MLF nets
„
„
„
Smaller in size, thus faster to make predictions
More reliable outside the range of training data (for example,
when the value of some independent variable falls outside the
range of values for that variable in the training data); though
note that prediction outside the range of training data is still
risky with MLF nets
Capable of generalizing from very small training sets
Sources of help
„
„
„
„
„
„
„
„
„
On-line tutorials
Help menu within the software
Software manuals (PDF)
Palisade web-site www.palisade.com
Helpdesk: http://helpdesk.palisade.com/
Forum: http://forums.palisade.com/
Web encyclopedia www.wikipedia.com
Your Regional Sales Manager(s)
Palisade Training and Consulting Services
Download