Introduction to StatTools and NeuralTools Presented by: Thompson Terry Palisade Corporation In this presentation we will… Introduce StatTools and explore several statistical analyses Investigate Neural Networks and their role in Predictive Modeling Explore NeuralTools Analyze predictive models StatTools Functionality Statistical Inference Sample Size Selection Confidence Interval Analysis Hypothesis Tests ANOVA Chi-square Independence Test Runs Test for Randomness Forecasting Moving Averages Exponential Smoothing Seasonality Classification Analysis Discriminant Analysis Logistic Regression Data Management Categorical Data Stacked and Unstacked data types Variable Transformations Random Sample Generation Analysis across multiple datasets and worksheets Summary Analyses One-Variable Summary Correlation/Covariance Autocorrelation Histogram Scatterplot Time Series Boxplot Tests for Normality Chi-square Test Lilliefors Test Q-Q Normal Plot Regression Analysis Simple Stepwise Quality Control Charts X-Bar, R, P, C, U, Pareto Charts StatTools Functionality We will be exploring the following StatTools functions and features: – Summary Statistics – Summary Graphs – Statistical inference – Data management – Regression Analysis – Time Series & Forecasting Data Set Manager Summary Statistics One Variable Analysis 01 - ExamScores1.xls StatTools (Core Analysis Pack) Analysis: One Variable Summary Performed By: Date: Monday, October 22, 2007 Updating: Live One Variable Summary Mean Variance Std. Dev. Skewness Kurtosis Median Mean Abs. Dev. Minimum Maximum Range Count Sum 1st Quartile 3rd Quartile Interquartile Range 1.00% 2.50% 5.00% 10.00% 20.00% 80.00% 90.00% 95.00% 97.50% 99.00% Score Exam Scores 67.54 323.22 17.98 -0.4571 2.4521 70.00 14.88 24.00 99.00 75.00 212 14319.00 55.00 82.00 27.00 26.00 28.00 33.00 42.00 52.00 84.00 89.00 94.00 95.00 97.00 Summary Statistics for Stacked Data with Categorical Variable 02 – ExamScores2.xls StatTools (Core Analysis Pack) Analysis: One Variable Summary Performed By: Date: Monday, October 22, 2007 Updating: Live One Variable Summary Mean Std. Dev. Median Minimum Maximum Count 1st Quartile 3rd Quartile Score (Female) Data Set #1 Score (Male) Data Set #1 68.21 16.94 71.00 24.00 98.00 97 58.00 80.00 66.98 18.87 70.00 24.00 99.00 115 54.00 83.00 Correlation and Covariance 03 – StockReturns.xls StatTools (Core Analysis Pack) Analysis: Correlation and Covariance Performed By: Date: Monday, October 22, 2007 Updating: Live Correlation Table AXP Stocks FDX Stocks GM Stocks IBM Stocks MCD Stocks MSFT Stocks AXP 1.000 0.400 1.000 0.483 0.250 1.000 0.358 0.292 0.259 1.000 0.354 0.265 0.318 0.397 1.000 0.361 0.119 0.303 0.313 0.289 1.000 AXP Stocks FDX Stocks GM Stocks IBM Stocks MCD Stocks MSFT Stocks 0.00205 0.00216 0.00185 0.00264 0.00512 0.00325 0.00151 0.00273 0.00323 0.00229 0.01229 FDX GM IBM MCD MSFT Covariance Table AXP FDX GM IBM MCD MSFT 0.00658 0.00369 0.00319 0.00270 0.01295 0.00231 0.00309 0.00661 0.00196 0.00866 Summary Graphs: Histograms 02 – ExamScores2.xls Histogram of Score / Data Set #1 (Female) 30 25 Frequency 20 15 10 5 85.00 95.00 85.00 95.00 75.00 65.00 55.00 45.00 35.00 25.00 0 Histogram of Score / Data Set #1 (Male) 30 25 15 10 5 75.00 65.00 55.00 45.00 35.00 0 25.00 Frequency 20 Summary Graphs: Scatterplots 04 - Expenses.xls Let’s first study correlations! Scatterplot of Salary vs Culture of Data Set #1 90000 80000 Salary / Data Set #1 70000 60000 50000 40000 30000 20000 10000 0 0 200 400 600 800 1000 Culture / Data Set #1 Correlation 0.506 1200 1400 1600 1800 Summary Graphs: Box-Whisker Plots 02 – ExamScores2.xls Box Plot of Comparison of Score / Data Set #1 Gender = Male Gender = Female 0 20 40 60 80 100 120 Statistical Inference: Confidence Intervals 05 – GasPrices.xls One- Sample Analysis StatTools (Core Analysis Pack) Analysis: Confidence Interval Performed By: Date: Monday, October 22, 2007 Updating: Live Conf. Intervals (One-Sample) Sample Size Sample Mean Sample Std Dev Confidence Level (Mean) Degrees of Freedom Lower Limit Upper Limit Confidence Level (Std Dev) Degrees of Freedom Lower Limit Upper Limit Price of regular unleaded Data Set #1 32 1.57906 0.07945 95.0% 31 1.55042 1.60771 95.0% 31 0.06369 0.10562 Statistical Inference: Confidence Intervals 06 – SandwichRatings.xls Two- Sample Analysis StatTools (Core Analysis Pack) Analysis: Confidence Interval Performed By: Date: Monday, October 22, 2007 Updating: Live Sample Summaries Sample Size Sample Mean Sample Std Dev Conf. Intervals (Difference of Means) Confidence Level Sample Mean Difference Standard Error of Difference Degrees of Freedom Lower Limit Upper Limit Satisfaction (Female) Data Set #1 Satisfaction (Male) Data Set #1 39 5.949 2.470 71 6.746 1.787 Equal Variances Unequal Variances 95.0% -0.798 0.409249402 108 -1.608964238 0.013442389 95.0% -0.798 0.448812955 60 -1.695520502 0.099998653 Equality of Variances Test Ratio of Sample Variances p-Value 1.9119 0.0190 Statistical Inference: Hypothesis Test 05 – GasPrices.xls StatTools (Core Analysis Pack) Analysis: Hypothesis Test Performed By: Date: Monday, October 22, 2007 Updating: Live Hypothesis Test (One-Sample) Sample Size Sample Mean Sample Std Dev Hypothesized Mean Alternative Hypothesis Standard Error of Mean Degrees of Freedom t-Test Statistic p-Value Null Hypoth. at 10% Significance Null Hypoth. at 5% Significance Null Hypoth. at 1% Significance Price of regular unleaded Data Set #1 32 1.57906 0.07945 1.55 > 1.55 0.014044567 31 2.0693 0.0235 Reject Reject Don't Reject Statistical Inference: Hypothesis Test 05 – GasPrices.xls StatTools (Core Analysis Pack) Analysis: Sample Size Selection Performed By: Date: Monday, October 22, 2007 Updating: Live Sample Size for Mean Confidence Level Half-length of Interval Std Dev (estimate) Sample Size 95.00% 0.005 0.08 984 Data Management: Stacking & Unstacking Variables 07- EmpowerRatings.xls 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 A South 7 1 8 7 2 3 7 5 7 4 B C Midwest Northeast 7 7 6 5 10 5 3 5 9 4 2 5 8 1 3 5 2 3 7 3 7 3 5 5 10 5 10 6 Plant South South South South South South South South South South Midwest Midwest Midwest Midwest Midwest Midwest Midwest Rating 7 1 8 7 2 3 7 5 7 4 7 6 10 3 9 2 8 Data Management: Transforming Variables 04- Expenses.xls 1 2 3 4 5 6 7 8 9 10 A Salary $54,600 $57,500 $53,300 $43,500 $57,200 $63,400 $58,500 $55,600 $61,300 B Culture $1,020 $1,100 $900 $570 $900 $820 $1,340 $1,250 $1,190 C Sports $990 $460 $780 $860 $1,390 $1,880 $710 $680 $1,220 D E F G H Dining Log(Salary) Log(Culture) Log(Sports) Log(Dining) $1,510 10.91 6.93 6.90 7.32 $1,180 10.96 7.00 6.13 7.07 $1,590 10.88 6.80 6.66 7.37 $1,750 10.68 6.35 6.76 7.47 $2,120 10.95 6.80 7.24 7.66 $3,090 11.06 6.71 7.54 8.04 $1,540 10.98 7.20 6.57 7.34 $1,800 10.93 7.13 6.52 7.50 $2,330 11.02 7.08 7.11 7.75 Data Management: Creating Dummy Variables 08 - Salaries1.xls 1 2 3 4 5 6 7 8 A B Employee EducLevel 1 3 2 1 3 1 4 2 5 3 6 3 7 3 C Gender Male Female Female Female Male Female Female D E F G Salary EducLevel = 1 EducLevel = 2 EducLevel = 3 $32,000 0 0 1 $39,100 1 0 0 $33,200 1 0 0 $30,600 0 1 0 $29,000 0 0 1 $30,500 0 0 1 $30,000 0 0 1 Regression Analysis 09 – Salaries3.xls 1 2 3 4 5 6 A Employee 1 2 3 4 5 B Age 26 38 35 40 28 C D E Gender EducLevel YrsExper Male 3 3 Female 1 14 Female 1 12 Female 2 8 Male 3 3 F Salary 32000 39100 33200 30600 29000 1 2 3 4 M Employee 209 210 211 G H Female Female*YrsExper 0 0 1 14 1 12 1 8 0 0 N Age 37 49 55 I EducLevel1 0 1 1 0 0 O P Q Gender EducLevel YrsExper Female 3 7 Male 2 10 Male 3 16 J EducLevel2 0 0 0 1 0 R Salary 38810 41201 56272 K EducLevel3 1 0 0 0 1 S T Female Female*Yrs 1 0 0 Time Series & Forecasting : Time Series Graph 10 – StereoSales.xls Time Series of Sales / Data Set #1 300 250 200 150 100 50 Ja n95 Ap r95 Ju l-9 5 O ct -9 5 Ja n96 Ap r96 Ju l-9 6 O ct -9 6 Ja n97 Ap r97 Ju l-9 7 O ct -9 7 Ja n98 Ap r98 Ju l-9 8 O ct -9 8 0 Time Series & Forecasting : Autocorrelation 10 – StereoSales.xls Time Series of Sales / Data Set #1 300 250 200 150 100 50 Ja n95 Ap r95 Ju l-9 5 O ct -9 5 Ja n96 Ap r96 Ju l-9 6 O ct -9 6 Ja n97 Ap r97 Ju l-9 7 O ct -9 7 Ja n98 Ap r98 Ju l-9 8 O ct -9 8 0 Autocorrelation Table Number of Values Standard Error Lag #1 Lag #2 Lag #3 Lag #4 Lag #5 Lag #6 Lag #7 Lag #8 Lag #9 Lag #10 Lag #11 Lag #12 Sales Data Set #1 48 0.1443 0.3492 0.0772 0.0814 -0.0095 -0.1353 0.0206 -0.1494 -0.1492 -0.2626 -0.1792 0.0121 -0.0516 Autocorrelation of Sales / Data Set #1 1 0.5 0 -0.5 -1 1 2 3 4 5 6 7 Number of Lags 8 9 10 11 12 Time Series & Forecasting : Simple Exponential Smoothing 11 – HardwareSales.xls Time Series of Sales / Data Set #1 3500 3000 2500 2000 1500 1000 Forecast and Original Observations 500 3500.00 3000.00 2500.00 2000.00 Sales 1500.00 Forecast 1000.00 500.00 Week 99 106 92 85 78 71 64 57 50 43 36 29 22 8 15 0.00 1 96 101 91 86 81 76 71 66 61 56 51 46 41 36 31 26 21 16 6 11 1 0 Time Series & Forecasting : Holt’s Method 12 – ChipSales.xls Time Series of Sales / Data Set #1 10000 9000 8000 7000 6000 5000 4000 3000 Forecast and Original Observations 2000 1000 12000.00 10000.00 8000.00 Sales 6000.00 Forecast 4000.00 2000.00 Jun-1991 Jan-1991 Mar-1990 Aug-1990 Oct-1989 May-1989 Jul-1988 Dec-1988 Feb-1988 Sep-1987 Apr-1987 Nov-1986 Jun-1986 0.00 Jan-1986 200 Q 498 399 Q Q 297 198 Q Q 495 396 Q Q 294 195 Q Q 492 393 Q Q 291 192 Q Q 489 390 Q Q 288 189 Q Q 486 387 Q Q Q 186 0 Time Series & Forecasting : Winter’s Method Time Series of Sales / Data Set #1 6000 5000 4000 3000 2000 1000 0 7000.00 6000.00 5000.00 4000.00 Sales 3000.00 Forecast 2000.00 1000.00 Q2-2002 Q1-2001 Q4-1999 Q3-1998 Q2-1997 Q1-1996 Q4-1994 Q3-1993 Q2-1992 Q1-1991 Q4-1989 Q3-1988 0.00 Q2-1987 13- SoftdrinkSales.xls Q1-1986 Q Q 186 486 Q 38 Q 7 288 Q 18 Q 9 48 Q 9 39 Q 0 29 Q 1 19 Q 2 49 Q 2 39 Q 3 29 Q 4 195 Q 49 Q 5 39 Q 6 29 Q 7 198 Q 49 Q 8 399 Q 200 Forecast and Original Observations NeuralTools® Predictive Modeling A statistical model of future behavior Made up of predictor variables that are likely to influence results Historical data is analyzed to find relationships between predictor variables and the outcome (or outcomes) Outputs and predictors can be numerical or categorical/classification in nature NeuralTools® Numerical Modeling The output of interest has a numerical value Conventional approach using Linear Regression (simple or multiple) Multiple numerical outputs require Multivariate Regression techniques Predictors can be numerical or categorical Examples: – Investment prediction – Air and sea currents NeuralTools® Categorical Modeling The model requires observations to be placed in groups Logistic Regression for binary models (two categories) Places observations into either group based on exceeding a critical value or not Discriminant analysis for more than two categories Places observations into groups based on their statistical distance from each group Predictors can be numerical or categorical Examples: – Tumour diagnosis – Credit scoring NeuralTools® Neural Networks A neural network is a system that takes numeric inputs, performs computations on these inputs, and outputs one or more numeric values Inspired by the structure of the brain Consists of a large number of cells (neurons) Neurons receive impulses from other neurons Depending on the impulse received, the neuron may send a signal to other neurons This signal will be a simple function of the impulses it received NeuralTools® Neural Nets vs. Statistical Methods Neural nets provide an alternative to more traditional statistical methods Used for function approximation and classification, just as Linear Regression, and Discriminant Analysis and Logistic Regression are An advantage of neural nets is that they are capable of modelling extremely complex functions, in contrast with the traditional linear techniques Neural nets are not subject to the same assumptions as statistical methods (autocorrelation, Gaussian errors, equality of variance etc) NeuralTools® Training a Net The process of fine-tuning the parameters of the computation, where the purpose is to make the net output approximately correct values for given inputs The training algorithm selects various sets of computation parameters, and evaluates each set by applying the net to each training case Each set of parameters is a "trial" The training algorithm selects new sets of parameters based on the results of previous trials NeuralTools® Neural Networks available in NeuralTools Multi-Layer Feedforward Network – The user specifies one or two layers of hidden nodes, and how many nodes the hidden layers should contain Generalized Regression Neural Nets and Probabilistic Neural Nets – GRN Net used for numeric prediction – PN Net used for category prediction/classification – Always have two hidden layers of nodes, with one node per training case in the first hidden layer, and the size of the second layer determined by the training data NeuralTools® Multi-Layer Feedforward Nets Also referred to as "Multi-Layer Perceptron Networks" Capable of approximating complex functions, and thus capable of modelling complex relationships between independent variables and a dependent one When MLF nets are used for classification, they have multiple output nodes, one corresponding to each possible dependent category A net classifies a case by computing its numeric outputs; the selected category is the one corresponding to the node that outputs the highest value NeuralTools® MLF Architecture NeuralTools® Generalized Regression Neural Nets Used for numeric prediction modeling Built on the intuitive idea that the closer a known case is to the unknown one, the more important it is when estimating the unknown dependent value NeuralTools® GRN Net Architecture NeuralTools® Probabilistic Neural Nets Used for classification modeling Similar in concept to GRN nets Considers the distance of a new case to every training case, giving greater weight to closer cases NeuralTools® PN Net Architecture NeuralTools® Advantages of GRN/PN nets Training time is much shorter Do not require topology specification PN nets not only classify, but also return the probabilities that the case falls in different possible dependent categories Advantages of MLF nets Smaller in size, thus faster to make predictions More reliable outside the range of training data (for example, when the value of some independent variable falls outside the range of values for that variable in the training data); though note that prediction outside the range of training data is still risky with MLF nets Capable of generalizing from very small training sets Sources of help On-line tutorials Help menu within the software Software manuals (PDF) Palisade web-site www.palisade.com Helpdesk: http://helpdesk.palisade.com/ Forum: http://forums.palisade.com/ Web encyclopedia www.wikipedia.com Your Regional Sales Manager(s) Palisade Training and Consulting Services