CIS732-Lecture-16

advertisement
Lecture 16 of 42
Regression and Prediction
Wednesday, 27 February 2008
William H. Hsu
Department of Computing and Information Sciences, KSU
http://www.kddresearch.org
http://www.cis.ksu.edu/~bhsu
Readings:
Section 6.11, Han & Kamber 2e
Chapter 1, Sections 6.1-6.5, Goldberg
Sections 9.1-9.4, Mitchell
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Lecture Outline
•
Readings
– Section 6.11, Han & Kamber 2e
– Suggested: Chapter 1, Sections 6.1-6.5, Goldberg
•
Paper Review: “Genetic Algorithms and Classifier Systems”, Booker et al
•
Evolutionary Computation
– Biological motivation: process of natural selection
– Framework for search, optimization, and learning
•
Prototypical (Simple) Genetic Algorithm
– Components: selection, crossover, mutation
– Representing hypotheses as individuals in GAs
•
An Example: GA-Based Inductive Learning (GABIL)
•
GA Building Blocks (aka Schemas)
•
Taking Stock (Course Review): Where We Are, Where We’re Going
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
– Working with relationships between two variables
• Size of Teaching Tip & Stats Test Score
Stats
Test
Score
100
90
80
70
60
50
40
30
20
10
0
$0
$20
$40
$60
$80
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Correlation & Regression
•
Univariate & Bivariate Statistics
– U: frequency distribution, mean, mode, range, standard deviation
– B: correlation – two variables
•
Correlation
– linear pattern of relationship between one variable (x) and another variable (y) – an
association between two variables
– relative position of one variable correlates with relative distribution of another
variable
– graphical representation of the relationship between two variables
•
Warning:
– No proof of causality
– Cannot assume x causes y
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Scatterplot!
•
No Correlation
– Random or circular assortment of
dots
•
Positive Correlation
– ellipse leaning to right
– GPA and SAT
– Smoking and Lung Damage
•
Negative Correlation
– ellipse learning to left
– Depression & Self-esteem
– Studying & test errors
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Pearson’s Correlation Coefficient
•
“r” indicates…
– strength of relationship (strong, weak, or none)
– direction of relationship
• positive (direct) – variables move in same direction
• negative (inverse) – variables move in opposite directions
•
r ranges in value from –1.0 to +1.0
-1.0
Strong Negative
0.0
No Rel.
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
+1.0
Strong Positive
•Go to website!
–playing with scatterplots
Kansas State University
Department of Computing and Information Sciences
Practice with Scatterplots
r = .__ __
r = .__ __
r = .__ __
r = .__ __
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Correlation Guestimation
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Correl ations
Miles walk ed per day Pearson Correlation
Sig. (2-tailed)
N
W eight
Pearson Correlation
Sig. (2-tailed)
N
Depres sion
Pearson Correlation
Sig. (2-tailed)
N
Anxiet y
Pearson Correlation
Sig. (2-tailed)
N
Miles walk ed
per day
1
12
-.797**
.002
12
-.800**
.002
12
-.774**
.003
12
W eight
Depres sion
Anxiet y
-.797**
-.800**
-.774**
.002
.002
.003
12
12
12
1
.648*
.780**
.023
.003
12
12
12
.648*
1
.753**
.023
.005
12
12
12
.780**
.753**
1
.003
.005
12
12
12
**. Correlation is s ignificant at the 0.01 level (2-tailed).
*. Correlation is s ignificant at the 0.05 level (2-tailed).
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Samples vs. Populations
•
Sample statistics estimate Population parameters
– M tries to estimate μ
– r tries to estimate ρ (“rho” – greek symbol --- not “p”)
•
r
correlation for a sample
• based on a the limited observations we have
•
ρ
actual correlation in population
• the true correlation
•
Beware Sampling Error!!
– even if ρ=0 (there’s no actual correlation), you might get r =.08 or r = -.26 just by
chance.
– We look at r, but we want to know about ρ
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Hypothesis testing with Correlations
•
Two possibilities
– Ho: ρ = 0 (no actual correlation; The Null Hypothesis)
– Ha: ρ ≠ 0 (there is some correlation; The Alternative Hyp.)
•
Case #1 (see correlation worksheet)
– Correlation between distance and points r = -.904
– Sample small (n=6), but r is very large
– We guess ρ < 0 (we guess there is some correlation in the pop.)
•
Case #2
– Correlation between aiming and points, r = .628
– Sample small (n=6), and r is only moderate in size
– We guess ρ = 0 (we guess there is NO correlation in pop.)
•
Bottom-line
– We can only guess about ρ
– We can be wrong in two ways
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Reading Correlation Matrix
Correlationsa
Total ball toss points
Distance from target
Time spun before
throwing
Aiming accuracy
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Manual dexterity
Pearson Correlation
Sig. (2-tailed)
N
College grade point avg Pearson Correlation
Sig. (2-tailed)
N
Confidence for task
Pearson Correlation
Sig. (2-tailed)
N
Total ball
Distance
toss points from target
1
-.904*
.
.013
6
6
-.904*
1
.013
.
6
6
-.582
.279
.226
.592
Time spun
before
throwing
-.582
.226
6
.279
.592
6
1
.
Aiming
accuracy
.628
.181
6
-.653
.159
6
-.390
.445
Manual College grade
dexterity
point avg
.821*
-.037
.045
.945
6
6
-.883*
.228
.020
.664
6
6
-.248
-.087
.635
.869
Confidence
for task
-.502
.310
6
.522
.288
6
.267
.609
6
6
6
6
6
6
6
.628
.181
6
.821*
.045
6
-.037
.945
6
-.502
.310
6
-.653
.159
6
-.883*
.020
6
.228
.664
6
.522
.288
6
-.390
.445
6
-.248
.635
6
-.087
.869
6
.267
.609
6
1
.
6
.758
.081
6
-.546
.262
6
-.250
.633
6
.758
.081
6
1
.
6
-.553
.255
6
-.101
.848
6
-.546
.262
6
-.553
.255
6
1
.
6
-.524
.286
6
-.250
.633
6
-.101
.848
6
-.524
.286
6
1
.
6
Correlationsa
Time spun
r = -.904
Total ball
Distance
before
Aiming
Manual College grade Confidence
toss points from target throwing accuracy dexterity
point avg
for task
p
=
.
013
-Probability
of
Total ball
toss points Pearson Correlation
1
-.904*
-.582
.628
.821*
-.037
-.502
*. Correlation is significant at the 0.05 level (2-tailed).
getting
a correlation
this size
a. Day sample collected = Tuesday
Sig. (2-tailed)
.
.013
.226
.181
.045
.945
.310
chance.
Reject
Ho 6
N
6
6
6 by sheer
6
6
6
p ≤ .05.-.883*
Distance from target Pearson Correlation
-.904*
1
.279 if -.653
.228
.522
Sig. (2-tailed)
.013
.
.592
.159
.020
.664
.288
sample
N
6
6size
6
6
6
6
6
© 2005 Sinn, J.
r
(4)
=
-.904,
p.05
TimeUniversity
spun before
Pearson Correlation
-.582
.279
1
-.390
-.248
-.087
.267
Winthrop
throwing
Sig. (2-tailed)
.226
.592
.
.445
.635
.869
.609
Kansas State University
CIS 732: Machine Learning and Pattern Recognition
Department of Computing and Information Sciences
Predictive Potential
•
Coefficient of Determination
– r²
– Amount of variance accounted for in y by x
– Percentage increase in accuracy you gain by using the regression line to make
predictions
– Without correlation, you can only guess the mean of y
– [Used with regression]
0%
20%
40%
60%
80%
100%
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Limitations of Correlation
•
linearity:
– can’t describe non-linear relationships
– e.g., relation between anxiety & performance
•
truncation of range:
– underestimate stength of relationship if you can’t see full range of x value
•
no proof of causation
– third variable problem:
• could be 3rd variable causing change in both variables
• directionality: can’t be sure which way causality “flows”
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Regression
•
Regression: Correlation + Prediction
– predicting y based on x
– e.g., predicting….
• throwing points (y)
• based on distance from target (x)
•
Regression equation
–
–
–
–
formula that specifies a line
y’ = bx + a
plug in a x value (distance from target) and predict y (points)
note
• y= actual value of a score
• y’= predict value
•Go to website!
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
–Regression Playground
Kansas State University
Department of Computing and Information Sciences
Regression Graphic – Regression Line
See correlation
& regression
worksheet
120
100
80
60
y’=47
40
y’=20
20
0
Rsq = 0.6031
8
10
12
14
16
Distance from target
© 2005 Sinn, J.
Winthrop University
18
if x=18
then…
CIS 732: Machine Learning and Pattern Recognition
20
22
24
26
if x=24
then…
Kansas State University
Department of Computing and Information Sciences
Regression Equation
•
y’= bx + a
–
–
–
–
•
y’ = predicted value of y
b = slope of the line
x = value of x that you plug-in
a = y-intercept (where line crosses y access)
See correlation
& regression
worksheet
In this case….
– y’ = -4.263(x) + 125.401
•
So if the distance is 20 feet
– y’ = -4.263(20) + 125.401
– y’ = -85.26 + 125.401
– y’ = 40.141
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
SPSS Regression Set-up
•“Criterion,”
•y-axis variable,
•what you’re trying
to predict
•“Predictor,”
•x-axis variable,
•what you’re basing
the prediction on
© 2005 Sinn, J.
Winthrop University
Note: Never refer to the IV or DV when doing regression
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Getting Regression Info from SPSS
Model Summary
Model
1
R
R Square
a
.777
.603
Adjusted
R Square
.581
Std. Error of
the Estimate
18.476
See correlation
& regression
worksheet
a. Predictors: (Constant), Distance from target
y’ = b (x)
a
+
a
y’ = -4.263(20) + 125.401
Coefficientsa
Model
1
(Constant)
Distance from target
Unstandardized
Coefficients
B
Std. Error
125.401
14.265
-4.263
.815
Standardized
Coefficients
Beta
-.777
t
8.791
-5.230
Sig.
.000
.000
a. Dependent Variable: Total ball toss points
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
b
Kansas State University
Department of Computing and Information Sciences
Predictive Ability
•
Mantra!!
– As variability decreases, prediction accuracy ___
– if we can account for variance, we can make better predictions
•
As r increases:
– r² increases
• “variance accounted for” increases
• the prediction accuracy increases
– prediction error decreases (distance between y’ and y)
– Sy’ decreases
• the standard error of the residual/predictor
• measures overall amount of prediction error
•
We like big r’s!!!
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Drawing a Regression Line by Hand
Three steps
1.
Plug zero in for x to get a y’ value, and then plot this value
–
Note: It will be the y-intercept
2.
Plug in a large value for x (just so it falls on the right end of the graph), plug
it in for x, then plot the resulting point
3.
Connect the two points with a straight line!
© 2005 Sinn, J.
Winthrop University
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Time Series Prediction
Forecasting the Future and
Understanding the Past
Santa Fe Institute Proceedings on the Studies in the Sciences of Complexity
Edited by Andreas Weingend and Neil Gershenfeld
NIST Complex System Program
Perspectives on Standard Benchmark Data
In Quantifying Complex Systems
Vincent Stanford
Complex Systems Test Bed project
August 31, 2007
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Chaos in Nature, Theory, and Technology
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Rings of Saturn
Lorentz Attractor
CIS 732: Machine Learning and Pattern Recognition
Aircraft dynamics at
high angles of attack
Kansas State University
Department of Computing and Information Sciences
Time Series Prediction
A Santa Fe Institute competition using standard data sets
•
•
•
Santa Fe Institute (SFI) founded in 1984 to “… focus the tools of traditional scientific disciplines and emerging computer
resources on … the multidisciplinary study of complex systems…”
“This book is the result of an unsuccessful joke. … Out of frustration with the fragmented and anecdotal literature, we made
what we thought was a humorous suggestion: run a competition. …no one laughed.”
Time series from physics, biology, economics, …, beg the same questions:
–
–
–
•
•
•
What happens next?
What kind of system produced this time series?
How much can we learn about the producing system?
Quantitative answers can permit direct comparisons
Make some standard data sets in consultation with subject matter experts in a variety of areas.
Very NISTY; but we are in a much better position to do this in the age of Google and the Internet.
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Selecting benchmark data sets
For inclusion in the book
•
Subject matter expert advisor group:
–
–
–
–
–
–
–
Biology
Economics
Astrophysics
Numerical Analysis
Statistics
Dynamical Systems
Experimental Physics
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
The Data Sets
A.
Far-infrared laser excitation
B.
Sleep Apnea
C.
Currency exchange rates
D.
Particle driven in nonlinear multiple well
potentials
E.
Variable star data
F.
J. S. Bach fugue notes
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
J.S. Bach
benchmark
•
•
•
Dynamic, yes.
But is it an iterative map?
Is it amenable to time delay
embedding?
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Competition Tasks
•
•
Predict the withheld continuations of the data sets provided for training and
measure errors
Characterize the systems as to:
–
–
–
–
•
•
Degrees of Freedom
Predictability
Noise characteristics
Nonlinearity of the system
Infer a model for the governing equations
Describe the algorithms employed
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Complex Time Series Benchmark Taxonomy
•
•
•
•
•
•
•
•
•
Natural
Stationary
Low dimensional
Clean
Short
Documented
Linear
Scalar
One trial
•
•
•
•
•
•
•
•
•
Synthetic
Nonstationary
Stochastic
Noisy
Long
Blind
Nonlinear
Vector
Many trials
•
Continuous
•
•
•
•
Discontinuous
Switching
Catastrophes
Episodes
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Time honored linear models
N AR
N MA
i 0
j 0
y[t  1]   ai  y[t  i]   b j  x[t  i]
•
•
•
•
Auto Regressive Moving Average (ARMA)
Many linear estimation techniques based on Least Squares, or Least Mean Squares
Power spectra, and Autocorrelation characterize such linear systems
Randomness comes only from forcing function x(t)

CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Simple nonlinear systems
can exhibit chaotic behavior
x[t 1]  r  x[t](1 x[t])
•
•
•
Spectrum, autocorrelation, characterize linear
systems, not these
Deterministic chaos looks random to linear
analysis methods
Logistic map is an early example (Elam 1957).
Logisic map 2.9 < r < 3.99
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Understanding and learning
comments from SFI
•
•
•
•
Weak to Strong models - many parameters to few
Data poor to data rich
Theory poor to theory rich
Weak models progress to strong, e.g. planetary motion:
–
–
–
–
–
•
Tycho Brahe: observes and records raw data
Kepler: equal areas swept in equal time
Newton: universal gravitation, mechanics, and calculus
Poincaré: fails to solve three body problem
Sussman and Wisdom: Chaos ensues with computational solution!
Is that a simplification?
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Discovering properties of data
and inferring (complex) models
•
•
•
•
•
Can’t decompose an output into the product of input and transfer function Y(z)=H(z)X(z) by doing a Z,
Laplace, or Fourier transform.
Linear Perceptrons were shown to have severe limitations by Minsky and Papert
Perceptrons with non-linear threshold logic can solve XOR and many classifications not available with
linear version
But according to SFI: “Learning XOR is as interesting as memorizing the phone book. More interesting and more realistic - are real-world problems, such as prediction of financial data.”
Many approaches are investigated
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Time delay embedding
Differs from traditional experimental measurements
– Provides detailed information about degrees of freedom beyond the scalar
measured
– Rests on probabilistic assumptions - though not guaranteed to be valid for any
particular system
– Reconstructed dynamics are seen through an unknown “smooth transformation”
– Therefore allows precise questions only about invariants under “smooth
transformations”
– It can still be used for forecasting a time series and “characterizing essential
features of the dynamics that produced it”
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Time delay embedding theorems
“The most important Phase Space Reconstruction technique is the method of delays”
Vector
Sequence
xn 1  f (xn ) {sn }  {s(xn )}
Time delay
Vectors
Sn  (sn(M 1) ,sn(M 2) ,...,sn )


–
–
–
–
–
Scalar
Measurement
Assuming the dynamics f(X) on a V dimensional manifold has a strange attractor A with box counting dimension dA
s(X) is a twice differentiable scalar measurement giving {sn}={s(Xn)}
M is called the embedding dimension
 is generally referred to as the delay, or lag
Embedding theorems: if {sn} consists of scalar measurements of the state a dynamical system then, under suitable
hypotheses, the time delay embedding {Sn} is a one-to-one transformed image of the {Xn}, provided M > 2dA. (e.g.
Takens 1981, Lecture Notes in Mathematics, Springer-Verlag; or Sauer and Yorke, J. of Statistical Physics, 1991)

CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Time series prediction
Many different techniques thrown at the data to “see if anything sticks”
Examples:
–
–
–
–
–
Delay coordinate embedding - Short term prediction by filtered delay coordinates and
reconstruction with local linear models of the attractor (T. Sauer).
Neural networks with internal delay lines - Performed well on data set A (E. Wan), (M. Mozer)
Simple architectures for fast machines - “Know the data and your modeling technique” (X.
Zhang and J. Hutchinson)
Forecasting pdf’s using HMMs with mixed states - Capturing “Embedology” (A. Frasar and A.
Dimiriadis)
More…
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Time series characterization
Many different techniques thrown at the data to “see if anything sticks”
Examples:
–
–
–
–
–
Stochastic and deterministic modeling - Local linear approximation to attractors (M. Kasdagali
and A. Weigend)
Estimating dimension and choosing time delays - Box counting (F. Pineda and J. Sommerer)
Quantifying Chaos using information-theoretic functionals - mutual information and
nonlinearity testing.(M. Palus)
Statistics for detecting deterministic dynamics - Course grained flow averages (D. Kaplan)
More…
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
What to make of this?
Handbook for the corpus driven study of nonlinear dynamics
Very NISTY:
– Convene a panel of leading researchers
– Identify areas of interest where improved characterization and predictive measurements can
be of assistance to the community
– Identify standard reference data sets:
• Development corpra
• Test sets
– Develop metrics for prediction and characterization
– Evaluate participants
– Is there a sponsor?
– Are there areas of special importance to communities we know? For example: predicting
catastrophic failures of machines from sensors.
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Ideas?
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
CIS 732: Machine Learning and Pattern Recognition
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Kansas State University
Department of Computing and Information Sciences
Terminology
•
Evolutionary Computation (EC): Models Based on Natural Selection
•
Genetic Algorithm (GA) Concepts
– Individual: single entity of model (corresponds to hypothesis)
– Population: collection of entities in competition for survival
– Generation: single application of selection and crossover operations
– Schema aka building block: descriptor of GA population (e.g., 10**0*)
– Schema theorem: representation of schema proportional to its relative fitness
•
Simple Genetic Algorithm (SGA) Steps
– Selection
• Proportionate reproduction (aka roulette wheel): P(individual)  f(individual)
• Tournament: let individuals compete in pairs or tuples; eliminate unfit ones
– Crossover
• Single-point: 11101001000  00001010101  { 11101010101, 00001001000 }
• Two-point: 11101001000  00001010101  { 11001011000, 00101000101 }
• Uniform: 11101001000  00001010101  { 10001000100, 01101011001 }
– Mutation: single-point (“bit flip”), multi-point
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Summary Points
•
Evolutionary Computation
– Motivation: process of natural selection
• Limited population; individuals compete for membership
• Method for parallelizing and stochastic search
– Framework for problem solving: search, optimization, learning
•
Prototypical (Simple) Genetic Algorithm (GA)
– Steps
• Selection: reproduce individuals probabilistically, in proportion to fitness
• Crossover: generate new individuals probabilistically, from pairs of “parents”
• Mutation: modify structure of individual randomly
– How to represent hypotheses as individuals in GAs
•
An Example: GA-Based Inductive Learning (GABIL)
•
Schema Theorem: Propagation of Building Blocks
•
Next Lecture: Genetic Programming, The Movie
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Download