Lecture 2 Slides

advertisement
Chapter 2
Overview of the Data Mining
Process
1
Introduction
• Data Mining
– Predictive analysis
• Tasks of Classification & Prediction
• Core of Business Intelligence
• Data Base Methods
– OLAP
– SQL
– Do not involve statistical modeling
2
Core Ideas in Data Mining
• Analytical Methods Used in Predictive Analytics
– Classification
• Used with categorical response variables
• E.g. Will purchase be made / not made?
– Prediction
• Predict (estimate) value of continuous response variable
• Prediction used with categorical as well
– Association Rules
• Affinity analysis – “what goes with what”
• Seeks correlations among data
3
Core Ideas in Data Mining
• Data Reduction
– Reduce variables
– Group together similar variables
• Data Exploration
– View data as evidence
– Get “a feel” for the data
• Data Visualization
– Graphical representation of data
– Locate tends, correlations, etc.
4
Supervised Learning
• “Supervised learning" algorithms are those used in classification
and prediction.
– Data is available in which the value of the outcome of interest is known.
• “Training data" are the data from which the classification or
prediction algorithm “learns," or is “trained," about the
relationship between predictor variables and the outcome
variable.
• This process results in a “model”
– Classification Model
– Predictive Model
5
Supervised Learning
• Model is then run with another sample of data
– “validation data"
– the outcome is known but we wish to see how well the model performs
– If many different models are being tried out, a third sample of known
outcomes -“test data” is used with the final, selected model to predict
how well it will do.
• The model can then be used to classify or predict the outcome
of interest in new cases where the outcome is unknown.
6
Supervised Learning
• Linear regression analysis is an example of
supervised Learning
– The Y variable is the (known) outcome variable
– The X variable is some predictor variable.
– A regression line is drawn to minimize the sum of squared deviations
between the actual Y values and the values predicted by this line.
– The regression line can now be used to predict Y values for new values
of X for which we do not know the Y value.
7
Unsupervised Learning
• No outcome variable to predict or classify
• No “learning” from cases
• Unsupervised leaning methods
– Association Rules
– Data Reduction Methods
– Clustering Techniques
8
The Steps in Data Mining
• 1. Develop an understanding of the purpose of the data
mining project
– It is a one-shot effort to answer a question or questions or
– Application (if it is an ongoing procedure).
• 2. Obtain the dataset to be used in the analysis.
– Random sampling from a large database to capture records to be used
in an analysis
– Pulling together data from different databases.
• Internal (e.g. Past purchases made by customers)
• External (credit ratings).
– Usually the analysis to be done requires only thousands or tens of
thousands of records.
9
The Steps in Data Mining
• 3. Explore, clean, and preprocess the data
– Verifying that the data are in reasonable condition.
– How missing data should be handled?
– Are the values in a reasonable range, given what you would expect for
each variable?
– Are there obvious “outliers?"
– Data are reviewed graphically –
• For example, a matrix of scatter plots showing the relationship of each
variable with each other variable.
– Ensure consistency in the definitions of fields, units of measurement,
time periods, etc.
10
The Steps in Data Mining
• 4. Reduce the data
– If supervised training is involved separate them into training,
validation and test datasets.
– Eliminating unneeded variables,
• Transforming variables
– Turning “money spent" into “spent > $100" vs. “Spent · $100"),
• Creating new variables
– A variable that records whether at least one of several products was purchased
– Make sure you know what each variable means, and whether it is
sensible to include it in the model.
• 5. Determine the data mining task
– Classification, prediction, clustering, etc.
• 6. Choose the data mining techniques to be used
– Regression, neural nets, hierarchical clustering, etc.
11
The Steps in Data Mining
• 7. Use algorithms to perform the task.
– Iterative process - trying multiple variants, and often using multiple variants of
the same algorithm (choosing different variables or settings within the
algorithm).
– When appropriate, feedback from the algorithm's performance on validation
data is used to refine the settings.
• 8. Interpret the results of the algorithms.
– Choose the best algorithm to deploy,
– Use final choice on the test data to get an idea how well it will perform.
• 9. Deploy the model.
– Integrate the model into operational systems
– Run it on real records to produce decisions or actions.
– For example, the model might be applied to a purchased list of possible
customers, and the action might be “include in the mailing if the predicted
amount of purchase is > $10."
12
Preliminary Steps
• Organization of datasets
– Records in rows
– Variables in columns
• In supervised learning one of these will be the outcome variable
• Labels the first or last column
• Sampling from a database
– Use a samples to create, validate, & test model
• Oversampling rare events
– If response variable value is seldom found in data then
sample size increase
– Adjust algorithm as necessary
13
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Types of variables
– Continuous – assumes a any real numerical value
(generally within a specified range)
– Categorical – assumes one of a limited number of values
•
•
•
•
Text (e.g. Payments e {current, not current, bankrupt}
Numerical (e.g. Age e {0 … 120} )
Nominal (payments)
Ordinal (age)
14
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Handling categorical variables
– If categorical is ordered then it can be used as continuous variable
(e.g. Age, level of credit, etc.)
– Use of “dummy” variables when range of values not large
• e.g. Variable occupation e {student, unemployed, employed, retired}
• Create binary (yes/no) dummy variables
–
–
–
–
Student – yes/no
Unemployed – yes/no
Employed – yes/no
Retired – yes/no
• Variable selection
– The more predictor variables the more records need to build the
model
– Reduce number of variables whenever appropriate
15
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Overfitting
– Building a model - describe relationships among variables in order to
predict future outcome (dependent) values on the basis of future
predictor (independent) values.
– Avoid “explaining“ variation in the data that was nothing more than
chance variation. Avoid mislabeling “noise” in the data as if it were a
“signal”
– Caution - if the dataset is not much larger than the number of
predictor variables, then it is very likely that a spurious relationship
like this will creep into the model
16
Overfitting
17
Preliminary Steps
(Pre-processing and Cleaning the Data)
• How many variables & how much data
• A good rule of thumb is to have ten records for every predictor
variable.
• For classification procedures
– At least 6xmxp records,
– Where m = number of outcome classes, and p = number of variables
• Compactness or parsimony is a desirable feature in a model.
• A matrix of x-y plots can be useful in variable selection.
• Can see at a glance x-y plots for all variable combinations.
– A straight line would be an indication that one variable is exactly correlated
with another.
– We would want to include only one of them in our model.
• Weed out irrelevant and redundant variables from our model
• Consult domain expert whenever possible
18
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Outliers
– Values that lie far away from the bulk of the data are called outliers
– no statistical rule can tell us whether such an outlier is the result of an
error
– these are judgments best made by someone with “domain"
knowledge
– if the number of records with outliers is very small, they might be
treated as missing data.
19
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Missing values
– If the number of records with missing values is small, those records
might be omitted
– The more variables, the more records to dropped
• Solution - use average value computed from records with valid data for
variable with missing data
• Reduces variability in data set
– Human judgment can be used to determine best way to handle
missing data
20
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Normalizing (standardizing) the data
– To normalize the data, we subtract the mean from each value, and divide
by the standard deviation of the resulting deviations from the mean
• Expressing each value as “number of standard deviations away from the
mean“ – the z-score
• Needed if variables are in different units e.G. Hours, thousands of dollars, etc.
– Clustering algorithms measure variables values in distance from each
other – need a standard value for distance.
– Data mining software, including XLMiner, typically has an option that
normalizes the data in those algorithms where it may be required
21
Preliminary Steps
• Use and creation of partition
– Training partition
• The largest partition
• Contains the data used to build the various models
• Same training partition is generally used to develop multiple models.
– Validation partition
• Used to assess the performance of each model,
• Used to compare models and pick the best one.
• In classification and regression trees algorithms the validation partition
may be used automatically to tune and improve the model.
– Test partition
• Sometimes called the “holdout" or “evaluation" partition is used to assess
the performance of a chosen model with new data.
22
The Three Data Partitions and Their
Role in the Data Mining Process
23
A
Example – Linear Regression
Boston Housing Data
B
C
D
E
F
G
H
I
J
DIS RAD
K
TAX PTRATIO
L
M
N
O
CAT.
B LSTAT MEDV MEDV
CRIM
ZN INDUS CHAS
NOX
RM
AGE
0.006
18
2.31
0
0.54
6.58
65.2
4.09
1
296
15.3 397
5
24
0
0.027
0
7.07
0
0.47
6.42
78.9
4.97
2
242
17.8 397
9
21.6
0
0.027
0
7.07
0
0.47
7.19
61.1
4.97
2
242
17.8 393
4
34.7
1
0.032
0
2.18
0
0.46
7.00
45.8
6.06
3
222
18.7 395
3
33.4
1
0.069
0
2.18
0
0.46
7.15
54.2
6.06
3
222
18.7 397
5
36.2
1
0.030
0
2.18
0
0.46
6.43
58.7
6.06
3
222
18.7 394
5
28.7
0
0.088
12.5
7.87
0
0.52
6.01
66.6
5.56
5
311
15.2 396
12
22.9
0
0.145
12.5
7.87
0
0.52
6.17
96.1
5.95
5
311
15.2 397
19
27.1
0
0.211
12.5
7.87
0
0.52
5.63
100
6.08
5
311
15.2 387
30
16.5
0
0.170
12.5
7.87
0
0.52
6.00
85.9
6.59
5
311
15.2 387
17
18.9
0
24
CRIM
per capita crime rate by town
ZN
proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS
proportion of non-retail business acres per town.
CHAS
Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX
nitric oxides concentration (parts per 10 million)
RM
average number of rooms per dwelling
AGE
proportion of owner-occupied units built prior to 1940
DIS
weighted distances to five Boston employment centres
RAD
index of accessibility to radial highways
TAX
full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV
Median value of owner-occupied homes in $1000
25
Partitioning the data
26
Using XLMiner for Multiple Linear Regression
27
Specifying Output
28
Prediction of Training Data
Row Id.
1
4
5
6
9
10
12
17
18
Predicted
Value
30.24690555
28.61652272
27.76434086
25.6204032
11.54583087
19.13566187
21.95655773
20.80054199
16.94685562
Actual Value
Residual
24 -6.246905549
33.4 4.783477282
36.2 8.435659135
28.7 3.079596801
16.5 4.954169128
18.9 -0.235661871
18.9 -3.05655773
23.1 2.299458015
17.5 0.553144385
29
Prediction of Validation Data
Row Id.
2
3
7
8
11
13
14
15
16
Predicted
Value
25.03555247
30.1845219
23.39322259
19.58824389
18.83048747
21.20113865
19.81376359
19.42217211
19.63108414
Actual Value
Residual
21.6
34.7
22.9
27.1
15
21.7
20.4
18.2
19.9
-3.435552468
4.515478101
-0.493222593
7.511756109
-3.830487466
0.498861352
0.586236414
-1.222172107
0.268915856
30
Summary of errors
Training Data scoring - Summary Report
Total sum of
squared
errors
6977.106
RMS Error Average Error
4.790720883
3.11245E-07
Validation Data scoring - Summary Report
Total sum of
squared
errors
4251.582211
RMS Error Average Error
4.587748542
-0.011138034
31
RMS error
• Error = actual - predicted
• RMS = Root-mean-squared error
• = Square root of average squared error
• In previous example, sizes of training and
validation sets differ, so only RMS Error and
Average Error are comparable
32
Using Excel and XLMiner for Data
Mining
• Excel is limited in data capacity
• However, the training and validation of DM models can
be handled within the modest limits of Excel and
XLMiner
• Models can then be used to score larger databases
• XLMiner has functions for interacting with various
databases (taking samples from a database, and
scoring a database from a developed model)
33
Simple Regression Example
34
Simple Regression Model
• Make prediction about the starting salary of a current college
graduate
• Data set of starting salaries of recent college graduates
Data Set
Compute Average Salary
How certain are of this prediction?
There is variability in the data.
35
Simple Regression Model
• Use total variation as an index of uncertainty about our prediction
Compute Total Variation
• The smaller the amount of total variation the more accurate
(certain) will be our prediction.
36
Simple Regression Model
• How “explain” the variability - Perhaps it depends on
the student’s GPA
Salary GPA
37
Simple Regression Model
• Find a linear relationship between GPA and starting salary
• As GPA increases/decreases starting salary increases/decreases
38
Simple Regression Model
• Least Squares Method to find regression model
– Choose a and b in regression model (equation) so that it minimizes the sum
of the squared deviations – actual Y value minus predicted Y value (Y-hat)
39
Simple Regression Model
• How good is the model?
a= 4,779 & b = 5,370
A computer program computed these values
u-hat is a “residual” value
The sum of all u-hats is zero
The sum of all u-hats squared is the total variance not explained by the model
“unexplained variance” is 7,425,926
40
Simple Regression Model
Total Variation = 23,000,000
41
Simple Regression Model
Total Unexplained Variation = 7,425,726
42
Simple Regression Model
• Relative Goodness of Fit
– Summarize the improvement in prediction using regression model
• Compute R2 – coefficient of determination
Regression Model (equation) a better predictor than guessing the average salary
The GPA is a more accurate predictor of starting salary than guessing the average
R2 is the “performance measure“ for the model.
Predicted Starting Salary = 4,779 + 5,370 * GPA
43
Detailed Regression Example
44
Data Set
Obs #
1
2
3
4
5
6
7
8
9
10
Salary
20000
24500
23000
25000
20000
22500
27500
19000
24000
28500
GPA
2.8
3.4
3.2
3.8
3.2
3.4
4.0
2.6
3.2
3.8
Months Work
48
24
24
24
48
36
20
48
36
12
45
Scatter Plot - GPA vs Salary
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
0
5000
10000
15000
20000
25000
30000
46
Scatter Plot - Work vs Salary
60
50
40
30
20
10
0
0
5000
10000
15000
20000
25000
30000
47
Pearson Correlation Coefficients
-1 <= r <= 1
Salary
Salary
GPA
Months
Work
Months
Work
GPA
1
0.898007
1
-0.93927
-0.82993
1
48
Three Regressions
•
•
•
•
Salary = f(GPA)
Salary = f(Work)
Salary = f(GPA, Work)
Interpret Excel Output
49
Interpreting Results
• Regression Statistics
– Multiple R,
– R2,
– R2adj
– Standard Error Sy
• Statistical Significance
– t-test
– p-value
– F test
50
Regression Statistics Table
• Multiple R
– R = square root of R2
• R2
– Coefficient of Determination
• R2adj
– used if more than one x variable
• Standard Error Sy
– This is the sample estimate of the standard
deviation of the error (actual – predicted)
51
ANOVA Table
• Table 1 gives the F statistic
• Tests the claim
– there is no significant relationship between your
all of your independent and dependent variables
• The significance F value is a p-value
• should reject the claim:
– Of NO significant relationship between your
independent and dependent variables if p<
– Generally  = 0.05
52
Regression Coefficients Table
• Coefficients Column gives
– b0 , b1, ,b2 , … , bn values for the regression equation.
– The b0 is the intercept
– b1value is next to your independent variable x1
– b2 is next to your independent variable x2.
– b3 is next to your independent variable x3
53
Regression Coefficients Table
• p values for individual t tests each independent
variables
• t test - tests the claim that there is no relationship
between the independent variable (in the
corresponding row) and your dependent variable.
• Should reject the claim
• Of NO significant relationship between your independent variable (in
the corresponding row) and dependent variable if p<.
54
Salary = f(GPA)
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
f(GPA)
0.898006642
0.806415929
0.78221792
1479.019946
10
ANOVA
Regression
Residual
Total
Intercept
GPA
df
1
8
9
SS
72900000
17500000
90400000
MS
72900000
2187500
F
33.32571
Significance F
0.00041792
Standard
Coefficients
Error
t Stat
P-value Lower 95% Upper 95%
1928.571429 3748.677 0.514467 0.620833 -6715.89326 10573.04
6428.571429 1113.589 5.772843 0.000418 3860.63173 8996.511
55
Salary = f(Work)
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
f(Work)
0.939265177
0.882219073
0.867496457
1153.657002
10
ANOVA
Regression
Residual
Total
df
1
8
9
SS
79752604.17
10647395.83
90400000
MS
79752604
1330924
F
Significance F
59.92271 5.52993E-05
Standard
Coefficients
Error
t Stat P-value Lower 95% Upper 95%
Intercept 30691.66667 1010.136344 30.38369 1.49E-09 28362.28808 33021.0453
Months
Work
-227.864583 29.43615619 -7.74098 5.53E-05 295.7444812 -159.98469
56
Salary = f(GPA, Work)
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
f(GPA,Work)
0.962978985
0.927328525
0.906565246
968.7621974
10
ANOVA
Regression
Residual
Total
Intercept
GPA
Months
Work
df
2
7
9
SS
83830499
6569501
90400000
MS
41915249
938500.2
Standard
Coefficients
Error
19135.92896 5608.184
2725.409836 1307.468
t Stat
3.412144
2.084495
-151.2124317 44.30826
-3.41274
F
44.66195
Significance F
0.00010346
P-value
Lower 95% Upper 95%
0.011255 5874.682112 32397.176
0.075582 -366.2602983 5817.08
0.011246 -255.9848174 46.440046
57
Compare Three “Models”
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
f(GPA)
0.898006642
0.806415929
0.78221792
1479.019946
10
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
f(Work)
0.939265177
0.882219073
0.867496457
1153.657002
10
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
f(GPA,Work)
0.962978985
0.927328525
0.906565246
968.7621974
10
58
Download