Data Mining Chapter 1

advertisement
Data Mining Chapter 1
Data vs Information
• Society produces huge amounts of data
• Potentially valuable resource
• Needs to be organized – patterns underlying
data
Data Mining
• People have always looked for patterns
• Computing power makes more feasible
• Data Mining = Extraction of implicit, previously
unknown, and potentially useful information
from data
Data Mining Has Commercial
Value
• Example Practical Uses
– Database Marketing
– Credit Approval
– Fraud Detection (e.g. Credit Card)
3 General Techniques
• Statistical
• Machine Learning
• Visualization
Machine Learning
• Skip the philosophy
• Acquisition of knowledge and ability to use it
• Computer performance via indirect
programming rather than direct programming
Black Box vs Clear Box
• Black Box – what the program has learned and
how it is using it is incomprehensible to humans
• Clear Box – what is learned is understandable
– Structural descriptions represent patterns
explicitly (e.g. as rules)
– ….
Let’s Look at Some Data and
Structural Patterns
• clraw - Access database with data related to
community crime – including socioeconomic
and law enforcement data
• cl-revised-attribute-list – lists attribute names
• NJDOHalldata – Access database with data
related to community addictions
• njcrimenominal – heavily processed from clraw
– ready to run Weka data mining tools on
• …
Run Weka
• On njcrimenominal –
– Rules – Prism
– Tree – ID3
• On my-weather –
– Rules – Prism
– Tree – ID3
Black Box
• Some machine learning / data mining methods may be
successful but not produce understandable structural
descriptions
• E.g. neural networks
• This book does not focus on them
– Frequently what is desired is a “take home message” for
future HUMAN decision making – this requires an
understandable result of learning
– Frequently, even if the results of learning may be used
“automatically” by a computer program, human decision to
TRUST the automatic program may depend on the human
being able to make sense and evaluate what has been learned
The Real World
• Examples learning from are NOT complete
• Data is NOT complete
• Frequently there are ERRORS or mistakes in
the data
Lessons from Simple Examples –
Weather
• Numeric vs “Symbolic” attributes
– Nominal/ categorical weather data could have at
most 3 X 3 X 2 X 2 = 36 possible combinations of
values
– With numeric values for temperature and humidity,
there are a much larger set of possibilities, and the
learning problem gets much more difficult – may
need to learn inequality test (e.g. temp > 80)
Lessons from Simple Examples –
Weather
• Classification vs Association tasks
– In our examples so far, we learned rules / trees to
predict the value for the “Play” attribute
• It was pre-determined what was to be predicted
– A less-focused approach would be to look for any
rule that can be inferred from the data (e.g. if temp =
cool then humidity = normal)
– It is possible to generate large sets of association
rules; these must be carefully controlled
Lessons from Simple Data –
Contact Lenses
• Decision Rules vs Decision Trees
– Decision tree may be more compact representation
of patterns in the data – see next two slides
If tear production ra te = reduced t hen r ecommendation = no ne.
If age = yo ung and as tigmatic = n o and t ear p roduction rate = normal
then recommendation = soft
If age = pr e-presbyopic an d astigmatic = no a nd tear production
rate = normal t hen r ecommendation = s oft
If age = pr esbyopic and sp ectacle p rescription = m yope and
astigmatic = no th en re commendation = no ne
If spectacle prescription = h ypermetrope an d astigmatic = no an d
tear production ra te = normal then recommendation = s oft
If spectacle prescription = m yope a nd astigmatic = ye s and
tear production ra te = normal then recommendation = h ard
If age = yo ung and as tigmatic = y es an d tear production ra te =
normal
then recommendation = hard
If age = pr e-presbyopic an d spectacle prescription = hypermetrope
and a stigmatic = yes th en re commendation = no ne
If age = pr esbyopic and sp ectacle p rescription = h ypermetrope
and a stigmatic = yes th en re commendation = no ne
Figure 1.1 Rules for the contact lens data.
Figure 1.2 Decision tree for the contact lens data.
Lessons from Simple Data –
CPU performance
• In some cases, what is desired is a numeric
prediction rather than a classification (here
prediction of CPU performance) – see table 1.5
on p15
• This is more challenging, and some methods
used for classification are not appropriate for
numeric prediction
• Statistical regression is the standard for
comparison
Lessons from Simple Data –
Labor Negotiations
• As we saw with crime dataset, realistic datasets
sometimes have missing values. This is also
seen in the labor negotiations dataset (sketched
in Table 1.6 on p16)
• Also – not all data mining involves thousands of
records – generating this 57 record dataset was
A LOT OF WORK
Lessons from Simple Data –
Labor Negotiations
• Training and Testing –
– A program that uses data to learn (train) should be tested on
other data (as a independent judgment)
• Overfitting –
– Getting 100% correct on training data may not always be the
best thing to do
– May be adjusting for idiosyncratic aspects of the training
data that will not generalize to yet to be seen instances
– Assuming the goal is to learn something that will be useful in
the future, then it is better to avoid “overfitting” what is
learned to the training data – see next slide
(a)
Figure 1.3 Decision trees for the
labor negotiations data.
(b)
Fielded Applications
• Loan approval (for borderline cases)
– 20 attributes – age, years with current employer,
years at current address, years with bank, other
credit …
– 1000 training cases
– Learn rules
– 2/3 correct vs ½ correct by human decision makers
– Rules could be used to explain decision to rejected
applicants
Fielded Applications
• Marketing and Sales
– E.g. bank detecting when they might be in danger of losing a
customer
• E.g. bank by phone customers who call at times when response is
slow
• Market Basket Analysis (Supermarkets and …)
– Use of “association” techniques to determine groups of items
that tend to occur together in transactions
– Famous example – diapers and beer
• May help in planning store layouts
• Send out coupons for one of the items
• Give out cash register coupons for one when other is purchased
Fielded Applications
• Direct Marketing
– Better response rate allows fewer mailings to
produce same result, or same mailings to produce
larger sales
– May use data from outside organization too (e.g.
socio-economic data based on zip code)
Other Fielded Applications
• Oil slick detection from images
• Energy use (load) prediction
• Diagnosis of machine faults
1.5 Learning as “Search”
• The possible sets of rules or the possible decision trees is too
large to consider all and pick the best
• Many learning algorithms follow a “hill climbing” approach
to search – try one possibility, then look for small changes that
will improve; stop when no improvement can be found.
• Such a process IS NOT guaranteed to find the best solution. It is
“heuristic” – it uses rules of thumb that are likely to help, but
which are not a sure thing.
• But it is more tractable than exhaustive brute-force.
• In many learning schemes, search is also “greedy” – decisions,
once made, are not retracted
• Learning methods have a “bias” – they are more likely to come
to some conclusions that to others – for efficiency sake
1.6. Data Mining and Ethics
•
•
•
•
Discrimination
Privacy policies
Human common sense on using conclusions
Profits vs service
End Chapter 1
• Show coming into Weka from Start, opening
file, visualizing data
Download