Uploaded by mohdfaiziqbal.work

Data Mining & Machine Learning Course Overview

advertisement
Introduction & Overview
F20DL/F21DL
Data Mining & Machine Learning
Outline
• Course description and Important Information
‣ Assessment
• Overview on data mining and machine learning
‣ Scope and Goals
• Example applications
• What does it mean to “learn by example”?
‣ Classification tasks
‣ Inductive bias
‣ Formalizing learning
• Ethical issues related to data
2
Course Logistics
Global Course Team
Edinburgh
Dubai
Ioannis Konstas
Neamat El Gayar
n.elgayar@hw.ac.uk
i.konstas@hw.ac.uk
Dongdong Chen
Radu-Casian Mihăilescu
d.chen@hw.ac.uk
r.mihailescu@hw.ac.uk
Marta Vallejo
Malaysia
m.valleho@hw.ac.uk
Abdullah Almasri
a.almasri@hw.ac.uk
Lab tutors: See extended Teaching Team
4
VLE (Virtual Learning Environment) CANVAS
• We ONLY use F21DL as Canvas Course
• If you enrolled in F20DL you should also see
‣ F21DL (Data Mining and Machine Learning) on your list of courses
• You will find all relevant information so please make sure to
check it regularly:
https://canvas.hw.ac.uk
‣ Course Outline
‣ Learning Materials / Reading List
‣ Python Tutorials
‣ Weekly lab tasks
‣ Coursework description and guidelines
‣ Supplementary material,…more
‣ (Campus specific information will be labelled accordingly)
5
Assessment
• Exam (60%)
‣ Online Canvas
40%
• Group Coursework (40%)
‣ You will write a series of Python code (Jupyter notebooks) to
practice and understand main concepts related to data mining
and machine learning
‣ This coursework will be assessed in parts during the semester
‣ The weekly labs will help you practice with Python and
implement the goals of the project
‣ Extra bit added to coursework for MSc students
‣ Students will work in groups of 5
6
60%
Data Mining and Machine Learning Portfolio (Group Work)
Why we want you to
work in Groups?
7
Data Mining and Machine Learning Portfolio (Group Work)
Why we want you to work in Groups?
• To encourage discussions for justify findings and explaining
conclusions
• To share experiences and explore wider ideas
• To divide work and do better research on each task
8
Software & Reading
• Python
‣ 6 Python tutorials on data
processing & training
models using scikit-learn
• Important links to online
resources are provided
‣ Recommended books with
GitHub repositories
https://github.com/ageron/handson-ml3
9
Main Textbook
• A Course in Machine Learning
by Hal Daume III
• Online Free [http://ciml.info]
• You can also find a slightly
modified version on Canvas
10
Additional Textbook
• Data Mining: Practical Machine
Learning Tools and Techniques,
3rd Edition Ian H. Witten, Eibe
Frank and Mark A Hall, Elsevier
2011
• New 4th Edition
• “Deep learning”
https://www.cs.waikato.ac.nz/ml/weka/book.html
11
Data Mining and Machine
Learning: Overview
Data
(Volume-Variety- Veracity-Velocity)
13
What is Machine Learning?
What is Machine Learning?
• Algorithms and techniques that allow computers to
‣ learn, extract information from data automatically
We will give a more
formal definition later…
• Find a model that best captures data characteristics
• Paradigm: “Programming by example”
‣ Replace “human writing code” with “human supplying data”
• Most central issue: generalisation
‣ How to abstract from “training” examples to “test” examples?
Data
Model
Algorithm
15
𝑓(𝑥)
What’s the relationship between
• Data Science (DS),
• Artificial Intelligence (AI) and
• Machine Learning (ML)?
Data Science (DS), Artificial Intelligence (AI)
and Machine Learning (ML)
• Artificial Intelligence: deals with
mimicking human thinking and
perception.
• Machine Learning: algorithms that
allow computers to learn from data
and discover patterns & forecast
trends
• Data Science: collection of methods to
draw actionable insights from raw data
17
Types of machine learning schemes
1. Classification
• Given a set of classified examples, learn to classify a new
example
2. Association
• Find any interesting associations between attributes or
combinations of attributes
3. Clustering
• Group together similar examples
4. Numeric prediction
• Instead of classifying, predict a numeric value
(We will mention examples later ……..)
Data Mining and Machine
Learning: Scope and Goals
Goal of DM&ML Course
Studying techniques to explore, analyse and better
understand data & discover meaningful patterns,
associations and rules
So, what can we do with data?
Data Mining & Machine Learning to answer questions
Simple Examples
• How many female clients do we have?
• How much paint did we sell in 2007?
• Which is the most profitable branch of
our supermarket?
• Which postcodes suffered the most
dropped calls in July?
21
More valuable -and harder- questions
• How can we predict ovarian cancer early enough to treat it successfully?
• How can I tell whether an essay was written by a Generative AI model?
• How can we get our customers to spend more money in the store?
• Is this loan applicant a good credit risk?
• Is this sonar image a mine or a rock?
• What other websites will this customer be interested in?
22
Some examples of large databases
• Shopping basket data
‣ much commercial DM is done with this.
In one store, 18,000 baskets per month.
Tesco has >500 stores. Per year,
100,000,000 baskets?
• The Internet
‣ ~ 1.1 billion websites (in 2024) with 194
million of them active (source: Gemini AI)
23
How can we begin to understand
and exploit such datasets?
Like this...
Average number of distinct items purchased per visit
Distribution of visits per day of the week
• But there are also more interesting questions:
‣ What items do people tend to buy together?
24
Goal of DM&ML
• Given some data, we want to extract information that is
‣ Implicit
‣ Previously unknown
‣ Potentially useful
‣ Non-trivial
• We need programs to extract patterns and regularities from the data
• Strong patterns lead to good answers and predictions
• But
‣ Most patterns are not interesting
‣ Patterns may be inexact (or spurious)
‣ Data may be garbled or missing
‣ There may be more data than we can process
25
Applications
1. Processing loan applications
(American Express)
• Given:
‣ questionnaires with financial and personal information
• Question:
‣ should money be lent?
• Simple statistical method covers 90% of cases
• Borderline cases referred to loan officers
• But: 50% of accepted borderline cases defaulted!
‣ Solution: reject all borderline cases?
‣ No! Borderline cases are most active customers
27
Loan applications with machine learning
• 1000 training examples of borderline cases
• 20 attributes e.g.
‣ age
‣ years with current employer
‣ years at current address
‣ years with the bank
‣ other credit cards possessed,…
• Learned rules: correct on 70% of cases
• Human experts only 50%
• Rules could be used to explain decisions to customers
28
2. Detecting oil slicks from images
• Radar satellite images of coastal waters
• Oil slicks appear as dark regions with changing
size and shape
• Not easy: lookalike dark regions can be caused
by weather conditions (e.g., high wind)
• Expensive process requiring highly trained
personnel
30
Detecting oil slicks from images with machine learning
• Extract the dark regions from normalized images
• Attributes:
‣ size of region
‣ shape, area
‣ intensity
‣ sharpness and jaggedness of boundaries
‣ proximity of other regions
‣ info about background
• Constraints:
‣ Few training examples—oil slicks are rare!
‣ Unbalanced data: most dark regions aren’t slicks
‣ Requirement: adjustable false-alarm rate
31
3. Marketing and sales
• Companies precisely record massive amounts of marketing and sales data
• Applications:
‣ Customer loyalty:
identify customers who are likely to defect by detecting changes in their
behavior
(e.g., banks/phone companies)
‣ Special offers:
identify profitable customers
(e.g., reliable owners of credit cards who need extra money during the
holiday season)
32
Marketing and sales
• Market basket analysis from checkout data
‣ Find groups of items that tend to
occur together in a transaction
• Historical analysis of purchasing patterns
• Identify prospective customers
• Focus promotional mailings
‣ targeted campaigns are cheaper than mass-marketed ones
33
Learning Associations
• Basket analysis: (finding associations between products bought by customers)
𝑃(𝑌|𝑋) probability that somebody who buys 𝑋 also buys 𝑌 where 𝑋 and 𝑌 are
products/services.
Example: 𝑃( Hands-On ML with Scikit-Learn | Introduction to ML with Python ) = 0.7
34
Regression (predicting a numerical value)
• Example: Price of a used car
• 𝑥 : car attributes
y = wx+w0
𝑦 : price
𝑦 = 𝑔(𝑥; 𝜃)
• 𝑔(𝑥) is a model; 𝜃 are its
parameters
35
.... and many, many more
• Forecast electricity supply
• Diagnose machine faults
• Separate crude oil and natural gas
• Find appropriate technicians for telephone faults
• Scientific applications in biology, astronomy, chemistry
• Monitor intensive care patients
• Mine the web
• Recommend TV programs, sales, films, music….
• Self-driving cars
Usually we care about performance (prediction or classification) but structure (reason for
the answer) matters too.
36
“Learning by example”
Classification tasks
• How would you write a program to distinguish a picture of you from a picture of
someone else?
• Provide examples pictures of you and pictures of other people and let a
classifier learn to distinguish the two.
38
Classification tasks
• How would you write a program to distinguish if a sentence is grammatical or
not?
• Provide examples of grammatical and ungrammatical sentences and let a
classifier learn to distinguish the two.
39
Classification tasks
• How would you write a program to distinguish cancerous cells from normal
cells?
• Provide examples of cancerous cells and normal cells and let a classifier learn
to distinguish the two.
40
Let’s try it out…
• What you have to do:
‣ Learn a classifier to distinguish class A from class B from examples
41
Examples of class A
42
Examples of class B
43
Let’s try it out…
• What you have to do:
➡Now: predict class on new examples using what you have just learned
44
Let’s try it out…
45
Let’s try it out…
46
Let’s try it out…
47
Let’s try it out…
48
Some people say…
A
B
bird/not-bird bias
B
A
49
but others say…
A
A
fly/no-fly bias
B
B
50
Key ingredients needed for learning
• Training vs test examples
‣ Memorising the training examples is not enough!
‣ Need to generalise to make good predictions on test examples
• Inductive bias
‣ Many classifier hypotheses are plausible
‣ Need assumptions about the nature of the relation between examples and
classes
51
“Formalizing the Learning
Problem”
Machine Learning as Function Approximation
53
Machine Learning as Function Approximation
• Problem formulation
‣ Set of possible instances 𝑋
‣ Unknown target function 𝑓: 𝑋 → 𝑌
‣ Set of function hypotheses 𝐻 = {ℎ|ℎ: 𝑋 → 𝑌}
• Input
‣ Training examples {(𝑥 (1) , 𝑦 (1) ), … , (𝑥 (𝑁) , 𝑦 (𝑁) )} of unknown target function 𝑓
• Output
‣ Hypothesis ℎ ∈ 𝐻 that best approximates target function 𝑓
54
Formalising Learning: Loss Function
• 𝑙(𝑦, 𝑓(𝑥)) where 𝑦 is the truth and 𝑓(𝑥) is the system’s prediction
‣ e.g. 𝑙(𝑦, 𝑓(𝑥)) = 0𝑖𝑓𝑦 = 𝑓(𝑥)
1𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
{
• Captures our notion of what is important to learn
55
Probabilistic model of learning
Formalising learning: Data generating distribution
• Where does the data come from?
• Data generating distribution
- A probability distribution 𝐷 over (𝑥, 𝑦) pairs
• We don’t know what 𝐷 is!
‣ We only get a random sample from it: our training data
57
Formalising learning: Expected loss
• 𝑓 should make good predictions
‣ as measured by loss 𝑙
‣ on future examples that are also drawn from 𝐷
• Formally
- 𝜀, the expected loss of
𝑓
over 𝐷 with respect to 𝑙 should be small
𝜀 ≐ 𝔼(𝑥,𝑦)∼𝐷 {𝑙(𝑦, 𝑓(𝑥))} = ∑ 𝐷(𝑥, 𝑦)𝑙(𝑦, 𝑓(𝑥))
(𝑥,𝑦)
58
Formalising learning: Training error
• We can’t compute expected loss because we don’t know what 𝐷 is
• We only have a sample of 𝐷
‣ training examples {(𝑥 (1) , 𝑦 (1) ), … , (𝑥 (𝑁) , 𝑦 (𝑁) )}
• All we can compute is the training error
1
𝜀 ≐ ∑ 𝑙(𝑦 (𝑛) , 𝑓(𝑥 (𝑛) ))
𝑛=1 𝑁
̂
𝑁
59
Formalizing learning: Minimizing the expected
error
• Given
‣ a loss function 𝑙
‣ a sample from some unknown distribution 𝐷
• Our task is to compute a function 𝑓 that has a low expected error over 𝐷 with
respect to 𝑙 .
𝔼(𝑥,𝑦)∼𝐷 {𝑙(𝑦, 𝑓(𝑥))} = ∑ 𝐷(𝑥, 𝑦)𝑙(𝑦, 𝑓(𝑥))
(𝑥,𝑦)
60
Ethics
(Some) Ethics of Data Mining
• Ethical issues arise in practical applications
• Personal Information
‣ anonymizing data is difficult
‣ 85% of Americans can be identified from just zip code, birth
date and sex
• Discrimination
‣ e.g., loan applications: using some information (sex, religion,
race) is unethical
‣ Ethical situation depends on the application
‣ e.g., a medical applications may need the same information
• Attributes may contain problematic information
‣ e.g., area code may correlate with race
62
(Some) Ethics of Data Mining
• Important questions:
‣ Who is allowed access to the data?
- Social norms – a library will not tell you who has a book
‣ For what purpose was the data collected?
‣ What kind of conclusions can be legitimately drawn from it?
‣ Are we just reinforcing old behaviour patterns? [Data Bias]
- Learning from existing data
‣ Are results put to good use?
- Easier for shoppers or more profit for the shop?
• Justification must be attached to results
‣ Sometimes human judgment
‣ Purely statistical arguments are never sufficient!
63
Thinking about ethics….
• All Projects at HWU need to apply for an ethical approval:
‣ University Research Ethics procedure
‣ University Ethics policy
‣ Data Protection Policy
• Interesting Reads:
‣ https://www.siliconrepublic.com/enterprise/ethics-data-science-bias
‣ https://dssg.uchicago.edu/2015/09/18/an-ethical-checklist-for-data-science/
64
Outro
• Read from Text Book (A Course in Machine Learning):
‣ Sections 4.1-4.4, 4.5 (contains proof of convergence theorem; it’s good to
try to understand it), 4.6 (optionally, but it’s a cool extension of the
standard algorithm), 4.7
65
Image Credits
• https://www.istockphoto.com/photos/excited-kid
66
Download