Introduction & Overview F20DL/F21DL Data Mining & Machine Learning Outline • Course description and Important Information ‣ Assessment • Overview on data mining and machine learning ‣ Scope and Goals • Example applications • What does it mean to “learn by example”? ‣ Classification tasks ‣ Inductive bias ‣ Formalizing learning • Ethical issues related to data 2 Course Logistics Global Course Team Edinburgh Dubai Ioannis Konstas Neamat El Gayar n.elgayar@hw.ac.uk i.konstas@hw.ac.uk Dongdong Chen Radu-Casian Mihăilescu d.chen@hw.ac.uk r.mihailescu@hw.ac.uk Marta Vallejo Malaysia m.valleho@hw.ac.uk Abdullah Almasri a.almasri@hw.ac.uk Lab tutors: See extended Teaching Team 4 VLE (Virtual Learning Environment) CANVAS • We ONLY use F21DL as Canvas Course • If you enrolled in F20DL you should also see ‣ F21DL (Data Mining and Machine Learning) on your list of courses • You will find all relevant information so please make sure to check it regularly: https://canvas.hw.ac.uk ‣ Course Outline ‣ Learning Materials / Reading List ‣ Python Tutorials ‣ Weekly lab tasks ‣ Coursework description and guidelines ‣ Supplementary material,…more ‣ (Campus specific information will be labelled accordingly) 5 Assessment • Exam (60%) ‣ Online Canvas 40% • Group Coursework (40%) ‣ You will write a series of Python code (Jupyter notebooks) to practice and understand main concepts related to data mining and machine learning ‣ This coursework will be assessed in parts during the semester ‣ The weekly labs will help you practice with Python and implement the goals of the project ‣ Extra bit added to coursework for MSc students ‣ Students will work in groups of 5 6 60% Data Mining and Machine Learning Portfolio (Group Work) Why we want you to work in Groups? 7 Data Mining and Machine Learning Portfolio (Group Work) Why we want you to work in Groups? • To encourage discussions for justify findings and explaining conclusions • To share experiences and explore wider ideas • To divide work and do better research on each task 8 Software & Reading • Python ‣ 6 Python tutorials on data processing & training models using scikit-learn • Important links to online resources are provided ‣ Recommended books with GitHub repositories https://github.com/ageron/handson-ml3 9 Main Textbook • A Course in Machine Learning by Hal Daume III • Online Free [http://ciml.info] • You can also find a slightly modified version on Canvas 10 Additional Textbook • Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition Ian H. Witten, Eibe Frank and Mark A Hall, Elsevier 2011 • New 4th Edition • “Deep learning” https://www.cs.waikato.ac.nz/ml/weka/book.html 11 Data Mining and Machine Learning: Overview Data (Volume-Variety- Veracity-Velocity) 13 What is Machine Learning? What is Machine Learning? • Algorithms and techniques that allow computers to ‣ learn, extract information from data automatically We will give a more formal definition later… • Find a model that best captures data characteristics • Paradigm: “Programming by example” ‣ Replace “human writing code” with “human supplying data” • Most central issue: generalisation ‣ How to abstract from “training” examples to “test” examples? Data Model Algorithm 15 𝑓(𝑥) What’s the relationship between • Data Science (DS), • Artificial Intelligence (AI) and • Machine Learning (ML)? Data Science (DS), Artificial Intelligence (AI) and Machine Learning (ML) • Artificial Intelligence: deals with mimicking human thinking and perception. • Machine Learning: algorithms that allow computers to learn from data and discover patterns & forecast trends • Data Science: collection of methods to draw actionable insights from raw data 17 Types of machine learning schemes 1. Classification • Given a set of classified examples, learn to classify a new example 2. Association • Find any interesting associations between attributes or combinations of attributes 3. Clustering • Group together similar examples 4. Numeric prediction • Instead of classifying, predict a numeric value (We will mention examples later ……..) Data Mining and Machine Learning: Scope and Goals Goal of DM&ML Course Studying techniques to explore, analyse and better understand data & discover meaningful patterns, associations and rules So, what can we do with data? Data Mining & Machine Learning to answer questions Simple Examples • How many female clients do we have? • How much paint did we sell in 2007? • Which is the most profitable branch of our supermarket? • Which postcodes suffered the most dropped calls in July? 21 More valuable -and harder- questions • How can we predict ovarian cancer early enough to treat it successfully? • How can I tell whether an essay was written by a Generative AI model? • How can we get our customers to spend more money in the store? • Is this loan applicant a good credit risk? • Is this sonar image a mine or a rock? • What other websites will this customer be interested in? 22 Some examples of large databases • Shopping basket data ‣ much commercial DM is done with this. In one store, 18,000 baskets per month. Tesco has >500 stores. Per year, 100,000,000 baskets? • The Internet ‣ ~ 1.1 billion websites (in 2024) with 194 million of them active (source: Gemini AI) 23 How can we begin to understand and exploit such datasets? Like this... Average number of distinct items purchased per visit Distribution of visits per day of the week • But there are also more interesting questions: ‣ What items do people tend to buy together? 24 Goal of DM&ML • Given some data, we want to extract information that is ‣ Implicit ‣ Previously unknown ‣ Potentially useful ‣ Non-trivial • We need programs to extract patterns and regularities from the data • Strong patterns lead to good answers and predictions • But ‣ Most patterns are not interesting ‣ Patterns may be inexact (or spurious) ‣ Data may be garbled or missing ‣ There may be more data than we can process 25 Applications 1. Processing loan applications (American Express) • Given: ‣ questionnaires with financial and personal information • Question: ‣ should money be lent? • Simple statistical method covers 90% of cases • Borderline cases referred to loan officers • But: 50% of accepted borderline cases defaulted! ‣ Solution: reject all borderline cases? ‣ No! Borderline cases are most active customers 27 Loan applications with machine learning • 1000 training examples of borderline cases • 20 attributes e.g. ‣ age ‣ years with current employer ‣ years at current address ‣ years with the bank ‣ other credit cards possessed,… • Learned rules: correct on 70% of cases • Human experts only 50% • Rules could be used to explain decisions to customers 28 2. Detecting oil slicks from images • Radar satellite images of coastal waters • Oil slicks appear as dark regions with changing size and shape • Not easy: lookalike dark regions can be caused by weather conditions (e.g., high wind) • Expensive process requiring highly trained personnel 30 Detecting oil slicks from images with machine learning • Extract the dark regions from normalized images • Attributes: ‣ size of region ‣ shape, area ‣ intensity ‣ sharpness and jaggedness of boundaries ‣ proximity of other regions ‣ info about background • Constraints: ‣ Few training examples—oil slicks are rare! ‣ Unbalanced data: most dark regions aren’t slicks ‣ Requirement: adjustable false-alarm rate 31 3. Marketing and sales • Companies precisely record massive amounts of marketing and sales data • Applications: ‣ Customer loyalty: identify customers who are likely to defect by detecting changes in their behavior (e.g., banks/phone companies) ‣ Special offers: identify profitable customers (e.g., reliable owners of credit cards who need extra money during the holiday season) 32 Marketing and sales • Market basket analysis from checkout data ‣ Find groups of items that tend to occur together in a transaction • Historical analysis of purchasing patterns • Identify prospective customers • Focus promotional mailings ‣ targeted campaigns are cheaper than mass-marketed ones 33 Learning Associations • Basket analysis: (finding associations between products bought by customers) 𝑃(𝑌|𝑋) probability that somebody who buys 𝑋 also buys 𝑌 where 𝑋 and 𝑌 are products/services. Example: 𝑃( Hands-On ML with Scikit-Learn | Introduction to ML with Python ) = 0.7 34 Regression (predicting a numerical value) • Example: Price of a used car • 𝑥 : car attributes y = wx+w0 𝑦 : price 𝑦 = 𝑔(𝑥; 𝜃) • 𝑔(𝑥) is a model; 𝜃 are its parameters 35 .... and many, many more • Forecast electricity supply • Diagnose machine faults • Separate crude oil and natural gas • Find appropriate technicians for telephone faults • Scientific applications in biology, astronomy, chemistry • Monitor intensive care patients • Mine the web • Recommend TV programs, sales, films, music…. • Self-driving cars Usually we care about performance (prediction or classification) but structure (reason for the answer) matters too. 36 “Learning by example” Classification tasks • How would you write a program to distinguish a picture of you from a picture of someone else? • Provide examples pictures of you and pictures of other people and let a classifier learn to distinguish the two. 38 Classification tasks • How would you write a program to distinguish if a sentence is grammatical or not? • Provide examples of grammatical and ungrammatical sentences and let a classifier learn to distinguish the two. 39 Classification tasks • How would you write a program to distinguish cancerous cells from normal cells? • Provide examples of cancerous cells and normal cells and let a classifier learn to distinguish the two. 40 Let’s try it out… • What you have to do: ‣ Learn a classifier to distinguish class A from class B from examples 41 Examples of class A 42 Examples of class B 43 Let’s try it out… • What you have to do: ➡Now: predict class on new examples using what you have just learned 44 Let’s try it out… 45 Let’s try it out… 46 Let’s try it out… 47 Let’s try it out… 48 Some people say… A B bird/not-bird bias B A 49 but others say… A A fly/no-fly bias B B 50 Key ingredients needed for learning • Training vs test examples ‣ Memorising the training examples is not enough! ‣ Need to generalise to make good predictions on test examples • Inductive bias ‣ Many classifier hypotheses are plausible ‣ Need assumptions about the nature of the relation between examples and classes 51 “Formalizing the Learning Problem” Machine Learning as Function Approximation 53 Machine Learning as Function Approximation • Problem formulation ‣ Set of possible instances 𝑋 ‣ Unknown target function 𝑓: 𝑋 → 𝑌 ‣ Set of function hypotheses 𝐻 = {ℎ|ℎ: 𝑋 → 𝑌} • Input ‣ Training examples {(𝑥 (1) , 𝑦 (1) ), … , (𝑥 (𝑁) , 𝑦 (𝑁) )} of unknown target function 𝑓 • Output ‣ Hypothesis ℎ ∈ 𝐻 that best approximates target function 𝑓 54 Formalising Learning: Loss Function • 𝑙(𝑦, 𝑓(𝑥)) where 𝑦 is the truth and 𝑓(𝑥) is the system’s prediction ‣ e.g. 𝑙(𝑦, 𝑓(𝑥)) = 0𝑖𝑓𝑦 = 𝑓(𝑥) 1𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 { • Captures our notion of what is important to learn 55 Probabilistic model of learning Formalising learning: Data generating distribution • Where does the data come from? • Data generating distribution - A probability distribution 𝐷 over (𝑥, 𝑦) pairs • We don’t know what 𝐷 is! ‣ We only get a random sample from it: our training data 57 Formalising learning: Expected loss • 𝑓 should make good predictions ‣ as measured by loss 𝑙 ‣ on future examples that are also drawn from 𝐷 • Formally - 𝜀, the expected loss of 𝑓 over 𝐷 with respect to 𝑙 should be small 𝜀 ≐ 𝔼(𝑥,𝑦)∼𝐷 {𝑙(𝑦, 𝑓(𝑥))} = ∑ 𝐷(𝑥, 𝑦)𝑙(𝑦, 𝑓(𝑥)) (𝑥,𝑦) 58 Formalising learning: Training error • We can’t compute expected loss because we don’t know what 𝐷 is • We only have a sample of 𝐷 ‣ training examples {(𝑥 (1) , 𝑦 (1) ), … , (𝑥 (𝑁) , 𝑦 (𝑁) )} • All we can compute is the training error 1 𝜀 ≐ ∑ 𝑙(𝑦 (𝑛) , 𝑓(𝑥 (𝑛) )) 𝑛=1 𝑁 ̂ 𝑁 59 Formalizing learning: Minimizing the expected error • Given ‣ a loss function 𝑙 ‣ a sample from some unknown distribution 𝐷 • Our task is to compute a function 𝑓 that has a low expected error over 𝐷 with respect to 𝑙 . 𝔼(𝑥,𝑦)∼𝐷 {𝑙(𝑦, 𝑓(𝑥))} = ∑ 𝐷(𝑥, 𝑦)𝑙(𝑦, 𝑓(𝑥)) (𝑥,𝑦) 60 Ethics (Some) Ethics of Data Mining • Ethical issues arise in practical applications • Personal Information ‣ anonymizing data is difficult ‣ 85% of Americans can be identified from just zip code, birth date and sex • Discrimination ‣ e.g., loan applications: using some information (sex, religion, race) is unethical ‣ Ethical situation depends on the application ‣ e.g., a medical applications may need the same information • Attributes may contain problematic information ‣ e.g., area code may correlate with race 62 (Some) Ethics of Data Mining • Important questions: ‣ Who is allowed access to the data? - Social norms – a library will not tell you who has a book ‣ For what purpose was the data collected? ‣ What kind of conclusions can be legitimately drawn from it? ‣ Are we just reinforcing old behaviour patterns? [Data Bias] - Learning from existing data ‣ Are results put to good use? - Easier for shoppers or more profit for the shop? • Justification must be attached to results ‣ Sometimes human judgment ‣ Purely statistical arguments are never sufficient! 63 Thinking about ethics…. • All Projects at HWU need to apply for an ethical approval: ‣ University Research Ethics procedure ‣ University Ethics policy ‣ Data Protection Policy • Interesting Reads: ‣ https://www.siliconrepublic.com/enterprise/ethics-data-science-bias ‣ https://dssg.uchicago.edu/2015/09/18/an-ethical-checklist-for-data-science/ 64 Outro • Read from Text Book (A Course in Machine Learning): ‣ Sections 4.1-4.4, 4.5 (contains proof of convergence theorem; it’s good to try to understand it), 4.6 (optionally, but it’s a cool extension of the standard algorithm), 4.7 65 Image Credits • https://www.istockphoto.com/photos/excited-kid 66